Taking Inventory of the Unstructured World

In most companies, there is a wealth of unstructured textual information. There are documents of many kinds found in many places. There are reports. There are articles. There are spreadsheets.
There are contracts. In a word, there are many documents of many types in many places in the corporation.


Intuitively, the organization knows that it ought to be doing something with these documents. Trying to find a document six months after it has been written is no small task. Trying to gather
documents for a cost justification or for litigation support is not trivial. Yet documents are like small minnows in the water. They keep multiplying and they are slippery to catch.

Managing Your Corporate Documents

Trying to manage corporate documents is like trying to catch the wind. Most corporations have never even attempted to try to manage their corporate documents. Yet some of the most valuable
information the corporation has is found in documents.

Not all corporate documents need to be managed. Many informal documents and presentations do not warrant the attention of management. But many documents do need management. Many corporate documents
represent official pronouncements and statements of obligations and expectations by the corporation.

The Document Inventory

A good first start for an organization to proactively manage its documents is to create a corporate document inventory. In creating an inventory, the organization looks at and catalogs its existing
documents. In some organizations, there are literally hundreds of thousands of documents. Building a “card catalog” of the documents that belong to the organization is an excellent
start to managing the corporate collection of documents.

Libraries have long used a card catalog to great effect. Libraries know that looking through an entire library with all of its books is a colossal waste of time. Realistically, if it were not for
the card catalog, libraries would not be in existence. When a person is looking for a book in the library, the most efficient way to look for the book is to use the card catalog. With the card
catalog, the reader can quickly scan through all the possibilities. Upon finding the one or two books that look the most promising, the reader then is directed to the location of the book by the
card catalog. And it is no different with the documents that belong to the corporation.

So what should an inventory of corporate documents – a corporate card catalog – contain? Some of the likely contents of the corporate card catalog should be:

  • A title or brief description of the document,
  • A measurement of the size of the document,
  • The date the document was created,
  • The date the document was last changed,
  • The date the document was last accessed,
  • The system path of the document, and
  • A classification of the document type.

All of these components of the card catalog are useful. Indeed some of the elements of the card catalog are found in the metadata of the document. But not all card catalog elements are found in the
metadata of the document. Perhaps the most useful of the card catalog elements is the document classification.

Document Classification

Documents can be classified in many ways. Consider an oil company. The business of the oil company can be roughly divided into the sectors of “upstream,” “mid stream” and
“down stream.” Upstream refers to the process of exploration. Mid stream refers to the process of refining and pipeline. Downstream refers to the process of distribution. Each document
that belongs to the oil company can be read and the document can be classified as to which general category of information that the document refers to. The document can be an “upstream”
document, a “mid stream” document or a “downstream” document.

Or consider manufacturing. In manufacturing, there is the process of handling raw goods, assembly, managing work in process, finishing a product, and shipping or storing the product. Documents for
manufacturers can be classified as to which aspect of manufacturing the document best applies to.

Classifying the content of the document is a jump-start for the analyst looking through the many documents that belong to the corporation.

Creating the inventory of corporate documents is an activity that represents the first start to managing the unstructured environment. Stated differently, without a corporate card catalog, the
world of unstructured data is a massive blob of ambiguity.

After the inventory is made, the next step is to read the documents and create a corporate index of those documents. The index doesn’t just reflect the document classification; the index goes
into the details of every word in every document. There are many and varied aspects to the creation of an index. Some of the aspects are:

  • Looking at and managing documents in different languages,
  • Classifying the content of documents so that there is a “higher” level of abstraction for each word and each concept in each document,
  • Taking information found in documents and organizing that information so that textual analytics can be supported, and
  • Organizing the information found in documents and creating it so that it can be queried along with structured information.

Indeed there are many different aspects to the creating of the corporate card catalog.

One of the challenges is that of dealing with different document types. Some documents are short (emails). Some documents are long (patents). Some documents are full of technical jargon (medical or
legal documents). Some documents are full of slang (chat logs). The corporate card catalog needs to be able to accommodate ALL the different kinds of documents.

The corporate document catalog is a good start for getting your hands around all of the important unstructured information in your corporation.


submit to reddit