Self Organizing Maps (SOM’s) – Visualizing Textual Data

Published in TDAN.com July 2006

For years there has been visualization of numeric data. Business Intelligence has pie charts, colored graphs, graphs over time, multi dimensional analysis, Pareto charts, and so forth. It is
visualization that brings out the reality and contrast in a large collection of numbers. But numeric data is not the only kind of data that there is. The counterpart to numeric data is textual
data. Textual data is found in many places, but nowhere more prominently than in the world of unstructured data and processing. Unstructured data and processing consists of emails, email
attachments, .pdf files, spread sheets, PowerPoint files, text files, document files, and many more file types. Unstructured data runs the informal part of the organization while structured data
runs the formal part of the organization. It is a good bet that as many business decisions are made in the unstructured environment as the structured environment.

But trying to get a handle on unstructured data is difficult. Using Business Intelligence technology is a misfit because Business Intelligence technology is best suited for the display of numeric
data while unstructured data is made up of text. Trying to feed text to Business Intelligence is like trying to plug in an electrical AC current device into a DC wall connection. If it works at all
it will probably fry your device once current starts to flow. In a word, AC current does not mix with DC current and textual visualization does not mix with numerically based Business Intelligence.

So where is there a need for assimilation, organization, and ultimately visualization of textual data? There are actually many places where there is such a need –

  • in the oil and gas environment, where pipeline and other infrastructure is described by text and is needed by researchers, scientists, and engineers,
  • in the pharmaceutical clinical test trial environment, where doctors gather the results of thousands of clinical trials and must assimilate that information intelligently,
  • in the automobile manufacturing environment, where thousands of emails must be organized in order to know what part of the car is drawing consumer attention,
  • in the aerospace manufacturing environment, where there is a need for gathering the text of manufacturing and engineering specifications for mammoth projects such as an airplane, and so forth.

THE CHALLENGES

So how does the organization cope with the challenges of visualization of unstructured data? Consider the issues facing the end user in addressing unstructured data –

  • need to process large volumes of documents.

In many environments the end user faces massive amounts of unstructured documents. There may be MILLIONS of unstructured documents. The end user cannot read them all. There simply are not enough
hours in the day. And even if the end user could read them all, there is no way the end user could remember all the information that has been read. The finite limitations of the human brain
preclude a manual reading of the library of a large number of unstructured documents as an effective way to process them. And that is exactly what some organizations are up against when it comes to
making sense of a mass of unstructured documents,

  • need for speed

Not only do organizations face the challenge presented by a massive number of unstructured documents, but organizations also face the need for speed in the processing of those documents. On
occasion, there is a need for finding some information quickly. If there were only a few unstructured documents then the end user could perhaps quickly find what was being sought. But when there
are a massive amount of unstructured documents, looking through those unstructured documents is laborious, exactly what you don’t want when you need speed. When speed of retrieval is needed, there
has to be an automated way of examining a large body of unstructured documents,

  • need for accuracy

Where there are many unstructured documents, accuracy can become an issue in the retrieval of data. If a manual approach is used, it is simply a fact of life that human memory is both fuzzy and
limited. The more documents a human tries to ingest, the fuzzier the memory becomes for any one document. After a large number of documents have been read, it is a wonder that anything is retained
in human memory about any document. Accuracy fades quickly in the face of the need to understand a large body of unstructured information,

  • need for relating documents

Another important aspect of understanding a body of unstructured documents is that of relating the documents together. It is one thing to understand a bunch of individual unstructured documents. It
is quite another to be able to relate those unstructured documents together. In many cases the relationship of unstructured documents together forms a different and much more powerful picture than
the unstructured documents taken individually. And the more unstructured documents there are, the more challenging it becomes in order to try to see the larger picture formed by the unstructured
documents and their relationships,

  • need for finding lots of things

As if there were not enough challenges in finding information from a large body of unstructured documents, there is also the challenge that is presented because of the fact that much work done
against a large body of unstructured documents is heuristic in nature. In a heuristic mode, the next step in processing is determined by the results obtained in the current analysis. The net result
of heuristic processing is the process of jumping all over the body of unstructured documents. The first analysis concentrates in one place. The next analysis concentrates somewhere else. The third
iteration of analysis goes yet somewhere else, and so forth. Trying to do heuristic analysis manually for a large body of unstructured documents is very, very difficult to do.

No wonder then that business analysts facing a large body of unstructured documents are so frustrated. Doing analysis is simply not feasible for these reasons and more.

SELF ORGANIZING MAPS – SOM’S

With modern visualization tools, you can now produce SOM’s – self organizing maps – of unstructured data and documents. The SOM’s that are produced solve ALL of the problems of unstructured
visualization of documents and unstructured data. Now, in one place you can look at unstructured data and documents as you have never looked at them before. With a properly constructured SOM, you
can look at –

  • lots of unstructured documents that have been created into a library. Literally millions of unstructured documents can be merged into a single SOM,
  • once created, the analyst can look at a SOM in real time speed. The SOM is simple to bring up and once up, the analyst operates at the speed of thought, the fastest speed there is,
  • the SOM is accurate. The SOM reflects the deepest level of accuracy that is possible. The SOM goes down to the individual stemmed text level. And that is as accurate as textual processing can
    become,
  • the SOM deals not only with individual unstructured documents but relationships between documents as well. One of the major values of the SOM is that it allows the analyst to see the larger
    picture as well as drilling down to the detailed picture. With a SOM you can have your cake and eat it too when it comes to perspective.
  • And when you want to change your mind as to what you want to look at, the SOM is versatile. The SOM encompasses ALL of your unstructured data and documents so that one minute you can examine
    one kind of unstructured data and document one minute and the next minute you can examine another kind of unstructured data and document. Once the SOM is built, you can change your analysis at the
    speed of human thought.

There are several very valuable things that a SOM can do for you. One of those is to show correlation of data. The SOM’s show text that is correlated to other text. In the medical field, working
with medical records, this ability to correlate is very attractive.

Another thing SOM’s can do is to enable textual drill down processing. In textual drill down processing the analyst goes from one level of analysis to a lower level of analysis until the specific
detail is found.

Depending on how the data is arranged and integrated, SOM’s support qualified analysis of data. First the analyst looks at records for employees. Then the analyst looks for records for women
employees. Then the analyst looks for records for women college graduates. Then the analyst looks for women college graduates who are older than 50, and so forth. If the unstructured data is
properly conditioned and edited, SOM’s can yield insightful analysis for the selection of qualified data.

About Inmon Data Systems

Founded in Colorado by Bill Inmon, Guy Hildebrand and Dan Meers, Inmon Data Systems (IDS) is a software company dedicated to the proposition that there needs to be a bridge between the worlds
of structured data and unstructured data. IDS has foundation technology that allows unstructured data to be brought into the structured environment and once there, integrated into the structured
environment.

Applications – unstructured visualization

  • Enterprise metadata consolidation
  • CDI/CRM linkage enhancement
  • Communication compliance
  • Email and unstructured indexing for bulk storage

IDS is located in Castle Rock, Colorado.

Contact IDS at 303-973-3788 for further information.

Share

submit to reddit

About Bill Inmon

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

Top