Textual Analytics: Business Intelligence from a Textual Foundation

Published in TDAN.com April 2007

Analytics have been around from the time the first computer program was written. Once corporations began to generate data, there were financial analysts, sales analysts, marketing analysts and
others anxiously awaiting to use that data in novel and creative ways. In the early days, data from applications was hard to come by, and the tools the analysts used to access and analyze the data
were crude. As time passed and the volume of data grew, so grew the opportunity to use analytics to compete in the business arena.

Over time, the world discovered the data warehouse as a foundation for analytic processing. The data warehouse contained integrated, historical and granular data that was gathered from a host of
legacy systems. The data warehouse proved to be an ideal foundation for the analysis of data. Data from the data warehouse was predictable and easy to access; and because data in the data warehouse
was granular, it could be reshaped for many different purposes.


Numerical Data – A Fundamental Limitation

Over time, it was recognized that business analysis – analytics – had a very fundamental limitation. That limitation was that analytics operated only on numerical data. While analysis of numerical
data was quite useful, in fact, corporations have massive amounts of data that is not in the form of numerical data. There exist massive amounts of unstructured textual data – from e-mails, medical
records, contracts, warranties, reports, call centers, and so forth. In fact, most estimates show that 80% of the data in the corporation is in the form of text, not numbers.

There is a wealth of information in that textual data, but there are some problems with this unstructured, textual data. Textual data is not as neatly organized and as accessible as numerical data
and it does not lend itself to easy and facile analysis because the software and technology used for business analytics is almost 100% dedicated to handling well structured numeric data. The very
disorder of the textual data defeats (or at least greatly hampers!) any attempt at accessing and analyzing it in any sort of meaningful manner.

However, there is technology that indeed is designed for textual analysis, such as Inmon Data Systems Foundation software. The following discussion of textual analytics makes free use of the many
patents IDS (Inmon Data Systems) has on the process of doing textual analytics.


Textual Analysis and Search Engines

When the subject of textual analytics arises, it is natural to think of search engines such as Google and Yahoo, among others. While a simple search of raw text can be considered to be a crude form
of textual analytics, there are, in fact, many limitations to a simple textual search.

In order to do textual analytics in a sophisticated manner, first the unstructured textual data must be integrated. If raw text is not integrated before it is analyzed, the search of the raw text
will produce truly sketchy and questionable results. Therefore, the first step in textual analytics is the integration of the raw text into an integrated form. Some of the steps required for raw
text to be integrated and to be fit for analysis include:

  • Different terminology must be accounted for so as to yield consistent results, even though the original source text is different
  • Alternate spellings (even common misspellings!) must be taken into account
  • Words need to be stemmed to their Latin or Greek roots

(For an in-depth treatment of the subject of the technology needed for textual integration, please refer to the white papers on the subject available from Inmon Data Systems.)


Integrating Raw Text

In short, in order to do analytics on text, the raw text must first be integrated.

After the raw text is integrated, textual analytics can be done. Searches are done against raw textual data and that textual analytic processing is done against integrated text.

A search can be something as simple as – “Tell me where the term – Katherine Heigl – is mentioned.” In this case, the search goes to the source or an index created from the source and looks for
the term or part of the term that has been specified.

An analytical treatment of text might be – “tell me about all the places where terms and information relating to Sarbanes Oxley can be found.”

The need for textual integration may not be obvious at all. In order to illustrate the importance of textual integration, consider the following. Suppose a medical file needs to be analyzed. In the
medical file is the term “ha.” If the raw data is searched on “ha,” there are many entries. Because “ha” means little or nothing to the layman, a search on “ha” is questionable. However, if
the raw data is integrated before being searched, then for all cardiologists the term “ha” is converted to “heart attack.” For all endocrinologists, the term “ha” is converted to “hepatitis
A,” and for all general practitioners, the term “ha” is converted to “headache.” After the conversion is completed, there is no questionable term “ha” to be dealt with, and patients with
heart attacks, headaches and hepatitis A are not grouped together.

From this simple example (and there are plenty of more cases of textual data needing to be clarified before being analyzed), it is seen that integration of text unlocks the text so that effective
textual analytics can be performed. However, the example provided is not the only reason for the need for textual integration as a foundation for analytics.


Searching for Categories of Text

Suppose there is a body of text about ranching. Part of the body of text relates to horses. In some cases, the type of horse is discussed. In other cases, the age and maturity of the horse is
discussed. In other cases, the gender of the horse is addressed.

Now suppose that there is a desire to do analytical processing against this document or set of documents on ranching. Suppose that there is a desire to see information about horses. One way – the
search engine way – to look at horses is to look for colts, then to look for ponies, then to look for studs, and so forth. The searcher must know beforehand what is being sought. Then the searcher
must be able to gather all the information about horses together. Searching for a wide variety of information is tedious to do.

A better approach – the integrated text approach – is to identify all information about horses into a common category. Then the integration process identifies all the places in the text where those
pieces of information about horse exist.

The kind of information that is returned when looking at integrated textual information about horses might include palomino, stud, mare, bridle, saddle, types of hay, fencing, horse whispering,
gait, racing, gelding, and so forth.

Now when the textual analyst wants to know information about horses, the textual analyst simply queries on the category – horses – and all information relating to horses is returned. Note just how
different the results of a query are when done using textual analytic processing rather than when doing a simple search.

By integrating the raw data, the textual analyst has prepared the data for effective textual analytical processing.


Recasting Textual Data

There are many forms of textual integration that can set the stage for effective textual analytical processing. As a simple example of another form of integration that needs to be done to raw
unstructured text in order for text analytics to be done is the recognition that there are multiple spellings of words, especially names.

By recognizing that there are multiple spellings of the same name, the text analytical processor will not miss mentions when the name is spelled differently. When a simple search engine is used,
the search may fail to pick up important information about the name entered in the search because a variation of the name is used.


Stemming Raw Text

The need for integrated text only begins with the simple examples that have been described. Another way that textual data needs to be integrated is in terms of operating at the Latin or Greek stems
of words.

Latin-based words tend to have similar but not quite the same spellings. If a search is literally made, then the search will not connect the fact that a word is related to another word even though
they are not spelled exactly the same. As an example, consider the word – “move.” Some of the different forms of the word “move” are moved, moving, mover, moves, remove and removed. If an
effective analysis of the text is to be done, it must be recognized that words that have the same stem need to be considered as the same word.

Indeed, there are many other considerations of the discipline of integrating text. Some of them include screening text to see if it is business relevant, punctuation removal, case sensitivity (or
insensitivity), and so forth.


The Scope of the Search and Analysis

One of the challenges of a search engine is the scope of the material accessed and analyzed by the query. A search engine is capable of drawing on wide amounts of source material (such as the
Internet). A textual analytical tool, on the other hand, must access and draw upon data that it has access to and can manipulate. In other words, because textual analytics requires a serious amount
of preprocessing of data in order to integrate the data, textual analytics is performed on a much smaller amount of data than searches of data.

It does not make sense that a search engine would integrate data before doing a search because the search engine does not have the ownership and control of the data that is being searched. Textual
analytical tools, on the other hand, typically operate on data from the corporation. Indeed, there is the opportunity to access and integrate corporate data before textual analytics occurs.


A Simple Query

In addition to the standard search queries that the textual analyst needs to do, there are many different kinds of queries that the textual analyst submits. One of the simplest of the queries
submitted is the query by class of data.

Consider that a textual analyst has submitted a query for the category of financial information. The query for financial information includes many different terms, each of which relate to finance.
Some terms that relate to finance include stock, share, equity, warrant, profit and so forth.

The query is submitted by a reference to finance. The results of the query are a reference back to each place where a term related to query is found. This type of query is sometimes called an
indirect query or a query by category.


Sophisticated Queries

An important type of query submitted by a textual analyst is that of a query looking for basic occurrences of information. Consider that a query has been made looking for all occurrences of the
word “water.”

Upon finding a reference to water, the next step is to do a search on the specific text preceding “water” and following “water.” These textual references to water along with their immediate
text are called “snippets.”


A Snippet Search

By looking at each of the snippets, the analyst can determine the context of the word that has been sought.

Snippets are most useful for determining the context of a particular word. The term “water” can refer to quite different things – for example, a water table, a watermark, sea water that is
menacing, and Waterford crystal.


A Proximity Search

Another type of query that the textual analyst sometimes needs to submit is a proximity query. In a proximity analysis, the query is done for words that are in proximity to each other in a
document. In a proximity query, a search is done over one or more documents where the document(s) is searched with regard to two or more words residing in the document within a predetermined
proximity. Of course, proximity analysis can be done for lists of words as well as individual words.


Relating Textual Data to Structured Data

Another form of textual analytic data is one that relates textual data to structured data.

Consider demographic data from a customer as it relates to the communications from the customer. The e-mails that a customer has made can be attached to the customer. By merging textual information
with structured information, a true 360-degree view of the customer is achieved. Stated differently, when a organization only has demographic information about a customer, that is hardly a
360-degree view of the customer. Customer communications as well as demographic information about the customer is required as well.


Textual Visualization

Another form of textual analytics that is extremely valuable is that of visualization of text. In a visualization of text, integrated text is ingested and clustered in order to find correlations
and relationships between words and phrases.

The text from the documents is integrated and then lifted into a work area. In the work area, the integrated text is clustered into what can be termed themes. The themes are then displayed in a
visualization called a SOM, or self-organizing map.

The clustering of data in a SOM has many uses. Some of those uses are identifying correlations of data, identifying the major themes of data, organizing data so that major themes of data are
obvious, and so forth.

SOMs can be created for very large amounts of data and for smaller amounts of data. Furthermore, SOMs can be used to look across whole vistas of information – looking at thousands of documents at a
time.

It is seen then that textual analytics is a very different subject than search engine processing. Very different results are achieved by textual analytics.


Bridging the Gap

One of the keys to creating the effective textual analytics environment is that of being able to access unstructured data in a structured format. In other words, if you want to use BusinessObjects
or Cognos against unstructured text, you have to put the unstructured data in a form that is useful to BusinessObjects or Cognos. This means that the unstructured data – after it is integrated –
must be restructured into a relational format. In other words, there is a need for taking textual information and placing it in to a structured format where there are recognizable relational fields
in a predictable format.

Once unstructured data has been transformed into the relational format, the standard analytical tools can be applied.

But there are some subtleties which are important. Consider what happens when more than one record is converted into a relational format.

Consider that the drug Metformin has been specified for Carol Teal. Yet when the unstructured record for Carol Teal is read, there is no such drug specified. Instead, it is seen that Carol takes
Glucotrol. The software – under the guidance of the analyst – has translated Glucotrol to Metformin as part of the transformation process. The ability to recognize and translate text is an
important capability in preparing for textual analytics.

In addition, the analyst has specified that generalizations (or categorizations) be made on the raw text. For example, whether or not a patient is being treated for diabetes II is analyzed. Based
on the textual data that is found, a patient can be classified as to whether the patient is or is not a Type II diabetic.

By translating data and by classifying it, then putting the data in a relational format, the end user is prepared to do analytical processing on text.


Accessing Integrated Textual Data Placed in a Relational Database

The first step in integrating raw data for textual analytics is to create the infrastructure that supports textual analytics. However, once that infrastructure is built, it then remains to put the
infrastructure to good use. This section of this paper is on the usage of the integrated textual infrastructure once that infrastructure is built.

Assume that you have a relational database that has been built from unstructured text and that the text has been integrated. The database is in a relational format and can be accessed by standard
industry analytical tools such as BusinessObjects, Cognos, MicroStrategy, Crystal Reports and others. The access to the database is through standard SQL.

There are some basic ways the data can be accessed. These ways are –

A simple search. A word or phrase is given to the software and the database is examined. Take the word “water.” A search of this type would find every occurrence of water.
A simple search of context surrounding a word ( a “snippet”). Take the word “water.” A context search gathers the text before and after the word being sought. Suppose a context
search was done for “water.” The results might look like: “….she held the Waterford crystal in her hands….,” “…the football players welcomed the waterboy, as Gatorade was passed…” and
“…was it a mirage or real water? He couldn’t see beyond the…”
An indirect search. A search is done for items that belong to a class or category of information. For example an indirect search on Sarbanes Oxley might return these results:
“…revenue recognition…,” “…promise to deliver…,” “…conditional sale….” and “…delayed delivery …”
Proximity search for words. Are the two words “water” and “television” found in a document within 200 bytes of each other? An example of a result might be: “….Waterworld was
advertised on television last night….” and “…she spilled water on the television set accidentally…”
Alternate spellings search. As an example, find all the places where “Osama bin Laden” is mentioned would yield: ” …lead me to Usama bin laden or else…,” “….huddled in a
cave, Osama ben ladeen drank tea and said prayers…” and “…the Muslims adore Abu ben laden, more every day…”

These are merely some of the analytical forms that can be taken based on unstructured data placed in a relational database.

Textual analytics can be done by searching whole masses of documents or looking at just one document. Textual analytics can be as simple as looking for one word or looking for whole categories of
words and phrases. Textual analytics can look for the context surrounding words.


The Value Proposition

How do these forms of textual analysis lead to business advantage? The general answer is that an infrastructure of integrated unstructured data placed in a database and accessed by analytical tools
gives the corporation advantages that it never had before in that information coming from the textual environment is now readily available. Now, decision makers can ask questions that were never
before possible.

In order to posit some of these questions, consider the following industries and functions within industries.

(Note: a term in quotation marks is taken to mean a generic term.)


E-Mail/Call Center Administration

  • How many unhappy customers are there? Why are they unhappy? Who are they?
  • Is there some recurring topic or product associated with unhappy customers?
  • What is the rate of unhappy customers’ communications? Is it decreasing or increasing?
  • Are customers interested in “xxxxx” product or “yyyyy” service?
  • Do customers who are satisfied contact the corporation? If so, what product or service?
  • are they interested in?
  • What response has there been to a promotion? Was it a generally good response or a?
  • bad response?


Contract Administration

  • For all contracts, what would be the exposure if there were a failure in the “drive train” or any parts in the “drive train”?
  • For all contracts, what would be the exposure if the supplier “Ajax nails” were to fail?
  • For all contracts, what has been the experience with supplier “Hardtack gear”?
  • For the “power supply” line of goods, what obligations are there?
  • For the category of vendors who are classified as “8A,” how much business has been contracted in the past?
  • What contractual relationship is there between “rail splitters” and “power mitigators” over all our contracts?
  • How many contracts do we have that deal with “services” and all forms of “services”?


Warranties Administration

Over the past three years have there been any noticeable patterns in the exercise of warranties? Any pattern in products failing? In type of customer exercising warranty? Any seasonality?

  • Does any product or sub product stand out?
  • Is any type of failure noticeable?
  • Does model type make a difference in warranties exercised? Does model year make a difference in warranties exercised?
  • Exactly what has been said about warranties exercised on “power generators”?
  • What is the general type of complaint that is issued?


Medical Healthcare Administration

  • What correlations are there between different medical diseases? Between different medical conditions?
  • Despite the fact that different doctors call the same thing something different, what common patterns of medical conditions are there?
  • When the condition “goiter” appears, does it appear in conjunction with “hypertension”?
  • When any condition relating to “heart failure” occurs, does it occur in conjunction with any condition relating to “liver cancer”?
  • What is the context of all occurrences of “smoking”?
  • How many patients can be generally described as “healthy”?
  • Show all textual references for people who are “overweight”?


Insurance Claims Processing

  • Does the same condition keep appearing in claims?
  • Does the same product or class of products appear frequently in claims?
  • What type of claim appears the most frequently?
  • When there is a claim for “broken light bulb,” is there also mention of “recreational weapons”?


Documentum (for anyone who has Documentum)

  • What information is in your Documentum file?
  • How can you analyze that information?
  • Can you integrate the data in Documentum for a cogent analysis?


Scientific

  • Can you scan my documents and create a list that shows the boiling point and melting point of all of all chemicals in the documents?
  • Can you scan a document and find all references to chemicals with melting points greater than or equal to 120 degrees centigrade?
  • Can you show all the places where “carbon” is discussed in conjunction with “benzene”?
  • Can you take a body of text and find and show the naturally occurring correlations that occur in the text?
  • Can you take text coming from many different sources and rationalize the terminology so that the text can be analyzed in a consistent manner?


In Summary

This paper has addressed the subject of textual integration and business intelligence (BI) operating on textual data. Raw text must first be integrated. The process of integration has many facets.
Once integrated, the raw text is placed in a relational database. Once the raw text has been placed in a relational database, it can be accessed and analyzed by standard BI tools.

The analysis can take many forms. Forms of textual analytics include:

Visualizations, in the form of a SOM where integrated textual is clustered,

  • Simple searches of integrated text
  • Searches of snippets of text
  • Searches of categories of text
  • Searches of words in the same proximity

Share

submit to reddit

About Bill Inmon

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

Top