This past summer I purchased software to help organize my collection of digital photographs. This software had some neat features including the ability to assign tags to each photograph. Tags can
help describe the photograph such as who is in it and where was it taken. The software even comes with a mini-data mart, answering questions such as “How many photographs contain the tag
‘Jamie’ (my youngest)?” (The answer by the way is more than 3,000!) Being a diehard data modeler, I stared at the screen after assigning the tag ‘Jamie’ to her three
thousand and first photograph and wondered this: What does the data model look like behind this software? I started sketching boxes and lines and realized I came to a decision point where I either
needed to model the photograph itself or the contents of the photograph. The photograph itself seemed to me as unstructured, yet the photograph’s components appear to have far more structure.
I then started asking myself, “What is unstructured data anyway? What is the difference between structured and unstructured?”
A quick search on the Internet turned up many computer-type definitions, such as this one from Wikipedia: Unstructured data refers to (usually) computerized information that either does not have a
data structure or has one that is not easily usable by a computer program. Definitions like this have little value to me as a data analyst, data architect or data modeler. In fact, I could not find
a definition that seemed relevant to the field of data management, so I wrote my own.
Before defining unstructured data, we need to understand the concept of a classword. A classword is the last term in a data element name that defines the high level domain in which the data element
belongs. A few examples of classwords are Quantity, Amount, Name and Code. For example, the data element Gross Sales Amount contains the Amount classword which implies a currency such as U.S.
dollars.
Some classwords are simple and some are compound in nature. Simple classwords cannot be broken down into data elements that contain other classword types. Simple classwords include Amount, Code,
Date, Identifier, Indicator, Name, Number, Percent and Quantity. For example Customer Name can be broken down into Customer First Name and Customer Last Name, but this is all within the same
classword of ‘Name.’ Compound classwords include Text and Object. Examples of data elements containing the text classword are Order Comments Text and Email Body Text. Examples of data
elements containing the object classword are those that are photographs, music, PDF files and voice conversations. Compound classwords can usually be broken down into smaller pieces. The data
element Photograph Object, for example, can be distilled into Photograph Taken City Name, Person In Photograph Name and Photograph Caption Text.
For me as the data modeler, the distinction between structured and unstructured data boils down to a distinction between simple and compound classwords. To be more specific, structured data is any
data named with a simple classword. Semi-structured data is the same as structured data with one minor exception: semi-structured data requires looking at the data itself to determine structure as
opposed to structured data that only requires examining the data element name. Semi-structured data is one processing step away from structured data. From a data modeler’s point of view,
there is no difference between structured and semi-structured data. From an analyst’s point of view, however, there is a huge difference because the analyst needs to create the data element
source/target mapping, which is traditionally much more complex with semi-structured data. Unstructured data includes any data element not structured or semi-structured, which means any data
element that ends with the Text or Object classwords.
If you are playing the role of analyst, architect or modeler, it is important to distinguish structured from unstructured because it forces us to confirm we have the right requirements. That is,
does the business want to see the unstructured data element Photograph Object or the structured data elements that are derived from Photograph Object, such as Photograph Taken City Name?
What are your thoughts on these thoughts? Feel free to send your comments to me at me@stevehoberman.com.
This article was originally published in TDWI FlashPoint e-newsletter. Visit www.tdwi.org for more information.