Stages Of Data Utility & Value

Published in TDAN.com October 2005

The world is filled with data and information. Some of it is unknowable (such as what is lurking behind the Andromeda galaxy). Some of it is knowable, but unknown–unknown mainly because it was
never deliberately observed, and properly recorded. Of the data and information which are recorded with some accuracy, how do we find what we need for specific uses or decisions? The utility of
data is a very specific characteristic, and depends upon the anticipated usage.


Utility characteristics of information

There are a several significant characteristics of information which must be understood, and understood distinct from each other, when evaluating utility. Any piece of information, in order to be
useful, should be…

Knowable. Nearly everything (but not all, as Heisenberg[1] taught us) is knowable, although sometimes very difficult to learn or discern.

  1. Recorded. In some sharable, objective medium and not just in some human brain.
  2. Accessible (with the right resources and technology)
  3. Navigable (it may be there but is it easy to find?)
  4. Understandable (language, culture, technology, etc.)
  5. Of sufficient quality (for the intended use)
  6. Topically relevant to needs (perceived needs and unknown needs) (otherwise, it is noise)

These characteristics apply to a piece of data, information, or potential information. These characteristics apply to both a single item of data, and any meaningful grouping of data items.

Please notice that these characteristics are also tests, and seem naturally sequential (1 through 7). Subsequent tests are irrelevant if the previous tests are not passed. For example, it is no use
worrying about navigability if the data is not knowable, recorded, and accessible.

Another important characteristic of information is whether it is structured (or tabular), or unstructured. The tabular-unstructured dimension is orthogonal to these seven. Meaning, that there are
really 14 tests (in two columns) possible.

An overview of the stages of data utility is shown here.

Fig. 1. Seven utility characteristic questions about data.

So these seven characteristics can also be seen as seven tests, which can be applied in sequence. The exact sequence of the final five can be debated. For example, if a fact (whose value is known
or not) is judged to not be relevant, then we probably wouldn’t worry about its quality. But the sequence here seems somewhat intuitive, so we use it.


1. Knowable

There is a wealth of potentially knowable data and information in the universe. Only a very small portion of it has ever been observed or known by humans (or the technology which humans use to
extend mankind’s observations and senses). And little of that was ever recorded anyway. Slightly more was remembered, but not recorded. Of course, memory fades…and fails. Hence, the need for
recording stuff.


2. Recorded

Most of what is knowable is never recorded. People don’t feel it is worth recording. Some is recorded by hand (requiring, generally paper and writing instrument) but since 1850, some is recorded
through automated or technological means (photography, phonographs, magnetic media, etc.).

What we chose to record depends upon our expectations of later utility or interest. Students take notes in academic lectures, but not on what goes on at a football game (unless they are a sports
writer for the college newspaper). More captains of industry and government write memoirs than do postal clerks.

When we discover a need for data, we can adjust behavior and systems to start recording it. But much of the most crucially needed data is not recorded until it is too late. Recording of immigration
activity is now more meticulous. I wonder why. Surveillance tapes may be saved longer now[2].


3. Accessible

The very availability and access to data can be a major issue. Of all the knowledge and information ever recorded, much (especially in ancient times) was confined to letters and personal journals.
There were newspapers, but only late in the history of civilization.

Thus, much of what has been known and recorded is not in the public domain. It is in the personal property of families. In the modern industrial age, a great deal of knowledge is held in the
private files of corporations and government agencies (though many of these are destroyed on a regular basis, sometimes to prevent their availability for discovery to aid litigation). These
generally are not available to the public. For investigative purposes they can only be subpoenaed if their existence is known.

The public library was a major step forward in making data, knowledge, and information available to a wider readership. The internet is the latest significant mechanism for lowering the cost of
access, speeding navigation (see below), and lowering the cost of “publishing”.

We often underestimate the value and power of the human brain. Much of contemporary knowledge and information is held solely in human brains. Most people do not even understand or realize how much
they know. Mechanisms for recalling that information are sometimes faulty. We may require triggers or clues to remember things. We may remember the route to a particular store if we were driving it
ourselves (relying on visual clues along the way which work for us, but which we could not articulate during an interview), but it is more difficult to remember, describe, and organize those clues
to give verbal or written directions to someone who wishes to drive the same route.


4. Navigable (it may be there but is it easy to find?)

Orthogonal to the concepts of structured and unstructured data are issues of navigation; how do we find what we are looking for? Or what is important? …to us. Or what we need to know?

There is a wide range of “navigability” in data and information–particularly in unstructured information. The worst case is a collection of personal letters, on paper stationery, (not stacked
chronologically) of some individual–especially after they are dead. How do you find the author’s mention of a particular distant cousin? You usually need to read it all.

Books are bound pages, with some physical organization (usually chronology); they are designed (except for reference books) to be read serially, front to back. Reference books have topics generally
organized alphabetically, or some other meaningful categorization. Then, some non-fiction books have tables of contents which are significant (for example, text books). In these kinds of books
(compared to narratives), topics are becoming easier to find. Then, topical indices were added to back of the bound non-fiction book. These are helpful, but reflect the judgement of some editor who
decides which topics are important enough to get a reference.

Tabular data was generally stored (at least on paper) in entry sequence. The beauty of automated (electronic, digital) tabular data was that records could be sorted, and we could use several
alternate keys or indices to find records. This is a very significant enhancement in navigability of data. Indeed, tabular data has become quite easily managed.

But what about unstructured data? Document imaging systems allowed a document’s image to be referenced by a few indices (perhaps name, date, and something else). But the content was not
reference-able in its graphic form. Then came the web, and search engines. Wherever unstructured textual information is digitized, and placed on uniquely-identifiable pages, search engines could
help us find it. Google has done much to revolutionize the finding of text.

And there appear to be some engines which can find images similar to each other. Whether the next generation of search engines can find meaning (rather than specific words or text strings) remains
to be seen.


5. Understandable

Data and information can be recognized as existing, and the meta-data (source, time of observation, etc.) may be known, but the content of the information may not be understandable. Drawing usable
meaning from data requires lingual, technical, and cultural familiarity. The more cryptic (or coded) the data is, the more culture must be “wrapped” around it, often supplied by the analyst or
interpreter.

The very existence of data (or communication) may itself be useful even if it is not understood. A friend of mine told of his military experience working at a “listening post” facility high on a
mountain in Turkey, with a good radio “view” into the Soviet Union. They listened to VHF radio signals, often voice, and though he spoke no Russian, he could note the time, frequency, duration,
number of voices, etc. This kind of “metadata” was useful (known as traffic analysis) even if the content was not understood–although he probably tape recorded what he heard over the radio
channel.

I frequently use the formula, “data plus context yields information”. The context can make raw data understandable. Two lanterns hanging in a Boston church steeple are merely data; their meaning
(profound meaning!) was only understood in the context of previously established values, plans, and codes by the “rebels” in the Revolutionary War.

The information and data held in the human brain are stored with context–sometimes vital context which makes the fact, out of context, of little value. The human manages to integrate and relate
all this information in a variety of intuitive ways–not at all in a tabular manner, thus going far beyond the ways we can relate or process tabular data in the relational model. The sole
proprietor entrepreneur is far better at integrating his personal knowledge about his products and customers (and thus making profitable business decisions) than any CRM system could do based upon
tabular data (said data being woefully incomplete, compared with the nuances of the sole proprietor’s memory).


6. Of sufficient quality

There are a variety of specific measures of quality. They include…

  • Presence of record. Is there a record present for an instance in reality?. If not, then the entire row of data is missing.
  • Presence of data in cell. While the row may be present, the cell may not be populated when it ought to be.
  • Validity of a fact. Does the value of a fact (a cell in a row) conform to some rules? E.g. if the field is called “STATE” is the value a valid state code? This concept is sometimes achieved
    through referential integrity. In tabular data, this is relatively easy to determine through Boolean tests.
  • Reasonableness of a fact. While the value in a cell may be valid, it may not be deemed reasonable, in context of peer data, or other facts in the same record. E.g. the zip code is not
    consistent with the state code on an address.
  • Accuracy. A fact may be valid (a proper code), reasonable (in keeping with peer data), and still be flat out wrong–inaccurate. Generally, testing for accuracy can be very expensive.
  • Precision. Precision is different from accuracy. A numeric amount can be accurate to the dollar, but not precise to the penny. It is still accurate, in a sense, and useful in some analysis.
  • Consistent definition over time. A set of facts (a column) in a set of records (a table) should have consistent definition over the span and scope of the table. If they do not, then the
    definitional accuracy of the data (or more precisely, the meta-data) is lacking.

Ideally, the quality of data is something that can be objectively measured, without reference to an intended use. And in designing systems and assessing data quality of latent and moving data
assets, we do need to strive for that.

But the quality of data or information (as objectively measured) may or may not meet the specific needs of a decision or analysis.


7. Topically relevant to needs

A final issue must be mentioned and that is relevance and usability of data and/or information to a particular need. Data which is not relevant can be distracting, or actually be considered
“noise”. Advertising is a culturally sanctioned, structured form of noise. (Graffiti, another form of advertising, is generally not sanction by society.) Ideally, we know what is an ad when we
see it, and we can choose to “tune it out”[3]. But noise can be used deliberately to obscure significant data, as can be dis-information. There were several
such situations in World War Two (such as the fake army radio messages just prior to the Normandy invasion), and probably have been many since then we don’t know about.

Needs for information may be known or unknown. “Write this down; you are going to need this someday!” is our way of adapting to one assessment of our anticipated needs. Many bureaucracies have
people who save everything (“We might need this someday.”). They may be considered pack rats, but many a crime mystery is supplied with essential clues by the pack rat who says, “Well, actually, I
did save that back here.” Of course, saved data is of no utility if it cannot be found–back to navigability!

These seven characteristics of data and information, when understood and applied as sequential tests, allow us to determine its utility and value for particular business and social needs and
expectations.

[1] The physicist, Werner Karl Heisenberg, postulated that there are limits to what we can measure and know about the movement and behavior of some
subatomic particles. His Uncertainty Principle was influential in the development of quantum mechanics and ultimately the development of nuclear energy. He won the Nobel Prize for his
accomplishments.

[2] Interesting that surveillance videos in the London Underground system were re-viewed when analysis of the tickets (which carry magnetic coding) used by
the 2005 bombers were traced back to previous uses of those same tickets, and then through establishing the prior date/times when the terrorists had used the Underground, authorities were able to
find more pictures of them, and determine that they had rehearsed their actions.

[3] If the appearance of the ad, in text and layout, is not readily distinguishable from editorial content, many magazines must label the ad,
(“Advertisement”) at the top of the page.

© Copyright 2005 Neil Michael Scofield All rights reserved.

Share

submit to reddit

About Michael Scofield

MichaelÊis an Assistant Professor of Health Information Management at Loma Linda University in southern California. He is a popular speaker in topics of data quality, decision support, and data visualization to professional audiences all over the United States. He is also a frequent guest lecturer at a number of other universities.

Ê

Top