Published in TDAN.com July 2005
Every organization both maintains and uses reference data sets within its enterprise. And in many of these organizations, there are standards about the definition and use of that reference data,
although sometimes those standards are at best silently understood instead of specifically documented. Yet even in the best governed environment, that reference data is bound to eventually be
abused, either through value set perturbation or extended usage expectations.
The pattern usually goes in one of two ways. In one pattern, a data set is defined or identified for use for a specific purpose. At some point, the purpose changes in a way that is not aligned to
the original data element definition, and so the data is appended or modified to incorporate data values that do not exactly match the definition. In another pattern, a data set is defined or
identified for use for a specific purpose. Then, other system developers with similar needs determine that the data set is suitable for their application, so they use it also. Over time,
application requirements diverge which in turn raise questions about the data set’s continued suitability for the intended purpose. Yet at that point, there are enough applications depending
on the data, which introduces artificial constraints on the data set’s flexibility.
As an example, consider geographic Federal Information Processing Standards (FIPS) codes defined by the National Institute of Standards and Technology (NIST). Every state in the United States is
assigned a two-digit code. Then, each county within each state is assigned a three-digit code. These combined codes provide a simple hierarchy:
One might think that this hierarchy (which is also a simple taxonomy) has a very simple definition. Or does it? One approach views the data set purely for geographic containment – there is a
state, and there are a set of counties contained within that state. In this view, a five-digit code resolves to a physical set of boundaries (for the county) that are fully contained within a
larger set of boundaries (the state). A different approach views this as a political taxonomy. In this view, a five-digit code resolves to a political entity (the county) that is subsidiary to a
larger political entity (the state).
Actually, the more you consider it, there are a lot of different ways that the five-digit encoding can be used:
- Location, representing the seat of government for a county
- Physical routing, as a way to map to physical mail addresses
- Electronic routing, as a way to map to email or IP addresses
- Aggregation, as a way to summarize activity or transactions at the state and county level
So let’s assume that you work with an organization that has an office in every county in the United States, and you have selected this standardized data set for routing electronic messages to
the office located in a county. As long as there is one office in each county, this scheme is fine, but what happens when your organization decides to open a second office in one of the counties?
In this case, the taxonomy breaks down because the use of the data set was predicated on an assumption (one office per county) that was true when the data set was first adopted and was violated
with the opening of that second county office. An alternate issue is introduced when the organization decides to consolidate two county offices into a regional office. Now, there are two counties
that share one office, and the translation from county to office is no longer a one-to-one mapping.
Now consider a different problem: your organization decides to open an office associated with the state (aside from each county). You want to be able to route electronic messages to the state
office or to a county office, except there is no five-digit code to represent the state! So some states adopt the use of ‘000’ as their state routing code, while others adopt
‘999.’ Two issues crop up here: the addition of a non-standard value to a standard data set, and the conflicting use of introduced values. Even so, you still know that as long as you
use the state’s two-digit code, you are at least going to route the message to the right general location.
Here is a different situation: your company decides to introduce regional offices that cover more than one state. What you are doing is introducing a new level into your taxonomy, yet the
standardized data set does not support it. So the decision is to take some of the unused two-digit codes to represent the new regions.
Each of these three cases introduces a semantic data quality problem. In the first case (multiple offices per county or multiple counties per office), original assumptions are violated, which then
perturbs the expected definition of the data set’s hierarchy. This happens because of expanding or contracting the actual use in a way that is inconsistent with the defined taxonomy. In the
second case (routing to a higher level in the hierarchy), the desire to map a “state + county code” to a state implies a need for an overloading of the meaning of the three-digit code.
When routing to a state, digits 3-5 no longer refer to a county (as is originally intended), but now also refer to a state. In the third case (encompassing regions), the problem is addressed by
overloading the state code to incorporate a higher level in the hierarchy supported by the original data set definition.
None of these solution approaches will have significant impact while there is only one application using the encoding. But when other application rely on the same data set, introducing new values
with new meanings diverges from the original data set semantics. Here we have changed a geographical containment hierarchy into a routing mechanism, but when other applications also change the
semantics, eventually there will be a clash.
As another example, consider North American telephone numbers. The number is divided into three parts: an area code, an exchange, and a line number. The original intent was that the number
represented a routing scheme: the area code directed the call to a region, the exchange represented a local office handling that line, and the line number is a specific circuit handled by that
local office. This scheme makes a lot of sense when being applied across a single network of physically connected units (e.g., land lines). And within this kind of network, certain assumptions can
be made, such as automatically routing inbound calls to a specific regional inbound call center based on the area code of the incoming number.
With the introduction of mobile telephones, telephone numbers are still issued based in the geographic region in which the order originates, so the area code associated with your mobile number is
associated with the originating location. As long as you are in any location served by the service provider, your call will go through, still associated with your mobile phone’s number and
area code. Yet this violates the assumption that telephone calls originate in the location associated with the area code – I can be in San Francisco calling with my 917 (New York City) area
code mobile phone. So when I dial the phone number of the call center, it will route me to a New York City call center even though I want to speak with someone in San Francisco. This becomes a real
problem when the number you are calling is one that requires local assistance (e.g., 911).
These examples are simple examples, but point out some of the problems with multi-purposing a standardized data set. The way to address this problem is to take the data usage policies into account
when defining data standards and documenting those policies within your metadata management system. In a future article I will address an approach to consolidating a representation of data usage
policies for standardized data.
Copyright © 2005 Knowledge Integrity, Inc.