Governance for Taxonomic Reference Data

– – – – – – –

This article is being featured in cooperation with IRMUK and the Data Governance Conference Europe 2009.  For more information about the event, please

– – – – – – –

Reference data has many definitions. The one that I use is the following:

Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the

In practice, this means that reference data is nearly always “code tables.” These are tables that typically have a code column and a description column, and probably fewer than 200
records – in fact, they typically have fewer than 20. Another feature of reference data tables is that the data in them does not change very often. All of these characteristics tend to
combine in the minds of data management professionals to generate the conclusion that reference data tables are unimportant and have no special management needs. There is, for instance, much
greater interest in the (relatively) gargantuan master data cousins of reference data tables, such as Customer and Product. Yet, as we will see, it is impossible to deal with tables like Customer
and Product without effective governance over reference data.

The Dark Side of Reference Data

Reference data is the Rodney Dangerfield of the data world – it gets no respect. However, it is feared. Neither users nor IT staff ever want to delete records from a reference data table, or
to alter the primary key of one of these records. This is because the consequences of such an action cannot easily be determined, and this in turn is because it is difficult to figure out where
reference data is being used and what it is being used for. Even so, there is general awareness that it features prominently in two kinds of places. The first is reporting hierarchies. Thus,
changing a value in a reference data table can have unpredictable effects on what appears in reports, perhaps hiding information. The second area is in program code since if a data value ever
appears in a business rule, it is almost certain to be a code of some kind. Deleting a code value, or changing it, can therefore cause program logic to react in unpredictable ways.

Given these consequences, it is easy to make a case for strict governance of reference data tables and their contents. Yet this need is rarely addressed, let alone implemented. Part of the problem
seems to be that we are dealing with data content, not data structures. Data management has traditionally focused on building data models and taken the attitude that whatever content users put into
databases is their problem. This extends to recognizing the need for data stewardship and implementing data quality improvement. Data management is quite happy to help implement processes and
functionality to assist in these areas, but still wants to treat the data content as a “black box” that data management does not need to get involved with.

However, data management must get involved with data content of reference data tables. The great majority of these tables are some kind of taxonomy, where the data content of the table is used to
classify other data in the database. Taxonomies allow us to treat similar data in similar ways, and achieve orderly knowledge from of a mass of detail. Yet taxonomies are not found in data models
nearly so much as they are found in the data content of reference data tables, and this data content requires governance.

The Role of Taxonomies

Nobody would walk into a supermarket and expect to find fish mixed in with the baked goods, detergent with the meat or spices in with the dairy products. Yet this is the level of taxonomic chaos
that exists in most reference data tables. A supermarket would be closed down in hours if it managed its taxonomies in this way, but pretty much all enterprises limp along for decades with their
reference data in such a condition. Look at the data in any code table and you will likely see semantic mismatches, gaps, overlaps and different levels of granularity. Hence, we see examples such
as a Credit Status tables with code values like “Platinum,” “Gold-1,” “Gold-2,” “Silver,” “Suspended,” “Employee,” and
“Department 123.”

The root of the problem is that records in reference data tables have meanings, but traditional data management only addresses semantics in data models. Nor do traditional data stewardship and data
governance programs distinguish taxonomies in reference data tables from any other kind of data content. To them, all data is just data, and taxonomies do not have to be treated differently than
any other kind of data.

Taxonomies and Traditional Logic

Traditional logic has not been taught in schools for more than a century, but has quite a lot to say about what we now call “taxonomies” and what it calls “division and
classification.” Complaints against traditional logic usually include statements to the effect that set theory and linguistics have replaced it. I do not have space to go into this issue, so
I will only say that removing the fairly easily understood, and rather common-sense, tools of traditional logic and replacing them with relatively inaccessible and impractical academic theory can
only be described as unhelpful.

Here are some of the rules that traditional logic gives us for taxonomies:

  • The basis of division must remain constant. So you cannot have a code table that is supposed to be Color but also includes Flavor. You cannot shift the basis of division (i.e., of the
    taxonomy) from one record in a code table to the next.

  • The species in a division must exhaust the genus. In logic-speak, the genus is what is getting split up in a taxonomy, and the species are the individual categories in the taxonomy
    – i.e., the records in the reference data table. So, if I have an Economic Sector table and forget to include anything for the Internet, I will be making this error. And, yes, this is where
    the terms genus and species used in biology come from.

  • The divided parts must be less universal than the thing being divided. So, you should not find Regions in a Country table.

  • The parts must exclude one another. Everything in a taxonomy must be distinct and cannot overlap. I cannot have Voice over IP as one category and Internet Telephony as another.

The list can go on. What is important is that a set of rules for managing “taxonomies” has existed for several centuries but is completely ignored in data management when it comes to
reference data tables – which can be anywhere from 20% to 50% of the tables in a database.

Governance of Taxonomic Reference Data

Given that we have a set of rules for managing taxonomic reference data what should we do next? The obvious thing is to implement them. This can be done in two major ways. The first is to assess
the current mess and use the rules to understand what is wrong with the data content of reference data tables.

Unfortunately, reference data that has resided for a long time in silos is tightly bound to the program logic in these silos. The program code massages, filters, cleans up, enhances and otherwise
manipulates the messy reference data so that the silo actually works. Thus, when you ask a silo boss if he or she has any data quality problems in the silo, they will inevitably reply that data
quality is perfect. However, when you pull the data out of the silo and decouple it from its protective shell of program logic, it suddenly makes a lot less sense. Hence, we have the paradox that
data warehouses can have severe data quality problems, whereas the silos from which they take their data do not.

What we can do is to take the rules from traditional logic (and others) and use them to figure out where taxonomic reference data tables have gone astray. Just recording this knowledge is a big
step in being able to deal successfully with such data outside the applications where it is produced. The data can thus be transformed into something that makes sense in the other environments.

More importantly, the rules can be applied proactively. Users responsible for managing the data content of reference data tables can be identified and trained in the rules for management of
taxonomies, most of which come from traditional logic. Right now there is pretty much nothing to guide what a user puts into a code table, except perhaps an authority that will state what is
“right” or what is “wrong.” The rules give us a detailed, objective and clear basis for governance.

Much work remains to be done in the area of governance of reference data tables that are taxonomies. Hopefully, the next few years will see detailed implementations being built out in this area.


submit to reddit

About Malcolm Chisholm

Malcolm Chisholm has over 25 years experience in data management, and has worked in a variety of sectors, with a concentration on finance. He is an independent consultant specializing in data governance, master/reference data management, metadata engineering, and the organization of Enterprise Information Management. Malcolm has authored the books: Managing Reference Data in Enterprise Databases; How to Build a Business Rules Engine; and Definitions in Information Management. He was awarded the DAMA International Professional Achievement Award for contributions to Master Data Management. He holds an M.A. from the University of Oxford and a Ph.D. from the University of Bristol.