Published in TDAN.com October 2001
Reference data is known to most IT professionals as “code tables” or “lookup tables”, and usually exists as rather simple database tables with just two columns – one for a primary key code and
another for a description. This basic and rather monotonous design often leads to the unwarranted conclusion that reference data needs little attention, or even that it can be safely ignored. Yet,
the same IT professionals who arrive at this conclusion often feel a sense of unease, sometimes approaching dread, when they have to deal with reference data. This is because there is an often
unarticulated understanding that reference data may be simple in structure, but it is highly leveraged throughout all databases, and getting it wrong can have really bad consequences.
Getting it right, by contrast, has few rewards because no-one has any real idea of its value or importance to the enterprise. All this presents a challenge to data administrators, but it is a
challenge that can only be met by thinking about reference data in a new way, and understanding its special management needs.
What is Reference Data?
The first step to managing reference data is to understand what it is. Reference data can be defined as follows:
This definition is fairly broad and this is because reference data is actually quite diverse. Here are some of the major kinds of reference data:
Things external to the enterprise. Examples are country, currency and time zone. Very few enterprises in the business of producing new countries, currencies, etc, but all
enterprises need to put their information into a context that can be understood within the framework of the wider world. So it becomes necessary to use information that describes things outside
the enterprise which can never be changed by the transactions that the enterprise processes.
Type codes, status codes, and role codes. These reference data tables constrain the design of a database and play a significant part in the business rule logic of any
application. An example of a type code is shown in Figure 1, where the Patient Type table contains two records one for “Male Patient” and one for “Female
Patient”. It controls the subtypes of the Patient table, which consist of Male Patient and Female Patient. Status codes similarly control
entity life histories, and role codes control relationships of entities. This category of reference data is quite close to meta data, has values that are known when a database is designed, and is
never updated by users.
Classification schemes. Any information in a database can be classified in an infinite number of ways, although it is common for an enterprise to use just a few schemes, e.g.
Industry Classification. Classification schemes are rather prone to changing reporting needs. E.g. in the energy crises of the 1970’s it was common to classify economic activities by energy
consumption levels. Today, issues like environmental impact and diversity in the workplace are seen as important reporting needs
Constant Values. The reference data we have discussed so far describes entities, be they external things, design elements, or ideas. However, additional non-key reference data
may be used by the enterprise. Tax rates are a good example. They are set by a particular jurisdiction and must be used by the enterprise. These constant values are nearly always non-key
attributes of reference data tables
What’s Special About Reference Data?
Even though reference data can be quite diverse, there are several things that unify it as a distinct layer of data in an enterprise’s information architecture. The first of these is that it does
not represent things that are parties to the transactions of the enterprise, nor does it represent these transactions themselves. In other words, it is not describing things that the enterprise
does business with. Yet it is needed in a database to categorize the other data which does describe the enterprise’s transactions.
The next important thing about reference data is that it often consists of actual values that have meaning, i.e. definitions. These definitions may be just as complex and difficult to understand as
definitions for entities and attributes. This is illustrated in Table 1, which shows an example for Organization Type.
Other kinds of data have values that do not have semantic differences. Two different people have two different values for their Social Security Numbers, but there are not two different definitions
for these Social Security Numbers. Or again, a bank may have hundreds of thousands of accounts, each with a unique Account Number. Every one of the values for Account Number has an identical
meaning – the meaning provided by the business definition of the attribute Account Number.
The fact that reference data values can have meaning is what enables these data values to control program logic. It is also why reference data values turn up so often in business rules. The
relationship between reference data and business rules is usually not well understood, but the link is actually very strong. Indeed, it is safe to say that an enterprise that wants to implement a
successful business rules strategy must manage its reference data effectively.
Another property of a great deal of reference data is that its data values are often set by bodies outside of the enterprise. There are standard sets of country codes maintained by ISO and the
United Nations. Credit ratings are defined by Moody’s and also by Standard and Poor’s. Currency codes are maintained by ISO. This means that it often does not make sense for an individual
enterprise to create its own data values for a particular reference data table.
One final common property of reference data is that it is frequently used repetitively throughout a database, in databases across an organization, and even across different organizations. A
currency table can be related to every other table in a database where financial amounts are stored. After all, the currency of the amount must be known in addition to its magnitude. This can
create a huge number of relationships from the currency table to the other tables. Furthermore, many databases in an enterprise will need a currency table. Hopefully identical copies of the
currency table will be used. If this is not done, then it becomes a lot more difficult to combine data from these different databases. It may also be necessary to send data to another organization,
e.g. to the government for regulatory reporting. When this is done, the receiving party must be able to understand the reference data used by the sending party. The fact that reference data is
redundantly used in this way is not widely recognized, and lack of standardization often leads to incompatible implementations of the same reference data in different databases.
Should Data Administration Care?
All of these special properties possessed by reference data mean that it is a special layer of data in an enterprise’s information architecture. As such, it has special management needs, and data
administration must be responsive to these needs.
However, data administrators tend to shy away from dealing with data values. They are often more comfortable when dealing with meta data, such as data models, or with standards for e.g. naming
conventions. Actual data values seem to belong on the other side of a divide. After all, we are talking about real data values in a production database, and surely it is the users’ responsibility
to populate these values. This attitude ignores the fact that reference data describes things that the enterprise does not do business with, which means that reference data usually has no business
owners. Data administrators can look all they want for business owners of reference data, but they will probably never find them. The only logical conclusion is that the data administration
function itself is the owner of reference data.
Data administrators also have problems with the fact that data values of reference data have semantic meaning. CASE tools permit definitions for entities and attributes to be stored. They may even
allow “domains” to be defined. However, CASE tools do not permit definitions for reference data values. Why should they? After all, it is not possible to anticipate all reference data values
before a database is implemented in production. It can be expected that a good number of reference data values will be created after production implementation. This means that the logical place to
store definitions is with the data in the production database.
Some data administrators may question the value of storing definitions for reference data. However, these definitions can be extremely helpful to both developers and users. Users often need to
“code” transaction data, i.e. categorize it in some way. For instance, a hospital administrator may have to update a patient’s record with a diagnosis code based on a written diagnostic
description provided by a doctor. Having the definitions of the diagnosis codes available can be a great help in finding the one that matches the doctor’s description. Similarly, developers often
find it critical that they understand the meaning of reference data so they can apply it correctly in business logic. For instance, taxes are not normally payable on municipal bonds. Programmers
building a securities management system must be able to recognize municipal bonds in order to avoid applying the rules to calculate taxes on them. There is probably a table called Security Type in
such a system, and one value will be for municipal bonds.
What Should Data Administration Do About It?
There are clearly strong reasons why data administrators should play a role in the administration of reference data, but what should this role be? There are in fact a number of steps that the data
administration function can take to manage the enterprise’s reference data more effectively. The first step is to recognize that reference data has no business ownership and institutionalize its
ownership by data administration. This is in fact a natural fit, because many of the special needs of reference data management are suited to the skills possessed by data administration staff. This
is particularly true of the definitions that must be provided for reference data values. Data administrators have a great deal of experience in obtaining complete and correct definitions for
entities and attributes, and it is a simple extension of this “dictionary” work to apply it to reference data values. Cataloging the enterprise’s reference data is another task that is well
suited to data administrators. If possible, the goal should be to find all reference data used by the enterprise and map it to the production databases where it is used. If reference data
definitions and cataloging can be handled successfully, then data administration can be seen by all as the center of knowledge for reference data. Furthermore, data administration can speak with
authority on important issues such as data quality problems related to reference data, and the availability of reference data to drive business rules.
There are other aspects of reference data management that need to be undertaken as well. The use of externally defined standards, such as ISO currency codes, can be a great help, but it takes time
and effort to research these standards. Also, such standards are not always magic bullets. For instance, the SIC Industry Classification is very broad, but for any particular enterprise it may not
be detailed enough for the industry sector in which the enterprise operates. In cases like this, data administrators may need to adapt standards to the needs of the enterprise – something that must
be done carefully for the enterprise as a whole and not on a project by project basis.
If data administration does undertake the management of reference data, it will need the resources to do the job, which translates into people and infrastructure. This in turn means that there must
be a budgetary justification for the work. However, this is not too difficult to provide, since problems caused by bad reference data are easily understood. Any enterprise that does not assure the
quality and integrity of its reference data runs very high risks from misapplied business rules, and bad transaction data quality. This can include system failures, such as system crashes, and the
delivery of incorrect information to decision-makers or customers. The highly leveraged nature of reference data magnifies these problems enormously. These risks are not too difficult to document,
and there are often cases of real problems (or near misses) that the enterprise has encountered as a result of lack of attention to reference data. In addition to this, both developers and business
users will probably be glad to have someone take responsibility for reference data, or simply have a focus for reference data issues within the enterprise. This is because the need to manage
reference data is widely recognized, but it is justly seen as an enterprise-wide issue that is not really a task for business owners, with responsibilities for business processes, or developers who
have a more project-oriented view.
What is most important, however, is that reference data is recognized as a distinct layer of data in an enterprise, with its own unique properties and special management needs. It really is
different to other kinds of data, and must be viewed as such. If data administrators can take this first conceptual step, then moving to successful management of reference data is an easily
definable set of logical steps – albeit a great deal of work. Failure to recognize reference data as something different, condemns it to be managed in the same way as all other data, which is to
say that its unique needs will not be addressed and it will remain not just a source of risk and unnecessary expense, but also an unused asset for the enterprise.