There is an assumption in database design that all design can be represented in data models. I have not actually heard this assumption stated, but I have not heard it challenged either. It is true
that data modelers will question whether a denormalized data model intended for direct implementation as a database truly represents the enterprise’s view of its data. This is often a reason
for disagreement between those responsible for logical data models and those responsible for instantiating physical databases. However, ardent purists never act in a way that questions whether a
sound logical data model can capture everything that is needed to understand a database design.
Yet is this true? I would submit that it is not. In fact, within every physically implemented database, there is a layer of design that is absent from any possible data model, which cannot even be
represented in a data model and yet which must be understood to actually use the database. This additional layer of design only exists when data values are populated in a physical database, and it
is specified in reference data tables.
Reference Data
Reference data is a class of data that has many homonyms and synonyms. In the sense used here, it is what are commonly referred to as “code tables,” “lookup tables,” or
“domain values.” In a data model, these entities look unimpressive. They typically contain two attributes: a “code” that is a primary key, and a description. In a physically
implemented database, they seem equally simple. There are usually just a few rows in these tables, and these contain data that typically changes infrequently. In fact, it is often referred to as
“static,” or “slowly changing.”
However, reference data has some curious properties that are not shared by other categories of data. One of them is that the code values often have definitions. For instance, the code
“BRONZE” in a record in a Customer Preferred Status table may be used to identify a customer to whom no credit can be extended. Even in a Country
table, it may be necessary to have definitions to tell us, for instance, whether “USA” includes Puerto Rico, or whether “People’s Republic of China” includes Hong
Kong. Only metadata – at least the metadata for entities and attributes in a data model – shares this need for semantics. All other categories of data lack the need to assign meanings
to physical data values. The Customer “John Doe” has no semantic meaning, nor does the Product “Widget X” and neither does the order I just placed to buy an airline ticket
online.
In my book, Managing Reference Data in Enterprise Databases, I defined reference data as follows:
Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the
enterprise.
This definition is based on the role that reference data plays at a high level in a database. It provides an easy way to recognize reference data and understand what it does in general terms. It is
within the general role of categorizing other data that reference data is used to specify design in a physically implemented database.
Moving Design Around
The way in which reference data can specify database design is illustrated in the data model fragment in Figure 1. This is part of a larger data model for a broker who arranges the buying and
selling of businesses. Various parties interact in the sale of a business, playing a number of different roles. Figure 1 shows a Party entity, representing legal entities that may
be involved in the sale of a business, and a Business Sale Transaction entity, representing an actual sale of a business. Within the Business Sale Transaction
entity the different roles which a party can play are represented by the attributes Business for Sale ID, Broker ID, Buying Institution ID, Selling Institution ID, Buyer’s Legal
Representative ID, and Seller’s Legal Representative ID. All these attributes are role names of the attribute Party ID in the Party entity.
Figure 1: Fragment of Data Model from Business Broker Subject Area
Many data modelers will be tempted to look at Figure 1 and ask themselves if there are so many known roles for institution today, might there not be more in the future? If this question is answered
in the affirmative, the data modeler will likely implement a design like that shown in Figure 2.
Figure 2: Fragment of Revised Data Model from Business Broker Subject Area
In Figure 2, roles for a party are defined in a new reference table called Party Role. This will have one record corresponding to each of the attributes for the various party roles
previously placed in the Business Sale Transaction entity. If new roles arise, they can simply be added to this table as new records.
From a modeling perspective, this is very interesting. In Figure 1, we had many attributes in Business Sale Transaction that were foreign keys of Party. Each of
these attributes will hopefully have had a complete and accurate definition in the data model from which Figure 1 was taken. In Figure 2, these attributes and their definitions are no longer
present in the data model. They have been removed from the logical level and replaced by the generalized attribute Party Role Code that qualifies Party ID in the Party
Involvement table to create the equivalent meaning of the attributes removed from Business Sale Transaction. Thus, it is at the physical level – specifically in the
records within the Party Role database table – that we will find the equivalent of the definitions of the attributes formerly in Business Sale Transaction.
The Logical / Physical Divide
What we see here is how individual attributes at the logical level in one data model can be replaced by records in a physically implemented database table in an equivalent data model. The data
modelers have chased design structure out of the logical level into physical reference data tables. In Figure 1, it was possible to understand the different roles played by a party from the data
model itself. That is impossible in the data model from which Figure 2 is taken. The definitions of the institution roles can only be found in the records in the physical database itself. In
practice, there is usually no place to store such definitions in reference data tables, so this information is nearly always lost. It is lost from the data model, that is. Users still have to know
about it to use the database. No data modeler would add attributes to a data model without specifying definitions, but this is essentially what happens when records are added to the typical
reference data table.
It is not just metadata about attributes that has been removed from the data model in Figure 2. The primary key of Business Sale Transaction has changed, as the data modeler has
decided to make Business for Sale ID a Party Role record. Perhaps more importantly, the individual relationships between Party and Business Sale
Transaction have their own optionalities and cardinalities in Figure 1. All these have been collapsed into the three identifying relationships between Party Involvement
and its parent entities in Figure 2. Business rules for Buyer’s Legal Representative ID and Seller’s Legal Representative ID must be defined for every Business
Sale Transaction to be captured in the data model in Figure 1. They cannot be captured in the data model in Figure 2.
It may be argued that the example we have just described represents an unusual case and that data models still capture nearly all of the semantics of a database. This is not the case. Until proven
otherwise, it should be assumed that every record in a reference data table can be transformed into an attribute in a data model. Consider a Customer table with an attribute of
Address Country related to a Country reference data table. There is another way to design this. In the data model representation we could add one attribute for
each potential country in which a customer could live and remove the Country table entirely. The Customer entity would then have additional attributes such as Address is USA,
Address is CAN, Address is MEX, and so on. This would also conveniently provide a place for us to put our definitions of “USA,” CAN,” “MEX,” etc. for the purpose
of customer address.
Of course, nobody would do this. For one thing, it would be argued that we would not be able to connect such new columns to a description like “United States of America.” But this is
true of all attributes – we somehow have to know what to call them for screen prompts and report labels. Perhaps it is because we have all simply adopted the patterns of well-known reference
data tables and do not think much about it. The point is that reference data values can be transformed into attributes in a data model, and this shows why they are truly unlike other kinds of data
and why they are so important. Given that 20% to 50% of the tables in a given physical database are reference data tables, there is enormous scope for them to hold a great deal of database design
information.
More importantly, it needs to be understood that any data model with reference data tables means that a layer of database design is going to exist in the physical database itself and that the facts
of this design cannot be found in the data model itself.
Self-Limited Data Administration
The more reference data a physical database contains, the more the design is contained in the physical database itself, and the less it is contained in the corresponding data model. Data models
with many reference data entities always look generalized. We can pretend that we understand the design of the databases such models represent, but we do not and we cannot.
Database designs that are deliberately intended to be generalized always have many reference data tables. These include a lot of so-called COTS (commercial off-the-shelf) software. Although a
customer may think they are getting a standard product, the reality is that this software often needs “configuring.” This is done, in part, by adding values to and selecting values from
reference data tables. The result is a design that is very specific to an individual customer, even though the same data model applies to all customers.
None of this would be a problem if we acknowledged that we are simply pushing complexity from one level (the logical) to another (the physical), and captured all the metadata involved for reuse.
However, data modelers, and traditional data administrators, usually see themselves as concerned only with the logical level. They have the attitude that any data populated into a physically
implemented database is solely the responsibility of the business. Hence, the design constructs that the data modelers themselves have chased out of the logical level simply fall onto the unwitting
heads of the business users. It should, therefore, come as no surprise that enterprises have a hard time understanding and managing their data.
It may be thought that this is too harsh a judgment. After all, many data administration units try to do “codes management.” The problem is that they are still thinking in the paradigm
of IT owning the logical design and the business owning the physical data. Also, they tend to think that all data is just data, not that reference data is a special class of data (among others)
that has its own special needs – particularly semantic management. This artificial divide between the logical and physical allows data modelers and data administrators to escape from their
true responsibilities, and will always limit the capacity of an enterprise to manage its data assets.
There is an urgent need to acknowledge the limitations of data models and find a way to bridge the logical-physical divide between models and actual databases so that database designs can be fully
understood. There is an even greater need to move data administration beyond its self-imposed confines of the logical world. Newly emerging trends in master data management and enterprise
information architecture seem to be going in this direction. Hopefully, new methodologies and tools will emerge to support these important requirements.