Designing Reference Tables for Usefulness, Flexibility and Survival

Prologue

My inspiration for this article came while driving my Prius in the rain. The local classical music station feeds a digital signal1 to a display text on my car radio. I have to punch it up when I want to know what classical work I am listening to. Unfortunately, the primary data file for the composer and title information seems to have only one field, containing the full name of the composer and the name of the musical work. The problem is that text may be longer than the fields on many consumer radio displays. Certainly on mine. So I get short parts of the text (overlaying the previous string by about 10 seconds)….

JOHANNE S
EBASTIAN
BACH BRA
NDENBURG
JOHANNE S

…repeating the cycle. In some cases, the name of the composer consumes the whole message, pusing out the title of the musical work. Only 64 characters are allowed for this message.

And that was unnecessary. Often, the last name (surname) of a popular composer is all that is necessary. How many composers were named Mozart? Or Handel? Or Copeland? The “Bachs” could be differentiated by “CPE BACH” and “JS BACH.” All in a short number of characters. But the designer of whatever table or database feeds the RDS for this classical music station didn’t think about the audience or any technical constraint.

This is a common problem. In traditional file design (dating from back in the 1960s when disk space was very expensive) only one name or description field was allowed for most reference tables (customer master, sales district master, G/L chart of accounts, etc.). But often the names and descriptions of things may be longer than we have space in reports and other displays.

We can do better than that. So we need to take another look at reference tables in databases.

Introduction

This article is basically about designing reference tables in a database to be useful and flexible in the prospect of many changes in data (and perhaps changes in data architecture).

To be highly useful, we want a reference table to contain all the facts about a subject entity which screens, programs, and reports may need, expressed in a way that fits those needs. And we certainly want these “facts” in reference tables, and not imbedded in program code.2

However, facts may change over time. Customers relocate and morph. Prices go up. While many operational systems only need the “current” values of non-key attributes found in reference tables, sometimes it is important to remember historical values. Many legacy systems do not carry historical values on a database, so they may only be found in searching backup files, with some difficulty. So retaining historical versions of facts can be important.

Even some events supported by business applications require historical data. An invoice may need to reflect the price of a product at the time of an order (even if the order is three weeks ago, before the price changed). In such case (unless the price is held on a transactional record), it is important to know what prices, policies, and conditions were true at the time of the order (or other historical event).

Hence, we want to design useful tables.

For Everything a Table

Businesses may have more reference entities than we sometimes realize. Almost every code (which is not human-created text) is a candidate for a reference table. Even “M” and “F” for gender could be a simple 2-row table, although many organizations have up to 16 gender codes. Don’t want to go there.

How about marital status? There are lots of different coding schemes; I know of few reliable standards. One company I know had 25 different values in this 1 -byte code of marital status. Those values all need explanation, and perhaps if a reference table did exist (for enforcing a domain of valid values), there wouldn’t be as many as 25 values in some columns.

There are native (or natural) coded attributes (which don’t change much) of kernel and other entities, and then there are derived (from customer behavior or business behavior) coded attributes. An example of the latter is a scoring of customers based upon their value to the company – something which is probably somewhat volatile.

This is a classic part of customer relationship management (CRM). A bank may perceive some customers (with large deposit balances) to be of more value, and thus expend efforts to keep them happy, provide free checking, or other services.

These scores (“Grade”) will be maintained on the Customer Master Table.3

The Grade could be recalculated at the time of each customer contact, but that might slow down some transactions, so it is re-calculated every night. The meaning of each Grade score should be kept on a reference file.

Again, it is possible to store this data (names and abbreviations of grades) in program code, but with a variety uses, there would be too many places to store and maintain. Better to keep it in one master table shown above.

This table could (shown above in blue) contain the ranges of total balances for the recalculation processes (if it were that simple).4

And in this respect, many reference tables serve as the target of foreign keys in other (almost always larger) tables – sometimes other reference tables, but usually transaction tables.

Name length options

One thing a reference table should contain, at minimum, is a basic name of the instance being described. An example is shown below for the sales districts of a company. Because we often think spatially, each sales district has a name – sometimes of the city where the district sales office is located.

We might also have the name of the district manager in the table for a variety of processing purposes. And perhaps a cost center number (below)

But some reports (more paper than online) may be so wide (and some screen usages may be so limited for space) that the full name of the district cannot fit neatly in a column. Some kind of abbreviation is needed.

But it may also be that some reports (perhaps for government reporting) require the word “district” in the name text. And we may want to support different lengths of the “district name” field in various reporting formats and pull-down lists in GUI applications.

So a robust District Reference table might look like this:

This includes some “fully-qualified” names. Other attributes of a district might include:Superordinate sales divisions

Dates when activity is due
Assistant manager’s name
Manager’s mailing address

So we can see that a reference table may be the logical place to house a variety of attributes.

Many coded attributes also serve as “categorical dimensions” in organizing how we think of breaking down business activity. I use the term “dimension” because they often become dimensions in the classical design of data marts for business intelligence.

Categorical Dimensions

Categorical dimensions often require a reference table. These include all the things we have codes for. Some examples might include:

States
Countries
Provinces
Sales divisions
Sales districts
Marketing districts
Distribution territories (for warehouses)
Brand and product groupings
Customer classes (by size)

Categorical dimensions are ways of clustering or grouping more granular entities (such as customers) and events (such as orders, sales, etc.) into summary aggregates for management reporting. And those groupings should be easy for management and knowledge workers to understand.

The list above includes government-defined entities as well as groupings of geography contrived solely by the company itself (for pricing purposes or for division of the field sales effort). These company-defined categories or groupings will probably not be seen anywhere else outside the company. Yet they may be the basis of numerous reports and even charts for managing the enterprise. So each deserves a reference table.

Characteristics of Coded Dimensions

Each code for which a reference table might be created may be thought of as a simple subject entity, and we need to ask some basic questions about such entities.

Are the instances mutually exclusive?

Many business analysts never think of this. This question is particularly appropriate for dimensions which are used to aggregate activity. We would naturally expect geographic territories to be mutually exclusive as shown below.

We would not want sales territories to overlap as shown below.

This territorial example is pretty clear-cut. But how about other categories of things. Are international customers rigorously outside the domestic definition? Are sales by internal sales representative clearly not broker sales?

Some categories are more slippery:

Business products vs. consumer products.
Products for export rather than products for domestic consumption.
Elite customers vs. standard customers.
Product color, flavor, size and other characteristics.
Kosher or not.
Product usage (or intended use).

If concepts on adjacent records in a reference table are not mutually exclusive,5 one has ambiguity of definition and risks mis-reporting activity in aggregates based on these categories.

Are the categories mutually exclusive over time?

Ah, this is a more subtle problem. Consider a situation where the northeast United States is divided up into four sales districts by a successful growing company.

To keep the example simple, the sales districts have been built of whole states (no partial states).

The district reference table would look something like this:

And the company grows, and the sales effort (and districts) must be subdivided. On January 1, 2013, the Vice President for Sales redesigned the northeast territory. States were added or deleted to districts, and sales representatives were reassigned. This constitutes a significant discontinuity to the district structure over time.

So the southern part of District 1 was split off to create District 22. And the southern part of District 2 was split off into District 21. And District 4 was expanded to include West Virginia, which was taken away from District 3.

Note something very important here. Some old district numbers, even though the districts were redesigned, are still being used, albeit with new meanings. “District 1” in 2013 is not the same as “District 1” in 2012.6 That may not be a problem in go-forward sales operations in 2013. But any kind of time-series analysis in the context of a data warehouse would be misleading. District 1 is not consistently defined over time. More about that later.

So the new (January 1, 2013 onward) District Reference Table would look like this:

Notice that District 01 was renamed. So what name would appear on time series reports such as that shown below?

Again, as we read across the District 01 line, we must remember that the scope (or coverage) of District 01 changed starting in FY-2013. That is a definitional discontinuity. If we revised this report format to include the names of the districts, we have a bigger problem.

What we have shown here is a very common situation. While the instances (districts) are mutually exclusive (at any point in time), they are not stable over time. And as they change over time, and we may need to “remember” the historical characteristics of the districts. This memory is especially true for data warehousing and business intelligence analysis and reporting.

Linear Dimensions

A linear dimension is one based on some numeric attribute. Time is a linear attribute (although it is often broken down into discrete and named units).7 Customer size (and perhaps customer value) can be a linear dimension. Sometimes customers are categorized (or bracketed) by their size (see CRM above). Personal age, weight, and other characteristics can be a linear dimension. Again, for meaningful reports, they are often bracketed.

Let us first consider the dimension of time. Here is a date master table. The subject entity is a calendar day.

Most of the field names in the above table are pretty self-explanatory. “Close” means the final day in which sales activity can be reported for credit in the current fiscal month. While this kind of data may be also stored in some kind of calendar system (like a company -wide calendar as part of the e-mail software), such a calendar database probably would not carry such unique attributes as month close, etc.

For most businesses, much reporting is done monthly. The definition of the month can vary (fiscal vs. calendar). The month may be keyed based upon the common, natural calendar (as below).

Some of this may seem trivial. But even if there are policy or business rules to determine the close-of-sales for a month, or the accounting-close-date for a month, if you want to make a one-time variation, you can do it in this table, without modifying any program.

Fiscal months and years

Many businesses structure their record-keeping and accounting activity in time periods which do not match the natural calendar. Fiscal years may begin in July or October. Fiscal months may be four or five whole weeks.

In the example below, fiscal months are composed of whole weeks (starting on Mondays, and ending on Sundays).

So some fiscal months have four weeks, and some have five. This reference table provides an authoritative source for application programs and processes to access this information in one place.

More problematic, we should mention, are some time periods which may be nominal “peers” to each other but actually overlap. One consumer product company had many promotion periods (with reduced list price or other incentives), some of them overlapping.

The start and end of each promotion period should be in a reference table. If there is ambiguity in the customer order process as to which should be used, the parameters for those rules should also be in the Promotion Table. Here is an example of peer instances (and records) which are not mutually exclusive.

What Code is Too Trivial?

Are there any common codes which are too trivial for a reference table? There may be. If there are names or descriptions for the codes, which might appear on reports, customer invoices, or on-line screens, then those names or descriptions probably need to be in a reference table. That covers a vast majority of the codes used in an enterprise.

Occasionally the meaning or definition of each code is held in some kind of corporate dictionary, glossary, or metadata repository. Such documentation is fine, but not easily accessible to application programs. Those meanings also need to be stored in a tabular structure in a common database.

But if there any other attributes to those codes, especially which may drive business logic and branching, then a reference table is even more important to have.

Effective Dating – Remembering History

Some important facts have a succession of values over time. A common example is a customer’s address and how it may change over time.

It may be important to remember previous values of critical attributes. One way to deal with this is to use effective dating in reference table design. There are two kinds of potential volatility in a reference table or master table. One is the change of individual attribute value, and the other is the change in the form and substance of the instance overall. Let us deal with the attribute volatility first.

A typical customer master file contains the name and address elements for the customer.

If, on April 9, Acme Plumbing moved their office from 41 Main Street to 984 Elm St., it is a simple “update-in-place” maintenance to change that “fact” about Acme Plumbing. The change looks like this….

Well and good….as long as it is not important to know what the previous address was. For current operations, probably only the current address is important. For many other purposes, there may be value in retaining the previous address. This can be accomplished with some sort of effective dating of the records.

Effective dating introduces two new fields – a FROM_DATE and a TO_DATE. They describe the period of time during which the facts in the row are true.

The FROM_DATE is the first date in which all the facts to the right in the record are “true.” Sometimes this may be when the record was created. The TO_DATE is the last date the facts are true. If the TO_DATE is null (or some logical equivalent like Dec. 31, 9999), then all the facts to the right may be considered to be current and true.

This now means that this table as a 2 -part key. The TO_DATE does not have to be part of the business-logical key. In fact, it can be considered by purists to be redundant data, but having it in the record is certainly convenient.

Effective dating makes searching for a customer record a little more complex, because a date must be supplied when a record is retrieved. (Or, if the program reliably is seeking a current (true) record, then the null value may be useful. But this will backfire later on….see below.)

Now, when on April 10, Acme Plumbing moves to a new address, we will create a second record (below) reflecting the new information. We will also have to manage the FROM and TO dates on both records.

This may result in multiple records for a given customer if they keep changing some of their non-key attributes. Hopefully, all the address elements change together when they move. But if other attributes change, and it is important to remember what those attributes were in the past (such as an attribute influencing a pricing policy on a past invoice), then a new record must be created when each attribute changes. This may make some reference tables very large.

Alternative strategy – current and history

A different way to approach this (especially if lots of legacy programs are costly to change) is to maintain two reference tables. One is the “current” view of the world, and the other is the historical view.

In the above approach, Acme Plumbing has only one “current” record in the table on the left, but four “historical” records (including the current values) in the table on the right. This reflects a name change on July 5, 1999, a location change on June, 2, 2006 and another location change on April 9, 2012.

The primary advantage of this strategy is that different programs have different data needs, and their access logic is made simpler.

Setting data up for the future – “future effectivity”

Another need arises when a future of a non-key attribute value is known, but not effective until some date yet to come. An example of this is a product price, or a ticket price for an airline. The best example is where we intend to raise the price of a product at some known date in the future.

Generally, pricing does not occur in the Product Master File because there are too many variables influencing pricing policy. So we will find that in many companies, the Price Master File may have a rather complex business-logical key.

A “current” price list (distributed to the sales force) might look like this:

Prices do vary by zone for any given product. This is a physical table, but what is published will look very similar.

But let us say that we plan to increase the price of product 477 in Zone 01 sometime in the future. From a late July, 2009 perspective, we plan to raise the price on October 1, 2009. Using the table on the right (below), we can pre-load that new pricing record (in yellow) well in advance of the effective date of the new price.

This saves someone from having to come into the office late on Sept. 30 to make all the pricing changes (with the potential for fatigue and error). This is especially valuable if some kind of change review process is involved.

I should alert the reader to another issue of retroactive changes. This is true where (a) the company was not aware of a change (e.g., the customer moved, but we didn’t know it until the invoices started coming back in return mail), or (b) a government policy was passed by a tardy legislature and applied back to the first of the year.

Effective dating structures in reference tables can accommodate retroactive changes but they must be used carefully. (And also changing the reference table doesn’t fix or mitigate transaction history based upon the previous, now obsolete, policy parameter values.) A prudent design for accommodating retroactive changes is more complex, and beyond the scope of this article.

But this brings us to another important topic: the distinction between corrections to facts (which were not supposed to be “true” or posted as so) and updates to facts (in the normal course of business change).

Corrections vs. Update

The distinction between these two concepts is often overlooked or misunderstood. An update should refer only to a fact (which was true at one time) changed, and a new value for the attribute is now true. Updates are very common in the normal course of business. Customer change addresses. Businesses and humans change their names. Prices are updated (hopefully not in the past).

A correction is a replacement of an attribute value which was not true. For example, the customer address of “451 Elm Street” was never true. It was “4451 Elm Street.” An easy mistake to make. So in a non-versioned customer master file (i.e., no effective date used), we would simply overlay the old value with the new value.

Where attribute values are differentiated by time through effective dating or some other mechanism, the concept of “correction” may only be the period of time when the former value was true.

In the example below, the patient address changed alright, but WHEN it changed was not known.

So if our enterprise database accommodates meticulous information about when patients change residences (particularly valuable for epidemiology research), we would correct the effective date of the move.

The two addresses did not change. Just when the move event took place. So this is yet another kind of correction.

Morphing of the Entity Itself

In the previous few pages, we have talked about changes to single-valued attributes, but what if the instance of an entity changes its form or existence over time? I am speaking primarily of mergers and divestitures of businesses. This is a common problem in a customer master file composed of businesses. It does not seem to be a problem for human customers.

First, businesses can own businesses. If they are discrete entities, it is possible to maintain data about that ownership relationship in a simple tabular format.

For example, one company may be a parent (in terms of corporate ownership) to two other companies (below).

This parent-child relationship is easily handled in a simple column in the Customer Master File.

The foreign key in the Parent Number fields points from the child to the parent.

This will not work, however, if a company can be a joint venture of two parents. And that often happens. The obvious resolution is to have an intersection entity which supports a variety of children and multiple parents (a classic data modeling issue).

But more vexing is the merging (at some point in time) or divesture of corporate entities.

Consider a situation where Luigi’s Pizza Parlor buys out a competitor. The survivor has two stores. But Guseppi’s Pizza is no more, and no longer purchases ingredients from us.

How is this managed on the customer master file? Especially if we need to maintain a coherent history. One way is to recognize the full life cycle of a customer by including a create date, and an “end date.” A business can be terminated by bankruptcy, acquisition, or other means.

In the above sample table, Gussepi’s Pizza ended its life through an acquisition on June 1, 2008, and was acquired by Customer #45. On the other hand, Barney’s Diner quit business on Sept. 4, 2009, and ceased to exist. Poor Barney.

This only scratches the surface of this topic – that of handling relationships between business customers, and their morphing, merging, divesting, and other changes in their basic existence.8 Again, this is far more complex than a mere attribute change (such as an address change).

Sequencing of Categories in Reference Tables

Some reference tables contain granular entities, like customers, elemental products-for-sale, etc. But many reflect categories by which elemental business activity is aggregated. These can be units of geography, enterprise-invented units of organization (e.g., sales districts), or others. Many times, these “categories” are perceived, by executives, as having certain significance. They want to see them in a certain sequence – the way they think about the world.

Suppose a company in the European market has a Country Master reference table. It might look something like this.

Those are its own internal codes for the countries. This does not conform to any international standard. But it has been in use within this company for years, and we must live with it.

But on a report, the executives want to see the countries arranged this way:

And that is with the gaps between the clusters. They have no name for the clusters, but that is how they want to see them. (Names may come later, but that is another problem.)

And they want to see the same sequence of countries on charts.

How do we drive this with reference data (rather than program code)? One approach is to change the identifiers of the country. But that is imbedding meaning into a key, and that is not a good data design practice.

To solve this, we can add new codes (which I call “cultural sequencers”) to the reference table.

The Cluster_Level_1 field tells which high level cluster the country is in. The Cluster_Level_2 field tells how to sequence the country within that cluster. Depending on the size and complexity of the enterprise, there could be any number of clustering levels. They must be hierarchical, however. And, there could be multiple hierarchies (e.g., one on a sales perspective, and one on a cost accounting view) being stored in the same reference table.

If the executives (in their oxygen-deprived wisdom) want to change the clustering or ordering of the countries, a change need be made in only one place – this reference table. All the downstream reports and charts that use this reference table (and its cultural sequencers) will reflect the change.

If new countries are added, no procedure code need be changed. A new row can be added into this table with ease.

Quality Control

Reference tables should be very accurate. Changes made to even one cell (fact) can have an enormous effect on transaction and downstream batch processing. Hence, the importance of ensuring quality on reference tables is greater than individual transactions. Some clerks may not understand the downstream consequences of changing they may make.9

Many companies have rigorous testing and quality control protocols for software. Equal care should be applied in changes to reference table data. It may be appropriate to have some kind of test region. But at minimum, there should be an approval process. The use of effective dating (shown above) allows changes to be made, and approved prior to their having any effect on logic and branching in production systems.

Summary

Reference tables are a valuable element of a logical data design. They do involve some extra effort, but holding the data about small-domain (perhaps less than 10) codes and concepts in a table rather than in program code will reduce effort down the road. But during the initial design, it is vital to understand (forecasting into the future of the enterprise) if the attributes need to be versioned in some way – that is formally (with logical structure) remembering previous values, and the range of dates when they were “true.”

In the past, some resistance to such tables was based on fears of inefficiency in processing. Advances in relational DBMS technology and techniques render such fears now trivial. When in doubt, create a reference table!

End Notes:

RDS is Radio Data System which allows a portion of the FM radio channel to be used for transmission of digital information including the station identifier, time, and program information.
Some “old timers” may claim that storing reference data in such tables, even without SQL joins, is slow. But modern DBMS software would probably make small reference tables like many described here as “core resident” so there would be no delay in accessing a row.
For ease of readability, I am using Times New Roman in these tables, and a mixed case. In reality, they might be stored in all upper case. Whether textual data is stored in all upper case (the default for many legacy mainframe databases) or mixed case must be a conscious decision by the data management or data stewardship experts in an enterprise. For me, when I get mail addressed in all upper case, I perceive the enterprise to be a bit backward.
Robust CRM policy would also consider lateral relationships between customers. As Terry Moriarty has been telling us for years, it is valuable to know that some poor student (with a low balance) has an uncle who has $25 million deposited. That low-balance student would be treated with some deference.
This situation is actually quite common where both elemental items and aggregates both appear on a reference file. In one large company I worked for years ago, both elemental product offerings (SKUs) and brand groupings appeared as “peer” records in the Product Master File.
This subtle discontinuity in the definition of “District 1” can be a major problem for downstream systems and data warehouses (within the organization, but especially outside the bureaucracy). If not properly informed of the redefinition of “District 1,” downstream users may misrepresent all kinds of time series analysis such as growth rates and such.
Even if we use codes like “JAN” and “FEB” (3 characters) to designate months, they are understood to occur sequentially.
Just for fun, it can get even more ambiguous about the nature of some businesses. Franchises, for example, can have an external name which does not reveal their true ownership. In the hospitality industry, it can get even worse. A hotel property can have a brand (e.g., Holiday Inn), an owner (e.g., Erie Partners, Ltd.), and a management company (e.g., Northeast Hospitality Management, Inc.) And who do the maids and front desk people work for?
In one company I worked for, a low-level clerk in finance made a “realignment” change to a accounting structure reference table in an ERP system. It was prematurely made, and it took the company three days to recover, particularly re-loading all the downstream data warehouses and data marts.

MenuMenu

Share this post