Intelligent keys are something that most data analysts abhor. Unfortunately, they are quite common, and there is often great attachment to them among communities of business users. This makes them
tricky to deal with. The problem is compounded by the fact that data analysts seem to be divided into two groups: one group with a near-religious belief that all entities should have surrogate
keys; and another group with an equally strong belief that all keys should be natural. Both groups are quick to condemn anything at variance with their core convictions, and intelligent keys are
typically treated as something close to heresy. Yet, are intelligent keys really all that bad? It is possible to build a strong case to include them in database designs, and there is good evidence
that they can be more useful than prejudiced data analysts often suppose.
What is Intelligence?
What is an intelligent key? It is an attribute that is used to identify an entity, but which encodes some form of meaning. One form of intelligent key is a number that conforms to some kind of
mathematical rule. An example I have worked with is a 7-digit employee number where the seventh digit was derived by dividing the first six digits by 7 and finding the remainder. This provided a
rough way of determining whether an employee number was authentic or not. Although this may seem rather a mild use of intelligence it is still too much for many zealots who declare that nothing
with even the faintest whiff of “intelligence” can be put in a primary key, or indeed, as an attribute in an entity.
The more common form of intelligent key is where more than one piece of information is encoded in a data value. A good example is the familiar Social Security Number (SSN). This has the format
XXX-YY-NNNN, which can be interpreted as follows:
- XXX is called the “area” and represents the zip code of the mailing address of the assignee at the time he or she applied for a SSN.
- YY is called the “group” and is a two-digit numbers. Groups are assigned in a complex pre-determined sequence, and thus can be tied to periods of time.
- NNNN is called the “serial number”, and is assigned sequentially.
This is not the whole story about SSN’s, but it does illustrate multiple pieces of information in one data value. Account numbers often follow similar patterns, and can often encode over 10
different pieces of information – which is one reason they can be so long.
More Than Keys
There is one other fundamental property of an intelligent key – it is a key. That is, it is used to identify instances of an entity. However, it is a key designed for use by humans, not for
use by computers, and as such does more than provide uniqueness. When we humans search for information and try to understand its significance, we do not have the capacity to process large amounts
of data. Our brains, eyes and hands are the only tools we have for this task and they do not have the efficiency of computerized databases in terms of speed of data processing. Intelligent keys are
a strategy that humans can use to overcome these shortcomings. If someone searching for information understands what is encoded in the value of the key, they can often determine whether to or not a
particular instance of an entity is relevant to their search simply by looking at the value of the intelligent key representing that particular instance. People experienced with working with a
particular intelligent key can spot patterns in its use, especially patterns that may be “red flags” indicating potential error situations. There is another aspect of intelligent keys,
which is that they permit cross-checks with other redundant information about the instance of the entity under consideration. Such cross-checks are an important way in which humans detect data
quality problems.
The usefulness of intelligent keys is one reason why communities of business users become so attached to them. This is not merely for purely manual information systems, such as filing cabinets full
of folders with the intelligent key typed onto each folder tab. Even in computerized applications, intelligent keys help users to do their work.
The Law of Atomicity
The features of intelligent keys that make them so useful to business users violate one of the fundamental tenets of data modeling, the law of atomicity. This states that one attribute should
represent only one piece of information. Having an intelligent key means that if the underlying nature of the instance of the entity changes, then the value of the key may also have to change
because one of the pieces of information encoded in the key value has changed. Furthermore, programs have to parse intelligent key values to extract the different pieces of information encoded in
them. This is known to be error-prone. Yet, beyond this there is simply a prejudice that intelligent keys are “wrong” as a matter of principle. Many data zealots think no further than
this, and actually have only hazy notions of why they are really “bad”. The result can be conflict with the user community and a database design that is less useful than it might
otherwise be.
Physician Heal Thyself
Surprisingly, in one respect, data analysts are among the most ardent proponents and users of intelligent keys. In fact, they often try to force entire enterprises to use elaborate schemes for
intelligent keys. Of course, data analysts do not call these artifacts “intelligent keys”, rather they call them “naming conventions”.
We are still in a time where repositories (databases that store metadata) are rather primitive. Such repositories, including the system catalogs of databases, usually store only a little
information about the structure of a database, e.g. the tables, columns and relationships. Each of these items is given a name, but DBA’s, programmers, and other people that use tables,
columns and relationships need to understand as much as possible what these names mean from a business standpoint. The answer is to introduce naming conventions. One example of a naming convention
for a physical database column name could be that column names consist of three nodes separated by underscores. The first node represents an entity name, the second node represents a business
concept in a standard way, and the third node represents a general datatype. Creating and enforcing these kinds of naming conventions is a major part of information resource management in all
enterprises.
Naming conventions are wonderful examples of intelligent keys, and how useful they are. Nobody would ever suggest creating a random number as a table or column name. How could programmers, for
instance, remember the meaning of such things? Surely they would cause all kinds of errors to be introduced into program code. Indeed, how could anyone else who needs to deal directly with tables
and columns work with such meaningless names?
Dealing with Intelligent Keys
Data analysts cannot both be for naming conventions as a matter of principle and against intelligent keys as a matter or principle. The contradiction is simply too great. Approaching database
design with a set of ideas about what is “wrong” and what is “right” with no thought about why this is so is a recipe for disaster. Design means truly understanding the
trade-offs we make when we choose a particular option. Every design decision optimizes some functionality, and forecloses other kinds of functionality. It cannot be an exercise carried out on
automatic pilot according to some predetermined set of rules. This is not to say that rules do not exist, but rather that they have to be fully understood.
In the case of intelligent keys, throwing them out of a database design is often a mistake. Finding a good way to include them may need some extra design work, but it can be done. For instance,
there is no need to actually make them into primary keys. The fact that intelligent keys often encode several pieces of information can be mitigated by creating additional attributes to hold these
pieces of information, while keeping the intelligent key in the database. Then, programs can be written that do not need to parse key values to extract the values of these attributes (though it can
be quite a chore to get programmers to do this). The intelligent key is thus available, albeit not as a primary key, and provides additional value to users.
Perhaps the most important lesson is that if business users are attached to intelligent keys, data analysts should try to incorporate this requirement in database designs. It may not be easy, and
it may require additional effort by the IT organization as a whole. However, absolute prohibitions of intelligent keys based on nothing more than the idea that this is “good practice”
does not serve the enterprise well, and is indeed hypocritical when it comes from groups that enforce naming conventions for physically implemented database structures.