Key Wars: Episode 1

 

Published in TDAN.com July 2004

 

A couple of years ago I was at a DAMA Day event in New Jersey where the theme was data modeling. It was well attended and there were a number of great presentations with a lot of audience
participation. Within a short time the topic of how to choose the primary key of an entity came up, and I groaned inwardly because I knew what was coming next. A large part of the rest of the day
was taken up between two sides in the audience exchanging strongly felt opinions. One group of modelers heartily insisted that “surrogate” keys were the right way to construct primary
keys, and an equally vociferous group countered that “natural” keys were the only sensible approach. Needless to say, the day ended with few true believers on either side changing their
minds, and the rest of the audience dazed and confused.

The issue of whether to use “surrogate” or “natural” keys for the primary key of an entity is a topic of active discussion on many database design projects, but it is
astonishing how much bad feeling it can generate between fellow data modelers, and between data modelers and other IT professionals. At the bottom of it seems to be a belief that there exists a
“right” way to design primary keys, and that if only the right way is chosen then all kinds of benefits will automatically low. This article, and one yet to come, explores the argument
over surrogate versus natural keys, and whether the concept that a single “right” way for modeling primary keys is justified or merely an unfounded assumption of two groups of key
zealots.

 


What Are Natural And Surrogate Keys?

 

The primary key of an entity is a set of one or more attributes of the entity that uniquely identify real world instances of the entity. In a physically implemented database, the primary key of a
database table is a set of one or more columns belonging to the table that uniquely identify records (also called rows) of the table. Some data modeling tools and database software packages allow
entities and tables to exist that have no primary keys. In reality, not having a primary key is a mistake because it is makes it difficult to practically identify unique instances of an entity,
impossible to establish relationships between entities, and difficult to prevent the inclusion of duplicate records in a database table. At the other extreme, it is possible to find different
attributes or groups of attributes that at least have to potential to uniquely identify instances of an entity. However, an entity or database table can only have one primary key. In these
situations data modelers will choose one attribute, or set of attributes, as the primary key, and may designate the other attribute or groups of attributes as alternate keys. In my experience it is
rare to see alternate keys identified as such in data models, but some modelers do it and it can be very useful.

Natural and surrogate keys can be differentiated as follows:

 

  • A natural key is primary key composed of one or more attributes that have business meaning. These attributes are observable characteristics of an entity that exist in real world instances of an
    entity.
  • A surrogate key is a primary key composed of attributes that have no business meaning, but are created by IT personnel. They are added to entities during data modeling. Surrogate keys of
    entities are nearly always single attributes rather than groups of attributes.

 

 


Level of Modeling.

 

While primary keys are necessary, they may require different approaches depending on what level a data model is at. Ideally, data models should be developed by going from conceptual to logical to
physical levels. The approach to the primary key is usually somewhat different at each of these levels, and this may be one of the factors that contributes to fights over keys.

At the conceptual level, where the emphasis is more on communication with business users, it may be possible to use natural business terms like Bank Name to describe the
Bank entity. At the logical level in a data model, we are expected to accurately represent the nature of business in detail. At this point the primary key attributes must be fully
defined. This is more analysis than design. Typically the modeler searches for candidate key attributes from the known attributes of an entity. Analysis hopefully shows if these candidate keys
really do uniquely identify real world instances of the entity in question. There may be several candidate keys that can serve as primary keys. One needs to be chosen, but the others can be
designated as alternate keys, as noted above. It is also possible that no primary key can be found. In this case a surrogate key must be created, typically by adding a new attribute to the entity.
Lastly, the logical data model has to be transformed into a physical one, which represents the implemented database. This part is pure design, not analysis. It is quite common for implementers,
such as programmers, to attempt to influence the design of the physical data model. Programmers are often apt to claim that changes need to be made for “performance reasons”, and one of
their favorite changes is to replace multi-column natural primary keys with single column surrogate keys. With today’s high performance hardware and database software, “performance
reasons” are often quite dubious. However, there are more subtle reasons why programmers want single column surrogate keys, as we shall see later.

So we can already see that at different levels of data models there are different requirements and perspectives that affect key structure, just as could be inferred from the Zachman Framework. Even
so, many IT professionals still feel that there is some “right” way that always works for defining keys, no matter what the level of the data model. Yet, how can a single right way
always apply to logical data models that are built as a result of analysis, and physical data models that incorporate of a lot of design?

 


The Programmer’s Perspective

 

The progression of data models through different levels, as just described, is unfortunately not always followed in practice. The reality is that many databases are designed at the only physical
level, and that they are designed by implementers – that is by IT staff who may have analytical skills and business knowledge, but who are essentially programmers. These personnel often have a
“pure” programming perspective that looks at business data from a special metadata viewpoint. A programmer is trying to build functionality, not to use functionality to solve business
problems, as a business user is. The programmer’s tools often reinforce this perspective, and a large component of this way of thinking is that a programmer needs to uniquely identify
records. The programmer is not interested in unique instances of an entity in the same way that a business user is. Rather the programmer needs to be able to create functionality that will reliably
navigate the database and process records in a reliable manner.

SQL has enabled programmers to do “set at a time” processing of large numbers of rows in single or related tables. However, there are limits to what can be done in single SQL
statements, and programmers often have to retrieve a recordset and loop through it a record at a time to do more complex processing. Quite often, this involves “remembering” the
identity of a particular record and then returning to that record later. When this happens, the programmer is concerned with records – artifacts of data storage – not with the business
information the records represent. Of course, in many other contexts the programmer does have to deal with the business information that records represent. However, dealing with records from the
metadata-based perspective of a record being a record is a real need for a programmer. In this context a primary key is something that uniquely identifies a record in a database table, irrespective
of whether it has business significance.

 


Software Tools and Primary Keys

 

The perspective that programmers have of records in database tables is often reinforced by the tools they use. The early PC database packages that were founded on dBase, such as dBase III, FoxPro,
and Clipper, all had RECNO() functions. This function returned a unique number for a record that the programmer could use to navigate recordsets and tables. Indeed, programmers could use this in
place of primary keys to uniquely identify records in certain circumstances. More modern DBMS packages tend to be page-based, where blocks of disk storage can accommodate more than one record.
However, even these packages tend to have some mechanisms for programmers to identify individual records, such as bookmarks in Microsoft Access. These mechanisms tend to work only within the
context of temporary situations, such as recordsets, and are not always persistent in the database. As a result of this, programmers tend to want some means to uniquely identify records across
different user sessions. A single column surrogate primary key is the answer to this problem.

Software tools present other constraints to programmers that affect their thinking about primary keys. For instance, Microsoft Access has a forms design tool that permits combo boxes (also known as
drop-down lists) to be easily created. A combo box of states, or countries, is a typical example. The list is taken from a database table, and when the user selects an item from the list a single
value is passed back to the Access program. The problem is that only one database column can be “bound” in this way. It is not possible to use multi-column primary keys for this
purpose. Again, this influences programmers to want to work with database tables that have single column primary keys.

Another feature of Access is that when a new database table is created, Access will offer to automatically create a single column surrogate primary key. Again, this is not something confined to
Access, and other software packages have similar functionality. It appears that such functionality influences many programmers to think that surrogate single column primary keys are the correct
design. From a programmer’s perspective this can certainly be a useful design. However, it can go beyond that for some programmers, who will consider nothing other than a single column
surrogate key for every table in a database.

 


Drawing the Battle Lines

 

Data modelers understand the need to uniquely identify instances of entities from a business perspective, but programmers often have a different perspective based on dealing with records, and
influenced by the tools they use. This is part of the origin of disputes between these two camps when a database is physically implemented. Unfortunately, neither camp may have a good appreciation
of how the other camp has arrived at its opinions, and indeed their positions are often simply expressed as opinions and nothing more. However, the problem is much more complex than this. There are
consequences that arise from choosing a natural or surrogate key, and for either choice there can be advantages and disadvantages. Understanding these consequences is extremely important because it
allows informed decisions to be made about key structure.

In the next article on this topic we will look at these consequences, and particularly at the challenge presented when the analysis needed to produce a logical data model conflicts with the
design needed to produce a physical database.

Share this post

Malcolm Chisholm

Malcolm Chisholm

Malcolm Chisholm is an independent consultant over 25 years of experience in data management and has worked in a variety of sectors. Chisholm is a well-known presenter at conferences in the US and Europe, a writer of columns in trade journals, and an author of three books. In 2011, Chisholm was presented with the prestigious DAMA International Professional Achievement Award for contributions to Master Data Management. He can be contacted at mchisholm@datamillennium.com or via the website www.datamillennium.com

scroll to top