In an article by Mr. James P. O’Brien (Maximum Business Solutions) the author outlined desirable properties of relational keys. I
agree with almost everything he had to say, but I have to take issue on some points.
I agree that a Natural Key can be inherent characteristics, such as DNA signatures, fingerprints and (longitude, latitude). I also agree that the ISO 3779 Vehicle Identification Number (VIN) can
also be Natural Keys. What makes them Natural Keys is a property that Mr. O’Brien does not mention: they can be verified and validated in the real world.
When I worked for a state prison system, we moved inmates by finger printing them because we had to be absolutely sure that we did not let someone out before his time, or keep them in prison longer
than their sentence. If I want to verify (longitude, latitude) as an attribute, I can walk to the location, pull out a GPS tool and push a button. The same principle holds for colors, weights, and
other physical measurements that can be done with instruments.
The VIN number is a bit different. I can look at the format and determine if it is validate VIN number – Honda does not make a Diablo and Lamborghini does not make a Civic. However, if the parts of
the VIN are in the correct format, I need to contact the automobile manufacturer and ask if the VIN was actually issued. If Honda made 1,000,000 Civics, then a VIN for the 1,000,001-th Civic is a
fake.
Validate internally, and verify externally. But then leads to the concept of a “Trusted Source” that can give us verification. And then leads to the concept of “How trusted?” is my source.
My local grocery story believes that the check I cash is good and that the address on the check and Texas Drivers license number are correct. If I produced a license with the picture of a large
black woman that did not match the name on the check they would question it. But as long as the photo id looks good and has a bald white male who looks like “Ming the Merciless of Mongo” on it,
they will probably cash the check.
When I go to travel to certain countries, I need a birth certificate and a passport. This is a higher degree of trust. For some security things I need to provide fingerprints. For some medical
things, I need to provide DNA – that is probably the highest degree of trust, since in theory you could make a clone from my sample a la many SF stories.
The points I want to challenge in Mr. O’Brien’s article are that a Natural Key
- Must have an invariant value
- Must have an invariant format
This year, the retail industry in the United States is switching from the 10-digit UPC barcode on products to the 13-digit EAN system and the ISBN (International Standard Book Number) is falling
under the same scheme. Clearly, this violates Mr. O’Brien’s condition. But the retail industry is still alive and well. Why?
The most important property of a key is that it must ensure uniqueness. But that uniqueness does not have to be eternal. Nor does the format have to be fixed for all time. They simply have to be
verifiable at the time I ask my question.
The retail industry has assured that the old and the new barcodes will identify the same products by a carefully planned migration path. This is what allowed us to change the values and the formats
of one of the most common identifiers on earth. The migration path started with changing the length of the old UPC code columns from 10 to 13 and padding them with leftmost zeros.
In a well-designed RDBMS product, referenced keys are easy to change. Thus, I might have an Inventory table that is referenced in the Orders table. The physical implementation is a pointer in the
Orders table back to the single value in the Inventory table. The main problem is getting the data types correct.
My next nit with Mr. O’Brien is the definition of Surrogate Keys. I prefer Dr. Codd’s over his: “..Database users may cause the system to generate or delete a surrogate, but they have no control
over its value, nor is its value ever displayed to them …”(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM
Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.
Auto-numbering is an exposed physical locator and not a surrogate.
Codd also wrote the following:
“There are three difficulties in employing user-controlled keys as permanent surrogates for entities.
- The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with
the result that some or all of the serial numbers might be changed.). - Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.
- It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a
retiree).
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution – proposed in part [4] and more fully
in [14] – is to introduce entity domains which contain system-assigned surrogates.
A ‘surrogate key’ is an arbitrary artificial attribute / column. In tables they quite often are implemented as identity columns. This is an awful design decision. Auto-numbering is a way to fake
a sequential file record number or a pointer chain. It is a value generated in the hardware at insertion time and not an attribute of the entity.
Mr. O’Brien argues for exposed physical locators when
- No suitable Natural Key for the entity exists
- A concatenated key is so lengthy that performance is adversely affected.
The first condition — no suitable Natural Key exists – is a violation of Aristotle’s law of identity (to be is to be something in particular) and the result of a bad RDBMS flaw. Or the designer
is too lazy to look for industry standards.
If you honestly cannot find for industry standards and have to create an identifier, then you need to take the time to design one, with validation and verification rules, instead of returning to
1950’s style magnetic tape files.
The argument that a concatenated key that is “too long” forgets that you have to insure the uniqueness of that key to maintain data integrity anyway. Your performance choices are to either have
the SQL engine produce a true surrogate or to design an encoding that is shorter for performance. The VIN has a lot of data (country, company, make, model, plant, etc.) encoded in its 17-character
string for verification.