SELECT * FROM Celko – June 2009

Several weeks ago, I spoke at a SQL Saturday in Atlanta, GA. This is a series of free events all over the U.S. for MS SQL Server users to get
together on a weekend, get some free food, some free lectures and hear war stories. It was good to get back to my hometown and visit people I had not seen in, well, decades in some cases.

While I was there, I ran into a new term that I am passing along to you, dear reader. That term is “Google diligence.” It refers to the obligation of a professional database designer to get on the
Internet and research the data elements in his schema design.

In the old days, starting a new project in unknown territory was an adventure. Sometimes you were lucky enough to have an orientation class, trade magazines or books on the subject area. Most of the
time, you were not lucky and had to learn on the fly.

First, we interviewed subject-matter experts and compiled a ton of notes. They talked their jargon to us and we listened. We promptly misunderstood them and had to re-interview them after presenting
a preliminary design.  The term “iterative design” sounds much better than “groping in the dark” to the client.

After we got to the point we had a working vocabulary and some idea of the business rules we needed, it was time to go to the existing documentation. In those days that meant we grabbed a few meters
of three-ring binders off a shelf and started reading. The world ran on paper back then. You hoped that the documentation the client had was current. Major shops with legal requirements were pretty
good about keeping things current because they could afford a full-time librarian. Smaller shops were not so lucky and had to depend on a secretary or an intern to keep up the documentation.

Some industries have had trusted data sources that could provide some of the changes in electronic format. Back then “electronic format” meant a magnetic tape issued to subscribers. This
costs money. There were no websites or downloads that you could get.

Let me make this more concrete. Consider the humble ZIP code. It is used in virtually every commercial enterprise in the United Sates because it is the only way to send mail. Today the term has
become almost a generic name for any postal code, and almost nobody remembers what it originally meant (Zone Improvement Program). The basic format is a string of five digits that appear after the
state abbreviation code in a United States street address. An extended _ZIP+4_ code was introduced in 1983, which includes the five digits of the ZIP code, a hyphen, and four more digits that
determine a more precise location than the ZIP code alone.

Only bulk mailers used the ZIP code when it was first introduced. They got a discount for presorting and bundling their mail so there was an immediate payoff for them. When smaller commercial and
compulsive people like me started using them, we had to buy a book of ZIP codes at an office supply store or the post office. A subscription to the magnetic tapes was too expensive for home
use.

When we put ZIP code into files (or early databases), programmers usually allocated a five character field. The slightly smarter programmer allocated five digits with a COBOL PICTURE declaration or
equivalent in their programming language. No validation.

The code is based on geography. The first digit is a contiguous group of states. The first three digits identify the Sectional Center Facility (SCF). An SCF sorts and dispatches mail to all post
offices with those first three digits in their ZIP codes. Most of the time, this aligns with state borders, but there are some exceptions for military bases and other locations that cross political
units or that can be best served by a closer SCF.

The last two digits are an area within the SCF. But this does not have to match to a political boundary. They tend to identify a local post office or station that serves a given area.

In spite of this geographical nature, we never bothered much with ZIP code validation. The attitude was that when we got a bad ZIP code, we would wait for an address correction to come back from the
post office and fix it manually.

This got old fast. The first temptation is to write a simple CHECK() constraint to mimic the COBOL PICTURE clause:

zip_code SIMILAR TO _[:DIGIT:][:DIGIT:][:DIGIT:][:DIGIT:][:DIGIT:]_

This does not quite work. Not all five-digit strings are actually used as ZIP codes. There is an urban myth among old programmers that 99999 used to be an Eskimo village in Alaska when the system was
first implemented.

COBOL programmers traditionally use all 9’s in a field to signify a missing value (you will see that convention in many standard encoding schemes today). The result was that the village was
swamped with bags of improperly encoded bulk mail because nobody had validation routines for the new system. It made inexpensive fuel for the winter.

You might want to use 00000 and 99999 for special purposes, but the lowest ZIP code in use at the time of this writing is 00601 for Adjuntas, PR (Puerto Rico is a U.S. territory). The highest ZIP
code in use is 99950 for Edna Bay, Kassa and Ketchikan, AK.

While several towns might be put under one official name (in the last paragraph, Edna Bay and Kassa are considered part of Ketchikan), a ZIP code will belong to only one state, territory or military
postal unit.

The USPS also publishes a list of preferred abbreviations for states, street suffixes and other parts of an
address.

If nothing else, it is very easy to create a simple auxiliary look-up table with {low_zip_code, high_zip_code, state_code} to see if the state and the ZIP code matched. Later, the zip_code could be
used in a JOIN on this auxiliary table to provide the state_code when we moved to SQL.

You can find commercial software to edit your addresses to USPS standards. But none of the information I just gave you required a ton of research. A few minutes on Google or Wikipedia gave us all of
these specs for data integrity constraints. Another few seconds and you could have downloaded your auxiliary table.

You might not be impressed with ZIP codes because you know them quite well. Pick another encoding that you don’t know  – or one you think you do – and see how it is really defined.
Do your Google diligence!

Share

submit to reddit

About Joe Celko

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.

Top