SELECT * FROM Celko – September 2012

I spend a lot of my time looking at bad SQL. I have a heuristic derived from decades of observation. Programmers do not understand that most (80-95%) of the work in SQL is done with the DDL. If you have to constantly write complicated DML, then you are most likely trying to repair bad DDL on the fly. Learning to hide this messy code in VIEWs is not a solution. Hoping that fast hardware will not save uyou.

But even an more common error is not know how to design the columns with the proper data type. When we were creating SQL, we knew that SQL had to work with data from any of the “X3J Language” data sources. That is a nickname we “Standards-guys” use for the programming languages that have an ANSI Standard defined by the X3J subcommittee. When we started, this was FORTRAN, COBOL, Mumps, C, PL/I and a few others. Today, some of the languages like ADA and PL/I are gone and others have appeared.

There are problems with converting from and to the internal SQL data types and the host language data types. COBOL knows the exact length of a string, while C uses a “nul” terminator. Some languages support BCD. Some have Boolean data types, but disagree on the internal format. The list goes on, but my point is that there are a lot of physical data type representations. Data conversion has cost.

But once inside the database, most designers do not consider the implications of their data type choices. They create the table with columns that can be loaded with source data. The designers are done and the DBA or programmers can worry about performance.

Inefficient data types require more storage. While we have cheap disk storage today, the time required to move it from the disk to the machine is what costs you. Oh yeah, backups and tape storage have a cost, too. My favorite example is the moron who blindly allocates strings of the largest possible length in Unicode for a fixed-length data element that is limited to the Latin-1 character subset.

Apparently, lots of designers do not know that Unicode standards require all languages have to support this Latin-1 subset (unaccented uppercase alphas, digits and a few punctuation marks). All ISO standards use only this character subset for encoding, so that any language can represent them.

One of my favorites was “zip_code VARCHAR(255) NOT NULL” which just invites garbage data. When is see this, I like to load it with the Heart Sutra (http://en.wikipedia.org/wiki/Heart_Sutra) in Chinese. It is 260 characters, so you lose the last four. If they do not learn SQL, perhaps they can obtain Zen enlightenment.

The obvious correction is “zip_code CHAR(5) NOT NULL”, but this is only a start. You can still get garbage data, so you might want to add “CONSTRAINT zip_digits_only CHECK (zip_code SIMILAR TO ‘[0-9][0-9][0-9][0-9][0-9]’)” to your declaration. Even better, add a constraint to check the range, “CONSTRAINT valid_zip_range CHECK (zip_code BETWEEN ‘00501’ AND ‘99950’)” to be safe from out of range errors, which usually means all zero and all nine entries.

A more subtle version for SQL Server users is having columns declared as DATETIME, their version of the ANSI Standard TIMESTAMP data type. It was eight bytes, but they also have a SMALLDATETIME data type that is often sufficient with only four bytes. Today, SQL Server has an ANSI Standard DATE data type that is only three bytes. But other SQL users will often define a temporal data type to an absurd precision and pay for the extra storage. Most commercial work needs only a date, not a timestamps. Timestamps seldom need to go to nanoseconds.

That extra storage space is in the table, all the indexes involved, and the backup files. All those extra bytes take space in the buffer cache when you read them. The fewer number of bytes it takes to fit a row on a page, the fewer pages need to get brought into cache for a query, then the faster the query. In fact, IBM has some statistics that show reading compressed data and expanding it in main storage much faster than reading uncompressed data directly. Disk reads are physical events done at the speed of rotation; decompression is done in silicon at the speed of electrons.

Officially, there is a difference between FLOAT and REAL, but today they both mean the IEEE-754 standards. We seldom need to use them in commercial work. They require special math which might or might not be built into the hardware you use. Converting the internal format to a display format is expensive. Contemplate the cost of “zip_code FLOAT NOT NULL” as a declaration.

But over-sized data types are just one problem. Mismatched data types are a great way to decrease performance and produce unwanted results. SQL has implicit conversions defined in the language which take mismatched data elements and promotes them to the same data type for math and comparisons.

The most common example is having tag numbers used with numeric values. For example “WHERE zip_code = 12345;” has to convert the integer value 12345 to a string for the search condition. The right way is “WHERE zip_code = ‘12345’;” so you have a CHAR(5) to CHAR(5) comparison. The conversion can mess up optimizer estimates and prevent index usage.

This is not just numeric versus string implicit conversions; Using unlike string data types have to be converted. CHAR(n) to CHAR(m) has to pad out the shorter string; CHAR(n) to VARCHAR(n) has to pad out the short string. CHAR(n) to NCHAR(n) has to go from 8 bits to 16 bits and the same for VARCHAR(n) to NVAR CHAR(n).

My favorite example is a CASE expression where the programmer is still locked in a procedural mindset so he thinks it is a statement. The skeleton code where alpha is INTEGER:

CASE alpha
WHEN 1 THEN 42
WHEN 2 THEN 75
WHEN 3 THEN 15
ELSE ‘alpha error’ END;

The CASE looks at each THEN and the ELSE clause (if any, ELSE is optional) and uses the highest data type for the data type of the result of the expression. The ELSE is a string! Now the THEN clauses have to be implicitly cast to strings. That is probably not the intent of the programmer, who was trying to send an error message. This would have worked with a procedural language where we have control flow as the computational model. Here is that procedural snippet in SQL/PSM:

IF alpha = 1
THEN SET x = 42;
ELSE IF alpha = 2
THEN SET x = 75;
ELSE IF alpha = 3
THEN SET x = 15;
ELSE CALL Print(‘alpha error’);
END IF;
END IF;
END IF;

The usual heuristic is that the best data type declaration is usually the shortest, smallest one that models the attribute values. The exception to this is when you expect expansion or contraction in the range of values. One example was the switch from UPC (10 digit) to EAN (13 digit) bar codes in the retail industry, which will be followed by the GTIN (15 digit) in a few years. Most of us in the book trade were caught off-guard by the change from the old 10 digit ISBN codes to the 13 digit EAN a few years ago. It was a pain to add the extra spaces and re-do the check digit in the database; the application programmers had it worse since they had to re-defines screens and printed forms.

But instead of declaring the column to be CHAR(13), we declared it to be CHAR(15) and then used a CHECK() constraint to enforce the EAN length and check digit. When GTIN comes along, we have the length in place and just have to change the other constraints. I wonder if the application programmers did anything like that on their side of the system?

MenuMenu

SELECT * FROM Celko – September 2012

Joe Celko

MenuMenu

Share this post

Joe Celko