SELECT * FROM Celko June 2012

In April, I attended a talk by Alexander Kopac at the local DB2 user group meeting. He is well-known in that community for his performance running skills. Obviously, a lot of the presentation was DB2 specific and his code was written for the schema information tables in DB2.

I tend to come into a database from the SQL queries. Mr. Kopac works from the other direction. He looks at what is available from the system without going to the internals of a coded module. We often arrive at the same conclusions, but from different directions.

He hit on many general rules that apply to any other SQL product.

At the SQL statement level, I do not care about the customer wait time for a web application. I just want the right answer. I am so inside the box that I do not even want to know about what the front end is! But the client needs to know that 15 seconds responses will move the potential customer to click over to the next competitor’s website. Timeouts, lock waits, deadlocks, lock escalations and other performance snags can be caused by bad schemas and SQL. But I have to find where to look for this bad SQL.

Trust But Verify

Every modern SQL has some kind of schema statistics that are a static description of the contents of the table. You can write the best query in the world and have it perform like cold glue if the stats are out of date. This sounds simple, but it can be tricky. If I run stats on Monday, then load skewed data on Tuesday, the stats can be worse than useless. As a real world example, a state government that gets data by counties can have huge differences, and the data loads are done by counties from tapes compiled locally. Sparse populations do not have the same demographics as dense populations, so the stats change after every data load.

Another system level problem is a table reorganization (aka compression, defrag, garbage collection, etc). This is generally done on the weekend so that on Monday the tables are physically smaller and often are put into some physical sorted order. This makes the queries run on Monday and Tuesday faster, but by Wednesday things are slowing down. The reason is simple: the tables are being modified (insert, update, delete) but the reorganization was done with zero percent free space (aka fill factor, etc.) in the tables and perhaps zero percentage free space in the indexes. Every time the table changes the system has to handle a data page overflow condition!

Another tag line of Mr. Kopac’s is “Are there leftover pizza crusts in your backyard?” which refers to schema objects that should have been removed (or never created in the first place), but still exist. Look for empty tables or for tables that have been seldom accessed. Another crust are indexes that are not used, but still have to be maintained.

This happens when non-RDBMS programmers created tables in their code to mimic scratch tapes and then never get rid of them. Instead of writing a single SQL statement with derived table expressions, they do the job step by step, just like we did with magnetic tape files. Another give-away are tables with generated names that usually include a sequence number or an ordinal date. The ordinal date is how IBM used to label tapes with a “yy-ddd” field in the label.

TRIGGERS are for Guns and Bombs

Triggers can be terrible resource hogs. Everything has to be locked while the triggers cascade over multiple tables and indexes. Any trigger is a sign of a poorly designed database; it says that you did not know how to write declarative code to enforce data integrity. The incompetent programmers will allow bad data into the table then correct it after the fact via triggers. This is not to say that all triggers are bad, but they are rare. My heuristic is that you will write no more than five triggers in your career.

When you look at the schema information tables and see too many triggers, you need to do re-write the code.

What is Too Much?

Standard SQL has no CREATE INDEX statement because we considered it to be too physical for a language specification. But that does not mean indexing is not important. There are some things we want in the indexes in a schema.
  • We want low index cardinality. Low does not mean zero because every table will have a primary key.
  • Over five indexes per table is a really bad sign. In particular, single column indexes are suspect. If you reduce the number of overall indexes and change single column indexes to composite indexes, you will have fewer overall locks.
  • Lack of clustering or sorted indexes are a bad sign. This will force sorting and table scans that could be avoided.
  • Indexes are simply not used at all. There is an old programming proverb that there is nothing more permanent than a temporary fix. Nobody is sure that it is safe to remove it, so it stays. Hints (pragmas, set locks, etc) are even worse.
  • If your Indexing has “Include Column” option you should be using it. This allows non-index values to be carried by the index. It is very useful for look-up tables. You ought to look at converting old indexes to the new option. 
  • If your indexes can allow a reverse scan, enable it. It can only help. 
  • Get rid of redundant and duplicate indexes. This is easy to do from the schema information table because there will be some way to show all the columns in each index. In particular, look for overlapping column lists

Ghosts from Enhancements Past

Most of us use some off-the-shelf software packages. We get regular upgrades and enhancements. We run scripts that alter tables if new columns are needed. But almost nobody will drop an index and perform a rebind with the enhancements. The indexes in a package are often obscure, and we do not want to take a chance by dropping the old ones.

Death by ‘Itty Bitty’ Tables

You can go to the schema information table and get the cardinality of all the tables. Small tables might be fine, but they are always worth looking at. The first thing to do is check whether these tables are volatile or stable. Those terms are a bit vague, but here are heuristics:
  • Small volatile tables are most often scratch tapes in SQL disguise. They are constantly being loaded, emptied or dropped. They need to be replaced with a derived table expression in the SQL statements.
  • Small stable tables are most often look up tables that are used in REFERENCES clauses. They need to be replaced with CHECK (.. IN (..)) that can be optimized. For example, a table with the 54 two-letter state abbreviation codes is very stable. Nobody wants to join the Union. The storage space for a short IN() list is nothing today and many SQL products will optimize the search by building a hash table or binary tree when they reach a certain size. 
  • Application programmers will put configuration data in a small table and load it into the application. Mr. Kopec’s example was a system with a ten row read-only table that was loaded by almost every Java application, thousands of times per day. This data needs to be put in the application side and not the database.

Death by ‘Fat’ Tables

Instead of filing a table with all valid combinations, consider placing only exceptions in a table. This means changing the logic in the SQL statements, but it is usually not to hard once you change your mind-set.

You often do not need all the data from a table. The Pareto distribution, 80/20 rule or Sturgeon’s Law says a relatively small subset of the data is most used. For example, 80% of your business comes from 20% of your clients. The most active part of the database will be today’s work, etc.

You can partition your database and keep the most active parts in faster physical storage. You can design history tables and add Views for the current data. You can archive data and get it out of the database. You can purge data that you no longer need and which might be a source of legal problems.

Using SQL for the Wrong Reasons: It was Nice

Application programmers can use SQL to test the state of the system. Apparently Java programmers like to write statements “SELECT COUNT(*) FROM < some schema information table > ;” to test for a connection! This just ties up resources.

Using “SELECT * FROM ..;” in production code is both dangerous and expensive. Many SQLs will expand the * into a column list at compile time. The execution plan is now set and will stay that way until a recompile. Gee, I sure hope the base tables do not add, drop or move columns because you will never be sure what the SELECT list is today.

While not always wrong, options like “FETCH FIRST zillion ROWS” or “OPTIMIZE FOR zillion ROWS” are usually a waste. Will any humans actually look at zillions of rows in a result set on his screen? If you really need all that data, it is usually cheaper to spool all the data to a file and give it to a report writer, statistical analysis package or some tool that can handle zillions of rows.

It was nice to get out of my little box and look at things from the outside for once. Maybe I will learn a new programming language before I die.Nah!

Share

submit to reddit

About Joe Celko

Joe is an Independent SQL and RDBMS Expert. He joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press. The author of six books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups. Joe is now an independent contractor based in the Austin, TX area.

Top