Published in TDAN.com October 2000
Software and Database engineering are complex activities that require planning and control to be successful. By the time the DBA is called up to tune the indices of a database it is probably
already too late. Efficiency should be designed into the data structure before the data is actually put on disk. Since the invention of CASE tools there is usually a missing step in the database
design. The logical database design is set up in the modeling tool and then the DDL is generated. The table design of the physical database is the entity design of the logical database. Then when
tuning is required, data is moved around on disk, indices are applied, freespace is modified, and more CPU memory is assigned.
The DBMS level tuning steps are valid and will continue to be used. But, there has been a missing step in the database design process.
Physical Design of Databases
The word ‘denormalization’ is used to describe changes to the table design that cause the physical tables to differ from the normalized entity relationship diagram.
‘Denormalization’ does not mean that anything goes. Denormalization does not mean chaos. The development of properly denormalized data structures follows software engineering principles
that insure that information will not be lost. If the table is read-only (periodically refreshed from the system-of-record) then the rules are looser. Star schemas and hyper-cubes are read-only
denormalizations. If the data is to be distributed and/or segmented and added-to, changed, or deleted from then the reconstruction described below must be followed. Fundamentally a single principal
must be followed. If the individual table is updated in more than one system, it should be possible to reconstruct the original table as if the data was never reformatted or taken apart.
There are many techniques for denormalizing a relational database design. These include –
- Duplicated data – This is the technique of making copies of data whole or in part and storing and utilizing both the original and the copy(s). This technique is great unless you want to update
it. This is the area of distributed updates and synchronization. Whole texts have been written on this subject. The general idea is that extra-DBMS processes must insure integrity and accuracy.
Stored joins are an example of duplicated data.
- Derived data – The issues with storing derived data are accuracy and timeliness. When the base data changes the derivation(s) must change accordingly. When the semantics of the derived columns
is ‘current balance’ you have one sort of accuracy problem. When the semantics of the derived column is average sales by product, salesman, and division, and month; and the salesman are
constantly being reassigned. You have another accuracy problem. Also many designers store the derivation in tables containing inappropriate keys. When derivations are not stored with their logical
(functionally dependent) keys subsequent (tertiary) derivations are inaccurate. Also many derivations are non-additive (percents, highest, lowest, etc). This subject deserves many chapters in data
warehousing texts. See references to summary data and slowly changing dimensions.
- Surrogate keys – There is a problem with very long and compound keys in that they are hard to use when writing queries and they generate inefficient indices. If the table has a very long key
and also has many rows this can generate a “show stopper” situation. If the table has a maximum of 100,000,000 rows and a fifty byte real compound key, assigning a 10 digit surrogate
key (and indexing on it) will increase performance dramatically. Imagine the situation where the fifty byte key is used in an equi-join! The real key(s) should not be deleted after the
surrogate key is added. This would make reversing out the surrogate key impossible. And would offend the Rule of Reconstruction (see below). Usually the long real key is made up of many sub-keys
that are useful in their own right.
- Over Normalization (Vertical partitioning/segmentation) – This is the technique of splitting the original logical table into two or more physical tables. By assigning some of the columns to one
physical table and some to another. Both tables end up with the same number of rows and have the same keys (see “Rule of Reconstruction”, below). Grossly this will increase performance
since the individual tables are now smaller. In most DBMSs the negative affect of long column length is non-liner. The query time against a 1000 byte row length table can be more than twice the
query time against a 500 byte row length table. So arbitrary vertical partitioning will cause much better performance against each of the separate partitions. If you are constantly joining the
partitions, over normalization is self-defeating. Therefore, the trick is to cluster the columns together that are used together.
- Horizontal segmentation – This is the technique of storing some of the rows in one table and some in another. Many modern DBMSs can do this automatically. When the criteria for segmentation is
non-simple, segmentation must still be done programmatically. Of course, update anomalies occur when rows occur in more that one segment.
- Stored Joins – This is the technique of joining two or more tables together and storing the answer set as an additional table. This is one of the most common denormalizations. If the stored
join table is never updated, there is no problem with this. Since this always generates duplicate data, updates are a problem. Look out for query anomalies when a measurement column is on the many
side of the relation being joined.
- Recurring data groups (vector data) – When there is a fixed small number of subordinate tables associated with a table collapsing the subordinate table into the parent table will increase
performance. Care must be taken that the logical key of the subordinate table is not deleted or obscured. Otherwise the join is not reversible and the “Rule of Reconstruction” is
Rule of Reconstruction
When the Rule of Reconstruction is ignored and the data updated, the data is corrupted.
Codd’s Rule of Reconstruction is a corollary to ‘Lossless-Join Decomposition’. Lossless Decomposition is a method for creating well-formed normalizations from non-normalized
database designs. The Rule of Reconstruction is basically the same idea in reverse. Well-formed non-normalized structures are created from normalized tables.
Read-only database (data warehouses, data marts, etc.) gain flexibility and quality if they also approximate this approach. The designer should think in terms of a series of transformations via
SQL. If the physical database design is not based on a well-formed (Third Normal Form) logical database design you cannot know if the Rule of Reconstruction is being followed or not. There seems to
be some kind of data entropic law here. If data is not carefully designed and managed through time, it slides into chaos. Constructions based upon algebra and set theory are fragile. In my
experience, if you don’t know and can’t prove the design is correct, it is very likely incorrect.
In reality, it is more subtle than this. For example, if the business changes and the logical data model is not updated and the corresponding physical database design is not updated, you will more
than likely get the same effect as a poorly formed denormalization. Add, Changes, and Deletes will corrupt the data relative to the present real business. Future queries will then get the wrong
answer relative to the present real business. This is the “Silent Killer.” As a designer, you start out right, change nothing, and now you are wrong.
In the textbook theoretical literature written on the Relational Model, the operators used on databases are Insert, Replace, Delete, Retrieve, Select, Project, and Join (and to be complete –
Product, Union, Intersection, Difference, and Divide). This causes some confusion with those of us that know SQL. The Retrieve, Select, Project, and Join functions are all performed by the SQL
SELECT operator. Just to reduce the amount of re-statement and/or translation from Codd’s original source, the following uses the Relational Model terminology.
The following is the Relational Model’s definition of Select, Project, and Join:
- Select – The Select function takes whole rows from a single table and creates an answer set consisting of some (or all) of the rows.
- Project – The Project function takes whole columns for a single table and creates answer set consisting of some (or all) of the columns.
- Join – The Join function takes whole rows form two or more tables and creates an answer set consisting of concatenated rows that pass the join criteria. The join criteria is usually that
two or more columns in the source tables’ rows are equal.
Any combination of relational operators can be applied to re-form how the data is distributed, provided that the total transformation is reversible. The issue is verifying this
reversibility. To not cause inaccuracies and anomalies, each updateable physical data store must be reversible to the well-formed normalized data model. The following is not intended to be
exhaustive and consists of only a few of examples of well-formed denormalizations.
This is achieved via the project relational operator. Each projection to be stored in some table, possibly another database, must include the primary key of the relation from which the projection
is made. Then, each and every projection is well-formed with no duplicate rows. A table can be split into any number of projections. Note that, at any time, the original table can be recovered
using equi-join with respect to the primary keys of the new denormalized tables being joined. In applying relational technology to the management of a denormalized database, it is essential that it
be possible the relational operators can be used to decompose relations in the global database into relations for the various denormalized target tables. In other words you should be able to use
SQL for the decomposition. This doesn’t mean that you must use SQL for every decomposition, only that you could. You might have a non-SQL Extract Transform and Load (ETL) engine interfacing
one or more OLTP systems with an Operational Data Store.
If the key in the above example was very long and compound, a surrogate key could be substituted. And one of the vertical segments could be stored with the surrogate key only.
The select relational operator is used to insert some rows of a table into another table, and other rows to other tables. The selection of rows cannot be arbitrary. The selection of rows must be
made using the select operator. When a table is partitioned by rows to be stored in other tables, union is used to recover the original table.
The use of stored joins as a denormalization will formally offend the rule of reconstruction if the joined-table is updated. Since this is a very common if not the most common denormalization some
discussion is appropriate. If you must update a stored join, non- relational programmatic measures must be taken to see that data integrity is maintained. A simple example of this is to not update
the duplicate columns. It is always dangerous to have the data integrity rules outside the data domain. When other update programs are added to the system they must also contain the non-relational
data integrity processes. It is better software engineering to have all the data integrity rules in the DBMS in the form of constraints, triggers, and stored procedures. Quality almost always
implies simplicity in usage and a ‘minimum number of moving parts.’ If you have to write complex code with a relational database to get the right answer you are allowed to leap to the
conclusion that the database design does not map to the business.
While the project and select operators can be used to split a single table from a database into several tables, the join, union, relational intersection, relation difference, and relation divide
operators can be used to combine several relations from a database into one table. If the transformations performed on tables from the database are to be reversible, these operators must be applied
with considerable care.
Any combination of relational operators can be applied sequentially provided that the total transformation is reversible. One can imagine over normalization and horizontal segmentation being
applied to the same table. Of course the same restriction applies. If the reversing relational operators are applied, and you cannot tell they table was not always in Third Normal Form, the
denormalization was well-formed. We should start the physical tuning of our databases at design time with well-formed denormalizations. To delay tuning until after implementation is a mistake.
Codd, E.F. (1990) The Relational Mode for Database Management: Version 2, Addison-Wesley
Kirkwood, John (1992) High Performance Relational Database Design, Ellis Horwood, Ltd.