A correspondent from St. Petersburg (Russia, not Florida) sent me a copy of an article on agile database development, a subject of particular interest to me. The article advocates that database
developers start from a mostly denormalized design (in which the data resides on as few physical tables as possible), as a way of simplifying the design, improving performance, and speeding up the
development process. The database design can then be “refactored” over time, adding increased complexity (and denormalization) as needed. This suggested approach is a common one in the
literature of agile development, and it has a certain elegant simplicity that is appealing to many developers.
However, my experience with application development projects has taught me that there are several potential problems with this approach, and the unwary developer or DBA can quickly find himself
(and his application) in a quagmire from which no amount of incremental refactoring can rescue him.
First of all, we need to recognize the purpose of normalization in database design (I am speaking here of relational databases used in frequently updated transaction processing – OLTP –
applications). Most people are aware that a normalized database design helps ensure data integrity (by eliminating duplicate and redundant data, which can get “out of synch” over time)
and increases the performance of updates. But the principal value of a normalized design is that it produces a highly cohesive, loosely-coupled database schema that is reusable; that is, the same
database can be used to support multiple applications and multiple business uses1. Non-normalized designs usually result in a database that is tightly coupled to a specific application
(and, usually, an application that is tightly coupled to a database schema). This “deadly embrace” results in a number of problems:
- The database schema can’t be changed without breaking the application.
- The application becomes platform-bound, and thus difficult to scale or redeploy (as, for example, to a service-oriented architecture).
- It becomes difficult to extend the database design to support new application functionality.
- The business fails to realize the economic value of data reuse.
As I’ve noted in previous articles, application-specific databases rarely return sufficient ROI to justify their existence; they become legacy “albatrosses” almost from the moment
of implementation, and the potential business value of data reuse (for example, the ability to do on-the-fly data analysis to identify potential customer relationship opportunities or supplier
quality problems) is lost2. Companies spend millions of dollars on data cleansing, migration, integration, and reporting technologies in an effort to pry data out of application-specific
silos for reuse – money that, for the most part, shouldn’t need to be spent!
Some people may argue that taking the time to do a normalized database design flies in the face of the basic tenets of agile development, but I disagree. There is nothing in agile development that
mandates poor (or nonexistent) design! Just as an artist sketches in his subject before starting to paint, the good agile developer makes sure to have a solid, well-architected, scalable,
extensible application design. In agile development, you don’t design (or build, or document) that which you don’t need (or don’t have enough information about); but what you do
build, you build well, and you base it on a design that is extensible and can be added to and scaled out incrementally as needed. In database design, this is the value that normalization provides.
There are two additional problems with premature denormalization, problems that surfaced in a couple of past development projects at our company.
The first problem is that denormalization masks important key attributes, and values needed for these attributes can become lost. Then, if you try to re-normalize later, you find that you
can’t because the key values are missing. A while back, I had to “reengineer” the database for a job scheduling and timesheet application written by one of our engineers. He had
put all of the data into a single denormalized table that looked something like this:
You can see the problem. The only key associated with this table (which actually represents timesheet data, not task data) is the IDENTITY column. But what is a timesheet? It’s the record of
hours worked on assigned tasks by a particular employee in a given week. But in this table, the task, employee, and week columns aren’t identified as key attributes, and are all nullable. In
fact, the application was sometimes putting null values in those columns. This means that we were losing critical data associated with the principal object class (the timesheet) in the application!
And when we tried to normalize this table later, we found that we couldn’t, because those key values were missing3.
The other problem with denormalizing too soon is that, paradoxically, it can make SQL coding much harder for the developers and result in performance problems. This seems counter-intuitive, since
one argument for denormalization is that developers don’t need to code table joins (which should be masked with views, anyway, so that they don’t have to), and that the reduction in
table joins should simplify SQL coding and speed query performance.
Here’s the problem: if data for multiple entities is contained in the same table, then each query against the table must be coded in such a way that it extracts data for one entity, and
ignores the others. Here’s an example from a recent project: our developers requested that a supertype/subtype table relation be collapsed into a single table. When we did that (and, in
retrospect, we shouldn’t have), we ended up with a table that looked like this:
Now, all the stored procedures that access this table have to have the following parameter list:
CREATE PROCEDURE csEventProcedure (@EventType smallint, @EventType1Key int = null, @EventType2Key int = null, @EventType3Key int = null…)
And the WHERE clause for the SELECT will have to look something like this:
WHERE (@EventType1Key IS NOT NULL AND @EventType1Key = Event.EventType1Key)
OR (@EventType2Key IS NOT NULL AND @EventType2Key = Event.EventType2Key)
OR (@EventType3Key IS NOT NULL AND @EventType3Key = Event.EventType3Key)
Or perhaps like this:
WHERE (@EventTypeCode = 1 AND @EventType1Key = Event.EventType1Key
OR (@EventTypeCode = 2 AND @EventType2Key = Event.EventType2Key)
OR (@EventTypeCode = 3 AND @EventType3Key = Event.EventType3Key)
In either case, the presence of the multiple ORs in the WHERE clause will cause the SELECT to become non-optimizable, and the query will have to do multiple scans of the Event table, resulting in
very poor performance. In this case, our developers ended up having to code dynamic SQL queries to get around this problem, but still had to take a performance hit (although not as great a one)
because dynamic SQL can’t be pre-compiled and pre-optimized.
I hope these examples will convince you of the dangers of trying to denormalize database tables too soon (or at all!). It’s actually safer, from a database perspective, to over-normalize in
the first iteration and then, as necessary, do any necessary denormalization in later refactorings rather than starting with a completely denormalized schema (at least that way you don’t lose
After all, if you have the right tools and processes in place, it really shouldn’t take you that much longer to create a database with, say, 50 tables instead of 5. In our shop, we can
generate our database schema and all associated objects (triggers, constraints, rules, base views, etc.) directly from our modeling tool. And the fundamental stored procedures that our developers
use for updating the database are generated directly from the database schema using a Visual Basic application. This enables us to generate the database very quickly and to process schema changes
very quickly with little manual effort.
And, most importantly, be sure to mask the complexity of the normalized database schema with a “virtual data layer” of views, stored procedures, functions, application data objects, and
whatever else is required to abstract the functionality of the data from the underlying schema. This provides the application with an application-specific view of the data that maps more easily to
application object classes, keeps the application from being tightly coupled to the database schema, and enables the easy reuse of data by multiple applications4.
This approach also enables the quick and easy refactoring of the data layer in response to application changes, without affecting the underlying schema or any other application that is using the
database. By using automated development processes, and by “virtualizing” the data, we don’t have to sacrifice database quality (and, potentially, data integrity and reusability)
for the sake of agility!
- Larry Burns, Promoting Data Management, Parts 1 and 2, originally published August 11 and August 18,
2006 in DMReview Online.
- Larry Burns, The Ugly Truth About Data Quality originally published May 31, 2005, in DMReview Online.
- Fortunately, we caught this problem early in the development process. I created a normalized set of base tables, created a denormalized view on top of the tables that looked exactly like the
Task table above, and then wrote Instead-Of triggers on the view to handle the updates to the base tables. We were then able to use the application, as written, to recreate the missing data.
- Larry Burns, Views: The Key to Database Agility, originally published October 2005 in TDAN, The Data