Published in TDAN.com April 2002
This the first of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Steve Hoberman’s book, the Data Modeler’s Workbench (Order this book through Amazon.com Today!). This first article focuses on the dangers of denormalization
and introduces the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our data models. The second article will discuss the questions and answers
that comprise this guide when modeling a data mart.
The Dangers of Denormalization
In her book entitled, Handbook of Relational Database Design, Barbara von Halle has a great definition for denormalization: “Denormalization is the process whereby, after defining a
stable, fully normalized data structure, you selectively introduce duplicate data to facilitate specific performance requirements.” Denormalization is the process of combining data
elements from different entities. By doing so, we lose the benefits of normalization and, therefore, reintroduce redundancy into the design. This extra redundancy can help improve retrieval time
performance. Reducing retrieval time is the primary reason for denormalizing.
My favorite word in von Halle’s definition is “selectively”. We have to be very careful and selective where we introduce denormalization because it can come with a huge price even
though it can decrease retrieval time. The price for denormalizing can take the form of these bleak situations:
Update, delete, and insert performance can suffer. When we repeat a data element in two or more tables, we can usually retrieve the values within this data element much more
quickly. However, if we have to change the value in this data element, we need to change it in every table where it resides. If Bob Jones appears in five different tables, and Bob would prefer to
be called “Robert”, we will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes longer than making this change to just one table.
Sometimes even read performance can suffer. We denormalize to increase read or retrieval performance. Yet if too many data elements are denormalized into a single entity, each
record length can get very large and there is the potential that a single record might span a database block size, which is the length of contiguous memory defined within the database. If a
record is longer than a block size, it could mean that retrieval time will take much longer because now some of the information the user requests will be in one block, and the rest of the
information could be in a different part of the disk, taking significantly more time to retrieve. A Shipment entity I’ve encountered recently suffered from this problem.
You may end up with too much redundant data. Let’s say the CUSTOMER LAST NAME data element takes up 30 characters. Repeating this data element three times means we are now using
90 instead of 30 characters. In a table with a small number of records, or with duplicate data elements with a fairly short length, this extra storage space will not be substantial. However, in
tables with millions of rows, every character could require megabytes of additional space.
It may mask lack of understanding. The performance and storage implications of denormalizing are very database- and technology-specific. Not fully understanding the data elements
within a design, however, is more of a functional and business concern, with potentially much worse consequences. We should never denormalize without first normalizing. When we normalize, we
increase our understanding of the relationships between the data elements. We need this understanding in order to know where to denormalize. If we just go straight to a denormalized design, we
could make very poor design decisions that could require complete system rewrites soon after going into production. I once reviewed the design for an online phone directory, where all of the data
elements for the entire design were denormalized into a single table. On the surface, the table looked like it was properly analyzed and contained a fairly accurate primary key. However, I
started grilling the designer with specific questions about his online phone directory design:
After receiving a blank stare from the designer, I realized that denormalization was applied before fully normalizing, and therefore, there was a significant lack of understanding of the
relationships between the data elements.
It might introduce data quality problems. By having the same data element multiple times in our design, we substantially increase opportunities for data quality issues. If we
update Bob’s first name from Bob to Robert in 4 out of 5 of the places his name occurred, we have potentially created a data quality issue.
Being aware of these potential dangers of denormalization encourages us to make denormalization decisions very selectively. We need to have a full understanding of the pros and cons of each
opportunity we have to denormalize. This is where the Denormalization Survival Guide becomes a very important tool. The Denormalization Survival Guide will help us make the right denormalization
decisions, so that our designs can survive the test of time and minimize the chances of these bleak situations from occurring.
What Is the Denormalization Survival Guide?
The Denormalization Survival Guide is a question-and-answer approach to determining where to denormalize your logical data model. The Survival Guide contains a series of questions in several
different categories that need to be asked for each relationship on our model. There are point values associated with answers to each question. By adding up these points, we can determine whether
to denormalize each specific relationship. If our score is 10 or more after summing the individual scores, we will denormalize the relationship. If our score is less than 10, we will keep the
relationship normalized and intact. When you are done asking these questions for each relationship, you will have a physical data model at the appropriate level of denormalization.
There are two main purposes to using the Denormalization Survival Guide:
Maximizing the benefits of denormalization and minimizing the costs. Whenever we denormalize, we potentially gain something but also potentially lose something. The purpose of
the Denormalization Survival Guide is to make sure we maximize our gains while minimize our losses. Therefore, if using the guide dictates that we should denormalize Account into the Customer
entity, it will be because the benefits of denormalization outweigh the costs.
Providing consistent criteria for denormalization. We are applying the same set of questions and point values to each relationship. Therefore, there is a consistency that we will
use in deciding when to denormalize. If two relationships have similar answers, either both will be denormalized or both will remain normalized. This consistency leads to greater reuse and
understanding. In addition, because we will start to see the same structures in multiple places, we will more easily be able to learn and understand new structures. For example, if Customer and
Account are denormalized into a single entity, we will be able to understand this denormalized Customer Account entity in other places it occurs.
This first article focused on the dangers of denormalization and introduced the Denormalization Survival Guide. The next article in this series will discuss the questions and answers that comprise
the guide, including these questions when determining whether to denormalize a relationship in a data mart:
- What type of relationship?
- What is the participation ratio?
- How many data elements are in the parent entity?
- What is the usage ratio?
- Is the parent entity a placeholder?
- What is the rate of change comparison?