The Denormalization Survival Guide – Part I

Published in TDAN.com April 2002

Articles in this series – Part 2,

This the first of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Steve Hoberman’s book, the Data Modeler’s Workbench (Order this book through Amazon.com Today!). This first article focuses on the dangers of denormalization
and introduces the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our data models. The second article will discuss the questions and answers
that comprise this guide when modeling a data mart.


The Dangers of Denormalization

In her book entitled, Handbook of Relational Database Design, Barbara von Halle has a great definition for denormalization: “Denormalization is the process whereby, after defining a
stable, fully normalized data structure, you selectively introduce duplicate data to facilitate specific performance requirements.” Denormalization is the process of combining data
elements from different entities. By doing so, we lose the benefits of normalization and, therefore, reintroduce redundancy into the design. This extra redundancy can help improve retrieval time
performance. Reducing retrieval time is the primary reason for denormalizing.

My favorite word in von Halle’s definition is “selectively”. We have to be very careful and selective where we introduce denormalization because it can come with a huge price even
though it can decrease retrieval time. The price for denormalizing can take the form of these bleak situations:

  • Update, delete, and insert performance can suffer. When we repeat a data element in two or more tables, we can usually retrieve the values within this data element much more
    quickly. However, if we have to change the value in this data element, we need to change it in every table where it resides. If Bob Jones appears in five different tables, and Bob would prefer to
    be called “Robert”, we will need to change “Bob Jones” to “Robert Jones” in all five tables, which takes longer than making this change to just one table.
  • Sometimes even read performance can suffer. We denormalize to increase read or retrieval performance. Yet if too many data elements are denormalized into a single entity, each
    record length can get very large and there is the potential that a single record might span a database block size, which is the length of contiguous memory defined within the database. If a
    record is longer than a block size, it could mean that retrieval time will take much longer because now some of the information the user requests will be in one block, and the rest of the
    information could be in a different part of the disk, taking significantly more time to retrieve. A Shipment entity I’ve encountered recently suffered from this problem.
  • You may end up with too much redundant data. Let’s say the CUSTOMER LAST NAME data element takes up 30 characters. Repeating this data element three times means we are now using
    90 instead of 30 characters. In a table with a small number of records, or with duplicate data elements with a fairly short length, this extra storage space will not be substantial. However, in
    tables with millions of rows, every character could require megabytes of additional space.
  • It may mask lack of understanding. The performance and storage implications of denormalizing are very database- and technology-specific. Not fully understanding the data elements
    within a design, however, is more of a functional and business concern, with potentially much worse consequences. We should never denormalize without first normalizing. When we normalize, we
    increase our understanding of the relationships between the data elements. We need this understanding in order to know where to denormalize. If we just go straight to a denormalized design, we
    could make very poor design decisions that could require complete system rewrites soon after going into production. I once reviewed the design for an online phone directory, where all of the data
    elements for the entire design were denormalized into a single table. On the surface, the table looked like it was properly analyzed and contained a fairly accurate primary key. However, I
    started grilling the designer with specific questions about his online phone directory design:
“What if an employee has two home phone numbers?”
“How can we store more than one email address for the same employee?”
“Can two employees share the same work phone number?”

After receiving a blank stare from the designer, I realized that denormalization was applied before fully normalizing, and therefore, there was a significant lack of understanding of the
relationships between the data elements.

  • It might introduce data quality problems. By having the same data element multiple times in our design, we substantially increase opportunities for data quality issues. If we
    update Bob’s first name from Bob to Robert in 4 out of 5 of the places his name occurred, we have potentially created a data quality issue.

Being aware of these potential dangers of denormalization encourages us to make denormalization decisions very selectively. We need to have a full understanding of the pros and cons of each
opportunity we have to denormalize. This is where the Denormalization Survival Guide becomes a very important tool. The Denormalization Survival Guide will help us make the right denormalization
decisions, so that our designs can survive the test of time and minimize the chances of these bleak situations from occurring.


What Is the Denormalization Survival Guide?

The Denormalization Survival Guide is a question-and-answer approach to determining where to denormalize your logical data model. The Survival Guide contains a series of questions in several
different categories that need to be asked for each relationship on our model. There are point values associated with answers to each question. By adding up these points, we can determine whether
to denormalize each specific relationship. If our score is 10 or more after summing the individual scores, we will denormalize the relationship. If our score is less than 10, we will keep the
relationship normalized and intact. When you are done asking these questions for each relationship, you will have a physical data model at the appropriate level of denormalization.

There are two main purposes to using the Denormalization Survival Guide:

  • Maximizing the benefits of denormalization and minimizing the costs. Whenever we denormalize, we potentially gain something but also potentially lose something. The purpose of
    the Denormalization Survival Guide is to make sure we maximize our gains while minimize our losses. Therefore, if using the guide dictates that we should denormalize Account into the Customer
    entity, it will be because the benefits of denormalization outweigh the costs.
  • Providing consistent criteria for denormalization. We are applying the same set of questions and point values to each relationship. Therefore, there is a consistency that we will
    use in deciding when to denormalize. If two relationships have similar answers, either both will be denormalized or both will remain normalized. This consistency leads to greater reuse and
    understanding. In addition, because we will start to see the same structures in multiple places, we will more easily be able to learn and understand new structures. For example, if Customer and
    Account are denormalized into a single entity, we will be able to understand this denormalized Customer Account entity in other places it occurs.


Summary

This first article focused on the dangers of denormalization and introduced the Denormalization Survival Guide. The next article in this series will discuss the questions and answers that comprise
the guide, including these questions when determining whether to denormalize a relationship in a data mart:

  • What type of relationship?
  • What is the participation ratio?
  • How many data elements are in the parent entity?
  • What is the usage ratio?
  • Is the parent entity a placeholder?
  • What is the rate of change comparison?

Share

submit to reddit

About Steve Hoberman

Steve Hoberman is a world-recognized innovator and thought-leader in the field of data modeling. He has worked as a business intelligence and data management practitioner and trainer since 1990.  Steve is known for his entertaining, interactive teaching and lecture style (watch out for flying candy!) and is a popular, frequent presenter at industry conferences, both nationally and internationally. Steve is a columnist and frequent contributor to industry publications, as well as the author of Data Modeler’s Workbench and Data Modeling Made Simple. He is the founder of the Design Challenges group and inventor of the Data Model Scorecard™. Please visit his website www.stevehoberman.com to learn more about his training and consulting services, and to sign up for his Design Challenges! He can be reached at me@stevehoberman.com.

Top