The Denormalization Survival Guide Part 2

Published in TDAN.com July 2002

Articles in this series – Part 1

This is the second of two articles on the Denormalization Survival Guide, adapted from Chapter 8 of Stev Hoberman’s book, the Data Modeler’s Workbench (You can order this book through Amazon.com). The previous article focused on
the dangers of denormalization and introduced the Denormalization Survival Guide, which is a question-and-answer approach to applying denormalization to our logical data models. This article will
discuss the questions and answers that comprise this guide when modeling a data mart.


The Denormalization Survival Guide for a Data Mart

We will go through each of the questions in the Denormalization Survival Guide and introduce a handy template for recording the scores for each question. The questions for a data mart are:

  • What is the type of relationship?
  • What is the participation ratio?
  • How many data elements are in the parent entity?
  • What is the usage ratio?
  • Is the parent entity a placeholder?
  • Is there a similar rate of change?

An important concept to fully understand is that there is a parent and a child in every relationship on our design. The parent is on the “one” side of the relationship, and the child is on the
“many” side of the relationship. For example, if a Customer can place many Orders, Customer would be the parent and Order would be the child. Remember that the entity containing the foreign key
for the relationship is the child. Order would contain the foreign key back to Customer, and therefore, Order is the child. The terms parent and child appear quite often in our survival guide
questions.

In addition, the higher the scores, the greater the chance we will be denormalizing the relationship. If our score is 10 or more after summing the scores on each question, we will denormalize the
relationship. If our score is less than 10, we will keep the relationship normalized and intact. I have found that the number 10 works for me in distinguishing the normalization/denormalization
border. 10 is really an arbitrary number I’ve chosen because after answering the questions, the total scores are usually much higher or lower than 10, so I could have also easily chosen 15 or
5. When you are done asking these questions for each relationship, you will have a physical data model at the appropriate level of denormalization. Let’s briefly discuss each question.


What is the type of relationship?

This question addresses the type of relationship we are analyzing. What relationship does the parent entity have to the child entity? We choose one of three options: Hierarchy (worth 20 points),
Peer (worth -10 points), or Definition (worth -20 points). A hierarchy relationship is when one entity contains, includes, or encompasses another entity. For example, Order Header contains Order
Lines. Peer is a relationship where both child and parent entities have an independent existence. For example, Sales Person services Customers. Definition is when the parent entity determines the
meaning of the child entity. Definition relationships are either relationships from transaction to reference data, or associative entities. For example, Customer Account is the associative entity
between Customer and Account. Hierarchy is worth the most points meaning there is the greatest chance for the entities participating in this relationship to be denormalized. This is because in data
marts the users often select information from multiple levels of a hierarchy in the same query.


What is the participation ratio?

For a given parent entity value, roughly how many children entity values would we have? The closer to a one-to-one relationship between parent and child, the greater the chance we will denormalize.
This is because a relationship with a one-to-10 ratio will have, on average, 20 percent of the redundancy of a relationship with a one-to-50 participation ratio. We need to choose one of three
options for this question: Up to a one-to-five ratio (worth 20 points), up to a one-to-100 ratio (worth –10 points), and over a one-to-100 ratio (worthy –20 points).


How many data elements are in the parent entity?

This question addresses the redundancy in terms of extra space from parent entity data elements we will need in the child entity if we denormalize the parent entity into the child entity. If there
are less than 10 parent data elements, it means there will usually be a minimal amount of redundancy on each child record. Thus, we get 20 points. If there are between 10 and 20 data elements, we
get –10 points. If there are more than 20 data elements from the parent, we get –20 points. For this question, also keep in mind the length of each of the data elements. For example,
adding 20 one-character data elements introduces less redundancy than adding five 50-character data elements.


What is the usage ratio?

When users need information from the child, do they often include information from the parent? Likewise, if users need information from the parent, do they often include information from the child?
In other words, how tightly coupled or correlated are these two entities? If the data elements from two entities will appear together on many user queries and reports, the information will be
retrieved more quickly if it is within the same table instead of joining multiple tables. Because of the importance of this question, I have assigned it more weight than the other questions. After
all, if very little correlation between two entities exists, why would we want to denormalize them? Assign 30 points for a strong correlation with each other and -30 points for a weak or no
correlation with each other.


Is the parent entity a placeholder?

Are we going to add more data elements or relationships to the parent entity in the near future? If the answer is no (worth 20 points), then there is a stronger tendency to denormalize. This is
because we would not need to worry about the extra space and redundancy new parent data elements would cause. Also, we do not have to worry about integrity and the enforcement of new business rules
against the parent. It also means that we will not impact the child entity with parent entity changes, thereby minimizing database maintenance problems. If the answer is yes (worth -20 points),
then we are going to add data elements in the near future and thus have a tendency to keep the two entities separate.


Is there a similar rate of change?

This question addresses whether the two entities have a similar number of inserts and updates within the same time period. If one of the two entities changes very seldom, whereas the other entity
changes very often, there is a strong tendency to keep them normalized in separate tables (worth -20 points). Separate tables can avoid performance and synchronization data issues. If the two
entities have roughly the same rate of change, it would make more sense to denormalize them into the same entity; therefore, we receive 20 points.


Denormalization Survival Guide Template

A very good way to keep the scores for each relationship is through a spreadsheet. See the following table. The first column contains each of the six questions. The second column contains the
answers and their point values. The third column contains the actual points, and their sum appears at the end.


Denormalization Survival Guide Questions Template

 


Summary

This article discussed the questions and answers that comprise the Denormalization Survival Guide when designing a data mart:

  • What is the type of relationship?
  • What is the participation ratio?
  • How many data elements are in the parent entity?
  • What is the usage ratio?
  • Is the parent entity a placeholder?
  • Is there a similar rate of change?

Although you can use the questions and point values discussed in this article exactly as they appear, you might want to consider some customization. You might have different questions you would
like to ask or might assign different point values, depending on how important each question is in relation to the others.

Share

submit to reddit

About Steve Hoberman

Steve Hoberman is a world-recognized innovator and thought-leader in the field of data modeling. He has worked as a business intelligence and data management practitioner and trainer since 1990.  Steve is known for his entertaining, interactive teaching and lecture style (watch out for flying candy!) and is a popular, frequent presenter at industry conferences, both nationally and internationally. Steve is a columnist and frequent contributor to industry publications, as well as the author of Data Modeler’s Workbench and Data Modeling Made Simple. He is the founder of the Design Challenges group and inventor of the Data Model Scorecard™. Please visit his website www.stevehoberman.com to learn more about his training and consulting services, and to sign up for his Design Challenges! He can be reached at me@stevehoberman.com.

Top