Published in TDAN.com January 2007
An application’s flexibility and data quality depend quite a bit on the underlying data model. In other words, a good data model can lead to a good application and a bad data model can lead to a
bad application. Therefore we need an objective way of measuring what is good or bad about the model. After reviewing hundreds of data models, I formalized the criteria I have been using into what
I call the Data Model Scorecard.
- The Scorecard contains 10 categories:
- How well does the model capture the requirements?
- How complete is the model?
- How structurally sound is the model?
- How well does the model leverage generic structures?
- How well does the model follow naming standards?
- How well has the model been arranged for readability?
- How good are the definitions?
- How well has real world context been incorporated into the model?
- How consistent is the model with the enterprise?
- How well does the metadata match the data?
This is the fourth of a series of articles on the Data Model Scorecard. The first article on the Scorecard summarized the 10 categories, the second article discussed the correctness category, the
third article focused on the completeness category, and this article will focus on the structure category. That is, How structurally sound is the model? For more on the Scorecard, please refer to
the book, Data Modeling Made Simple: A Practical Guide for Business & IT Professionals. How structurally sound is the model?
This is the “Data Modeling 101” category. That is, this category ensures that the model follows good design principles. We don’t need to necessarily understand the content of the model to score
this category. Many of the potential problems from this category are quickly and automatically flagged by our modeling and database tools. Examples include prohibiting having two data elements with
the same exact name in the same entity, a null data element in a primary key, and certain reserved words in data element and entity names
.
Here are a few of the areas for improvement I look for in validating this category:
Empty entities. Despite the strangeness of this statement, I have seen entities containing zero data elements more than I would like to admit. This is sometimes due to accidentally forward
engineering (converting from the model to database tables) entities that were designed as legends in the model or placeholders for the next phase of the project.
Improper use of surrogate keys. This includes over or under-utilizing surrogate keys. Over-utilizing would involve adding a surrogate key to an associative entity purely for the sake of
consistency. An example of under-utilizing surrogate keys would be deciding to uses a large and inefficient composite primary key instead of a surrogate key.
Inappropriate multivalued data elements. Having a data element called customer name that contains both first and last name, for example.
Relationships that don’t have foreign keys correctly propagated. Also known as a partial-key relationship, it occurs when the modeler would like only a subset of the parent’s primary key to
appear in the child entity as a foreign key.
Data elements with the same name yet different formats. For example, person telephone number is 20 characters long in one entity and 10 characters long in another.
Circular and redundant relationships. A circular relationship is a catch-22. That is, to do A, we need to do B, To do B, we need to do C. To do C, we need to do A. Therefore, we are stuck. An
example of a circular relationship appears in Figure 1.
Figure 1. Circular relationship
Each order can contain zero, one, or many products.
Each product can be offered on zero, one, or many promotions.
Each promotion can be applied to zero, one, or many orders.
Because these three entities are connected with identifying relationships, there is no way we can create any entity instance without having at least another one other entity instance already in
place. In other words, we become stuck. We are stuck because we can’t create an order without a promotion, yet a promotion requires an order before it can be created.
A redundant relationship is one that is derivable from other relationships on the model. See Figure 2.
Figure 2. Redundant relationship
Each order can contain zero, one, or many products.
Each product can be offered on zero, one, or many promotions.
Therefore we know the order for a given promotion without explicitly showing the relationship between order and promotion. If we are building a business intelligence application, showing this
redundant relationship might make sense if it improves performance and navigation.
As a proactive measure to improve the structure of the data model, I have found the following techniques to be very helpful:
- Make sure each data element’s format matches the rest of the data element’s metadata. Many times we can identify something structurally incorrect when the formatting of a data element does
not seem right, such as a 40-character gender code or a 10-character last name. - Put your model down and come back to it the next day. Modeling is one of those activities that at times require simply working on something else for a while and then coming back to your model.
I have had modeling revelations a few minutes after picking up a model, whereas a day earlier I was pulling hairs out trying to come up with a creative solution. - Try Validator. This software marketed by Computer Associates checks a data model against over 50 criteria. I have found it very useful to identify valuable structural inconsistencies that would
be too time consuming and difficult for me to check manually. For example, this tool identifies when two or more data elements with the same name have different formats across entities.