Published in TDAN.com April 2001
In the magical land of Logical Nirvana there lives Good Queen Normalization III. In this land the Black Night Redundancy lives in shackles in the basement prison. Redundancy shares his cell with
his squire Urgent Deadline. The old village of Legacy Data has been toppled and the new city of Enterprise Data has sprung up in its place. The blind alleys and dead ends have been replaced by an
efficient highway system. “Because we’ve always done it that way!” has been erased from the city archway.
Queen Normalization III appoints you Supreme Judge of Data. The application developers come to your court and bow down begging you to model their data. “Take as long as you need and please,
no denormalization!” they cry. “Please make sure it depends on the key, the whole key, and nothing but the key!” they beg. You have the power to instantly behead anyone avoiding
business keys or creating bundled elements. In Logical Nirvana all the RDBMS’s and application development languages are developed too only work with normalized data. Each night you sleep
soundly knowing that you have successfully defended Logical Nirvana from all evil.
Unfortunately Logical Nirvana is not where we live. We must deal with redundancy, deadlines, legacy data, and constraints of RDBMS’ and development languages. Most of the executives we know
do not have a lot in common with Good Queen Normalization III. So how do we live in today’s world of deadlines and performance constraints? As data analysts I believe it is our job to help
applications develop solid systems, on time, while ensuring that the new data entities are reusable and fit in the enterprise. This is not an easy task. Data analysts often find themselves caught
between the urgent need for new application data stores and the commitment to an enterprise view of the data.
There are two traps we can fall into. The most common trap is becoming too theoretical in the pursuit of “Logical Nirvana”. It is more important that what we produce is useful then that
it is technically perfect. Second, we can not give in to every application level demand at the expense of the “big picture”. We must find a way too balance the two. We must give the
application developers what they need to do their job while not undermining the enterprise view of the data.
I believe the goal of every logical model should be to normalize to 3rd normal form. Much to the horror of many life time data analysts, many of whom have no practical experience, I also believe
there are exceptions. Logical models should graphically represent the business in a way that is useful and understandable. I recently worked on a model that would be used to store data about
business recovery. The table in question contained the business activities that are essential and need to be quickly replaced if the building blows up or burns down. Two of the attributes needed
were the employee id of the disaster recovery plan owner (the person responsible for the activities plan upkeep.) and the employee id of the plan approver (typically but not always the plan
owner’s manager). When looking at this data from a logical standpoint the two employee id’s are a repeating group and therefore should be normalized to a separate table (see example 1).
The other choice, not as logically pure, would be too denormalize the id’s to the business activity table (see example 2).
When reviewing this I asked myself the following questions:
- Would the separate entity be shareable?
-
No, the business activity employee table would only have value in context of its relationship to the business activity table. No other entity would relate directly to the business activity
employee table. The table only has value because it further describes the business activity (Who is the approver of the business activity? Who is the owner of the business activity?) - Does creating a separate entity make the model easier to understand?
-
No, you must know the values of the type code to understand the model. In some cases Employee Id would represent the approver and in other cases it would represent the owner. When the two
employee ids are stored and properly named on the main table they are clearly identifiable. The separate entity makes the model more difficult to understand - Does creating a separate entity make the model more flexible?
-
Yes, if a new role is created for an employee it could be added to the model without altering the structure. You would simply have to add another type code. But, it would still require changes
to the application to edit, verify, display, and update the new role. Existing business rules, which must be built into the application, must be evaluated to ensure they are still valid. The
flexibility is at the expense of usability.
Based upon the about answers I fail to see a practical benefit of creating a separate entity.
Another example concerned the representation of business units. We needed to develop a model to support Division, Department, and Unit (a sub-group of department). One school of thought involves
creating a “generic” business unit table with a recursive relationship (see example 3) as opposed to creating 3 separate tables (see example 4). I agree with the intelligent use of
recursive relationships. But we must be careful that the increased flexibility is not at the expense of the models usability. If a model is technically perfect but is so generic that it increases
the complexity of accessing it, we should question that approach.
If we combine Division, Department and Unit into 1 entity with a type code we would increase flexibility but greatly decrease usability.
- We could not search a repository for every instance of division. We would have to search for ‘business unit id’ but would only be able to find every potential instance of division
(which would also include every potential instance of department and unit). - With separate division, department, and unit tables any RDMS can validate the relationships. With one combined table application level validation is required. (e.g. If type code = 1 we do not
need a related entity but if type code = 2 then there must be a another row in same table where the related business unit id is a business unit id with a type code of 1 but if the type code = 3
then there must be a another row in the same table where the related business unit id is a business unit it with a type code of 2). - When business unit is used as a foreign key it will typically represent one of the roles (division, department or unit) but it will not be apparent which role it is using without looking at the
data. In other words if business unit id existed as an attribute on an employee table we would not be able to tell if it meant the employees division, department or unit. It will also require
application level verification to ensure they are only entering business unit ids for the proper role. - The flexibility gains of the “generic” approach are greatly outweighed by the usability losses.
Now that I have all modeling purist cursing me, I would like to state that I am a firm believer in the need for good logical models. They are the corner stone to a good data enterprise. There are
some points in which we should be uncompromising. Some things we should allways enforce are :
-
Common naming If we call an employee identifier emp_id one place we should call it emp_id everyplace else it is used. We should be able to search the repository (or rdms
catalog) for emp_id and have a good deal of confidence that we located every instance of the employee identifier. -
Common data format If emp_id is determined to be a five digit number, it should be a five digit number everywhere it is used. This will help enforce integrity and
increase performance and accuracy of joins between tables. -
Bundled Elements Each attribute should represent one and only one thing. We should avoid the temptation of combining what should be two distinct attributes into one.
This greatly complicates validation and enforcement of business rules. -
Business Keys The primary key of a logical entity should represent the business key. The only time sequence numbers should appear, as part of a primary key is if no
other elements or combination of elements can uniquely and clearly identify the row or if the elements needed to uniquely identify the row are optional (nullable). A practical example of the use
of a sequence number would an order id. While you may be able to uniquely identify a row with a combination of order attributes (customer, timestamp, items ordered etc.) Using these as the
primary key would only add confusion to the model. Creating an unintelligent order id would make more make the model more understandable and easier to use.
I understand that all of this is confusing. Logical Modeling is more of an art then a science. Until we can tear it down, we need to walk the fence between the enterprise view and the application
view. Maybe someday we will find Logical Nirvana. Maybe someday Data Redundancy and Urgent Deadline will no longer haunt us. Until they do we must continue to press on. Keep the big picture in
focus. What real, defendable benefit is this decision providing? That is the best we can do.
December, 2000 issue of the Journal of Conceptual Modeling
(www.inconcept.com/JCM).