The article Why Data Models Cannot Work published in February edition of TDAN stirred my emotions. It stated requirements for a model from the knowledge management perspective and provided an example to
highlight where data models fell short. The intention was likely to stimulate a discussion. It definitely achieved that goal: it provoked me into writing this response.
Points of Concurrence
There are problems and issues within any discipline. As a practitioner, I am painfully aware of ours. A logical data model should represent the business domain while following all relational
normalization rules; a physical model is its implementation under constraints of the given DBMS. Very often, the real-world data model is a hybrid between logical and physical ones. There is not
enough time and resources to create two separate models and, more importantly, to maintain both and keep them in sync. Out of two, the physical one is essential for an IT organization; the scripts
for creating and modifying database objects are generated from it. By natural selection, the more dominant species survive, and the logical model disappears:
- Performance-related denormalized structures propagate from physical to the logical side;
- New mandatory attributes are added as optional to simplify processing;
- Keys are not fully identified or include extraneous attributes to reflect physical indexes;
- Sub-types are not identified or reflect physical partitioning instead;
Therefore, the data model should not be used alone but be augmented by other tools. These tools might include data profiling, as was done by my colleague. In a similar fashion, I verify new data
sources in my data modeling efforts.
Perceived Data Model Shortcomings
The perceived shortcomings arose from the three-fold assertion put forward. Models should “tell us (1) the truth, (2) the whole truth and (3) nothing but the truth about what it is they
are purporting to describe” [Chisholm 09]. Let us go over each part of the statement. A relational model should give correct information and should not mislead. I am in complete agreement on
the first and third points. I would also argue that these qualities are attainable. In the places where I have worked, they are more of the norm rather than the exception. The situation with the
second point is different. It is impossible for any model to contain the whole truth. In order to achieve that goal, the model would be as voluminous as the underlying data.
Definition of a Good Model
The first question would be what the principles are for creating a model. George Box, the industrial statistician, is credited with the quote “all models are wrong, some are useful”
[Box 79]. The author applied his maxim to statistical analyses, which are notoriously ambiguous. The relational models associate attributes on a deterministic basis; the relationships once
established are always true (at least they are supposed to be). So our models are less ambiguous. Nevertheless, the quote rings true to models in general and to relational models in particular.
The same ideas are detailed by Marshall Clemens on his website [Clemens 02]. The term “model” refers to any structured knowledge that accurately reflects
– and helps us to make sense of – the world. What are the desirable properties of a good model?
“A good model should be:
Salient: Since no model can represent everything, it must selectively represent those things most relevant to the task at
Accurate: The model should precisely encode the actual state of affairs and not an erroneous or biased
Complete yet Parsimonious: The model should be as simple as possible, but no simpler. It should concisely capture all the
relevant dimensions of the problem without squeezing out the opportunity for serendipitous or creative insight.”
The properties list includes a total of 14 requirements. I showed the first three, which are the most relevant to the discussion. Nowhere was it stated that a model should be able to capture
everything up to the tiniest detail. And models built on such principles have been used for years. They provided insight and spread knowledge. Therefore, models are the de facto basis of knowledge
The original article had an example that compared a data model to data profiling analyses to highlight where data models fell short. The example used the Financial Instrument entity. The
attribute level report only showed the attribute name and definition. Let us take a closer look at the data modeling language. It includes many more attribute properties, such as domain, optionality
(Null/not Null), and participation in candidate and primary keys. Thus, the hypothetical data model in expanded form is presented in the Table 1 below.
Each of the just “uncovered” attribute properties enhances the model. The domain allows interpreting data values and suggests possible operations. This information is impossible to glean
from the database alone. It is just one example of the model’s value.
Next is the optionality. Based on the incomplete attribute report, the CUSIP attribute was assumed to be mandatory. It led to a proposition that contradicted data: Every Financial Instrument has a CUSIP. Now it is clear that the CUSIP is an optional attribute. And the propositions are: some Financial
Instruments have a CUSIP, and some Financial Instruments do not have a CUSIP. That exactly matches the database profiling results.
Last are the keys. The key structure for this entity is very complicated. I only depicted two keys: primary key (PK) and candidate key (CK1). Moreover, the candidate key might not reflect the true
business domain and is shown for illustration purposes only. Anyway, let us finish with the disclaimers, and get to the point. In this example, the primary key does not reflect the business entity
identity. It is only a surrogate key that substitutes a multi-part natural key of the entity. The natural key is relegated to the role of a candidate key. Thus, it is important to identify candidate
keys for understanding the entity identity.
The Financial Instrument is a generalized construct representing a variety of investment products. A set of attributes vary by the product. This is apparent from the existence of optional attributes.
According to best practices [Reingruber 94] the model should be refined until there are no more optional attributes. So a more sophisticated IT shop would create entities for each subtype. It might
be a simple three-subtype structure based on the Financial Instrument Type with values of Equity, Bond and Option, or it might be a more complicated hierarchy. The attributes would be moved into the
subtypes, where they would become mandatory. This would fill another purported gap, because the sub-typing hierarchy would provide propositions “particular” to a subtype that has a given
attribute and “negative” propositions to subtypes that lack that attribute.
Some of the data profiling propositions in the original text showed dependency of an attribute on other attributes. One of them was “Every Financial Instrument
that has a Maturity Date must have either a Coupon or a Strike Price.” An attribute should depend only on a key. The dependence on other attributes is an anomaly and a violation of
normalization rules. That anomaly would go away once the candidate keys and the complete sub-typing hierarchy are established.
While a cursory look showed data modeling inadequacy, a more thorough analysis uncovers a treasure trove of useful metadata. The additional metadata covers most of the perceived gaps and
discrepancies. Therefore, in our hypothetical example, the root of the problem was either the data modeler not using the tool to its capacity, and/or the knowledge worker not requesting all of the
Models in general are an integral part of the body of knowledge. Data modeling language is quite expressive when used in full. It is based on a solid foundation of relational theory. Data models
are capable of adequately reflecting the underlying business. They can and should be used for knowledge management. As in any discipline, real-life implementations have issues. Knowledge workers
should verify and, if need be, augment models by other tools such as data profiling. They should demand logical data models instead of physical/logical hybrids.
- Malcolm Chisholm (February 2009). Why Data Models Cannot Work.
- Marshall Clemens, 2002
- Box, G.E.P.(1979), Robustness in the strategy of scientific model building, in Robustness in Statistics, R.L. Launer and G.N. Wilkinson, Editors.
Academic Press: New York.
- Reingruber Michael, William W. Gregory (1994).The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models. A Wiley-QED Publication: New York.