The Information Model & the Data Vault

 

Published in TDAN.com October 2004


Introduction

This article is a follow-up to an article I wrote for the last issue of TDAN.com about Business Metadata. I think I may have found a way to integrate
the Information Model with the IT environment in a way that brings tangible value and ROI to the business, and helps us create viable meta data in the process.


The Information Model

To summarize the basic concepts from my last article, an Information Model is essentially the same thing as an Enterprise Data Model (EDM): It is a model of the business concepts, not IT jargon or
data elements used in information systems. Every business needs one, in order to make sense of their data, but it becomes paramount in businesses doing any integration work. This includes, but is
not limited to, data warehousing. It also includes applications needing to share data with other applications.

The part that makes information sharing and integration so difficult is the failure of information systems to encapsulate and encode the meaning of the data. It is difficult to get organizations to
agree to “standard” definitions of business terms or a standard way of formatting data for the entire enterprise. In my previous article, I established the need for the EDM or Common
Information Model in a data warehouse environment. Any business meta data initiative must have one, because it provides traceability between physical data elements and their corresponding business
concepts.


The Data Vault

The Data Vault is a design methodology by Dan Linstedt. The basic notion is to design the data warehouse using a hub and spoke method where each main business idea becomes a hub and attributes are
grouped together in satellites. Relationships between main business concepts become link structures (kind of like intersection/associative entities). For a more detailed primer on the Data Vault,
see Dan’s excellent series of articles in TDAN.com, beginning in January 2003.

*Publisher’s Note – See Dan’s TDAN.com Data Vault Series at:
Part 1, Part 2, Part 3, and Part 4.

I spoke to Dan about a month ago and asked him what the relationship between a data vault model and the Common Information Model was. He felt that they were one and the same. I began to reflect on
this a while. Is it possible and/or feasible to have the data warehouse design essentially reflect a Common Information Model? If so, perhaps the Data Vault would be an appropriate design strategy
to pull it off.

Here’s why it just might work:

The purpose of the data warehouse is to store integrated data in a format that represents business usage. The data vault lends itself well to iteration. Satellites, links and new hubs can be easily
added as needed.

For example, most operational systems contain fields that have evolved into overloaded buckets of information that require elaborate interpretation. These fields should be properly decoded into
their proper business concepts in the data warehouse. The resulting structure will typically come close to, if not mirror, the IT information concept model (or Common Information Model).

Now, the problem with this approach is it contradicts another goal of some data warehouse projects: that of traceability. Some data warehouse projects want to preserve the data in its original form
as much as possible, afraid that any transformation will cause the business users to complain that the data has been corrupted and it can no longer be trusted.

Here is a simple example. Suppose you have a field called “suffix” which serves multiple purposes in the business: it stores a type code, sometimes it stores a quantity (if the type code is a
certain value), and for the specified quantity it also includes a unit of measure abbreviation; the rest of the field contents is to be interpreted as a comment. The natural inclination of the
warehouse group is to leave suffix as a single field, because “the business is used to dealing with suffix” and they are afraid to tinker with it because they may decode it incorrectly. However,
breaking the field up into its separate pieces allows the business to use the data in the way it really needs to. Storing it in the former way is clumsy but it facilitates traceability.

Suppose we have three systems, each store the same data but in different formats:

System A:

 

Figure 1: Overloaded field layout

System B:

Separate fields for each data element:

  • Type code
  • Quantity
  • Unit of Measure
  • Comment

System C:

  • Type code as a numeric value
  • Quantity
  • No unit of measure; system has an implied/default value
  • No comment field

The corresponding data vault model is shown below. Note that keys are not shown in these models. For information about keys in a Data Vault model, see Dan Linstedt’s articles mentioned above.

 

Figure 2: Data Vault Model

The beauty of this is, all three systems’ data fit nicely into this model. It also permits the model to be added and enhanced as needed. If another system is added to the data warehouse,
which has additional attributes not represented, another satellite can be added to accommodate it, without any maintenance to the existing structures. Note, however, that there is no attribute
called “suffix” in this model. There is no such thing as “suffix” in the business; it was purely an invented thing, evolving from usage of the information system. There
would consequently be no “suffix” which would appear in the CIM either.

If the data warehouse wanted to preserve the suffix data in addition to maintaining the decoded data, it could be added as another satellite.

 

Figure 3: Suffix Satellite

This stores redundant data, which takes up more space on the disk. As we know, database design is all about trade-offs, so based on the goals of the warehouse, it may be helpful to do this but it
may also be more expensive.

Obviously, if the latter approach is adopted, the resulting data vault model will not be a common information model; it will be a hybrid, and will require interpretation to be understood.


Normalization (or Lack Thereof)

The Enterprise Data Model will probably be normalized. In order for this technique to work, the EDM will need to be converted into a data vault. Again, it can be converted iteratively, since new
structures can easily be added when required.

In the Data Vault model shown in Figure 2, The Item Hub is related to the Inventory Hub by means of a Link. The EDM for this Data Vault Model might actually look like the one shown in Figure 4.

 

Figure 4: Normalized Model

This model shows that the Item entity is related to the Inventory entity in a one-to-many fashion. The Data Vault links all relationships as if they were many-to-many. So some cardinality semantics
is lost with the Data Vault. This is due to the fact that the data vault model is intended to support the data warehouse. It allows data to be loaded into the structures even if the relationship is
not supported in the source systems. Many source systems do not enforce referential integrity. The Data Vault structure allows “imperfect” data to be loaded without error and be fixed
at a later time. This is an advantage for the data warehouse, but it does mean that some semantics are lost. Therefore it does not serve as a complete model of the business, but a model is never
totally accurate anyway. Should you maintain a separate EDM that is normalized or can you replace it with the Data Vault, knowing that it will be limited in the representation of cardinality?
Perhaps it is possible to maintain only one model (the data vault). It depends upon which part of the trade-off is more useful. You can maintain two models, both the normalized EDM and the data
vault, and the latter also serves as the data warehouse model. Consequently, you can maintain only one, and document the missing cardinality somewhere in your model annotations.


Performance

The main problem with the data vault is it is not optimized for performance in a typical relational database. Incurring a 4-way join as opposed to a two-table join is not a great idea! However, we
should consider what the purpose of the data warehouse is:

  • Historical
  • Integrated
  • Non-volatile

If the data warehouse feeds marts, which are the primary source of decision support data, then proliferation of tables and query performance of the warehouse itself is a secondary concern, with
flexibility taking precedence. Dan also discusses how special tables can be added to enhance performance. See the TDAN.com series mentioned above.


Conclusion

It can be seen that the data vault can be used as a representation of the Enterprise Data Model or Common Information Model. It is valuable for the data warehouse to utilize structures that
represent business concepts instead of data elements invented by and for information systems of the past. The data vault adds flexibility, facilitating iterative development at the same time
allowing for incomplete data to be loaded first and researched later. However, when using the Data Vault, it is important to realize that some cardinality will be lost, and a normalized model
provides more information. Therefore, the data vault has its limitations but it may still be useful as a model of the business.

Share

submit to reddit

About Bonnie O'Neil

Bonnie O'Neil is a Principal Computer Scientist at the MITRE Corporation, and is internationally recognized on all phases of data architecture including data quality, business metadata, and governance. She is a regular speaker at many conferences and has also been a workshop leader at the Meta Data/DAMA Conference, and others; she was the keynote speaker at a conference on Data Quality in South Africa. She has been involved in strategic data management projects in both Fortune 500 companies and government agencies, and her expertise includes specialized skills such as data profiling and semantic data integration. She is the author of three books including Business Metadata (2007) and over 40 articles and technical white papers.

Top