Welcome to the fifth article in a series about applying Agile techniques to data projects in our Tips and Tricks from the Trenches Column. The column’s goal is to share insights gained from experiences in the field through case studies. This column explores the first of two different ways (and actually, opposites) to perform data warehouse modeling. Both techniques use an Agile approach to data modeling and enable easy adaptation.
Credits and Kudos
The first approach we are going to explore is called the Data Vault and was invented by my well-respected colleague and friend Dan Linstedt, from whom I had the privilege of taking a class in Data Vault shortly after he had published his series of articles outlining and explaining his approach. He was very proud of the fact that, early on, Bill Inmon, the “father of data warehousing,” had endorsed the technique. His articles came out around 2000 and were launched on his website, danlinstedt.com, and the basics of Data Vault can still be found there. He has published a book in 2015 called Building a Scalable Data Warehouse with Data Vault 2.0, published by Morgan Kaufman. I hope to whet your appetite enough for you to pick up the book if you want more depth. The technique is an interesting one. This article is not intended to be a comprehensive overview, but instead a high-level summary.
In addition to Mr. Linstedt, the other “father” (or perhaps “uncle”) of the Data Vault is Kent Graziano, who hosts a blog on Dan’s site. He has written extensively on Data Vault and presented the subject at numerous conferences.
Introduction to the Data Vault
According to Dan Linstedt, here is a definition of the Data Vault:
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.
The Data Vault follows a hub and spoke architecture. Linstedt has created three basic structures:
- Hubs (business keys)
- Satellites (groups of attributes)
- Links (intersection entities)
Figure 1 shows a simple illustration of this concept.
Figure 1. Three Basic Building Block Structures of a Data Vault Model
A Very Flexible Structure
The basic concept behind the Data Vault is it is extremely flexible, and can be extended with very little work. The hubs contain the essence or identity of a central business concept, what Dan likes to stress are “business keys,” as close to the true identifying attributes from a business perspective; they are also called “natural keys.” In other words, the attributes that truly identify the business entity, are “naturally occurring in the business,” and not artificial or “surrogate keys.” A surrogate key is a data modeling convention that is used to uniquely identify an instance of an entity that has no inherent meaning. It is usually a sequence number or hash key that is system-generated so it is always guaranteed to be unique. But it is by its very nature, artificial, and not “naturally” occurring in the business.
For example, a business has an order and an invoice. What is the business key for the order? Data modelers for many years have assigned an Order ID to easily identify an order. It is a single attribute and facilitates in a relational database—a data convention to assist in physical database design. But what really makes the order unique from a business perspective? It might be the following:
- Products being ordered
- Date of the order, perhaps including the time, if multiple orders can be placed on the same day for the same customer
In a similar fashion, an invoice might have the following business keys:
- The business keys for the order, so the invoice can be paired with the order
- Invoice date
Order line items may retain their sequential numbers because it may be a useful construct from a business perspective to number them. But note that it is retained from a business perspective and not a physical database design consideration.
Example: A College
Let’s look at another example, that might illustrate business concepts a bit better: a small college.
A college must track Students, Classes, Departments, and Advisors. We begin with a simple design, shown in Figure 2 below. The model shows other concepts that are also important, but the ones in gray are chosen for consideration in the first iteration. The Data Vault is best implemented in iterations, and not as a “Big Bang.” This is one of the features that makes it Agile.
Figure 2 College Data Vault, Iteration 1
First Step: Identify Hubs
The college has decided that the first iteration of the Data Vault will focus around Departments. The main information they will start tracking is Teachers assigned to the Departments and the Students’ major within a Department. The first step is to identify the hubs; in this case, we have three:
The Advisor entity is not a hub; it represents a relationship between two hubs. Relationship entities are known as Links and will be discussed later in this article.
Second Step: Business Keys
The second step, after identifying the hubs, is to identify the business keys for each. Business keys are supposed to uniquely identify an instance of the entity and be immutable– resistant to change. The business keys should never change. An entity may have a combination of business keys when one element alone is not satisfactory for unique identification.
It is fiendishly difficult to determine natural keys for people. If you consider the natural identifiers for people, they are not reliable:
- Birth date is probably the most reliable identifier for people because it does not change. The only problem is that there may be more than one student with that birthdate, so by itself it is not a good business key.
- Social Security Number is not consistent because you may have foreign-born students. Foreign students may have Green Cards or Student Visas, which might work. But note that these represent two different types of IDs based on the citizenship of a student. It is tricky to do this and there are different attributes for Student Visa than Social Security Number, like expiration date, that don’t apply to the other.
- Names can change. Female students may get married and change their last name to that of their husband. Anyone can also change their name (either first name, last name, or both) at any time.
- Hair color can change; even eye color. A person may wear colored contacts.
- Gender is mostly constant, but this can change.
- Phone number, especially cell phone number, is fairly reliable but it can also change.
- At the end of the day, most organizations have created their own version of the surrogate key to identify people: “Employee ID” and “Student ID.”
The college chooses to use the Student ID and the Birthdate in the Student hub as the business keys. There is a similar issue with Teachers, although probably not as acute. The college assigns each employee an Employee ID, and the college uses this and birthdate as natural keys for a Teacher.
The business key for Department might be its Name. The college might group its departments into divisions or “schools,” such as the “School of Business”, “Science and Technology”, or “Engineering.” However, it might be good to not include the division as part of the natural key because the groupings may change. Today, Computer Science might be part of the Engineering school, but tomorrow the college may reorganize and it might be under the “Science and Technology” school. We shall see how data vault handles this in the next article in the series.
Now we have the following structure shown in Figure 3 below. The hubs are in gray, the business keys are in purple, and the link is in blue.
Figure 3. Evolving Data Vault with Business Keys
The next step is to define Links. One of the rules of Data Vault modeling is “Make all relationships many-to-many (M:M).” The reason for this is it creates an extremely flexible data model.
The ramification of this rule on our college model can be seen in Figure 4. We have our first Link: Advisor. So even if Advisor is usually a one-to-many, the Data Vault always will turn it into a M:M. In the case of Advisor, though, it is easy to understand that it might be a M:M. An advisor almost certainly advises more than one student, and a student may change advisors. So the figure assumes that this is the case.
Figure 4. New Links Added
In addition, Figure 4 shows several other Links that are added. One is Major, which relates a Student to a Department. The college may have a rule that prohibits double majors, but Data Vault makes it M:M anyway. Note that a Teacher is usually only associated with one Department, but again, the relationship is created as M:M. If you are familiar with intersection tables, Links are similar– they carry the keys of both tables that participate in the Link.
Dan Linstedt in his Data Vault Basics example shows a four-way Link table. It is instructive to note that Links don’t have to be binary, but can have more than two tables associated with them.
One of the secrets of Data Vault is Dates. Dates are added as a vital component. This enables the structure to support change. Dates become part of the keys in all Data Vault structures– every structure in a Data Vault model has the load date and a record source identifier. This is important for a data warehouse. It allows adjudication between sources as well as dates. The best value for a given data element may not be the most recently entered, but might be one data source trumping another. This means that the Data Vault can be used for Master Data Management—another plus for the technique.
Load dates turn all relationships into many-to-many. Even relationships that normally are 1:M can be M:M when you consider point in time. For example, a Student can have only one advisor at any given time. But because the model stores multiple points in time, the usual 1:M relationship actually turns into a M:M. This enables the structure to capture students switching advisors.
Next, the modeler adds Satellites. Satellites contain descriptive attributes for hubs or links. All attributes that are not business keys belong in satellites. The rule about how “attributes that change together, stay together” guides the design of the Satellite structures. Address is a good Satellite– all the attributes in a standard address are modified at the same time. You would have a satellite address table each for Student and Teacher.
Suppose we want to track Teachers’ specialty areas. You would probably not modify the specialty area at the same time as the Teacher changes his/her address, so the two would become separate satellites. It is possible to have a satellite with very few attributes in it; possibly even one—but then you would add the date and source ID attributes.
What if I want to expand my model? Suppose the source system now supports listing the journals that the various teachers publish in. No problem! Simply create a new satellite for it (see Figure 5 below). It shows three Satellites for Teacher: Specialty Area, Journals, and Teacher Address. This demonstrates the real Agile power of the Data Vault. It doesn’t matter what changes in the source; it is trouble-free to add new attributes, because it involves simply creating a new satellite. And the same goes if the source has deleted an attribute. Simply remove the satellite containing the attribute, or modify only that table.
Figure 5. Satellites
Now some of the benefits of Data Vault become clear. The Data Vault is meant to support the central data warehouse, which represents the history of the enterprise over time. It does this by storing all the changes from all the (relevant) source systems. Data Marts will then be created and populated from the Data Vault central data warehouse. Because the Data Vault stores all history, it can be used reliably to adjudicate Master Data for a Master Data Management system (MDM), and can support changes made to the underlying source systems very easily in an agile manner. A very flexible structure indeed!
The Data Vault is subject to performance problems, especially when implemented in a relational database, which was the only game in town when Linstedt first created the Data Vault. This is due to its hyper-normal structure. Normal forms mean more tables, and more tables in relational databases mean performance challenges. However, in today’s world there are more data structure types to choose from. Although I have never seen a Data Vault implemented on a graph database, I can see how it might work very well. There seems to be symmetry between them because both involve inherently networked structures.
I have seen the Data Vault both succeed and falter. In the latter case, my client had a hardware shortage and each database application had to share its hardware infrastructure with a number of other, usually highly utilized systems. The result was large performance degradations, affecting all applications hosted on the hardware. It certainly did not help that most queries involved 7 or 8-way joins or worse. But the hyper-normal database design was only partly to blame for the dreadful performance.
The success of the other implementation was due to the flexibility of the structures and the ability to implement the data warehouse incrementally. The company could introduce new parts of the model easily while users were able to take advantage of the already-built portions. This organization had robust infrastructure, which helped mitigate the potential poor performance due to large joins.
Linstedt admits that the Data Vault can cause slow performance. He mentions that the Vault structure is not meant to be a source of ad hoc queries, but instead is designed to feed data marts which will be the main end user database. He presents options such as Helper and Bridge Tables, which are specially designed tables for specific uses, such as to house summaries. But these structures are not meant to be part of the main design.
It is easy to see that the Data Vault technique can enable an Agile approach to data warehouse modeling. It provides a flexible way to incrementally build out a data warehouse, allowing for users to get the benefits from a smaller portion of the model while the development team is working on the next iteration. Targeted data marts can also be incrementally released when the underlying Data Vault is ready. High performance hardware should be used, or alternative storage structures such as graph databases.
The next column in our series will explore a completely different approach to Agile data warehousing, which is a different spin on the generic data modeling approach, explored in the third article in this series: here is the link.