A Really Close Look at the “Universal Data Vault” (UDV)

This is the second of two articles from John Giles on Universal Data Vault (UDV) design. Part one of the series can be found at Universal Data Vaults: Case Study in Combining “Universal” Data Model Patterns with Data Vault Architecture.

In Part 1 we looked at why you might want to combine the benefits of (1) “Universal” data model patterns (e.g. their robustness and flexibility as proven across diverse industries) and (2) the Data Vault architecture which facilitates the auditable, adaptable and scalable assembly of vast amounts of data for an enterprise.

However it was noted that, while these two approaches have significant benefits in their own right, combining them was potentially problematic as “Universal” data model patterns are by their very nature generalized whereas Data Vault practitioners recommend a more specialized approach. Based on a real-life case study, a technique for their combination was briefly introduced.

This article expands on the technique, shares some experiences with top-down versus bottom-up methods and closes with a warning against unquestioned adoption of the “Universal Data Vault” for all situations.

UDV Hubs

Instead of having multiple physical hubs in your UDV design – for example a physical Hub for Aircraft (plus specialized Hubs for Helicopter and Fixed Wing Plane), and a physical Hub for Fire Truck (again with its specialized Hubs for Water Tanker and Slip On), and so on, we can have one generalized Asset Hub to hold all instances of helicopters, water tankers, generators, etc. The specializations are managed as logical types of the Asset Hub via a foreign key that links to the subtype definitions in the associated Asset Hub Type table.

The Hub is named according to the chosen selection of generalized data model patterns from the ELDM. They could include, in addition to Asset and Activity as shown above, concepts such as Account, Agreement, Event, Location, Product, etc. Another common pattern is the Party and Party Role pattern. This is a possibly contentious topic that we can leave aside in this article, since alternative modeling choices can easily be accommodated by the UDV structures.

It should be noted that the Asset Hub has a unique index on the combination of Asset Type and Asset Business Code. A regular DV Hub is required to capture the value of a suitable business key that is unique within its domain. For example, aircraft registration numbers can reasonably be expected to be unique across all aircraft, and similarly the registration numbers for fire trucks can be expected to be unique within their context. But when one generalized Hub represents all these things, it might be possible for an aircraft and a truck to have the same registration number. To address this issue we have involved the Hub Type Code as part of the uniqueness constraint.

UDV Hub Types

A Hub Type table, such as the Asset Type table, manages the logical identification of a supertype/subtype inheritance hierarchy. It is a simple self-referencing “type” table that defines types of (logical) Hubs and their parent/child hierarchy, e.g. it might have a row for Fixed-Wing Plane, a row for Helicopter, and a more generalized row for Aircraft as the “parent” of Helicopters and Fixed-Wing Planes.

A UDV Hub Type table is not a regular DV table. It doesn’t have a Source column, or a Load Date & Time, as its contents do not come from an operational system. It is simply configuration data, hand-crafted by the design team. This means that if, for example, a new asset type is required (as a specialization of an existing generalized Hub), you don’t have to create a new Hub – just add a row to the Asset Type configuration table, then start populating instances of that type in the Hub table.

There is a column to indicate if a Hub Type is “Abstract or Concrete.” This approximates the object-oriented meaning of an abstract or concrete class, but it is not essential that it be included in the core UDV architecture. In this sample diagram, an abstract Asset Type would be one that might be used to classify a collection of subtypes, but would not be expected to have any “concrete” instances actually recorded in the associated generalized Hub table.

UDV Links

Just as Hubs can be generalized or specialized, so can Links. For example, if a bank has a generalized Hub for Agreement (Mortgage, Term Deposit, Security etc.), and it also has a generalized Hub for its customers and associated parties, the types of Links between parties and agreements might include signatory-to-agreement, guarantor-for-agreement, witness-to-signature, employee-approval-of-agreement, and so on. In UDV, each of these types of Links can be themselves generalized, with a foreign key in each generalized Link that identifies its definition in the associated Link Type table – see below.

UDV Link Types

A UDV Link Type table, such as the Activity-To-Asset Link Type table, contains the logical definition for each allowable type of specialized Link between the participating generalized Hubs. As for Hub Types, the UDV Link Type tables:

Are simple “type” tables that define types of (logical) Links, e.g. with a row for the Aircraft-assigned-to-Emergency Response Schedule Link.
Are not regular DV tables – they don’t have a Source column, or a Load Date & Time column.
Can have new (logical) types of Links added dynamically where the generalized Link already exists.

UDV Satellites

The Satellites are regular DV Satellites, but they can be hung off Hubs at any level in the (logical) inheritance hierarchy. For example, if all Activities share some common attributes (e.g. activity start and end dates and times), a common Satellite can be defined. Instances of more specialized Hubs can populate the common Satellite as well as their own specific Satellites.

Similarly, Satellites can be hung off Links, and can be generalized or specified to one particular type of Link.

Some Optional Extras

The standard Data Vault architecture has a beautiful elegance. It has only three core constructs – Hubs, Links, and Satellites. Yet within this simplicity, it has the flexibility to accommodate a number of variations. For example:

Satellites can be split to have multiple Satellites for one Hub (or Link)
A Hub or Link can have “effectivity” Satellites
The standard Data Vault can be supplemented with “Point-In-Time” and “Bridge” tables

These, and many more topics, are best left to the experts who have already published on such matters!

The basic UDV architecture also has both simplicity and extensibility. A few examples are described below.

Self-Referencing UDV Link Types

If we have a self-referencing Link named “Asset Contains Asset” that links a pair of assets, we might have to guess which one is the container and which one is the component. Sometimes we might be able to guess correctly. If one item is a car and the other is an engine, it’s pretty reasonable to assume the car contains the engine rather than the engine containing the car! But to take a real-life example where the roles of participating Hubs are less obvious, one of our fire fighting agencies sees a fire response unit as being a fire truck that contains some crew members (i.e. the truck is the container), while another agency sees the grouping of people (the crew) as the container, and the truck as a component.

An amusing twist at another client site relates to office blocks and waste treatment plants. You can have an environmentally responsible office block that contains a small, local waste treatment plant. Conversely, you can have a large regional waste treatment plant that contains its own office block!

In such cases, rather than implying the role of the participants, the UDV Link Type may need to explicitly define the roles. An example of such a structure follows, where the UDV Link Type table has columns to define the role for each participant:

Rules

Data Vault has a principle of “All the data, all the time.” You don’t filter out data during the load process just because it breaks some perceived business rule. You want to capture the actual data in the operational systems, whether it is “right” or “wrong.” After all, maybe it’s the rule that’s wrong!

The UDV has the potential to define a number of rules in its metadata, and it can easily be extended. For example, the UDV Link Type tables can be extended to hold expected multiplicity (optionality and cardinality). But, the key message is that any such rules are to be used to report apparent breaches of rules against the data as loaded, rather than excluding data from the load.

Visualization

If we had a physical Hub for Aircraft and another for Emergency Response Schedules, plus a physical Assignment Link between them, we can inspect the database and “see” the structures. There may be hundreds of Hubs, Links and Satellites, but they are somewhat visible. Conversely, if all this design is held as data in UDV Hub Type and UDV Link Type tables, the visibility of the structure is different. We can still find it by looking at the rows in these “type” tables, but some data people may be less comfortable.

One solution is to read the metadata in the “type” tables (plus the column definitions in the Satellite tables) and create XML Metadata Interchange (XMI) files which in turn can be imported to an XMI-compliant tool and visualized as a UML class diagram.

Reflections

Pattern-based Enterprise Logical Data Models (ELDMs)

There can be passionate debates about top-down (generalized) versus bottom-up (specialized) modeling. I care less about where people start, but I do recommend that before anyone claims to have reached the end, they consider both aspects.

Graham Witt is a fellow Australian that I respect highly. He came across a story of what I will call a “bottom-up” designer. The database had to include student records for a school. The designer had noted that in a given sample of data, all the students had two parents that shared the same surname and same address as the student. Of course, once you encounter real-life data, this may not work! The mistake on the surname was embarrassing, but little more; he mistake regarding a common address was far more serious. A mother was living in a women’s refuge due to domestic violence issues, and this modeling “mistake” was the catalyst for having her address made known to her abusive husband.

There may have been many issues behind this story, but I am guessing that at least two were in play.

Firstly, I am suggesting that if generalized data model patterns were taken as a starting point, the additional flexibility of these patterns, such as each person having different names (and possibly multiple names), could be reviewed, and unnecessary “over-engineered” features consciously removed. This would have been far less painful than discovery of missed features.

Secondly, the developer in this real-life example was presumably a bit of a novice. I highly recommend anyone considering a pattern-based ELDM (or its logical extension, namely a UDV) purchase the books on data model patterns by Len Silverston and David Hay. In my book, The Nimble Elephant, I also provide insight into how these patterns can be effectively applied. But reading the books may not be enough. A bit of hard-won experience in the use of these patterns may prove to be valuable.

Universal Data Vault: a Battle or Bonanza

There is a saying that necessity is the mother of invention. At some of my client sites, I felt the need to see if I could combine the benefits of data model patterns with those of Data Vault architectures: and it worked, and worked well!

I remember a colleague who challenged the suitability of the generalized data model patterns. He wanted to examine its ability to address the specific needs of the client. He took the most complex set of sample data he could get his hands on, and we mapped it to the patterns. I think he was pleasantly surprised when over 90% fitted. Being a cheeky sort of a bloke, I teased him by saying that if we had also incorporated the “survey” pattern in the ELDM, we would have got very close to 100%.

There are some drawbacks with UDV. One already mentioned is the lack of visibility of the data structure, at least in a form that is familiar for many data modelers. Another is that finding suitable business keys is hard for regular DV, but even harder for the more generalized Hubs of the UDV architecture.

Yet in spite of these challenges, the first UDV project was such a success that one of the employees at this client site subsequently approached me to “do the same thing” at his new site after he had left and joined another organization.

In summary:

Data model patterns are generic in nature, and may well offer benefits for you.
Data Vault architectures also offer much:
- If your enterprise can find one “sweet spot” on their generalization/specialization continuum, standard DV is great
- If your enterprise can’t define a single level of specialization but has a limited number of focal points on the continuum, explicit Hubs for each point, supported by “same-as” Links, should be seriously considered.
- If your organization sees value in having a multi-level generalization/specialization hierarchy, especially if based on data model patterns, you may find the Universal Data Vault approach demonstrates that by combining universal data model patterns with the Data Vault architecture the “whole is greater than the sum of the parts.”