Data Vaults have enormous potential to deliver value, but to do so, we’ve got to turn the spotlight off the technology. Sure, we absolutely need some underpinning techno stuff, but if we want tangible value for the business, we must focus on …? Focus on what? Yes, focus on the business.
In the “Agile” world, there is recognition of “technical debt” – consciously postponing some aspect of the technical solution set to another day. Unfortunately, some Data Vault vendors and consultants try to impress with today’s speed but they seem to avoid a conscious discussion of “business debt.” These fast-off-the-block Data Vault service providers can get data into the Data Vault quickly, but their technology-centric approach can bite hard later.
The good news is that there is a growing recognition of the need to apply business-centric views to the design of a Data Vault, from the very outset.
This paper is part of a trilogy. The first paper argues “why” top-down models should be considered, not just for Data Vault projects. A second, companion set of materials articulates “how” to develop such models.
But any model is useless shelf ware if it’s not applied. One of the most common applications of these models is the design of a Data Warehouse, and this is the topic for this third member of the trilogy (itself split into two – you’re reading the first segment), specifically using Data Vault as the platform for discussion.
It is important to note that top-down models can deliver tangible business value across a variety of other scenarios such as formulation of enterprise strategies, delivery of MDM solutions, design of a services architecture, and more. Those interested in just Data Vault will hopefully enjoy the entire trilogy; others are encouraged to read and apply the first two parts, and then maybe add their own paper on how they applied top-down models to their particular area of interest.
Before We Start, What’s a Data Vault?
If you’re reading this paper, there’s a good chance you’re already familiar with what a Data Vault is. If so, please feel free to skip the next few paragraphs. For others, Data Vaults can be seen as an evolutionary advance from earlier data warehousing approaches.
Data Vault 2.0 has a few dimensions – architecture, methodology and modeling. The modeling aspect is but a part of the whole, but it’s the focus of this paper.
Within Data Vault modeling, you start with hubs that have “business keys.” The hubs are the center of everything, which is why they’re called hubs! Examples of hubs might include a Customer hub, an Order hub, an Employee hub, and so on. Perhaps unsurprisingly, the “business key” for the Customer hub is likely to be the Customer Number. Likewise, Orders have Order Numbers and Employees have Employee Numbers. Pretty simple so far?
These hubs are linked together with “links” – another pretty good name. For example, employees in the role of account manager may be assigned to look after customers, so there’s a link between the Employee hub and the Customer hub.
And these hubs and links can have satellites that hold the data values, over time.
So there you go. Modeling for a Data Vault taught in a few of paragraphs. I haven’t done even the modeling aspect justice, but it may be enough to help Data Vault newbies get some value from this paper.
Who Says Data Vault’s Not All About Technical Stuff?
After all these years, I still enjoy technical challenges. And Data Vault implementations certainly need to be able to cope with plenty of such challenges. Just taking a few of the ever-growing list of V’s in “big data” as an example, a Data Vault has to handle volume (Dan Linstedt, the Data Vault founder, has clients with petabytes of data), velocity (Micron’s billions of new rows each day), and the variability of Hadoop data structures. We sure need techos. But …
Dan Linstedt is emphatic that Data Vault builders “… must focus on ontologies [when] building the Data Vault solution …”. [By the way, my simplified definition of an enterprise “ontology” is that it’s a model that describes business concepts, their inter-relationships, and their major attributes. For practical purposes in this paper, I suggest that an enterprise ontology is pretty much the same as a top-down big-picture enterprise data model.]
At the start of 2018, Roelant Vos, a respected Aussie spokesperson for Data Vaults, put up an excellent article on his LinkedIn website. One tiny snippet from it stated that the Data Vault approach supports “…incremental, business focused and tightly scoped information delivery” [emphasis mine].
In a similar vein, Mike Magalsky spoke of Micron’s experiences with Data Vault, explaining that their creation of a foundational design reflective of core business concepts was central to their success.
Hopefully the message is clear: If you’re building a Data Vault, make sure you start with getting the business view of the world. Unfortunately, not everyone does Data Vault that way, in spite of Dan’s blunt admonition where he states, “… if the Data Vault you have in place today is not currently about the business, then unfortunately you’ve hired the wrong people, and those people need to go back to school and re-learn what Data Vault really means. OR you’ve built the wrong solution, and you need to fix it – immediately.”
So Why Might Some Be Missing the Mark?
I don’t want to focus on the problems. Solutions are much more valuable! But maybe an awareness of how good people fall into traps can serve as a warning. It’s better to have warning signs on top of a cliff than have excellent paramedics at the bottom treating those who have fallen.
Reasons for people doing bottom-up, source-driven Data Vault are varied, but some observations follow.
- Tools: Some tool vendors, maybe in their enthusiasm to demonstrate simplicity and speed, show how quickly they can point at a data feed from a source system, press the big green “Go” button, and hey presto, there’s a Data Vault. The tools may be excellent, and many are. It’s the implied source-driven Data Vault message these tool demonstrations carry that’s dangerous.
- Postponing business needs: You’ve heard the saying, “Build it and they’ll come?” Great theory, maybe, but not a guaranteed approach to success. Some practitioners, perhaps unconsciously, seem to follow this mantra. Load source data into a raw Data Vault as quickly as possible, and at some time in the future worry about who will consume it, and what business rules need to be applied so they can use it.
The business may admire these people for visible progress today in getting data into the Data Vault, but may be bitterly disappointed when they try to get integrated, enterprise-wide analytics out of the Data Vault.
- Fear of enterprise data models: Some may say they agree with the theory of driving a Data Vault from an enterprise ontology, but if one doesn’t exist, they either don’t know how to build one, or they think it will take so long to build that it’s not worth even trying. And even if one does exist, they may not know how to apply it to shape the Data Vault design. If they haven’t come across big-picture top-down modelling as an approach to delivery of “sufficient” design for a Data Vault, who could blame them?
So what are some symptoms of source-centric, bottom-up thinking?
One simple example encountered recently was the intention to build “lots of customer hubs” in the raw part of the Data Vault, and to consolidate them later in the business part of the Data vault. Seeing a multitude of hubs around one common theme should set warning bells clanging loudly.
A Misguided Belief in Source Purity?
I’ve heard some disturbing comments from some Data Vault practitioners:
- “Business concepts don’t reflect reality. It’s the source systems that represent the facts. Go to them to discovery reality.”
- “There is no such thing as a business concept in Data Vault.”
- “Through the source systems, Data Vaults accurately capture the business process realities.”
It appears to me that a bottom-up view of the world can be found at the heart of these statements, where source systems are somehow seen as representing unquestionable pure reality. Sorry, but that’s dangerously flawed thinking. A raw Data Vault must faithfully capture exactly what’s presented from a source feed. But we can’t stake our lives on the belief that the source system is, in turn, a pure reflection of “business” reality.
Let’s start with looking at source system data structure purity. Off-the-shelf packages, even with tailoring, are unlikely to ever be a perfect match to “reality.” Perhaps you’ve got an ideal scenario, where a hand-crafted IT system, at the time of its creation, was a perfect match to the business reality. Unfortunately, over time, businesses change. At best there may be a lag in re-aligning the IT system to the new reality, or, at worst, changes are too expensive to be made and business reality and the IT solution slowly drift apart, with work-arounds creeping in.
Let’s challenge the idea of source system data content purity a bit harder. All data is at best a partial abstraction of reality, and its representation can be blatantly wrong. What if the source system insists a job applicant must have a date of birth supplied, and the person keying the data hasn’t got a clue? In goes something like 1/1/2001.
And the source system isn’t a pure reflection of business processes either, as is sometimes claimed. It may faithfully track what was entered. Its audit mechanisms may capture when the data was entered and the user ID of the person doing the data entry (which should represent the person at the keyboard, but the rules can be broken and passwords shared). Even more fundamentally, the source systems will never catch the manual process workarounds performed outside of the system because the system constraints simply do not reflect the reality of how things are done.
I recently encountered a classic example of source system constraints. It’s almost unbelievable in this day-&-age, but one source system uses parts of a person’s name to generate a unique identifier for that person (plus a “tie-breaker” in case the name’s not unique). When a person changes their name, the source system will not permit a change to the primary key. As a work-around, the operator must create a new record with a different primary key. Same person, two records. This example should dispel the myth that a source system reliably reflects the real-world.
Please, I am not downplaying the enormous value of Data Vault’s solid audit and data lineage features related to the loading of source system records. But I am arguing against the perception of source system purity as a representation of the real-world truth.
Let’s Look at Data Vault Design a Bit Differently
What the business wants is raw data from many, many sources, turned into business information, based on business concepts and their relationships – their enterprise ontology if you like. So how can we align to this goal?
The egg-timer diagram above may give a hint. The amount of data held in source systems is relatively high. The raw Data Vault can be expected to hold (1) less data as typically not all source data sets are nominated for loading to the Data Vault, but (2) more data because it holds history over time. The business Data Vault layer ideally will have relatively little data – for example, multiple conflicting satellites for a common parent may be resolved into a single, conformed satellite. Then the data volumes are likely to start increasing if the presentation layer and consumer reports are retained as physical objects rather than as virtual objects (maybe multiple data marts could be constructed from common business Data Vault artifacts), and finally, the number of reports or extracts produced by consumers could potentially be huge
The business Data Vault is the business-centric layer of a Data Vault, and it is core because that is where the raw data is finally assembled as business information, founded on the business ontology. Yes, assembly of disparate raw source feeds can and should commence in the raw Data Vault— for example hanging satellites from multiple sources off shared business keys in one hub (“passive integration”) — but the business Data Vault is where the last touches are applied.
An investment in a well-considered design can pay massive dividends. David Hay has classifications for different types of conceptual models. He talks of divergent semantic models, where different parts of the enterprise see things differently, but he also refers to a convergent “essential” model that seeks to represent a unified view across all parts. It is this single unified enterprise view that should shape the business Data Vault’s design.
And the good news is that the companion paper on how to build a top-down big picture enterprise data model demonstrates that a “sufficient” model can be assembled in weeks, not years.
An Outline of a Data Vault Design Approach
After we’ve got our hands on the enterprise data model, what steps are to be taken to shape up the Data Vault design?
- Hubs, as reflected by their name, are at the center of our Data Vault world, so we start there, immediately focusing on hubs that reflect the business view – the business Data Vault that’s in the narrow neck of the egg-timer diagram above. If we get this bit right, we have a solid starting point for the rest of the entire Data Vault.
- Identify the relationships (i.e. links) between hubs. These may be fundamental business relationships (e.g. Account Managers manage Customers), or they can reflect transactions and events (“units of work”) as presented to us as part of some process.
- Identify the clusters of attributes coming from source systems that are to provide descriptive context for hubs and links, and that are to be stored as satellites.
Designing the Hubs
Identifying the Business Hubs
Let’s assume we have operational source systems that manage response to emergency events such as wildfires and floods. A snippet from our enterprise data model might look something like the following diagram:
I have used UML class modeling notation, and the diagram is intended to convey the message that Emergency Events are a supertype (superclass) with attributes that are common to all subtypes (subclasses). The diagram is far from comprehensive – for example it doesn’t include earthquakes and tornados – but it may be sufficient for our discussion here.
The business recognizes the hierarchy of types (you could call it a taxonomy if you like). The business has identified major attributes for each entity type (i.e. class of thing). And the business wants a business Data Vault that reflects their view of the world, even if some purchased software packages that implement emergency response management look nothing like their view. (I have been putting “business” in italics to emphasize this is all about the business, not technology, nor the views of source systems.)
One design option might be to define a hub called Emergency Event, and implement at that level, but supertype hubs are not generally recommended. At the other extreme, we could implement a hub for atomic subtypes (forest fire, grass fire, river flood, and tsunami). Or, heaven help us, we could have a hub for every entity type (class) across the entire hierarchy, with Data Vault “same-as links” between them.
So what do we do to resolve the best approach? We talk to the business. The example above has only three levels in the supertype/subtype hierarchy, but in the real world there may be more. If you listen to how people talk about the things in their business, don’t be surprised if they only sometimes talk at the highest levels of generalization or at the lowest levels of specialization, but more frequently chat about the middle layers. In our example, that translates to lots of folks chatting about fires and floods, but less frequently discussing emergency events at one extreme, or forest fires, grass fires, river floods, or tsunamis at the other extreme. Listen to the business, and try to find the “sweet spot” that seem to most commonly reflect their conversations, then nominate that as the default level for the design of our business Data Vault hubs.
The next step is to check against business processes. If the subtypes of wildfire in the enterprise data model (forest fire and grass fire) follow the same or similar processes, and use the same or similar data, then a Wildfire Hub is looking good. Note that the logical subtypes of the Wildfire Hub may require a “Wildfire Type Code” attribute to distinguish between forest fire and grass fire instances in the common hub. It is quite likely that this is really simple – source systems may well provide just what you need as a “type” attribute to be stored in their satellite.
Conversely, when we look at floods and their subtypes, if we discover that river floods and tsunami’s are processed in a totally different manner, we might decide to have a River Flood Hub and a Tsunami Hub. It may not be black-&-white, but use the enterprise data model to guide the conversation.
This top-down approach using the enterprise data model also defuses another topic that seems to generate lots of heated discussion. You may come across Data Vault hub design guidelines that state all sources sharing one hub must have the same “semantics and grain.” Sounds simple, but I have seen debates on hub design go in circles in meeting after meeting. Does this source system’s “client” mean the same thing (semantics) and have the same grain as this second system’s “customer?”
Why these lengthy debates? I suggest that doing Data Vault design bottom up, starting with source systems without the context provided by the enterprise data model, is at the heart of the problem. You’re trying to derive enterprise meaning on the fly as you pick up one source system after another source system.
Conversely, if you’ve already built an enterprise data model, it turns out these “semantics and grain” issues are delightfully simple to resolve. Here’s how.
We start with “semantics”— i.e. the meaning behind our model. Instead of arguing about one system’s “client” and another’s “customer,” you can go to the enterprise data model. You’ve already had many conversations with the business. Every entity type in the model should be fully described as to what it is, what it is not, and have examples provided also in the documentation.
You know what each entity type means. Now all you have to do is map source system data to the existing entity type that matches. Of course, this mapping may uncover some holes in the top-down model, but that’s just part of proving any top-down model by cross-checking against bottom-up details.
Grain should be relatively simple, too. Taking the example introduced above, emergency event is a coarse-grained supertype, with wildfire and flood as finer-grained subtypes.
The conclusion is simple – each entity type in the enterprise data model identifies one class of thing that has a consistent “semantics and grain.”
Of course, that may be how the business sees their world. For them to pick the “sweet spot” that seems to be the natural “hub” of attention (pun intended) is relatively easy. Taking the example above, what if the business like the idea of a Wildfire Hub, but some source systems are more fine-grained, and others are more course-grained? In such a case, should we attempt some integration at the raw Data Vault level by having a common, shared Wildfire Hub, or leave it all to the business Data Vault?
Business Key Structures
One of the goals for Data Vault is to achieve integration of data around business keys. You will often find multiple sources offer data feeds for what has been determined should belong to the same hub – remember, a common “grain and semantics” – but business key structures can be problematic.
Same hub, same key value for same real-world object
Let’s assume we’ve got an Employee Hub with Employee Number as the business key, and you’ve got two sources – a Payroll System and a Human Resources (HR) system. If Employee Number 123 in both systems should refer to the same real-world person, we want:
- One hub instance for that employee
- Attribute values recorded as a data instance in a Payroll satellite, plus attribute values recorded as a data instance in a HR satellite, both hanging of the one, shared hub instance.
That’s called passive integration.
Same hub, different key values (non-overlapping) for same real-world object
Maybe one of the systems might use integer identifiers (123), and the other includes some alphabetic characters (ABC) for the same real-world employee. If the Employee Number attribute is a text string, and the integer is converted to text, the two systems can still load data to one common hub, though as separate instances. Hopefully one of the source systems will inform the Data Vault that 123 and ABC represent the same employee, and the two instances can be associated using a “same-as-link.”
Same hub, different key values (potentially overlapping) for same real-world object
What if the HR system and the Payroll system both use integers, but they generate their own independent codes instead of sharing a common code set? It may be possible for HR employee 123 to represent Alex Anderson, and Payroll employee 123 to represent Brooke Brown. A new column needs to be added to the business key to act as a discriminator to identify the source of the code set. We might have Employee Numbers “HR/123” for Alex and “PAY/123” for Brooke. We can again use same-as-links to associate Alex’s HR record “HR/123” and Alex’s payroll record (say) “PAY/555.”
Same hub, different multi-part business key values
What do we do if we add another source system for the employee hub that identifies employees by Department Number plus Surname plus Date Of Birth? The entity type itself still presents the same “grain and semantics” even if the business key has what some confusingly call a different grain i.e. a different number of columns.
Some practitioners avoid this problem by designing different hubs for the raw data, getting close to the ill-advised source Data Vault construct – one hub for each source. But this approach doesn’t really solve things – it just postpones it to being someone else’s problem when the business insists that you must have one business Data Vault hub. So if it has to be faced sooner-or-later, I suggest it should be faced sooner.
Some people suggest we have one business key column, with all the multiple parts concatenated into a single text string, with appropriate separators. An issue with this solution is that if a consumer wants to analyze the data using discrete parts of the business key, a mechanism is required to separate the parts. Others suggest that the parts be kept in separate generic columns (Employee Number Part 1, Employee Number Part 2 …), but then you need a layer to rename these generic columns back into business terms. Yet another approach some suggest is to have a hub with a single-column business key with concatenated parts, supplemented by separate columns as attributes in satellites. My recommendation is that whatever approach you choose, please don’t compromise the business fidelity and have multiple hubs just to try to work around technical implementation issues. Stay true to the business!
In this segment of the ‘Applying top-down to Data Vault’ paper, we have looked at some reasons why a Data Vault design should be based around the enterprise view, why some people might ignore this sound advice, and finally we began to look at some tips-&-techniques for actually applying a top-down enterprise view to the design of a Data Vault, starting with Hubs.
The next segment of this paper continues the application of a top-down enterprise view, this time to Data Vault Links and Satellites, and then shares some concluding thoughts.
Many of the ideas presented in this paper were refined during one Data Vault project. I worked with many wonderful people at the site, but I wish to give special mention to Rob Barnard, Emma Farrow and Natalia Bulashenko. You were a delight to work with, and truly an inspiration.
I also wish to acknowledge the contribution of Roelant Vos, an internationally recognized Data Vault thought leader, who kindly provided much valued feedback across all parts of the trilogy.