Universal Data Vault: Case Study in Combining “Universal” Data Model Patterns with Data Vault Architecture – Part 1

Giles01There are plenty of reasons you might want to use proven “universal” data model patterns to improve the outcome of your enterprise initiative or your agile project – the patterns are proven, robust, efficient now, and flexible for the future.

There are also many reasons for considering the Data Vault (DV) architecture for your data warehouse or operational data integration initiative – auditability, scalability, resilience, and adaptability.

But if your enterprise wants the benefits of both, does their combination (as a “Universal Data Vault”) cancel out the advantages of each? A case study looks at why a business might want to try this approach, followed by a brief technical description of how this might be achieved.

Country Endeavours Pty Ltd

Creative solutions for Difficult Problems”

Why Would Anyone Want a “Universal Data Vault”?

Patterns at Work

giles03The idea of using proven data model patterns to deliver business benefits isn’t new. In his 1995 book, Data Model Patterns: Conventions of Thought, David Hay stated, “Using simpler and more generic models, we will find they stand the test of time better, are cheaper to implement and maintain, and often cater to changes in the business not known about initially.

That’s the theory, and I’m pleased to say it works. I smile when I remember one interview for a consulting job. The interviewer reminded me that another telecommunications company we both knew had spent years developing an enterprise logical data model (ELDM). He then threw me a challenge, saying, “John, I want you to create the same sort of deliverable, on your own, in two weeks – we’ve got three projects about to kick off, and we need an ELDM to lay the foundations for data integration across all three.”

I cut a deal. I said that if I could use the telecommunications industry data model pattern from one of Len Silverston’s data model patterns books, modified minimally based on time-boxed interviews with the company’s best and brightest, I could deliver a fit-for-purpose model that could be extended and refined over a few additional weeks. And I did.

I’ve done this sort of thing many times now. I commonly use “patterns”, but it’s the way they’re applied that varies. Sometimes an ELDM is used to shape an Enterprise Data Warehouse (EDW). Other times it is used to define the XML schema for an enterprise service bus (ESB). Or maybe mold a master data management (MDM) strategy, or define the vocabulary for a business rules engine, or provide a benchmark for evaluation of the data architecture of a commercial-off-the-shelf package…

There’s a key message here. If you use a pattern-based ELDM for any of these initiatives, the job of information integration across them is much easier, as they are founded on a common information model. This is exactly what an IT manager at another of my clients wanted. But his challenges were daunting.

Some of you will recognize the data administration headaches when one company acquires another, and the two IT departments have to merge. It can be painful. Now try to imagine the following. As of June 30th, 2010, there were 83 autonomous, but related, organizations across Australia, each with something like 5 fundamental IT systems, and each of these systems had maybe 20 central entities. If they were all in relational databases, that might add up to something like 8,000 core tables. Then on July 1st, under orders of our national government, they become one organization. How do you merge 8,000 tables?

They had some warning that the organizational merger was going to happen, and they did amazingly well, given the complexity, but after the dust settled, they wanted to clean up one or two (or more!) loose ends.

Firstly, I created an ELDM (pattern-based, of course) as the foundation for data integration. It’s a bit of a story in its own right, but I ran a one-day “patterns” course for a combined team of business representatives and IT specialists, and the very next day, I facilitated their creation of a framework for an ELDM. It did take a few weeks for me to flesh out the technical details, but the important outcome was business ownership of the model. It reflected their concepts, albeit expressed as specializations of generic patterns. This agreed expression of core business concepts was vital, as we will shortly see.

A Data Vault Solution Looks Promisinggiles04

The second major activity was the creation of an IT strategy that considered Master Data Management, Reference Data Management, an Enterprise Service Bus and, of course, Enterprise Data Warehouse components. The major issue identified was the need to consolidate the thousands of entities of historical data, sourced from the 83 disparate organizations. Further, the resulting data store needed to not only support Business Intelligence (BI) reporting, but also operational queries. If a “client” of this new, nationalized organization had a history recorded in several of the 83 legacy organizations, one consistent “single-view” was required to support day-to-day operations. An EDW with a services layer on top was nominated as part of the solution.

The organzation had started out with a few Kimball-like dimensional databases. They worked well for a while, but then some cross-domain issues started to appear, and work commenced on an Inmon-like EDW. For the strategic extension to the EDW, I suggested a Data Vault (DV) approach. This article isn’t intended to provide an authoritative description of DV – there’s a number of great books on the topic (with a new reference book by Dan Linstedt hopefully on its way soon), and the TDAN.com website has lots of great material. It is sufficient to say that the some of the arguments for DV were very attractive. This organization needed:

  • Auditability: If the DV was to contain data from hundreds of IT systems across the 83 legacy organizations, it was vital that the source of the EDW’s content was known.

  • Flexibility/Adaptability to change: The EDW had to be sufficiently flexible to accommodate the multiple data structures of its predecessor organizations. Further, even during the first few years of operation, this new organization kept having its scope of responsibility extended, resulting in even more demand for change.

I presented a DV design, and I was caught a bit off guard by a comment from one of the senior BI developers who said, “John, what’s the catch?” He then reminded me of the saying that if something looks too good to be true, it probably is! He was skeptical of the DV claims. Later, he was delighted to find DV really did live up to its promises.

But there was one more major challenge.

A Marriage of Convenience? The Generalization/Specialization Dilemmagiles05

Data model patterns are, by their very nature, generic. Len Silverston produced a series of 3 books on data model patterns (with Paul Agnew joining him for the third). Each volume has the theme of “universal” patterns in their titles. Len argues that you can walk into any organization in the world, and there is a reasonable chance they will have at least some of the common building blocks such as product, employee, customer, and the like. Because the patterns are deliberately generalized, their core constructs can often be applied, with minor tuning, to many situations. In Volume 3, Len & Paul note that data model patterns can vary from more specialized through to more generalized forms of any one pattern, but that while the more specialized forms can be helpful for communication with non-technical people, the more generalized forms are typically what is implemented in IT solutions. The bottom line? We can assume pattern implementation is quite generalized.

Conversely, DV practitioners often recommend that the DV Hubs, by default, avoid higher levels of generalization. There are many good reasons for this position. When you talk to people in an enterprise, you will likely discover that their choice of words to describe the business appears on a continuum, from highly generalized to very specialized. As an example, I did some work many years ago with an organization responsible for emergency response to wildfires, floods, and other natural disasters. Some spoke sweepingly of “deployable resources” – a very generalized concept. Others were a bit more specific, talking of fire trucks and aircraft, for example. Yet others spoke of “water tankers” (heavy fire trucks) and “slip-ons” (portable pump units temporarily attached to a lighter 4WD vehicle) rather than fire trucks. Likewise, some spoke of “fixed-wing planes” and “helicopters” rather than just referring to aircraft.

In practice, if much of the time an enterprise uses language that reflects just one point on the continuum, life’s easier; the DV implementation can pick this sweet spot, and implement these relatively specialized business concepts as Hubs. This may be your reality, too – if so, you can be thankful. But within my recent client’s enterprise involving 83 legacy organizations, there was no single “sweet spot”; like with the wildfire example, different groups wanted data represented at their sweet spot!

Here was the challenge. By default, data model patterns, as implemented in IT solutions are generalized, while DV Hub designs are, by default, likely to represent a more specialized view of the business world. Yet my client mandated that:

  1. The DV implementation, while initially scoped for a small cross-section of the enterprise, had to have the flexibility to progressively accommodate back loading the history from all 83 legacy organizations. As a principle, any DV implementation should seek to create Hubs that are not simplistic, mindless transformations from the source system’s tables. Rather, each Hub should represent a business concept, and any one Hub may be sourced from many tables across many systems. So even though the combined source systems contained an estimated 8,000 tables, we did not expect we would need 8,000 Hubs. Nonetheless, if we went for Hubs representing somewhat specialized business concepts, we feared we might still end up with hundreds of Hubs. There was a need for a concise, elegant, but flexible data structure capable of accepting back loading of history from disparate systems. I met this need by designing a DV solution where the Hubs were based on “universal” data model patterns.

  2. The DV implementation of the new EDW had to align with other enterprise initiatives (the enterprise service bus and its XML schema, MDM and RDM strategies, and an intended business rules engine). All of these, including the DV EDW, were mandated to be consistent with the one generalized, pattern-based ELDM. Interestingly, the business had shaped the original ELDM framework, and it was quite generalized, but with a continuum of specialization as well. I needed to create a DV solution that could reflect not just more generalized entities, but this continuum.

I called the resultant solution a “Universal Data Vault”. This name reflects my gratitude to the data model pattern authors such as Len Silverston (with his series on “universal” data model patterns) and David Hay (who started the data model patterns movement, and whose first book had a cheeky “universal data model” based on thing and thing type!), and Dan Linstedt’s Data Vault architecture.

We have now considered why I combined generalized data model patterns with Data Vault to gain significant business benefits. Next, we look at how we design and build a “Universal Data Vault”, from a more technical data model perspective.

Introducing the “Universal Data Vault” (UDV)

Foundations: A Brief Review of Data Vault – Hubs, Links and Satellitesgiles06

If we go back to the emergency response example cited earlier, and prepare a data model for one tiny part of their operational systems, we might end up with a data model something like the following:giles07Here we see aircraft (helicopters or a fixed wing planes) being assigned to planned schedules-of-action for emergency responses (fires, floods, earthquakes, etc.). This model reflects an operational point-in-time view: at any one point in time, an aircraft can be assigned to only one emergency. Of course, over time, it can be assigned to many emergencies. If we now design a database to hold the history, and if we choose to use Data Vault (DV) architecture, we could end up with something like this:

giles08

Click image to enlarge in another tab

Without going into details of the transformation, we can note:

  • Each business concept (an aircraft, a schedule) becomes a Hub, with a business key (e.g. the aircraft’s registration number), plus a surrogate key that is totally internal to the DV. Each Hub also records the date & time when the DV first discovered the business instance (e.g. an aircraft registered as ABC-123), and the operational system that sourced the data.

  • Each relationship between business concepts (e.g. the assignment of an aircraft to an emergency) becomes a Link, with foreign keys pointing to the participating Hubs. Like a Hub, the Link also has its own internal surrogate key, a date & time noting when the relationship was first visible in the DV, and the source of this data. Note that while an operational system may restrict a relationship to being one-to-many, the Links allow many-to-many. In this case, this is essential to be able to record the history of aircraft assignment to many emergencies over time.

  • Hubs (and Links – though not shown here) can have Satellites to record a snapshot of data values at a point in time. For example, the emergency schedule Hub is likely to have changes over time to its severity classification, and likewise, its responsible incident controller.

Fundamentals of the UDV “How-to” Solution

giles09As noted in the previous part of this paper, people may challenge the level of generalization/specialization of the business concepts that become Hubs. While some may be comfortable with the business concept of an “aircraft”, others may want separate, specialized Hubs for Helicopters and Fixed Wing Planes, and yet others may want to go the other way – to combine aircraft with fire trucks and have a generalized Hub for Deployable Resources.

One solution is to have a Hub for the generalized concept of a Deployable Resource, as well has Hubs for Aircraft (and Fixed Wing Plane and Helicopter), Fire Truck (and Water Tanker and Slip-On), Plant (Generator, Water Pump) and so on. The inheritance hierarchies between the generalized and specialized Hubs can be recorded using “Same-As” Links. This approach is most definitely something you should consider, and is explained in a number of publications on Data Vault. However, for this particular client, there was a concern that this approach may have resulted in hundreds of Hubs.

I created the “Universal Data Vault” (UDV) approach as an alternative. A sample UDV design follows. At first glance, it may appear to be a bit more complicated than the previous diagram. However, the generalized Asset Hub not only represents the incorporation of Aircraft, but additionally it can accommodate a large number of specialized resource Hubs for Fire Trucks, Plant & Equipment and so on. Likewise, the generalized Activity Hub not only handles the Emergency Response Schedule Hub, but also other specialized Hubs such as activities for preventative planned burns.

Click to enlarge in another tab

Click image to enlarge in another tab

In the next part of this paper, we will look in more detail at the individual UDV components.

Share

submit to reddit

About John Giles

John Giles is an independent consultant, focusing on information architecture, but with a passion for seeing ideas taken to fruition. He has worked in IT since the late 1960s, across many industries. He is a Fellow in the Australian Computer Society, and completed a Master’s degree at RMIT University, with a minor thesis comparing computational rules implementations using traditional and object-oriented platforms. He is the author of “The Nimble Elephant: Agile Delivery of Data Models Using a Pattern-based Approach”.

  • Joel Mamedov

    Thanks John for an article.
    Before, commenting on substance of the article I would like to point out on entity relationship error on diagram. I think first diagram a relationship one to many should be flipped. Aircraft would be on “one” side instead. I will come back to continue my comment after your response. Thank You.

    • John Giles

      G’day Joel,
      Thanks for posting your comments. I believe that I understand the point you are making, that one Aircraft can (over time) be assigned to several Emergency Response Schedules (plans for responding to wildfires), and this is true. Additionally, one Emergency Response Schedule can have several Aircraft assigned to it (even at the same time). If we model this scenario, we could expect a many-to-many relationship.
      However, the description below the diagram notes that the model is intended to reflect an operational view at a single point in time. The reality of the actual real-world emergency response organization is that an Aircraft is assigned to one “Incident Controller” who “owns” the aircraft exclusively for a period of time i.e. an Aircraft can only be assigned to one Schedule at a time.
      Hope that helps, Joel.
      Regards, John

      • Joel Mamedov

        If I understand your explanation correctly and “point in time” condition then it sounds to me it is one to one relationship with transferability (transferable relationship) and Schedule would be a required instance which is an optional on depicted diagram.
        But,I digress. Maybe choice of example can be made better to focus on main subject.

        • John Giles

          G’day again, Joel,
          A quick comment on”point in time” data. While it is central to data warehousing to hold history, operational systems often (though not always) hold “current state” data. For example, a payroll system might hold only the current name and current department of an employee (but it might hold current and historic pay rates). Taking this payroll example a little further, the department information may be held as a foreign key in the Employee record. As you note, it is a transferable relationship i.e. the employee may be assigned to another department. But being a transferable relationship does not make it mandatory if the business allows some employees to be “unassigned” until they are allocated a new department. And it would typically be a one-to-many in that one Department may have many Employees.
          Similarly, my intention in the sample ERD is to suggest that an Aircraft can be transferred to a different Schedule, but it is an optional relationship in that Aircraft may be unassigned for a time. And as for Departments and Employees, each Schedule item may have many Aircraft assigned to it.
          In working through my response, I did note that I have made an error in the foreign key within the Aircraft table. I nominated the Maximum Cruise Distance instead of creating a new column named Emergency Reference Number. So thanks for challenging my ERD, Joel. I hope to be at Dan Linstedt’s WWDVC (World Wide Data Vault Consortium) next May (2016), so if you’re there, I owe you a coffee!
          As you note, maybe we are better off now focussing on the main topic? But thanks for the discussion.
          Regards,
          John.

          • Joel Mamedov

            Hello John.
            I do understand “point in time” concept. I was trying to
            underline an importance of condition that you have put forward.
            Back to the main topic.

            I usually do not post any comments on DV related sites or engage in any heated discussion about it. Simply because I do not like a concept of DV and I personally think that it is like digging a deep hole that sooner or later you have to stop digging it. This concept
            or methodology is not new (out-triggers, snapshot tables, audit logs and others
            ) are a similar concepts. DV is more formalized and hyped up in my view.
            But, I might be wrong and therefore I do not shy away from reading about it and trying to give the benefit of the doubt.
            I can be specific about my issues about DV but, that is not a subject of your posting.

            The most obvious effect of generalization is to reduce the
            number of entities which may or may not correspond to an actual reduced number of implementation tables. As you
            know there is no free lunch. By taking this implementation approach you inevitably loose semantic details, business rules that pre-generalize structures embody.
            In a way you are “kicking the can down” to ETL coders , procedural programmers that must deal with consequences of
            “pattern” usages in schema design. ( I mean a physical design ).

            The place where I work data architects implemented (not
            fully compliant) version of DV and they call it an “integration layer”. Obviously,I have not involved with this effort and strongly disagree with their definition of “data integration” .
            DV does not attempt to integrate anything. It is just a
            dumping ground for source data with limited historical data snapshot capabilities which constructed around “load date”. At most
            I characterize it as a source data staging layer.
            I do stress with my peers that you cannot avoid a real data integration (hard work) effort when you have multiple and inconsistent data sources that you are trying to consolidate under promise of “Enterprise Data Warehouse “. You are
            fooling yourself with a new toy that called DV.

            Now, I am trying to understand what we will be gaining by adapting your design approach.
            Here is my understating and will be glad to be corrected.

            1. Before loading source data (numerous tables )
            into abstracted or partially abstracted DV structures we need some mapping mechanism that will translate detailed table
            content from source into generalized (super and sub-types) structures. (The reason I am calling it “structure” is that one needs to figure out which source table will be constructed into Hub,
            Link or Satellite and that might not be an easy task.)

            2. When that is done then you will lose business
            rules and possible semantics of source
            data. That knowledge must be documented somewhere to go to a step 3.

            3. Now you need to extract data from DV into your
            application which in most cases would be some sort of analytical or reporting application/ database.

            4. Reporting and analytical application cannot work
            with abstracted schema (based on my experience at least). Therefore, it needs to be translated into integrated/normalized schema or highly de-normalized (star ,snowflake ) schema. So, we are talking about an additional need for mapping mechanism
            to de-contract generalized DV .

            5. More likely this entire process will be repeated
            with some kind of automation. All those translation layers are a potential source of errors that might be based on design flaws (generalizing and specializing ) or source data issues or both.

            I will stop here and take a breath before going further with my comments.

            Thank You.

          • John Giles

            G’day Joel,

            I am concerned that much dialogue has occurred that is off
            the topic of Universal Data Vault (UDV). Your earlier posts spent time debating the merits of the base operational model I used to compare against, and a significant portion of the most recent post expresses your misgivings about DV in general. I disagree with some of your views on DV, especially in relation to DV being little or nothing more than a source data staging layer, but I do not wish to use this forum to pursue that debate. My concern is that the foundation upon which you are basing an evaluation of the merits of extending DV to create a UDV is a different foundation to mine. Dan Linstedt has recently published an
            excellent book on DV 2.0 which I highly recommend, and which I believe will lessen if not remove misunderstandings about DV.

            I also suggest that you reserve your judgement of UDV until
            you have read the follow-up article of which the above is only Part 1. In Part 2, I address the dangers of seeing generalisation as an approach to be used in all cases. Instead, it is a case study where I am sharing the actual approach taken for a specific real-world situation that is almost guaranteed to be different in most aspects to the situation you find yourself in, but hopefully may
            present an alternative worth considering if similar challenges are encountered.

            Regards,

            John

          • Joel Mamedov

            Hello John.
            There is nothing to concern about it. A scrutiny is part of our job. If you put something out to public then nothing off the topic including examples you have provided. If you think people should reserve judgment then maybe it would be better to disable the comment until you completed your posting. I comment what I read so far.
            Thank you for posting and good luck!

  • Magnus Borg

    Hi John, interesting and well written article! I look forward to read part 2..
    Regards
    /Magnus

    • John Giles

      Thanks for the encouragement Magnus. It is very much appreciated. Regards, John.

  • glenncogar

    Great idea and article John :) would love to catch up and discuss Are you in Sydney before Xmas/NY or in the new year?

    • John Giles

      G’day Glen, Sorry, but no plans for Sydney at this stage. Maybe see you at Dan Linstedt’s Data Vault conference in May? (But sad if we both have to travel half-way around the world to meet.) Regards, John.

      • glenncogar

        Thanks for your reply John, good to hear from you again :) I am not sure yet if I’ll get to the DV conference in May 2016, I will let you know. What’s the best way to get back in contact? My email address is glenn.cogar at gmail dot com Please feel free to send me an email :)

  • Kent

    Interesting article. Would definitely need more information on DV. I am still trying to cross the data lake, empty the data reservoir etc. and now there’s DV.
    Where do I start? How does one merge these technologies if you are already in the middle of a warehouse effort?
    Can you easily move from a data lake/warehouse to a DV?

    • John Giles

      G’day Kent. First let me give you a massive apology for the delayed response – I had thought I would get emails on updates to this posting, but maybe they either didn’t get sent, or perhaps they got lost at my end. Either way, I am sincerely sorry.

      Yes, there are lots of dimensions to data warehousing (pun intended). The Data Vault (DV) initiative is seen by some as just another way to do data warehousing, but it’s interesting that Bill Inmon has given his thumbs-up to Data Vault. And he has also published on data lakes – his title is thought-provoking – “Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump”.

      But back to your questions. In my opinion, the DV architecture / methodology happily co-exist and even integrate with other initiatives such as data lakes (for example, it is designed to accommodate “big data” such as Hadoop), and it is also intended to complement / supplement consumption via traditional dimensional data marts. I would recommend you read Dan Linstedt’s “Building a scalable data warehouse with Data Vault 2.0”. Even if you start by reading Chapter 2 (architecture), I think it will answer many of your questions. And by the way, I see DV as being much more than a “better data warehouse” – Dan’s book will help.

      Another option might be to come to Dan’s WWDVC conference next week – I’m an Aussie, but I will be there and would love to buy you a coffee. Alternatively, I will be back in the USA in October at the Data Modeling Zone and we could meet there (assuming you are from the USA – a big assumption). I will be presenting an introduction to DV.

      • Kent

        Thanks, John I would love this to be at one of these events but my schedule Won’t permit at this time. I will keep looking out for future events then maybe someday you can get me my free coffee.

        Thanks for your detailed answer it was worth the the wait.
        Steven

  • John O’

    John – Would you entertain the idea of extending the concept of universal data models to an even larger universal pattern of communication?

    In other words, if we accept the premise that ‘fit-for-purpose’ data models can be constructed using (to start) universal data model patterns, then is it much of a stretch to say that *all* information falls into universal patterns of classification, identification and association with which, as Len Silverston says: You can walk into any organization in the world and immediately recognize – not just data modelling patterns – but ones which run the gamut from terminology, to structured data, to documents and analytics of all kinds.

    I enjoyed your article on several levels, especially the sub-text that strongly suggests that the complexity most analysts are ‘seeing’ is largely the result of being too close to the trees. Once you step back – maybe a better word might be “above” the apparent chaos all is revealed.

    • John Giles

      G’day John – As in my posting to Kent below, I am enormously embarrassed by the lateness of my reply. Please accept my sincere apology.

      Sounds like you are a fan of data modelling patterns. I think we all owe a debt of gratitude to the authors (David Hay, Len Silverston, and more). You may be aware that many see the patterns movement as having been started by Christopher Alexander who published on patterns for homes, towns and the like – an architect in the traditional sense. Since then, there have been many others who have added their own contributions, including design patterns for object oriented programmers (Gamma, Helm, Johnson & Vlissides), analysis patterns (Martin Fowler), and many more. One of my current interests is in Data Vault, and it is solidly founded on patterns.

      All of that to say that yes, I think you are onto something when you talk of using patterns in communication. I am assuming you are talking of “communication” in the IT sense. One of my views is that if you have, for example, an enterprise-wide data model, it can not only contribute to the shaping of a data warehouse initiative, but also to data structures in a services layer (e.g. XML or JSON), can assist with Master Data Management and Reference Data Management initiatives, provide a foundation for business rules glossaries, and so on.

      Hope this belated answer help. Regards, John.

Top