Welcome to the wrap-up column in our series on applying Agile techniques to data projects in our Tales and Tips from the Trenches Column. The column’s goal is to share insights gained from experiences in the field through case studies.
The last column in this series explored one way to perform data warehouse modeling in an Agile fashion, using the Data Vault methodology, and mentioned another way to perform data warehouse modeling using the opposite approach, promising to cover it next.
We will touch upon it briefly, introduce another related approach, and conclude the series by summarizing the past topics and setting the stage for the next series of articles that illustrate a practical use of one specific Agile technique.
“Hyper-Normal” Data Warehousing Recap
In the last column, the Data Vault technique was explained as a “Hyper-Normal” approach. The technique gets its power in modeling by using a hub-spoke-link architecture: the main business table is represented as a Hub, and many attribute tables (called Satellites) can be created, related to the main hub. Hubs can be connected to other hubs via Links. The power of this approach is that new tables can be added at any time, with little to no impact on existing tables; new attributes are added by creating one or more new Satellites. Inherent in the Hub is a time element, which forms part of the key. The temporal history of the enterprise is thus captured using this approach. History can easily be reconstructed.
The downside of this approach is clear: Tables proliferate at a fast rate. The relational model is strained due to many joins that are required even for simple queries. “Helper” or summary tables are a necessity to obtain reasonable performance.
The Opposite Approach
Instead of creating many tables and over-normalizing, the opposite thought process can be used, that of over-generalizing. We discussed Generic Modeling in Column 3 of the series; the basic notion can be extended to data warehousing. The core idea is to model the business, and business facts and dimensions can be derived from the business model. Think of it as Business “topics” which would become Dimensions, and things with a temporal component (such as transactions) which link business topics together would become “Facts.”
The approach is very metadata-driven; the Thing Type entity would contain information about the business topics such as “Product” and “Customer” whereas the Thing entity would be the instances themselves, such as “Car Part 1234” and “Mary Smith.” The main components are the Model (the Type entity) and the Data (the Thing entity); as in the Data Vault, link entities are required to tie them together, along with an Effectivity entity linking Thing Types to Thing in a point in time. This provides a temporal data warehouse like the Data Vault, but in a different way: there are very few actual tables, only generic structures. See Figure 1, below. The Data Vault creates a new Hub for every business concept. The Hyper-General model contains only the basic six tables, as shown in the figure.
This data structure contains the business model. The intention is that this business model would form the main online transaction processing (OLTP) model for source applications. The data structures surrounding “Thing” and “Thing Type” can then be used to generate a dimensional model, in a similar fashion like the Data Vault. The transformation is reasonably intuitive. This is where the technique derives its power.
The problem with this approach is that it is not easy to build it on your own; it necessitates a tool. The creation of the business model can be tricky. There was at one point such a tool: Kalido. The company that produced Kalido merged with another company to form a third company called Magnitude. Kalido’s underlying technology was used to create a Master Data Management (MDM) product, but it is unclear from their website if the original Business Model product which transforms into a Data Warehouse model still exists. It was a very clever approach, and I’m sorry to see it disappear. I worked on a project that implemented it several years back. The problem with the approach involves maintaining the business model over time. The business model nicely transforms into Facts and Dimensions, but we all know that business changes. This approach may be Agile in the transformation to the Star Schema, but it doesn’t solve the main problem of maintaining business changes. The main data structures don’t change, but business functionality does. So perhaps it wasn’t very Agile after all!
Another Agile Data Warehousing Technique: Anchor Modeling
We cannot leave the subject of Agile Data Warehousing without briefly highlighting Anchor Modeling. This modeling approach is very similar to the Data Vault. It is based on 6th Normal Form, and Data Vault is loosely based on this as well. It emulates a temporal database. There are open source tools available and tutorials on the Anchor Modeling website. It has its origin and is maintained by Lars Ronnback and a team from Stockholm University. It deserves mention, but I have no personal experience with this technique. There are articles on their website which discuss the differences between Data Vault and Anchor Modeling, and I was able to find information concerning a debate that was held between the two creators of these approaches. I welcome any feedback that readers may have concerning this approach and its use in the field.
Another related topic involves what a Data Vault “uncle”, Hans Hultgren, dubbed “Ensemble Modeling”: it labels the group of modeling techniques that break up a model into those parts considered mutable and those that are not. Anchor and Data Vault are two examples, but there are others. The term “ensemble modeling” is unfortunately also used by the data analytics community, so don’t be confused when doing an internet search.
It is interesting that the topic of data change is receiving a lot of attention. Michael Stonebraker, the inventor of the Ingres database, is working with others at MIT concerning “database decay”, how databases don’t keep up with business changes over time and “become so decayed that they cannot be further modified”. They offer several solutions, one of which is to create schemas that are more resilient to change, which is essentially using generic modeling or data model patterns. Another technique they suggest is to use a layer of abstraction between the database and the code such as messaging. This sounds like a great topic for a later article.
Time and Temporal Data in the Data Warehouse
Data warehouse data modelers have always struggled with how to properly model and expose temporal data. For example, what’s the best way to handle a product hierarchy that changes every so often? Product is an important, pivotal dimension in any sales-oriented data warehouse. I have always felt that the Kimball approach to “Slowly Changing Dimensions” (SCDs) solves the problem incompletely. There are many articles and books that discuss the “Type 1”, “Type 2”, and all the other Slowly Changing Dimension types; here’s an example of a basic tutorial on SCDs. The heart of the problem is trying to meet the need for both the historical source of record (since the beginning of the business) and satisfying the business need for focused analysis using only one modeling technique. The combination of Data Vault and Data marts in Data Virtualization will increase the agility and reduce the cost.
The Data Vault structure enables the storage of comprehensive history, and the data marts facilitate customization, focused on a specific business problem. Different data marts can exist for targeted business analysis. This hybrid approach of both a central, historical warehouse coupled with data marts for specific purposes accomplishes the goals of the data warehouse.
I have seen implementations of various approaches with respect to layer integration. Most common are batch-based ETL that transfers data from a Data Vault storage to a business warehouse. However, there are also approaches where the business warehouse is implemented with Data Vault or where all superior structures are created on-the-fly with data virtualization based on the Data Vault warehouse.
In today’s world, we have very sophisticated technology that enables current, timely data to enter the warehouse using techniques like Change Data Capture (CDC), in-memory databases and appliance technology, and add to these many “No SQL” approaches. You can get the data in, using these great technical solutions, but how do you accommodate it in your dimensions? It is important to understand the needs of the business; don’t over-engineer when a simple solution will do. Kimball had a nice way of representing the various latencies with his “Data Highway”; see Figure 2.
The Data Highway can be used to gauge which latency is appropriate for the business problem. Not all business problems need raw source or real-time cache access. The basic “tip from the trench” is make sure the solution fits the problem. As mentioned above, you can have focused data marts for specific time-based views. You don’t need to accommodate the product dimension for many years back in your targeted analyses; you can focus only on the pertinent requirements. This then frees up the granularity of the dimensional model, because you have a separate architecture such as the Data Vault to handle the historical record of the enterprise. This also alleviates the performance issues with the Data Vault because it is not used for the business analysis. It is the best of both worlds.
Conclusions from the Agile Data Series
Agile methodologies are used pervasively. There is wide recognition that change is the only constant, and our methods must adapt to the changing landscape of the way we do business. There is by necessity a balancing act between the specificity of our data models and their ability to adapt to change. The more specific our models are, the more expressive they are, the more easily understood they are, and it usually follows that if they are easy to understand, they will be easy to change. But we have seen that this is not often the case, due to the dependency of code. A model may be easy to change for a data modeler, but the change presents a major problem for the code that depends upon that model. This explains the Developer/Data Modeler Wall that typically exists between them, and it is also known as the Relational/Object Impedance— the fundamental paradigm mismatch. Data modelers use the relational model and developers use object orientation. These two approaches do not conform well. This disconnect is one of the major causes of code maintenance problems, as highlighted in our series. Relational databases are better for storing data than object-oriented databases, so they continue to be in use. Our series has explored various modeling techniques that can be used to express data in more agile ways.
One of the techniques we discussed in the series was the use of data model patterns, in Article 4. There are many resources available which provide common patterns that can be used to express many frequently occurring business needs. It is possible to start with a common pattern and tweak it as needed to fit the specific situation. Patterns can be easily reused, and due to their basic nature, can accommodate change easily. Patterns provide a jumpstart to modeling, can help you get up to speed faster, and can help in modeling a basic scenario when not much is known.
Another technique is Generic Modeling. Article 3 in our series provided examples of how a very generic approach can be used for ambiguous business problems. Again, the trade-off between specificity in the model and understandability is an important consideration. The model may be very generic, enabling just about any situation to be modeled; but even though the model does not have to be changed, it requires the data to understand it, because the model itself does not inherently represent the business.
Article 2 mentioned that a good technique when using an Agile methodology is for the data modeler to remain at least one sprint ahead of development. Another method which helps with module dependencies is the Scaled Agile Framework or SAFe. This method builds in Systems Thinking (Principle #2) and synchronization with cross-domain planning (Principle #7) which helps to deal with dependencies. In my own experience, the inability to deal properly and effectively with module dependencies has been a major failure of the Agile approach, and it affects the data modeler greatly. Therefore, the SAFe method works to mitigate this common problem.
There are many tools that the data modeler can have in his or her toolbox to help deal with both changing business requirements and working within Agile methods. It is possible to shift as needed between techniques as situations warrant. After all, we can be Agile in our methods too!
New Series Coming Up: Data Transformation
In our new series, we will introduce a technique a colleague of mine has used very successfully to apply generic data modeling to a very complex, difficult problem: Data Transformation. As we all know, the need for data transformation is everywhere, and in every situation the requirements are unique. Data transformation is not limited to Extract, Transform and Load (ETL), and data warehousing. We will explore data transformation and its many forms, and how to tackle it universally using an adaptable, home-grown, generic approach.
 Adapted from Hughes, Ralph. “Hyper-Modeled Databases: A Crucial Technique for Truly Agile Data Warehousing”. Enterprise Data World, 2015.
 Stonebraker, Michael et al. “Database Decay and What To Do About it”. Communications of the ACM. https://cacm.acm.org/blogs/blog-cacm/208958-database-decay-and-what-to-do-about-it/fulltext
 Kimball, Ralph. “Newly Emerging Best Practices for Big Data”, http://www.kimballgroup.com/wp-content/uploads/2012/09/Newly-Emerging-Best-Practices-for-Big-Data1.pdf Page 4.