In Part 1, I showed that SCDs inadequately live up to their stated purpose. They are intended to provide history for dimensions, but the history they provide is incomplete. They support as-was reporting in one sense, but not in the two senses that provide essential information to the business.
Here in Part 2, I will show that, even if type 2 SCDs did provide an adequate history of dimensional change, the new SCD types defined in the 2013 edition of the Data Warehouse Toolkit involve hard and soft costs that outweigh any savings obtained by using less storage than type 2 SCDs.
The conclusion is that the very valuable fact-dimension pattern should be altered. SCDs should be abandoned. Bi-temporal dimensions should take their place.
Basic and Advanced SCDs
In the 2013 edition of their data warehouse toolkit, Ralph Kimball and Margy Ross introduced several new types of slowly changing dimensions. However, none of these new “advanced” types of SCDs capture any information that the basic type 2 SCD cannot capture. Moreover, none of them can be reasonably claimed to be simpler than type 2 SCDs, in any total-cost-measured sense of “simpler”.
So the only justification for these non-type 2 SCDs is to store dimensional information in less space than type 2 SCDs would otherwise require, and thus to reduce storage costs. But with the rapidly declining cost of storage, at all technologies throughout the storage hierarchy, extending from solid-state devices, to spinning disks, to off-line storage, this justification apparently no longer exists, even for the largest of real-world dimensions.
Advanced SCDs cost more than type 2 SCDs for two reasons The first involves the incremental human resources costs of modeling, designing, coding, testing, implementing, documenting, and maintaining advanced SCDs, over the comparable costs for type 2 SCDs. Since storage costs are rapidly declining, while IT personnel costs are not, these personnel costs will almost always outweigh any possible savings in storage costs, no matter how enormous the dimensions involved, and no matter how frequently rows of those dimensions are updated. If any “monster dimensions” for which this is not yet obvious still tremain, cost trends will soon rule out advanced SCDs even for them.
If measuring the total cost of ownership of an enterprise’s software assets is to be anything more than an empty gesture, then the incremental personnel costs of implementing SCD types, and often of replacing those implementations when additional history is required, cannot be ignored.
The second way in which advanced SCDs cost more than type 2 SCDs is that the physically complex data structures of advanced SCDs effectively isolate them from direct querying by business users. Those business users can reasonably be expected to know the names of individual dimension tables and their attributes, but cannot reasonably be expected to understand the vertical partitioning of dimension tables that results in “mini-dimensions”, nor the added complexities of outriggers, additional history columns, or any of the other variations that distinguish SCDs types 3, 4, 5 and 6 from one another.
Following is the demonstration of these claims.
Types of SCDs
Kimball and Ross describe eight methods for “dealing with slowly changing attribute dimensions” (p.53).
The type 0 method is to leave the attributes unchanged. This will lose no information as long as the attributes are for unchanging properties, such as an employee’s date of birth. For these attributes, historical data is not an issue.
The type 1 method is to overwrite all changes to dimension attributes. For tables for which historical data is not required, this is the typical way updates are carried out. Here, again, historical data is not an issue.
The type 2 method is to create a new row every time there is a change to any attribute in a dimension. The after-update row is then distinguished from the before-update row by means of a date or timestamp on one or both rows. This method uses the first of Kimball and Ross’ SCD types that addresses the issue of keeping track of historical data.
The type 3 method is to add a new column for each attribute for which we want to keep both the current value and the previous value (or perhaps the current value and the original value, but, in any case, only one value other than the current value). This will normally result in using less storage than a type 2 SCD will use, for the same amount of information, and indeed there appears to be no other reason to choose a type 3 over a type 2.
The type 4 method is to vertically partition a dimension into a “base dimension,” for which updates will be handled as overwrites, and a separate “mini-dimension,” for which updates will be handled as a type 2 SCD. This means that the technique of making a full copy of a row and applying an update to the new copy will be applied to rows with fewer columns, and will thus use less storage than a type 2 SCD for which each update requires a copy of the entire row.
The type 5 method also splits a dimension into a base dimension table and a mini-dimension table. But in each base dimension row, it adds a foreign key to the row in the base dimension to point to the row in the mini-dimension, which contains the most current values for those mini-dimension attributes. Kimball and Ross call this method “add(ing) mini-dimension and type 1 outrigger”.
The type 6 method is to include all mini-dimension columns in the base table, and to overwrite the values for those columns, in the base table, every time there is an update. The mini-dimension then keeps the history of changes to the values in those columns, and the base dimension keeps the current values for those columns. This eliminates a join from the base dimension to the mini-dimension which would otherwise be required to retrieve the current values for all columns in the complete dimension.
The type 7 method combines a type 1 SCD that records current data, and a type 2 SCD that records a history of changes.
Other combinations of SCD types are also possible. For example, combining a type 0 and a type 2 would record original data and a history of changes. And, of course, a combination of types 0, 1 and 2 might also be useful.
Rows and Columns
All history-preserving SCD types use one or both of two basic techniques. The first technique is to copy a row. A variation on this technique is to vertically partition a table into one set of columns for which only current and/or original values are needed, and a second set of columns for which history must be tracked. We will look more closely at this technique when we consider an example of using a Kimball and Ross mini-dimension.
The second technique is to copy a column, which corresponds to Kimball and Ross’ type 3 SCD. It is important to realize the cost implications of this second technique. For example, if a new column is added to our Product dimension, to capture a second unit price on the rows in that dimension, the two unit prices could then be used to record (i) an original unit price (as of January 1st, for example) and the current unit price, or (ii) the previous unit price and the current unit price, or (iii) the current unit price and any other related price, such as a budget forecast unit price.
But it is important to notice that, after all the work to change the Product dimension by adding a new column, it is only a second value for product unit price that can be captured. If a third value for unit price is later required, the work of modeling, designing, coding, testing, debugging, documenting, and supporting the modified table structure, must be undertaken again. If there is an even later requirement, say to capture product name changes, an equally costly redesign must be undertaken. If any later requirement to capture additional product history is required, an equally costly redesign must be undertaken.
It would be hard to think of a better example of what agile software is not.
A Mistaken Claim
Kimball and Ross claim that their history-recording SCD types (types 2 – 7) “support both as-was and as-is reporting”. (p.56) But this is seriously misleading. These SCD types can support as-was reporting in the sense of reporting on the history of changes to the things that are described by dimensional data. But they can report on this history only as it is currently recorded. As we have seen, however, these SCD types cannot support as-was reporting in the sense of recreating any previously produced report (i) with the original data, or (ii) with the most current data, or (iii) with data that includes corrections made up to any point in time between the two.
Advanced SCDs vs. Type 2 SCDs: a Cost/Benefit Analysis
Kimball and Ross refer to “rapidly changing monster dimension(s)” i.e. “multi-million row dimension tables” (p.54), and recommend the use of “mini-dimensions” to manage them. Mini-dimensions – which are what distinguish types 4, 5 and 6 SCDs from type 2 SCDs – are created by splitting a set of frequently changing columns from monster dimensions and storing them in a separate table. Although Kimball and Ross avoid mentioning the reason for using mini-dimension SCDs rather than type 2 SCDs, the very term “monster dimension” makes it clear what this intended benefit is. It is the ability to store the same information as type 2 SCDs, but in a smaller amount of space, thus saving on storage costs.
To assess the value of mini-dimension SCDs, let’s look at a cost/benefit analysis.
Suppose a monster dimension (MD-X) starts out (before any history is accumulated) as ten million five-hundred-byte rows. That’s five billion bytes. A terabyte is a thousand billion bytes, so MD-X takes up half a percent of a terabyte. No matter how high in the on-line storage hierarchy MD-X is placed, half of one percent of a terabyte is not going to be very expensive.
Now let’s assume that half the rows in this table fully change every day, and that we are using type 2 SCDs, making a new copy of the entire row each time it changes. Over the course of ten years, say, this monster dimension grows from ten million rows to ten million plus an additional five million rows every day for 3,650 days. That’s a total of over eighteen billion rows. At five hundred bytes per row, the storage required for this rapidly changing type 2 SCD monster dimension is 9.13 terabytes.
Even if none of this full ten year’s worth of data is archived, 9.13 terabytes is no longer a prohibitively expensive amount of data to keep on-line. Indeed, it is a small amount of data compared to the storage requirements for many fact tables, and an insignificant amount of data compared to the storage requirements for non-character-set data and for Big Data. Using non-type 2 SCDs is like responding to a requirement to reduce warehouse costs by rearranging the items on a few shelves in the back corner of the warehouse.
So let’s see how much we do save in storage costs by using a mini-dimension. Because we are using a mini-dimension, the only columns that will not be split out into the mini-dimension are those that won’t change, or those for which only current values are required. A generous assumption would be that twenty percent of the columns of MD-X are like this.
So we assume that ten million four hundred byte rows of MD-X are split off into a mini-dimension, (MD-X2). This leaves ten million one-hundred-byte rows in a type 0 and/or type 1 vertically truncated MD-X (MD-X1).
History now begins to be recorded in MD-X2. It is accumulated at the rate of five million rows each day, for an assumed ten years. This results in a storage requirement for 7.3 terabytes for the mini-dimension. For MD-X1, the storage requirement is for one-half of one percent of a terabyte, so we still have 7.3 terabytes as the estimate for storage requirements when using the mini-dimension technique.
So using a mini-dimension, instead of a type 2 SCD, saves us 1.83 terabytes of storage.
What, then, are the incremental development costs for using an advanced SCD rather than a type 2 SCD? These costs will include the incremental costs incurred during designing, coding, testing, documenting, implementing and supporting a mini-dimension SCD.
To make the numbers as favorable as possible for the mini-dimension approach while remaining at least marginally realistic, let’s say that the incremental cost for a mini-dimension SCD over that of a type 2 SCD is minimal – say a single man-week. This is the incremental cost associated with vertically partitioning the original dimension, possibly adding an outrigger to the base dimension, and writing the retrieval logic to join rows from a non-historical MD-X1 and an historical MD-X2. Using a conservative loaded man-hour cost of $150, that’s $6,000.
Kimball and Ross, to manage this monster dimension, would have us spend $6,000 to save less than two terabytes of storage.
But that doesn’t reveal the full absurdity of using advanced SCDs. Consider, to begin with, type 3 SCDs. As I pointed out earlier, every time a requirement for additional history is met by adding a new column to a dimension, a project must be undertaken to model, design, code, test, debug and document the new structure. Such projects will certainly involve several man-weeks, and perhaps several man-months. If we conservatively estimate a single man-month to do all this, what is the cost?
At our assumptions, which make the numbers as favorable as possible to advanced SCDs, the per-hour loaded cost of $150 means that the project costs are approximately $27,000. And these costs – which would be zero if a type 2 SCD were used – will be incurred every time any type 3 SCD, in any fact-dimension structure, in any database in the enterprise, over the multiple years’ lifetime of those fact-dimension structures, must be modified to capture additional history.
But suppose we avoid type 3 SCDs, and limit our improvements on type 2 SCDs to the mini-dimension SCD types – 4, 5 and 6? In that case, we may still be exposed to project costs to accommodate additional history requirements for a mini-dimension SCD. For example, suppose that the MD-X1 + MD-X2 monster dimension, in our example, is our Product dimension and that, originally, product name was kept in MD-X1. Later, a requirement is added to keep history on product name changes as well as on price changes. This means that the product name column must be moved from MD-X1 to MD-X2. Modeling, design, coding, testing and documenting costs will be incurred.
Soft Costs of Advanced SCDs
In addition, there are two significant soft costs, i.e. costs that are difficult to quantify, of using any mini-dimension SCD in a fact-dimension structure that is not isolated as a cube structure whose data cannot be accessed directly. Both costs are the result of splitting dimensions into base tables and mini-dimensions, and of adding outriggers – in place of a single normalized table.
Since Kimball and Ross are vociferous in their scathing remarks about normalized tables, let me point out what these soft costs of denormalized dimensions are.
The first cost is that advanced dimensions isolate fact-dimension structures within the semantic firewall of an OLAP tool. A business user may want to query a product dimension, for example, and have several columns returned in the result set of a query on that dimension. We can expect and require this user to know the physical name of the dimension, and also the physical name of the columns she wants returned. But we cannot expect and should not require her to know which columns are in the base dimension table and which are in the mini-dimension table – which are in MD-X1 and which are in MD-X2.
We can expect and require her to provide an as-of timestamp on a query against a type 2 dimension. But we cannot expect and should not require her to join a non-temporal (MD-X1) and temporal (MD-X2) table in her query.
Users are under an obligation to know what data is provided for each dimension in a fact-dimension structure. But what the user can reasonably be required to know about a dimension is what its name is, and what the names of its columns are. The dimension, as a semantic object, corresponds to a single table with one column per attribute. Base dimension tables vs. mini-dimension tables, outriggers, and any number of columns for historical copies of specific attributes, are technical details which only IT technicians should have to be aware of.
Because of their incremental IT personnel costs, and their additional soft costs, there is no justification for using advanced SCDs instead of type 2 SCDs.
Because no SCD, including a type 2 SCD, can provide a bi-temporal history, and because both orthogonal dimensions of bi-temporal history have business value, there is no justification for using any SCD type instead of using bi-temporal dimensions.
The important fact-dimension pattern must be modified. SCDs must be abandoned. Bitemporal dimensions must be used in their place.
A use case of a bi-temporal dimension and its associated fact table can be found in Chapter 18 of my Bitemporal Data: Theory and Practice.
Note: my thanks to Mr. Stan Muse, of IBM, who helped me with Part 2 of this article.