The Case Against Slowly-Changing Dimensions, Part One

Beginning of ChaosI believe that Ralph Kimball’s fact-dimension pattern is one of the most important and powerful data design patterns ever discovered. But his slowly-changing dimensions (SCDs) are another matter. In this article, I will show why slowly-changing dimensions should be abandoned, and explain what should replace them.

In brief, there are two things wrong with SCDs. The first is that they inadequately live up to their stated purpose. They are intended to provide history for dimensions; but the history they provide is incomplete. This objection applies to all SCD types, used alone or in combination.

The second is that the costs of using “advanced SCDs” (SCD types 3 – 7) outweigh any cost savings that those advanced SCDs achieve over type 2 SCDs. In addition, there is no information that can be captured with those advanced SCDs that cannot be captured with type 2 SCDs. Consequently, if history-capturing SCDs are to be used (types 2 and higher), it is only type 2 SCDs that are cost-justified. However, since even type 2 SCDs fail to provide the complete histories that businesses need, it follows that something else must be used in place of SCDs.

What is there to replace SCDs that will provide a complete history for dimensional data? Bitemporal dimensions. Bitemporal tables are already defined in the ISO 9075:2011 SQL standard, and they are supported by IBM, Oracle and Teradata, and by such middleware products as the Asserted Versioning Framework [developed by myself and my co-author of Managing Time in Relational Databases (2010, Morgan-Kaufmann)].[1] Bitemporal data provides both a complete history of changes to the things represented by that data, and an orthogonal history of changes to the data itself. As we will see, both kinds of history are needed by business.

Additionally, the complexities of managing bitemporal data are well-encapsulated in SQL 2011, so the costs of invoking that functionality are minimal compared to the costs of hand-coding access to such structures as mini-dimensions, outriggers, and history columns.

What SCDs Can’t Do

The business requirement that motivates SCDs is usually stated as the requirement to “provide history” for one or more dimensions in a fact-dimension structure. This will support historical reporting – or so those who provide the requirement are told. But the question that is never asked, when requirements are gathered, is this: what does “support historical reporting” mean?

On the one hand, what a business user may want is to rerun a historical report and get exactly the same results she got when she ran it the first time. However, between the time the original report was run and the time it is rerun, the data that was used for the original report may have been corrected any number of times. So what the business user will get, instead of what she expected, is an historical report which is identical to the original report except that it uses corrected data.

 Indeed, the business user may want an original-data historical report, or an otherwise-identical current-data historical report, or even an otherwise-identical historical report based on the state of the database at different points in time between when the original report was run, and when one or more corrections, made at different points in time, were applied to the historical data. Depending on the point in time chosen, the data that appears in the report may be identical to the original data, identical to the current data, or identical to the data as it existed after one, but not all, of those corrections were applied.

 These are reasonable things for a business user to expect from a database which, she is told, “provides history”, and supports “both as-was and as-is reporting”. (Kimball and Ross, The Data Warehouse Toolkit, 3rd edition (2013, John Wiley), p.56). When IT personnel convert a Product dimension, for example, to an SCD, and tell the business that this will provide both as-was and as-is reporting on product data, the business probably assumes that the result will support all of these reasonable requests.

 But SCDs cannot support all these reasonable requests; and no combinations and/or variations of them can, either. To satisfy all these reasonable requests, dimension tables must be bitemporal. That is, they must locate the data they contain at the intersection of two orthogonal points or periods in time – the time when the things represented by the data were in a given state, and the time when the data itself was in a given state.

An Example

Take as an example a report on a history of unit price changes for products, for the first half of 2015. For simplicity, I focus on just one product, product xyz. Assume that this price list report is produced from a type 2 Product dimension, and is run on July 1st. The history of xyz’s price changes, in the first half of 2015, is shown in Figure 1.

Prod     Eff Date    Uprice
[xyz | 01/01/2015 | $3.98 ]
[xyz | 04/23/2015 | $4.25 ]
[xyz | 05/01/2015 | $4.49 ]
[xyz | 05/24/2015 | $4.98 ]

Figure 1. The Product Dimension on July 1st.

(Underlined column headings are columns that are part of the business key of the table.) Although the details of the type 2 Product dimension are not shown, the rows on this report and the rows in a type 2 table clearly correspond one-to-one.

Now suppose that, on July 2nd, a correction is made. The price change from $4.25 to $4.49 actually took effect on April 27th, not on May 1st. So the original row, with its incorrect effective date, is deleted, and a new row, with the correct effective date, is added. The report is rerun on July 3rd, and the result is shown in Figure 2.

Prod     Eff Date    Uprice
[xyz | 01/01/2015 | $3.98 ]
[xyz | 04/23/2015 | $4.25 ]
[xyz | 04/27/2015 | $4.49 ]
[xyz | 05/24/2015 | $4.98 ]

Figure 2. The Product Dimension on July 3rd.

However, we now cannot tell that, from when the price change to $4.49 was originally entered into the database (on or close to its original effective date, likely), until the correction was made on July 2nd, the Product dimension incorrectly stated that, for the last four days of April, the unit price of xyz was $4.25. That information has been lost. But since business decisions may have been made on the basis of that incorrect information, it is as important as you care to imagine. Methods of managing temporal data should not result in the loss of important information.

So from this point in time on, we cannot reproduce our original price list report. That original report showed that from April 27th through April 30th, the unit price for xyz was $4.25. But with the correction made on July 2nd, the report run on July 3rd shows that the unit price for xyz, on those four days, was $4.49.

Next suppose that, on July 5th, a second correction is made. The price change that took place on May 24th was actually a change to $4.99, not to $4.98, and so the database is updated to reflect that. The report is again rerun, this time on July 6th, and the result is shown in Figure 3.

Prod     Eff Date    Uprice
[xyz | 01/01/2015 | $3.98 ]
[xyz | 04/23/2015 | $4.25 ]
[xyz | 04/27/2015 | $4.49 ]
[xyz | 05/24/2015 | $4.99 ]

Figure 3. The Product Dimension on July 5th.

From this point in time on, we cannot reproduce either our original price list report run on July 1st, or the corrected report run on July 3rd. More generally, what has happened is that, although the SCD has captured an historical record of price changes and when they became effective, it has not captured a historical record of what the database said those prices were and when it said it.

If historical tables in databases, including SCDs, were bitemporal, this problem would not exist. Bitemporal data allows the database to record (i) what X was like at time t1, (ii) according to the data present in the database as of any orthogonal time t2.

But since SCDs are not bitemporal, they cannot do this. And even though business users are typically not sophisticated enough to explicitly ask for this bitemporal functionality, it is almost always what they want. What business user, when requesting that history be captured for a particular table, would not want the capability to learn about the history of any object of interest, as it was recorded in that table as of any point in time? What business user, on learning that she cannot exclude corrections to data when rerunning a report, would not be disappointed?

With any SCD, she will be disappointed.

Aside: I note, in passing, that for the same reason, no Inmon-conformant data warehouse has ever had the properties Inmon attributes to them, the properties of being both “time-variant” and “non-volatile”. Neither Inmon nor Kimball warehouse structures permit us to locate data at the intersection of two orthogonal points in time. Yet what things were like at point in time t1 is one thing, but what the data said, at that or any other point in time, that things were like at t1, is an entirely different thing.

Two Kinds of Dates

One way around this loss of information might seem to be to add a second date to the dimension. It will be the date the row was inserted into the table. If we use that approach, then after the two corrections described above, our data will look like this:

Prod       Eff Date      Ins Date     Uprice
[xyz | 01/01/2015 | 12/31/2014 | $3.98 ]
[xyz | 04/23/2015 | 04/22/2015 | $4.25 ]
[xyz | 05/01/2015 | 04/30/2015 | $4.49 ]
[xyz | 05/24/2015 | 05/23/2015 | $4.98 ]
[xyz | 04/27/2015 | 07/02/2015 | $4.49 ]
[xyz | 05/24/2015 | 07/05/2015 | $4.99 ]

Figure 4. The Dual-Date Product Dimension on July 5th.

This is clearly complicated data to use. For example, what is the query that would return the effective date and unit price history of xyz as of July 4th? Is it a filter on insert date, which would return the first five rows from Figure 4?

That query does produce the correct result set, although the May 1st entry in it is superfluous. However, if the as-of date was July 6th, then all six rows would be returned. But in that case, the two rows with the May 24th effective date would contradict one another.

So is a filter on insert date plus, for matching product and effective date pairs, the selection of the one with the latest insert date that is not later than the query’s as-of date? That does return the correct result, in either case.

Here are two more corrections that might be made to the data for product xyz. I leave it to those who think that a dual-date approach to SCDs can support the orthogonal histories of changes to products and changes to product data, to work out how they would handle these two corrections.

Suppose that the second row in Figure 4 has also been corrected. In this case, xyz did not take on a unit price of $4.25 on April 23rd. Instead, it kept its unit price of $3.98 until April 25th, at which time its unit price changed to $4.28. How would a dual-date approach to SCDs retain the original second row in Figure 4 while also recording the corrected data? Perhaps the result would be this pair of rows:

[xyz | $4.25 | …. | 04/23/2015 | 04/22/2015 ]
[xyz | $4.28 | …. | 04/25/2015 | 07/03/2015 ]

But how do we know that the second of these two rows is a correction to the first one? How do we know that the unit price for the 23rd and 24th should be $3.98, and not, as shown, $4.25?

Consider another correction, one which entirely removes an originally-entered price change. Suppose that the price change from $3.98 to $4.25 never took place. So the next unit price for xyz is in fact $4.49. How will this correction be made? The original $4.25 row cannot be deleted, because that would lose the information that, until the correction was inserted, the Product dimension did show a price change to $4.25.

I am not saying that something couldn’t be jury-rigged, possibly involving a delete flag or a delete date.  Perhaps it would also require adding an effective period end date. What I am saying is that what we see here is the tip of the iceberg of the issues involved in managing a dual history of changes to product prices and changes to product pricing data.

We don’t need a jury-rigged solution, because this is what bitemporal data management technology already does. Since a complete theory of bitemporal data has been stable within the computer science community for several decades, and since robust implementations currently exist for it, there is no justification for embarking on a course of modifying SCDs to express this dual history. That would be to re-invent a complicated wheel that was invented a long time ago, whose design and use have been refined over several decades, whose maintenance and retrieval logic has been encapsulated in vendor code, and for which SQL 2011 provides a simplified way of maintaining and querying those orthogonal histories.

In Part 2, I’ll describe the eight types of SCDs defined in the 2013 edition of Kimball and Ross’ Data Warehouse Toolkit. I’ll distinguish between two senses of “as-was reporting”, only one of which is supported by SCDs. My second argument against SCDs is that, even if they could provide a complete historical record of dimensional change, they involve hard and soft costs that outweigh any savings in storage costs that they might provide.

My conclusion is that to track dimensional changes, those dimensions cannot be SCDs; they must be bitemporal dimensions. A use case that I provide in Chapter 18 of my book Bitemporal Data: Theory and Practice (2014, Morgan-Kaufmann) shows how bitemporal dimensions can be used in fact-dimension structures.

 

[1]     The Asserted Versioning Framework is described in that book and, in greater detail, in U.S. Patents 8,219,522 B2 (July 20, 2012) and 8,713,073 B2 (April 29, 2014).

 

Share

submit to reddit

About Tom Johnston

Tom Johnston is the President of Asserted Versioning, LLC, a company which offers bi-temporal data management technology and consulting services. He has a doctorate in Philosophy, with an academic concentration in ontology, logic and semantics. He has spent his entire working career in business IT, in such roles as programmer, systems programmer, analyst, systems designer, data modeler and enterprise data architect. He has designed and implemented systems for numerous clients, in such industries as healthcare, telecommunications, banking, manufacturing, transportation, and retailing.

Top