Untangling Spaghetti (Real World ODSs)

There are lots of theories out there about what an Operational Data Store (ODS) is, is not, what it should do and shouldn’t, how long it should hold data, how that data should be modeled, even
what the basic purpose of the ODS is and whether or not it’s even necessary. Maybe it’s a streak of arrogance, but to me, it seems like a no-brainer. The name even implies the purpose and whether
or not you will need one. It is a store for operational data — or should I say a central store for operational data. The data from lots of operational systems is brought
together, cleaned, given a single definition, and sent on to a data warehouse for use.

Of course the problem is those words “lots of operational systems” and “given a single definition”. Words of bane I tell you, words of bane. In most large corporations, especially ones that
have been around for more than thirty years, there are typically three or more operational systems — each of which has its own definitions (and frequently for the same stuff),
its own way of storing data, its own unique data problems, and so on and so on. This brings us to Data Warehousing 101. There are two things a data warehouse is supposed to do:

  1. Store high quality data
  2. Give the decision-makers a single set of definitions with which to make their decisions by

The idea is that when the Director of Sales asks to see a report on gross revenue by product and the Finance Director asks to see a report on gross revenue by product they both get the same
information exactly. The problem is that the data “source” is actually more than one system (accounting, sales, and possibly purchasing) — each of which has its own definition of what
“revenue” is. To further complicate matters each of these systems may not even have the same definition for the word “product” and there is overlap of… well… products within them.

To heap trouble on top of this mess, some of these systems are “simple” COBOL flat files, some VSAM data, some DB2, maybe some are home-grown Oracle or Informix, and maybe some data from an ERP
package that is being implemented. These are used in operational systems for sales, multiple systems for manufacturing, for finance, maybe two or three for accounting (GL, payroll, etc), another
for HR, and the list goes on. So you have a mixture of denormalized data with some highly normalized data and each of these systems defining things differently, storing it differently, and modeling
it differently. THAT is when an ODS is necessary. Given that this is the operational architecture that most of us live under (or one very close to this ugly) an ODS is generally going to be needed
for most any enterprise data warehouse. The vast differences between the several operational systems are going to make not having one a major undertaking.

Now that I’ve stuck my stake in the ground let me continue to add that an ODS also needs to store data from all of the systems in the same tables for a given data set. The model needs to be such
that the data within the tables describes the data within the tables. Ok, go back and read that last sentence again. What I mean here is that everything should be dynamic and changeable. What
fields a given item something has, what relationships, what hierarchies, everything. You see, the first and most primary job of any ODS should be to take data that is all over the place (meaning it
has different definitions and different fields) and bring it together into a single place so that it can be loaded into the warehouse. The ODS needs to be the place where “parts” from several
operational systems with several different definitions of what a “part” is, is brought together, formed into a something with a single definition, and uploaded to the warehouse. Traditional
modeling techniques simply don’t have it them to do this. Fortunately, there is an approach that does.

If you have an operational architecture that should be called an operational spaghetti shop this is the technique for you — in stage one of the ODS. Now that I’ve told you I
advocate a multi-staged ODS, let me explain. If the data is all over the place (and most companies operational data is) then you will need more than one stage to get the data to have a single
definition. The data is pulled from your operational systems and ran through a vigorous ETL process and loaded into ODS Stage 1.

This stage is modeled using the data driven approach. The idea is to get the data “together” in a single place on a single machine in a single table set. This does not mean that your Stage 1 is
only four tables. It does mean that all of your part/product information goes into the same four tables. All of your financial information goes into a separate set of tables. Sales information
another set, Manufacturing information gets its own, and so on. Again, the idea is to get wildly divergent things into the same bucket.

From there, the data is moved to Stage 2. This stage of the ODS should be done in either 3NF or Star (depending on your requirements and philosophies) AND it should be such that
it allows for only one definition of a single business concept. Revenue is “this”, a part is “such and such”; you get the idea. Now you may be thinking to yourself now wait just a minute! why
would you 3NF part information in Stage 2 when you said in your last article how differently this data can be described? The reason is simple. If you can’t get the data to fit into 3NF how are you
possibly going to get it to fit into the Star model that is in your data warehouse? What needs to be done is to find out the types of reporting that will be done in the warehouse and put in only the
needed columns. You should look to fulfil analytical requirements at this point, not try to adapt to change at the speed of light. Your ODS needs to be the place your warehouse receives trusted data
in a constant format with well defined definitions, columns, and values. The ODS is not the place to be putting together your part master database.

The next question you may be asking is why bother with Stage 1 at all? Why not just do the requirements gathering up front so that you only need a single stage? In this case the reason is two fold.
The first is that I feel there is a lot of value to having all of this operational data described in the manner of the source systems all in one place. Remember, an ODS is a central store for
operational data. That means it should hold all of the data from the operational systems. By keeping all of it described in the manner of the source systems you have a rollback path. The second
reason is plain old performance. It performs better to have the data go into Stage 1 with only quality checking being ran against it. From there the data transformations take place within the ODS
itself running at the speed of the data engine instead of at the speed of your external ETL process/tool. There is a final reason why I like the multi-staged approach to an ODS. It has been my
experience that Stage 1 actually serves to isolate the warehouse itself from the numerous business changes that take place. I have found that most of the changes to the business itself do not need
to be reflected in the warehouse for analytical purposes. Without Stage 1 it becomes difficult to isolate the warehouse from those changes because of the need to highly alter the ETL process.

So, we have moved the data from the various operational systems to what amounts to a bucket of high quality data. The first stage is a place where the data (after being ran through a data quality
filter) is sort of just “dumped”. The data then goes to stage two. This stage is the “warehouse prep”. Single definitions are given to business entities (for some reason, revenue and part seem
to be the ones most likely to be defined differently by different departments) in this stage. Lastly the data goes to the warehouse.

This “multi-staged” approach to an ODS gives you tremendous flexibility and allows you to convert a mess (your various operational systems) into something manageable. The only question left
unanswered is how long should the data stay in the ODS. I have my own ideas, but I feel that question really doesn’t have a set answer. You should decide how long your ODS holds data based on your
requirements at the time. Remember, disk space is cheap and getting cheaper. In this area it is tough to make a mistake and there are preciously few like that when dealing with something as
expensive as a data warehouse infrastructure. So relax about the storage requirement for the ODS and focus instead on making your ODS such that it helps you untangle your operational spaghetti.


submit to reddit

About Danny Wall

Danny Wall is the CEO of Wolf Data Systems, a company based in San Diego, California. Wolf Data Systems is a new breed of professional service company. Their only goal is to make you money. Mr. Wall can be reached at 619-640-6340.