What is the bane of every data warehouse project? What seems to be the major cause of project overruns?
Most experienced data warehouse project managers would quickly say, “Data Integration.” When planning a data warehouse project, a project manager breaks it into main phases, and the phase labeled
“Data Integration” is the one they see at fault. It’s the phase that, no matter how long they budgeted, always seems to go longer than expected.
Instead of dissecting why this is the case, the simple solution always is to hire more ETL programmers, hire more highly skilled ETL Programmers, or switch ETL vendors. But perhaps the answer does
not involve ETL.
What Exactly is ETL?
As a data consultant for many years (Gasp! More than twenty!), I have seen a lot of confusion surrounding ETL. Let’s examine what ETL does and what it doesn’t do.
ETL stands for Extract, Transform and Load. Essentially, ETL tools move data from place to place, usually performing some action along the way. Sometimes, they simply move data and leave it the way
it is, like a customer’s last name. Sometimes, the tool may standardize the data, such as looking up a value in a code table and moving the standard code value instead of the original value.
Sometimes, the tool might transform the value into an entirely different value by applying a function formula or conversion factor.
ETL tools facilitate data warehousing and data migration projects in many ways. They provide a graphical interface to show the flow of data; they can help programmers pinpoint fields in either
source or target that do not have a map. They greatly facilitate maintenance, because it is much easier to maintain graphical maps than it is to maintain code of any kind. Most ETL tools also are
equipped with performance-enhancing mechanisms like parallel loads.
What ETL is Not
ETL tools do not do everything, however. I had a client recently who was confused and thought that ETL tools performed reverse engineering. This is a task that is performed when you want to analyze
a database structure. Although you can hook up an ETL tool to a database and “see” the tables in that database, the ETL tool is not designed to produce an Entity-Relationship Diagram (ERD) and
show the connections between all tables in the database. A CASE tool (computer-aided software engineering tool) is the right tool for the job. Using an ETL tool in this way is like using a
screwdriver to hammer in a nail; you could do it, it just would take five times as long. ETL diagrams are also not really ERDs, although they can show a connection between two tables. Usually you
have to know what the keys are to join it on in advance, and you must manually enter this information into the ETL tool; the tool cannot infer the relationships from the database.
ETL will also not provide any indication as to how correct the mappings are. Research is necessary in order to understand what each field means and whether the mappings reflect the same things
being mapped together. This may be the hidden problem behind data integration woes.
The Secret of Successful Mappings
Semantics is the study of meaning. Every data warehouse project should have a person in charge of semantic verification and validation. This person should be a skilled data architect and modeler,
trained in extracting discriminating definitions from the data itself, documentation and people’s heads. This person needs to compare source and target fields on a concept level to ensure that the
mapping is accurate. In short, this person should be a “data sleuth:” A real Sherlock Holmes of data, able to detect and define each field carefully and consistently, and to document the meanings
clearly and concisely so all can understand, both business and technical people.
The Dark Side: What Really Happens
Who really designs mappings on a typical data warehouse project? It is usually an ETL programmer. ETL programmers are typically not data modelers; they usually have not been trained in semantic
discrimination. Their skills revolve around optimizing mappings, writing formulas and transformations, and designing job flows (which jobs are triggered upon success of a particular job).
Surprisingly, programmers designing mappings have failed to base them on the demographics of the actual data. I have always insisted on data profiling as a “best practice” for data integration.
Programmers and project managers alike look surprised and say “Why do I need to do this? If the field is called revenue in both places, isn’t it the same thing?” Profiling quickly reveals the
truth or falsity of these assumptions. Without profiling, programmers always suffer from the “load, code and explode” problem. Programmers should have been the first to grasp the advantages of
this approach, and on most projects I am still amazed at having to be a data profile evangelist.
As stated above, due to no systematic, repeatable approach to mapping which takes semantics into account, the first result that is detected immediately is schedule overruns. Load, code and explode
is common operating procedure. If data harmonization is not done systematically in advance, it must be done as a corrective measure.
Only the fields that did not fit technically (datatype, range etc.) would fail. The real semantic distinctions will always remain undetected.
ETL maps have usually been designed hastily; “Customer” in one database has always been assumed to be the same as another, without delving into more difficult questions as to what constitutes a
customer in each system, and then comparing them to see if they are really the same. The result is a data warehouse that looks like it is correct, but something often doesn’t add up. These types
of errors are extremely difficult to track down and diagnose. What often happens is the warehouse is seen to be more and more inaccurate over time, and since the errors are elusive, finally the
warehouse is scrapped.
Exploiting Semantics for Success
How do we avoid these tales of woe?
Data warehouse projects must acknowledge the importance of a semantic approach to integration. I call this approach Data Unification and Harmonization, or DUH. My two sons use the term “duh”
(usually said with a certain inflection) to signify something that should be very obvious but does not appear to be obvious to the original speaker. It should be obvious to all that DUH should be
done in a systematic, proactive way. Proactive instead of reactive; proactive meaning in advance, before any code is written. However, no one considers DUH until it’s too late and the project is
already over time and budget.
Here are the types of tasks that should be done as part of the DUH approach:
- Reference the Enterprise Conceptual Model, if it exists
- Conduct data profiling
- Consult documentation
- Consult with Subject Matter Experts
- Design a Conceptual model of the data warehouse (map it to the Enterprise model if possible)
- Create accurate definitions of all data elements in each source system and compare them with like elements to be mapped to the data warehouse
- Make these definitions accessible to warehouse users
- Document any discrepancies between like elements in different systems and make these notes available for business users of the warehouse
If all you do is data profiling, it will take you farther than if you don’t do it.’ But be aware that data profiling is only one component of DUH.
The moral is: If you hire a piper to play at your party, you must pay him. You can choose to pay him before the song, during the song, or after the song, but you must pay him; that is not optional.
The case for DUH is you pay much, much less in the project if you perform DUH upfront. Whether you have planned for it or not, DUH will be cone in any data warehouse project. The drastic mistake
that most project managers make is not planning for DUH at all.
So you must do DUH, duh! (Don’t forget that. Inflection!)