Published in TDAN.com April 2000
A critical success factor for a data warehouse is the perception of the quality of the data it contains. The quality of this data is directly related to the source of the data used to populate the
data warehouse. The ability to select the most appropriate source of data for use in a data warehouse is critical to ensure quality as well as provide complete knowledge and understanding of the
information made available through a data warehouse.
Over the years, the information needed to make business decisions has been contained within the glass-enclosed, water-cooled rooms that housed gigantic mainframe computers. The only way to access
this information was through green-screen terminals and batch reports. These delivery mechanisms were pre-designed and very inflexible. The earliest “data warehouse” applications, in concept, can
be traced back to the use of calculators for producing numbers on the reports from which managers made business decisions. As personal computers became more prevalent, business users started taking
any information available and loading it into their “personal” applications. This time-consuming manual effort often involved rekeying the data into spreadsheets so that the desired information
could be manipulated as needed.
The need for change was finally recognized and became the driving influence behind data warehousing efforts. Instead of requiring the end-user to select and load the information manually, an
automated process was developed to extract information and load it into files that were “outside the glass room.” Through this data delivery process, data was selected and used to populate target
structures commonly known as data warehouses, or more often, data marts (business unit-specific data warehouses). This data was timely, accessible and enabled the end-user to make informed business
Because of the pressure to deliver these data structures quickly, shortcuts in the development process were often taken. A data warehouse/data mart is supposed to be designed to accommodate
multiple queries using large quantities of historical summarized data. In reality, most of these structures are merely images of the production files that were extracted and replicated, i.e. “snap
shots” of production files at a point in time. This extraction reduces the strain of reporting against production files, thus reducing performance degradation. But it really does nothing to
facilitate true adhoc queries.
Another result of the pressure to quickly satisfy pressing information needs was that these new structures were populated with whatever data was readily known and accessible to the populator.
Instead of taking the time to analyze the proper source of each data item that would be made available in the new data warehouse, any accessible, known source might be selected. An extreme example
would be a case in which a number of production reports are selected (as the source), and entered manually into an Access database by the departmental administrator to produce spreadsheets and
graphs with no regard to data compatibility.
As the delivery pressure increases, the quality of the information continues to decrease. The need to extract the required information from multiple production systems often becomes too
time-consuming and staging becomes too complicated. The risk of providing incorrect information vs. taking the additional time required to determine the most appropriate source of that data is
often not even weighed.
PROBLEM: DATA WAREHOUSE DATA QUALITY
The fastest way to undermine end-users’ confidence in the newly delivered data warehousing “system” is to present data that results in inconsistent, inaccurate, ambiguous query/report results.
This can happen as soon as the data in the data warehouse does not agree with data shown on existing production reports. What causes even more damage is when managers from various business units
attempt to make decisions based upon inconsistent results from their individual data marts.
This inconsistency can arise from a number of factors. One frequent problem is that the data used to populate each data mart originates from a different source. There are many data sources within
an organization from which to derive a list of customers, product/service offerings, etc. Each of these sources could result in different totals when the numbers are averaged or added. Moreover,
even if each data mart uses the same source to populate the data, unless there is a consistent, accepted definition of the information, confusion could, and often does, arise.
Another result of the increased pressure to deliver data warehousing solutions in a short time frame is that data marts are often developed based on a limited knowledge of existing operational
systems. Each business unit, or individual department, populates its individual data marts (often referred to as “independent” data marts) with the information they require. The source of their
information is that which is within their sphere of knowledge. This contrasts to the use of dependent data marts populated from a predefined set of information that has been integrated from
multiple operational system sources. The development of dependent data marts requires some initial data warehousing analysis at an enterprise level. Even though this enterprise-view effort requires
more time in initial development, the time spent on follow-on incremental efforts can be sharply reduced. This is especially true when “conformed” dimensions are understood, modeled and utilized.
DATA WAREHOUSE DEVELOPMENT
Even though the single development effort of a complete enterprise data warehouse has been widely accepted as an impossible task, the incremental development of dependent data marts has proven to
be feasible and acceptable. In order to understand how these development problems arise, a brief review of the development effort required for a data warehousing effort might help.
The major tasks in this incremental development process include:
- Development of an understanding of the information requirements
- Design of the target structure
- Design of the population/refreshing steps
- Development of initial queries to deliver the required information
The first step in determining the content of the data mart really consists of two major efforts. The first is to develop an understanding of the meaning of the information requested and to clarify
business definitions. It is well understood that as soon as the initial information is delivered, additional requirements will be discovered, hence the idea of incremental data marts. Once the
identification of the required information is completed, the source of where that data will come from must be determined. Most often this is accomplished by asking other members of the development
team. (Whether or not the “official source” of critical information is available often doesn’t matter, since no one knows of its existence or how to obtain it.) Using the existing format of the
required data, the new target structure for the data mart is designed. More often, these tasks are accomplished using some form of data modeling software. This enables analysis and documentation of
requirements and helps design and generate the physical target tables and columns.
Once the information is understood and the new structure is designed, the procedures to populate this data are designed and developed. These procedures include extraction of data from the
appropriate operational source, and any transformations, scrubbing, decoding and staging required prior to loading the data into the new target structure. This process can be as simple as
extracting the entire production data structure and replicating it, as is, into a new read-only target structure. On the other hand, various sources may be used to populate the new structure,
requiring extensive staging procedures to prepare the data. Depending on the incoming quality and format of the data, data transformations and scrubbing may also be performed prior to the final
load. This set of procedures is often executed through scheduled procedures developed using an Extract, Transform, Load (ETL) tool. The process of continually refreshing the data in the new data
mart can be handled in a number of ways. One is a fairly complex procedure that only adds changes that have occurred since the previous refresh (commonly known as a delta refresh). A simpler method
adds all the data that exists at a point in time (a snapshot population). The second method causes the size of the data mart to grow at almost an exponential rate, while the first requires a more
in-depth understanding of why and when data changes occur and how those changes can be captured.
Once the data is available in a new read-only structure, the delivery of this information to the end-user is now available. Through the increased availability and functionality of Business
Intelligence (BI) tools, the process of creating queries based on this data has become much easier. In fact, these tools are marketed as capable of being used by non-technical business people,
especially to develop adhoc queries. In reality, most organizations create initial queries to satisfy the main requirements of the business end users. These queries can be easily adapted to support
parameter-driven requests that will meet the majority of the basic informational needs of the business.
PROBLEM: DATA WAREHOUSE META DATA
Just as operational applications can become like “stovepipe” applications/islands of information, data warehouses/data marts can increase the number of “applications” in the same way. In order
to begin to understand the meaning of the information contained in these data warehouse “applications,” meta data must be available to explain and help facilitate the understanding of the
information content. In the past, meta data (whether for operational systems, ERP (Enterprise Resource Planning) applications, or data warehousing efforts) has been thought of as something captured
and stored in a repository. Once captured, it can be made available upon request. These capture and distribution activities are usually manual efforts and therefore often overlooked and
Each of the tools used to develop data marts has its own individual meta data requirements. Meta data is not standardized from tool to tool, and meta data was never meant to be shared. In fact,
many believe that the subtle differences in this meta data are what give each tool its competitive edge. Individual tool meta data “repositories” were developed to facilitate the requirements of
each individual tool. The fact that the meta data captured in one tool could be used in the next tool (from a tool perspective) is merely a coincidence. Each tool requires some information (meta
data) to begin the process as well as to provide information (meta data) when the process it supports is completed. In order to reduce the time required to develop and deliver a data mart, the meta
data information must be available and shared between the various data warehousing tools.
Just as documentation is perceived as a necessary evil, and often left to the last minute (and therefore often not completed), meta data capture also is often overlooked. Since most meta data is
captured through a highly labor-intensive, manual process, its value is not deemed to be sufficient to justify the cost. The ability to electronically capture the software tool meta data available
at the end of each logical unit is an absolutely necessity. Additionally, previously captured meta data should be made available to each tool when work is initiated. As standards such as XML
emerge, this integration of meta data requirements between various data warehousing tools should become much simpler.
AN EXPERT’S PERSPECTIVE
“The central meta data repository is the heart of the data warehouse. It is used as a “single version of the truth” to provide central definitions of business rules, semantics, data
definitions, transformations, and data models. There can only be one version of the truth for these definitions, not multiple, inconsistent definitions…” – Pieter R. Mimno, The Data
META DATA CONSUMERS
A number of different potential consumers of meta data exist in a typical organization. These range from the corporate data administrator to the IT staff to the businessperson. Each person has his
or her own set of meta data requirements. The capture and distribution of these various categories of information require different procedures and delivery mechanisms.
In the past, meta data and the meta data repository were the primary tools of the corporate data administrator. There were volumes of standards and procedures for the capture and usage of meta data
within the enterprise. Over time, operational systems meta data has been used more and more by application developers for impact analysis. The repository has made it possible to understand the
impact of a change to an operational system.
It also has provided a picture of how various applications relate to each other.
This use has expanded the scope of the repository and its meta data contents to encompass the IT organization requirements, even though its use was very limited. Most of the meta data contained in
the repository to support these needs was about technical artifacts such as applications, jobs, files/tables, records/columns, reports, queries, etc. Little if any business meta data was available
(or deemed to be necessary).
As more and more business knowledge is required, the need for business meta data becomes more obvious. The content of the repository begins to be extended to include not only the operational
systems technical meta data, but also the definitions, meanings and business rules (i.e. business meta data) that provide the business knowledge behind these system components. The capture of this
information is a bit more difficult. No one has yet figured out how to transfer knowledge that resides in a businessperson’s brain into a machine-readable format (brain scan). Often the only
electronic version of this knowledge is in a word processing document or spreadsheet, or possibly in a data modeling tool. It has become imperative that this information be captured electronically
for retention in the repository. Once the problem of capture is overcome, the delivery vehicle for this new meta data to the business enduser has to be established. It is important to realize that
the meta data currently provided through most software tools, including most meta data repositories, focuses on technical content. As the distribution of this meta data is expanded to include
business meta data and business end-users, the content of that information should be more business-oriented.
As if our lives were not “meta data poor” already within a given organization, we are now also faced with the newer challenges related to e-Business relative to meta data. An organization must be
prepared and enabled to provide meta data to users within organizations outside their own. The business endusers, as we know them today, are only a small percentage of the audience of the future.
There is one other group of meta data consumers. This is not a group of people, but a set of software tools. Restricting the discussion here to data warehousing, many tools are available to support
the development of a data warehouse/data mart. As the use of each tool is introduced and established, the meta data requirements for that tool must be identified. Previously captured meta data
should be electronically transferred from the enterprise meta data repository to each individual tool. In return, the tool-specific meta data should be analyzed for inclusion in the enterprise meta
data repository. If captured meta data is never required by another tool, or group of users, a question should be posed as to whether or not the storage of this information is really required. As
the consumption of meta data grows beyond the organizational boundaries, information requirements also grow.
As the pressure to deliver incremental data marts at “warp speed” increases, organizations look for ways to reduce development time. One way that has proven successful at not only reducing the
development time, but also increasing the quality from one increment to another, is the use of conformed dimensions. A conformed dimension is a dimension that has been analyzed and modeled by a
central data warehouse design team and then used for all following data mart implementations. The official definition of a conformed dimension, according to data warehousing expert Ralph Kimball,
is “a dimension that means the same thing with every possible fact table to which it can be joined.” For example, once a product dimension has been analyzed and modeled, that knowledge can be
reused. Because the actual usage may be different, the actual physical structure of this dimension may vary from data mart to data mart, but the logical structure, and imbedded business rules, need
only be developed once. As more and more data is requested regarding a dimension, those items can be added to the logical model and are then available for future increments and enhancements. In
addition to the physical model of each dimension, the business meta data, including business rules and approved sources, can be maintained in a meta data repository.
Another analysis effort, which needs to be done only once, is to develop an understanding of the effect of business events on the changing of information in each dimension. This is commonly
referred to as “slowly changing dimensions.” Rather than simply replacing the entire product dimension, for example, on each refresh cycle, the business event that adds, changes, or “sunsets” a
product is analyzed. A determination is then made as to how best to reflect the change in the dimension. When this business event occurs, the resulting information could be captured (i.e. delta
change) and used to refresh the dimension. Using this approach, the amount of data that is captured and loaded on each cycle is reduced.
DATA WAREHOUSE SOURCE
The underlying requirement that satisfies both data quality and conformed dimensions lies in the definition of the appropriate source for each data item that is included in a data warehouse
implementation. There is no questioning the fact that each and every piece of business information is contained in multiple physical files/tables throughout the organization. Some of these physical
structures are referred to as “master files” while many are extracts that have been developed and used for specific purposes. In order to facilitate the “correct” choice of data sources, an
extensive analysis must be undertaken to:
- Determine which files are to be considered as master files
- Map the business elements to the most appropriate location in the enterprise.
This knowledge should be gathered by either the enterprise data administration organization or the central data warehouse team. Once gathered, easy access to it must be provided to the rest of the
organization. The capture, maintenance and distribution of this information should be facilitated by an enterprise meta data repository that is available to all individuals within an enterprise.
The task of developing this “business element-appropriate source” link is monumental. It need not be undertaken as a single effort all at once. In fact, it can be handled concurrently with the
identification of the requirements of each incremental data warehousing effort. As the need for individual business elements are identified, the appropriate source can be determined and provided to
the data warehouse developer. The identification and capture of not only the actual physical field/column and file/table of the appropriate source, but also the application that maintains this
information, is important. This valuable information can then become another set of meta data attributes about a business element. In organizations where modeling tools are used to facilitate the
design of each data warehouse/data mart increment, a list of business elements, appropriate source, business rules, business definitions, previously developed data models, etc. can be
electronically transferred from the enterprise meta data repository to the modeling tool. The presence of this type of information saves discovery time at the initiation of each new effort. It also
helps ensure data consistency and quality whenever the information is presented.
The real difference between the meta data repositories contained in each data warehousing tool and a true enterprise meta data repository is the scope of the information contained. The tool
repositories are meant to support the functionality of the tool and therefore contain only that information required by the tool itself. Very seldom is additional information captured in these
tools (unless through user-defined fields).
On the other hand, an enterprise repository is designed to provide a source for managing the “meta data” for the entire enterprise. As one data warehouse professional remarked, “meta data is not
data about data, but rather, data about resources.” As a result, many enterprise repositories contain much more than just the typical information about the information assets of an organization
(such as applications, files, programs, files,) and have begun the capture of additional technical and business meta data. Many organizations use the Zachman Framework (a framework for information
systems architecture) to help identify the types of meta data that could be captured and maintained in an enterprise repository. Today an organization would be lucky to have even the first three
columns of the last row populated “to an excruciating level of detail” (which means including textual definitions/descriptions!). As the need for this information increases, the pressure to
capture and deliver the entire contents of the Zachman Framework increases. The value of the understanding and reusability of this information is impossible to measure until the need arises and the
information is not available.
As organizations continue to push the decision-making process further and further out in an organization, the greater the need for understanding becomes. There are fewer and fewer individuals who
can be contacted to answer the question “what does net net sales mean on my sales report?” Wouldn’t it be nice if, when that question arises, that the business person could right-click on that
net net sales number and view a box that describes (in the business user’s language) the definition, business rules, calculation, application source of the data, time last updated, etc.? And as
normal procedure, that every report and query that shows net net sales would contain the same value? It need not be impossible or improbable.