Implementing Data Quality Through Metadata, Part 1

Published in April 2006

A special thanks to Mike Jennings for his invaluable contribution to this month’s column

How are you addressing the single most difficult problem facing data warehouses today? Data Quality. When the quality of data is compromised, incorrect interpretation and use of information from
your data warehouse can destroy the confidence level of its customers, YOUR users. Once the user’s confidence in your warehouse is eroded it is a question of time before your system will no longer

This data quality quandary often results from system architectures that fail to identify “bad” data before it is loaded into the data warehouse. This missed opportunity leads to a dramatic
increase in the time and costs that companies expend to reconcile and audit information in the warehouse. Insertion of technical meta data “tags” directly into the data warehouse’s dimensional
data model design and the extraction, transformation and loading (ETL) processes corrects this situation by providing a practical means to measure data quality precisely at a table row level of

This article is the first portion of a two-part series on implementing data quality through meta data. This installment examines the role meta data can have in the data warehouse model and data
acquisition designs for information content and quality. Part two of the series will examine the beneficial technical meta data tags that can be incorporated into an architecture to measure data
quality and provide flexibility to the system design.

The warehouse developers use technical meta data as a method to build a tighter relationship between the repository and the data warehouse. This is accomplished by incorporating technical meta data
directly into the data warehouse design and ETL processes. This technique is used to extend the design and architecture of the data warehouse to provide increased processing optimizations for data
acquisitions, maintenance activities, and data quality measurement opportunities. These technical meta data tags, unlike information stored in a meta data repository, are referenced at a row level
of granularity in the data warehouse. This direct association of meta data to each row of information in the data warehouse is a key distinction of extending meta data into the architecture.

To select operators, each row of data is tagged from the source systems during ETL processing with technical meta data. The meta data tags on each row in the warehouse provide a clearer semantic
meaning to the data by placing the information in context with the repository. As an example, consider a client dimension table that derives its information from two operational sources. Client
information is extracted in priority order from either from a sales force automation application, an enterprise resource planning application (ERP), or both, depending on availability and stability
of the data. The absence of technical meta data in the dimension table would require use of the information without consideration of the source system(s) that provided it. Technical meta data
tagging allows you to determine the origin(s) of information in the dimension table. Information originating from one or more sources can be easily and quickly determined through the technical meta
data tag on that row.

A clear, consistent method of tagging the data originating from the operational systems needs to be developed and agreed upon by both the technical and business users of the data warehouse. Any
technical meta data tied to the row must be applicable to the entire row of data, not just the majority of columns in the table.

I like to keep technical data tagging to a minimum in a simple dimensional data model design. Schemas that have only one or two fact tables or the case where a single operational system is source
to the warehouse are examples of these basic cases. I prefer to increase its use in very complex schema designs, using multiple fact tables with several conformed dimension tables, or when
integration of numerous source systems is required. These more complicated cases make mapping of technical meta data from the source systems more challenging.

The data warehouse team is responsible for reviewing the design, construction and maintenance impact resulting from the use of technical meta data to the repository, ETL processes, data model
design, database sizing and front-end data access tools. For example, some ROLAP (relational online analytical processing) tools require a very strict adherence in the design of the data warehouse
model in order to function properly or fully. This may preclude the use of some, or all, meta data tags on certain tables such as the time dimension. Certain types of technical meta data can
require additional ETL processing times to occur that may interfere with tight processing window schedules. For example, careful consideration to ETL processing times needs to occur in the case
where a meta data tag is used in the warehouse to indicate whether production keys, or natural keys, are still active or not in the source system.

The benefits of technical meta data use include source system identification, data quality measurement, improved management of ETL processes and database administration. Use of these technical data
tags offers warehouse administrators and business users a means for measuring the content quality of the data in the warehouse. Adding these types of meta data tags into the warehouse model and ETL
processes can help you to reconcile data quality issues in your environment. Such reconciliation offers an increase in data integrity benefiting both the warehouse technical and business users by
increasing the level of confidence in the quality of the information content.


submit to reddit

About David Marco

Mr. Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence, and is the world’s foremost authority on metadata.  Mr. Marco is the author of several widely acclaimed books including “Universal Meta Data Models” and “Building and Managing the Meta Data Repository: A Full Life-Cycle Guide”.  Mr. Marco has taught at the University of Chicago, DePaul University, and in 2004 he was selected to the prestigious Crain’s Chicago Business "Top 40 Under 40". He is the founder and President of EWSolutions, a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing, enterprise architecture and managed metadata environment technologies (