Implementing Data Quality Through Metadata, Part 2

Published in TDAN.com July 2006

A special thanks to Mike Jennings for his invaluable contribution to this month’s column

This article is the concluding portion of a two-part series on implementing data quality through meta data. The first installment examined the role meta data can have in the data warehouse model
and data acquisition designs for information content and quality. This segment will examine real world examples of technical meta data tags that can be incorporated into your designs to facilitate
measurement of data quality and promote user confidence in the informational content of the warehouse. This meta data provides a semantic layer of knowledge about the information in your warehouse
that is highly valuable to both business users and IT (information technology) development staff.


Load Date

The most commonly used technical meta data tag is the load date column. It indicates when, date and/or time, a row of information was loaded into the warehouse. This “snapshot” date maintains
temporal integrity of the data in the warehouse. The column can be referenced by warehouse administrators to identify candidate rows for archival/purge processing or by users to reconcile/audit
information in the data warehouse with the source systems.


Update Date

Another common technical meta data column is the update date. This column denotes when a row was last updated in the warehouse. Like the load date, this column maintains the historical meaning
(temporal integrity) of information in the data warehouse. It is routinely used in dimensional models that implement slowly changing dimensions, (SCD) type 1 or 3 processing methods, to identify
when the row was refreshed or updated. For those not familiar with the implementation of SCD, type 1 maintains a single row per production key(s) in the dimension table, which is updated as
required over writing any history about the row. Type 3 also maintains one row per production key(s) but doubles the number of columns to keep both a current and previous view of the information.
The column is used in administration activities such as archival/purge processing or reconciliation/audit by end users.


Load Cycle Identifier

One more technical meta data tag a data warehouse development team can incorporate is the load cycle identifier. This column is a sequential identifier assigned during each load cycle to the data
warehouse regardless of the refresh frequency (e.g., daily, weekly, monthly, etc.). It can be used to easily remove data from a particular load cycle run if data corruption or other data quality
issues arise. The load cycle identifier is typically used in conjunction with a meta data repository that describes other operational statistics about the load cycle. Using the repository alone you
could determine how many and when load cycles occurred against the warehouse. Now, by tying the repository statistics to the actual warehouse content, you know exactly which rows were loaded and
when.


Current Flag Indicator

The current flag indicator tag identifies the latest version of a row in a table. It facilitates quick identification of the latest version of a row as compared to performing date comparisons. This
flag is especially useful for managing dimension tables using SCD, type two, where history of a production record is maintained. SCD 2 is used to model a dimension table when changes to relevant
columns need to be captured over time. This technique relies on the use of a production key(s) in the dimension table not changing. New surrogate keys are assigned to the dimension table when
changes to relevant columns are detected during a batch load cycle. Comparisons are made between the previously loaded production key(s) of the dimension table and the new load cycle data. Changes
to relevant columns in the new load cycle data for matching production keys are loaded with new surrogate key assignments. This tag is also very useful in non-star like data model schema designs
such as an atomic (most granular) data warehouse where structures tend to conform closer to third-normal form. Instead of querying a table for the latest date field, the ETL process assigns a “Y”
to the latest record loaded for a production/natural key(s) while setting any previously loaded record to an “N”.


Operational System Identifier

One of the most useful technical meta data tags is the operational system identifier. This tag is used to track the originating source(s) of a data row in the warehouse. This tag allows
identification of each row in a warehouse table to the sources used in its construction. This provides the user, repository architect, and data acquisition developer with a powerful means for
identifying and measuring the quality of the data received from an operational source. In the common case where your ETL process is required to extract and merge data from multiple sources, the tag
is assigned a value that represents this particular integration. For example, where a row of data is integrated from more than one operational source system (client information from a trouble
ticket, order management and billing information) a column value indicating the combination of these systems is assigned.


Active Operational System Flag

This tag is used to indicate whether the production keys (rows of data) in a warehouse table are still active in the originating operational system or systems. The active operational system flag
provides an analysis alternative to queries posed to the data warehouse. This column can be used effectively in a variety of analysis activities to identify dormant data or data that should be
constrained in reporting.


Confidence Level Indicator

One of the more intriguing technical meta data tags is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a
particular row of data through application of a ranking value. This tag provides a means of measure to a user as to the credibility level of a data row based on the transformation processing
performed. It is used to identify potential problems with data quality from source systems and to facilitate correcting these issues. For example, data from a stable source like customer were
loaded at the highest level. Data more volatile, easy to cleanse or relatively moderate to define, was loaded at the second level. The third level of data was considered more problematic to define,
such as planning or forecasting data. The fourth level consisted of data not originating from a internal operational systems but provided by management, in the form of a spreadsheet. The fifth,
level was used to tag the data from external sources such as new services or commercial sources.

Using these technical meta data tags allows your corporation the ability to vastly improve the overall data content quality by tying the meta data repository and decision support data models closer
together.

Share

submit to reddit

About David Marco

Mr. Marco is an internationally recognized expert in the fields of enterprise architecture, data warehousing and business intelligence, and is the world’s foremost authority on metadata.  Mr. Marco is the author of several widely acclaimed books including “Universal Meta Data Models” and “Building and Managing the Meta Data Repository: A Full Life-Cycle Guide”.  Mr. Marco has taught at the University of Chicago, DePaul University, and in 2004 he was selected to the prestigious Crain’s Chicago Business "Top 40 Under 40". He is the founder and President of EWSolutions, a GSA schedule and Chicago-headquartered strategic partner and systems integrator dedicated to providing companies and large government agencies with best-in-class business intelligence solutions using data warehousing, enterprise architecture and managed metadata environment technologies (www.EWSolutions.com).

Top