Published in TDAN.com April 2001
This paper focuses on a few practical tips for ensuring data quality specifically in the context of data warehouses sourced from multiple source systems across the globe. While many of these tips
may also be applicable for all other kinds of data warehouses, this paper focuses on the global warehouse described above.
Data Quality is one of the biggest challenges facing data warehouses, especially in companies with global operations, and multiple source systems of different vintages. It is the author’s
belief that the tips given here are critical to ensuring data quality which in turn is key to the success of the warehousing effort as a whole.
The paper is pitched towards the data warehouse practitioner and hence it is assumed that the reader is familiar with the data warehousing life cycle, and typical issues that arise in any
warehousing effort.
By no means is this list of tips intended to be a complete checklist for ensuring data quality. These, in the author’s experience, are necessary, important, and often overlooked.
Define Source Systems Correctly
Treat each unique combination of application system, line of business, and geographic region as a completely different system when identifying and evaluating candidate source systems.
This is because of likely regional variations to the same system depending on the line of business and local market conditions. Often times these variations are not minor.
Also, in the author’s experience, the same system name may be used to denote completely different systems in the different regions.
The term ‘source system’ will henceforth be used in this paper to mean a unique combination of application system, line of business, and geographic region.
Focus on Each Data Class Separately While Evaluating and Selecting Source Systems
The appropriateness of a given candidate source system as a data source for the warehouse depends on the data class(es) for which the source system is being considered as a possible source.
Examples of data classes could include: client, account, transaction, etc. Data classes can be identified from high-level warehouse data models and the business requirements.
Accordingly, the evaluation of source systems must be conducted in a focused manner for each data class. To facilitate the evaluation, a clear set of criteria needs to be developed. Examples
include: granularity of source data vis-à-vis target warehouse needs, how current the source data is, ease of availability (extraction), data quality, completeness of data, etc. The criteria
are very dependent on the requirements of the data warehouse.
Assess Data Quality Up-Front in Source Systems
The old adage “there is many a slip between the cup and the lip” holds true when it comes to differences between the definition of source data (as available in Cobol copybooks for
instance) and the actual data in the source files/tables. Examples of such mismatch include: banker code always contains ZZZZ rather than the actual banker code; country code is always blank in the
US implementation of the source system, etc.
Often such mismatches catch the project team and the business users by surprise at late stages of the project with disastrous consequences, such as project delays, lost sponsorship, inability to
deliver on project objectives, etc.
To avoid such situations, it is best to conduct an up-front assessment of data quality in the source systems. This assessment needs to be done at least for the “critical” data elements
(the definition of what is “critical” is situation specific). The data quality needs to be assessed in light of warehouse requirements, warehouse data model, and any documented data
mapping between source and target.
Data quality assessment tools are available in the market for helping with up-front quality checking before the source systems are selected and the actual extract programs written.
Assess Reference Data Quality Up-Front
Reference data is defined in this paper as various code values (and descriptions) such as transaction code, product code, etc. In the author’s experience, reference data values are not well
documented and usually in a few people’s “heads”, creating lot of downstream effort in mapping these to target warehouse values.
For instance, Product code A100 in source system X and product code Z999 in source system Y may actually be the same product. It is imperative that this be recognized in a warehouse project so that
these product codes are eventually mapped to the same target product code.
This is different from the assessment of data quality referred to in the previous tip since this deals more with the semantic meaning behind each code value, which can only be known by manual
inspection as opposed to automated checking.
It is recommended that the warehouse project team focus on understanding the actual reference data (values) available in each source system up-front (typically once the source systems are selected
and in parallel with the data mapping and data quality assessment activities).
This can cut down delivery time, but more importantly ensure high quality data in the warehouse.