With my involvement in business intelligence (BI) and data warehousing (DW) projects for over a decade now, I have often seen that during the initial phases of an effort there is generally a good
focus on data architecture and the right components to build the operational data store (ODS), enterprise data warehouse (EDW) and reporting architecture. But somehow the importance or focus on
data quality has been lukewarm at best. There is always a good understanding of its significance initially, but as progress is made in requirements and design, the overwhelming list of activities
around modeling, data capture and ETL makes the team lose focus on data quality. And eventually comments like these pop-up:
“Our data isn’t synchronized. Daily transaction files didn’t match with the customer data.”
“Data that was extracted by an individual user, then passed around to other users who don’t understand the context of the data or its relationship to other applications.”
“Oh but we have multiple sources for the same data.”
“Important entities and their attributes are hidden and floating in text fields.”
And usually in situations like these, folks with data quality experience are brought in to clean up the data problems and still try to achieve the business benefits that were originally built in to
the business case. If the system involves data integration from a number of disparate sources (which it usually does), it is useful at times to follow a bifocal data quality strategy
(short-term and long-term) to ensure that the projects achieve their business objectives. In the short term, we can look at tactical ways to improve data quality using data profiling, data
cleansing and enhancement. I have described a Six Sigma approach for this that I have found helpful, using a DMAIC (Define, Measure, Analyze, Improve, and Control) framework. The long-term strategy
should be to implement an ongoing data quality initiative that provides a cost-effective operational data architecture for the management, storage, transformation and delivery of data. This article
details the high level approach for both strategies, and organizations can pick their place on the spectrum depending upon their appetite for investment and the focus on data quality.
Before deciding what level of investment to make in data quality efforts, it is worthwhile to consider the following key questions about the BI/DW initiative being undertaken by the organization:
- How much quality of data is needed? The context of data quality should be the level of data quality needed for a business functionality, otherwise theoretically one can spend infinite cycles to
get the right quality. - Do business and IT users view quality data as a prerequisite? Is the quality of data considered in system and business planning? Is the quality of the data investigated prior to system
design? - Are subject areas prioritized based on their importance to the business strategy?
- Has a data quality program with dedicated employees been set up?
- Does the data governance organization identify business sponsors to provide ownership, accountability, direction and buy-in for data quality improvements? Good data ownership is quintessential
to proper data lineage and execution of a data quality effort. - Are quantifiable metrics for a data quality baseline available and monitored on a regular basis?
- Does the team use data definition dictionaries to define fields and their meanings? Are changes to fields and data managed through a change request mechanism in place to measure and report the
impact of changes to field definitions? - Are cross-field validations defined by the business and systematically enforced at the point of entry?
- Is there an archive and purging strategy to retire data that has exceeded its valuable lifespan?
Depending on your initial assessment, if the general direction is to work toward some quick wins, you need to follow a short-term data quality effort to clean up the existing data sources. I have
found that the following brief structure (DMAIC) borrowed from Six Sigma methodology is very useful and practical. I have used this in 1-2 week cycles dealing with different source systems and
following a rhythm beat to tackle some tactical data quality issues:
1. DEFINE
Create a data dictionary
Even if an elaborate effort has not been made to execute a central metadata repository, the initial task for the data management team should be to the following:
- Identify all production files from source systems
– Important fields
– Expected values
- Get information from metadata repositories, if available
– Enterprise repositories
– Data dictionaries
- Extract data from data definition artifacts
– COBOL copybooks
– PL/1 Data
– Database catalogs
The results of these finding could be stored in a repository like MS Excel. An example of the ideal set of data architecture artifacts is shown in the figure below. The data dictionary is a key
document used through all phases of an implementation.
2. MEASUREProfile this data using tools like Informatica Profiler or just simple PL/SQL queries. Create data reports to define a data quality baseline. The team should look for the things like:
A. Inspect values
– Inspect data values
• Value/frequency list
• Boundary points (high/low values), (short/long values)
• Values with special characters– Take counts of value that are
• Null or missing values
• Unique values
• Constant values or code values– Look for usage of columns
• Low usage columns (high percentage of blanks or zeros)
• Unused data values from code tables
• Indications of a “no value” like null, blank, “n/a”, etc.B. Inspect structure
– Orphaned records or referential integrity
– Pattern recognition – check for predefined formats and the frequency of each pattern:• Telephone numbers – with and without dashes and parentheses
• Social Security Number – with and without dashes
• ZIP code – with and without the PLUS 4 field– Derived values
• Computations like PROFIT = REVENUE – COST
• Business rules like If CUSTOMER_TYPE = “PREFRRED” then DISCOUNT = 10%– Third-normal form
The deliverable output could be in the form of a report composed of a list of defects annotated with key information, such as impact on the business, criticality level, frequency, and possible
resolutions. For example:
System – name of the system in which the error/defect was found
Object – name of the object (entity, table) where the error/defect was found
Field – name of the field (attribute, label) related to the error
Description of error – detailed description of the defect
Count – number of errors that were found of this error type
Impact – depending on the understanding of the impact on business functionality and its importance, categorize the defect as critical, high, medium, or low
Correction effort – this is to explain the time and effort it might take to correct an error of this category
Action – this is to explain how this error type could be handled – for example, create automated update scripts, manually cleanse in source system, no action, etc.
3. ANALYZE
- Compare data reports generated with the project documentation, if it exists (data map, data dictionary, high level designs, and low level designs).
- Identify differences in expected values and what the reports have and identify where transformations are required
- Meet with configuration and interface designers when unexpected values and/or undocumented transformations are required and create a plan of action for these.
The deliverable output could be a root cause document that has the following elements:
Based on the results of the initial data quality assessment, group specific data issues into more generic observations. Identify the root cause by asking business subject matter experts (SMEs), developers and technical experts. Finally, formulate recommendations on how to tackle specific issues after talking with the SMEs, developers, and tech experts.
4. IMPROVE
- Work with development teams and source system teams to have the teams update
- Their data dictionary with values expected for code value fields
- Data attribute definitions with any new discoveries about the data
- Data map to indicate the data type. Add data type column for attributes and populate with the data type for each attribute for each system
- Facilitate consensus on new transformations required and/or unique aspects of data attributes
5. CONTROL
There should be an ongoing monitoring of the critical data quality metrics to ensure that the cleaned data does not fall below a quality threshold. People often believe that once a fix has been
identified and applied, a given problem will remain resolved forever. While this is often the case, things are rarely that simple when people and processes are involved. Ongoing measurement
against the stored quality data is advisable.
After your team has helped alleviate some immediate and tactical data problems, you can start laying some foundation for a long-term data quality strategy with the
following guidelines:
1. Data quality initiatives need to encompass improvements in all aspects of the enterprise: people, processes, and technology.
People: Establish data governance organization(s) to provide ownership, accountability, direction, and buy-in for data quality improvements. It addresses the processes,
skills, leadership and assets required to successfully administer a company’s information resources. This involves data administration, data quality services team, data governance
committee, DBAs, some business and executive members, etc.Process: Streamline data set-up and maintenance processes to minimize handoffs. This, in turn, will reduce the risks of introducing errors into the system. Since you
have just cleaned up the major data quality issues, strict data governance needs to be enforced from this point forward:
- Formal change request process for any changes to metadata or data artifacts.
- Proper approvals for any changes (from team, delivery manager, data champion and anyone who could be impacted by the change).
- Proper recording of any changes (how many changes were done, how many are approved but pending implementation, how many are pending approval and how many were rejected and why). This should
be stored in a tool that has the audit trail available to the team.- Timestamps of when the change was requested, when it was implemented and by whom.
Tools: Implement system controls to reduce the introduction of errors at the point of entry. Implement data profiling/data quality toolsets to help assess and monitor the health
of the data. This involves data profiling, quality and monitoring tools, ETL tools, and audit reports.
2. Key subject areas for data quality improvements should be prioritized relative to their importance to key business strategies and initiatives. Also enough time
should be allocated for subject matter experts to support the data quality team in profiling the source systems and analyzing data quality concerns.
-
The data samples provided should be over a significant period of time (at least a quarter if possible and preferable a whole year).
-
When data is originating from the mainframe, this data should be staged so anomalies can be scrutinized with more care using SQL statements. This also makes it easier to set up tools like
Informatica Profiler, Power Analyzer, etc., used to profile the data.
3. There should be a proper process for defining what level of data quality is needed for the compensation systems and any outbound interfaces. The business analysts should
be asked and a formal data quality definition document should contain things like threshold levels for nulls, spaces, negative values, etc. This should be formally signed off on.
4. A data quality assessment plan should be created. This plan should be used to validate the data definitions and rules.
-
Data profiling should be performed during the analysis phase of the project to insure the business needs are going to be met by the source. Without a minimum level of data
quality the data source should not be used and alternative sources should be sought. The following is a data profiling template that could be used be filled for all data elements.
5. During the analysis and design phases, the data needs to be defined in both technical and business terms. Examples of technical terms are data type, length, or
number of decimal places. This definition step also captures any rules the data should satisfy.
6. These definitions should be included in the data map and data dictionary. The data maps should have data type definitions, ranges, lengths, and description of columns.
7. The data map should be the one true source of data definition for the project.
- The data in the data map should also by in sync with data as defined in the corporate metadata repository.
- The data map should include data information for all operational data stores, data warehouses, data marts, views, etc.
8. The data architecture team should be hands on during the analysis and design phase with artifacts like logical data model, physical data model, data maps, etc. This team
should also be involved in reviews for the ETL development process during build and deploy for things like code reviews, etc.
9. After the data artifacts are finalized, any data object change should go through a data change request process involving a committee represented by team members from data
architecture, development, business analysis, and testing.
10. Like any other corporate effort, proper communication is key to the success of the data quality initiative. Raising awareness of the project and frequently communicating
the benefits of the efforts will help the adoption of a culture where data is viewed as a corporate asset.