I have recently received many challenging questions about how to use Data Governance processes, technology, and the Business Glossary to manage and present the data quality metrics to our business reporting consumers.
I’ve discussed this in seminars and articles for a while but I believe it is time to take a different view on how we can achieve results.
Improving the quality and value of our data assets is a complex objective. Improving data asset quality is not the responsibility of one person. It takes a village; data consumers, stewards, technical architects, process experts, and even application developers at times. Improving data asset quality requires processes, people, and technology support to be successful.
– – – – – – – – – –
Some of the recent questions I’ve been asked include:
- What is the relationship between a business rule, data quality rule and validation rule?
- When, where, and how do we define and manage any of those rules?
- Should our data governance policies be related to the validation rules in our MDM application?
- Who is responsible for managing business rules versus validation rules on a data asset?
- How should my Data Quality/profiling technology be integrated with the business glossary?
- We already have a data quality reporting team in IT so why should data quality be a responsibility of the data governance program?
We have seen a lot of executive level articles suggesting that the core objective of Data Governance should be to improve the quality of data for the organization. We all agree that improving the quality of data is an objective of data governance, but it is certainly not the sole one. So, what are the data quality challenges and how does Data Governance and a Business Glossary resolve the challenges?
Data Quality Challenges
From my experience, often the challenges to the Data Quality program can include the following:
- We don’t have a clear understanding of who is accountable for the quality of our critical data, thus we don’t know who should be involved in defining the rules or even correcting the data.
- We don’t really know what data is critical to include in the Data Quality program. To a data profiling tool all data seems to be of equal importance.
- We don’t have a consistent definition of the data nor do we agree on the value set for this data.
- We don’t agree on the authoritative source application and database for the data to use as a basis for our profiling and quality metrics. There are multiple databases where the same or similar data exists for user access across the enterprise and each may produce different profiling results.
- We do not agree on the business rules or validation rules for this data since different business units have different business definitions and usage rules.
- We don’t know what quality dimensions are important to measure. Do we just measure completeness percentage?
- We do not have consistent data controls in the data integration flows thus measures on different databases produce different results for what should be the same data.
- Should the data controls (or lack of them) also be considered a component of the Data Quality program?
Hopefully you are not saying “yes that’s our challenge” to all of the above. But I’ve been there and had to address all of the above challenges. Without some sort of Business Glossary, I did not have effective answers to the six operatives of what, how, where, who, why, and when. Answers to many of the above questions can be provided by Data Governance processes and captured in a Business Glossary.
I believe it is best to report data quality to Data Consumers at a Business Term level rather than at a critical data element level. Why? First, most organizations have many occurrences of each data element throughout the technology landscape. The same data element may exist 10 or even 70 times across the enterprise. The activities of capturing the definition of the quality rules, accountability, authoritative sources, and other metadata is a challenge to do just once. It is improbable that you can manage it across all of the occurrences at the data element level across the enterprise. Second, we have critical data elements defined with different physical names across applications and platforms. However, we can define all that metadata once at the Business Term level and then relate all of the critical data element physical occurrences to the Business Term. These activities are done effectively in a Business Glossary.
The measurement of data quality has more business meaning at the Business Term level. Yet, the actual measurement has to be done on the physical data element and at the Authoritative Source database. No question on that. While we may have the same data attribute in many systems, applications and databases in our environment, there should only be one Authoritative Source for that data attribute and that is where our data quality measurement must be done.
How Can Data Governance Be Used For Data Quality?
Prior to the popularity of Data Governance programs, data quality initiatives were most often seen as an IT issue. IT was recognized as the custodian of the data. “Why can’t you just get the data right?” was often a statement heard from business management. Well the definition of “right” was never something IT could complete alone. To borrow a phrase, it takes a village to achieve an acceptable level of data quality. By that, I mean both IT and the business teams must work together to answer the six operatives necessary to achieve acceptable data quality.
The Data Governance program will enable solutions to our quality challenges. And the Business Glossary is the core application for Data Governance information management. What can the Data Governance program provide to support a Data Quality program?
- Data Governance should provide an organization of people such as a Working Group to define the strategy around data quality for critical data. The working group is important since the resolution of data quality issues are often cross functional and may require funding of operational application changes.
- Data Governance should identify Data Content Owners or Data Stewards that are responsible for the life-cycle and management of specific Business Terms and related data elements. These individuals should be recognized in the Business Glossary. These individuals use the Business Glossary to capture the Business Term definitions, valid values, business rules, quality rules, quality expectations of “fit for purpose”, issue management and resolution, data usage guidelines, etc.
- Data Governance should have established Data Policies that includes a Data Quality Strategy that identifies what, why, who, when, where, and how data quality will be managed. These Policies should be accessible from your Business Glossary as well.
- The physical data elements and authoritative sources should be related to the appropriate Business Term in the Business Glossary. Thus, the Business Glossary provides the linkage to the critical data physical database and column where data quality is measured. The quality measurements for each quality dimension at each period of time should be integrated into the Business Glossary. All Data Consumers can have access to the Data Quality metrics in the Business Glossary or dashboards you create from the Business Glossary metadata.
- Data Governance allows for the coordinated definition and capture of data quality requirements for each quality dimension. This definition, requirements, thresholds and periodic quality metrics should be managed in the Business Glossary. The Business Glossary should be your repository of information about data quality and associated metrics at the Business Term level.
I suggest that you categorize data quality rules for your Business Terms similar to those data quality dimensions listed below. The dimensions of data quality that are defined below are accuracy, completeness, conformity, consistency, integrity, and timeliness. Your Data Governance program should determine the dimensions that are important to your organization. Not all dimensions may be important to all critical Business Terms.
Accuracy rule | Quality rules to capture accurate data content values. The rules must state the domain or valid value set as well as the validation rules to be applied. The data content must meet the requirements of the validation rule and data domain valid value set. |
Completeness rule | Quality rules that define the constraints applied to the stored data content which represents the existence of a business valid value. The completeness rule may state what values must exist or what values cannot exist for the attribute to have a complete value. |
Conformity rule | Validation rules to define how the data must agree to the data standards for data structure, domain, and format. For example, Telephone number values must be numeric only and must be structured in the format as +xx-xxxx-xxx-xxxx. |
Consistency rule | Validation rules to define how the stored data conforms to the domain values or compliance to applicable common reference data values. For example, the attribute “State Code” must be a value from the “State Reference List” of values distributed by the Data Content Owner. |
Integrity rule | Validation rules that define adherence to the data content and its usage across systems and applications. Compares like stored values of the same data attribute across 2 or more database columns. Also used for comparing two different but related attributes across two or more database columns. |
Timeliness rule | Quality rules to define the currency of time for updates to content made available to Data Consumers. For example, the rule may state that “data is to be available in the EDW by 0800 Eastern Time.” |
Data Quality Thresholds
The Data Quality thresholds are stated as percentages to be achieved for each data quality dimension. These threshold percentages are defined through an analysis across all Data Consumers to represent the usage “fit for purpose.” Data Consumer thresholds effectively state the Data Consumer requirements for the fit for purpose, in terms of percentages across all rows of a data element related to a Business Term.
I recommend keeping two thresholds if possible. The first is a “lower” or minimum acceptable usage threshold. The purpose of this is to provide usage warnings to Data Consumers that a specific quality dimension has falling below an acceptable range. The second threshold is used to identify the desired percentage, the fit for purpose, for Data Consumers desired usage purposes.
Those threshold percentages are used to drive data quality improvement activities by the data governance organization. The dimensions for Data Quality thresholds are as follows:
Accuracy, Completeness, Conformity, and Consistency thresholds | Both lower and upper thresholds are stated as a percentage from 0 to 100 that represent the minimum and desired fit for purpose respectively. |
Integrity thresholds | Both lower and upper thresholds identifying adherence to defined Integrity business rules and usage across databases and applications. Includes the matching of values for the same data across all databases that contain the same data. |
Timeliness thresholds | Both lower and upper thresholds that represent the degree to which data content is available to Data Consumers at the required point in time. |
By aligning data quality business rules to data quality dimensions and thresholds in the Business Glossary, your firm is able to understand how data is to be measured and what areas of data quality have the greatest risk, priority, and impact effect on the business. Leverage your Data Governance program for your Data Quality initiatives. The business glossary will function as your repository to aggregate data quality metrics and reporting.
It is OK, stay calm and allow your business glossary to prosper.