I have been fortunate to lead many data quality improvement efforts over my professional life. Those projects have included quality improvement efforts in individual operational applications, ERP, reference data and master data management, BI and Analytics reporting, Internet environments, as well as data governance programs. We have had varying degrees of success with those efforts. The most successful are those that have been driven by the Data Governance organization leveraging the processes and people, and driven using technologies including that of the Business Glossary. There are likely many reasons for this, but one is certainly that Data Governance provides a sustainable platform and resources for improving data quality.
That fact should not be a surprise to anyone that has experience in data quality improvement projects. While everyone should not question that data is a great asset to their organization, your critical data can be viewed as a depreciating asset. Data left unmanaged has a tendency to become stale thus losing its value to the consumers that need to use it. Data is not like wine; it does not improve with age. The opposite is true; thus we need sustainable processes and technology to keep our critical data valuable.
I have heard many professionals say that the sole objective of Data Governance is to improve the quality of data for the organization. We all agree that improving the quality of data is an objective of data governance, but it is certainly not the sole objective. Yet, what are the data quality challenges and how does Data Governance and a Business Glossary resolve the challenges?
Data Quality Challenges
From my experience, often the challenges to the Data Quality program can be stated to include many or all the following:
- We don’t have a clear understanding of who is accountable for data quality for the critical data thus don’t know who will be involved in defining or correcting it.
- We don’t really know what data is critical to include in the Data Quality program.
- We don’t have a consistent definition of the data nor do we agree on the value set for this data.
- We don’t agree on the authoritative source application and database for the data to use as a basis for our profiling and quality metrics. There are multiple databases where the same or similar data exists for user access across the enterprise and each may produce different profiling results.
- We do not agree on the business rules for this data as different business units have different business and usage rules.
- We don’t know what quality dimensions are important to measure. Do we just measure completeness percentage?
- Is it appropriate to measure data quality with our SQL scripts each weekend? We don’t have a data quality tool to fully automate the quality measures thus we have an inconsistent history of the quality measures that we do create.
- We do not have consistent data controls in the data integration flows thus measures on different databases produce different results for what should be the same data.
- Should the data controls (or lack of them) also be considered a component of the Data Quality program?
Hopefully you are not saying, “Yes, that’s our challenge” to all of the above. But I’ve been there and have had to address all of the above challenges. Without some sort of Business Glossary, I did not have effective answers to the 6 operatives of what, how, where, who, why, and when. Answering those questions can be provided by Data Governance processes and captured in a Business Glossary.
I believe it is best to report data quality to Data Consumers at a Business Term level rather than at a critical data element level. Why? First, most organizations have many occurrences of each data element throughout the technology landscape. The same data element may exist 10 or even 70 times across the enterprise. The activities of capturing the definition the business rules, quality rules, accountability and authoritative sources and other metadata is a challenge to do just once. It is improbable that you can manage it across all of the occurrences for the critical data elements at the data element level across the enterprise. Second, we have critical data elements defined with different physical names across applications and platforms. Again, defining all the metadata associated is too daunting to manage at the data element level. Yes, it can be said that it is a “bottom-up then top-down process” for identifying the critical data in the quality program. However, we can define all that metadata once at the Business Term level and then relate all of the critical data element physical occurrences to the Business Term. These activities are done effectively in a Business Glossary.
The measurement of data quality has more business meaning at the Business Term level. Yet, the actual measurement is done on the physical data element and at the Authoritative Source database. No question on that. While we may have the same data attribute in many systems, applications, and databases in our environment, there should only be one Authoritative Source for that data attribute and that is where our data quality measurement must be done.
How Can Data Governance be Used for Data Quality?
Prior to the popularity of Data Governance programs, data quality initiatives were most often seen as an IT issue. IT was recognized as the custodian of the data. “Why can’t you just get the data right” was often a statement heard from business management. Well, the definition of “right” was never something IT could complete alone. To borrow a phrase, “it takes a village” to achieve an acceptable level of data quality. By that I mean both IT and the business teams must work together to answer the 6 operatives necessary to achieve acceptable data quality.
The Data Governance program will enable solutions to our quality challenges, and the Business Glossary is the core application for Data Governance information management. What can the Data Governance program provide to support a Data Quality program?
- Data Governance should provide an organization of people such as a Working Group or Board to define the strategy around data quality and critical data. The working group is important since the resolution of data quality issues are often cross functional and may require funding of operational application changes.
- Data Governance should have identified Data Content Owners or Data Stewards that are responsible for the life-cycle and management of specific Business Terms and associated data elements. These individuals should be recognized in the Business Glossary. These individuals work with the Business Glossary to capture the Business Term definitions, valid values, business rules, quality rules, quality expectations of “fit for purpose,” issue management and resolution, data usage guidelines, etc.
- Data Governance should have established Data Policies that included a Data Quality Strategy that identifies what, why, who, when, where, and how data quality will be managed. These policies should be accessible from your Business Glossary as well.
- The physical data elements and authoritative sources should be related to the appropriate Business Term in the Business Glossary. Thus, the Business Glossary provides the linkage to the critical data physical database and column where data quality is measured. The quality measurements for each quality dimension at each period of time should be integrated into the Business Glossary. Thus, all Data Consumers can have access to the Data Quality metrics via the Business Glossary or dashboards you create from the Business Glossary metadata.
- Data Governance allows for the coordinated definition and capture of data quality requirements for each quality dimension. This definition, requirements, thresholds and periodic quality metrics should be managed in the Business Glossary. The Business Glossary should be your repository of information about data quality and associated metrics at the Business Term level.
I suggest that you categorize data quality business rules for your Business Terms similar to those data quality dimensions listed below. There are additional quality dimensions that can be measured. However, the dimensions listed below are those that most organizations seem to measure. The dimensions of data quality that are defined below are accuracy, completeness, conformity, consistency, integrity and timeliness. Your Data Governance program should determine the dimensions that are important to your organization. Not all dimensions may be important to all critical Business Terms.
Accuracy rule | Business rules to capture accurate data content values. The business rules must state the domain or valid value set as well as the validation rules to be applied. Valid values are a different metadata attribute in the Business Glossary. The data content must meet the requirements of the business rule and data domain valid value set. |
Completeness rule | Business rules that define the constraints applied to the stored data content which represents the existence of a business valid value. Completeness business rule may state what values must exist or what values cannot exist for the attribute to have a complete value. |
Conformity rule | Business rules to define how the data must agree to the data standards for data structure, domain, and format. For example, Telephone number values must be numeric only and must be structured in the format as +xx-xxxx-xxx-xxxx. |
Consistency rule | Business rules to define how the stored data conforms to the domain values or compliance to applicable common reference data values. For example, the attribute “State Code” must be a value from the “State Reference List” of values distributed by the Data Content Owner. |
Integrity rule | Business Rules that define adherence to the data content and its usage across systems and applications. Compares like stored values of the same data attribute across 2 or more database columns. Also used for comparing two different but related attributes across two or more database columns. |
Timeliness rule | Business rules to define the currency of time for updates to content made available to Data Consumers. For example, the rule may state that “data is to be available in the EDW by 0800 Eastern Time”. |
Data Quality Thresholds
The Data Quality thresholds are stated as percentages to be achieved for each data quality dimension. These threshold percentages are defined through an analysis across all Data Consumers to represent the usage “fit for purpose.” Data Consumer thresholds effectively state the Data Consumer requirements for the fit for purpose, in terms of percentages across all rows of a data element related to a Business Term.
I recommend keeping two thresholds if possible. The first is a “lower” or minimum acceptable usage threshold. The purpose of this is to provide usage warnings to Data Consumers that a specific quality dimension has fallen below an acceptable range. The second threshold is used to identify the desired percentage – the fit for purpose – for Data Consumers’ desired usage purposes.
Those threshold percentages are used to drive data quality improvement activities by the data governance organization. The dimensions for Data Quality thresholds are as follows:
Accuracy, Completeness, Conformity and Consistency thresholds | Both lower and upper thresholds are stated as a percentage from 0 to 100 that represent the minimum and desired fit for purpose, respectively. |
Integrity thresholds | Both lower and upper thresholds identifying adherence to defined Integrity business rules and usage across databases and applications. Includes the matching of values for the same data across all databases that contain the same data. |
Timeliness thresholds | Both lower and upper thresholds that represent the degree to which data content is available to Data Consumers at the required point in time. |
By aligning data quality business rules with data quality dimensions and thresholds in the Business Glossary, your firm can understand how data is to be measured and what areas of data quality have the greatest risk, priority, and impact on the business. Leverage your Data Governance program for your Data Quality initiatives. The business glossary will function as your repository to aggregate data quality management and reporting.
It is OK, stay calm, and allow your business glossary to prosper.