This column is next in the series of how organizations achieve and sustain significant improvements in fundamental data management disciplines, in their journey to Great Data. Content and recommendations are based on my work with many organizations to evaluate, accelerate and enhance their EDM Programs.
We’re going to explore “What Good Looks Like” for an organization’s data quality program, with examples of benefits and challenges to overcome. We’ll also provide some suggested approaches, key implementation steps, and work products you need to help your organization build strong data quality capabilities and evolve to a ‘quality culture.’
In a previous column, How to Tackle a Data Quality Strategy, we outlined how to manage a project to develop an organization-wide data quality strategy, including roles and activity steps. In another, Accelerating Enterprise Data Quality, we explained how to develop a data quality program from the ground up, employing the mechanism of a pilot project with activity steps and work products. If you are engaged in launching either of these initiatives, check out those columns for practical and useful tips.
The Data Management Maturity (DMM) Model’s Data Quality category comprises of practices in four process areas that combine to create a holistic approach to build data quality capabilities and create a data quality program. The two columns mentioned focused primarily on the Data Quality Strategy and Data Profiling process areas, respectively. Now, we’ll discuss the business-led evaluation of data quality, Data Quality Assessment, and the activities and work products that foster data quality governance and continuous improvements.
Let’s briefly review – There are many paths by which incorrect data may enter your systems. Some of the main causes of poor data quality are:
- When data is entered by a person, it is subject to data entry mistakes, either entering the wrong data, failing to enter expected data, or placing in a default value because the correct information is unavailable.
- When data is acquired from another source, it may contain inconsistencies, such as discrepancies in data definitions, missing, incomplete, or inaccurate data.
- Improper design can also result in bad data. For example, text fields for phone numbers, incorrect field lengths, incorrect format, or precision.
- Data quality efforts are mostly reactive. An issue or error is often identified because it is noticed by system users, or is a component of a report and found to be the cause of inaccuracies.
- Organizations typically approach data quality improvements project-by-project, data store-by-data store. This can result in different rules for the same data. For shared data, if rules and procedures are developed independently for each location, the organization incurs excessive effort and costs, and uncoordinated, inefficient implementations (for example, repeatedly cleansing data in a downstream data store, while data quality at the source is not improved).
Put all that together, and you can see that the elusive goal of “zero defects” frequently disappears in the rear-view mirror. There is no such thing as ‘perfect data’ across an entire organization. We should accept that at the outset, and develop criteria to weight the importance of issues, and determine what we can accomplish.
Automated tools can support the effort to identify data defects (an error) and anomalies (unexpected results that may be an error). Data profiling tools can help identify defects in physical data stores, and AI detection software can surface patterns that may indicate errors. However, people who produce and use the data are needed to make definitive decisions about the data’s condition.
“You Are Where You Are” – Building a Foundation
Executives realize that data quality is very important to improve business operations and decisions, as well as avoid risk. They frequently hear complaints about the data from their staff and internal consumers, and are probably aware of major known issues. They do want to improve the data, but are also wary of incurring excessive costs. As one Chief Technology Officer said to me, “I hate data quality – it’s so expensive!” And he is correct – if the organization leaves data quality efforts to siloed project teams and remedial ad hoc efforts, it will incur repeated efforts (and costs) across the organization.
Some organizations have not taken up the challenge of moving forward from square one. For example,
one global corporation spent millions of dollars every month standing up a data quality team to cleanse financial data from many regional financial systems. The company didn’t make the effort to analyze the root causes of the issues, and had no master plan to consolidate or redesign the separate systems. That’s a sure-fire recipe for repeated spending, month after month, year after year.
If the organization doesn’t make a broader commitment to improve the condition of the data, it will always be on a treadmill. Everyone would like the data to be better than it is. To make that happen over the long haul, organizations need to commit to three transformative initiatives:
- Enhance data governance engagement through defining robust data quality processes
- Educate staff in these processes and provide policies, mentoring, and guidance
- Plan and prioritize where to focus – by major program implementation, by domain, or by business line – and ideally, formalize in a data quality strategy and roadmap.
Pop quiz! What’s the data quality process that is rarely defined, modeled, mandated, and enforced?
Answer – The proactive, governance-intensive activities comprising Data Quality Assessment – the business-driven, responsibility-accepting, roll-up-the-sleeves work that probably accounts for 75-80% of the accomplishments that lead to the greatly desired, but seldom achieved ‘Quality Culture.’
The business owns the data. So, walk the talk, yes? How many times have we heard statements like, “IT is responsible for the quality of the data.” from executives, managers, and staff? Doesn’t it drive you crazy? Really! For example, Total Information Quality Management, a seminal work by Larry P. English, which synthesizes a comprehensive approach to establishing and maintaining data quality, has been highly influential since the 1990s. It’s spelled out in black and white in the framework: Process 2 – Assess Information Quality.
The organization needs to acknowledge factual reality – it will take the involvement of many individuals, over a significant period of time, to achieve a satisfactory condition of data across the most important data stores. My previous columns Enterprise Data & EDM Strategy and Know Thy Enterprise Data address how to determine what ‘enterprise data’ is for your organization, and how to approach designating critical data elements – both assist in sharpening your focus. In the data quality strategy column mentioned earlier, we address how to prioritize your organization-wide quality plan.
When we explore what’s involved in the Data Quality Assessment process, we’ll assume that the organization has selected a segment(s) of data to concentrate on first – for example, a data domain, a major shared data implementation effort (e.g., data lake Ingestion Data Set #1, adding new data to a data warehouse, or implementing a data mastering solution).
Making Proactive Decisions About Data is the Path to Progress
The organization’s governance structures and operations needs to expand to emphasize four foundational areas of focus. Organizations that have implemented best practices in this area have achieved these elements:
- Data steward engagement in defining quality objectives and characteristics
- A systematic method to measure and monitor quality
- Applying business priorities to decide what to remediate
- And working with IT partners to determine how to improve quality.
Overall, for data governance, this might mean designating additional business data stewards per data domain or business line, defining the events or triggers for when they engage, and enhancing data governance role descriptions to include specific data quality tasks.
Let’s explore what needs to happen, through the dual lenses of what the organization needs to commit to, and the role of the data steward / business data expert.
Step One – Defining Quality
Organizations need to implement a systematic, business-driven approach to evaluate and measure data quality, to ensure fitness for purpose, establish targets and thresholds, and employ data quality dimensions to analyze quality in detail.
Starting at the top: What does qualitylook Like? You must decide.We can encapsulate ‘good’ as follows:
- The right data – the data set(s) you need to produce or use to fulfill a business purpose
- At the right time – data that meets currency (time period) requirements, and is available when you need it
- In the right condition – data that conforms to characteristics (dimensions) that make it useful.
These decisions can only be made by data producers and consumers. IT can support you in analyzing what quality is from the perspective of the business, but it cannot determine this by itself. The elements necessary to define ‘quality’ are:
- Fitness for purpose – defining the data set(s) that will satisfy requirements to perform a business process, make a business decision, or create useful reports and models to answer key questions, analyze trends, or model business scenarios. The data set may be narrowly or broadly defined, and it may originate from a single data source or from multiple sources. Conversely, the determination of fitness also includes the identification of how existing data is currently not fulfilling its business purpose.
- Quality targets – a ‘target’ is an aspirational goal for the data’s desired condition; for example, “99.9% of Patients shall be uniquely identified in the registration system.” The target must be quantifiable, and should be realistic, measurable, and achievable.
- Quality thresholds – a ‘threshold’ is the lowest level of quality acceptable for the business purpose, for example, “No more than one percent of Patient records may be duplicates.” You define the threshold by answering the question, “What can we live with, without major impact on our work?” Targets are the state you want the data to be in, and thresholds are your considered tolerance for non-perfection.
- Quality dimensions – a ‘dimension’ is a characteristic (or criterion) pertaining to the data’s condition. Some dimensions are termed “inherent” – they apply to each data object per se; and some are termed “pragmatic” – they apply to the data’s use. Leading data management organizations and academic institutions may advocate different lists of dimensions. Those presented below are in common use by most organizations.
An organization needs to adopt the use of dimensions as a key component of improving the data’s condition, through defining and implementing a process with activity steps and corresponding roles, and then require that it be followed, according to a set of criteria for scope and size.
Let’s take a hypothetical scenario when this activity should be mandated: A company wants to ‘improve the customer experience.’ They’re going to implement a Customer Relationship Management System, and there are currently multiple operational systems which can create a new customer. This is a major effort. In addition to profiling the data from the candidate sources to find defects prior to migration, the data working group (comprised of recognized data stewards and other subject matter experts) needs to evaluate the data’s condition through the data quality assessment process, applying dimensions for each data object. For smaller projects, non-production use cases, or data that isn’t shared, the process may not be required. We’re not talking about boiling the ocean, but rather, training the big guns on a worthy adversary – highly shared, or operationally critical data.
If you’re in the early stages of a quality analysis and improvement effort, it can be useful to ‘back into’ your evaluation by surveying and itemizing what is wrongwith the data. You can pose these questions to get started:
- Why can’t we trust the data now?
- What are the most important persistent data issues?
- What are the specific negative impacts on business processes?
There may be long-standing data quality issues that haven’t been addressed, either because responsibility has not been assigned, or the affected consumers have simply accepted the need for recurring data fixes and extensive manual reconciliation.
You can begin by listing known issues, for instance, duplicate records, missing zip codes, inconsistent status codes, multiple names for the same data, reports caused by divergent sources for the same data, etc.
Armed with an understanding of the existing problems, you can focus on the desired data conditions by applying the quality definition approach we outlined above. Once you have determined fitness for purpose, targets, thresholds, and applied dimensions, you’re ready to analyze and codify requirements that will ensure the data’s satisfactory condition.
Step Two – Measuring Quality
There are two aspects to measurement of data quality:
- Prevention – requirements for each data object that must be met to prevent data quality defects (i.e., “rules”)
- Monitoring – measures and metrics, applied systematically, that evaluate the extent to which rules are successful at preventing poor data quality.
Let’s briefly clarify the term ‘quality rule.’ Quality requirements primarily apply to a specified data object, while business rules primarily involve logic. An example of a quality rule, stated in business terms, is “Marital Status must not be null.” A business rule related to Marital Status might be, “When a person’s Divorce Date is created, the Marital Status value “M” (Married) is expired, and a new value “D” (divorced) is created.” Hence the importance of dimensions, which force you to start by zeroing in on each data object.
Many quality rules are specified in logical data models and enforced through the physical system or database constraints, for example, entities must have a unique identifier, or the format of an attribute must be numeric, etc.
If you’re involved in a project for a new implementation, your analysis may discover both quality rules and business rules; this is a typical situation. Just be sure that you document both, and if you’re evolving a data model, ensure that rules which can be enforced by a database are reflected in the logical data model.
You’ll need to work with your Information Technology partners (who hold up half the sky, after all) for their input on the feasibility of your quality rules, since they will be implementing them in physical data stores. Your organization’s data quality assessment process should specify technology roles and activities, building in collaboration and cooperation as a standard.
Once rules are defined, approved, and implemented, you have not only proactively implemented preventative requirements to improve the quality of data, you have also complied a set of initial measures – e.g., targets A,B,C, thresholds D,E,F, and quality rules G, H, I – to apply to production data. Measures are typically counts, as in “how many null values in Marital Status?” And from them, you develop metrics, which are calculated, such as “what percentage of Marital Status values are null?”
The process that your organization defines for evaluating data quality – targets, thresholds, and rules – and capturing appropriate metrics should specify that rules for shared data will be populated into a quality rules repository, enabling their use wherever the data may be stored.
Now, let’s look at what you can do with the metrics you developed.
Step Three – Quality Monitoring
You can apply metrics to ensure that the quality of the data exceeds your thresholds and approaches the targets. It’s highly recommended to establish systematic monitoring of key data sets. Monitoring starts with a commitment by your organization to check the improved data sets at designated intervals, and track progress against quality objectives.
Monitoring is usually conducted through two methods:
- Data quality dashboards and reports – by capturing metrics and displaying them visually through a dashboarding feature that many data quality toolsets support. This allows for the efficient tracking of progress and helps communicate your progress to stakeholders and senior management.
- Periodic re-profiling of data sets, applying the data quality rules you implemented, capturing defined metrics, and supplementing with evaluation by Data Stewards, as needed. For example, say six months have elapsed since your improvements were implemented. The initial profiling effort yielded measures and metrics for the defects you intended to correct. Running it again will provide detailed inch-by-inch discovery of uncorrected, or new defects.
This may seem like a lot of effort, but remember, you’re only going to implement systematic monitoring for important shared data stores – when the data must be right. However, it’s notable that organizations which implement data quality monitoring are often rewarded, not only with improved data, but also with continued funding for quality improvements, and expansion of their scope. Is a data quality Center of Excellence in your future?
Step Four – Prioritizing Quality Remediation
You don’t want that Chief Technology Officer to be laughing behind the curtain – “See, I told you that it was expensive”’ To avoid boondoggles that don’t show results, start from the data quality strategy (if available) or the data management strategy (also if available). Or, demonstrate how your efforts support one or more strategic objectives or programs (e.g., the example of ‘improving the customer experience’).
At the front end, work with your business sponsor to develop an effective business case. One avenue to tighten up your business case is your definitive list of existing issues. If you’ve analyzed the business impacts, and explained how your data quality assessment effort are integral to correcting them, that’s strong evidence. The other avenue is to address business aspirations, for example, describe what the business would like to accomplish with the data that it can’t do, or know now, because of poor data quality.
At the back end, you evaluated quality and defined quality rules. The key consideration then is, “What do we fix?” If you’re engaged in a new development or consolidation effort, the task is simpler – you can build in quality from the ground up and should be able to obtain approval for most or all your preventative measures.
If you’re addressing an existing data store, you can simplify the decision by creating a comparison table that summarizes your analysis of business and technical impacts for each modification. Describe the issue or defect, determine the negative impact on the business (high, medium, or low) and the technical impact (i.e., how much effort/cost it will take to fix) and the recommended remediation action. Here’s a sample:
|Issue||Business Impact||Technical Impact||Recommended Action|
|Missing zip codes||Medium||Low||Employ address validation|
|Null Marital Status||Medium||Medium||Implement rules in sources and destinations|
The organization is advised to provide a template for a similar table when making data quality remediation decisions, just as it may supply a template for categorizing project risks. The template should include definitions for the impact categories and describe effort and cost thresholds.
Organizations Must Educate Staff & Formalize Quality Activities
For data stewards and business data experts, the activities required to assess data quality can be likened to a graduate level activity. To be capable and confident with the in-depth analysis that is needed, they should have knowledge and experience in defining business terms, specifying data requirements, reading a data model, and working in problem-solving teams.
Added to that, they need specific education in setting targets and thresholds and applying data quality dimensions. A combination of coursework, available educational materials (webinars, templates, and guidelines), and mentoring will ensure that staff assigned to a data quality assessment effort can be productive from day one.
Organizations implementing data quality assessment practices have reaped the rewards of focused data quality efforts, and have supported continued positive accomplishments by implementing processes, guidelines, and assigned responsibilities for business engagement. Once the data quality assessment process has been formalized, communicated, and rolled out, a policy should be developed and issued, providing the basis for compliance. This is the final demonstration of the organization’s commitment to quality.
organization can evolve to the point where the principles and practices are
embedded across the organization, and the initiative to drive quality
improvements originates from the business lines. That’s success!
 Therefore, I won’t repeat those topics here. Moving on…..