Data Professional Introspective: Comparable Data Management (Part 4)

In our last column, Comparative Data Management Part 3, we explored two Data Management Maturity (DMM^SM) Model data management processes with the highest average scores on our benchmark range chart. These processes were Governance Management and Provider Management. Then, we saw examples of organizational and implementation factors that reflect robust capabilities, which propel the organization towards optimizing its data assets and gaining the most value from them.

Part 1 | Part 2 | Part 3

In Part 4, we will focus on one broad topic – Data Quality – and share observations and examples from both lower and higher-performing organizations, with practical suggestions for how to accelerate and implement a data quality program.

Let’s look at our chart again for a quick visual on the average scores in the four Data Quality process areas, against the range of all scores:

Below is a table containing the same information boxed above, with the addition of the number of practices in each process area, 74 total in the Category:

Process Area	# Practices	Low	Average	High
Data Quality Strategy	21	0.50	2.75	5.00
Data Profiling	19	1.00	2.50	3.75
Data Quality Assessment	18	1.00	3.00	5.00
Data Cleansing	16	0.50	2.75	5.00

Before we turn our attention to the individual process areas, let’s consider the big picture. If you listed all of the components within the scope of Enterprise Data Management (EDM), including business, technical and solution-oriented (some respective examples – Business Glossary, Data Design, Master Data Management); I would state that you can capture the “Essence of Everything” in three broad groupings: Architecture = WHAT data assets are built; Governance = HOW the data assets are managed; and Quality = WHY you build, acquire, and manage data assets, namely, to put the data to meaningful use.

Quality is the ultimate reason for implementing EDM, although this often goes unstated because it is thought that it should be obvious to everybody. But too often, it is not.

According to the mission, purpose, and business processes needed to be the organization that it is – with the strategic objectives it has, in the industry in which it exists – an organization creates and acquires data, stores it[1], modifies it, uses it for operational decisions and actions, integrates it, and aggregates, models, and analyzes it to make tactical and strategic decisions.

Better data leads to better decisions and predictions, greater staff efficiency, increased agility, and competitive advantage. While this a commonly stated principle, even a cliché in our industry, the reality is that at the enterprise level, there is little recognition of its importance, and few organizations demonstrate a clear understanding of the true impact of bad data on their business. Fewer still have made a top-down commitment to an organized set of processes and work products, resourced sufficiently to constitute a data quality program or formalized permanent quality function that eventually benefits all major business processes and the entire organization.

Instead, enlightened business sponsors and responsible IT project managers discover and apply data quality practices in their own realm of control, usually at the project level. And there it typically stays, without expansion or extension to other projects, other business lines, or the enterprise. This is reminiscent of a block of upscale urban townhouses where amazingly, only a few doorsteps are kept free from trash, leaves, and dust. ‘Neglected’ is the best word to describe the condition of data in many organizations.

‘The business owns the data it creates and manages.’ Another commonly stated principle, another cliché. However, historically, information technology has been blamed for poor data quality. This has been intensely frustrating for them, because many business analysts and data architects have strongly encouraged building in quality rules during requirements, finding defects in existing data stores before data consolidation or migration takes place, and other sensible and sound practices. Nonetheless, they have frequently found themselves running after a business expert with a notebook, asking questions and not getting answers, such as ‘if we find a Social Security Number 999-99-9999 in the data load, shouldn’t the workflow automatically notify the business owner?’ Or ‘Do you want to import all those addresses where the street name is XXXXXXXXXX?’ Or ‘Shouldn’t we redesign this database so that ‘inactive’ and ‘dead’ aren’t in the same column values list, since different business rules apply?’

The default solution has been to rely on ‘data heroes’ to research, wrangle, mish and mash the data on their own, but the organization often doesn’t want to acknowledge that those individuals are regularly giving up nights and weekends to manually fix data for, say, the month-end financial report. In effect, where data quality is concerned, organizations would like reality to be different than it is. But it isn’t. And persisting in that belief is a serious obstacle to facing the pervasive quality issues and coming up with a plan to address them. If I seem to be waxing a bit melodramatic at this juncture, it’s because I’ve seen this pattern So…. Many…. Times (!!!) So organizations, here’s another cliché for you: ‘If you do what you’ve always done, you’ll get what you’ve always gotten.’

Let’s state some key facts about data quality:

Quality is expensive, yet may yield cost savings
You can’t achieve it overnight – you need a plan and a segmented approach
Many individuals are involved – (period)
There are details – many details
Technology is not a total solution
Elegant design (or redesign) is not a total solution
If you can’t demonstrate the business impacts of poor data quality, you won’t get funding
Hide the detail (but not the effort required) from the executives or their heads will spin
Paint a vivid picture of the gains when quality is improved – make them believe

You recall the picture to the right from Part 3, representing the Buddhist view of the journey of humanity from illusion to enlightenment.

Data quality capabilities tend to range across many stages of maturity, for example:

Project A – has a data quality tool and has used it in one data store migration effort

Program B – has developed quality rules for a product master data store used by several lines of business

EDM Organization – has developed a dashboard for two data stores containing critical data elements

Technology Organization – has a data quality COE, but only for financial data

This presents a confusing picture – the same organization has varying capabilities in isolated areas, and there is no central backbone (in this picture, undoubtedly the Four Noble Truths and the Eight-fold Path) or plan to benefit from the separate achievements.

What would we advise the organization experiencing the woes of this piecemeal approach? (And in our visual metaphor, help it attain ‘quality enlightenment?’)

To implement a data quality program that addresses the entire scope of data quality principles, considerations, processes, and best practices
To ensure that this program is enterprise-wide, and to align the program activities and progress to support the data management strategy, which in turn aligns with the business strategy (see 7, 8, and 9 in the list above)
To create and mandate the policies, triggers, and conditions governing when processes and practices must be followed
To create and communicate useful guidance and templates for defect reporting, quality scoring, and cleansing
To educate staff with data responsibilities across the entire data lifecycle, from the executives to the data entry individuals and everyone else along the chain

The Data Management Maturity Model’s four Data Quality process areas represent a complete set of activities and practices that address the four major components of a well-organized approach to improving data quality. Achievement of DMM Level 3 in these process areas is evidence that the policies, processes, standards, and guidelines have been established, are understood, and are followed by all relevant stakeholders across the organization. Let’s take a look.

Data Quality Strategy

It is rare to encounter an organization that has engaged in sufficient self-reflection to draw this kind of conclusion: ‘hey, we’re spending money on multitudes of projects to improve data quality. We couldn’t tell you who is doing exactly what, and we can’t derive a total cost for it because our funds are allocated by project. But we know that the more data we have, the more these efforts will multiply. And (near-term objectives differ according to priorities – here’s an example) we want to implement advanced analytics, the data science team is going to use scores of sources, and we have no idea how accurate the data is. So we have to get our act together.’

Creating a strategy to launch, build, and maintain a data quality program is the quickest way to consider and encompass the vision of how the business will benefit. I mean ‘vision’ literally – what are the accomplishments the organization could achieve if data quality was improved? To take an insurance example, ‘If the property data didn’t have so many errors, like a Built Year of 2065, and a Construction Type Code value of Unknown, I could better estimate our risk and needed reserves, lower premiums, and increase sales.’ Business leaders across the organization have similar aspirations. Capture them.

The DMM contains a suggested outline for the contents of a data quality strategy, and the 21 practice statements enable precise evaluation against best practices. Here are the Cliff Notes:

An ounce of top-down planning is worth a pound of gritty bottom-up work
If you already have an enterprise data management strategy, start with the data quality section – if you don’t, you can still forge ahead, because quality is always important
Perform a high-level analysis of the primary data quality issues. Interview the program leaders responsible for the major data stores in their line of business
- The analysis can be accomplished in a few weeks at the most, yielding a succinct summary document
- They will know their major quality problems, because their staff has been complaining to them all along
- Describe the major quality issues and their corresponding business impacts
Aim for relatively high-level treatment – the details will be determined in the spawned implementation projects
Convene a working group of experienced business experts to facilitate quality aspirations for the strategy
Draft a mission statement, goals, and objectives for the program with their input – get the executive governance group to review and approve (they need to remain engaged periodically until the strategy is approved)
Discuss and agree on the initial data scope – e.g., Product data in a primary operational system following a company acquisition; this gets the experts to take the enterprise perspective versus my-me-mine thinking
Discuss and agree on the next scope segment, then the next, for all major subject areas – create a sequence plan for the data scope
- Note: it is always helpful to align the scope around critical projects and programs, e.g., the need to improve data quality in the Entity master before we launch the hub for multiple business lines next year
Discuss and agree on the capabilities that are available (the DMM is very helpful here) and those that aren’t yet implemented – e.g., we don’t have a data quality tool, we don’t have documented processes for profiling, etc.
Determine and gain agreement on the prioritization of new disciplines
Determine the roles, responsibilities and resources, and the organizational unit that will be the central mover in building the program
Determine a starter set of metrics that are meaningful to the business and help baseline the data’s condition and track improvements (dashboards are quite useful as well)
Finalize a sequence plan (1 year, 3 years, or 5 years are the most often used)
Estimate Year 1 funding for the program business case

To create a successful strategy, you need to engage business representatives at all stages. If the organization never undertakes planning for data quality at the enterprise level, quality will remain uneven and costs will continue to be a burden (aka, ‘If you don’t get it, you don’t get it.’) Finally, this should all be completed in no more than three months, tops – if you drag it out, you won’t gain the momentum you need to launch.

Process Area	# Practices	Low	Average	High
Data Quality Strategy	21	0.50	2.75	5.00

Note that the low score here is really, really low, 0.50 on a 5-point scale. There are organizations which have never given much thought to what they can do to improve data quality, relying on data heroes and manual reconciliation to make the data usable. The average score of 2.75 shows that most organizations have implemented data quality at the program or line of business level and begun to standardize or centralize some functions and processes. The high score of 5.00 was earned by two organizations which had devoted considerable time, effort, and resources to planning and executing data quality improvements for many years and now enjoy the fruits of exemplary data quality.

Data Profiling

Data profiling is the development and application of discovery techniques to physical data sets to develop an understanding of the content, quality, and rules of a set of data under management. Data profiling is an important first step for many information technology initiatives. It is a discovery task conducted through automated (tool supported or custom queries) and/or manual analysis of physical records. For a selected data set, it reveals what is stored in databases and how physical values may differ from expected, allowed, or required values listed in data store documentation or described in metadata repositories. Definite errors are often referred to as “defects.” Suspected errors are often referred to as “anomalies.”

Data profiling has been conducted for application data stores for many years at most organizations, however, typically this is performed in an ad hoc manner, for example, a project team responding to a specific business complaint about data errors. Profiling as a cornerstone of an effective data quality improvement program is not often encountered, yet it is critical, since ‘you don’t know what you don’t know.’ Without profiling, the organization may not discover that there are nulls in important columns, such as a status code; there may be bad addresses or missing zip codes in customer data; or there may be surprise values in key code fields, such that research is required to see if the source system was modified or the value has not been documented. The causes and resulting effects are legion, like the grains of sand on a beach.

Profiling identifies defects and anomalies. It deepens business and technical knowledge about a data set, and basic tests and checks can be expanded to address all the problems that arise over time. Here’s the Cliff Notes:

Select a robust data quality toolset – most data quality packages from major vendors will do everything you need them to do – choose the toolset that is most in harmony with your technology stack
Develop and document key activities, and share them with the organization
Create a standard data profiling process – when to profile, how to request profiling, who performs the profiling, who is responsible for analyzing the results, governance involvement, etc.
- Create a profiling plan template – how to determine the boundaries of the data set of interest (e.g., one data store, joins with the same data in multiple data stores, etc.), stating what techniques you will use (such as out-of-the-box tests, business rule checks, duplicate checks, known issues checks with complex queries, etc.)
- Create a profiling results report template – how summary results should be formatted, what conclusion types will be included, how detailed results will be linked or referenced, how technical and business impacts will be classified, how recommendations will be reported to business users, etc.
- Develop a brief course on data profiling aimed at business data experts – why profile, what support is needed from them, etc.
Profiling supports the effort to enhance metrics for data quality – use them to enhance metrics for the data quality program, to answer the question ‘Is the data getting better?’
Don’t profile the same data in multiple sources again and again – fix the source (see Data Cleansing)
Reuse quality rules and checks developed for a profiling effort wherever that data occurs – this is especially important for shared and governed data (aka, enterprise data)
Determine the criteria for when profiling is performed – what events trigger profiling, such as migration of data to a new data store, and how often data should be monitored (re-profiled in the same data store)
Consider establishing a center of excellence with highly skilled staff, supplemented by the project data architect for each data store in which data is profiled

Process Area	# Practices	Low	Average	High
Data Profiling	19	1.00	2.50	3.75

The low score of 1.00 indicates that data profiling has not progressed beyond the ad hoc phase. Methods are not documented, custom queries are employed exclusively, results are not reported in a standard manner, etc. The average score of 2.5 shows that most organizations have understood the importance of profiling activities and are employing it across more areas of the organization, yielding greater benefits to more internal business customers. The high score of 3.75 indicates that an organization has implemented an enterprise-level standard process, standard methods, tools, and techniques, has engaged governance and business approvals for profiling and results recommendations, and likely has implemented centralized data quality services for critical and enterprise data.

Data Quality Assessment

Data quality assessment is a systematic, business-driven approach to measure and evaluate data quality employing data quality dimensions, to ensure fitness for purpose and establish targets and thresholds for quality. The business owns the data it creates and manages. No organizations IT group can single-handedly improve the quality of data. Business representatives across the lifecycle of any given data set must be engaged to determine the data’s fitness for purpose, define the level of quality desired and the level of quality that is acceptable.

This is often a heavy lift for the business data experts and governance representatives, but accepting the responsibility is necessary. The data quality assessment processes consist of making decisions about the data and acting on those decisions. Only those who create, modify, and delete data across every phase of its lifecycle can decide:

If the data set is sufficiently complete and accurate to support business process needs (i.e., “Fit for purpose”)
The desired state of specific attributes (i.e., “Target”)
The minimum level of quality acceptable (i.e., “Thresholds”)
The most meaningful measures and metrics to track improvements

To take an example of fitness for purpose, an organization may discover, based on data profiling results, that it does not capture a sufficient set of attributes to maximize the efficacy of its record matching algorithm; the data set is not fit for purpose because it is incomplete for preventing and reconciling duplicates. This conclusion may drive modifications to the data store(s) and lead to additional business rules.

The most effective mechanism to assist the business in assessing data quality and establishing useful targets, thresholds, and metrics is the consideration and application of data quality dimensions to each attribute. A “dimension” is a criterion against which data quality is measured. A number of different dimensions of quality can be measured, and the DMM offers a sample set of dimensions against which to determine the data’s condition.

Performing a data quality assessment is based on predefined quality expectations and criteria set by stakeholders and approved by governance. Here are the Cliff Notes:

Create a training class aimed at governance representatives and business data experts – include:
- Dimensions – what they are, examples, and exercises
- Targets – how to formulate a realistic yet aspirational target for quality
- Thresholds – how to determine what level of quality is acceptable
- Quality rules – definition, examples, and exercises
Perform a proof of concept quality assessment on a small data set supporting one or more business processes
Profile the data first and share the results
Convene a working group with expertise on the data set – define dimensions, thresholds, and targets, and identify any missing data which would assist business process performance
Characterize the impacts of identified data quality defects or anomalies – e.g. risk, productivity, compliance, etc.
Identify meaningful metrics for the data set
Determine if any profiling errors warrant root cause analysis in source data stores
Determine if any business process enhancements would improve the quality of the data set
Define a standard process capturing the above activities, and share it with the working groups undertaking assessments of other data sets.

Process Area	# Practices	Low	Average	High
Data Quality Assessment	18	1.00	3.00	5.00

The score of 1.00 illustrates that an organization has not engaged the business in data quality responsibilities beyond the project level, is not employing data quality dimensions, and hasn’t implemented effective governance. The average score of 3.0 shows that some organizations have taken data quality seriously and have implemented processes and assigned responsibilities across the major business lines with significant business engagement. The high score of 5.00 demonstrates organizations in which the principles, processes, and practices of quality are deeply embedded, and the business is driving quality improvements.

Data Cleansing

Data cleansing identifies the mechanisms, processes, and methods used to validate and correct data defects according to predefined quality rules, as well as the analysis and enhancement of business processes to prevent errors.

Data cleansing focuses on data correction to meet business criteria (targets and thresholds) as determined by data quality rules addressing all applicable quality dimensions. Quality rules, developed through data quality assessment and the results of data profiling efforts, provide a baseline for identifying data defects which can affect business operations. Gaps in business processes may be addressed to create better data at the point of origin or modification.

Data cleansing activities are most effective when conducted at, or as close as possible to, the point of first capture, i.e. the first automated data store to record data, or as close to the original creation point as feasible. A best practice is to undertake cleansing activities based on data profiling and/or data quality assessment analysis.

Organizations need to establish criteria for what events trigger data cleansing efforts. In most organizations, data cleansing is more frequently conducted on key shared data stores, such as a master data hub or a critical operational system serving multiple business lines. As priorities and budgets allow, it is advisable to expand the scope to operational systems which provide data to internal or external consumers. Here are the Cliff Notes:

Data corrections should be made available for affected downstream data stores and repositories
When data defects are determined to be caused by gaps in a business process, the responsible parties should consider corresponding data store design modifications
Develop a standard process for escalating issues approporiately to governance and/or the data store owner
Data corrections should be verified with internal and external data providers, preferably through an automated message or report
Develop a standard data cleansing plan template to ensure that cleansing rules for data are shared and reused for any data store in which the data being cleansed is located
Avoid cleansing the same data in more than one physical location while applying different rules – this causes rework
Data cleansing plans should be subject to impact analysis, as with data profiling

Process Area	# Practices	Low	Average	High
Data Cleansing	16	0.50	2.75	5.00

In this process area, the low score of 0.50 implies that cleansing methods have not been specified for even one project, and that the business users have not been involved. The average score of 2.75 shows that the organization has implemented a standard cleansing process for at least one program and is working to extend these efficiency gains to other business lines. The high score of 5.00 illustrates that the organizations data is well governed, that maximum effectiveness has been achieved, and that defects and anomalies are discovered, corrected, and root cause analysis and improvements are conducted as they arise, creating trust in the data across the organization.

As mentioned, there are many detailed considerations, activities, and work products involved in addressing data quality as an enterprise imperative. The data quality strategy is the critical element to gain support and approval of the data quality program, making an enterprise approach possible and yielding steady improvements in quality over time. The results are worth the effort.

[1] By the way, ‘data’ is a mass noun, like ‘water’ or ‘tea’ or ‘advice’– the term ‘datum’ as a singular with the plural ‘data’ originated in the academic scientific world (first recorded in 1946) and is archaic in a modern context. “Data are” is incorrect business usage.

MenuMenu

Data Professional Introspective: Comparable Data Management (Part 4)

Data Quality Strategy

Data Profiling

Data Quality Assessment

Data Cleansing

Melanie Mecca

MenuMenu

Data Quality Strategy

Data Profiling

Data Quality Assessment

Data Cleansing

Share this post

Melanie Mecca