The Data Forecast: Data Quality is Not What We Think

COL02x - feature image - EDThe data management space is riddled with terms that sound familiar but betray their true complexity. We also often choose ridiculous terms that describe simple concepts (see metadata management), but that is a topic for another column.

Today we will focus on data quality, since it is arguably the most misunderstood topic around.

Data quality is one of these simple terms that get us data folks into sticky situations. Everybody knows what data is, right? It’s those 1’s and 0’s we have in our databases that get all the attention.

Similarly, everybody has an idea of quality. We think of the finest materials, or exquisite craftsmanship, or some combination of similar attributes that lead to a high relative value compared with a benchmark example that represents something marginal.

People somewhat reasonably think that by mashing these two definitions together they will arrive at a basic understanding of data quality. This is unfortunately not the case, and has resulted in the commonly accepted definition of data quality: pieces of information that are either perfect (of quality) or worthless (not of quality).

Say what?!?!

Because data is a virtual construct, it hurts our human brains to evaluate it along the same kinds of dimensions as physically manifested quality. When data is involved, we lose the ability to use most of our five senses in the comparison, so we oversimplify. Since data is fundamentally made up of 1’s and 0’s, we think data quality must also be a binary (yes or no) operation.

After all, nobody has ever told us otherwise.

People in organizations expect the data they receive to be correct — but what does that mean? What are acceptable thresholds of variance? Does the account balance need to be accurate to the microsecond, or will a daily balance suffice? Can we update customer addresses based on returned mail, or do we need to post somebody with binoculars watching for moving vans near their homes?

It’s all relative.

In fact, that’s the entire challenge with data quality. What is sufficient for marketing’s purposes may not be sufficient for accounts payable. The executive team may prefer to make strategic decisions based on higher levels of aggregation than IT security needs to protect the network.

Data quality cannot be considered perfection or bust. Data quality is the measurement of data’s suitability for use. Suitability for use is a spectrum, not a binary construct.

There are two sides to this:

  1. First, we must determine data’s actual suitability for the variety of uses for which it may be considered.We must figure out where data of interest falls on this spectrum. Most organizations with formal data quality programs adopt a data scoring methodology that works for their particular business. The best approach is to start simple to articulate basic usefulness, and iterate through the more complex aspects that determine suitability for your organization’s needs.
  2. Second, we must communicate those suitability characteristics of the data to the folks whom are considering using it.At minimum, a data quality score should be displayed alongside any data being considered for use. More sophisticated data quality implementations use matrix designs or integrate additional metadata to provide a more nuanced understanding of the purposes for which particular data can effectively be used. For data quality to have any hope, a suitability-for-use assessment must accompany any data that may be used to drive a business outcome.

Now we are ready to start talking about actual data quality initiatives!

Note: up until now, everything we have been discussing about data quality is metadata management in disguise! But you never would have read the article if I called it “Metadata Management Necessities to Inform Your Data Quality Program, Plus Some Actual Data Quality Stuff Too!” Even if I added the exclamation point, we’re talking single-digit clicks, mostly from my immediate family — and none of them would have read this far, anyway.

So now that we have an understanding of data’s suitability for use, we probably see examples where the data isn’t good enough to use it for what we want. This is when we can implement a data quality improvement initiative that has measurable objectives. We can say things like, “Our data is currently a 75 overall data quality score, but for the intended use we need 85, so we must devote resources to improve the completeness and verified sub-scores by 20%. This will require approximately 60 hours of effort at $X per hour.”

In a previous TDAN.com article, I outlined the Value of Data (http://tdan.com/the-data-forecast-intro-the-value-of-data/20304). Analyzing the projected value from data made sufficient by proposed data quality improvement initiatives enables comparisons across all data management disciplines. Said another way, we never have enough resources to do everything we would like to do. By quantifying value, we can optimally allocate scarce resources across any data initiatives we are considering, regardless of whether they are data quality, data governance, systems-related, etc.

For the truly intrepid data quality aficionados, this quantified data quality approach should raise an eyebrow when extended to its logical limit. Though the common misconception (thoroughly debunked above, if I say so myself), indicates a “perfection or nothing” default approach — the truth is that aiming for perfect data quality is always a mistake. At the limit, diminishing marginal returns in the pursuit of perfection will consume infinite resources while driving zero meaningful difference in business outcomes.

A more pragmatic rule-of-thumb is that when there is a sufficiently high likelihood of driving the optimal decisions or activities, the data in question should no longer be allocated additional data quality improvement resources. Executives often make decisions with 70% or less confidence levels, as they cannot delay decisions until more complete information is available. For rocket scientists, higher-levels of confidence are typically required before moving forward.

Once the right decision or activity is likely, we must divert precious data quality improvement resources to data assets that need the help more. Why keep trying to improve data that already drives the best possible business outcomes?

The misunderstood state of data quality in our organizations may now seem unbelievable. But if we have ever provided data without explaining its suitability for use, then we have directly contributed to it. I have certainly added more than my share over the years.

However, data rightfully has people’s attention now, and we must start putting in better data quality practices that will help all of our organizations reach their potential. To correct the perception of data quality, we must have the courage to address these items head-on, and ensure that we make the most of any resources that come our way. Start with data quality scoring and suitability for use, and help your organizations get better at what they do.

Good luck in your journey — and until next time, go make an impact!

Share

submit to reddit

About Anthony Algmin

Anthony J. Algmin was the first Chief Data Officer for the Chicago Transit Agency and is now Chief Data Officer for Uturn Data Solutions, a Chicago-based consultancy that helps companies use data and cloud technologies to get better at what they do best. For more information on OpenGrid or Uturn Data Solutions, contact Anthony at aalgmin@uturndata.com.

  • Richord1

    Although the pursuit of data quality is admirable, it is “too little, too late”.

    Those who experienced the era of improving product quality discovered that quality must be designed in. With data, you can check data quality using data profiling tools and discover patterns of errors but after that, data errors become increasingly unique (one of) and subjective (interpretation, bias).

    If you want to improve data quality start with the design of the metadata. Identify the syntax, semantics and pragmatics. Use ontologies and taxonomies and define the data quality metrics for each context of use before designing a data model or writing code. And most importantly, make sure the data is usable in all contexts and uses. Transactions, reports, analytics, compliance and within each department sales, customer service, marketing finance, operations etc.

    If don’t change your data (metadata) design practices your efforts in data quality will be remedial and add little value.

Top