Data Speaks for Itself: Data Validation – Data Accuracy Imposter or Assistant?

In my last article, “The Shift from Syntactic to Semantic Data Curation and What It Means for Data Quality” published in the August 2024 issue of this newsletter, I argued how the adoption of generative AI will change the focus and scope of data quality management (DQM). Because data quality is measured in the degree to which data meets data requirements, the starting point for data quality is formulating data requirements. In turn, data requirements are expressed in terms of data quality dimensions. Fortunately (or perhaps unfortunately), there are many data quality dimensional frameworks to choose from. For example, Danette McGilvray advocates for 14 dimensions of data quality in her wonderful book “Executing Data Quality Projects, 2^nd Ed.”, the ISO 25012 standard lists 15 dimensions, and in their seminal paper “Beyond Accuracy: What data quality means to data consumers,” Richard Wang and Diane Strong (“Journal of Management Information Systems,” 1996) identified 16 dimensions. You can easily find many other frameworks. Dan Myers at dqmatters.com has taken on the unenviable task of trying to reconcile these frameworks into what he calls 11 “conformed dimensions.”

Despite this profusion of data quality dimensions, most data quality professionals will tell you not to worry, and to just focus on 6 or 7 basic dimensions that, in one form or another, are common to all these frameworks. These typically include Accuracy, Completeness, Consistency, Timeliness, Validity, Redundancy, and Relational Integrity. The primary reason for the focus on these dimensions is that, except for Accuracy, failures for these can all be detected, and in some cases remedied, at scale through rule-driven software applications and process improvements. In other words, these are data quality problems you can sink your teeth into, measure, establish improvement plans, and show improvement across your entire data system.

But why do I say, “except for Accuracy?” So, first let’s be clear about what we mean by “accuracy.” Some twenty years ago, a data professional named Jack Olson wrote an entire book about data accuracy. The book, “Data Quality: The Accuracy Dimension, Morgan Kaufman,” is still available and in my opinion, still very relevant to DQM best practices today. His definition of Accuracy is that “the data values stored for an object are the correct values, and that to be correct, it must be the right value represented in a consistent and unambiguous form.”

So here is the rub: to verify that the value of a data item is accurate, you must have access to the correct value to confirm they are the same. And to do this at scale, you need the correct values in a digital format from an authoritative source. This can be a difficult task. The ones that are easiest to find and access are mostly for static, non-sensitive values like the table of correct zip codes for a given mailing address from the US Postal Service. For more sensitive and dynamic data, it is much more challenging, for example the correct names of the current occupants living at a given address.

Notice that I used the phrase “to verify” that a value is correct. This is because we often use “verify” and “validate” interchangeably, but they have a different meaning. When we validate a data item, we do exactly the opposite, we try to determine if the value is not correct by applying some kind of reasonability test. For example, a date-of-birth value might be syntactically wrong because it has the 30^th day of February. On the other hand, a date-of-birth could be syntactically correct, but semantically wrong if it occurs in the future. Validations can also be relational tests such as a pregnancy diagnosis for a male patient or work separation date earlier than a hire date.

Don’t get me wrong, I am a strong advocate for data validation, but we need to recognize it is not the same as data accuracy for two reasons. First, when a data item fails validation rules, it tells us the value is incorrect, but it doesn’t tell us what the correct value should be. Secondly, and most importantly, when a data item passes validation rules, it doesn’t mean that the value is correct. I think this second part is where we really confuse validation with accuracy. Sometimes our thinking is that if the validation didn’t say a data item was incorrect, then it must be correct right? A data item is either correct or incorrect. The problem is that a data item can be “reasonably incorrect.” Back to the date-of-birth example, the value can be valid date within a reasonable range of dates for a date-of-birth, but still not be the correct date of birth for the person it references. In other words, data validation rules are not exhaustive. They don’t detect every way in which a data value can be incorrect. If they could, then it would be true that data validation and data accuracy are the same.

Perhaps I am being a bit picky about this, but as a professor I like to have clear definitions and terminology. But from a more practical point of view, short of having an authoritative source of correct values for direct comparison, data validation is our best tool to achieve accuracy provided we recognize and address its two shortcomings.

So, if your organization is claiming through its DG policy to have accurate data, here are two questions to ask the DQM team. If a data value fails validation (here I am assuming there everyone has some type of data validation process), what is your course of action? Do you remove the value? Do you insert a placeholder value? Do you impute a new value? Or do you consult the source to establish the correct value? Of course, the answers may vary depending on the data type and source, but they are still good questions to ask. While getting the correct value from the source is the best, it is not always practical or possible. The worst thing is inserting a placeholder, these can be very misleading and make it appear your data is more complete than it is.

As for the other shortcoming, what do you do with the values that pass validation? Unfortunately, in most cases, the answer is nothing. In software engineering, the rule of thumb is that the number of undiscovered bugs in an application is proportional to the number you find. If this is true for data quality, then discovering a lot of data validation failures suggests that there are many more errors in the records that passed validation. On this point, I really like the approach developed by Dr. Tom Redman, a leading author and expert on data quality. He has developed a process he calls the Friday Afternoon Measurement. It is basically a simple sampling method where a small team of data experts in the company examine a segment of about 100 records (on Friday afternoon). The experts then review the records to determine if there are any inaccuracies in the records. The final score of inaccurate records to the overall sample size gives an indication of the overall accuracy of the data source. While you could argue that this is another form of validation, it brings into the picture two additional techniques beyond validation which are sampling and reviewing records that pass validation.

You can think of data validation processes as a form of stratified sampling, where the stratification segmentation is by validation rules passed or failed. And as we noted, while validation rules can detect some inaccuracies, they don’t detect all of them. By using a more random, or at least a systematic sampling method along with human observation, Redman’s approach does suggest how a DQM team can go beyond validation to get a better estimate of overall accuracy. The process of expert review also has the advantage of identifying new validation rules to augment the organization’s current process. The data experts can identify defects that escaped data validation detection, and through analysis, could lead to new and improved validation rules, thus further closing the gap between data validation and data accuracy.

So bottom line, what is the answer to the opening question for your organization? Are you simply substituting validation for data accuracy or are you actively trying to close the gap between validation and accuracy? If you really want to claim that your DQM program is addressing the accuracy dimension, you must go beyond a validation once-and-done approach to include additional processes for sampling records that pass validation, having experts analyze those records, and systematically review their findings to continually improve your data validation process. Again, I highly recommend Jack Olson’s book as a guide to exploring many other processes and techniques you can employ to improve data accuracy such as identifying and correcting sources of inaccurate data, implementing data quality assurance programs, and data profiling technology.

One thing I didn’t mention that is related to my next article is how GenAI is going to impact data quality management. So, I was wondering, in the small team of experts recommended in Redman’s Friday Afternoon Measurement, should one of the members be a Large Language Model? More about this next time.

MenuMenu

Data Speaks for Itself: Data Validation – Data Accuracy Imposter or Assistant?

Dr. John Talburt

MenuMenu

Share this post

Dr. John Talburt