Data Quality (DQ) continues to be an area that many data professionals pay lip service to, but are often unable to tackle in a meaningful way. Part of the problem seems to be a poor understanding of what the definition and scope of DQ are. Coupled with this, or perhaps because of it, are problems of confusing other sub-disciplines of Data Governance and Data Management with DQ.
The traditional definition of DQ is often given as:
fitness of data for use
This is adapted from Juran’s Quality Handbook, Fifth Edition, p. 2.2, McGraw-Hill, 1999, which states:
data to be of high quality if they are fit for their intended uses in operations, decision making and planning
Incidentally, “Handbook” is a very complimentary term for a tome that is about the size of a cinder block and certainly does not fit in my hand.
“Fitness for use” is a poor definition. Suppose I am a data scientist looking across my enterprise for a data set that relates customers to returned goods. Now, suppose I find what I think is a good data set, but I fail to check the data definitions, and what I originally thought were “returned goods” are actually “damaged goods”, and this is not what I wanted at all. On the standard definition of DQ I can now say that this data set had poor DQ. And yet the data set might be a perfectly accurate compilation of customers who had had experiences with damaged goods. So how can this data set be 100% accurate and of poor data quality at the same time? Clearly it cannot.
It is important to understand that people like econometricians and, I strongly suspect, data scientists, do not think of data in the same way as professionals who have been more involved with IT, Operations, and Data Management. Econometricians and their ilk think of data like fuel, or a similar kind of input, that they need for their modeling activities. So for them, “quality” often means how useful the data is for their activities, not how inherently accurate the data is (though presumably that is a separate concern for them).
In the example that we just went through of the damaged goods, the problem was caused either by the data scientist not bothering to look for data definitions, or adequate data definitions not being available. I would say that this is not a DQ problem. If the data scientist looked for definitions and found inadequate ones, we may have a semantic quality problem, but this is not a problem of the underlying data. If the underlying data did faithfully capture what it was intended to capture (i.e. the relationship between customers and damaged goods) then there is no DQ problem. Obviously, there is a problem but classifying it is a DQ problem is incorrect.
There is yet another problem with the poorly defined scope of DQ, and this is the failure to distinguish between DQ and Data Issue Management. To me, Data Quality, as a Data Governance and Data Management practice, is about detecting problems in the data – data issues. Once a data issue has been detected, someone needs to set about resolving it. The whole resolution process has very little in common with the detection capabilities needed for the practice of DQ. DQ often involves specifying business rules against which the data is tested. Data Issue Management involves trying to figure out if an issue is really a problem (and not a false positive), how to prevent it propagating, how to remediate the bad data, and how to correct the root cause (among other things). Lumping Data Issue Management in with DQ is to mix two quite different disciplines, each with their own distinct problems, methodological approaches, supporting software, metadata needs, and so on. The result will be confusion.
I submit that Data Quality as an attribute should be defined as:
the extent to which data actually represents what it purports to represent
and that the practice of Data Quality should be defined as:
the capabilities to detect and report on instances where data does not actually represent what it purports to represent
This does not cover the situation where someone has a perfectly good data set that is not suitable for their needs, and some other term will need to be created to signify that. Maybe “Data Suitability?”