Data Accuracy is one of the so-called “dimensions” of Data Quality. The goal for these dimensions, and it is a noble one, is so we can measure each of them, and should deficiencies be found then there should be a uniform set of best practices that we can implement. Of course, these best practices will differ from dimension to dimension. But just how feasible is this for Data Accuracy?
In this blog, we start by looking at the definition of accuracy in general. Wikipedia provides a commonly accepted definition, which is:
“The degree of closeness of measurements of a quantity to that quantity’s true value.”
The word “true” is very important and needs to be understood so we can understand accuracy. Truth was defined by Aristotle in a way that highlights the relation between a representation and the reality that the representation is trying to represent. That seems to be very much connected with Data Accuracy.
We give the definition of Data Accuracy as:
“The degree to which a data value represents what it purports to represent.”
One of the problems of the dimensions of Data Quality is that there are differences of opinion about how it should be defined. No doubt our definition can be improved, but we will take it as a basis for exploring how we should estimate Data Accuracy.
The next thing we must consider is that Data Accuracy is impossible to achieve with 100% accuracy for observations. Two great minds in Quality Control provide a basis for this understanding. Walter A. Shewhart pointed out that all systems of measurement introduce error. W. Edwards Deming, who was taught by Shewhart, went further, and pointed out that there is “no true value of anything.”
So, it seems we can never get 100% Data Accuracy. If this is the case, we will want to know how we can assess Data Accuracy, since we will want to know just how imperfect it is. And this is where we run into another hard truth, which is that was cannot do this just by considering the data alone. This is potentially a difficulty for data professionals. We are used to working with data and like to work with it. But to estimate Data Quality, we will need to step outside the data, figure out a method to independently assess a sample of the population we are interested in for the data values in question and compare that with the data under curation. It just cannot be done from entirely within the data itself. Of course, the way we assess Data Accuracy is likely to vary from data element to data element.
There is one exception to this, which is where data is itself the managed reality. In these cases, the data is not based on observations of something outside itself. Anything dematerialized, like a bank account is included in this category. In the database that actively manages a bank account, then the data is a reality. However, if we look at bank accounts and try to, say, capture their balances, then we are making observations and we are back to normal data. We will explore this distinction more in the future.
So those are a couple of hard truths about Data Accuracy: for data based on observations, it is impossible to achieve it completely, and we cannot estimate it just by looking at data.