
While everyone is asking if their data is ready for AI, I want to ask a somewhat different question: Is your data quality management (DQM) program ready for AI?
In my opinion, you need to be able to answer yes to the following four questions before you can have any assurance you are ready to build AI-enabled information products and services:
- Do you have a DQM program?
- Does your DQM program address quality control (QC) requirements?
- Does your measurement of data accuracy go beyond simply measuring data validation?
- Does your DQM program include coverage of the semantic data quality dimensions?
Do You Have a DQM Program?
I’m just going to punt on this one. If you don’t already have a formal DQM program, you have some serious catching up to do. The good news is that you don’t have to start from scratch. There are many established frameworks you can start with. For example, Data Management Body of Knowledge (DAMA-DMBOK), The MIT Total Data Quality Management (TDQM), Conformed Dimensions of Data Quality (CDDQ), Data Management Capability Assessment Model (DCAM), Six Sigma Data Quality (SSDQ) Framework, Data Quality Assessment Framework (DQAF), ISO 8000 Part 61 Standard, Ten Steps to Quality Data (McGilvray), and BCBS239 (Basel Committee on Banking Supervision), just to name a few.
These are general frameworks and will have to be adapted to your business. You will probably need some help to get started.
Does Your DQM Program Address Quality Control (QC) Requirements?
A few years ago, I wrote an article for this column called “What Could Possibly Go Wrong?” In that article, I talked about how many data quality management (DQM) programs tend to focus on the data producer side (input) while often giving little or no attention to the consumer side. Thus, subscribing to the theory if you clean the input data very thoroughly and run it through tested and approved software, then there is no need to check the output. What could possibly go wrong?
As I mentioned in the same article, no manufacturer builds products using this method. Even companies that produce shirts typically have a final inspection process by someone who kindly leaves an inspection slip in one of the pockets. Quality control (QC), which is the process of checking that the final product meets quality requirements, is an essential part of the quality management process, the “check” of Shewhart and Deming’s Plan, Do, Check, Act. Just as QC is essential to manufacturing, it is one of the founding principles of data quality management as elaborated early on by Richard Wang, et al in “Manage Your Information as a Product.”
Yet QC is a missing component of many data quality management programs I have reviewed over the years. It seems that we just can let go of the idea if we start with clean and accurate data everything will work out fine. The entire focus is on quality assurance (QA), making sure all the parts (sources) meet tolerances. As I remind my students, QC and QA are not interchangeable terms. QC happens after the final product is built, and QA happens during the building process to assure the success of QC. Both are essential.
Although it may seem obvious, the purpose of collecting data and having information systems in your organization is to build information products and services that create value. Yet, I still see many DQM programs that don’t include these QC requirements for information products and services. While this is bad enough for traditional information products and services, the situation will only be worse as organizations are beginning to build AI-enabled information products and services without proper QC.
Does Your Measurement of Data Accuracy Go Beyond Simply Measuring Data Validation?
While everyone with any type of DQM program will claim that it measures data accuracy, very few do. In most cases, they substitute data validation as a proxy for accuracy. In another recent article, “Data Validation – Data Accuracy Imposter or Assistant?” I discussed the problem of measuring data accuracy. While it is not easy, it is important, especially for AI applications, and there is usually much more that can be done with marginal investment.
Although validation is an important DQ process, it only tells you when values can’t be correct, not that they are correct. It is asymmetrical in that if a value fails validation, it is incorrect, but if it passes validation, then it could be either correct or incorrect, you just don’t know, i.e., data values can be reasonably incorrect. In an accuracy (verification) process, each data value is judged as either correct or incorrect and can only be fully automated only when you have an online authoritative source to compare against.
Short of the ideal situation of having access to an online authoritative verification source, there are still ways to start closing the gap between simple validation and true accuracy. The two important actions are 1) correcting invalid values, and 2) manually checking samples of valid values. The second step not only improves the overall accuracy of the data, but could also lead to new validation rules that further improve the validation process. In situations where you can save the correct values from these two actions, it may be possible to eventually build your own authoritative source over time.
Does Your DQM Program Cover the Semantic Data Quality Dimensions?
Finally, I appeal to yet another recent article, “The Shift from Syntactic to Semantic Data Curation.” In this article, I bring up the issue of how AI-enable products are pushing us to enlarge the scope of the data quality dimensions addressed in our DQM programs.
For traditional processing, we have been content to focus on a handful of so-called basic dimensions such as completeness, timeliness, validity, consistent representation, and of course, accuracy which, as we discussed earlier, is often only a validity check, not a true accuracy check.
But there are many other data quality dimensions to consider, and until recently we didn’t have much motivation to consider. Take as an example seven dimensions from the Wang-Strong 16-dimensional framework as described in “Beyond Accuracy: What Data Quality Means to Consumers,” namely believability, accuracy, objectivity, reputation, value-added, relevance, and access security. These are all important considerations for AI applications. Is your AI product believable? Will one crazy hallucination destroy in reputation with your customers? Is it biased, or truly objective in its answers? Are the sources, training data, or retrieval augmented generation (RAG) data relevant to the task, and does using them actually add value to the product? Is it possible that you are leaking confidential data in the product responses?
While we haven’t been paying much attention to these dimensions until now, you can develop QA and QC requirements for all these issues and measure and monitor them in your DQM program. There is nothing so special about these requirements that they can’t be addressed by your data quality team. Now could be a good time to have a complete review of your DQM program in answering these questions.
References
Wang R.Y., et al (1998). “Manage Your Information as a Product.” MIT Sloan Management Review.
Wang, R. Y., & Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 5–33. https://doi.org/10.1080/07421222.1996.11518099