Data Speaks for Itself: Data Quality Management in the Age of Language Models

Unsurprisingly, my last two columns discussed artificial intelligence (AI), specifically the impact of language models (LMs) on data curation. My August 2024 column, “The Shift from Syntactic to Semantic Data Curation and What It Means for Data Quality,” and my November 2024 column, “Data Validation, the Data Accuracy Imposter or Assistant?” addressed some of the touch points between LMs and data quality management (DQM). In this column, I would like to suggest a more complete roadmap for how your organization could implement changes to its DQM to better enable it to meet both the challenges and the opportunities LMs pose.

So, just to be clear from the beginning, when I refer to AI, I am talking about LMs, whether commercially available such as ChatGPT, Gemini, Llama, or your own bespoke models. The power of LMs is in how they can understand natural language syntax and semantics. Therein lies the opportunity to use LMs to directly process natural language inputs and avoid the time and effort currently directed toward parsing and data structuring and standardization.

As a simple experiment, I prompted ChatGPT with the following made-up address: “John Gregory, 123 Oak Street, Great Falls, Ohio.” It was able to tell me the person’s first and last name, street number, street name, city name, and the correct postal code of “OH” for the state of Ohio. All the things that we usually must parse and standardize from address records with a rule-based program. And without any supplemental training or prompt, Chat GPT told me that it could not provide the missing postal zip code because Great Falls is not a city in Ohio! A nice, unexpected validation on top of the accurate parsing and labeling.

Unfortunately, this ability to understand name, address, and other demographic semantics is an opportunity largely unrealized in current large-scale processing for two reasons. First is that we have not yet developed semantic processing languages that can sequentially process natural language inputs and return consistent results. Due to their probabilistic nature, LMs can give different and sometimes incorrect results for the same input and same prompt, so called hallucinations. The best we can do now is to use LMs as a pre-processing tool to structure such inputs into a structured format that can then be input into our legacy applications that expect to be given attribute-value pairs. This is mainly due to the huge investment that organizations have in legacy software applications that require standardized, labeled input. But I don’t doubt that newer LM-enabled software will be coming soon to take advantage of this semantic understanding.

The second reason we don’t use LMs for large-scale operational processing is that they are much slower than deterministic, rule-based applications. The most common work-around is to embed natural language inputs into vectors. While these numeric vectors allow for faster processing, the downside is that much of the granular semantic information is lost in this process. So, while we can say that two natural language inputs such as a names and addresses are semantically close in overall meaning because their vector embeddings are near each other in the vector space, we can’t extract and manipulate the name and address components from their vector embeddings the way we can with prompts.

Nevertheless, while LMs are not yet processing the company’s payroll, they are adding tremendous value in many other applications. Coupled with generative AI (GenAI) they are used for question answering, translating between natural languages, chatbots, code generation, report summarization, and document and email generation. This is happening in a broad spectrum of verticals such as healthcare, entertainment, finance, and manufacturing.

Back to the main theme, the good news is that the fundamental Deming-Shewhart Plan-Do-Check-Act (PDCA) model and the focus on managing information as product are still sound underpinnings of DQM, even in the age of LMs. In my opinion, it is not the framework that needs to change, but a shift in prioritization of data quality dimensions and the data quality requirements expressed in these dimensions. While traditional data quality dimensions such as validation, completeness, consistency, duplication, and timeliness are still important, the value derived from LM applications is very dependent upon other less commonly addressed dimensions such as accuracy, value added, relevance, objectivity (lack of bias), reputation, believability, and security.

This leads to number one on my roadmap, ensuring that your DQM program is truly pursuing accuracy, and not just validation. Is it relentlessly working to close the gap between accuracy and validation by always correcting invalid data, systematically sampling and checking valid data values, continually enlarging your validation test portfolio, and creating verification databases whenever practical. See my previous article “Data Validation, the Data Accuracy Imposter or Assistant?” for more detail.

Number two, enlarge your thinking about data quality dimensions and the scope of your DQM program. Relevant data contributes directly to the problem at hand, helping LM models focus on the most important variables and relationships. Irrelevant data can clutter models and lead to inefficiencies. To know what is, and is not, relevant and adding value requires listening to needs of the data product and data service owners, the voice of the customer (VoC). Understanding customer needs is one of the most important principles of managing information as product.

Objectivity or lack of bias is usually associated with demographic data, but can occur in almost any type of data source. There are statistical methods for detecting and measuring bias. This is a great opportunity to engage with statisticians in the data science or modeling team. Reputation and believability are more subjective, but still measurable through periodic surveys of data consumers, data providers, and data stewards. Amazon has vendor and product ratings to measure reputation, why can’t you?

LMs raise new security concerns because they can accidentally reveal sensitive information, proprietary algorithms, or other confidential information in their responses. DQM should become involved with processes for filtering both inputs (training) and output (responses). This is an opportunity to engage with the IT Security Team to help implement strict response filtering and anonymizing data during training. Regular audits of LM responses are just like any other quality control (QC) process and any disclosure problems should be ticketed and solved just as another other DQ issue would be.

Number three is to increase the DQM footprint through collaborations with other teams, some of which have already been mentioned. These include Data Acquisition, Data Science, Risk Management and Security, and Data Product Managers. Members of these teams can be your best friends in understanding how current sources are adding value, how other relevant sources out improve products and services, and how you can promote the re-use of products, services, and data sources across the organization. In general, DQM programs should help the organization pursue the FAIR principles of Findable, Accessible, Interoperable, and Reusable.

Number four is to begin thinking longer term about how AI can improve DQM. Vendors are already enhancing DQ tools with built-in statistical analytics, data mining techniques, and AI to automate the generation of data validation rules, classify data, automate data cleaning and standardization, and structure unstructured text. As LMs are trained with more complete and accurate data, they will be able to recognize and correct certain types of data quality errors.

Number five is to learn as much as you can about LMs and Generative AI. Don’t make the mistake of thinking that all things AI belong to the Data Science team. As LMs become integrated into the organization’s data processing, they need to be subject to the same data quality assurance and data quality control requirements as traditional data products and services. Even though LMs are constantly evolving, try to become conversant with the basic principles and vocabulary. Concepts such as LM training, fine-tuning, retrieval-augmented generation (RAG), prompt engineering, agents, encoding, decoding, embedding, vectorization, and vector databases can all be understood at a high level without a degree in computer science.

So, I hope that these suggestions have prompted you (no pun intended) to think about specific ways in which you can begin to adapt your DQM program to improve the LM products and services in your organization. Don’t let others convince you that LMs have nothing to do with data or data quality. Don’t’ be afraid to engage.

MenuMenu

Data Speaks for Itself: Data Quality Management in the Age of Language Models

Dr. John Talburt

MenuMenu

Share this post

Dr. John Talburt