Data Speaks for Itself: The Shift from Syntactic to Semantic Data Curation

My column today is a follow-up to my article “The Challenge of Data Consistency,” published in the May 2023 issue of this newsletter. In that article, I discussed how semantic encoding (also called concept encoding) is the go-to solution for consistently representing master data entities such as customers and products. Assigning and maintaining persistent identifiers for master entities is designed to overcome the issues associated with maintaining consistent representations of complex entities like customer name and address. And I mean complex in terms of the number of rules and effort required to maintain consistent syntax for the values that represent these concepts.

Fundamentally, it is the problem of encoding human expressions into machine readable encodings. In natural language, we have many ways of saying the same thing. But since the beginning of data processing, we have been condemned to standardize the syntax of both data and metadata values, and parse high-level concepts such as name address into their simplest logical components. For many years, these tasks comprise the primary work for data curation and data quality management. And in most cases, it is effort that is consistently underestimated and undervalued. How many dataflow diagrams have you seen where the source goes into a box called data cleansing or data preparation and by assumption, comes out the other side clean. We all know that this must be done, but simply assume that some unknown persons will run the proper processes to get that out of the way so we can perform the real magic of building our data product.

All this work and effort is necessary because our current processing applications and the programming languages they are written in are actually quite dumb. We must constantly point out to our computer programs things like “Okay, now the value in this variable is the customer’s last name”, and “Look, the tokens in this column are the street names for the billing address, got it?” And like children, they don’t remember from record to record or run to run, it all must be repeated each time new data arrives. In short, our current application systems expect that the semantics of the input data have already been defined.

Well, as I suggested in that same article, this is beginning to change. Large language models (LLMs) like Chat GPT, Llama, and Claude are good, and I mean really good, at reading and understanding the semantics of natural language. Although this capability is now mostly applied to downstream processes like report summarization and composing emails, it will eventually be applied to upstream data curations processes. In a great example of process and design fixation, most of the current focus on using LLMs upstream is to do exactly the opposite and take unstructured data and apply labels to key entities (named entity resolution) so our dumb processes can operate.

We even make our customers do this work for us when we can. We require them to fill out forms with labelled boxes so we can be sure our semantically challenged applications will know which data value is the zip code for shipping and which data value is the customer’s last name.

Eventually, we will have to ask ourselves, why are we taking this extra step? Why not just design new, semantically aware processes that can read and process the source data in its raw form. Eventually this will happen, and I believe that LLMs will completely change how we approach data curation in the future. Data curation and data quality processes will move from just being syntax manipulation and labeling, to new semantic enabled processes that can drink directly from the data lake and skip the medallioned bronze, silver, and gold layers in building data products.

Just as a simple experiment, I gave one of the currently available LLMs two customer name-and-address records. Both had identical names and addresses except that in one customer record, the first name was given as “Julie” and in the other, the first name was given as “Julian”. These similar-name situations are one of the most difficult for current MDM systems to solve because they usually apply some level of fuzzy matching to overcome spelling and keying errors. Because the names “Julie” and “Julian” only differ by two characters, many of the current MDM systems would consider them as matching records, and assume they represent the same customer.

In my prompt, I simply asked the LLM if these two records were for the same person. Interestingly, an older release of the LLM responded that they did represent the same person as would most rule-based MDM system. However, the most current (public) release recognized that these are most likely represent two different people because one has a man’s name (Julian) and the other had a woman’s name (Julie). This kind of semantic understanding is very powerful, and best of all, it works in different languages and doesn’t have to be programmed as some kind of rule.

How soon will these new semantic applications happen? I’m not sure, but at the rate technology is moving, it probably won’t be long. Will there still be a role for data quality? Yes, most definitely, but the emphasis will shift. In 1996, Rich Wang and Diane Strong published an article “Beyond Accuracy: What Data Quality Means to Data Consumers.” The article established a data quality framework of 16 dimensions, factors that data consumers can use to formulate quality requirements.

Initially, data quality practitioners and vendors criticized this framework as too academic and too complex. They argued that you only need to pay attention to a few of these like consistent representation, validity, completeness, timelines, and accuracy. Not surprisingly, except for accuracy, these are the dimensions rooted in syntax that current programming methods can readily handle. And even in the case of accuracy, most organizations substitute validation as a proxy for the effort required to actually measure accuracy.

I find it amusing how what was considered old becomes new again. In the new world of AI readiness, we will have to revisit the Wang-Strong framework and pay a lot more attention to not only accuracy, but to several other dimensions that have been largely ignored such as objectivity, reputation, believability, relevance, and value added. Correcting syntax will not address these qualities of data, and while they are not easy to measure, they are becoming extremely important to the integrity of AI applications. Measuring and educating users on these dimensions will be the new challenge for data quality professionals and data quality management systems as we move forward into the new world of semantic processing.

MenuMenu

Data Speaks for Itself: The Shift from Syntactic to Semantic Data Curation

Dr. John Talburt

MenuMenu

Share this post

Dr. John Talburt