Data Speaks for Itself: The Challenge of Data Consistency

Data quality management (DQM) has advanced considerably over the years. The full extent of the problem was first recognized during the data warehouse movement in the 1980s. Out of this, dimensional frameworks were developed for expressing data quality requirements and the development of DQM models such as the ISO 8000 Part 61: Data quality management: Process reference model. While I have discussed the overall DQM process in previous articles, I want to dig a little more into the details in this article.

If you were to ask most people, even data people, about the meaning of data quality (DQ), the typical response is “we want accurate data!” However, in my experience, I find that by far the most pervasive DQ problems in organizations are incompleteness and inconsistent representation. When key data values are missing, it is an acute problem whose intervention is finding some way to acquire the missing values or perhaps, to keep from losing them. But the problem of consistent representation is more of a chronic problem that must be treated every day.

In general, consistent representation simply means that when the values of an attribute are the same, they mean the same thing, and when the values of an attribute are different, they mean different things. In other words, consistent representation means that the syntax (expression) of a concept (data item) is in alignment (one-to-one correspondence) with the semantics (meaning) of the concept. There are two sides to this rule. One side is allowing the same concept to be represented by more than one value. The other side is allowing the same value to represent more than one concept.

As a simple example, expressing the calendar date May 7, 2023, as 5/7/23 (US pre-Y2K convention), 5/7/2023 (US post-Y2K convention), 7/5/2023 (European convention), or 2023-05-07 (ISO standard). Another would be how my state of Arkansas can be represented in several ways including “AR” (USPS code), “Ark” (traditional abbreviation), “Arkansaw” (alternate spelling), and “Arknsas” (misspelling). These examples represent two different types of misrepresentation: format and value. In the case of dates, the problem is having the same value expressed in different formats. In the case of the state names, the problem is having different but synonymous string values with the same meaning.

Consistent formatting is usually addressed through a data standard that prescribes a uniform format to be used across all systems in the organization. On the other hand, the synonymous value issue is usually addressed through reference data management (RDM). In RDM, a particular value from a collection of synonyms is designated the value to be always used across the systems in an organization such as the use ISO geographic codes of USPS postal codes. To facilitate the maintenance of the lists of designated values, many vendors have developed RDM software systems. In order enforce the consistent use of these standard formats and values, they must be mandated by data governance (DG) standards for data quality.

Both data standards and RDM systems deal primarily with concepts expressed by single-value representations such as state names. A more difficult case to manage is when multiple values are required to express a single concept. Here is where we encounter the other side of the inconsistent representation problem, i.e., the same value having different meanings. For example, the problem of representing an individual customer. If you are a corner neighborhood store, you might be able identify your customers just by their names.

But for a large company, this would not be sufficient as several different customers might have the same name. To solve this, we might add the elements of address. This will help distinguish customers with the same name if they have different addresses. However, as we introduce additional attributes to describe each customer, it moves us back to the other side inconsistent representation. It becomes more and more difficult to maintain the same values for the same customer because there are so many different values to maintain such as the spellings for the first name, last name, street number, street name, city name, and so on. If any one of these varies from an agreed upon fixed value, then the representation of the customer becomes inconsistent.

For these cases, the technique of semantic encoding is used. In semantic encoding, an agreed upon, single value is created (an identifier) to represent the entire concept. In the example of customers, the unique, single identifier is a customer number created to represent each customer. The software required to maintain this strict correspondence between a concept identifier such as customer number and the data describing the concept such as customer name and address is much more sophisticated and complex than for RDM systems that handle single values. For this reason, the semantic encoding technique is usually reserved for only the most critical concepts of the organization such as customers, suppliers, and products. Because the data describing these key concepts is called master data, the software that maintains the semantic encoding for them is called a master data management (MDM) system.

Fundamentally, MDM is a data quality process to maintain consistent representation of key concepts across all systems in the organization. However, no matter how well an MDM system maintains consistent representation of concepts from a technical viewpoint, if parts of the organization don’t use the MDM system or insist on using a different MDM system, then MDM will fail to achieve its intended goal of maintaining consistent representation. Just as with data standards and RDM, MDM standards must be included in DG for MDM to be successful. DG standards for MDM typically mandate having only one MDM system for each master data type, requiring all parts of the organization to use the identifiers maintained by these systems, and requiring that MDM identifiers be added to source data as early as possible.

But we should stop and ask ourselves, “Why organizations go to such great lengths to maintain consistent representation through data standards, RDM, and MDM systems?” The reason is the current state-of-the-art in software development. Applications systems are currently designed with the expectation that input data has already been preprocessed to be consistent with a set of data input requirements. Creating application software to handle inconsistent data in its application processing is too complex to handle given our current software development languages and programming methods. Thus, there is a separation of concerns to first analyze and correct inconsistencies, then perform the application processing on the consistent data.

But as entrenched as this approach has become, I believe this will soon change. One of the first impacts of the new AI models being produced, especially large language modes like ChatGPT and BARD, will be to overcome inconsistencies and other data quality problems in data in the same way that we do. When you or I see records like “James Doe, 123 Oak, Anytown, AR” and “Jim Doe, 123 Oak St, Anytown, Ark”, we already recognize that they are describing the same customer despite the inconsistencies. But because we know our application systems like billing can recognize them as the same, we put these through an MDM process to add the customer number so that the billing function will work correctly.

But soon, I believe that new AI enabled software will be able to do the same thing. And I don’t mean automatically carry out an automated MDM process to add a customer number. I mean it will directly ingest the records and use the data “as is” without an MDM or other data cleaning preprocesses. After all, isn’t the goal of AI to build systems that can equal or exceed human performance?

I believe that a fundamental shift in the world of data and data processing is coming soon, and it will be amazing! The next phase of data quality is on our doorstep, a phase that will focus on data opportunities, not data issues. We will no longer have to analyze source data and build ETL flows. Soon, we will see large-scale systems capable of ingesting data in all formats. Instead of being pre-programmed for specialized applications such as billing, payroll, and business reports each requiring specially preprocessed data, these systems will, like for Mr. Spock on the Star Trek Enterprise, answer our questions and obey our commands. And not only operational things like, “Give me a report of our gross and net revenue for the last 60 days”, but strategic functions such as, “How should we optimize our product inventory for the coming holiday season?”

Perhaps this will lead to a new interpretation of GIGO. Instead of Garbage In, Garbage Out, it can become Garbage In, Good Out. Are you ready for the future!

MenuMenu

Data Speaks for Itself: The Challenge of Data Consistency

Dr. John Talburt

MenuMenu

Share this post

Dr. John Talburt