Business has a fundamental problem with data quality. In some places it’s merely painful, in others it’s nearly catastrophic. Why is the problem so pervasive? Why does it never seem to get fixed? I believe we’ve been thinking about the problem wrong. It’s time for a fresh look.
The central flaw in the long-running discussion over data quality is literally its focus on ‘data.’ Stored data is merely the system or database residue of things that have already happened in the business, a memory of past events.
To truly fix ‘data quality’ problems requires a business perspective, a shift in the focus from data design or data cleansing, to what occurs in the business itself. Our sights should be trained squarely on the business activity that results in the data.
Creating Data: More To It Than You Think
Consider what workers are actually doing when they create a piece of data (these days, probably using some application). In the business world, of course, they’re probably just doing a bit of work. Look more closely, however, and in some ways what they’re doing is actually quite profound. Think about it this way:
The act of creating data is the act of
creating a message to people in the future.
Recipients of the message might be only just milliseconds away – but they also might be weeks, months, or even years away. Data isn’t just data– it’s an effort to communicate.
Normally we think of communication as either direct conversation or (in the spirit of the times) a flurry of text messages exchanged more or less in real time with people we know. In either case there’s usually a shared context within which the meaning of the messages can be interpreted, as well as more or less real-time exchange of clarifications.
What’s distinct about creating data is that you’re almost certainly not going to be face-to-face with the recipients of the message or connected live with them via an interactive network. That fact rules out body language (e.g., raised eyebrows or emoticons) and dialog (including grunts and groans – or more emoticons) to clarify what you mean. In that sense the communication is blind.
As a consequence, the data a worker creates literally needs to speak for itself. The emphasis needs to be on communication quality – that is, on communication clarity.
Communication clarity focuses on whether the meaning of a message is clear. Just formatting data correctly doesn’t get you there. If the meaning isn’t clear a business communication won’t be properly understood. In other words, you need clarity for the concepts you communicate – not just data quality.
The Role of Data/System Architectures and
What Data Quality is Really About
Because of the time delay in delivering blind communications to everyone in the future who might need them, a secure, well-organized holding area is needed. IT professionals, hopefully guided by knowledgeable data architects, create data/system architectures for that purpose.
Unfortunately, typical data quality measures in current use focus on the health of the content of the data/system architecture rather than on the communication clarity of the original business messages. That focus serves a purpose for data management but misses the mark almost entirely in clarifying what practices produce good business communications in the first place. Typical data quality dimensions are:
- Retroactive rather than proactive
- Quantitative rather than qualitative
- Systemic rather than semantic
Worst of all, typical data quality dimensions implicitly remove responsibility off the shoulders of those who create the data.
The quality of data in a data/system architecture can never be any better than the quality of the business communications that produced it. A systematic means to manage data at rest simply does not guarantee the vitality – the semantic health – of the business communications it supports. Sometimes IT professionals focus so intently on software development the importance of the point escapes them (Many data professionals do understand the point, but do not know quite how to articulate it or feel powerless to do much about it).
To make the point differently, it is entirely possible to assess your data quality as outstanding even though the business communications that produced the data were confusing, contradictory, unintelligible, or otherwise ineffective. Rating data quality high when communication is poor is simply nonsense!
Forming High-Quality Business Communications
Rather than retroactively focusing on data already formed, business people and professionals need proactive measures to form high-quality messages in the first place – no matter whether structured data or written business communication (‘unstructured data’).
What should the recipients of blind messages expect? They have the right to expect:
- High-quality evidence about what the content means.
- No need for any significant assumptions, whether unconscious or deliberate, to supplement that evidence.
- The content representing exactly the reality the evidence suggests.
What form does evidence available to recipients take?
- Names, codes and words
- Business vocabulary
- Business rules
The four dimensions of communication clarity presented below in Table 1 arise directly from these four kinds of evidence, respectively. They provide the context for blind communications. They apply equally to structured data and to written business communication (‘unstructured data’).
Table 1. The Four Dimensions of Communication Clarity
|Dimension of Communication Clarity
|not encoded or cryptic
|Meaning is not obscured by the choice of signifiers (terms/codes/words).
|Only terms with solid business definitions are used.
|consistent in meaning
|Only shared terms from a business vocabulary based on a concept model are used.
|free from flaws or defects that might impair usefulness
|All relevant business rules are followed.
The four dimensions of communication clarity are discussed individually below with examples. They might seem largely self-evident, but there is more to them than initially meets the eye.
A readable message is one that is not encoded or cryptic (unintentionally) – that is, one whose meaning is not obscured by choice of signifiers (names, codes or words). If a message is encrypted (as security of course usually demands these days) the encryption should be on top of the message, not an accidental by-product of forming the message (data) itself.
Cryptic names and codes are rampant in IT systems; they are encouraged by programming languages, software platforms, and legacy computer tradecraft. Some typical examples:
- PID-RAD2-TYPE. Who but programmers might know what that name represents.
- A coding scheme for the values of a field where ‘0’ stands for ‘no’ and ‘1’ stands for ‘yes’. Why?!
- The abbreviation ‘PT’. Without adequate evidence, this abbreviation could stand for many things, including the following:
- PT Emp -> part-time employee
- PTCRSR -> Personal Transportation Cruiser
- Blk pt chassis -> black platinum chassis
- 24pt bk -> manual published in 24-point type
- 2 pt asbl -> two-part assembly
- 1 pt -> one pint
- LIS PT -> Lisbon, Portugal
How you name things should always be based on natural-language ways of communicating about the things. Inadequate or misleading names, or ones that could easily be misconstrued, should be carefully avoided.
In subject matter of any complexity – which is to say virtually all business subject matter – word choice can make a huge difference in the ultimate effectiveness of a communication. There is simply no name like exactly the right name. Chapter 6 examines this important topic in detail.
An understandable message uses only terms with solid business definitions.
Suppose in immunology someone calls something a site. A definition is missing. Does site refer to a location where a vaccination took place (e.g., a doctor’s office), or to an anatomical location where a vaccination was injected.
Miscommunication can easily result where definitions for terms are absent, unclear, imprecise, incomplete, and/or un-business-like. Defining things accurately is a central skill for designing concepts. Chapter 7 provides the basic notions you need in this regard; Part V expands on them with numerous guidelines and examples.
A precise message is one that uses shared terms from a business vocabulary correctly.
Sometimes the choice of word for some concept in a message is simply wrong. Such usage can be highly misleading. For example:
- Using extension to mean an offering of a product given to a prospect when the prospect clicks on an ad, rather than the concept model’s meaning, an additional period of time given to a prospect to accept an offer. (Yes, that’s a real example from a large organization arising from social-media marketing vs. traditional marketing.)
Digging Deeper: Perhaps even worse is being inconsistent in usage – e.g., sometimes a word means one thing, and sometimes another. Such cases are called homonyms (one word or word phrase, but multiple meanings).
Other times a word can span a broad gray band of meaning. For example:
- Using customer to mean anything from active customer to any party that has ever expressed even the slightest interest in the company’s products or services.
Terms (including synonyms) should always refer to only a single concept in a given context. For that you need a solid business vocabulary, which in turn requires a robust concept model.
A reliable message is one that complies with all relevant business rules.
Much confusion arises over business rules. Professionals who work with data/system architectures often have a technical view of them. That’s off-target. Business rules are not data rules or system rules. A true business rule is a criterion for running the business. Business rules are about business knowledge and business activity, not data – at least not directly.
I recently read the following statement about data quality: “Business rules capture accurate data content values.” No. Business rules are about running the business correctly.
If the business is run correctly, its business communications should be formed correctly. If its business communications are formed correctly, then the content of its data/system architecture should be correct. So yes, business rules result in correct data, but more importantly correct data arises because business activity is conducted correctly in the first place.
In other words, data quality isn’t really about the quality of your data, it’s more about the quality of your business rules.
Here are some relatively simple examples to illustrate true business rules. Each example is first expressed by a clear textual business statement, then as a corresponding data constraint. The alternative expressions illustrate the fundamental difference between communicating in business terms vs. communicating in data-speak. And remember, many or most business rules are much more complicated than these examples.
- Business rule: A customer must have an assigned agent if the customer has placed an order.
Expressed as a corresponding data constraint: A valid agent id is required in the assigned-agent field of a customer record if any order records are listed for that customer record.
- Business rule: The payee of a claim payment for a claim must be a party who made the claim.
Expressed as a corresponding data constraint: The payee number, if any, listed in the payee field of a claim-payment record must be for one of the parties listed as having made the claim.
Digging Deeper: Unfortunately, trivial examples are almost always used to illustrate problems with data quality arising from failure to comply with business rules. Here are samples:
- Data in a field is invalid because it violates some definitional business rule(s) – for example, social security numbers are found in a field for a person’s surname.
- Data in a field is invalid because it violates some minimum or maximum threshold – for example, a number greater than 99 is found in a percentile field.
Obviously, you do need rules like these, but don’t be fooled! They barely scratch the surface. They just happen to be easy to talk about because they involve values of only a single field. Sad to say, most discussions of data quality have been complicit in a vast oversimplification.
The four dimensions of communication clarity get to root causes of ‘data quality’ problems, as well as of miscommunication in written or other business communications. Communicating about difficult subject matter is hard to begin with. Blind communication to people you can’t converse or interact with directly is the hardest of all. It requires order-of-magnitude sophistication in the techniques used to form the messages. You can address the challenge successfully by means of a concept model, which the next chapter explains.
Extracted from Business Knowledge Blueprints: Enabling Your Data to Speak the Language of the Business, by Ronald G. Ross, 2020.
 From “Six Myths about Data Quality,” by Steven Sarsfield, January 28, 2017 https://www.ewsolutions.com/six-myths-data-quality/