The first step to fixing any problem is to understand that problem—this is a significant point of failure when it comes to data. Most organizations agree that they have data issues, categorized as data quality. Organizations typically define the scope of their data problems by their current (known) data quality issues (symptoms). However, this definition is misleading because the known data issues are only the tip-of-the-iceberg. The majority are hidden and unknown.
Data issues are one of the most underestimated and misunderstood challenges organizations face today. A significant percentage of data within any organization is disjointed, undefined, redundant, untrustworthy, large, and complex. The sheer volume, low-quality, and mismanagement of data are crippling to organizations, even without Big Data. However, the bigger issue is that most organizations do not realize the extent of their data problems.
Broken data, especially the unknown, requires organizations to spend ridiculous amounts of time hunting for missing data, correcting inaccurate data, creating “workarounds,” pasting data together, and reconciling conflicting data. These activities are a waste of valuable time and resources, resulting in a loss of productivity and missed opportunities. Because most organizations are unaware of the full extent and impact of their broken data, they fail to address their core data problems. This scenario does not need to continue. Organizations can gain an accurate understanding of the extent of their data issues, find the root causes, and fix the underlying problems.
I identify nine underestimated or unrecognized data challenges. Below, I will cover the first four:
1. The Scope of the Data Issues
Organizations know that they have data quality issues. However, most are unaware of the full extent of their data issues, their cost, and the harm caused to their organization. Data quality refers to the degree of the data’s accuracy, completeness, relevance, and trustworthiness. Studies on data quality have produced alarming statistics, which show the cost of poor-quality data. However, most statistics do not include the unknown data issues and their hidden costs.
Information workers spend valuable time repeatedly correcting inaccurate data, hunting for missing data, and researching the data’s accuracy. Although these activities are unaccounted for, they detract from the bottom line. Critical business operational and strategic decisions, essential to an organization’s profitability, need accurate data. Business organizations without an accurate picture of their inventory, profits, losses, or customers inevitably make poor decisions. Business decisions based on incorrect data waste time, money, and resources—unaccounted-for in terms of poor-quality data cost to an organization.
A general practice is to place the responsibility for data quality in the technology department. Technologists use data quality tools to automate the identification of the data quality issues and are responsible for “fixing” the incorrect data. However, since poor-quality data is typically a symptom of a core data problem, a band-aid to the symptom only temporarily masks the underlying data problem. Unless an organization addresses the underlying data problem, it continues to create poor-quality data.
As poor-quality data moves through the organization, the data quality issues multiply because data is often used to create additional data. The longer the incorrect data stays in a system, the greater the negative impact. Unfortunately, this scenario is all too familiar. The actual cost and negative impact of poor-quality data remain unknown.
The 1-10-100 quality rule accurately describes the increasing cost of a defect as it flows through its life-cycle and remains undetected. The rule states that for each dollar spent on prevention saves ten dollars to address a quality issue in production. If a quality issue goes undetected until a customer discovers it, the cost to rectify is one-hundred dollars—100 times what it would have been to prevent it. The 1-10-100 rule applies to data. The most cost-effective solution is prevention.
2. Unknown-Inaccurate Data
Unknown-inaccurate data is any data with an issue that goes unrecognized. Organizations are not only challenged by their inaccurate data but even more by their ability to recognize inaccurate data. Consider if the results of a data query are incorrect. How does one know the results are wrong? Especially if the answer is in an acceptable range. For example, if a numeric query result was 100-times more than expected or character-based, then the data’s accuracy is questioned. However, if the results are in the expected range (often an extensive range) where the answer “looks” right, then rarely is the data’s accuracy questioned. Sometimes there is no expected range. Think about it; if someone already knows the answer, they would not be asking the question in the first place. It is only when the distortion becomes apparent that it gets noticed and tagged as a data quality issue.
Another instance of unknown-inaccurate data occurs when technologists attempt to “clean” incorrect data, handicapped by not understanding what is correct. When data lacks business meaning, definitions, or context, then the technologists have no other choice but to take their best guess.
Consider all that unknown-inaccurate data used to make critical business decisions. This unrecognized liability to an organization includes the loss of business confidence and opportunity, and an increased risk for non-compliance and data security issues. These problems multiply as the unknown-inaccurate data is used and reused to plan, operate, predict, and steer organizations. The adverse effects on the business are unknown, leading to blind ignorance.
Unfortunately, experience indicates that there is an unconscious consensus in many organizations that as long as the data does not appear incorrect, it is considered accurate. This misbelief is one of the biggest data fallacies that underpin our data challenges. Many blindly trust their data merely on the assumption that it is correct—never knowing for sure. Therefore, within what appears as good quality data, there is a significant percentage likelihood of error. Remember, data quality statistics only account for the data that is obviously or proven incorrect. The known data issues are just the tip of the broken data iceberg.
3. Loss of Business Context
Data represents the real-world business organization—its things, events, and relationships (tangible or intangible). Data, along with the information gleaned from the data, empowers the organization. However, only when the data accurately represents the real-world things or events. If not, then almost every use of the data is compromised. Loss of the business meaning (connection) to the data seriously compromises the data.
Data considered right must correctly represent the business. When organizations skip business data requirements and business data architecture, data loses its business meaning. Missing, inadequate, or inaccurate business data names and definitions result in the misunderstanding and misuse of data. This issue underlies many data challenges.
Data needs to capture all of the characteristics of the real-world business considered important to the organization, including the interrelationships of business things and events. Everything in a business organization is interrelated and interdependent. Nothing exists on its own. Everything derives meaning from its relationships. This principle is especially relevant for analytics to be successful. Therefore, capturing the context or relationships of real-world things and events is essential for data’s’ meaning. Taking data out of context compromises its meaning and integrity. It is difficult to understand the importance of this concept from a technology viewpoint of data. The Business-to-Data Connection is fundamental to the integrity and accuracy of data.
4. Undefined or Inaccurately Defined Data
Much of the data in an organization are undefined, inadequately, or inaccurately defined, resulting in a significant loss of the data’s meaning. Adding to the issue, technologists create best-guess data definitions for missing definitions. Best-guess definitions are often more dangerous than no definition. Sadly, the best-guess method is becoming the norm. Best-guess data names and definitions make the data appear purposefully architected and designed, giving users a false sense of trust. When, in fact, there is a high probability of error.
Even with adequately defined data, a widely used practice is to overload a data field. Data-field overloading occurs when additional types of data, not intended for that field, are entered into that field. Common reasons for data-field overloading are to avoid expensive system enhancements or do a “quick fix” for another data problem.
Multiple departments can overload the same data field. Each department has its unique cryptic translation of the various types of data placed in the overloaded data field. Inevitably, the overloaded data fields collide with another department’s use of that field, resulting in multiple, and often unknown data issues. Time is also a factor as the various types of data in the overloaded fields become outdated. No one remembers their purpose, adding to the data issue complexity.
The practice of best-guess naming and data-field overloading is more common than one would imagine. Both of these practices result in data-fields not containing the intended or indicated data according to the data field name or definition. These little-known and widely-used practices are a significant challenge for the many automated data tools that depend on data-field names or content to operate effectively. Best-guess naming and data-field overloading add to the unknown data issue challenge, especially for information protection and security. These fields can contain sensitive information that the field name or definition does not indicate. For example, a “non-PII” field (customer name, description) that includes a credit card number or social security number.
Continued in Broken Data – What You Do Not Know Will Hurt You (Part 2)