Introduction
The role of information in creating competitive advantage for businesses and other enterprises has been well documented and is now a business axiom: whomever controls critical information can
leverage that knowledge for profitability. The difficulties associated with dealing with the mountains of data produced in businesses brought about the concept of information architecture which has
spawned projects such as Operational Data Stores (ODS), Data Warehousing and Data Marts. Along with these came a set of associated complementary technologies which help companies collect, massage,
process, analyze and deliver useful information from this mass of raw, unconnected data. The growth of Data Warehousing into a $6 billion market demonstrates the degree to which organizations have
taken a pro-active role in managing their data.
Enterprise Data Quality Management
After some years of attempting to deal with the issue of data quality, a new discipline has emerged within information architecture development to address the need for appropriately managing
data quality. This discipline, known as Enterprise Data Quality Management, (EDQM) is intended to ensure the accuracy, timeliness, relevance and consistency of data throughout an organization, or
multiple business units within an organization, and therefore to ensure that decisions are made on consistent and accurate information.
Clean, useful and accurate data translate directly to the bottom line for most companies. It represents the added revenues that are realized when businesses correctly model and track their customer
relationships, product or service preferences. With reliable data, a major credit company was able to assign risk assessment for loans based on the ability to read free format generalized text
regarding automobile year, make and model data. Within weeks of implementation, 27 million records were processed and the company was able to offer new product line to their customers.
Similarly, an insurance company was able to cleanse and standardize the names and addresses from its customer information files, resulting in a 62% reduction in names and an 80% reduction in
addresses from duplications. This translated into huge savings in processing time, storage and mailing costs, in the confidence users have in their own data, analysis and conclusions, but most
importantly in the cost of contacting customers and managing ongoing customer relationships
Clearly, information is of value only if it is accurate, and in today’s more complex information technology, when internal and external data are blended together in data warehouses and more
advanced OLAP (on-line analytical processing) applications, new technology processes to ensure the accuracy of information are required. Today, more than ever, it is imperative to tackle the data
quality issue from a point of prevention as well as cleansing existing data stores. While many organizations realize the dollar value in clean data, most organizations are still leaving money on
the table.
The Challenge – Where and How to Begin
According to the Gartner Group “Most information reengineering initiatives will fail due to a lack of data quality”. Just as Total Quality Management (TQM) required a level of pain within an
organization to take hold in the manufacturing sector, permanent data quality management occurs only when companies feel sufficient pain as a result of poor data quality to be willing to build new
practices to solve the problems on an enterprise basis.
Projects to cleanse data are often created when a “crisis” occurs, and a key project may be in danger of failure. Unfortunately, many of these special projects are neither permanent nor
consistent across the organization, and may result in several months of effort and hundreds of thousands of dollars in expense on solutions which are not permanent in nature. Traditionally data
reengineering projects lacked the key factor for success in enterprise, data quality management projects – a set of consistent technology processes, which institutionalize data quality as a
strategic asset, and business processes to make it a consistent competitive advantage.
Effective EDQM approaches can significantly lower the costs of data-cleansing. In a recent article, Larry English, an international expert on data-cleansing processes and the “Data-cleansing”
feature writer for DM Review magazine, succinctly outlined the costs of handling data errors, anomalies and inconsistencies at three separate points within the information technology infrastructure
of an organization. “The costs of data quality impacts organizations when existing systems fail to provide the data in the format necessary to profitably conduct business and results in scrap and
rework remedies. Additionally, costs occur during assessment or the inspection phase of the process. Lastly, are the costs associated with prevention.” He summarizes the need for EDQM in that
article with a question: “If the data were correct at the source and in an enterprise-defined format, would we need to spend so much on data clean-up?”
Developing programs to convert data from one format to another is not difficult. Designing processes to clean and standardize data on an enterprise-wide scale, including data values that may not be
obvious, presents a greater challenge. Fortunately, today’s new generation of data management solutions provide data re-engineering and process tools along with conversion programs, to assist
companies in implementing EDQM programs.
Why is EDQM valuable and what problems does it solve? Perhaps the best way to illustrate the value of data quality is to review the roots of bad data, and some of the ways corrupt data can impact
an organization.
Mistakes: The origin of mistakes in data is the simplest of problems to understand. These include misspellings, typographical errors, out of range values or incorrect data types. While
typographical errors are difficult to correct, validation routines typically handle out of range values or incorrect data types within applications. An example of out of range values might be 13 in
a field for month, or an alphabetical character in a numeric field, such as interest rate.
Homonyms: The English language contains many words and abbreviations with identical spellings that have multiple, and often unrelated or conflicting meanings, and relies on the context of usage to
determine the correct meaning. Improper interpretation of the context in which the homonym was used can have a significant impact on data accuracy. For instance, the contextual use of St. in the
example below illustrates how context sensitive processing is built into our language, and that proper interpretation of data requires recognition of the format and condition of how words and
abbreviations are used.
Catherine B. St. James, MD.
In trust for Mary Church
St. Catherine’s Church
MS ST 225
111 1st St.
7St. Petersburg, FL 33708
Lack of Standards: When data entry responsibilities are spread among different people and business units, variations are bound to arise, as in the example below from information gathered for
inventory purposes. Within a field as simple as Product and Location data may be represented in several different ways:
Product
PC |
Location
Bin Location 223 |
Legal Entities: In many instances, the addition or subtraction of naming conventions may alter the actual legal definition of a document. Many banking and financial institutions require complex
naming conventions that are unrecognizable by most applications, but must remain intact in order to protect the legal purity of the document.
Missing/Invisible Data: Often data that is present may contain the proper structure and values, and in fact may appear to be correct but, data that has inadvertently been omitted causing
identification and linkage mechanisms to unknowingly “grow” a mountain of poor quality data. This problem usually occurs without an organization’s knowledge. For instance, “35 Avenue of the
Americas” is syntactically correct. What are undetected are the thousands of apartments, suites and mail stops within the same address. Additionally, the name “Leslie Brown” is correct, but
without a “title” deriving gender, matching would be accomplished with a lesser degree of certainty.
Phantom Data: In many applications, phony data (e.g., the date 99/99/99) may be used to flag a record or signify that there is no valid data for a particular field. Equally perplexing, the flag
inserted into a field may have nothing to do with the data in that field; for instance, a phantom date may serve as an indicator that the record in question is no longer valid.
To be effective, EDQM as a process must also meet a number of technical challenges. The process must work across multiple platforms and information architectures, must be adaptable and capture
knowledge from an organization, and not scare away users by being difficult. When all of these challenges are met, EDQM can be leveraged into an Enterprise “Business Intelligence” Asset. Some of
the critical technical challenges are:
- Interoperability is a critical technical challenge to EDQM. Today’s enterprises not only contain a variety of computing technologies, from PCs to workstations to servers and mainframes, but
also a number of database management systems, data architectures and applications with which EDQM must interface. Hardware flexibility is a critical component in choosing a data-reengineering tool. - Adaptability and the ability to accumulate knowledge within an organization so as not to “reinvent the wheel” are critical factors for data-cleansing tools. Tools must be portable from one
application to another, scalable for large as well as small applications, and have the ability to be re-used throughout an organization, building on prior knowledge and operational rules.
Functional flexibility is another critical component in choosing a data-reengineering tool. - Ease of Use is critical in the successful implementation of EDQM within an organization. A successful data-cleansing application must be easy to implement, integrate with existing applications
and business processes, and provide for monitoring and tuning of the system to ensure that knowledge and rules are easily maintained. Flexibility in use is another critical component in choosing a
data-reengineering tool.
A data-cleansing tool with these three technical capabilities will facilitate deployment and consistent utilization of EDQM techniques throughout an organization. Once processes and procedures to
ensure data quality are in place, the organization can begin to leverage its data resources into a “business intelligence” asset.
Data-cleansing – An Emerging Field
Once data warehousing architects and practitioners discovered the need for data quality, the question became: how to achieve it? Initially, data reengineering consisted of manually written code
interposed between the data extraction and the data loading phases of the Data Warehousing implementation. Each project has specific needs, tailored to specific target and legacy data structures
and context, and therefore each project required custom built edits to achieve the data quality required for the warehouse. Data-cleansing has grown from this editing process in the early days of
information systems through a series of first and second-generation tools to help manage data quality.
Proactive data quality initiatives start at the Data Entry phase. Data entry validation is the first line of defense against bad data, with validation routines checking data ranges and ensuring
that all required fields are filled during the data entry process. Validation checks are commonplace in many newer systems. Newer generation solutions often contain more sophisticated conditional
logic that may narrow the range of acceptable data based on entries to previous fields. Most importantly, solutions that enable organizations to develop data reengineering processes independent of
particular projects, and execute those processes either on-line at the point of entry or in batch mode within legacy systems, are closest to achieving enterprise-wide data quality management.
The advantages of checking data quality at the data entry stage are fairly obvious: mistakes are nipped in the bud, while the information is still fresh, thereby avoiding the need for
downstream rework that is often performed by someone unfamiliar with the source data. Data entry validations, however, are not necessarily foolproof. Just as a word processing spell checker will
not catch grammatical errors with properly spelled words, data entry personnel can still input incorrect codes to the right fields in the correct format and range, and the error would go
undetected. This is why the tools must be used in conjunction with enterprise standards, which allow certain accepted mechanisms for entering data and reject others. These must be implemented at an
enterprise level in order to ensure that all departments involved in data entry (accounts receivable, order entry, sales) use the conventions consistently.
Need for an Enterprise Approach
Managing data quality throughout an organization requires an enterprise approach. Such an approach, which focuses on prevention and standards, as well as error correction, can provide significant
benefits to users, information technologists and, most importantly, to the bottom line.
Just as TQM focuses on the prevention of scrap and rework, EDQM focuses on ensuring the accuracy of data throughout the enterprise. EDQM requires changes to business processes and the development
of standards, which ensure that data are entered and standardized in accordance with a set of rules, which adapt to changes in business needs.
In a future article to be included on the TDAN web-site I will discuss the benefits of a data quality approach as well as the differences between the first and second-generation technologies
available to help organization achieve their goals.