Published in TDAN.com October 2000
Today, more people are entering more data into databases than ever – and making more mistakes than ever before.
Such problems are only compounded in e-commerce environments, where customers are more concerned with speed than accuracy. A recent study has shown that customer data input via the web is far less
accurate than data keyed in by customer service representatives.
The traditional solution to data quality problems has been the use of data cleansing and editing programs. However, they are only a partial solution, as they can also make mistakes. And sometimes,
they may actually cause more problems than they solve.
For instance, a single individual keying names and telephone numbers can only make a few mistakes at a time. When data cleansing programs make mistakes, they can instantly affect hundreds of
thousands of records.
When such a program “fixes” a problem that doesn’t exist, it commits a “Type 1” error. When it misses a problem entirely, it commits a “Type 2” error.
Maximum data quality is found somewhere between the two extremes.
This problem is compounded by software, which relies on a small knowledge base, in that it does not recognize subtle differences, which can lead to over or under matching.
Here’s a brief description of both cases, and how to choose a software solution that strikes a balance between the two.
Type 1 Errors
A Type 1 error occurs when the data cleansing program thinks there is something wrong with the data, but there is no actual problem.
For example, every large city has districts or areas whose inhabitants prefer that their addresses contain the area name, not the city name. However, postal authorities prefer the city name on the
address. The data cleansing program that changes these “vanity” names back to the city name will have made a Type 1 error. The postal authorities are happy — customers are less than
happy.
Another example of a Type 1 error is the data cleansing program that verifies gender codes and changes KIM from a male name to a female name. Many people named KIM are female, but some are male. Or
the program that makes SHIRLEY TEMPLE an organization. Many temples are organizations, but Shirley is a person.
A little trickier example is the record matching program that considers HARRY JONES at 132 MAIN ST to be a duplicate of HARRY JONES at 131 MAIN ST. After all, the address is only 1 digit off. But
if these two HARRYs are not the same person, but are neighbors, a Type 1 error will have been made.
Type 2 Errors
A Type 2 error occurs when the data cleansing program passes over bad data. Such data passes all of the data hygiene tests, but still isn’t right. Another common name for a Type 2 error is a
“passed” error. The data cleansing program passes data that it should not.
Sometimes this is because the error was made in such a way that it came out looking reasonable. Sometimes it’s just because the data cleansing program missed it, or was not up to the task.
It is possible to take all of the examples of Type 1 errors and provide Type 2 equivalents.
In such examples, the city name that is seriously in error is mistaken for a “vanity” address. Or a person possessing a common female name is actually a male possessing an obscure name
handed down through their family. Or a restaurant named after their specialty drink, the SHIRLEY TEMPLE. Or the two HARRY JONES records that really were duplicates, but one address was incorrectly
keyed. The number of error permutations is almost endless.
The Tension
A natural tension occurs in the data cleansing program that seeks to minimize both Type 1 and Type 2 errors.
To avoid Type 1 errors, the program must exercise caution so as not to change anything that is already correct. Such caution increases the likelihood of missing something, or of committing a Type 2
error.
Type 2 errors are avoided by being so thorough that nothing is missed, thus running the risk of over-cleansing, or committing a Type 1 error.
Some data cleansing programs strike the balance between caution and thoroughness by creating voluminous suspect reports. These programs catch everything suspicious, good and bad, in a big net and
simply report them, which severely escalates the cost of achieving accurate data because the people cost greatly outweighs the software acquisition cost.
Others overcompensate for Type 1 errors by fixing very little. The poorest and most dangerous data cleansing programs overcompensate for Type 2 errors, and end up with data that is worse than what
was originally started with.
The definition of a good data cleansing program is one that leaves good data alone, corrects bad data, creates small suspect reports, and does so with a minimum of setup, maintenance and clerical
review. It strikes the optimum balance between Type 1 and Type 2 errors.
The Remedy
How is such a balance achieved?
Every data cleansing program has some kind of knowledge base that it uses to find and correct errors. The larger the knowledge base, and the more varied the kinds of information within the
knowledge base, the better the results.
For example, consider the SHIRLEY TEMPLE problem. A knowledge base that only contains the word TEMPLE defined as an organization will probably get it wrong. A knowledge base that contains the word
TEMPLE as an organization and the word SHIRLEY as a female first name may get it right or may not. It depends on which word is stronger.
A knowledge base that also contains the pattern FIRST NAME-ORGANIZATION as an individual has a better chance of getting it right. A knowledge base that can also say that the address is residential
has the best chance of getting it right.
The latter knowledge base looks at data from four different angles. Two word meanings, word patterns, and address style. Three of the four agree that SHIRLEY is a person. The weakest angle, one of
the word meanings, suggest the possibility that she is an organization. The evidence is quite strong to set her up as a person. The more corroborating evidence, the better the chance of getting it
right.
In short, the larger and more diverse the knowledge base, the more corroboration there will be.
Conclusion
If you are in the market for a data cleansing solution, here are some guidelines that may help you make the right decision:
-
Knowledge Base Size Is Critical — Take note of the size of the knowledge base, and the different kinds of information stored there. See how much corroboration is brought into
the decision. Request that your potential supplier walk you through a few examples of ambiguous data.
-
Experience Counts — Ask your potential software supplier to describe how they deal with Type 1 and Type 2 errors. If they don’t know what you’re talking about (even
after you have defined the terms), you’re either talking to the wrong people or the wrong vendor.
-
Beware Of Stock Solutions –- “canned” demos that imply that your potential supplier has all the answers to every problem are highly suspect. Demos are great for
allowing the vendor to show you the shiny parts of the product. But you want to get under the hood. The best demos will accept data on the fly, preferably your own.
-
Be An Informed Buyer –- Be sure that you are clear on all the underlying setup, maintenance and clerical costs. The software cost alone can be just the tip of the iceberg.
Make sure you know what you’re getting into.
The best data cleansing solution is the one that provides you with the cleanest data at the lowest possible cost. The right decision you make today can make sure that your data warehouse is clean
today, and stays that way tomorrow.
Copyright 2000 Innovative Systems, Inc. All rights Reserved.