Understanding and Overcoming the Unavoidable Error and Variation in Identity Data

Published in TDAN.com January 2001

Most people take their name for granted. Some of us like our names; others don’t – either way we are stuck with it. We respond when our name is spoken (even when pronounced badly), dutifully write
it down when requested (often not all that neatly), and recognize it when someone else writes it (even when the spelling is wrong). But how do computer systems cope with such error and variation
specifically, how do they competently search and match this type of data?

To some extent, back office systems control this problem through the management of the people who enter or use the data. However, these systems are far removed from the data’s real owners, such as
you and me, and suffer from errors introduced by interpretations and assumptions. Users of front office systems can also be managed with the added benefit that the data owner is at hand for
confirmation. However, such systems often require speed – which is often at odds with quality. Web-based systems, on the other hand, have raised the problem to new heights. There is no management
of the people who enter or use the data, except for the feeble attempts to structure the way the data is entered. In addition, the users are far removed and want responses in real time.

This article exposes the problems that computer systems have with names and other identity data, how traditional systems are ineffective and how intelligent systems may overcome these limitations.

The Problem With ‘Names’

In many systems, whether computerized or manual, it is important to find, match or group information that has been filed away using a person’s or customer’s name, account name, company or
location name, address, file title, author’s name, book title, etc. All such “names” are collections of words, numbers and codes used to label the original real world item.

Such labels are chosen from a much larger and very different vocabulary than any meaningful language. There are no dictionaries, spell checkers or rules governing the names given to addresses,
people, places or things. For the sake of simplicity, when the word “name” is used in this article it should be taken to mean all of the above mentioned “labels”.

Names, when spoken, written and especially when entered into a computer system are subject to considerable variation and error. Although this variation and error can be reduced, it can’t be
entirely eliminated. Even if the data on file is absolutely correct, the search criterion comes from the real world and is subject to natural error and variation.

In addition to the words and codes in names, addresses, titles and descriptions, other data is frequently used to make decisions about whether we believe two reports or records are about the same
identity. Data such as dates of birth, dates of contract, ages, phone numbers and identity numbers are all used and all subject to error and variation.

Examples of Variation

The variations that occur in names include spelling, typing and phonetic error; synonyms & nicknames; Anglicization and foreign versions of names; initials, truncation and abbreviation; prefix
and suffix variations; compound names; account names; missing words, extra words and word sequence variations, as well as format, character and convention variations.

Apart from the natural error and variation that unavoidably occurs in all real world identification data, in many systems the objective is also to overcome fraudulent modification of identity data.
This class of error, which does not occur naturally, is more aggressive because it is introduced to defeat or control aspects of matching systems while retaining the defense that it was in error
rather than fraudulent.

The frequency distribution of names is also a concern when searching for a match from within a large population. The vocabulary in use for people’s first names includes in excess of 2,500,000
words in the USA alone, yet as much as 80% of the population may have names from as few as 500 words. Family names are just as unevenly distributed, causing searches for common names to take longer
and requiring additional supplementary identification data to make the correct choice.

Name search and matching systems must work well at both ends of this extreme curve. They must perform efficiently for the uncommon names as well as for very common names. This is a difficult
challenge when a database of 100,000,000 people may contain 100,000 John Smiths or Mike Jones in addition to as many addresses (1 Main Streets).

When people make choices about whether words match or not, they compensate for the error and variation. To confirm that records match requires that systems use the same data in the same manner as
the human users. In fact, the system needs to mimic the very best users doing the same job.

Whether the process is an on-line inquiry (such as customer identification), or a batch matching process (as might occur during the merging of marketing lists before a mailing), or a criminal
record search, we must mimic the human expert in finding all candidate records. It is important that the system make the same matching choices as the human expert for any specific business purpose.

For the system to overcome error and variation increases the work done, and therefore the cost. In addition, the actual process of compensating for the error may introduce errors and false matches.

Any solution to this unique data processing problem requires a balance between performance and quality, between under-matching verses over-matching.

Popular Yet Often Ineffective Techniques

Exact Name Searches

The use of exact name keys is very inefficient leading too much of the duplication of records, accounts and customers in today’s systems. Finding an exact match does not mean the correct record
has been located nor is it necessarily any better than one with some variation.

Searching With Wildcards

Wildcard searches do overcome some of the error and variation in the name and for that reason are popular with the users. Unfortunately, since wildcard searches often return too many irrelevant
candidates, users don’t realize the large amount of data they are actually missing.

In reality, wildcard searches work if the searcher guesses the correct character sequences to include or exclude – assuming there are no errors in the characters of the database being searched. Not
only do these searches miss relevant records, they do not address nicknames and abbreviations, or the fact that different records have different types of errors.

Keying Partial Words To Save Time

Data entry time can be reduced by only keying partial search criteria – e.g. the first three characters of the first word followed by the first three characters of the next word. Performing
searches with this type of criteria, however, makes it impossible to use techniques to overcome word variation and errors in a database. Soundex (an algorithm for encoding the last name) or other
techniques to handle nicknames or Anglicization, or translation or formal abbreviation, cannot be applied to partial words.

Text Retrieval Software and Name Search

The use of text retrieval packages for name search applications also misses data. Even when full text inversion indexes have phonetic algorithms or “expert” rule bases for name searches, the
indexing mechanism is an inefficient process. Does it make sense to find all index values for the records containing John and then join them with those that contain Smith to discover the subset
John Smith?

For text retrieval based systems to be successful they must recognize, find or discover the names from within the text and index them with the specialized techniques that are necessary for quality
and performance in a name search system (as opposed to indexing the names in the same way that other words are indexed).

Match-Codes

A Match-code is a key built from a combination of an identity’s attributes. For example, a key built from State-code+Surname Soundex code+ Birth Date is a match-code. Match-codes require that each
attribute be first strictly formatted into its pieces (e.g. that the position of the surname is known in the name); that all pieces used are in the “stable” order (e.g. that the birth date is
always yyyy/mm/dd, or whatever), and that there are no errors in the pieces used.

Of course, attributes such as Sex, Birth Date, State, Postal or Zip code have a stable, known set of valid formats and values and can be accurately edited and validated. However, the fact that they
are valid does not mean that they are true, accurate or consistent (e.g. a birth date can be a valid date without being the correct date for a certain identity). Mathematically such data can be
precise but not necessarily accurate.

Typically, Match-Codes find correct records but they frequently miss the other equally good candidates.

Yesterday’s Soundex’s And Related Tools

In the early 1900’s the Russel Soundex technique was developed to provide a stable manual filing code for the USA Census documents. The development of this algorithm for encoding a person’s last
name was based upon phonetics and certain classes of typical spelling and filing errors. This simple set of rules to convert a last name word into a four, five or six digit number had a high
probability of being the same for two words that were variations of each other.

Since then, many Algorithms with similar objectives have been developed and modified. In the 1960’s, the New York State Identification Intelligence System (NYSIIS) project evaluated the popular
algorithms. This included evaluation of Algorithms such as: Soundex and many of its variants; LA County Sheriff Consonant coding; Phonic standard and extended; Michigan Lien; and several Extract
list based systems. The end result of this project was a popular algorithm known as NYSIIS that proved better optimized for their data at that point in time.

While such “stabilization” (similar sounding words are “stabilized” to the same encoding) algorithms can be a critical piece of a name search engine, the algorithms of the past are not
sufficient for use on their own as database keys with today’s data or volumes. In addition, purely “English” based Algorithms are not suitable for non-English languages. Typically, these
stabilization algorithms either cause too many incorrect records to be found, or miss too many relevant records.

Mimicking the Expert Users

The best solution will overcome the error and variation in the identity data while: a) maintaining acceptable performance and b) not missing candidates or generating too many false matches.

Such a solution needs intelligent and scalable algorithms, which, through the use of rich keys and search strategies, return all of the candidates an expert user would consider as being the same as
the search data.

These algorithms must be able to cope with data from the real world. This includes data from different countries, data which is not formatted or cleaned or not capable of being formatted or
cleaned, which contains noise characters, noise words, initials, abbreviations, nicknames and concatenations, and which contains tokens in an unstable order.

The algorithms need a customizable rule base to incorporate the knowledge of the expert user, and a default populated rule base in the case where the user is not that experienced.

The algorithms require phonetic and orthographic correction functionality, to address spelling and typing errors.

Intelligent matching routines must be available and able to be tuned to mimic the expert user making a choice as to which candidates are the correct matches. Such matching routines need to take
into account all of the error and variation in the identities’ attributes, as well as weighting the attributes as the user would.

The Algorithms must work well regardless of the country of origin and language of the data and must insulate the application developer from the differences between country and language.

Conclusion

Increasing M&A, expanding databases, the need to limit risk and fraud, the current emphasis on “customer” relationship, the proliferation of data available on or from the web, the “thirst”
for that data, are all putting pressure on search and matching systems to come up with these capabilities. There are many general search engines around hopefully this article has illustrated that
identity search and matching is a non-trivial problem requiring a more intelligent and targeted solution.

Copyright 2000 – Search Software America,
A division of SPL WorldGroup. All rights reserved.

Share

submit to reddit
Top