Mastering Data: An Algorithm to Clean Names

Businessman working with ecology business conceptIt has been a great challenge in many companies to preserve name consistency with written Latin characters in countries where the main alphabet is not Latin. Nowadays many companies worldwide use the English language as the main language for data storage. This is not a problem in countries where the native language is based on the Latin alphabet; however, it would be a great problem for those who their main languages are non-Latin alphabet like Arabic, Chinese or Hindi.

Why Should We Care About Name Cleansing?

First of all, name cleansing is not important only to non-Latin alphabet based countries only, it is severely important worldwide. Here are some usages of such an algorithm:

  1. It can play a great role when a merge or acquisition takes place between companies and the customer bases need to be merged.

  2. Identify users across social network so that a person’s opinion on both Facebook and Twitter can be known.

  3. Articles or blogs written by the same author on different websites can be detected.

  4. Make names look more consistent and professional.

Concepts of Name Cleansing

The manual effort is the first step in implementing this algorithm. A database should be prepared containing the clean names in the selected native language. Sound like a tedious task? No. When implementing the Arabic-Egyptian culture names cleansing algorithm, the table including the clean names contained only 1206 names.

The names included should be atomic and unique (i.e. primary keys). Atomic means that composite names should be divided into several rows. For example, the name “Abd Al-Rahman” should be separated into two different records. It is no doubt that the name in the native language is the primary/unique key since there should not be two different spellings for the same name.

The algorithm is culture-based; i.e. it is not based on language.

Data Model for Name Cleansing

The following are the list of tables which are needed to implement this solution:

  1. CleanData: Includes all names in their native language and their corresponding clean Latin form.

    1. CleanDataID: Sequence [Optional]

    2. Domain: That is the field discriminating the different cultures, if the algorithm can clean them [Optional]

    3. PrimaryEntity: The name in its native form

    4. CleanEntity: The name in its clean form

    5. Language: The language used in column CleanEntity [Optional]

NOTE: If supporting multicultural name cleansing is needed then a unique constraint should be created on Domain and PrimaryEntity columns, otherwise make it on PrimaryEntity only.

  1. Domain [Optional]: This would be used to differentiate cultures

    1. DomainID: Sequence

    2. Name: Culture name

    3. Language: The language on which the culture depends

  2. Language [Optional]: Key/Value pair for languages

  3. Log: This is used for logging the name cleansing trials. It is very important to extend the algorithm and make it learn from its failed attempts.

    1. MalformedName: The name that needs to be cleaned

    2. ProposedName: The outcome of the algorithm

    3. DomainID: The domain used while cleansing

    4. LanguageID: The output language targeted while cleansing

    5. LogDate: The timestamp when that trial was attempted

  4. Rule: It is used for determining the rules which will be applied on a name to normalize it.

Input for Name Cleansing

The algorithm requires some inputs in order to process the request:

  1. Malformed Name: The name to be cleaned

  2. Input Culture: The culture against which the name should be cleaned

  3. Output Language: The desired clean output language

  4. Split Character: The character which will be used to split between names in the inputs and outputs

  5. Unrecognized Character: The character which will be used to represent an unrecognized name (i.e. a fail attempt)

  6. Unrecognition Behavior [Optional]:

    1. Same: The unrecognized name would be inserted in the proposed name as-is

    2. Replace: The unrecognized name would be replaced with the unrecognized character defined

    3. Prefix: The unrecognized name would be inserted in the proposed name as-is and would be prefixed with the unrecognized character as well

The Name Cleansing Algorithm

The input name is split first by the Split Character defined in the earlier section then the result is passed to the core method (called here CleanName), which outputs the proposed name and receives the following as parameters:

  1. An array of strings filled with names

  2. A Boolean representing whether to try to split the name. For example: husseinfareed would be converted to Hussein Fareed

CleanName loops through each name and tries to clean them using several methods which are prioritized in order to get the best results. The methods are:

  1. Check exact match: Attempts to compare the name with the entries recorded in the database

  2. Check a normalized match: Attempts to compare the name with the entries recorded in the database after applying some normalization rules on both input name and stored data.

  3. Try to split the name and repeat from the first step again.

Check Normalized Match and Splitting

Normalizing a name is based on a set of rules which is defined per domain. For example, rules can be set for Egyptian culture and others for Hindi culture.

A rule has a set of characteristics:

  1. Replacement: A rule is basically used to replace a portion of a name to another standard form. For example, a rule may be set to replace “ee” to “i” so “Fareed” and “Farid” will be treated in the same manner.

  2. Round: Consider a rule as a compromise. When no exact match is found, more compromises occur by making some replacements. These replacements are prioritized through rounds. For example: round 1 may include some rules like replacing “ee” to “i” and “oo” to “o.” If no match is found, then more compromises like replacing “ae” to “a” may take place.

  3. Priority: Sets a priority inside each round as it may be needed to make replacements in specific order. For example: replace “el-” to “el” first, then replace “el” to “al.”

  4. Normalizing occurs on both the input name and the names database.

  5. Rule Direction: A rule may be applied by replacing any occurrence in the name at the start/end of the name. For example, it may be needed to replace any names starting with “el” to “al,” but the rule should not be applied to the middle of the name.

  6. Applying Type: There are three types of applying a rule

    1. Full: Applies to the full name; not just single names. It is applied once at the beginning of the cleansing process.

    2. Clean: Applies on the malformed name.

    3. Primary: Applies on the original name (e.g. Arabic name).

After rules are applied, all redundancies are removed from the input name. Applying name normalization is a loop that compares the normalized form of the input name and names stored in the database. It tries first to normalize the name by applying the “Primary” rules then “Clean” rules. If the above trials do not work, the normalization process uses a fuzzy match algorithm which is called the Levenshtein distance algorithm. If the process reaches the fuzzy match part, the word with the least score is chosen to get the nearest match.

Splitting a malformed name is a basic loop which iterates from both sides (starting from the beginning and end of the name). The process attempts to split the name into two parts only. Upon each iteration, the input is split at the current position defined by the iteration and both sides are cleaned. If matches were found for both sides, the algorithm returns the split names. Otherwise, the loops continues until the best match is found either in the left or right. A match is defined to be the best by reaching the clean match in the fewest number of rounds.

Quick Recommendations and Future Work

To wrap up this brief article I would like to include two quick recommendations:

  1. The names database should be stored in a hashed list to allow fast access.

  2. The names database should be normalized upon loading them to allow fast comparison.

The algorithm might need to have the ability to grow the original names database by taking the number of repetitive failures and flagging them as “consider to revise.”

To See a Demo

A demo algorithm is already implemented for Arabic-Egyptian names at http://www.hg-solutions.net by registering to the website (the quota will be set to 200 characters for free daily).

Share

submit to reddit

About Hussein Fareed

Hussein Fareed is a computer scientist that specializes in data. He started programming when he was 10 years old and joined TE Data, a large ISP in Egypt as a Senior Business Intelligence Developer at 25. He has been promoted to be Strategic Research Supervisor and is responsible for building enterprise data warehouses using the approaches of Ralph Kimball and Dan Linstedt. Hussein is the author of the Dictionary-Based Name Cleansing algorithm. You can contact him at husseinfareed@gmail.com.

Top