Setting the Record Straight: Introduction to Data Rules

Published in TDAN.com July 2003

 


DATA EXCHANGE

Data exchange is simply the transfer of data from one system to another. It is usually called “data migration,” but that term limits the scope of data movement efforts. While data
migration infers that data will move from a source system to a target system, data exchange includes bi-directional data movement in addition to unilateral movement between two systems. This paper
will address the many issues associated with data exchange. Specific topics include:

The Reasons for Performing Data Exchange. This section further discusses why data exchange is a much larger effort than data migration.

  • Data Exchange Problems. This section identifies problems that the IT industry has recognized with respect to data migration. It then identifies problems that the IT industry has not recognized.
  • Solving Today’s Problems. This section fully describes industry-recognized problems, shows how these problems have not been totally resolved, and makes suggestions for solving them.
  • Solving Tomorrow’s Problems. This section fully describes the problems that industry has failed to recognize and describes how to address them.

At first glance, data exchange sounds rather simple, right? Not so fast. The issues involved with data exchange must not be underestimated. A precise, disciplined approach is required to overcome
these formidable, technical and non-technical issues. Our approach describes how to do this. Read on.


DATA EXCHANGE-WHY DO IT

Data exchange is almost always a subset of a much larger project. When this is the case, it must be accomplished in parallel with that other project. Data exchange can be a major component of one
of six different system initiatives:

  1. Implementation of a Business Process Re-engineering (BPR) project
  2. Implementation of Electronic Data Interfacing (EDI)
  3. The result of a merger or acquisition
  4. Implementation of a decision support (data storage) project
  5. Implementation of a data farming project
  6. Implementation of an information exchange project


BPRS AND MERGERS/ACQUISITIONS INITIATIVES

A BPR might involve the replacement of one or more legacy systems for the purpose of increasing functionality. During a merger or acquisition, organizations often need to combine the data from
multiple, disparate systems. These two initiatives focus on moving data from one place to another, most often unilaterally, although sometimes a best-of-breed approach requires interfacing data
between unrelated systems (EDI). These types of projects make up a large portion of today’s data exchange market. These initiatives usually involve the implementation of one or more better
systems; therefore, if successful, these data exchange initiatives will reduce the cost of doing business because they result in increasing efficiency.


INFORMATION INITIATIVES

The remaining three initiatives focus on a new concept: turning data into information, and then turning that information into knowledge. Organizations can use this knowledge to expand their
business in a cost-effective manner. For example, the knowledge provided by the data in decision support and data farming projects might involve static data that will be analyzed or referenced at
some future time:

  • An organization uses a decision support system to examine its own data. That data is then used to expose existing clients to new products based on what their clients buy and when they are
    mostly likely to make purchases (throughout the year, or mostly in December?).
  • An organization uses the knowledge gleaned from a data farming project to find new clients who, based on previous spending habits, might be interested in their products.

Information exchange initiatives focus on the sharing of information between two or more, often competing organizations, i.e., Company A’s data will be periodically shared with Company B;
likewise, Company B’s data will be periodically shared with Company A. Practically speaking, an auto insurance company might want to partner with a life insurance company. It is assumed that
people who buy one type of insurance are more likely to buy another type of insurance. The data owned by these similar businesses compliment each other; their data is used to improve the
competitive advantage of both businesses.


COMMONALTIES OF THE SIX INITIATIVES

All of these initiatives can reduce the cost of doing business by increasing efficiency, whether it be by upgrading existing systems, or by helping an organization target customers who are more
likely than the general public to buy their product(s).


DATA EXCHANGE PROBLEMS

Data exchange problems abound. Some have been acknowledged by the industry, although they have not been sufficiently corrected, while others have escaped notice completely. Let us summarize both.


TODAY’S RECOGNIZED PROBLEMS

We have identified six distinct data exchange problems that are recognized by the industry. Although these have all been acknowledged to some degree, they have not been adequately addressed. Left
unresolved, these problems can sink the data exchange project along with the main project:

  1. Legacy systems lack data integrity, making it difficult to translate data to a more restrictive architecture
  2. The theoretical design differences between legacy and relational systems are numerous and complex
  3. The personnel assigned to the task lack enthusiasm
  4. Planning for the task is enormously insufficient
  5. The personnel assigned to the task are often not the right people for the task
  6. The performance of the task-the amount of time it will take-is not considered


TOMORROW’S UNRECOGNIZED PROBLEMS

We have identified three distinct data exchange problems that the industry has failed to recognize. If acknowledged and properly addressed, these problems can turn into solutions that will greatly
increase the efficiency of the data exchange project:

  1. The importance of process mapping is downplayed or ignored
  2. The idea of data patterning-fi nding patterns in the legacy data-is virtually unknown
  3. The importance of instituting validation metrics is overlooked and rarely employed


HOW WE FIX TODAY’S RECOGNIZED PROBLEMS

This section will detail the recognized problems of today. It will further explore industry’s solutions for them. To recap, recognized problems include:

  1. Lack of data integrity in legacy systems
  2. Complex design differences between legacy and relational systems
  3. Personnel’s lack of enthusiasm
  4. Enormously insufficient planning
  5. Improper personnel assignment
  6. Negligence of performance issues


LACK OF DATA INTEGRITY

One would think that any two systems that maintain the same sort of data must perform very similar tasks; information from one system should map to the other with ease. This is rarely true. Legacy
systems are typically quite lenient with respect to enforcing data integrity; i.e., fields that should be populated from a list of valid values require that the user enter a value, but seldom is
the value validated.

For example, a field for STATE ordinarily requires two characters. Ideally, when a user enters a value in this field, the system should check that the two characters form the abbreviation for one
of the 50 US states. However, if the system does not validate this field, there may be values entered that do not correspond to any state. After all, we all make typos. Also, not everyone knows all
the two-digit state abbreviations, which means that a user might enter an invalid state code, thinking that it is the right code.

Furthermore, legacy systems are infamous for containing de-normalized and/or redundant data, which does not make for an easy transition to relational or object oriented architectures. Legacy
systems were designed around a logical procedure that accepts data, processes it, and outputs data. Newer systems look at the data as objects that need to be manipulated, rather than looking at the
logic required to manipulate the data. These architectures were designed to ensure that any given bit of data appears only once in the system. To find information, relational architectures join two
structures in a dependant fashion, while object oriented architectures have one “main” structure to which other structures are joined.


DESIGN DIFFERENCES

The many complex, theoretical design differences between legacy systems and the systems of today can frustrate those involved in a data exchange project. Traditionally, performance was considered
very important, and designers would seek to enhance performance even if it would limit flexibility. Newer architectures, like relational and object-oriented systems, inherently eliminate many
performance concerns. These newer systems combine performance with flexibility.


ATTITUDES OF PERSONNEL

Another issue of data exchange projects involves the attitudes of the people assigned to the task. Most technical people view data exchange as unglamorous. They also believe that they will not
learn anything new that can further their careers (with the exception of learning about tools used in data exchange). On the surface, programmers think that they will simply be writing a bunch of
complex insert-update-delete scripts. Our new approach to the tasks of data exchange shows the people involved that they will learn a lot, and that the work can be intriguing.


PLANNING

People simply do not understand the intricacies of data exchange-that is, until they have braved a number of data exchange projects. Because of this lack of understanding, the planning of data
exchanges often become a one-line task on the main project plan, which downplays its complexity. The first pass at data exchange reveals its complexity, and by then, it may be too late. Much time
has passed, and the users are clamoring for the new system. A lack of proper planning can cause the entire project to fail.

Most data exchange plans start with a rushed analysis, followed by designing the data exchange routines, data mapping, and codewriting, followed by testing. Then the project explodes. Next,
it’s back to more analysis, mapping, coding, and testing. It becomes a spiral, the end of which cannot be seen.


PERSONNEL ASSIGNMENT

Project leaders fail to dedicate adequate access to appropriate personnel. The staff members who are most familiar with the data must be involved in its exchange, yet these people are usually
assigned to more high-profile tasks, such as designing and gathering requirements for the main system. Project leaders think that they can push all of the data exchange tasks to either lower-level
staff members, or to contractors who are unfamiliar with the data. This usually is a mistake-unless process mapping and data patterning are used (these concepts will be explored later in this
paper).

If staffs are unfamiliar with the source system are assigned to the data exchange, they need to perform extensive process mapping in order to learn the processes. Spending a lot of time on mapping
data to data will prove to be time wasted. It cannot be stressed enough! Key personnel must be identified and assigned to the data exchange project at the earliest phases of the main project. It
must be very clear that these key players must dedicate a specified number of hours per week to this task. The staff members who have been supporting the source system are the ones who should be
working on the data exchange. To make this work, staff members must present the results of the various mapping deliverables to management. This fosters ownership of the project and brings
recognition to those working on the data exchange.


PERFORMANCE ISSUES

Let’s face it-data exchange takes time. Systems were never designed to push large, historical volumes of data in a short window of time. In many cases, it took systems years to generate the
volumes of data that you now wish to transform and move to another environment, all in less than one day, in most cases. Knowing how much time and planning for it is important, especially if you
want to migrate a large 24×7 system that cannot afford any significant downtime.


INDUSTRY’S RESPONSE TO TODAY’S PROBLEMS

The most recognized problem in data exchange is writing the code, and, true to form, the technical industry has turned to a technical solution for this problem. ETL (extract, transform, load) tools
look good because they are good for iterative code generation. However, they are not the panacea that many had hoped they would be.


ETL DRAWBACKS

The major drawback to ETL tools is that they don’t sufficiently address analysis and performance issues. These two issues absolutely must not be minimized. By their very nature, ETL tools
have difficulty supporting these issues because they are technology-independent. They are designed this way for two reasons: (1) It is simply not cost-effective for ETL designers to continually
redesign their product to keep up with the latest technology. (2) Basing the ETLs on a given technology would reduce their flexibility. However, being technology-independent means that ETL tools
cannot leverage the performance savings features inherent in specific platforms. These code-generators offer a framework that forces organizations to develop custom code beyond the reach of the
generated code.


DATA PARTITIONING

To improve performance, a method called partitioning is often employed. For example, in the retail industry, much of their historic data is static, i.e., old orders. To use partitioning, we pick a
point in time and declare that all data assembled before that time is historical, and any data compiled after that point in time is dynamic. To meet the project launch date, a copy of the
historical data is moved to the target system before it has to go live. Dynamic data, which will be the newest data, is moved right before the new system comes online. (A delta strategy will be
employed to account for the chance that some parts of the historical data did change.) ETL tools do not support partitioning inherently; it’s up to the migration architect to plan for it (if
one is assigned to the project, that is).


SUMMARY

ETL tools are not designed to take analysis results into consideration. A thorough analysis forms the cornerstone of a successful data exchange project; you must apply everything you learn from
your analysis to ensure success. A key component of the analysis phase will be the definition and application of validation metrics (described later), which is another item unsupported by ETL
tools. Our method does use ETL tools, but it takes these shortcomings into account and compensates for them.


HOW WE FIX TOMORROW’S UNRECOGNIZED PROBLEMS

This section will discuss the unrecognized problems of today and explore how we propose to resolve them. The solutions to the unrecognized problems include:

  • Process Mapping
  • Data Patterning
  • Validation Metrics

As you read this section, you might think that these solutions will take too much valuable time away from the task at hand. Try to remember the times that you spent extra time at the start of a
task, and how it paid off during the execution of the task, especially near the end.


PROCESS MAPPING

Process mapping gives the people involved a better understanding of how the data relates to itself and to the system(s) involved. This might best be described via an example. Let us say that we are
a retail establishment that wants to move from a legacy system to a state-of-the-art system. Both systems provide support for promotions, but the source (legacy) system may only support promotions
at the order level, whereas the target system supports promotions at the line item level. It is probable that the business itself supported promotions at the line item level in the past, but the
legacy system was just not designed to capture this information. The line item-to-promotion relationship was lost. If not for process mapping, this knowledge would probably never be discovered
until the data exchange occurred.


KNOWLEDGE MANAGEMENT VS. INFORMATION MANAGEMENT

Knowledge management has to do with how we use and interpret data for the purpose of improving how we do business. Information management has to do with how we access, manipulate, and supply the
supporting data. Information management is actually a subset of knowledge management.


NEW APPROACH

Historically, few people looked at data collection systems as containing knowledge. This is a relatively new phenomenon. Despite this, it is not too late for organizations to think of it this way.
We must be willing to break out our data into logical subsets called knowledge areas.


KNOWLEDGE AREAS

A knowledge area might best be described through an example. In the world of retail, knowledge areas might include Customer, Order, and Product. These areas do not necessarily correspond directly
with specific departments in an organization. For instance, the Product knowledge area may contain data from the product department, the promotions department, and the marketing department; the
knowledge area is a mix of these three departments. Each has its own context.

For each knowledge area, a team of area experts will be assembled. Area experts should include both technical and business experts from the departments involved. Having separate teams for each
knowledge area will help spread out the workload. Each team can work on its own specialty.


EVENT

After defining each knowledge area team, you will start to document the high-level events that take place within each area. Back to our retail example, when a customer places an order, this is an
event. Although both the source and target systems will support customers placing orders, they mostly likely will perform this event differently. Identifying the event is the first step in process
mapping.


SCENARIO

After documenting all high-level events, you can write scenarios for each event that might encompass multiple events. By documenting these scenarios, and associating events, we can begin to
articulate the true context of how we do business. From here, the data exchange team has a strong foundation to seriously review data.


PROCESS MAPPING

The biggest challenge in data exchange is translating existing processes to the new architecture. Back to the retail example, the old system might support back orders inefficiently, while the new
system might not support them at all. This is a new business rule. As a result, we need rules to define how to handle back ordered items at the time of conversion. This arises as a result of
transporting data to a new system. This rule is not inherent to the source or target system. Without understanding the context, we would not be able to detect or resolve issues like this one.


SUMMARY

Process mapping allows us to break processes down into knowledge areas. From there, we create teams and give them responsibility for leading the data mapping. Data mapping should be the end-product
of this task, not the starting point. It might seem that data mapping should begin right after process mapping concludes, but a step called data patterning should be performed first. This leads us
to the next section of this paper.


DATA PATTERNING

Data patterning gives us a jump on data mapping. The time spent on this step will translate into time saved during the analysis phase. As discussed earlier, a solid analysis increases the chances
of the data exchange project being successful. For example, the experts claim that the data column containing a name always has a first name, a middle initial, and a last name. While conducting
data patterning, we find that this is untrue. Sometimes there is no middle initial, and in some cases, the entire middle name is spelled out. Data patterning allows us to discover this early, when
we still have time to challenge that rule and alter the code accordingly.


IDENTIFY ANOMALIES

In any industry, project leaders seek to identify and correct anomalies before they impact the project. Spending the time up-front, before developing code, will eventually save you more money and
resources (money and manpower). Figure 1 depicts how the cost of data exchange increases when anomalies are not discovered until after the code has been developed. Let us explore this figure in
detail. Whether you use the iterative approach or the analytical approach, the same number of data errors will exist, because the project is constant. The time axis also corresponds to the volume
of data errors (i.e., everything above the optimal line means there are more errors that must be resolved, while everything below the optimal line means there are more errors that can be resolved,
but it is not as critical for project success that they are resolved). The analytical approach spends more of its time in analysis resolving more errors than the iterative approach does-hence the
analytical curve drops more rapidly, and intersects the optimal error rate curve sooner.

Given this situation, the team is afforded more time to make the exchange more error free than anticipated, whereas the iterative approach only aspires to achieve acceptance.


PROCESS MAPPING VS. DATA MAPPING

Data mapping looks only at the data, not at the context of the data. Looking at a row of data without the context makes it almost impossible to know why that data exists. Expert knowledge of data
and processes is vital to data exchange success. When inexperienced staff members focus on the data without understanding the context from which it originates, they eventually find themselves
struggling to understand the data leading to struggles to complete the project. Their approach isn’t entirely wrong; however, our approach allows inexperienced staff members to successfully
perform a data exchange.


SYSTEM STRUCTURE DEFINITIONS

System structures explain the source and target data structures involved in the exchange. Defining system structures involves determining the following items:

  • Data Type. Is the data alpha only, alpha-numeric, numeric only?
  • Length. How long is the data string?
  • Precision. If the data is numeric, is there a decimal point? How many places are used?
  • Optionality. Is the data mandatory or optional?

Documenting this information gives us a baseline to begin our data patterning. We’ll know that a certain level of data integrity exists, i.e., a numeric column really is all numeric. In
legacy systems, columns or data elements that are supposed to contain only alpha characters will contain other characters as well. In this case, further validation of these alpha values is only
possible through customized code. Unfortunately, the source code that conducts this validation is typically not well documented, if it even exists. Therefore, we’ve found a way to extend the
structure definition concept: through value sets and combination sets.


VALUE SETS

Value sets are made up of both character sets and reference sets. A character set identifies all valid characters for a given field. A reference set identifies the valid combinations of values for
this field. For example, a column of data contains all abbreviations for the 50 states of the US. The character set would be the alpha characters a, c-i, k-p, r-t, v-z (in other words, all 26
letters of the alphabet except for b, j, q, u). The reference set would contain only valid combinations of the 24 valid characters, meaning that NJ is valid, but JN is not.


COMBINATION SETS

While value sets look at column patterning, combination sets look at row-level patterning. For instance, they will show us what a valid customer looks like (name, address, telephone number, fax
number), what a valid order looks like (customer, ship-to address, bill-to address, terms of payment, item ordered, quantity), etc.


VALIDATION METRICS INTRODUCTION

Validation metrics allow us to measure the success of the data exchange before it actually takes place. Typically, no one really knows if the data exchange project has been a success or a failure
until the exchange actually occurs. If the exchange project fails, there may not be enough time to find out why it failed, make corrections, and hope that it works the second time around.
Validation metrics combine process mapping and data patterning. Process mapping is based at the row or transaction set level, i.e., an address on an order or a header on an order. Data patterning
is based at the column level; it is used to help execute analysis. Validation metrics comprise a set of multi-faceted tasks that look to discover the number of errors that are associated with the
data. These errors are then given an assignment based on severity. Validation metrics are percentage based and priority based.


PERCENTAGE BASED

This metrics tells us how many of the total records fall into the top few patterns. These patterns will guide us as to how many records upon which to concentrate. Uncovering the five top patterns
is a good rule of thumb. First we need to establish an appropriate set of patterns from which to conduct an analysis. If the column of data was STATES, there can be only one valid pattern: XX. You
will also find that data that is of a higher quality will produce fewer patterns. A long column of “good” data that represents titles might easily fit into two patterns only: alphas
followed by a period, or only alphas, i.e., Dr., Miss, Ms., Mr., Mrs.

A better example for explaining this concept might be a column of data that contains last names of employees. We would assume that this would consist of a string of alpha characters with no spaces.
Upon inspection, we find this to be true only 60 percent of the time. Ten percent of the data contains alphas plus spaces, i.e., Del Rosa, Johnson III. Another 10 percent contains alphas separated
by a dash, i.e., Smith-Taylor. Another 10 percent contains alphas, spaces, and punctuation, i.e., Bronson Sr., Bronson Jr. The last 10 percent may contain a variety of combinations, i.e., a space
followed by alphas, or numerics. In this example, we may be able to recognize four patterns that fi t 90 percent of all data.

Continuing the last name example, we find that 10 percent of the data falls outside of these four patterns and therefore must be analyzed independently. These metrics helps us estimate how much
time we need to spend on this data and plan our time accordingly.


PRIORITY BASED

This tells us how many errors (exceptions to the top five patterns) exist. We then must decide how many errors we can tolerate and still consider this project successful. With this, we can identify
which records need to be fixed right away, which can be fixed later on, and which can be forgotten about totally.


BENCHMARKS

Benchmarks are samples of production data. We select five records that match the top five most common records (patterns). We then manually transform these five records and show them to our
developers. Our developers use this as an example as to what all records should look like after the routines are run, i.e., as target records (after transformation). By giving this to our
developers, we eliminate a vast number of errors prior to the development phase of the project.


CONCLUSION

It should be obvious by now that there is a desperate need for a reliable, methodological approach with which organizations can tackle data exchange projects. Although it is impossible to avoid
unpleasant surprises in data exchanges, you can handle them if you are properly prepared.

In this paper, we have uncovered the most significant obstacles to successful data exchange projects-process mapping for better understanding the context of the data to be mapped, and data
patterning for rapidly identifying and resolving data errors.

By using our methodology to combine a devoted team, an early start, a clearly defined project plan, validation metrics to accurately evaluate progress and automated tools (where applicable), you
will tremendously help your organization achieve its goal of a successful data exchange.

Share

submit to reddit

About Joseph Hudicka

Joseph is the founder of the Information Architecture Team, an organization which specializes in data quality, data migration and ETL. Winner of the ODTUG Best Speaker award for the Spring 1999 conference, Joseph is a regular speaker at ODTUG, OOW, IOUG-A, and a variety of local user groups. Joseph has extensive knowledge in the areas of information architecture and data exchange, with substantial experience in pharmaceutical, retail, public, and e-commerce applications. Joseph co-authored Oracle8 Design Using UML Object Modeling for Osborne/McGraw-Hill & Oracle Press along with Paul Dorsey. He has also written several articles for a variety of publications.

Top