Published in TDAN.com July 2003
  DATA EXCHANGE
  Data exchange is simply the transfer of data from one system to another. It is usually called “data migration,” but that term limits the scope of data movement efforts. While data
  migration infers that data will move from a source system to a target system, data exchange includes bi-directional data movement in addition to unilateral movement between two systems. This paper
  will address the many issues associated with data exchange. Specific topics include:
The Reasons for Performing Data Exchange. This section further discusses why data exchange is a much larger effort than data migration.
- Data Exchange Problems. This section identifies problems that the IT industry has recognized with respect to data migration. It then identifies problems that the IT industry has not recognized.
 - Solving Today’s Problems. This section fully describes industry-recognized problems, shows how these problems have not been totally resolved, and makes suggestions for solving them.
 - Solving Tomorrow’s Problems. This section fully describes the problems that industry has failed to recognize and describes how to address them.
 
  At first glance, data exchange sounds rather simple, right? Not so fast. The issues involved with data exchange must not be underestimated. A precise, disciplined approach is required to overcome
  these formidable, technical and non-technical issues. Our approach describes how to do this. Read on.
  DATA EXCHANGE-WHY DO IT
  Data exchange is almost always a subset of a much larger project. When this is the case, it must be accomplished in parallel with that other project. Data exchange can be a major component of one
  of six different system initiatives:
- Implementation of a Business Process Re-engineering (BPR) project
 - Implementation of Electronic Data Interfacing (EDI)
 - The result of a merger or acquisition
 - Implementation of a decision support (data storage) project
 - Implementation of a data farming project
 - Implementation of an information exchange project
 
  BPRS AND MERGERS/ACQUISITIONS INITIATIVES
  A BPR might involve the replacement of one or more legacy systems for the purpose of increasing functionality. During a merger or acquisition, organizations often need to combine the data from
  multiple, disparate systems. These two initiatives focus on moving data from one place to another, most often unilaterally, although sometimes a best-of-breed approach requires interfacing data
  between unrelated systems (EDI). These types of projects make up a large portion of today’s data exchange market. These initiatives usually involve the implementation of one or more better
  systems; therefore, if successful, these data exchange initiatives will reduce the cost of doing business because they result in increasing efficiency.
  INFORMATION INITIATIVES
  The remaining three initiatives focus on a new concept: turning data into information, and then turning that information into knowledge. Organizations can use this knowledge to expand their
  business in a cost-effective manner. For example, the knowledge provided by the data in decision support and data farming projects might involve static data that will be analyzed or referenced at
  some future time:
- An organization uses a decision support system to examine its own data. That data is then used to expose existing clients to new products based on what their clients buy and when they are
mostly likely to make purchases (throughout the year, or mostly in December?). - An organization uses the knowledge gleaned from a data farming project to find new clients who, based on previous spending habits, might be interested in their products.
 
  Information exchange initiatives focus on the sharing of information between two or more, often competing organizations, i.e., Company A’s data will be periodically shared with Company B;
  likewise, Company B’s data will be periodically shared with Company A. Practically speaking, an auto insurance company might want to partner with a life insurance company. It is assumed that
  people who buy one type of insurance are more likely to buy another type of insurance. The data owned by these similar businesses compliment each other; their data is used to improve the
  competitive advantage of both businesses.
  COMMONALTIES OF THE SIX INITIATIVES
  All of these initiatives can reduce the cost of doing business by increasing efficiency, whether it be by upgrading existing systems, or by helping an organization target customers who are more
  likely than the general public to buy their product(s).
  DATA EXCHANGE PROBLEMS
Data exchange problems abound. Some have been acknowledged by the industry, although they have not been sufficiently corrected, while others have escaped notice completely. Let us summarize both.
  TODAY’S RECOGNIZED PROBLEMS
  We have identified six distinct data exchange problems that are recognized by the industry. Although these have all been acknowledged to some degree, they have not been adequately addressed. Left
  unresolved, these problems can sink the data exchange project along with the main project:
- Legacy systems lack data integrity, making it difficult to translate data to a more restrictive architecture
 - The theoretical design differences between legacy and relational systems are numerous and complex
 - The personnel assigned to the task lack enthusiasm
 - Planning for the task is enormously insufficient
 - The personnel assigned to the task are often not the right people for the task
 - The performance of the task-the amount of time it will take-is not considered
 
  TOMORROW’S UNRECOGNIZED PROBLEMS
  We have identified three distinct data exchange problems that the industry has failed to recognize. If acknowledged and properly addressed, these problems can turn into solutions that will greatly
  increase the efficiency of the data exchange project:
- The importance of process mapping is downplayed or ignored
 - The idea of data patterning-fi nding patterns in the legacy data-is virtually unknown
 - The importance of instituting validation metrics is overlooked and rarely employed
 
  HOW WE FIX TODAY’S RECOGNIZED PROBLEMS
This section will detail the recognized problems of today. It will further explore industry’s solutions for them. To recap, recognized problems include:
- Lack of data integrity in legacy systems
 - Complex design differences between legacy and relational systems
 - Personnel’s lack of enthusiasm
 - Enormously insufficient planning
 - Improper personnel assignment
 - Negligence of performance issues
 
  LACK OF DATA INTEGRITY
  One would think that any two systems that maintain the same sort of data must perform very similar tasks; information from one system should map to the other with ease. This is rarely true. Legacy
  systems are typically quite lenient with respect to enforcing data integrity; i.e., fields that should be populated from a list of valid values require that the user enter a value, but seldom is
  the value validated.
  For example, a field for STATE ordinarily requires two characters. Ideally, when a user enters a value in this field, the system should check that the two characters form the abbreviation for one
  of the 50 US states. However, if the system does not validate this field, there may be values entered that do not correspond to any state. After all, we all make typos. Also, not everyone knows all
  the two-digit state abbreviations, which means that a user might enter an invalid state code, thinking that it is the right code.
  Furthermore, legacy systems are infamous for containing de-normalized and/or redundant data, which does not make for an easy transition to relational or object oriented architectures. Legacy
  systems were designed around a logical procedure that accepts data, processes it, and outputs data. Newer systems look at the data as objects that need to be manipulated, rather than looking at the
  logic required to manipulate the data. These architectures were designed to ensure that any given bit of data appears only once in the system. To find information, relational architectures join two
  structures in a dependant fashion, while object oriented architectures have one “main” structure to which other structures are joined.
  DESIGN DIFFERENCES
  The many complex, theoretical design differences between legacy systems and the systems of today can frustrate those involved in a data exchange project. Traditionally, performance was considered
  very important, and designers would seek to enhance performance even if it would limit flexibility. Newer architectures, like relational and object-oriented systems, inherently eliminate many
  performance concerns. These newer systems combine performance with flexibility.
  ATTITUDES OF PERSONNEL
  Another issue of data exchange projects involves the attitudes of the people assigned to the task. Most technical people view data exchange as unglamorous. They also believe that they will not
  learn anything new that can further their careers (with the exception of learning about tools used in data exchange). On the surface, programmers think that they will simply be writing a bunch of
  complex insert-update-delete scripts. Our new approach to the tasks of data exchange shows the people involved that they will learn a lot, and that the work can be intriguing.
  PLANNING
  People simply do not understand the intricacies of data exchange-that is, until they have braved a number of data exchange projects. Because of this lack of understanding, the planning of data
  exchanges often become a one-line task on the main project plan, which downplays its complexity. The first pass at data exchange reveals its complexity, and by then, it may be too late. Much time
  has passed, and the users are clamoring for the new system. A lack of proper planning can cause the entire project to fail.
  Most data exchange plans start with a rushed analysis, followed by designing the data exchange routines, data mapping, and codewriting, followed by testing. Then the project explodes. Next,
  it’s back to more analysis, mapping, coding, and testing. It becomes a spiral, the end of which cannot be seen.
  PERSONNEL ASSIGNMENT
  Project leaders fail to dedicate adequate access to appropriate personnel. The staff members who are most familiar with the data must be involved in its exchange, yet these people are usually
  assigned to more high-profile tasks, such as designing and gathering requirements for the main system. Project leaders think that they can push all of the data exchange tasks to either lower-level
  staff members, or to contractors who are unfamiliar with the data. This usually is a mistake-unless process mapping and data patterning are used (these concepts will be explored later in this
  paper).
  If staffs are unfamiliar with the source system are assigned to the data exchange, they need to perform extensive process mapping in order to learn the processes. Spending a lot of time on mapping
  data to data will prove to be time wasted. It cannot be stressed enough! Key personnel must be identified and assigned to the data exchange project at the earliest phases of the main project. It
  must be very clear that these key players must dedicate a specified number of hours per week to this task. The staff members who have been supporting the source system are the ones who should be
  working on the data exchange. To make this work, staff members must present the results of the various mapping deliverables to management. This fosters ownership of the project and brings
  recognition to those working on the data exchange.
  PERFORMANCE ISSUES
  Let’s face it-data exchange takes time. Systems were never designed to push large, historical volumes of data in a short window of time. In many cases, it took systems years to generate the
  volumes of data that you now wish to transform and move to another environment, all in less than one day, in most cases. Knowing how much time and planning for it is important, especially if you
  want to migrate a large 24×7 system that cannot afford any significant downtime.
  INDUSTRY’S RESPONSE TO TODAY’S PROBLEMS
  The most recognized problem in data exchange is writing the code, and, true to form, the technical industry has turned to a technical solution for this problem. ETL (extract, transform, load) tools
  look good because they are good for iterative code generation. However, they are not the panacea that many had hoped they would be.
  ETL DRAWBACKS
  The major drawback to ETL tools is that they don’t sufficiently address analysis and performance issues. These two issues absolutely must not be minimized. By their very nature, ETL tools
  have difficulty supporting these issues because they are technology-independent. They are designed this way for two reasons: (1) It is simply not cost-effective for ETL designers to continually
  redesign their product to keep up with the latest technology. (2) Basing the ETLs on a given technology would reduce their flexibility. However, being technology-independent means that ETL tools
  cannot leverage the performance savings features inherent in specific platforms. These code-generators offer a framework that forces organizations to develop custom code beyond the reach of the
  generated code.
  DATA PARTITIONING
  To improve performance, a method called partitioning is often employed. For example, in the retail industry, much of their historic data is static, i.e., old orders. To use partitioning, we pick a
  point in time and declare that all data assembled before that time is historical, and any data compiled after that point in time is dynamic. To meet the project launch date, a copy of the
  historical data is moved to the target system before it has to go live. Dynamic data, which will be the newest data, is moved right before the new system comes online. (A delta strategy will be
  employed to account for the chance that some parts of the historical data did change.) ETL tools do not support partitioning inherently; it’s up to the migration architect to plan for it (if
  one is assigned to the project, that is).
  SUMMARY
  ETL tools are not designed to take analysis results into consideration. A thorough analysis forms the cornerstone of a successful data exchange project; you must apply everything you learn from
  your analysis to ensure success. A key component of the analysis phase will be the definition and application of validation metrics (described later), which is another item unsupported by ETL
  tools. Our method does use ETL tools, but it takes these shortcomings into account and compensates for them.
  HOW WE FIX TOMORROW’S UNRECOGNIZED PROBLEMS
This section will discuss the unrecognized problems of today and explore how we propose to resolve them. The solutions to the unrecognized problems include:
- Process Mapping
 - Data Patterning
 - Validation Metrics
 
  As you read this section, you might think that these solutions will take too much valuable time away from the task at hand. Try to remember the times that you spent extra time at the start of a
  task, and how it paid off during the execution of the task, especially near the end.
  PROCESS MAPPING
  Process mapping gives the people involved a better understanding of how the data relates to itself and to the system(s) involved. This might best be described via an example. Let us say that we are
  a retail establishment that wants to move from a legacy system to a state-of-the-art system. Both systems provide support for promotions, but the source (legacy) system may only support promotions
  at the order level, whereas the target system supports promotions at the line item level. It is probable that the business itself supported promotions at the line item level in the past, but the
  legacy system was just not designed to capture this information. The line item-to-promotion relationship was lost. If not for process mapping, this knowledge would probably never be discovered
  until the data exchange occurred.
  KNOWLEDGE MANAGEMENT VS. INFORMATION MANAGEMENT
  Knowledge management has to do with how we use and interpret data for the purpose of improving how we do business. Information management has to do with how we access, manipulate, and supply the
  supporting data. Information management is actually a subset of knowledge management.
  NEW APPROACH
  Historically, few people looked at data collection systems as containing knowledge. This is a relatively new phenomenon. Despite this, it is not too late for organizations to think of it this way.
  We must be willing to break out our data into logical subsets called knowledge areas.
  KNOWLEDGE AREAS
  A knowledge area might best be described through an example. In the world of retail, knowledge areas might include Customer, Order, and Product. These areas do not necessarily correspond directly
  with specific departments in an organization. For instance, the Product knowledge area may contain data from the product department, the promotions department, and the marketing department; the
  knowledge area is a mix of these three departments. Each has its own context.
  For each knowledge area, a team of area experts will be assembled. Area experts should include both technical and business experts from the departments involved. Having separate teams for each
  knowledge area will help spread out the workload. Each team can work on its own specialty.
  EVENT
  After defining each knowledge area team, you will start to document the high-level events that take place within each area. Back to our retail example, when a customer places an order, this is an
  event. Although both the source and target systems will support customers placing orders, they mostly likely will perform this event differently. Identifying the event is the first step in process
  mapping.
  SCENARIO
  After documenting all high-level events, you can write scenarios for each event that might encompass multiple events. By documenting these scenarios, and associating events, we can begin to
  articulate the true context of how we do business. From here, the data exchange team has a strong foundation to seriously review data.
  PROCESS MAPPING
  The biggest challenge in data exchange is translating existing processes to the new architecture. Back to the retail example, the old system might support back orders inefficiently, while the new
  system might not support them at all. This is a new business rule. As a result, we need rules to define how to handle back ordered items at the time of conversion. This arises as a result of
  transporting data to a new system. This rule is not inherent to the source or target system. Without understanding the context, we would not be able to detect or resolve issues like this one.
  SUMMARY
  Process mapping allows us to break processes down into knowledge areas. From there, we create teams and give them responsibility for leading the data mapping. Data mapping should be the end-product
  of this task, not the starting point. It might seem that data mapping should begin right after process mapping concludes, but a step called data patterning should be performed first. This leads us
  to the next section of this paper.
  DATA PATTERNING
  Data patterning gives us a jump on data mapping. The time spent on this step will translate into time saved during the analysis phase. As discussed earlier, a solid analysis increases the chances
  of the data exchange project being successful. For example, the experts claim that the data column containing a name always has a first name, a middle initial, and a last name. While conducting
  data patterning, we find that this is untrue. Sometimes there is no middle initial, and in some cases, the entire middle name is spelled out. Data patterning allows us to discover this early, when
  we still have time to challenge that rule and alter the code accordingly.
  IDENTIFY ANOMALIES
  In any industry, project leaders seek to identify and correct anomalies before they impact the project. Spending the time up-front, before developing code, will eventually save you more money and
  resources (money and manpower). Figure 1 depicts how the cost of data exchange increases when anomalies are not discovered until after the code has been developed. Let us explore this figure in
  detail. Whether you use the iterative approach or the analytical approach, the same number of data errors will exist, because the project is constant. The time axis also corresponds to the volume
  of data errors (i.e., everything above the optimal line means there are more errors that must be resolved, while everything below the optimal line means there are more errors that can be resolved,
  but it is not as critical for project success that they are resolved). The analytical approach spends more of its time in analysis resolving more errors than the iterative approach does-hence the
  analytical curve drops more rapidly, and intersects the optimal error rate curve sooner.
Given this situation, the team is afforded more time to make the exchange more error free than anticipated, whereas the iterative approach only aspires to achieve acceptance.
  PROCESS MAPPING VS. DATA MAPPING
  Data mapping looks only at the data, not at the context of the data. Looking at a row of data without the context makes it almost impossible to know why that data exists. Expert knowledge of data
  and processes is vital to data exchange success. When inexperienced staff members focus on the data without understanding the context from which it originates, they eventually find themselves
  struggling to understand the data leading to struggles to complete the project. Their approach isn’t entirely wrong; however, our approach allows inexperienced staff members to successfully
  perform a data exchange.
  SYSTEM STRUCTURE DEFINITIONS
System structures explain the source and target data structures involved in the exchange. Defining system structures involves determining the following items:
- Data Type. Is the data alpha only, alpha-numeric, numeric only?
 - Length. How long is the data string?
 - Precision. If the data is numeric, is there a decimal point? How many places are used?
 - Optionality. Is the data mandatory or optional?
 
  Documenting this information gives us a baseline to begin our data patterning. We’ll know that a certain level of data integrity exists, i.e., a numeric column really is all numeric. In
  legacy systems, columns or data elements that are supposed to contain only alpha characters will contain other characters as well. In this case, further validation of these alpha values is only
  possible through customized code. Unfortunately, the source code that conducts this validation is typically not well documented, if it even exists. Therefore, we’ve found a way to extend the
  structure definition concept: through value sets and combination sets.
  VALUE SETS
  Value sets are made up of both character sets and reference sets. A character set identifies all valid characters for a given field. A reference set identifies the valid combinations of values for
  this field. For example, a column of data contains all abbreviations for the 50 states of the US. The character set would be the alpha characters a, c-i, k-p, r-t, v-z (in other words, all 26
  letters of the alphabet except for b, j, q, u). The reference set would contain only valid combinations of the 24 valid characters, meaning that NJ is valid, but JN is not.
  COMBINATION SETS
  While value sets look at column patterning, combination sets look at row-level patterning. For instance, they will show us what a valid customer looks like (name, address, telephone number, fax
  number), what a valid order looks like (customer, ship-to address, bill-to address, terms of payment, item ordered, quantity), etc.
  VALIDATION METRICS INTRODUCTION
  Validation metrics allow us to measure the success of the data exchange before it actually takes place. Typically, no one really knows if the data exchange project has been a success or a failure
  until the exchange actually occurs. If the exchange project fails, there may not be enough time to find out why it failed, make corrections, and hope that it works the second time around.
  Validation metrics combine process mapping and data patterning. Process mapping is based at the row or transaction set level, i.e., an address on an order or a header on an order. Data patterning
  is based at the column level; it is used to help execute analysis. Validation metrics comprise a set of multi-faceted tasks that look to discover the number of errors that are associated with the
  data. These errors are then given an assignment based on severity. Validation metrics are percentage based and priority based.
  PERCENTAGE BASED
  This metrics tells us how many of the total records fall into the top few patterns. These patterns will guide us as to how many records upon which to concentrate. Uncovering the five top patterns
  is a good rule of thumb. First we need to establish an appropriate set of patterns from which to conduct an analysis. If the column of data was STATES, there can be only one valid pattern: XX. You
  will also find that data that is of a higher quality will produce fewer patterns. A long column of “good” data that represents titles might easily fit into two patterns only: alphas
  followed by a period, or only alphas, i.e., Dr., Miss, Ms., Mr., Mrs.
  A better example for explaining this concept might be a column of data that contains last names of employees. We would assume that this would consist of a string of alpha characters with no spaces.
  Upon inspection, we find this to be true only 60 percent of the time. Ten percent of the data contains alphas plus spaces, i.e., Del Rosa, Johnson III. Another 10 percent contains alphas separated
  by a dash, i.e., Smith-Taylor. Another 10 percent contains alphas, spaces, and punctuation, i.e., Bronson Sr., Bronson Jr. The last 10 percent may contain a variety of combinations, i.e., a space
  followed by alphas, or numerics. In this example, we may be able to recognize four patterns that fi t 90 percent of all data.
  Continuing the last name example, we find that 10 percent of the data falls outside of these four patterns and therefore must be analyzed independently. These metrics helps us estimate how much
  time we need to spend on this data and plan our time accordingly.
  PRIORITY BASED
  This tells us how many errors (exceptions to the top five patterns) exist. We then must decide how many errors we can tolerate and still consider this project successful. With this, we can identify
  which records need to be fixed right away, which can be fixed later on, and which can be forgotten about totally.
  BENCHMARKS
  Benchmarks are samples of production data. We select five records that match the top five most common records (patterns). We then manually transform these five records and show them to our
  developers. Our developers use this as an example as to what all records should look like after the routines are run, i.e., as target records (after transformation). By giving this to our
  developers, we eliminate a vast number of errors prior to the development phase of the project.
  CONCLUSION
  It should be obvious by now that there is a desperate need for a reliable, methodological approach with which organizations can tackle data exchange projects. Although it is impossible to avoid
  unpleasant surprises in data exchanges, you can handle them if you are properly prepared.
  In this paper, we have uncovered the most significant obstacles to successful data exchange projects-process mapping for better understanding the context of the data to be mapped, and data
  patterning for rapidly identifying and resolving data errors.
  By using our methodology to combine a devoted team, an early start, a clearly defined project plan, validation metrics to accurately evaluate progress and automated tools (where applicable), you
  will tremendously help your organization achieve its goal of a successful data exchange.
