Published in TDAN.com April 2003
What makes management think that 20 years of data can be migrated in a day? Or that it is even the right decision to migrate in the first place? Many of us face tough decisions like this every day,
and most tend to go with the flow. A decision comes from higher management to migrate a set of data, and no one questions the validity of this decision.
To make matters worse, nobody even cares about the status of the data migration until about two months before the go-live date of the new system. You know how it goes – development is now three
months late, and testing is just getting under way, when everyone notices the obvious: It’s really hard to test applications when there’s no data behind them! The data migration is holding up the
And so it goes.
Now you find yourself in the meeting area of the CEO’s office. You own the white board, and you have been tasked with drawing the “high level” view of the data migration architecture that you
will be implementing in the next few days.
Where to Begin
From where do you begin? How many systems are involved? How much processing power is available in the existing systems? Are any of the system’s CPU and memory resources expandable? Are there any
production applications that may conflict with the migration? What is the available network throughput, or, rather, how much data can you really send through the network in an hour, and does that
performance degrade significantly during peak hours?
Don’t even think that this list includes all of the concerns you must address in designing your migration/ETL architecture. So many questions-how do you answer them?
Now for the good news – and yes, there is good news. Compare designing your migration/ETL architecture to buying a new car. No, not at a huge auto-mall that pushes you through on a conveyor belt –
think more along the lines of “carsdirect.com.”
Assemble a menu of all the different options that are available to you. Weigh the costs and benefits for each option. Can your migration architecture be as successful as a Honda – reliable, best
when used within legal speed limits? Or must it be a Ferrari – leaving nothing but vapor trails and some tire residue on the network?
The Main Variables to Ponder
For the purposes of this article, let’s keep things simple. The main variables you have to consider are throughput of data across the network, and processing power for transformation and
cleansing. These are formidable challenges – far worse than any nail you’ll ever run over on the road. They’re more like that bubble that forms on the inside wall of a tire as the tread wears
through, soon to give way to a blowout.
So where do we begin? We need to start by getting some diagnostics of the current environment:
- Determine how much data needs to move from point to point (or server to server) in the migration.
- Determine how much processing power is available at each point. Take these measurements at both peak and non-peak intervals.
- Estimate the amount of transformation and cleansing to be conducted.
When you have obtained these figures, it is time to begin comparing the pre-migration data size to the post-migration data size. This will help you determine when transformation/cleansing
activities should be conducted. The examples discussed below will assume that a staging area will be necessary, though this may not always be the case.
For example, if during transformation, several arrays are normalized into separate database records, that will result in a significant increase in the overall data volume. In this case, you may
want to extract the data from the source system(s) as is and move it to a staging area in the target system, where transformation can be conducted locally.
The architecture may look something like it does in Figure 1:
This solution leverages the processing power of the target server.
Let’s look at another example. You may have an abundance of processing resources available on the source system(s), but it may be slow. In such a case, you will want to explore the possibility of
partitioning your source data and pre-converting the historic data in the source environment. The architecture for this solution may look something like it does in Figure 2:
Of course, in this solution, there remains a second wave of migration for current data. This will bring many complexities with it as well, including the possibility of both source and target being
operational in parallel. This might sound scary, but sometimes it’s the only option.
Business Intelligence, Data Warehousing, Information Exchange
While the architectures described thus far are more than adequate for legacy system conversions and business process reengineering type projects, they’re not up to the task of business
intelligence, data warehousing or information exchange. This is where we need to borrow from the aviation industry for a moment, and explore the notion of a hub and spoke architecture.
Business intelligence, data warehousing or information exchange initiatives all have at least one common attribute – volatility. The number of sources and/or targets can and certainly will change
over time, with frequency. The volume of data to be processed expands daily. And the data rules-we cannot forget the data rules and how rapidly they will evolve.
Initiatives of this sort require a robust, flexible architecture that welcomes change and is prepared to manage it carefully. That is what the hub and spoke architecture is best suited for. The
architecture for this solution may look something like Figure 3:
As you can see from the diagram of Figure 3, you can support any number of source and target systems (spokes) while managing the overall ETL processes through a hub. This does not, however,
necessitate a commercially available ETL software product. If you have ever worked with a software product before that claims to generate code, then you know what a slippery slope that can be.
The astute reader will notice that I have carefully avoided the issues associated with network bandwidth in this article. It has been my experience that obtaining true, accurate network bandwidth
statistics is on par with seeing pink elephants – and I have never seen a pink elephant – not yet, anyway.
But I can tell you that file compression will undoubtedly be your best friend at some point in your architecture, especially if you are transferring large amounts of data. Extracting data into
manageable files that you can compress and move across the network in parallel maximizes your overall data throughput.
But hey, you probably have at least 24 hours before you have to present your migration/ETL architecture to the team, so what’s the hurry, right?
There is, fortunately, a better way to manage data migrations. However, management expectations are rarely something you can control. While this choice of architectures will provide you with an
excellent starting point, you must deploy a rigorous analysis strategy to identify and document the data rules. After all, even cars need fuel, and data rules are the fuel of data migrations,
providing the what, where, when and how of data movement.