Abstract
It is difficult to find an IT shop anywhere where they are not considering or are in the process of building a data warehouse. Though the nature of a true data warehouse is subject to debate, the
data warehouses being designed and built today have the same basic purpose: provide a data storage facility for efficient analytical reporting.
Driving this effort to build more data warehouses has been the gradual improvement of the technology involved and the lower prices of these technical components. Warehousing technology has improved
on the front-end with better OLAP tools now offered at a lower price, and on the back-end with lower server prices and improved server technology for warehouse queries. And in the middle
of this technical architecture, warehouse tools to assist in the building, extraction, transformation and loading of warehouse data have also come of age.
Data warehouses have traditionally been built using various available tools: ERD modeling tools for building the model, database tools for building the database and loading the data, programming
languages for extraction and transformation, and sort tools for creating aggregations. Extraction, transformation and loading (ETL) tools combine many of these functions together in a single,
integrated package. The latest set of these data warehousing tools provide a user-friendly front-end, point-and-click programming for source-to-target mappings and transformations, meta data
repositories and collaborative development facilities in additional to other features, thus addressing many of the shortcomings of using separate and sometimes inappropriate or inadequate tools for
performing these tasks.
Yet despite many of the clear advantages of using ETL tools, there is still significant resistance to their purchase and use on data warehouse projects. It appears there is still the perception
that these tools are either unnecessary or so new and ‘cutting edge’ that using them represents a risky proposition.
This article provides a brief history of the ETL tools and identifies some of the significant benefits that can derived from using them. Some of the resistance to using these tools is discussed and
arguments in favor of their usage are identified.
In the past few years data warehousing has managed to work it’s way into our collective IT consciousness. At its core, data warehousing is based on concepts that are at once both old and new. For
instance, the process of taking data required for analysis and placing it in a separate database is a notion almost as old as database software itself. Likewise, star-schema data modeling uses
denormalization techniques to improve performance and simplify data access; these modeling techniques have long been the result of various tried and true excuses for retreating from third normal
formal of relational database modeling.
But what has made data warehousing feasible on such a large scale has been the technology involved. As database products have matured, larger volumes of data could be stored and accessed
efficiently. Coupled with this development has been the steady improvement of computing power on the mid-range server platform. Finally, the most notable development has been the rapid maturity of
the client-server platform and the OLAP tools used to access the data in the data warehouse. These tools provide a user-friendly interface to relational data that had been difficult to provide over
more complex normalized relational databases.
This combination of technological and analytical developments has led to the continued growth of the data warehouse. And it is technology that has continued to drive the success of data
warehousing, with the steady improvement of OLAP tools and the continued downward spiral of mid- range server costs. (While software costs have come down to some degree, the future pressure from
Microsoft which plans to combine low cost with bundled OLAP functionality, will undoubtedly drive data warehouse software prices down even more.)
The Architecture of the Data Warehouse
Data warehousing technology has traditionally been built on the technical architecture shown in the figure below. On the far left of this architectural diagram is the source of the warehouse data,
the system containing the transactional data. At the far right of the diagram is the delivery mechanism for the warehouse tools, the OLAP/Data mining tool used to analyze the warehouse data. These
tools retrieve data from the warehouse storage database, generally a relational database or a Multi-dimensional database (MDD).
Figure 1 – Data Warehouse Architecture
Figure 1 – Data Warehouse Architecture
Between the warehouse data and the data source resides the tools for extracting data from the source system and loading the database. These extraction, transformation and load (ETL) tools extract
the data from the source system, perform any transformation needed on the source data and then load the data into the database for the data warehouse.
Traditionally, a great deal of warehouse development effort has been applied to this ETL process. The process of taking transactional data designed for operational systems and putting it into a
format useful for reporting is non-trivial, often involving the application of complex business rules to resolve data inconsistencies. Often developed using 3rd generation programming languages
such as C or COBOL, the effort for developing and implementing such programs can easily become a technical and organizational nightmare. ETL tools can help to alleviate this nightmare of processing
by providing a consistent, simplified solution to a recurring set of development and implementation problems. Vendors such as D2K, Sagent, Prism, Ardent and others provide these integrated tools
sets.
ETL Tools
An ETL tool is by definition an integrated suite of tools for the purpose of extraction, transformation and loading data warehouse data, an approach distinctly different than the more common
warehousing solution of cobbling together a set of programs or tools to accomplish this same task. Using a disjointed set of tools in this way would ultimately provide the processing that needs to
be done to load the warehouse, but in the effort of developing these programs valuable information may be lost. For instance, transformation of the data may be occurring in more than one process,
in the extraction program running on the legacy system and then again during the load process. A developer attempting to modify existing transformation and load routines may look in one location
and not the other and thus miss important business rule information. Since data warehousing is admittedly an iterative process, the possibility that modifications will be made to existing
extraction, transformation and load programs is high. Given the scenario described here, finding the correct business rule information would be difficult if not impossible.
ETL tools capture transformation information (business rules) from the ETL process in a meta data repository. This centralized repository can facilitate communication between the members
of the warehouse development team: the end-user, the subject matter expert, the data warehouse architect, the developer and others. The information in the repository is then available for
subsequent warehouse developers adding a subject or modifying an existing area who must discern existing transform logic before developing new logic.
The ETL Tool Today
The components of ETL tools have been around in some form or another for a number of years. Initially, these bundled tools just provided the ability to extract data from legacy mainframe systems
into a file format that could then be used to load data into the target database. The tools generated programs (usually COBOL) which would then run on the mainframe to perform the extraction and
transformation required.
This initial set of bundled tools is sometimes referred to as first generation ETL tools. While effective in some ways, they did not provide a mechanism to facilitate collaboration of the
data warehouse development team, and offered only one or two bundled data warehouse ETL tools.
But the ETL tool has matured and the current slate of tools, the self-proclaimed second generation of ETL tools, provide added user-friendly features (client-server GUI, Web access) and
additional functionality and performance benefits. Source to target mapping, and transformations can be programmed easily using a process that creates documentation and collects meta data as an
automatic part of the development process.
Resistance to ETL Tool Usage
The contribution that these tools can make to the DW development process is significant. Yet there continues to be significant resistance to using these tools in a DW development effort. IT
professionals (clients, consultants, and data warehouse developers) in general have issues and concerns with this technology that is still considered by many to be new and cutting edge. These
issues center around initial product cost, the learning curve for using and maintaining the tool and ongoing maintenance of the programs that comprise the ETL tool.
Cost Issues
Unfortunately the cost of these tools continues to be high. But this initial cost of the tool should be factored with the return on investment (ROI) of the tool. When considering the alternative of
programming a set of extraction and transformation programs with a disconnected set of tools and programs, it is clear that significant development effort and cost can be reduced and eliminated
using an integrated ETL tool.
Learning Curve
Often the decision that IT managers must make is whether or not to use the legacy system programmers in house to program the extraction, transfer and load programs, a decision that incurs no
learning curve cost, or to alternatively use a new ETL tool and incur the various associated costs of training internal staff on the tool. IT managers squeamish about change and the potential for
failure have much to consider here.
But IT managers should perform a careful cost-benefit analysis of these tools that should consider the costs of not using an ETL tool in a data warehouse development effort. These costs are the
increased development time of creating programs to perform extraction and transformation relative to using a GUI interface to assign targets to sources and program transformations. Using this
approach, development time would be shortened (thus lowering development cost) and program bugs greatly reduced.
Ongoing Maintenance
An additional management concern with ETL tool use is the cost of ongoing software maintenance for the tool. As with any development and production tool used in-house, qualified staff must be found
to continue to use and maintain the tool. In a world where IT professionals can be one of the most expensive resources used, this is a valid concern. But this cost should be balanced with the
user-friendly aspect of the ETL tool and the ease of maintenance and monitoring of the ETL process with the tool. Additionally there will be savings realized from the elimination of maintenance for
disjointed extraction and transformation programs (usually a collection of 3GL programs, a maintenance nightmare in itself).
ETL Tools and the Enterprise Model
The lofty goal of the enterprise-wide data model continues to be unrealized in many organizations. How could it be that a proposition that offers consistency, stability, and accuracy to the
business continues to be ignored? This IT failure relates directly to the process of developing a data warehouse. Bill Inmon’s well-known vision of a data warehouse involves a consolidated,
non-volatile database of information from throughout the organization, effectively an enterprise database and thus requiring an enterprise-wide data model. And as Bill Inmon is quick to point out,
a set of disjointed data marts does not make a data warehouse.
The business reality is that the budgets required to launch a lengthy data modeling exercise rarely exist. The budgets that do exist are smaller and look for more tangible, short term results, thus
explaining the popularity of the quick data mart. But with consistent meta data management and a centralized, well-managed data modeling effort for the data marts, many of the goals of the
enterprise data model such as a consistent data model across the enterprise can potentially be realized.
The foundation of any such modeling effort is a centralized meta data repository that facilitates collaboration between the members of the data warehouse development team. This centralized
repository allows data modeling issues to be resolved with input from team members and then provides centralized access to the resulting model. ETL tools can provide a technical foundation for this
effort if they contain a sound repository tool. This centralized repository plus numerous time-saving tools for the extraction, transformation and loading of data create a significant
value-proposition for data warehouse developers.
Conclusion
While resistance for ETL tool use continues, the realization that these tools have significant value for warehouse development and for the enterprise in general is beginning to dawn on the IT
world. The continuing maturity of the ETL tool combined with the continuing acceptance of the data warehouse concept will lead to increased usage of these tools. In time, the 3GL extraction and
transformation program may become a thing of the past and the most important component of the data warehouse technical architecture may be the selection of the ETL tool to use.