Published in TDAN.com January 2000
A data warehouse is more than an archive for corporate data and more than a new way of accessing corporate information. A data warehouse is a subject-oriented repository designed with
enterprise-wide access in mind. It provides tools to satisfy the information needs of enterprise managers at all organizational levels — not just for complex data queries, but as a general
facility for getting quick, accurate, and often insightful information. A data warehouse is designed so that its users can recognize the information they want and access that information using
One of the principal reasons for developing a data warehouse is to integrate operational data from various sources into a single and consistent structure that supports analysis and decision-making
within the enterprise. Operational (legacy) systems create, update and delete production data that “feed” the data warehouse.
A data warehouse is analogous to a physical warehouse. Operational systems create data “parts” that are loaded into the warehouse. Some of those parts are summarized into information
“components” that are stored in the warehouse. Data warehouse users make requests and are delivered information “products” that are created from the stored components and parts.
A data warehouse is typically a blending of technologies, including relational and multidimensional databases, client/server architecture, extraction/transformation programs, graphical user
interfaces, and more.
Data warehousing is one of the hottest industry trends — for good reason. A well-defined and properly implemented data warehouse can be a valuable competitive tool.
A data warehouse has its own unique peculiarities and characteristics that make developing a data warehouse unlike developing just another application. Not every enterprise is able to successfully
develop an effective data warehouse – in fact there are many more failures than successes.
Critical Success Factors (CSF)
The critical success factors for data warehouse engineering are:
- Sponsorship and Involvement
- Business Requirements
- Enterprise Information Architecture
- Data Warehouse Architecture and Design
- Data Warehouse Technology
- Information Quality
CSF: Sponsorship and Involvement
Enterprise executives and managers must sponsor data warehouse development. Equally important, all potential users must be involved in data warehouse engineering. Without both management
sponsorship and near universal involvement, enterprise-wide data warehouse projects usually fail.
commitment to implementing a data warehouse that is the single source for corporate measurement and decision support data.
three things: management commitment, universal approval, and appropriate measures and rewards.
ensure that resources necessary to effect the change are available. Consistent commitment means that the change becomes both an enterprise strategy and an enterprise goal that leaders continuously
and obviously support. The visibility of leadership support is a primary factor in achieving universal approval for change.
for them. They agree that the change being undertaken is the right change. Peter Senge, in his book The Fifth Discipline, describes the need for universal approval in order to implement systemic
change: “People want change, they don’t want to be changed.”
even more difficult. The best way to achieve and maintain universal approval is to ensure that the process and results of change are measured appropriately and accurately and communicated
enterprise-wide. Good results and changed behavior must be rewarded. At the same time, unchanged behavior and poor results should not be rewarded. Employees will not work toward change if they
continue to be rewarded for old practices.
management. Data warehouse users will have the most influence on acceptance of the warehouse, so it is imperative that their needs are addressed. They are also the “owners” and “stewards” of
operational data and thus are the best source for subject matter expertise.
CSF: Business Requirements
Developing a data warehouse without first determining strategic business requirements is a sure recipe for failure. The best source for these requirements is the enterprise strategic
plan and the performance measures identified in the plan. These become the basis for the enterprise information architecture as well as the data warehouse architecture and
design (see below). An enterprise should never undertake system development efforts, particularly engineering a data warehouse, without first determining its strategic business and information
with which effective managers guide their organizations and ensure corporate success.
Through the strategic planning process, the enterprise defines and documents its purpose, goals, and objectives, along with strategies for achieving them. Included in the process is an assessment
of external opportunities and threats as well as an assessment of internal strengths and weaknesses.
for every critical outcome.
stakeholder expectations are being met.
the measures, often based upon external benchmarks, form the structure for an enterprise performance measurement system.
combination of all the reports of all the performance measures becomes the basis for a data warehouse and a Strategic Information System that is truly tailored to the enterprise’s requirements.
strategic needs. Linking timely accurate measures to specific goals and objectives begins to make enterprise management more of a science and less of an art.
CSF: Enterprise Information Architecture
Linking the enterprise strategic plan with an enterprise data architecture (EDA), information systems architecture (ISA) and enterprise technical architecture
(ETA) results in an enterprise information architecture. This architecture is a logical organization of corporate information requirements, descriptions of application systems that support
the enterprise’s strategic requirements. It includes the relationships between application systems via shared software components and shared data elements. The enterprise information
architecture also establishes guidelines, standards, and operational services that define the enterprise’s computing technology environment.
Before an enterprise can define, design, and implement the architecture for its strategic information management systems, including data warehouse, data mart, decision support, and executive
information systems, it must first document the environment in which these systems will be implemented.
data elements, and identification of the “owner” of the data. In addition, it is important for the model to indicate the circumstances (who, when, where, how) for creating, updating, using, and
deleting enterprise data. For ease of use, subsets of the enterprise data architecture model should be established. These subsets, or views, can represent functions, organizations, regions,
systems, and any other significant grouping of information.
appropriate data elements in the Enterprise Data Architecture. Every system should also be linked to appropriate elements of the enterprise technology architecture.
guidelines, standards, and operational services that define the enterprise’s systems development environment are documented.
CSF: Data Warehouse Architecture and Design
The key to success in scaleable data warehouse development and the single factor that contributes most to data warehousing success is a data warehouse architecture.
The architecture and design of an enterprise’s data warehouse should reflect the performance measurement and business requirements of the enterprise. Its data model, structure,
components, and meta data should all be based upon internal information requirements – not specific technologies.
model provides a clear and unambiguous definition of every key data entity, describing the way each is used, as well as defining derivation formulas, aggregation categories, and refreshment time
periods. The data warehouse model, linked with the enterprise information architecture, becomes both requirement documentation and a source for communicating the contents of the data warehouse to
its users and developers. Issues that must be addressed in the data model include what legacy data will be used to populate the data warehouse, how data will be moved from legacy environments to
the data warehouse, and how the legacy data will be integrated or transformed to ensure data quality and integrity in the data warehouse. The two most important issues for any data warehouse
are data quality and data access.
different dimension from other warehouse data. Meta data used to manage and control data warehouse creation and maintenance resides outside the data warehouse, often in a digital repository. Meta
data for data warehouse users is part of the data warehouse itself and is available to control access and analysis of the data warehouse. To a data warehouse user, meta data is like a “card
catalog” to the subjects contained in the data warehouse. The two types of data warehouse meta data are called structural and access.
meta data is the data warehouse model that describes its data entities, their characteristics, and how they are related to one another. The way potential data warehouse users currently use, or
intend to use, enterprise measures provides insight into how to best serve them from the data warehouse; i.e., what data entities to include and how to aggregate detailed data entities. The data
warehouse model provides a means of documenting and identifying structural meta data. This includes both strategic and operational uses of enterprise measures, as well as multi-dimensional
summarization. Structural meta data also includes performance metrics for programs and queries so that users and developers know how long programs and queries should run. Data warehouse performance
tuning also uses these metrics.
dictionary of standard terms including user-defined custom names and aliases. Access meta data also includes the location and description of data warehouse servers, databases, tables, detailed
data, and summaries along with descriptions of original data sources and transformations. Access meta data provides rules for drill up, drill down and views across enterprise dimensions and subject
hierarchies like products, markets, and customers. Access meta data also allows rules for user-defined custom calculations and queries to be included. In addition, access meta data contains
individual, work group, and enterprise security for viewing, changing, and distributing custom calculations, summaries, or other analyses.
Data Warehouse Components
The data warehouse architecture also contains descriptions data warehouse components: current detail, summarized data, and archives as well as systems of
record and integration/transformation programs.
data or as an aggregation of raw data. Current detail, organized by subject area, represents the entire enterprise, rather than a given application. Current detail is the lowest level of data
granularity in the data warehouse. Every data entity in current detail is a snapshot, at a moment in time, representing the instance when the data are accurate. Current detail is typically
maintained for two to five years, but some enterprises may require detail data for significantly longer periods. When initially implemented, a data warehouse may include current detail more than
two years old, but the often questionable quality of older data must be considered and measures taken to ensure its validity. Current detail refreshment occurs as frequently as necessary to support
effective data warehouse design provides for customized, lightly summarized data for every enterprise element (see Data Mart, below). An enterprise element may have access to both detailed and
summarized data, but typically much less than the total stored in current detail.
detail. Data volume at this level is much less than other levels and represents an eclectic collection supporting a wide variety of needs and interests. In addition to access to highly summarized
data, executives also should have the capability of accessing increasing levels of detail through a “drill down” process.
stored in the data warehouse archives that has a low incidence of access. Archive data are most often used for forecasting and trend analysis. Although archive data may be stored with the same
level of granularity as current detail, it is more likely that archive data are aggregated as they are archived. Archives include not only old data (in raw or summarized form); they also include
the meta data that describes the old data’s characteristics.
as is, into a data warehouse. Raw operational data are virtually unintelligible to most end users. Additionally, operational data seldom conform to the logical, subject-oriented structure of a data
warehouse. Further, different operational systems represent data differently, use different codes for the same thing, squeeze multiple pieces of information into one field, and more. Operational
data can also come from many different physical sources: old mainframe files, non-relational databases, indexed flat files, even proprietary tape and card-based systems. Thus operational data must
be cleaned up, edited, and reformatted before being loaded into a data warehouse.
As operational data items pass from their systems of record to a data warehouse, integration and transformation programs convert them from application-specific data into enterprise data. These
integration and transformation programs perform functions such as:
- Reformatting, recalculating, or modifying key structures and other data elements.
- Adding time elements
- Identifying default values
- Supplying logic to choose between multiple data sources
- Summarizing, tallying, and merging data from multiple sources
- Reconciling data from multiple sources
When either operational or data warehouse environments change, integration and transformation programs must be modified to reflect that change.
Data Warehouse Structure
A data warehouse may have any of several structures. The structure that best meets the data warehouse needs of an enterprise is fully dependent upon the enterprise business, data,
and access requirements. The basic data warehouse structures are:
and processing the detail data.
actual data. Instead it contains the information necessary to access the data wherever they reside. This structure is possible only when operational systems exactly reflect the enterprise data
architecture and system capacities can support both operational and management functions.
reflects the needs of the enterprise element it supports, and may be different from the organization of the enterprise data warehouse. Specific data elements may be stored redundantly in both the
data mart and the data warehouse. As part of an iterative data warehouse development process, an enterprise builds a series of physical data marts over time and links them via an enterprise-wide
logical data warehouse or feeds them from a single physical warehouse.
“horizontal slice” of data that pertains to their branch (i.e. they need all the data elements – tables and columns – but only the rows pertaining to their branch). Other users need “vertical
slices” or a combination of horizontal and vertical slices.
regularly used by the user group should be on their Data Mart server. All other data accessible to the user group should be available on other machines in the network when needed.
The major issues to be addressed in implementing a particular data warehouse structure involve data distribution and data replication. How much data? How often? Detail or summarized?
Uni-directional or bi-directional update? Data distribution and replication decisions will also have application implications.
Distribution refers to the parceling out of segments of the data to distinct multiple independent computers (or clusters). Replication refers to the copying of portions of the data in one (or more)
databases to a second database (often on a different computer) and guaranteeing that whenever the data is changed that all the replicas implement the change in order to stay in synchronization.
The challenges of distribution and replication are not only exist in developing the initial design, but also the ongoing management and maintenance of the overall system. Data Warehouse
architecture design must take into account the following:
Replication: An effective design must consider
- From what database and to what servers is the data moving? Is the data moved one-way or bi-directionally?
- How many replicas will be needed and are they all identical?
- By operational application, how much data – whole databases or selected sets of fields – will be moved throughout the network?
- By application, how time sensitive is the data?
- How is data delivery guaranteed? Who is responsible for the guarantee?
- What should be done if the data to be replicated cannot be delivered due to temporary problems?
- How is data replication tracked?
Network: The network issues that revolve around distribution and replication include
- What are the physical characteristics of the network architecture that connects the data sources to the replicas? Can its bandwidth handle the amount to data to be transferred?
- What is the overall processing speed of each component of the network? How fast can data effectively move between each node?
- What type of replication process will be implemented (synchronous or asynchronous)? Will it require 100% availability of the network? How is the replication process affected if the network is
temporarily down? How will failed replication attempts be managed?
Data Warehouse Application(s): Applications must be designed to be aware of the replicas available.
- Each application at each site must know where to find the data (preferred site)
- If the data is not available at the preferred site, how does the application detect the problem?
- Should the application have the ability to switch to an alternate site from which to retrieve the data?
Scheduling: Efficient scheduling of replication must consider
- What events or conditions will trigger a dynamic data transfer? How much data is transferred during a triggered data move?
- How long will it take to perform the data transfer under various conditions? Can the required time be minimized through better scheduling
- What time zones play an important consideration when moving data?
Change Management: Replication is not a static concept. Ongoing changes must be anticipated in the overall replication strategy.
- What technological changes could impact the operation or performance of the current replication strategy? What investments will improve the strategic business impact of the data?
- What organizational, product, service, regulatory, market, competitive, or other environmental changes could impact the operation or performance of the present replication strategy?
- Does the replication strategy fit today’s users’ needs?
Distribution and replication of data warehouse data is primarily a physical architecture design and implementation issue.
Copyright © 1999 Visible Systems Corporation