Critical Success Factors for Data Warehouse Engineering Part 1

Published in TDAN.com January 2000

Articles in this series – Part 1, Part 2


Introduction

A data warehouse is more than an archive for corporate data and more than a new way of accessing corporate information. A data warehouse is a subject-oriented repository designed with
enterprise-wide access in mind. It provides tools to satisfy the information needs of enterprise managers at all organizational levels — not just for complex data queries, but as a general
facility for getting quick, accurate, and often insightful information. A data warehouse is designed so that its users can recognize the information they want and access that information using
simple tools.

One of the principal reasons for developing a data warehouse is to integrate operational data from various sources into a single and consistent structure that supports analysis and decision-making
within the enterprise. Operational (legacy) systems create, update and delete production data that “feed” the data warehouse.

A data warehouse is analogous to a physical warehouse. Operational systems create data “parts” that are loaded into the warehouse. Some of those parts are summarized into information
“components” that are stored in the warehouse. Data warehouse users make requests and are delivered information “products” that are created from the stored components and parts.

A data warehouse is typically a blending of technologies, including relational and multidimensional databases, client/server architecture, extraction/transformation programs, graphical user
interfaces, and more.

Data warehousing is one of the hottest industry trends — for good reason. A well-defined and properly implemented data warehouse can be a valuable competitive tool.

A data warehouse has its own unique peculiarities and characteristics that make developing a data warehouse unlike developing just another application. Not every enterprise is able to successfully
develop an effective data warehouse – in fact there are many more failures than successes.

Critical Success Factors (CSF)

The critical success factors for data warehouse engineering are:

  • Sponsorship and Involvement
  • Business Requirements
  • Enterprise Information Architecture
  • Data Warehouse Architecture and Design
  • Data Warehouse Technology
  • Information Quality


CSF: Sponsorship and Involvement

Enterprise executives and managers must sponsor data warehouse development. Equally important, all potential users must be involved in data warehouse engineering. Without both management
sponsorship and near universal involvement, enterprise-wide data warehouse projects usually fail.

Management
Enterprise management must fully sponsor data warehouse development and usage. Sponsorship includes ensuring sufficient resources are available. Sponsorship also means consistent
commitment to implementing a data warehouse that is the single source for corporate measurement and decision support data.
Data warehouse development and usage often requires significant culture change. This cannot happen without management commitment. Managing internal change, particularly culture change, requires
three things: management commitment, universal approval, and appropriate measures and rewards.
Management Commitment: In order for anything to happen in an enterprise, including change, executives and managers must be consistently committed to making it happen. Only enterprise leaders can
ensure that resources necessary to effect the change are available. Consistent commitment means that the change becomes both an enterprise strategy and an enterprise goal that leaders continuously
and obviously support. The visibility of leadership support is a primary factor in achieving universal approval for change.
Universal Approval: Change is successful only when the people involved approve of the change. They understand the need for the change. They believe the change is good for the enterprise and good
for them. They agree that the change being undertaken is the right change. Peter Senge, in his book The Fifth Discipline, describes the need for universal approval in order to implement systemic
change: “People want change, they don’t want to be changed.”
Measures and Rewards: Getting everyone to want change is difficult. It requires a level and degree of communication and cooperation not found in most enterprises. Maintaining universal approval is
even more difficult. The best way to achieve and maintain universal approval is to ensure that the process and results of change are measured appropriately and accurately and communicated
enterprise-wide. Good results and changed behavior must be rewarded. At the same time, unchanged behavior and poor results should not be rewarded. Employees will not work toward change if they
continue to be rewarded for old practices.
Potential Users
All potential users of the data warehouse, even executives, from every organizational unit and level, must be actively involved in data warehouse design, development, and
management. Data warehouse users will have the most influence on acceptance of the warehouse, so it is imperative that their needs are addressed. They are also the “owners” and “stewards” of
operational data and thus are the best source for subject matter expertise.


CSF: Business Requirements

Developing a data warehouse without first determining strategic business requirements is a sure recipe for failure. The best source for these requirements is the enterprise strategic
plan
and the performance measures identified in the plan. These become the basis for the enterprise information architecture as well as the data warehouse architecture and
design (see below). An enterprise should never undertake system development efforts, particularly engineering a data warehouse, without first determining its strategic business and information
requirements.

Strategic Plan
A strategic plan outlines an enterprise’s mission and purpose, goals, strategies and performance measures (business requirements). Properly used, a strategic plan is the tool
with which effective managers guide their organizations and ensure corporate success.
An enterprise’s strategic plan not only provides a guide for effective management; it also provides the guiding force for internal change and the guidelines for responding to external change.
Through the strategic planning process, the enterprise defines and documents its purpose, goals, and objectives, along with strategies for achieving them. Included in the process is an assessment
of external opportunities and threats as well as an assessment of internal strengths and weaknesses.
The most useful strategic plans are multi-dimensional, incorporating the enterprise’s overall plan with the subordinate plans of every enterprise element, and including performance measures
for every critical outcome.
Establishing the right performance measures is the key to successful enterprise management. An enterprise must be able to tell whether progress is being made on its critical goals and whether
stakeholder expectations are being met.
The most effective and useful performance measures are cross-functional and are linked to the appropriate strategies, objectives, and performance criteria. Management’s targets and thresholds for
the measures, often based upon external benchmarks, form the structure for an enterprise performance measurement system.
Performance measurement documentation should include not only the content of reports and queries, but also document the path of the data from source to ultimate information recipient. The
combination of all the reports of all the performance measures becomes the basis for a data warehouse and a Strategic Information System that is truly tailored to the enterprise’s requirements.
Executives and managers use the information produced from the data warehouse to reinforce initiatives, reward behavior and change strategies. Employees use it to adjust operations and respond to
strategic needs. Linking timely accurate measures to specific goals and objectives begins to make enterprise management more of a science and less of an art.

CSF: Enterprise Information Architecture

Linking the enterprise strategic plan with an enterprise data architecture (EDA), information systems architecture (ISA) and enterprise technical architecture
(ETA)
results in an enterprise information architecture. This architecture is a logical organization of corporate information requirements, descriptions of application systems that support
the enterprise’s strategic requirements. It includes the relationships between application systems via shared software components and shared data elements. The enterprise information
architecture also establishes guidelines, standards, and operational services that define the enterprise’s computing technology environment.

Before an enterprise can define, design, and implement the architecture for its strategic information management systems, including data warehouse, data mart, decision support, and executive
information systems, it must first document the environment in which these systems will be implemented.

The EDA is a fully normalized data model that describes all the data necessary to the enterprise. It includes relationships between “business data objects,” business rules concerning usage of the
data elements, and identification of the “owner” of the data. In addition, it is important for the model to indicate the circumstances (who, when, where, how) for creating, updating, using, and
deleting enterprise data. For ease of use, subsets of the enterprise data architecture model should be established. These subsets, or views, can represent functions, organizations, regions,
systems, and any other significant grouping of information.
The ISA documents all the information systems in use by the enterprise to create, read, update, and delete enterprise data. In order to be useful, the information systems should be linked to
appropriate data elements in the Enterprise Data Architecture. Every system should also be linked to appropriate elements of the enterprise technology architecture.
This third segment of the Enterprise Information Architecture documents the enterprise’s hardware platforms, operating systems, and telecommunications infrastructure. The ETA is also where
guidelines, standards, and operational services that define the enterprise’s systems development environment are documented.
More detail concerning “Enterprise Information Architecture” can be found in the Visible White Paper with that title.


CSF: Data Warehouse Architecture and Design

The key to success in scaleable data warehouse development and the single factor that contributes most to data warehousing success is a data warehouse architecture.

The architecture and design of an enterprise’s data warehouse should reflect the performance measurement and business requirements of the enterprise. Its data model, structure,
components,
and meta data should all be based upon internal information requirements – not specific technologies.

A data model documents the data elements whose values at any point in time are necessary to tell data warehouse users how well their enterprise is performing. The data warehouse
model provides a clear and unambiguous definition of every key data entity, describing the way each is used, as well as defining derivation formulas, aggregation categories, and refreshment time
periods. The data warehouse model, linked with the enterprise information architecture, becomes both requirement documentation and a source for communicating the contents of the data warehouse to
its users and developers. Issues that must be addressed in the data model include what legacy data will be used to populate the data warehouse, how data will be moved from legacy environments to
the data warehouse, and how the legacy data will be integrated or transformed to ensure data quality and integrity in the data warehouse. The two most important issues for any data warehouse
are data quality and data access.
Meta data, or data about data, is the nerve center of a data warehouse and is essential. Meta data is essential to all levels of the data warehouse, but exists and functions in a
different dimension from other warehouse data. Meta data used to manage and control data warehouse creation and maintenance resides outside the data warehouse, often in a digital repository. Meta
data for data warehouse users is part of the data warehouse itself and is available to control access and analysis of the data warehouse. To a data warehouse user, meta data is like a “card
catalog” to the subjects contained in the data warehouse. The two types of data warehouse meta data are called structural and access.
Structural meta data is used for creation and maintenance of the data warehouse. It fully describes data warehouse structure and content. The basic building block of structural
meta data is the data warehouse model that describes its data entities, their characteristics, and how they are related to one another. The way potential data warehouse users currently use, or
intend to use, enterprise measures provides insight into how to best serve them from the data warehouse; i.e., what data entities to include and how to aggregate detailed data entities. The data
warehouse model provides a means of documenting and identifying structural meta data. This includes both strategic and operational uses of enterprise measures, as well as multi-dimensional
summarization. Structural meta data also includes performance metrics for programs and queries so that users and developers know how long programs and queries should run. Data warehouse performance
tuning also uses these metrics.
Access meta data is the dynamic link between the data warehouse and end-user applications. It generally contains the enterprise measures supported by the data warehouse and a
dictionary of standard terms including user-defined custom names and aliases. Access meta data also includes the location and description of data warehouse servers, databases, tables, detailed
data, and summaries along with descriptions of original data sources and transformations. Access meta data provides rules for drill up, drill down and views across enterprise dimensions and subject
hierarchies like products, markets, and customers. Access meta data also allows rules for user-defined custom calculations and queries to be included. In addition, access meta data contains
individual, work group, and enterprise security for viewing, changing, and distributing custom calculations, summaries, or other analyses.

Data Warehouse Components

The data warehouse architecture also contains descriptions data warehouse components: current detail, summarized data, and archives as well as systems of
record
and integration/transformation programs.

The heart of a data warehouse is its current detail. It is the place where the bulk of data resides. Current detail comes directly from operational systems and may be stored as raw
data or as an aggregation of raw data. Current detail, organized by subject area, represents the entire enterprise, rather than a given application. Current detail is the lowest level of data
granularity in the data warehouse. Every data entity in current detail is a snapshot, at a moment in time, representing the instance when the data are accurate. Current detail is typically
maintained for two to five years, but some enterprises may require detail data for significantly longer periods. When initially implemented, a data warehouse may include current detail more than
two years old, but the often questionable quality of older data must be considered and measures taken to ensure its validity. Current detail refreshment occurs as frequently as necessary to support
enterprise requirements.
Lightly summarized data are the hallmark of a data warehouse. All enterprise elements (department, region, function, etc.) do not have the same information requirements, so
effective data warehouse design provides for customized, lightly summarized data for every enterprise element (see Data Mart, below). An enterprise element may have access to both detailed and
summarized data, but typically much less than the total stored in current detail.
Highly summarized data are primarily for enterprise executives. Highly summarized data can come from either the lightly summarized data used by enterprise elements or from current
detail. Data volume at this level is much less than other levels and represents an eclectic collection supporting a wide variety of needs and interests. In addition to access to highly summarized
data, executives also should have the capability of accessing increasing levels of detail through a “drill down” process.
Data warehouse archives contain old data (normally over two years old) of significant, continuing interest and value to the enterprise. There is usually a massive amount of data
stored in the data warehouse archives that has a low incidence of access. Archive data are most often used for forecasting and trend analysis. Although archive data may be stored with the same
level of granularity as current detail, it is more likely that archive data are aggregated as they are archived. Archives include not only old data (in raw or summarized form); they also include
the meta data that describes the old data’s characteristics.
The components that link operational systems with the data warehouse are the integration/transformation programs. Even the “rightest” operational data cannot usually be copied,
as is, into a data warehouse. Raw operational data are virtually unintelligible to most end users. Additionally, operational data seldom conform to the logical, subject-oriented structure of a data
warehouse. Further, different operational systems represent data differently, use different codes for the same thing, squeeze multiple pieces of information into one field, and more. Operational
data can also come from many different physical sources: old mainframe files, non-relational databases, indexed flat files, even proprietary tape and card-based systems. Thus operational data must
be cleaned up, edited, and reformatted before being loaded into a data warehouse.

As operational data items pass from their systems of record to a data warehouse, integration and transformation programs convert them from application-specific data into enterprise data. These
integration and transformation programs perform functions such as:

  • Reformatting, recalculating, or modifying key structures and other data elements.
  • Adding time elements
  • Identifying default values
  • Supplying logic to choose between multiple data sources
  • Summarizing, tallying, and merging data from multiple sources
  • Reconciling data from multiple sources

When either operational or data warehouse environments change, integration and transformation programs must be modified to reflect that change.

Data Warehouse Structure

A data warehouse may have any of several structures. The structure that best meets the data warehouse needs of an enterprise is fully dependent upon the enterprise business, data,
and access requirements. The basic data warehouse structures are:

Physical Data Warehouse – physical database in which all the data for the data warehouse are stored, along with meta data and processing logic for scrubbing, organizing, packaging
and processing the detail data.
Logical Data Warehouse – also contains meta data including enterprise rules and processing logic for scrubbing, organizing, packaging and processing the data, but does not contain
actual data. Instead it contains the information necessary to access the data wherever they reside. This structure is possible only when operational systems exactly reflect the enterprise data
architecture and system capacities can support both operational and management functions.
Data Mart – subset of an enterprise-wide data warehouse. Typically it supports an enterprise element (department, region, function, etc.). The organization of data in a data mart
reflects the needs of the enterprise element it supports, and may be different from the organization of the enterprise data warehouse. Specific data elements may be stored redundantly in both the
data mart and the data warehouse. As part of an iterative data warehouse development process, an enterprise builds a series of physical data marts over time and links them via an enterprise-wide
logical data warehouse or feeds them from a single physical warehouse.
Both within the Data Warehouse as a whole and within the individual Data Marts, different groups of users have needs for differing slices of data. For example, users at a branch generally need the
“horizontal slice” of data that pertains to their branch (i.e. they need all the data elements – tables and columns – but only the rows pertaining to their branch). Other users need “vertical
slices” or a combination of horizontal and vertical slices.
The general approach is to try to make data that are needed by a user group available on a machine that is as close to the users as is feasible – a Data Mart server. Only that slice of data that is
regularly used by the user group should be on their Data Mart server. All other data accessible to the user group should be available on other machines in the network when needed.

The major issues to be addressed in implementing a particular data warehouse structure involve data distribution and data replication. How much data? How often? Detail or summarized?
Uni-directional or bi-directional update? Data distribution and replication decisions will also have application implications.

Distribution refers to the parceling out of segments of the data to distinct multiple independent computers (or clusters). Replication refers to the copying of portions of the data in one (or more)
databases to a second database (often on a different computer) and guaranteeing that whenever the data is changed that all the replicas implement the change in order to stay in synchronization.

The challenges of distribution and replication are not only exist in developing the initial design, but also the ongoing management and maintenance of the overall system. Data Warehouse
architecture design must take into account the following:

Replication: An effective design must consider

  • From what database and to what servers is the data moving? Is the data moved one-way or bi-directionally?
  • How many replicas will be needed and are they all identical?
  • By operational application, how much data – whole databases or selected sets of fields – will be moved throughout the network?
  • By application, how time sensitive is the data?
  • How is data delivery guaranteed? Who is responsible for the guarantee?
  • What should be done if the data to be replicated cannot be delivered due to temporary problems?
  • How is data replication tracked?

Network: The network issues that revolve around distribution and replication include

  • What are the physical characteristics of the network architecture that connects the data sources to the replicas? Can its bandwidth handle the amount to data to be transferred?
  • What is the overall processing speed of each component of the network? How fast can data effectively move between each node?
  • What type of replication process will be implemented (synchronous or asynchronous)? Will it require 100% availability of the network? How is the replication process affected if the network is
    temporarily down? How will failed replication attempts be managed?

Data Warehouse Application(s): Applications must be designed to be aware of the replicas available.

  • Each application at each site must know where to find the data (preferred site)
  • If the data is not available at the preferred site, how does the application detect the problem?
  • Should the application have the ability to switch to an alternate site from which to retrieve the data?

Scheduling: Efficient scheduling of replication must consider

  • What events or conditions will trigger a dynamic data transfer? How much data is transferred during a triggered data move?
  • How long will it take to perform the data transfer under various conditions? Can the required time be minimized through better scheduling
  • What time zones play an important consideration when moving data?

Change Management: Replication is not a static concept. Ongoing changes must be anticipated in the overall replication strategy.

  • What technological changes could impact the operation or performance of the current replication strategy? What investments will improve the strategic business impact of the data?
  • What organizational, product, service, regulatory, market, competitive, or other environmental changes could impact the operation or performance of the present replication strategy?
  • Does the replication strategy fit today’s users’ needs?

Distribution and replication of data warehouse data is primarily a physical architecture design and implementation issue.

Articles in this series – Part 1, Part 2

Copyright © 1999 Visible Systems Corporation

Share

submit to reddit

About Alan Perkins

Alan Perkins has been an executive consultant for a Big Four firm, senior enterprise architecture consultant for a large international IT solution provider, vice president of consulting for a software engineering tools company, general manager of a high-tech consulting firm, vice president of system R&D for a virtual corporation, chief information officer for a technology transfer consortium, chief solutions architect for a repository technology, systems analyst on the White House staff, and director of the U.S. Army Data Processing School in Germany. He has provided information and enterprise management consulting to numerous companies, associations and government agencies.

In his SRA role as Principal Enterprise Architecture Consultant, Alan guides and advises clients in order to help them achieve both alignment of IT with business and convergence of IT and business strategies to improve readiness, reliability and resilience. Alan is also leading the internal effort to implement a world-class sustainable career development program for SRA enterprise architects. He may be contacted by e-mail at alan_perkins@sra.com.

Top