Four Ways to Build a Data Warehouse

It has been said there are as many ways to build data warehouses as there are companies to build them. Each data warehouse is unique because it must adapt to the needs of business users in
different functional areas, whose companies face different business conditions and competitive pressures.

Nonetheless, four major approaches to building a data warehousing environment exist. These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid, and 4) federated. Most
organizations—wittingly or not—follow one or another of these approaches as a blueprint for development.

Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among these architectures. This is especially
true of the “top-down” and “bottom-up” approaches, which have existed the longest and occupy the polar ends of the development spectrum.

As a result, some companies fail to adopt a clear vision for the way the data warehousing environment can and should evolve. Others, paralyzed by confusion or fear of deviating from prescribed
tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations. Ideally, organizations need to borrow concepts and
tactics from each approach to create environments that uniquely meets their needs.

Semantic and Substantive Differences The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data
warehousing field. Inmon, who is credited with coining the term “data warehousing” in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse
followed by data marts. Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse
altogether.

On the surface, there is considerable friction between top-down and bottom-up approaches. But in reality, the differences are not as stark as they may appear. Both approaches advocate building a
robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature.
For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while “top-down” subscribers call this a data warehouse,
“bottom-up” adherents often call this a “staging area.”

Nonetheless, significant differences exist between the two approaches (see chart.) Data warehousing professionals need to understand the substantial, subtle, and semantic differences among the
approaches and which industry “gurus” or consultants advocate each approach. This will provide a clearer understanding of the different routes to achieve data warehousing success and
how to translate between the advice and rhetoric of the different approaches.

Top-Down Approach
The top-down approach views the data warehouse as the linchpin of the entire analytic environment. The data warehouse holds atomic or transaction data that is extracted from one or more source
systems and integrated within a normalized, enterprise data model. From there, the data is summarized, dimensionalized, and distributed to one or more “dependent” data marts. These data
marts are “dependent” because they derive all their data from a centralized data warehouse.

Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate
staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems.

The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures. First, this means the data warehouse
provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse
lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs. For example, a data warehouse can be used to create rich data sets for statisticians,
deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of
the data.

On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably
detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or reports. (Of course, depending
on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their
own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their
reports.

Bottom-Up Approach
In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the
data—both atomic and summary—that users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart
builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data.

The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user
requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases,
dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject
areas.

Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value
rapidly because it doesn’t lay down a heavy infrastructure up front.

Without an integration infrastructure, the bottom-up approach relies on a “dimensional bus” to ensure that data marts are logically integrated and stovepipe applications are avoided. To
integrate data marts logically, organizations use “conformed” dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical
enterprise dimensional model.

Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to “drill through” from a data mart to another structure
to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts.

One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When
data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and
business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create “independent” or non-integrated data marts.

In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data
structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a
data mart at night when users are not active on the system.

Hybrid Approach
The hybrid approach tries to blend the best of both “top-down” and “bottom-up” approaches. It attempts to capitalize on the speed and user-orientation of the
“bottom-up” approach without sacrificing the integration enforced by a data warehouse in a “top down” approach. Pieter Mimno, an independent consultant who teaches at TDWI
conferences, is currently the most vocal proponent of this approach.

The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form before developing the first data mart. The first several data marts are also designed in
third normal form but deployed using star schema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability and query performance of a star
schema.

The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between
them. This lets local groups, for example, develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificing long-term integration.
Organizations also use the ETL tool to extract and load data from source systems into the dimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary
tables on the fly.

After deploying the first few “dependent” data marts, an organization then backfills a data warehouse behind the data marts, instantiating the “fleshed out” version of the
enterprise data model. The organization then transfers atomic data from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money, and
processing resources. Organizations typically backfill a data warehouse once business users request views of atomic data across multiple data marts.

The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecture framework. It develops an enterprise data model iteratively and only
develops a heavyweight infrastructure once it’s really needed (e.g. when executives start asking for reports that cross data mart boundaries.)

However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and therefore may never be funded. In addition, few query tools can dynamically and
intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database.

This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, develop aggregates, load detail data, and orchestrate the transition to a data
warehousing infrastructure. Although ETL tools have matured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy for local groups to stray
irrevocably from the enterprise data model.

Federated Approach
The federated approach is sometimes confused with the hybrid approach above or “hub-and-spoke” data warehousing architectures that are a reflection of a top-down approach.

However, the federated approach—as defined by its most vocal proponent, Doug Hackney—is not a methodology or architecture per se, but a concession to the natural forces that undermine
the best laid plans for deploying a perfect system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business
conditions. In short, it’s a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of
his business users.

Hackney says the federated approach is “an architecture of architectures.” It recommends how to integrate a multiplicity of heterogeneous data warehouses, data marts, and packaged
applications that companies have already deployed and will continue to implement in spite of the IT group’s best effort to enforce standards and adhere to a specific architecture.

Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards as an “optimal solution.” He says it provides the “maximum
amount of architecture possible in a given political and implementation reality.” The approach merely encourages organizations to share the “highest value” metrics, dimensions,
and measures wherever possible, however possible. This may mean, for example, creating a common staging area to eliminate redundant data feeds or building a data warehouse that sources data from
multiple data marts, data warehouses, or analytic applications.

The major problem with the federated approach is that it is not well documented. There are only a few columns written on the subject. But perhaps this is enough, as it doesn’t prescribe a
specific end-state or approach. Another potential problem is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and fragmentation of
analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in a heterogeneous, ever-changing environment.

Summary
The four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousing managers need to be aware of these methodologies but not wedded to them. These
methodologies have shaped the debate about data warehousing best practices, and comprise the building blocks for methodologies developed by practicing consultants.

Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket. Since each organization must
respond to unique needs and business conditions, having a foundation of best practice models to start with augurs a successful outcome.

Top-Down Bottom-Up Hybrid Federated

Major Characteristics

· Emphasizes the DW.
· Starts by designing an enterprise model for a DW.
· Deploys multi-tier architecture comprised of a staging area, a DW, and “dependent” data marts.
· The staging area is persistent.
· The DW is enterprise-oriented; data marts are function-specific.
· The DW has atomic-level data; data marts have summary data.
· The DW uses an enterprise-based normalized model; data marts use a subject-specific dimensional model.
· Users can query the data warehouse and data marts.
 
· Emphasizes data marts.
· Starts by designing a dimensional model for a data mart.
· Uses a “flat” architecture consisting of a staging area and data marts.
· The staging area is largely non-persistent.
· Data marts contain both atomic and summary data.
· Data marts can provide both enterprise and function-specific views.
· A data mart consists of a single star schema, logically or physically deployed.
· Data marts are deployed incrementally and “integrated” using conformed dimensions.
 
· Emphasizes DW and data marts; blends “top-down” and “bottom-up” methods.
· Starts by designing enterprise and local models synchronously.
· Spends 2–3 weeks creating a high-level, normalized, enterprise model; fleshes out model with initial marts. · Populates marts with atomic and summary data
via a non-persistent staging area.
· Models marts as one or more star schemas.
· Uses ETL tool to populate data marts and exchange meta data between ETL tool and data marts.
· Backfills a DW behind the marts when users want views at atomic level across marts; instantiates the “fleshed out” enterprise model, and moves atomic data to the
DW.
 
· Emphasizes the need to integrate new and existing heterogeneous BI environments.
· An architecture of architectures.
· Acknowledges the reality of change in organizations and systems that make it difficult to implement a formalized architecture.
· Rationalizes the use of whatever means possible to implement or integrate analytical resources to meet changing needs or business conditions.
· Encourages organizations to share dimensions, facts, rules, definitions, and data wherever possible, however possible.
 

Pros

· Enforces a flexible, enterprise architecture.
· Once built, minimizes the possibility of renegade “independent” data marts.
· Supports other analytical structures in an architected environment, including data mining sets, ODSs, and operational reports.
· Keeps detailed data in normalized form so it can be flexibly re-purposed to meet new and unexpected needs.
· Data warehouse eliminates redundant extracts.
· Focuses on creating user-friendly, flexible data structures.
· Minimizes “back office” operations and redundant data structures to accelerate deployment and reduce cost.
· No drill-through required since atomic data is always stored in the data marts.
· Creates new views by extending existing stars or building new ones within the same logical model.
· Staging area eliminates redundant extracts.
 
· Provides rapid development within an enterprise architecture framework.
· Avoids creation of renegade “independent” data marts.
· Instantiates enterprise model and architecture only when needed and once data marts deliver real value.
· Synchronizes meta data and database models between enterprise and local definitions.
· Backfilled DW eliminates redundant extracts.
 
· Provides a rationale for “band aid” approaches that solve real business problems.
· Alleviates the guilt and stress data warehousing managers might experience by not adhering to formalized architectures.
· Provides pragmatic way to share data and resources.
 

Cons

· Upfront modeling and platform deployment mean the first increments take longer to deploy and cost more.
· Requires building and managing multiple data stores and platforms.
· Difficult to drill through from summary data in marts to detail data in DW.
· Might need to store detail data in data marts anyway.
 
· Few query tools can easily join data across multiple, physically distinct marts.
· Requires groups throughout an organization to consistently use dimensions and facts to ensure a consolidated view.
· Not designed to support operational data stores or operational reporting data structures or processes.
 
· Requires organizations to enforce standard use of entities and rules.
· Backfilling a DW is disruptive, requiring corporate commitment, funding, and application rewrites.
· Few query tools can dynamically query atomic and summary data in different databases.
 
· The approach is not fully articulated.
· With no predefined end-state or architecture in mind, it may give way to unfettered chaos.
· It might encourage rather than reign in independent development and perpetuate the disintegration of standards and controls.
 

Major Proponents

Bill Inmon and co-authors
 
Ralph Kimball and co-authors
 
Many practitioners
 
Doug Hackney
 

Share

submit to reddit
Top