Published in TDAN.com July 2002
Executive Summary
With the introduction of CRM comes the need to cut costs, improve quality, and speed delivery of warehousing efforts. There is a need to shorten the time it takes to attain ROI while improving
quality and repeatability of these projects. The Corporate Information Factory (CIF) and The Matrix Methodology (TMM) work together to help businesses achieve these goals. The CIF is a logical
architecture whose purpose is to deliver Business Intelligence and Business Management capabilities driven by data provided from Business Operations. TMM provides the steps necessary to implement
the components in the CIF framework.
At the heart of The Matrix Methodology is the Data Vault (see Data Vault Overview article in this issue). This is the next evolutionary step in data modeling. It provides flexibility,
scalability, and the capability to store massive sets of granular information over long periods of time. The Data Vault combines the best practices of 3rd normal form and star schema modeling to
reach the next level of data and information storage. The Data Vault overcomes the traditional difficulties that occur when changes occur to the original data model.
TMM is built around Software Engineering Institute (SEI level 3) concepts such as Capability Maturity Model (CMM), Key Process Areas (KPA), and Key Process Indicators (KPI). These concepts make it
possible to measure applied resources to pre-defined build processes. Included in TMM implementation steps (SEI paradigm) are risk assessments, quality checkpoints, peer reviews, versioning, and
change management. Function point assessments are also a key component to the implementation effort – making the components of the project plan measurable from a man-hours perspective. At any
time the executives feel the project is out of scope, it can either be re-evaluated or re-scoped with predictive impact analysis.
Introduction
Welcome to the new era of business intelligence. There are many more challenges today than there ever have been, ranging from integration of massive sets of information to gaining more rapid ROI
out of the systems we deploy. Many of the today’s drivers come from CRM efforts and the desire for effective competitive business intelligence. Some of the questions that are raised from
these efforts include the following:
- How do we increase or even measure ROI from the systems we are installing?
- Do we really understand our competition, furthermore – is there a better method for competing in the new economies?
- Why do we appear to be outgrowing our existing data warehouses?
Some of the problems that CRM raises in the business intelligence arena come from sheer size. The enormous amounts of information that CRM generates begin to cause problems. The scope of the
projects involved in building proper warehouse architectures affects our project’s success. It’s imperative to address these issues in a consistent, repeatable, and scalable fashion. In
doing so, we must have the proper frameworks and implementation methodologies.
The framework we choose is the Corporate Information Factory (CIF). This provides the architectural blueprint for business intelligence projects needed to handle the enormity of analytical
processing in support of CRM. The Matrix Methodology™ (TMM) supplies the instructions, standards, and processes necessary to implement the architecture successfully. At the heart of TMM is
the Data Vault. The Data Vault is a modeling technique capable of sustaining massive volume and granular data, while remaining extensible and scalable. It is easily adaptable to change, and can be
configured to nearly any new requirements without significantly impacting existing processes, structures, or data.
The justification for data growth is incorporated into industry analysts statements, according to the MetaGroup, there is a 400% increase per year in volume of data in data warehouses. The
warehouses themselves will grow from 3TB to 300TB (Terabytes) by 2004. There is now an urgent need for strategic data management processes.
Corporate Information Factory (CIF)
The Corporate Information Factory (CIF) is a logical architecture whose purpose is to deliver Business Intelligence and Business Management capabilities driven by data provided from Business
Operations. The CIF has proven to be a stable and enduring technical architecture for any size enterprise desiring to build strategic and tactical decision support systems. The CIF consists of
producers of data and consumers of information. Figure 1 shows all the components found within the Corporate Information Factory architecture.
The producers in the Corporate Information Factory capture the data (Data Acquisition) from the operational systems and assemble it (Enterprise Data Management) into a usable format (Data Warehouse
or Operational Data Store) for consumption by the business consumers. The CIF consumers acquire the information produced (Data Delivery), manipulate it (Data Marts) and assimilate it into their own
environments (Decision Support Interface or Transaction Interface). To sustain and maintain the CIF, the four Operations and Administration functions must be implemented. These consist of:
- Systems Management – Processes that manage the changes to the core technologies such as new versions of databases, upgrades to software, and installation of new hardware components.
- Data Acquisition Management – Processes that monitor and maintain the programs that capture, integrate, cleanse, transform and load or update data in the data warehouse or operational
data store. - Service Management – Processes that register, prioritize, assign and track the disposition of all requests for service coming from the business community.
- Change Management – Processes that ensure that changes to the entire environment (models, programs, interfaces, etc.) are tracked and handled appropriately.
The last part of the CIF is the Information Workshop in which all the capabilities and knowledge come together in an easy to use workbench-like environment for the business community. Capabilities
and knowledge become integrated into the business processes for which they were developed. The technological aspects of the CIF fade into the background as these workbenches become prevalent
throughout the business.
We use a simple model that separates the major components fo the CIF into two fundamental processes – “Getting Data In” versus “Getting Information Out”. Figure 2
demonstrates the relationship between these two processes as well as the distinct components of the CIF involved in each. (Please see the September 1999 DM Review article, “Are You an Inny or
an Outty”, for more information on these processes.)
The CIF is a proven robust logical architecture for both strategic and tactical decision support. It demonstrates the interaction between the various components and the processes needed to support
it. Each component has a specific function and purpose and, if left out, may cause disruption in the overall efficiency and usability of the architecture.
This architecture can be used for both large and small Business Intelligence implementations. The key to successful implementations is heavily dependent upon the design of the heart of BI –
the data warehouse component. This design must be adaptable to accommodate the constantly changing BI environment. Yet, it must have stability in terms of the maintaining the data structures
already developed and being used. Finally the data must be stored in a flexible format that can accommodate the various types of analytical capabilities available today – from simple
reporting and querying to multidimensional analyses, data mining, and even exploration of the data.
To accomplish these daunting requirements for the data warehouse data model, a new way of thinking and designing is needed. The rest of this paper describes just such a new data design technique
– the data vault. The data vault is the next evolutionary step in data modeling. It provides flexibility, scalability, and the capability to store massive sets of granular information over
long periods of time, yet provides a solid design for smaller implementations as well. The Data Vault combines the best practices of 3rd normal form and star schema modeling to reach the next level
of data and information storage. The Data Vault overcomes the traditional difficulties of changes through the modeling techniques that are applied.
The Matrix Methodology and Data Vault
The Matrix Methodology (TMM) is a powerful and flexible methodology for building successful data migration / integration, and business intelligence projects. It is the implementation side of the
CIF framework and builds on the set of standards defined by Bill Inmon in his book: “Building the Data Warehouse.” The process steps in TMM define how to build each of the components
specified in the CIF. It provides the foundations for building projects in a repeatable, reliable, and consistent fashion through the use of Software Engineering Institute (SEI) principles. The SEI
principles included in TMM project plans include (but are not limited to): peer reviews, quality checks, and risk assessments. These principles are based on Key Process Areas (KPA), Key Process
Indicators (KPI), and Capability Maturity Models (CMM) work within SEI. For those that need definitions, a glossary of terms has been provided at the end of this document.
At the heart of The Matrix Methodology is the Data Vault. It is a data modeling technique that builds on the concepts of a data warehouse. It is an evolutionary step in data modeling. The
evolutionary chain starts with 3rd normal form, a modeling technique that is capable of handling fully normalized granular data but difficult to understand and use for analysis. The Star Schema
(which followed in the late 80’s) brought the multidimensional analytic capabilities, the ideas of aggregating information for end-users and faster query times, but was difficult to change or
use for other forms of analyses. The Data Vault contains the capability to house massive sets of granular data in a smaller, more normalized physical space and provides flexibility and
extensibility without high impacts. The Data Vault works just as well for small data warehouses (this will be demonstrated later with an example) as it is for massive ones.
The paradigm has shifted. In the past, 3rd normal form, or Star Schemas were traditional approaches to business intelligence; each had its pro’s and con’s. 3rd normal form offered
normalization, linkage and data lineage – historical views of low-level granular data, linked together to form an entire organizations historical view. In other words, you could get what you
wanted – but it was not only difficult to get it, but also difficult to build it. Because the entire enterprise had to be organized into a single data structure (in order for the information
to make sense), it became a daunting task at best to build – much less deliver to the end-user. This paper will explore the evolution of the data modeling techniques, now arriving at the Data
Vault.
In the beginning the 3rd normal form models or entity relationship diagram (ERD) worked its way through data modeling circles. It became the foundation of nearly all the OLTP systems available.
This model worked wonderfully to capture transactions from users – and to store relationships between transactions (or business information), but began to have trouble when applied to delta
driven, time variant, non-volatile information. It became apparent that to answer growing business questions from such volumes of information required accessing too many tables in the queries, too
many rows to join, and was too complex for the RDBMS engine to handle and business users to understand. The business was beginning to ask for multidimensional analysis capabilities, such as:
how many customers this day, week, month, year had purchased product X?
These questions were difficult to answer in a timely fashion because of the normalized data model. It simply wasn’t built for the types of response times needed now in analytical processing.
Its inflexibility also reared its head when business attempted to make a change in business processes or business rules – which then required changing the way the data was related to itself.
Sometimes the business required additional information about certain activities. Changing a 3NF (3rd normal form) ERD became awkward and difficult. By storing data in 3NF at the detailed level, you
have a flexible model, but it’s a bear to query or understand from a business perspective.
What developed next was the Star Schema. This came with the simplicity and elegance of denormalization, but mainly was able to serve the business in its need for multidimensional analysis
capabilities. The star model works well for rapid delivery of multidimensional information to specific end user groups. Its strengths are: multidimensional analysis, and numerical reporting
(counts, minimums, maximums, and aggregation levels) for summary reporting. It is capable of answering the question posed above, along with many others. It seemed to be flexible in the beginning.
If a business didn’t like the information it was receiving, it could simply rebuild the star under a new design. However, this proved to be not such a good practice, ultimately its structure
proved to be well suited to a non-changing / rigid format. Another weakness proved to be the poor ability to provide statistical analysis.
Some of the problems occurring with the star schema data model include its inability to: handle granular levels of detail over long periods of time, store massive sets of information, provide
relationship information between business functionality, demonstrate flexibility for ad hoc queries, and be able to rapidly change as the business did. The Star Schema is capable of delivering
highly focused sets of “slice and dice” data to an individual or group of business users. Another weakness lies in its inability to join data together thus allow the user to ask and get
answers to their questions from any business perspective. Finally, the model is very redundant, and difficult to change once implemented.
Soon the business users changed the question to: Now I want to know why they purchased product X and Not Y. Was it because of a marketing campaign, a sales price, and brand loyalty? This
is where the Data Vault picks up. It is capable of answering these questions, and many more. The Data Vault combines the best of both worlds (3NF and Star Schema) to allow a flexible, rapidly
deployable, and linkable data model. It uses the concept of fully normalized information, and parent-child relationships to describe the information, while providing granular level of detail and
time driven changes similar to a conformed dimension. The Data Vault is a hub and spoke data modeling technique which is unique, thus overcoming the problems previously associated with
normalization and multidimensional or query analysis.
The components of a data vault are: Hub, Satellite, and Link tables. There are derivations of the satellite tables in order to accommodate query issues. For instance: Since the satellites are
time-driven, often times nested sub-selects are required to find the max date or specific instance of history; this drives the number of tables in the query to double. The satellite table
derivation is called a Picture table (or Point in Time). This picture table houses parts of the primary key of each of the satellites, so that equal joins can be performed after a single sub-select
max is used. This keeps the access fast, and the number of tables in the query to a minimum. The hub is a list of business keys, the satellite is similar to a dimension, and link tables are similar
to many to many relationships in 3rd normal form. Sometimes the hub is mistakenly identified as a factless-fact table. This is not so. While it may not contain facts, it does not resemble the fact
table in any form. The hub’s function is to provide singular lists of unique business keys, of which the keys then migrate outward to satellites and link tables; whereas in a fact table, the
dimension keys migrate in.
A satellite is a delta-only time-variant set of information. A satellite contains the hub’s business key and a date time stamp indicating when it was built/loaded. The satellite contains
descriptive information (including calculated statistics if necessary) about the hub and its business key. It relates itself to one and only one hub, although foreign keys can be a part of
satellites for additional reference links. So what’s the point of having a satellite? It has a unique ability embedded in the modeling technique. The satellite is then broken apart based on
the rates of change and type of information in the satellite to form new satellites.
For example: I have a product and product number. I have quantities, schedules, defect reasons, and descriptions. The product number would become the hub, and the rest of the descriptive
information would be housed in a single satellite. Then, upon further analysis we discover that quantities, schedules, defect reasons, and descriptions all have different rates of change. What we
do is split each of these components in to their own satellite, again migrating the product key outwards. We end up with one hub (product number) and four satellites: quantity, schedule, defect
reasons, and description.
What is visible now is that the satellites with rapid change (such as quantities) can be long and narrow, thus queries are rapid – as long as we know what group of products we are after. The
satellite takes on one additional form: point in time, or picture in time. This is a structure that assists queries in equal joins across the different satellites once a time period has been chosen
(i.e. today, last month, last year).
Figure 3 is a high level illustration of a data vault model. It is based on a CRM example: marketing campaigns or sales campaigns. The business wants to know when products were sold, what products
were sold, were they sold under certain campaigns, and why one campaign might have been more successful than another. Product, Invoice, Campaign, and Customer are all Hubs. Line Items, and Campaign
Products are link tables, the rest are satellites. The direction of the arrows demonstrates which way the primary keys migrate. Not shown in the figure are the point-in-time satellites, which would
normally be centered on the hubs. These were not included to simplify the example.
Additionally, we’ve provided a sample logical data vault below, we’ve expanded Just the Invoice Hub from above to better demonstrate the Data Vault
How do I build a data vault?
The major steps to building a data vault are below:
Steps:
- Identify the business functional areas, followed by the unique subject areas within the business units, and finally the individual topic areas.
- Build a hierarchical diagram showing the components of these business areas. Leave enough room in the diagram to place connectors between these business areas.
- Next: place connectors between these business and topic areas in the business that are of interest or importance to you. For example: if information passes between a CPA and an auditor,
describe what the data is, and what the business function is to pass this information between these two people. Be sure to describe what the CPA must do to “certify” the information
before it reaches the auditor; this is known as the business rule. - Change the hierarchical diagram to a flow diagram at a topic level, group the different components of business flow by subject area, and group the subject areas by business functional areas
(such as: marketing, finance, etc…) In other words: you start with Business Process Modeling, but you are identifying the INFORMATION that flows between these business processes, not just
the processes and links themselves. - Next, take the topic area components in the flow, and title those “hubs.” These are the hubs of the hub and spoke architecture.
- Decide what makes up the business keys to the information that’s flowing between these components. For instance: if a CPA relies on Account Number, and/or contract number, maybe invoice
number, write these all down. If you have more than one particular key identifier, you’ve got several hubs within that topic. That means the topic is really a subject, and should be broken
down further in to elements such as: Invoice, Contract, Customer, etc… - Now each of these key identifiers belongs in their own hub. These are the foundational layers of the business and the first step in the data model.
- Next: take all the surrounding data (like date of invoice, description of invoice, days past due, amount of invoice) all the things that BELONG to an individual key such as invoice and group
them together as a “satellite.” Similar to a dimension, but we will change it’s definition shortly. Draw an arrow from the hub to the satellite, indicating that the hub business
key migrates outward. - Draw connecting lines between say: invoice and customer (part of this business process). Be SURE you only connect the hubs that are connected in your process flow / business flow.
- Perform step 9 for customer (items such as Customer Name, Address, Description, Sign on date, deleted date, etc..)
Building a Data Vault can be accomplished quickly, as long as the business requirements have been addressed. Not addressed are the structures of the tables within the data vault. As an exercise,
use a long distance and local phone bill – and break it down into a data vault. This will help the thought process of creating data vaults. It is vital to note that this is a business
approach to modeling data, and that walking through a data vault allows the business to focus on the relationships of business elements (such as: how a customer is related to a bill). These are
terms that the business users will help correct in a data vault model walk-through.
Conclusion
The CIF provides the best architecture for business intelligence giving you an easily understood roadmap thus ensuring a successful implementation. It has proven to be a stable and flexible
technical architecture for any size enterprise desiring to build strategic and tactical decision support systems.
The Data Vault is a highly scalable data model that can be utilized to house massive sets of information in a granular format. The Data Vault architecture is the next step in the evolutionary chain
of data modeling for a data warehouse. It plugs into The Matrix Methodology as a well-integrated component. The Matrix Methodology offers all the steps necessary to implement high quality data
warehouses. The steps in TMM conform to SEI standards and concepts by utilizing such ideas as: peer reviews, quality checkpoints, customer signoff, common standards, and risk assessments.
Through these concepts, a company can implement CMM, KPAs, and KPIs for business intelligence projects. TMM rolls up under the CIF framework for a complete solution.
Glossary
3NF (3rd Normal Form) = Is a method of representing information (data) in a mostly non-redundant format. It is a form in which information is separated in to parent-child
relationships, and described to a physical database to comprise a storage system.
Change Management = Is the set of processes coordinating modifications to the Corporate Information Factory.
CIF Data Management = is the set of processes that protect the integrity and continuity of the data within and across the data warehouse and operational data store. It may employ a
staging area for cleansing and synchronizing data.
Data Acquisition = Is the set of processes that capture, integrate, transform, cleanse, and load source data into the data warehouse and operational data store.
Data Acquisition Management = is the set of processes that manage and maintain processes used to capture source data and its preparation for loading into the data warehouse or
operational data store.
Data Delivery = Is the set of processes that enables end users and their supporting IT groups to filter, format, and deliver data to data marts and oper-marts.
Data Mining Warehouse = Is a data mart for analysts to test their hypotheses, assertions, and assumptions. Specialized tools containing intelligent agents and statistical analysis
capabilities are used to perform these tasks.
Data Vault (DV) = Is a data modeling technique which comprises the best of both 3rd normal form and star schema modeling to provide an expansive, extensive, yet solid foundation on
which to construct a data warehouse environment. The Data Vault provides the implementation guidelines, modeling techniques, necessary to store massive sets of information, and still remain
flexible enough to extend in to tomorrows needs.
Data Warehouse = is a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise.
Decision Support Interface = Is an easy-to-use, intuitive tool to enable end user capabilities such as exploration, data mining, OLAP, query, and reporting to distill information
from data.
ERD (Entity Relationship Diagram) = Is a 3rd normal form model structured to house normalized information. The information is related to other components through links, and parent
and child relationships. It is the traditional data modeling form used for OLTP systems. It was one of the starters of data warehouse models.
Exploration Warehouse = Is a data mart whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse may utilize specialized technologies
to provide fast response times with the ability to access the entire database.
External Data = Is the any data outside the normal data collected through an enterprise’s internal applications. Generally external data, such as demographic, credit,
competitor, and financial information, is purchased by the enterprise from a vendor of such information.
Information Feedback = is the set of processes that transmit the intelligence gained through the usage of the Corporate Information Factory to appropriate data stores.
Information Workshop = Is the set of the facilities that optimize use of the Corporate Information Factory by organizing its capabilities and knowledge, and then assimilating them
into the business process.
KPA (Key Process Area) = Is an area in any business which is considered critical to operation. KPA is defined as a part of business improvements or business re-engineering. Further
definitions can be found under SEI from Carnegie Mellon University.
KPI (Key Process Indicator) = Is a method of measuring (a metric) the business process selected or identified by a KPA. These metrics can then be gathered continually in order to
assess the impact of changes to that component of business. KPI’s are also defined by SEI from Carnegie Mellon University.
Library and Tool Box = Is the collection of meta data and capabilities that provides information to effectively use and administer the Corporate Information Factory. The library
provides the medium from which knowledge is enriched. The toolbox is a vehicle for organizing, locating, and accessing capabilities.
Meta Data Management = is the set of processes for managing the information needed to promote data legibility, use, and administration.
OLAP (online analytical processing) Data Mart = Is aggregated and/or summarized data that is derived from the data warehouse and tailored to support the multidimensional
requirements of a given business unit or business function.
OLTP (On-line transaction processing) = Is the act of capturing and storing granular levels of information in a relational database system. It is the process by which this
information is taken from an input source and placed in the appropriate structure within the database. The elements in the information are typically time and date stamped, and are called
transactions.
Oper-Mart = Is a subset of data derived from the operational data store used in tactical analysis and usually stored in a multidimensional manner (star schema or hypercube). They
may be created in a temporary manner and dismantled when no longer needed.
Operation and Administration = Is the set of activities required to ensure smooth daily operations, to ensure that resources are optimized, and to ensure that growth is managed.
Operational Data Store = Is a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the
central point of data integration for business management, delivering a common view of enterprise data.
Operational Systems = Are the internal and external core systems that run the day-to-day business operations. They are accessed through application program interfaces (APIs) and
are the source of data for the data warehouse and operational data store.
SEI (Software Engineering Institute) = Is the institute for better development processes of Software. Sponsored by Carnegie Mellon, it contains templates and definitions followed
around the world that make software engineering easier, repeatable, and more reliable. There are 5 levels of SEI, and with each increasing level comes increasing levels of success and
repeatability.
Service Management = Is the set of processes for promoting user satisfaction and productivity within the Corporate Information Factory. It includes processes that manage and
maintain service level agreements, requests for change, user communications, and the data delivery mechanisms.
Systems Management = Is the set of processes for maintaining, versioning, and upgrading the core technology on which the data, software, and tools operate.
Transactional Interface = Is an easy-to-use and intuitive interface for the end user to access and manipulate data in the operational data store.
TMM (The Matrix Methodology) = Is a set of steps that describe how to build each of the components contained within the Corporate Information Factory. These steps (process steps)
describe the order, dependency, and implementation methods necessary to build different pieces in a successful, reliable, and repeatable fashion.
Workbench = Is a strategic mechanism for automating the integration of capabilities and knowledge into the business process.