Meta Data – the Backbone of the Data Warehouse Environment
The management of meta data is critical for successful implementation of a data warehouse program. This can best be achieved through a single-point meta-data repository that allows for the
management of the gradual spread of informational systems throughout the enterprise based on a uniform pool of information.
To enable the data warehouse to grow and to adapt to technology changes concurrently, there must be no rigid codification of rules, processes and structures. However, the individual components must
be connected in a sufficiently flexible manner that will allow changes in the integration of new components without damage to the integrity of the overall concept.
Meta data management represents the backbone of a data warehouse environment, meaning:
- an integrated and expandable meta data model for all data, processes and tools
- functions for data flow analysis in the data warehouse
- functions to analyze the effects of changes in the data warehouse
- functions for the integration of the meta data of other tools
- configuration and version management in the data warehouse
The quality of information in a data warehouse implies direct usability to the user, and must be structured in a user-oriented manner. Information must be made available in a simple, variable,
structured and efficient form. It is also necesssary to integrate a host of tools and to make such integration possible for the future.
The Structure of Meta Data
In order for meta data to have value, its structure must reflect the physical systems it represents. Determining the relevant structure needed to employ meaningful criteria is necessary.
Criteria for Structuring Meta data
The architecture of the environment plays a central role in determining and structuring the meta data required for a data warehouse environment. The architecture is determined by:
- Xthe operational systems
- the external data sources
- the data warehouse database(s)
- the data marts
- the tools used
- the considered time frame
- the data warehouse processes.
Origin
The simplest way to structure meta data is based on its origin as seen by the data warehouse. One can differentiate according to the description of the operational systems (and external sources)
and the informational systems (data warehouse).
Meta data has long played a central role in operational systems. Meta data is contained in the database catalog, active dictionaries for 4GL languages or standard software packages, data
dictionaries, CASE systems, and data catalogs. Systems outside of the enterprise (information services, Internet) represent external data sources. Meta data within the operational environment can
be determined via analysis tools and parsers or can be derived via interfaces from CASE systems. An inventory of these meta data serves as the basis for the subsequent processes in the data
warehouse.
Similarly, corresponding data arises from the modelling and formation of the informational world. In comparison to the thirty-year tradition of operational systems, the variety of database
technologies, and the employment of standard software, the ‘new world’ of the data warehouse is rather ‘simple’ at first. Essentially, relational systems are being used, which are
supplemented by multi-dimensional or object-oriented systems.
Type of Use
Another organizing scheme for meta data is based on its type of usage within the data warehouse. The objective of meta data management in the data warehouse is to guarantee functionality and to
ensure the continued evolution of the data warehouse. One might speak of DNA for meta data as analogeous to the DNA of human genes. The letters DNA can be used to describe the different functions
of meta data in the data warehouse:
- Definition — unique definition and transparence of data warehouse data
- Navigation — finding and re-using data warehouse elements
- Administration — management of the operation and continued development of the data warehouse
The Definition becomes increasingly significant within the operational environment, where the user is guided by the application. In the data warehouse, the user has the freedom to
choose, combine or reassemble the appropriate data. In addition, there are aggregations, analysis criteria, and calculations. If conflicting results and misinterpretation are to be avoided, an
exact definition of significance, origin, currency (up-to-dateness) and quality of the data is indispensable (Business Library).
Navigation concerns finding all the information on given topics, keywords and structures. This is a new role in the data warehouse environment, enabling a data warehouse user to
immediately generate frequently used reports and analyses. However, to request new information or to implement enhancements, requires meta data to find exisitng information. This requires
navigational aid to determine the availability and location of information by means of meta data.
Administration is the ‘classical’ function of meta data, involving the management of changes and the enhancements of an existing system environment. Functions include impact
analysis, language generation, versioning and archiving.
Process
Data warehouse processes afford a detailed organization of the meta data. Therefore, the type and shape of these processes are essential for the architecture of the metamodel. Type and shape of the
data warehouse processes are largely determined by the chosen data warehouse architecture. The choice of architecture depends on both the size of the enterprise and the level of maturity of the
data warehouse environment. Figure 1 shows the basic architecture, where the information pool and the aggregation levels can be virtual, central or decentralized.
Figure 1.
In mature data warehouse environments, transforming data from operational systems and external sources to information that is directly usable is a multi-level process. The first step is the
extraction or acquisition of data from the operational databases and file systems, and then determining which data elements will form the base of the information pool. This is followed by
transformation to the data warehouse structures, quality control measures and validation tests. The newly structured information is inserted in the warehouse database. The data can be added
directly to the data store of the data mart (e.g. a multi-dimensional system) or can be stored first in a relational data warehouse database. The latter then serves as a data source for the data
marts to supply the decentralized ‘branches’ of the warehouse with consistent and quality-assured information. The required intermediate aggregation levels can be integrated at different
locations. The meta data is thus structured according to data stores (operational, information pool, aggregation levels) and processes (acquisition, preparation, query).
Subject-Related and Technical Data
A basis for structuring meta data according to the described data stores and processes is provided, and is represented by the lowest level of the meta data in Figure 2. Each of the rectangles
within the pyramid represents a complete meta data model-part with different meta-types and meta-attributes.
Figure 2.
The structures of the (relational) information pool, at the logical/technical meta level, are described by a structure corresponding to the database catalog, with tables, columns, table spaces,
stored procedures and triggers. However, the database structure for the data warehouse or data mart is not set up directly at the technical level of the database, but rather, at the conceptual
level of an entity relationship model. This model serves as the interface between the subject-related view and the technical level, and is easier for the user to understand.
Technically, the creation of the conceptual datamodel in the form of an entity relationship diagram, as well as its generation in the database, can be carried out by a CASE tool or directly in the
repository. The integration of the meta data requires a transforming both the conceptual model and the database catalog to the central repository. This guarantees integrity between the conceptual
model and the database schema, support of different platforms and the integration with other meta data.
In the metamodel, the meta data of the database schema is linked to the meta data of the query tool in order to allow queries and reports about database element usage. It is also linked with the
meta data of its origin in the central warehouse, operational systems, or external sources, in order to allow for later analysis of dataflow and origin.
In general, the conceptual level will be present both for the information pool and for the data marts. The information pool is a classical, subject-oriented data model with an integrative character
for the enterprise. Structurally, it should grow in the direction of an enterprise-wide data model with a full definition of the attributes. It is not feasible to consider an enterprise-wide model
as the prerequisite for a data warehouse. However, it does make sense to enable a uniform view of all data marts of an enterprise in order not to generate conflicting information. Therefore, the
data model of the data warehouse performs an integrative role with regard to the meaning and structure of enterprise information.
The structure of the data mart at the conceptual level depends on the view of the users and on the query tool being used. In the case of OLAP, one will use a Starjoin/Multijoin or Snowflake schema
as the basic structure.
The existence of information about operational systems at the conceptual level, or the need for it to be generated, largely depends on the extent of CASE use and the completeness and reliability of
these descriptions in the enterprise. Other meta-information can be described at the logical or conceptual level.
Summary
Meta data is central to the data warehouse environment. Its importance to the end user has grown significantly since the explanatory (Definition) and exploratory (Navigation) role is no longer
filled by a pre-defined and completely programmed user interface. The user can manipulate and combine data in many ways, but must be able to find all relevant data and understand its exact meaning.
This is possible only if the respective subject-related meta data is made available. On one hand, there is the decentralized character of data marts and their flexible definition by means of
appropriate tools. On the other hand, there is considerable effort and the high cost of providing an integrated, high-quality information pool.
Ultimately, only integrated meta data management for the data warehouse and the data marts can provide information to the end user, reuse in creation and expansion, change management, and the
efficient control of the entire data warehouse environment, including the tools.
The warehouse environment contains numerous, diverse elements including operational systems, extraction and transformation tools, data warehouse DBMSs, data models and query tools. In a
conventional environment, each of these diverse elements may have its own meta-data store. The crux of achieving a manageable data-warehouse environment is a single-point meta-data repository.
Thus, four essential success factors for the operation of a data warehouse environment are achieved:
- self-sufficiency of the individual users by supplying them with quality-assured data and meta data for finding and understanding these data
- uniform informational data through uniform use of enterprise-relevant criteria and code numbers
- efficient availability of an information pool
- maintainability of the overall data warehouse environment
In short, the challenges that exist in successful data warehouse/mart implementation stem largely from the numerous and disparate tools required, each with its own isolated meta data store. The
best way to rise to this challenge is a single-source, tool-independent meta data repository, which will allow one to place a homogeneous face on a heterogeneous world.