Publisher’s Note: This article is being published in Association with DebTech’s Meta-Data and Data Modeling Summit in February 2006 in Orlando, Florida.
© InfoModel 2005
The purpose of this article is to define the process for converting a logical data model to a physical data model, especially in a warehouse environment.
Before discussing the specific methods for optimizing a data warehouse data model, let us first review the overall process for developing a logical data model and eventually building it as a physical, optimized database. The following procedures can be successfully used to transform a Logical Analytical Data Model into a physical database, such as a data warehouse or a data mart. By following these procedures, one can ensured that:
- The data model has integrity (e.g., that business rules have been honored),
- The process has integrity (i.e., that it accounts for all the necessary steps and feedback loops), and
- The overall process is pragmatic (e.g., that it addresses issues of query, report and load performance). This chapter describes a process that is generic enough to be used as a guide for different projects.
Overview
Let us start with an overview of the overall process for creating logical and physical data models. Here are the major steps in the transformation of the model:
- The business authorization to proceed is received.
- Business requirements are gathered and represented in a logical data model, which will completely represent the business data requirements and will be non-redundant.
- The logical model is transformed into a first-cut physical model by applying several simple modifications, such as splitting a large table or combining entities in a 1:1 relationship.
- The logical model is then transformed into a second-cut physical model by iteratively applying three levels of optimizations or compromises. The outcome from this is a physical database design.
- Apply safe compromises to the model, such as splitting a table or combining two tables.
- Apply aggressive compromises to the model, such as adding redundant data.
- Apply technical optimizations to the model, such as indices, referential integrity or partitioning.
- The physical database design is then converted to a physical structure by generating or writing the DDL and installing the database.
- Steps 4 and 5 are iteratively performed so that the database can be tested before going into production. Sometime aggressive compromises are done so that the effect of them can be properly tested. Aggressive compromising is an iterative process. If views are used up-front, it may even be transparent to the application teams and business users.
Analysis Phase – Develop a Logical Data Model
This step is obviously the crucial starting point. There is no substitute for business knowledge. Any team that does not know the business has no business building the logical data model. The key characteristics of this model are that it:
- Must declare the grain.
- Should be atomic to the appropriate level.
- Should fully satisfy user requirements.
- Should be non-redundant.
- Should be independent of technology, implementation and organizational constraints.
- Should also contain any external data that is essential to the business.
Next we explain these points is slightly more detail.
- The Grain. The DW establishes the granularity of the data. It is generally advisable to retain as detailed a grain as possible to ensure the flexibility and power of the DW.
- Atomic. Any query that can be asked should be capable of being satisfied by this model. It should satisfy these queries effectively, although it may not necessarily satisfy all types of queries efficiently. It should be only as atomic as the queries and usage of the data dictates. The atomic data in a DW is comprised of analytical data. The level of detail is the most detailed grain that the analytical environment will require.
- Non-redundant. A logical model should be normalized (to at least third normal form). This runs contrary to what lots of people say, but remember, normalization does not presuppose the granularity of the data. Normalization does not require that the model be reduced below the level necessary to satisfy the query and reporting needs of the user. Later in design, the logical model will be optimized, and, if necessary, denormalized. It is best to separate the search for business requirements from the concerns over designing an optimized structure.
- Independent of technology and organization. Certainly the logical model should be independent of technology, and thereby implementable on any technology. It is also important for it to be independent of organizational structure. An analytical or decision support model that is tied to a given organizational structure will have a very short life. This is so because organizational structures, and more so reporting and roll-up structures, can and will change repeatedly. The best was to prepare the model for change is to design it to accommodate organizational changes from the beginning.
- External data. Analytical models can use vast amounts of purchased data, such as demographic data, household data, and customer data. This needs to be considered when designing the data structure.
Information Gathering
The most critical task in the development of the data model is requirements gathering. There are three general ways to approach model expansion:
- Top-down,
- Inside-out and
- Bottom-up.
Top-down expansion essentially starts with a high level model to represent the breadth of data in the DW. This model is rough cut and coarse but represents the major subject areas of data that the DW must support. Top-down expansion starts with this rough-cut model and progressively enlarges it by adding data elements to it.
Inside out expansion starts with an initial high level model, which expands by adding details. For example, we can start with Customer and a few attributes, and eventually populate it with many attributes.
Bottom-up expansion proceeds by collection of views or use cases. The data from these is incorporated into the model. A typical view could be as simple as a query. A typical query could be “How much volume did we sell last month (and this year to date) versus a year ago last month (and last year to date).”
It is generally best to start the model in an inside-out fashion to begin with and to expand the data model independent of specific views. This encourages creativity and thinking out of the box. Once this is done, it is then appropriate to identify queries and merge the results of these into the model. A good way is to identify the top 10 queries. (It does not have to be 10; it can be any number – 15 or 20, or even more). These queries can then be used to crosscheck and expand the model.
We have found that when you start modeling by focusing exclusively or heavily on user queries, the result is a model that has a more limited usage and thereby more limited life. A model produced this way satisfies these queries but runs into trouble if the kind of query or the granularity of data in the queries changes.
There are several vehicles to use for information gathering:
- user interviews,
- facilitated sessions,
- examination of current data and queries and
- direct observation.
User interviews work well to get an overall sense of the business background and to provide detailed feedback. Facilitated sessions are useful for generating a synergistic and creative answer to a problem. These are mostly inside-out (or top-down) methods of development. Examination of current data and queries are bottom-up methods and are good for providing details and confirming the model. Direct observation of business people involved in business processes can provide invaluable insights.
It’s important to go beyond external observation. In a DW it is also important to understand the work that people do and how they do it. This often goes beyond queries. For example, say, a marketing department has been producing a set of 35 queries and reports for over a year now. If the model is specifically designed to support those queries, then the model will not work, as the query needs change. We have found this to be true. A typical marketing group will change its reporting and analytical requirements frequently. It is important to consider this wider range of requirements when designing the model. Flexibility is essential.
Scenarios
Queries and reports can be used to verify that the database will support them. From these it is best to identify a set of key user scenarios that can be used to pass against the model. A scenario is a sequence of activities that respond to a major business event. A scenario is actually a test case — but a business test case. It consists of two parts: the definition of the business case and a script of values. The purpose of scenarios is to test the correctness and completeness of the model. A major business event is something of importance that happens to the business. Usually, but not always, the event is something that is initiated by some agent who is outside of the business area. A scenario should be defined for each major business event. Some people call scenarios Use Cases.
A scenario could be as simple as a sample report from last month. It could be more complex such as: “IBM just made a big announcement. Can the model tell us which of our customers own IBM stock and of them which should we call?” Other typical examples could be:
- a breakdown of lapsed and surrendered policies in a given month,
- a complete or partial transfer of an investment account.
How to Use Scenarios
The simplest way to use scenarios is to manually walk them through the model. It is easiest to do in a facilitated session. The facilitator takes the model and the scenario. Each activity, data element and sample value in a scenario are compared to the model to ensure that the model is complete and correct. One wants to ensure the data is there and that the model will allow one to navigate properly. Usually, the first 3 or 4 scenarios disrupt the model. This trauma is essential for the health of the model.
Using scenarios and queries with actual sample values validates the data model, ensures that the model satisfies the user needs, guarantees model completeness and simplifies database changes.
It is important to collect whatever volumetric information you can. Volumetric information includes the number of occurrences of data, the frequency of query execution, the ratio between tables, and such things. These factors substantially influence the physical design and should be a part of the process for gathering business requirements.
Finalization of the Logical Model
Two important additions need to be made to the logical model before it is complete:
- Completion of definitions of all entities and attributes. In a data warehouse environment, this is especially critical. Users will need to understand the data in order to use it. Do not underestimate the magnitude of this task. Do yourself a favor and compile the definitions as the model grows.
- Addition of domain information. A domain is the set of physical characteristics that one or more columns can have in common, such as length and data type.
In the next step, do not neglect to forward engineer the definitions to the physical model.
Design Phase – First Cut Physical Model
Before doing extensive usage analysis, it is possible and even advisable to construct a first-cut physical model. This can be done quickly by applying the following modifications to the data model.
These seven steps represent what is considered the first cut physical model:
- Ensure physical columns characteristics
Use domains or assign data type, length, nullability, optionality, etc. - Ensure proper physical database object names are used
Follow standard (enforced by CASE tool) for physical names. - Add necessary constraints and business rules
Take example of an Order Item that must belong to Order and that an Order must have Order Item. The first is enforced by referential integrity (RI); the second requires a trigger or stored procedure. - Resolve implementation of subtypes
Decide how the entire hierarchy will be implemented. - Perform any obvious denormalization or other safe compromises
Safe compromises are those trade-offs that do not compromise the integrity and non-redundancy of the data, e.g., splitting or partitioning a table based on usage. To store mailing address both in both Customer and in Customer Address table is an example. - Adjust keys
Ensure keys truly provide uniqueness and stability (even considering anomalies over time, such as Order Numbers getting reused). Add surrogate keys as justified. - Add indexes on primary keys, major foreign keys only
Introduce indexing sparingly at first to determine raw capabilities of the design. - Add any production objects
Example of these are transient tables, processing tables/columns (e.g. batch control tables).
The process for creating the first-cut physical model should be rapid. The point is to get a neutral model up so that real testing can be done.
Forward Engineer Logical Model to a Physical Database Design
Using these scenarios, the entire data model should be reviewed table by table and column by column to confirm what is really required, what could be eliminated, which data elements should be combined and which should be broken into smaller tables to improve ease of use and reduce query and report response time. For example, it is conceivable to reduce the number of tables from 100 to 60. It is also possible to eliminate unnecessary columns when many of the remaining tables are code tables, used only to look up descriptions. Consequently, the number of actual fact tables and dimensions could significantly be reduced. Also, information gathered during this activity is useful in determining index deployment, load balancing across devices and optimal data clustering. All optimization changes based on discoveries made during this process should be implemented in the physical model. Any business rule changes or additional columns should be incorporated into the logical model and then forward engineered again.
The forward engineering process from a logical model to a first cut physical database can be automated using one of many tools, such as Visible Systems Visible Advantage, CA’s Erwin, Sybase’s PowerDesigner, Silverrun and others. It is even possible that the logical model be developed using one CASE product and then bussed into another toolset for physical design. For example, doing the logical model in Erwin or Visible Analyst and the physical design in PowerDesigner or Platinum. Different components are more robust or rich in different toolsets. In this case, a product such as the Reichmann Toolbus product can be used to convert the logical entity relationship diagram to physical format. One toolset will minimize data movement errors, save time and cost, and reduce the possibility that the logical and physical models will get out of synch.
At this point, the physical modeling tool is used to forward engineer the logical model into the physical design. Initially, it may be best to take most defaults to automate the first cut physical design creation. An exception to this could be the conversion of entity and attribute to table and column names. It may be that many entity and attribute names exceed length limits imposed by database management systems. In this case, a manual effort will have to be undertaken during forward engineering to assure standard abbreviations were assigned where necessary. The forward engineering process should convert the names according to a physical naming standard. The forward engineering process can be accomplished without such tools, but the level of manual intervention becomes more significant.
Remember to forward engineer all relevant data, including definitions. A data warehouse is useless without these.
Design Data Model Optimization
There are three general ways to optimize a database:
- get better hardware (or take better advantage of it),
- get better software (or take better advantage of it), or
- optimize the data structure itself.
Before the database can be generated a variety of optimizations need to be evaluated for their applicability to the logical model. These optimizations are actually trade-offs. A trade-off is the emphasis of one feature, which is considered to be an advantage, while compromising another, which is considered to be less important. These trade-offs are of two types: safe trade-offs and aggressive trade-offs.
Safe Trade-Offs
Safe trade-offs do not compromise the integrity of the data. They are exemplified by the following model changes:
- Collapse some or all subtypes into a supertype
- Collapse a supertype into one or more subtypes
- Split one entity into multiple tables,
- Merge multiple entities into a single table,
- Collapsing certain code tables,
- Converting sub-types to tables (there are various ways)
- Violate first normal form, among others.
Aggressive Trade-Offs
Aggressive trade-offs compromise the integrity of the model. They are exemplified by several main modifications:
- Storing derived data and
- Adding redundant data or relationships and
- Adding surrogate keys.
Some other very aggressive trade-offs are possible, such the use of hybrid data structures, though in practice these are rare. A hybrid structure is a non-relational structure with an internal structure. Exceedingly aggressive trade-offs like hybrid data would require extraordinary justification. With hybrid data, the table contains an imbedded data structure like an array. This structure is usually a variable structure. To an RDBMS this is a long string column. Some data mining applications, such as SAS, might take advantage of such as hybrid structure. Often, these structures are not part of the permanent database but rather retained separately.
Some form of load or data usage analysis should justify aggressive trade-offs. These types of modifications to the data model represent thoughtful changes that require careful study of the data and its relationships.
Technology Choices and Trade-offs
Here is where choices like the use of indices and referential integrity are addressed.
Once the above optimizations are performed, the next activity is a basic clean up of the forward-engineered physical design. Some very substantial choices have to be made at this stage, such as alternate keys and indices. This step also includes defining devices, segments in Sybase (or tablespaces in Oracle or DB2), a database and a server name in the physical design tool. Also, a verification of domain and dimension information (i.e., column size and alphanumeric requirements) should be performed.
Volume Determination and Database Sizing
Volume determination and database sizing is facilitated by determining the best source for data knowledge and getting a best estimate for each table. The numbers can be plugged into a tool, such as Platinum Desktop DBA, that can generate space requirements, or into a spreadsheet designed to perform the same estimations, or manually. It is important to pad your estimates in case actual row counts exceed what was anticipated. An additional 20 percent when sizing each table is reasonable.
Determine Hardware Requirements
Disk requirements are determined by volume estimates and system software disk utilization requirements. Aspects such as number of processors and amount of memory are best determined by hardware and
database vendors based on information determined in earlier steps combined with detailed user information (i.e. number of users, queries, reports, etc.). Once the hardware requirements are determined, hardware and system software can be ordered, installed and configured. For example, in one situation, an IBM UDB database could be installed on a parallel platform, or a SQL Server database could be deployed on a Sun server. However, the above rules continue to apply and be applicable for most platforms.
Generate the Physical Database
After all previous information has been gathered and entered into the physical design tool, a physical database definition (DDL for DB2, schema for Sybase or Oracle) can be generated. This must be reviewed carefully and in all likelihood modified to some degree, since no physical design tool generates 100 percent perfect database definitions. The script can then be run against the database management system to define the physical environment.
Subsequent Steps
While the completion of the above step signifies the end of the logical model to physical database transformation, the task of delivering a production database includes several additional activities that could result in modifications. Issues like load file content, new query requirements, additional data requirements or new user requirements can result in necessitating database modifications. It is recommended that all database changes are implemented through a physical design tool, as this will provide a central authoritative source for the physical database environment and allow the generation of scripts that can recreate the database server if necessary.