Different Types of Data Models
One of the things you often find people arguing about is what a data model is, and what it is for. Here’s one of the secrets of analysis: when you find people arguing passionately about something, try to discover why they are both right. So it is with data models. Data models have many purposes. These cause differences in both style and content, which can cause confusion, surprise, and disagreement. This section looks at some different types of data models (I do not claim necessarily to have exhausted the possibilities) and how their purposes might lead them to differ for nominally the same scope. A particular data model may be of more than one of the types identified.
Physical Data Model
A physical data model represents the actual structure of a database—tables and columns, or the messages sent between computer processes. Here the entity types usually represent tables, and the relationship type lines represent the foreign keys between tables. The data model’s structure will often be tuned to the particular needs of the processes that operate on the data to ensure adequate performance. It will typically include:
Restrictions on the data that can be held
Denormalization to improve performance of specific queries
Referential integrity rules to implement relationship types
Rules and derived data that are relevant to the processes of the application(s) the physical data model serves
Logical Data Model
There is a range of views on what a logical data model is. So I will start by talking about how I see them and then mention the divergences that I have noticed.
A logical data model is a fully attributed data model that is fully normalized. Fully attributed means that the entity types have all the attributes and relationship types for all the data that is required by the application(s) it serves. It may include:
Restrictions on the data that can be held
Rules and derived data that are relevant to the processes of the application(s) the logical data model serves
The main difference I see from this in practice is that many data models that are described as logical actually have some level of denormalization in them, particularly where change over time is involved.
A logical data model might relate to a physical data model, but this is not the only possibility. For example, with a software application, it would be quite appropriate for a logical data model to be developed of the user view of the application through the screens, and/or the computer interfaces to and from the application. This might be considerably less flexible than the underlying database, with restrictions imposed either by the application code, or by the configuration of the application.
It should be clear from this description that a physical data model can also be a logical data model, provided it does not include any denormalizations.
Conceptual Data Model
As with logical data models, there are some differing opinions about what a conceptual data model is. So again, I will state the way that I understand the term and then identify some key variations I have noticed.
A conceptual data model is a model of the things in the business and the relationships among them, rather than a model of the data about those things. So in a conceptual data model, when you see an entity type called car, then you should think about pieces of metal with engines, not records in databases. As a result, conceptual data models usually have few, if any, attributes. What would often be attributes may well be treated as entity types or relationship types in their own right, and where information is considered, it is considered as an object in its own right, rather than as being necessarily about something else. A conceptual data model may still be sufficiently attributed to be fully instantiable, though usually in a somewhat generic way.
Variations in view seem to focus on the level of attribution and therefore whether or not a conceptual data model is instantiable.
A conceptual data model might include some rules, but it would not place limits on the data that can be held about something (whether or not it was instantiable) or include derived data.
The result of this is that it is possible for a conceptual data model and a logical data model to be very similar, or even the same for the same subject area, depending on the approach that is taken with each.
Canonical Data Model
In the context of data models, a canonical data model means a data model that is fully normalized and in which no derived data is held. So a logical data model might or might not also be a canonical data model.
Application Data Model
An application data model is (obviously) one that relates to a particular application. It may be any or all of the following data models: conceptual, logical, physical, or canonical.
Business Requirements Data Model
The purpose of a business requirements data model is to capture and reflect a statement of business requirements. For such a model, it is important that the notation is simple and easily understood. This data model will form a basis for further analysis, so it does not need to capture all the detail. It can also function as a useful framework for capturing business rules as part of the definition of the entity types.
So this type of data model is essentially a simplified conceptual data model—perhaps without cardinalities and without taking account of change over time, since most users do not understand their implications. My preferred notation for this, first introduced by CDIF (CASE Data Interchange Format) is particularly easy to read. It consists of boxes and arrows with the names of the relationship types on them, where the direction of the arrow just tells you in which direction to read the relationship type name. This is very easy and natural to read. Figure 1 shows an example of how such a data model might look.
Figure 1: A business requirements data model for part of an order data model.
The idea is that you can just read around it following the arrows, to get phrases like:
Order of offering of product at price.
Order from customer.
Order delivered to address.
Address within location.
Delivery charge for product to location.
Integration Data Model
An integration data model integrates a number of separate applications. In order to do this, it needs to be instantiable. Its scope is usually either all the data for the applications it integrates or any data that is shared between at least two of these applications.
You can also use an integration data model to share data between enterprises, for example, in the supply chain.
Enterprise Data Model
An enterprise data model is a type of integration model that covers all (well, probably most in practice) of the data of an enterprise. Your Enterprise Architecture may include enterprise-wide data models that are also conceptual, logical, or physical data models.
For most types of data model, it is fairly obvious when you need to develop them. Enterprise data models, however, seem to be the exception. There are many cases where enterprise data model projects have been abandoned, or where the results have languished unused, even when what was asked for has been delivered. The reason for these failures is usually straightforward: It was not clear at the outset what questions the enterprise data model needed to provide the answers for nor was it clear what the economic imperative to answer these questions was.
Establishing the questions to be answered as the purpose of the enterprise data model is not only good because it means you have a clear purpose, it also means you know when you can stop data modeling. Otherwise it is perfectly possible and very tempting to develop the enterprise data model to a level of detail that is unnecessary, and this adds both cost and time to the exercise. It is, of course, always possible to return to the enterprise data model later and develop more detail when questions arise that require that detail.
There are two occasions when I think an enterprise data model is clearly justified. The first is when a major business re-engineering project is being undertaken and the processes of the enterprise are being fundamentally revised. In this case, developing an enterprise data model alongside the enterprise process model will deliver significant value to the re-engineering process. The second occasion arises from a bottom-up approach to enterprise architecture. As the need arises to integrate across applications, a logical data model showing the overlaps between the various systems becomes necessary. A key element of this will be master and reference data, since it is getting this consistent that enables consistency across different applications, but also the data exchanged between systems will need to be in scope. Since most transaction data is eventually transferred to data warehouses and reporting systems, it is likely that this will grow to cover most of the enterprises data.
Business Information Model
Business information models are a type of application data model that is used in data warehouses for reporting and slicing and dicing your transaction data. Instead of being normalized in their structure, these models are arranged in terms of “facts” (transactions, typically), and the “dimensions” (such as time, geography, or product lines) used to specify reports. The simplest structure is a “star” pattern, with a fact or group of facts in the middle, and dimensions radiating out from there. More complex structures resemble “snowflakes.” There are some special rules that apply to business information models; for example, only hierarchical relationship types are allowed, otherwise as you summarize up the relationships, your data may get counted more than once. On the other hand, at different levels in the hierarchy, you may use different relationship types for summarization.
Data Usage Model (Data Flow Diagram)
A data usage model shows where data is created and used by which processes. Some examples of data usage models are CRUD (create, read, update, delete) matrices and data flow diagrams. It is these that show how the process and data models interact with each other.
One of the challenges here is that the processes in a data usage model may themselves be things about which we wish to hold information, so you need to recognize that a process in such a model may also be represented in an entity-relationship model.
Summary of Different Types of Data Model
You will see from the earlier descriptions that these classifications of data models are not mutually exclusive. Figure 2 is a Venn diagram that illustrates the combinations possible.
Figure 2: A Venn diagram showing the different types of data model.
This now enables me to draw your attention to the focus of this book, which is conceptual and logical data models that are also enterprise or integration data models. This does not mean that other types of data model will be ignored entirely, but I think this is where data modelers face the greatest challenge. Indeed, I have heard people say that enterprise data models are simply impossible, to which I would respond that the only thing you know for certain is that the person who says that does not know how to create one. I hope to show that such data models are quite achievable while pointing out some of the reasons that people fail and how you can overcome them.
Integration of Data and Data Models
In the previous sections I explained the different purposes and types of data model. In this section I am going to look in somewhat more detail at integration data models and at integrating data including an architecture and a methodology you can use for data integration.1 The reason for this is that the approach to data modeling presented here is very much aresponse to the demanding requirements of data integration.
I will start by introducing the basic principles for the architecture and integration methodology presented here.
The three-schema architecture for data models2 shows that, for any data model, it is possible to construct views on the original model. This principle can be extended to cover other types of model. In the integration of models, this process is reversed: a model is created for which the initial models are views. A model created in this way is an integration model with respect to the initial models in that it is capable of representing information with the scope of either or both of the original models. This is illustrated in Figure 3.
Figure 3: Model integration.
You can create an integration model if you can establish a common understanding of the application models to be integrated. Difficulties in creating such a model point to a gap in human knowledge about the subject of the application models. Further, you may be able to integrate an application model to more than one integration model, where the integration models support different ways of looking at the world (see Figure 4).
Figure 4: Integration into more than one integration model.
Integration models can themselves be integrated. This means that any arbitrary set of models can, in principle, be integrated at the cost of creating a new model (see Figure 5).
Figure 5: Integrating integration models.
To have to create a new integration model each time you add a data model to the set can be time-consuming and expensive.What you want is an integration model that is stable in the face of the integration of additional models. Here stable means that the existing integration model does not need to be changed as more models are integrated, though extensions of the integration model may be necessary. So it is worth looking at some of the barriers that mean that the existing integration model needs to be changed, rather than simply extended.
Scope and Context
The scope of a data model is defined by the processes supported by its actual content. The context of a data model is the broadest scope that it could be part of without being changed.
When you create an integration model, its scope and context must be no smaller than the combined scope of the application models being integrated. Figure 6 shows the relationship between the scope and context of an integration model where the context is hardly larger than the scope.
Figure 6: A limited integration model.
If an additional application model needs to be integrated that falls outside the context of the existing integration
model (see Figure 7), then a new integration model will have to be developed to integrate the existing integration model with the additional application model.
Figure 7: Integrating an application model and a limited integration model.
However, you can choose for the initial integration model to have a wide model context. This means that it can support the information needs of many different applications, even though its initial model scope is limited to that of the models that it integrates, as shown in Figure 8.
Figure 8: Using an integration model with a broad model context.
You can then integrate further application models by extending the integration model—enlarging the model scope within the broad model context.
It turns out that the main barrier to having a broad context is the implementation of rules that apply in the narrow context of the original use and intention of the data model but that do not apply in a wider context.
Integration Model Content
The content of an integration model can be divided into primitive concepts and derived concepts. Primitive concepts are those that cannot be defined in terms of other concepts in the integration model, are in turn the building blocks for the definition of other concepts, and can be further divided into foundation concepts, general concepts, and specific concepts as represented in Figure 9.
Figure 9: Primitive concepts.
Discipline-specific concepts depend on general concepts that depend on foundation concepts, since all the lower concepts rely on the existence of one or more higher level concepts. For example, without the foundation concept of classification, relatively little can be said about anything.
At the top level, an integration model might have foundation concepts like classification, connection, and composition. General concepts might include agreements and organization, and finally discipline-specific concepts that are limited in their range of application, such as pumps and valves.
An integration model is not just a data model. It includes master and reference data that adds detail, particularly about the detailed kinds of things that are of interest.
A Full Integration Model
A full integration model, as illustrated in Figure 10, is more than just primitive concepts; it includes derived concepts—useful and valid combinations of primitive concepts. You only need to record derived concepts that are of interest, since their existence is implicitly recognized.
Figure 10: A full integration model.
A primitive concept is not necessarily primitive forever. If a concept that is initially thought to be a primitive concept turns out not to be, then you can identify and add the concepts it is derived from,and add the derivation, so that it becomes a derived concept away from the front face of the pyramid. This allows flexibility to reflect an improved knowledge of the world, rather than reflecting knowledge of the world that is constrained by a modeler’s knowledge at a point in time. This is one of the ways that you will need to maintain and extend an integration model.
Mapping specifications specify the transformations that determine how the instances of one model can be represented as instances of another model.
When you create a mapping specification, it is important to note that:
• New concepts or constraints are not introduced in the mapping specification; that is, mapping specifications are limited to transformations of structure, terminology, and encoding.
• A complete mapping specification is bidirectional. However, the transformations of the first model to the second model may have to be specified separately from those of the second model to the first model, and the mapping in one direction need not be derivable from the mapping in the other direction.
• Before you can support a bidirectional mapping, you may need to make explicit some of the context of the data model being integrated.
Overview of the Model Integration Process
The model integration process takes a number of application models and an integration model. It ensures that all the concepts of the application models are represented in the integration model, and it develops a mapping specification between the integration model and each of the application models.
There are three possible cases for the integration process:
The integration model and the application models both exist before the integration process starts.
The application models to be integrated exist before the integration process starts, but the integration model does not.
The integration model exists before the integration process starts, but the application model needs to be developed from some statement of requirements.
The process of integrating an application model with an integration model is illustrated in Figure 11. The goal of this integration process is to allow the same information that is represented in the application model to be represented in the integration model without losing any meaning and to allow transformations between these representations. The result of the integration process is a mapping specification between the application model and a part of the integration model. In order to define this mapping, you may need to extend the integration model so that it precisely represents the concepts found in the application model.
Figure 11: Integrating application models with an integration model.
The process of integrating application models with an integration model is divided into a number of steps, as follows:
Analyze the application models and identify the equivalent concept of the integration model, including any constraints that apply (see Figure 12). Most application models have a context within which the model has to be understood but which is not explicit in the model itself. Usually it is inappropriate to add this information explicitly to the application model. In this case, you should capture these requirements in the mapping specification as part of the integration process.Figure 12: Analyzing the application models
If necessary, extend the integration model so that it includes all the concepts found in the application models (Figure 13).
Figure 13: Adding any missing concepts to the integration model.
Identify the part of the integration model that represents the concepts in each application model (see Figure 14).Figure 14: Identifying the subset of the integration model.
Create the mapping in each direction between each application model and the appropriate subset of the integration model (see Figure 15).Figure 15: The mapping between integration model subset and application model.
Specify any structural transformations, terminology transformations, or encoding transformations that apply within the mapping.
Specify any transformations that are necessary between model representations. For example, if an application model is specified in the XML Schema definition language and the integration model to which it is mapped is specified in EXPRESS (ISO 10303-11), a transformation between these languages will be necessary to map between different representations of the same concepts.
Repeat this process for all other application models to be integrated.
Most application models have a context within which the model has to be understood but which is not explicitly represented in the application model itself. Mapping successfully in both directions requires that both the explicit model and its context be mapped into the integration model. For example, in a salary payment system, there may be an entity data type called employee. However, it is often implicit that each person represented by instances of this entity data type is an employee of the company that operates the system and is legally eligible for employment under company and government policies.
Data Mapping and Consolidation
Mapping between models is not sufficient to achieve integration. Integration requires reconciliation of information represented according to the different models. This process is illustrated in Figure 16.
Figure 16: Data consolidation.
Translate the data population 1 and 2 according to their source models into the data populations 10 and 20 according to the model C.
Identify which data elements in the two data sets represent the same things, and consolidate them in data population 3.
It should be noted that this requires a common, reliable, persistent identification mechanism. This will enable the different ways the same object is identified in different systems to be captured and effectively provide a translation service between those systems.
In this chapter I have looked at the different types of data model that you can find, and in particular at the business of data integration. You will have noticed that data integration presents some particular challenges, and that the desirable characteristics of integration models are not necessarily the same as those developed for other purposes. In particular it is desirable that they are stable and extensible. Parts 2 and 3 of this book are largely about a way to achieve this.
This is largely taken from West, Matthew; Fowler, Jason. The “IIDEAS” architecture and integration methodology for intergrating enterprises PDT Days 2001 (2001), which was used as the basis for ISO TS 18876 – Integration of Industrial Date for Exchange Access and Sharing.
ISO/TR 9007: 1988, “Information processing systems – concepts and terminology for the conceptual schema and the information base.”
©2011 Elsevier, Inc. All rights reserved. Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2011. “Developing High Quality Data Models” by Matthew West. For more information on this title and other similar books, please visit elsevierdirect.com.