The old chestnut goes that the only difference between a data modeler and a terrorist is that you can negotiate with a terrorist. In spite of the fact that there are many notations available and
many ways to approach the effort using each of those notations, data modeling is such an intense, personal experience that it is very easy to become emotionally attached to a particular way to do
it. On top of that, the difficulty that goes into making a modeling tool work just the way we want it to also tends to makes us uncomfortable with the prospect of change.
It’s a wonder any of us can talk to each other at all. Still, data models are products to be delivered to one’s consumers, and it is clear, on the surface at least, that there is a difference
between a good data model and a bad one. Unfortunately, however, that difference between good and bad is rarely articulated, and more often it is expressed in arguments carried out with an almost
This paper attempts to cut through such arguments and present as dispassionately as possible some basic principles of data model quality. At the very least, it hopes to get the issues out on the
table where we can discuss them.
To be sure, it is inevitable that the “principles” will reflect the prejudices of your author. They are, however, drawn not only from his own sixteen years’ experience, but also from that of
other people as well. So, they are presented for your consideration with some confidence that there is merit to them.
The first order of business of course is to clarify exactly what kind of data model we are talking about. The term “data model” has taken on several quite different meanings in recent years,
reflecting the very different uses that collections of boxes and lines can be put to.
- A “divergent” data model describes the concrete things of interest to the people who run the day to day business of the organization. By definition, these are specific to particular
situations or specialties, and their representation is not constrained by any modeling principles. These are the models seen by John Zachman’s “business owners”. The collection of such models for an enterprise will overlap and often be inconsistent with each other.
- A “convergent” data model describes the fundamental things of significance to the organization. The specific things seen by each business owner are but examples of these more general
things. This is a conceptual model encompassing all the divergent models, and it uses various modeling disciplines to ensure that all things presented are organized coherently and consistently.
This corresponds to the view of John Zachman’s “information system designers”, although your author would prefer to call this the “architect’s view”.
- The term “data model” is also often used to refer to a database design. For example, Rational Corporation has added a “data modeler” to its Rose products, which portrays relational database
structures. This is a representation of the tables and columns that will be used to implement a particular computer system. The term “object model” has also come up, sometimes referring to the
convergent model described above, and sometimes referring to the structure of data in an object-oriented program. The relational and object-oriented designs are views of data held by Mr. Zachman’s
“builders”, although your author would prefer to call this the perspective of the “designers”.
This article will address the conceptual-convergent-view of data modeling. This is the view that seems the most misunderstood-not the least because of its often being confused with the other two.
The purpose of the conceptual model is not only to describe an organization in a way that represents fundamental structures, but also to describe it in terms that can be understood by the business
stakeholders who will be responsible for any systems developed from it. For this reason, it is not only a tool for stimulating and disciplining thought processes but also a tool for communication.
The issues described in this article, then, have to do with aesthetics as well as with clarity and coherence of thought.
It is important to understand that the conceptual modeling described in this article is not about capturing the concrete terms of business clients, nor is it about database (or object class)
design. Any notation can be used to create a conceptual model, but it is important to understand that that is what you are doing. A UML conceptual model composed of business classes is not the same
as a UML design model describing the bits of program code that are design classes.
Conceptual data model quality is particularly important for an organization that is beginning to focus on data quality. You cannot collect quality data if your database is poorly organized. You
cannot organize your database properly if you do not understand the real structure and organization of the data. And you cannot understand the structure of the data if you cannot discuss it with
the people who define and use it. This requires all the graphic and linguistic help you can get.
By the way, the first prerequisite to data model quality is simply a passion for data modeling. If you are not a fanatic about making sure that each attribute and entity type has a precise
definition, that each box is properly aligned, and that each symbol is in exactly the right position, you will never produce a great model, no matter what technique or approach you use.
This article addresses data model quality in terms of three dimensions: a model’s content, its names, and its appearance. At the end of the article, we will discuss how to go about producing a
quality data model.
Content: Six Principles
In 1994, Matthew West, of the Shell Oil Company, prepared a set of three white papers called “Developing High Quality Data Models”. The first of these specifically addressed data model
quality. In it, he laid out six principles for high quality data models:
- Entity types should represent, and be named after, the underlying nature of an object, not the role it plays in a particular context.
- Entity types should be part of a sub-type/super-type hierarchy (“class hierarchy” if you’re familiar with object oriented terms) in order to define a universal context for the model.
- Activities and associations should be represented by entity types (not relationships).
- Relationships (in the entity/relationship sense) should only be used to express the involvement of entity types with activities or associations.
- Candidate attributes should be suspected of representing relationships to other entity types.
- Entity types should have a single attribute as their primary unique identifier. This should be artificial, and not changeable by the user. Relationships should not be used as a part of the
primary unique identifier. (They may be part of alternate identifiers.)
1. Entity types: The Underlying Nature of an Object
The first principle goes to the heart of the meaning of an entity type: An entity type in a conceptual model is a thing of significance to the organization, about which it wishes to keep
information. It is the thing itself, not the thing playing a role. Unfortunately, if you look through your interview notes looking for nouns as candidate entity types, many of the nouns will
combine references to things with the roles they play.
Figure 1 shows two common models, describing customer and vendor: Each customer may be a buyer in one or more sales orders; Each vendor may be a seller in one or more purchase orders.
These models raise a question, however: what if the same person or organization is both a customer and a vendor? Or, what if a division of your own company is a customer? By folding the
“customerness” and the “vendorness” of a person or an organization into its underlying definition, you are no longer able to deal with that definition separately.
Figure 2 shows an alternative. This model actually addresses two issues. First, it recognizes that we will often have relationships that apply to either a person or an organization. By
defining the super-type party, we can talk about the groups of people and organizations together. Second, we have generalized the concept of sales order and purchase order to simply, order. This
recognizes that the two kinds of orders are in fact the same thing. The only difference is whether the viewer of the model is buying or selling. Thus “customerness” and “vendorness” are not
characteristics of entity types but rather roles played in the relationships between party and order. More than that, by explicitly recognizing that every order has both a buyer
and a seller in it, we have also recognized that we are a party, just like all those others we deal with. This turns out to be a useful thing. Indeed, the values of our attributes
can be stored just like those for everyone else.
By the way, note that if you are successful in creating a model of true entity types, and have correctly assigned single-valued attributes to them, you will have by definition created a fully
normalized model. To say that all attributes of an entity type are a function of the key, the whole key, and nothing but the key is to say that they are in fact attributes of this thing.
2. Sub-types and Super-types
Following on the above point, the second principle asserts that “Entity types should be part of a sub-type/super-type hierarchy (‘class hierarchy’ if you’re familiar with object oriented terms)
in order to define a universal context for the model.” As it happens, the sub-type/super-type structure is not a relational concept. By definition, a relational database consists of flat,
two-dimensional tables only. But in his formulation of information engineering, Clive Finkelstein picked up the concept of “inheritance” from the object-oriented community and it has turned out
to be a very powerful way to represent the world. Among other things, it allows you to show that the concrete things most people see are examples of more general concepts.
There are actually two justifications for setting up a sub-type/super-type structure-one formal and one informal.
The formal justification for a sub-type comes when it inherits all relationships and attributes from its super-type, but it has at least one unique relationship and/or attribute. The informal
justification for a sub-type is to illustrate the kinds of things represented by the super-type, even if there are no differences in their attributes or relationships. This can be very
useful in the presentation of a model, to clarify the meaning of the super-type.
Figure 3 shows examples of both of these uses. It represents the kinds of facilities that constitute an oil field. The kinds of subsurface facilies (the stuff in an oil well) have distinct
attributes, as well as specific relationships among them that could not be expressed if they were not broken out into sub-types. Specifically, each well may be composed of one or more
completions, and each well may be composed of one or more wellbores. Each of these three entity types has a different set of attributes: “Surface Elevation” for well; “Top Depth” and
“Bottom Depth” for wellbore; none for completion.
On the other hand, however, there seems to be no formal justification for the sub-types of surface facility. As nearly as we can tell from the figure, the attributes of a steam generator, a
dehydration plant or any other surface facility are all the same. Even so, it is useful to show them as sub-types, to clarify just what is meant by a “surface facility”. In fact, further analysis
may in fact show different attributes for the different sub-types, but even if it did not, this would still be a reasonable way to present surface facility.
3. Activities and Associations are Entity Types
The first draft of a model is likely to contain many-to-many relationships. Indeed even many of the one-to-many relationships, upon further reflection, turn out to be many-to-many relationships.
Normalization will require you then to define associative (intersect) entities to account for them. (The tangible entity types thus associated are called “reference entity types”.) Principle 3
says that, from the beginning, “activities and associations should be represented by entity types (not relationships).”
Figure 4 shows employment as a relationship between person and organization. Each person may be an “employee” by virtue of being employed by one and only one organization. (Note that the
commonly specified “employee” is an example of the spurious entity types described in Principle 1, above.)
The only problem with this is that it represents considerably more company loyalty than is normally the case. In the “real” world, each person is likely to be employed by more than one
organization over time, necessitating records to be kept of each employment. This results in Figure 5. As it happens, over time, nearly all one-to-many relationships between reference
entity types are really many-to-many relation-ships.
4. Relationships as Involvement
Ok, if associations are to be represented as entity types, what do relationships represent? Simply, a relationship is an entity type’s involvement with one of those associative
entities described above. In Figure 5, above, there are two relationships between any associative entity and its reference entities. Each employment may be of one person and in
5. All Attributes are Candidate Relationships
The fourth principle says simply to examine each attribute of an entity type and ask whether it isn’t really a relationship to something else. As an example, consider data about “households”
received by a bank from a market research company.
First, we draw an entity type for household, as shown in Figure 6. The attributes shown are those delivered by the market research company.
Let’s look at the attributes of household: “Income”, “number of cars”, and “sociological group”, are examples of potential relationships to other things. “Average balance” could stay as an
attribute, but if we want to keep it over time, it needs its own entity type as well. This means that the model can be better represented as shown in Figure 7.
Here, what had been the attribute “Sociological group” is now a sociological group membership for a household in a sociological group. Using this structure allows you, as you
could not before, define in advance a specific list of sociological groups that are of interest. You can also now see that a car ownership is of a car model. (If it were of interest, you could add
a relationship to assert that each car model must be manufactured by an organization-a car company.) Income ranges can similarly be grouped into pre-defined categories. And finally, bank
balance contains the household’s “Average balance” for multiple points in time, as designated by the “Effective date” of each occurrence.
If it is not clear what kinds of categories will be of interest in the future, this model can be further generalized, as shown in Figure 8. For example, a household parameter group could be “Car
Model”, and a household parameter could be “Number of Jaguars”. Thus, the household parameter value of “Number of Jaguars” for “Jay Leno” could be “5”.
A word of caution: Generalizing a model makes it much more flexible and robust. It insulates it against the effects of future changes to the business. At the same time, however, it removes the
model from the language of your customers. It is important to use judgment in determining how general to become. In your author’s experience, most business people can be taught the meaning and
implications of more general models, and once so taught, they appreciate the elegance. It is important to do this teaching carefully, however. In presenting this model, begin with Figure 6 and work
your way to Figures 7 and 8.
6. Only Use Surrogate Identifiers
The sixth principle in the Shell paper asserts that only surrogate identifiers should be used as unique identifiers for entity types. A surrogate identifier is a system generated number that is
uniquely assigned to each row. It has no inherent meaning and is used only to distinguish one row from another.
Your author respectfully disagrees with part of this one. It is the case that for reference entity types (those that are not required to be related to any other entity types), surrogate identifiers
are best. It is extremely rare to find any attributes for product, person, activity, etc. that could never change value. Moreover, the temptation (left over from manual identification systems), to
encode all manner of information into the identifier is genuinely a bad idea. Surrogate identifiers are simpler, easier to implement, and a better guarantor of the integrity of the data.
In the case of associative entity types, however, the identity of an occurrence is, by definition, derived from the things it is associating. If the related things are cleanly identified with
surrogate identifiers, there is no harm in using the relationships to those things as the identifier for the association. By definition, you cannot change the values of such a composite identifier
for an occurrence and keep the same occurrence.
If it is necessary to allow for more than one occurrence of a pair, it is a simple extension to add a “Sequence number” to the compound identifier of the associative entity type. Even though this
is using a surrogate for part of the identifier, it is useful to keep the natural identifier of the entity type visible.
Figure 9 shows this. party and product type are each identified by a surrogate “ID”. vendor, on the other hand is identified by a combination of the party that is in the role of the vendor and
the product type that is sold by the vendor.
(In this notation, an attribute’s participation in a unique identifier is shown by the octothorpe (#) to the left of the attribute. A relationship’s participation is shown by a short line across
the relationship line, next to the entity type being identified.)
If one of the relationships is optional, then of course this argument doesn’t fly. In that case a surrogate identifier is better.
By the way, this is a very gentle argument. To assign a surrogate identifier to all entity types, including associative ones does minimal harm. To do so, however, does cloud the meaning of each
The second important dimension of data model quality is the use of adequate names for entity types, attributes, and relationships.
Entity type Names
As cited in the description of Principle 1, above, an entity type is a thing of significance to the organization. As such, it should have a common, recognized natural language name. In Figure 9,
for example, the names party, vendor, and product type are all names of real things. An entity type name should not be:
- The name of a computer artifact, like a file or table
- An acronym
- A “List” or “History” (the entity type name describes a representative example. It does not describe a collection.)
- An attribute name (like “ZIP code”. Use postal area.)
An attribute is the definition of a kind of data about an entity type. This also should be a simple natural language term. There is some discussion in the industry about standardizing the structure
of attribute names, but this is problematic. The idea of classifying attribute names is not a bad one, but it must be carried out with care. A structure that produces names that are not simple or
intuitive is counter-productive if the model is to be understood by the public at large.
Some argue that the entity type name should be included in the attribute name. In a conceptual data model, however, an attribute is never shown or described outside the context of its entity type,
so this is redundant. In deference to those who use CASE tool utilities to generate default database designs, however, one place where the entity type name can reasonably be added is to the
identifier, “ID” (“Person ID”, “Product Type ID”, and so forth). It does not make the name excessively long, and when the attributes get converted to foreign keys in a system implementation,
the foreign key columns are easier to understand. In most cases, however, descriptive attributes gain nothing but length when the entity type name is appended to it.
More problematic is the desire to standardize the qualifiers (“amount”, “text”, “quantity”, etc.) to attribute names. This can be of value if the list of qualifiers is defined with care.
Attribute names like “description text”, however, are simply silly.
Your author studied under Richard Barker and is a strong believer in the Oracle/SSADM standard structure for relationship names. It
provides discipline, and it permits the relationships to be read as easy-to-understand but precise English sentences. When models are being presented to the public, this is extremely valuable.
Each of the two roles in a relationship should be named with this structure:
For example, in Figure 10, the first relationship consists of the following two roles:
- Each order must be from one and only one party.
- Each party may be a buyer in one or more orders.
The second relationship consists of the following two roles:
- Each order must be to one and only one party.
- Each party may be a seller in one or more orders.
(Note that these sentences are much more graceful than something like “A party buys in zero, one, or more orders”. Non-data modelers don’t talk like that.)
Sentences in approximately this structure have been used throughout this article, and it is likely that you didn’t even notice. This is a very natural way to describe a relationship.
Edward Tufte has written, “Confusion and clutter are failures of [drawing] design, not attributes of information. And so the point is to find design strategies that reveal detail and complexity
¾ rather than to fault the data for an excess of complication. Or, worse, to fault viewers for a lack of understanding.”
(Emphasis added.) In other words, if someone has trouble understanding your model, it is not necessarily because the underlying meaning of the model is too complex, nor is it because the
viewer isn’t smart enough to understand it. It may simply be the case that the model wasn’t good enough.
A conceptual data model, developed as part of a requirements analysis project is fundamentally for the purpose of sharing the analyst’s understanding of the business with the business people. For
this reason, it is extremely important that the data model which represents that analysis be as clear and easy to read as possible. This means that aesthetics are a very important dimension of data
The first aspect of a model’s aesthetics is the notation used to create the model. Among the notations available are:
- Oracle/SSADM notation-The best, in your author’s humble opinion, for reasons described below.
- Information Engineering-This is similar to the Oracle/SSADM notation and is aesthetically as good.
UML-This may be used for either analysis models or design models. When used for analysis models, it is equivalent to any other data modeling notation. Aesthetically, it
is better than some and not as good as others. It has additional symbols that are most useful to object-oriented designers.
IDEF1X-This is an extremely complex notation. It’s the sort of thing a government bureaucracy might use. Oh, wait! It’s the one that the U.S. Government does use!
Seriously, it is oriented more toward relational design than toward analysis, and its notation is pretty forbidding to the novice.
Object Role Modeling-This is a modeling technique that takes a completely different approach. It has advantages in its ability to convey business rules, but it does
require the practitioner to learn new skills.
- Chen-The original entity/relationship notation, this shows relationships and attributes as separate symbols
- cluttering the diagram, but it does allow for “n-ary” relationships (between more than two entity types), which makes it useful in presenting models closer to the business person’s
Criteria for selecting a conceptual modeling notation include:
Are the symbols intuitive? The Oracle/SSADM notation uses a dashed line for optionality and a crow’s foot for cardinality. This is reasonably intuitive. Information
engineering uses a circle for optionality which is also fairly intuitive. IDEF1X uses a combination of dots, diamonds, and various letters to represent optionality and cardinality. This is less
than intuitive. UML is acceptable, but using characters rather than graphics to communicate optionality and cardinality requires the viewer to understand the concepts, rather than simply to see
them. That may be a matter of personal taste, however. Even so, it should be noted that UML was intended as a notation to support object-oriented design, and while a sub-set of it can be used to
draw an entity/relationship diagram, many elements in the language are not suitable and must be left out of a conceptual model.
Is the notation compact? Notations like Chen’s or Object Role Modeling show attributes external to entity type boxes. This takes considerable room on the diagram. Also,
notations that show sub-types outside super-types waste a great deal of space.
Does it have the fewest different kinds of symbols? The minimum symbols required show entity types, relationship optionality, and relationship cardinality. Attributes
don’t have to be displayed, but it should be possible to show them inside the entity type box. A symbol designating part of a unique identifier is useful but not necessary. If present it should
be subtle, not overpowering the aesthetic of the model, as is the case with IDEF1X. UML unnecessarily adds special symbols for the part of/composed of relationship, as well as a strange box for a
variation on unique identifiers. It also adds the datatype to attribute names. These things all clutter the model. IDEF1X has different symbols for optionality and cardinality, depending on
whether foreign keys are involved.
Is the notation rich enough for your purposes? Is it too rich? Some notations get trapped by trying to represent too many things. For an analysis model, the Oracle/SSADM
notation shows most things required. It is not as strong as UML in describing constraints, but it is better than the other notations for this. It intentionally does not show physical
characteristics like foreign keys or class composition, since that is not its purpose. When a model is to be turned over to a designer, it is useful to be able to show additional things that were
not of interest to the business person. UML is well suited for that transition.
Many of you have been presented with a data model, only to be overwhelmed by the profusion of boxes and lines on the page. The invention of the plotter was not an unmixed blessing.
If a model is to be readable, it must be constrained to a reasonable number of entity types. As a rule of thumb, it should not have more than about fifteen entity types on an 8 ½” by 11″
page. (A few more are permissible if they are sub-types.) A page should represent a subject area, ideally consisting of the “logical horizon” of a key transaction entity type.*
An important aesthetic principle is that all lines between pairs of entity types must be straight. To put an elbow in a line is to add a symbol. But it is a symbol that has no meaning, so it is a
visual distraction. Moreover, to have a relationship between two entity types that wanders all over the drawing is to have one that is extremely difficult to follow and understand. If necessary,
stretch entity boxes. Following this rule makes it much less important whether or not lines cross. The eye doesn’t see the crossing because it’s focused on the relationship.
One problem many models have is that there is no sense of organization. Figure 11 shows an example of a typical model. What is the model about? Measurements? Test Types? It’s hard to say.
The Oracle/SSADM answer to this is to have all the crow’s feet point either to the left or the top of the diagram. Figure 12 shows an alternative to Figure 11, following this rule. Note also how
straightening the lines helped as well. In this diagram, we can see that its subjects are people, test types, and samples. The reason the diagram exists at all is to portray the collection of
measurements. Note also that a measurement is similar in structure to expected measurement. Organization makes a big difference.
And yes there are heretics out there who would have the toes of the crow’s feet point south and east. Fie on them! Well, ok, as long as the practice is followed consistently, the effect is the
The art of data modeling is in the aesthetics of positioning entity type boxes on the page. It is difficult to give hard and fast rules in this area. It is generally a good idea, however, to align
the left or right and top or bottom edges of logical groups of entity types. In general, have relationships (or groups of relationships) intersect the edge of an entity type at its center. Be sure
to tidy up the relationship names, so that they don’t cross the relationship lines.
The Modeling Process
Ok, if these are the characteristics of a good model, how do you go about achieving them? Is this a skill that can be mastered by ordinary workers or can it only be done by a superior expert? The
fact of the matter is that there are ways that anyone can go about developing a model that greatly increase the chances of being successful.
First, it is important to recognize that modeling is a process like any other part of the system development effort. It requires time and resources and these have to be planned for. At the very
least, a project plan could look something like this:
- Introduce the project through a briefing session for the stakeholders-especially those who will be asked to participate.
- Conduct interviews: Survey interviews of top management reveal the objectives, priorities and constraints of the organization as a whole; detailed interviews of the people who operate the
company reveal details of data, process, timing and business rule structures.
- Prepare models
- Present models in a feedback session, so the stakeholders can verify that they correctly describe the business.
- Digest feedback responses and produce the final report.
The big question of course is what does the “prepare models” step look like? How much time should be allocated to it? When your author did his first data models, this was a very painful process
of collecting all the nouns from the interviews, posting them up (one per Post-it note) on a whiteboard and trying to infer structure from them. This required many iterations over many months.
Arriving at a model required a lot of “Ah ha! Now I see it!” With experience, however, came the discovery that, if we are really going after fundamental things of significance, those are
the same for any organization. Out of this discovery came, among other things, the book, Data Model Patterns: Conventions of Thought. By using patterns, the process of developing a model no longer requires great intuitive leaps. It is simply a matter of recognizing these patterns when they occur, and
elaborating on them as necessary to accommodate the particulars of this organization.
The patterns exist in six broad categories:
- People and organizations
- Facilities (sites) and geographic locations
- Product types and product instances
- Procedures (activity types) and activities
- Contracts and other kinds of agreements
Now, instead of trying to infer these things from notes, you can begin by asking the questions, such as “What people and organizations are of interest to you?” A person may be an employee, a
customer, or an agent for a vendor. An organization may be a customer, a department, or a vendor. Each of these are roles played by a person and they can be modeled in a straightforward manner.
(What? A customer can be either a person or an organization? Then perhaps we should define a super-type of person and organization that can play the role of “vendor”. By
convention, that super-type is known as party.) Similarly, what kinds of products are of interest? Are we only concerned with products for sale? What about equipment used in manufacturing? Are
chemicals included? Other questions home in on facilities, procedures, and contracts.
Just as data models provide structure to the enterprise, data model patterns provide structure to the process of modeling.
To be sure, the particular sub-types of these elements are different from company to company. But the five categories are fundamental. Using them as a starting point makes the details much easier
An important component in the search for data quality is quality in the design of the tools that will capture, store, and manipulate data. This design must be derived from a sound understanding of
the nature of the data and the organization manipulating them. This understanding can come from an effective effort to describe that nature in ways that everyone can see. Hence, data model quality
is an important pre-requisite to data quality.
 For a description of Mr. Zachman’s “Framework for information systems architecture” go to
 For a discussion of the differences between John Zachman’s and David Hay’s framework, see David C. Hay, Requirements Analysis: From Business Views to
Architecture (Englewood Cliffs, NJ: Prentice Hall, 2003), 5-6.
 Matthew West “Developing High Quality Data Models, Volume 1, Principles and Techniques”, The Data Management Guide. (London: Shell International Petroleum
Company Limited, 1994). These ideas are further expanded at http://www.matthew-west.org.uk/documents/princ03.pdf.
 Ibid, p. 30.
 “Structured Systems Analysis Methodology”, commonly used in Europe.
 Edward R. Tufte, Envisioning Information. (Cheshire, Connecticut: Graphics Press, 1990), 53.
 For a full comparison of the most common notations, see Hay, Requirements Analysis, 343-387.
* The logical horizion of an entity type is the entity type itself, plus all the entity types it requires-either directly or indirectly. To define one, start with the
entity type in question and follow the trail of mandatory relationships to find all required entity types.
 David C. Hay, Data Model Patterns: Conventions of Thought. (New York: Dorset House, 1995).
This article was originally published in Cutter IT Journal, Vol. 16, No. 1, January 2003. It is reprinted with permission of the publisher, Cutter Information, LLC, Arlington, MA USA www.cutter.com.