Mastering Data Modeling

Author: John Carlis & Joseph Maguire
Publisher: Addison-Wesley 20000
ISBN 020170045X

First impression

I confess to greeting books on data modeling with suspicion. Invariably they take a slightly or greatly different approach than I do. I have worked hard over the years to come up with a particular
way of going about this art, and I don’t take to people telling me how to do it differently. After all, if other people do it differently, they’re probably wrong, eh?

Well, Mastering Data Modeling: A User Driven Approach, by John Carlis and Joseph Maguire, indeed approaches some aspects of data modeling differently than I do, but unfortunately, in most
cases they are right. Now that’s really irritating!

Actually, this turns out to be an excellent book. What is significant is that the authors’ premise is exactly right: Unlike many in this field, they believe strongly that the reason you
create a data model is to communicate better with the world of people who don’t know about system development – or data modeling – but who ultimately will have to use its
products.

Much of what is written about data modeling is for data modelers and system developers. There are books on syntax and on various attempts to develop industry standards. What books about data
modeling should be promoting instead is better ways to get through to the uninitiated. This book does that. This is the first book I have encountered whose main thrust is how to make data models
accessible to the public at large. It has been sorely needed.

By their presenting the right attitude and approach to data modeling, when they offer conventions for particular aspects of it that are different from mine, I don’t mind. If they have a
better way than I do for getting their points across, I salute them.

In some cases, I will respectfully continue to disagree, but in others, I think they may be onto something that I will certainly pay attention to in my future work.

The authors’ contend that a data model (what they call a “logical data structure” or LDS) should be developed as a tool for communicating with users. This has four important
implications:

  • The LDS is very much based on language, and skill in data modeling is based on skill in using language.
  • It is important for the graphics of an LDS to be as simple and unintimidating as possible. The pair’s approach is to use four symbols: The entity box, the relationship line, the
    “chicken foot” (I call it a “crow’s foot”, but who’s counting?), and the identifier line. Figure 1 shows the entire graphic vocabulary. Cow and herd are
    entities. One or more cows is related to exactly one herd. “Cow ID”, “name”, and “birth weight” are attributes of cow. “Herd ID” is an attribute that
    identifies occurrences of herd. “Cow ID” and the relationship from cow to herd together identify occurrences of cow.
  • They don’t represent optionality in their models. By this one move, they remove much of the complexity present in other model notations. They believe that whether a relationship (or an
    attribute) is required or not is a constraint that cannot be unambiguously asserted when first sketching out models.
  • Indeed, no constraints appear at all on an LDS. The authors believe that in many cases the constraints identified initially are wrong, they change, and they often reflect processing, rather
    than data structure.


Figure 1 : LDS Notation

A significant part of the book is about the process of developing models jointly with users. Their focus is on getting people to articulate just what it is they want a system to
“remember”. A relationship might be read, “about each herd we can remember its cows”. An attribute might be described by saying “about each cow we can remember its
birth weight”. They use many examples to show how a rough idea of a model is slowly refined as more information is extracted from the audience.

As I said, because their premises are right, and because they are serving such a valuable purpose in their book, my responses to the specifics of their conventions are clearly less important. For
what its worth, however, they are these:

Graphics

First of all, I really appreciate the simplicity of their graphics. Our industry is rife with people trying to design ever more complex notations in the attempt to capture every imaginable nuance
of data structure. In fact, the amount of information to be discussed with users is relatively small, and the notation should reflect that.

The dropping of optionality is fascinating. For years I was looking for a notation that let me say “initially may be but eventually must be”. Recently I discovered that Clive
Finkelstein has just such a notation, but even that doesn’t really describe the circumstances under which the relationship becomes mandatory. In discussing a metamodel with Larry English, he
observed that it is important to recognize that whether a relationship or an attribute value is required depends on the state of the entity involved. I included that assertion in my metamodel, but
no notation really supports it.

Messrs Carlis and Maguire simply ignore the whole question. They believe that this is a kind of business rule that should be documented outside the model drawing. They may be right. One of the
reasons that IDEF1X and the UML are as aesthetically messy as they are is because of their treatment of optional relationships.

The authors discuss model patterns (they call them “shapes”) at length. My first dispute with the book is that these are low-level, abstract shapes, such as collections, subordinate
entities, and many-to-many resolutions. While these are useful and should be understood, some are a bit too arcane for my taste. My real dispute with their approach, however, is that they
specifically argue against using more business oriented patterns – which they believe make application models too abstract. They argue that this can make them harder to understand by the
intended audience. There is merit in what they say, but in my experience – if presented with care – audiences can understand the more generalized models, and they benefit from doing so.

I do believe that their models would benefit from postional conventions. For example, Figure 2 shows a model the authors used to present what they call their “chicken feet in” and
“chicken feet out” shapes. (Note the lines across the relationships from aspiration. These indicate that an occurrence of aspiration is identified by the skill and the
creature it applies to.)


Figure 2 : LDS Shapes

An alternative arrangement, which in no way violates the rules the authors set down, points chicken feet (toes) either to the left or the top of the diagram. This shows more clearly which are
reference entities and which are transactions. Clearly, the model in Figure 3 is about creatures and skills, and refers to dates. The data about those things are achievement,
aspiration, practice session
, and exam.


Figure 3 : Better LDS Shapes

Vocabulary

I like the authors’ attitude towards naming. Their first priority is to devise names that are meaningful to the world at large. They insist that entity names be real things that make sense to
the users. As mentioned above, they discourage making models too abstract – arguing that more concrete names are more meaningful to the audience. This is certainly a worthy goal. The problem
is that the users’ language is often too imprecise for what we’re trying to do. I have found that many times, the data modeler’s greatest contribution can be the introduction of a
new, more precise vocabulary than the one they have been using.

As discussed above, an extension of this idea is that of making the model more abstract – connecting it with industry patterns. This has two benefits: It expands the audience’s
horizons, forcing the participants to think outside their concrete world; and it specifically allows them to understand the more general principles that, when used as the basis for system
architecture, allow for the building of systems that will last.

Relationships

I disagree with the authors’ approach to naming relationships. They name relationships with verbs, in the form:

Each



can
(or “must” if the relationship is identifying)



one
(or)
one or more

.

For example, “each creature can desire one skill”, and “each skill can be desired by one or more creatures”.

Each relationship name must be a verb, and it is desirable for the relationship in one direction to be the active form and the relationship in the other direction to be the passive form of the same
verb.

Their structure is consistent, and this does yield reasonable sentences, most of the time. The problem with it is that – in my humble opinion – the relationship names should not be
verbs. This implies that there is some sort of action occurring between the entities. But this is a structure model, not a process model. We are not concerned with actions here. We are concerned
with relationships.

The part of speech that describes relationships is not the verb but the preposition. (Remember Grover’s words in Sesame Street?) The only verb that should exist in a relationship
sentence is “be”. The relationship exists. It is only a matter of describing its nature. A better structure would be:

Each



can be



one
(or)
one or more

.

(Since I am an unreconstructed specifier of optionality constraints, I use “must be (or) may be” instead of “can be” in that structure, but that is not important here.)

This is consistent with the second sentence above: Each skill can be desired by one or more creatures. But “be” is now part of the standard structure, not the
relationship. The relationship is only the prepositional part. The relationship going the other direction then becomes something like: “Each creature can be desirous
of
one skill”.

The authors themselves acknowledge a problem when the most reasonable verb is intransitive. In one example, “each element can belong to one or more
sets”, means that in the relationship in other direction has to be something like “each set can be belonged to (by) one or more
elements”. (Yuchh!) The problem with all this is that the verb they want to use is “belong”. But “belong” is intransitive. You cannot “belong”
something. You have to either “belong to”, “belong above”, or “belong” something else to it.

Instead, the relationship should read “each element can be part of one or more sets”, and “each set can be composed of one or
more elements.

Unfortuanately, this runs into one other rule they have imposed: According to Messrs. Carlis and Maguire, the relationship going one way should, if possible, be the same verb as the relationship
going the other way, but in passive voice. When using prepositions, this rule goes away. The preposition in one direction of a relationship is often quite different from the preposition in the
other direction.

Sub-types

One problem with their shapes is that they do not recognize sub-types. Instead, they use one-to-one “to be” links, which seem to serve much the same function without constraints, but
which I believe are harder to understand.

Figure 4 shows an example of this. This model asserts that, for example, each human can be also a creature. Each creature, in turn, can be a human. The cross line, however, asserts that each human
is identified by the creature it is. That is, a human must be a creature. Similarly, a monster must be a creature, and a frog must be a creature.

This is almost equivalent to asserting that human, monster and frog are sub-types of employee. Indeed, the model suggests, for example (although the authors do not assert this explicitly), that
human, frog, and monster all inherit the “creature ID” of the creature.


Figure 4 : LDS “To be” Notation

There are two differences, however. First of all, there are no constraints saying either that a creature must be either a human, a monster, or a frog, nor that no creature can be more than one of
those things. This is consistent with the authors reluctance to show constraints, but these seem reasonable constraints to express. (Although our colleagues in the industry can’t agree on
whether they should be imposed or not, so maybe these guys have the right idea.) More significantly, however, there is no sense of the fact that the set of creatures consists of frogs, humans and
monsters.

I for one am partial to using sub-types. An important aspect of the world that should be modeled seems to be that larger sets of things are composed of smaller sets of things. There are different
kinds of contracts, activities, and products. It is useful to be able to show this fact graphically.

In the chapter describing other notations, they do explicitly discuss sub-types in terms of what they call the “triangle relationship”. They are not completely opposed to the idea, but
they are troubled by some of the constraints on it required by some notations – for example that the notation does not permit the identifiers of sub-types to be different from that of the
super-type. That’s a rule I don’t follow anyway, so I don’t see it as a problem.

So, I think I will continue using sub-types and super-types.

Altogether, in spite of some disagreements about particulars, I am happy to endorse this book as an important addition to the body of data modeling knowledge. I hope the basic ideas described here
catch on.

Share

submit to reddit

About David Hay

In the Information Industry since it was called “data processing”, Dave Hay has been producing data models to support strategic and requirements planning for thirty years. As President of Essential Strategies International for nearly twenty-five of those years, Dave has worked in a variety of industries and government agencies. These include banking, clinical pharmaceutical research, intelligence, highways, and all aspects of oil production and processing. Projects entailed defining corporate information architecture, identifing requirements, and planning strategies for the implementation of new systems. Dave’s recently-published book, “Enterprise Model Patterns: Describing the World”, is an “upper ontology” consisting of a comprehensive model of any enterprise—from several levels of abstraction. It is the successor to his ground-breaking 1995 book, “Data Model Patterns: Conventions of Thought”–the original book describing standard data model configurations for standard business situations. In addition, he has written other books on metadata, requirements analysis, and UML. He has spoken at numerous international and local data architecture, semantics, user group, and other conferences.

Top