UML as a Data Modeling Notation, Part 1

This series of articles has two audiences: The data modelers who have been convinced that UML has nothing to do with them; and UML experts who don’t realize that data modeling really is different from object modeling (and the differences are important). The authors’ objective is to finally bring the two groups together in peace.

The series is in three Parts. Part I, here, sets the stage, describing the basic differences between the notations and how they can be reconciled. Part II, which will be published in the next issue of The Data Administration Newsletter, goes into more detail, addressing sub-types and constraints, along with both what in UML should not be used for a data model, and what “stereotype” has to be added. And, since the whole point of preparing a data model is to present it to management for validation, Part III discusses the aesthetics of preparing and presenting data models – no matter what notation is used.

INTRODUCTION 1: FOR THE DATA MODELER

Premise: A class model in UML is not the same thing as an entity/relationship model.

Data modelers come in (at least) two flavors. Some of you (the database group) view data modeling as a prelude to database design, and in fact include many relational design concepts (such as “foreign keys”) in your data models. Using ERrwin as a modeling tool is particularly conducive to that way of thinking. The second group (the business concepts group) views data modeling as a way to capture the language of a business without regard to the database management system that might one day capture the data. This group tends to view the world a bit more abstractly and is more concerned with accurately describing the business than with concerns about such things as database performance.

Both of these groups find UML to be at least annoying, if not threatening to their world views. First, the object-oriented approach to data is dramatically different from the relational database approach. Most significantly, object-orientation makes extensive use of sub-typing (inheritance), while this cannot be directly represented in a purely relational database.

Second, while the data base administrator views a database as a corporate resource and is concerned with controlling access to the data and to their definitions, the object-oriented developer is concerned with data as the basis for program design. As used by an object-oriented designer, a class in UML refers to a piece of program code that describes its attributes and behaviors, with objects in that class coming into existence and going out of existence as needed.

UML “class models” are different from the business conceptual entity/relationship models as well, because the object-oriented community is not constrained in specifying what constitutes a class. Pretty much anything can be an object to be collected into classes. In the entity/relationship world, on the other hand, in a business-oriented data model, an entity class can only be something of significance to the business, about which it wishes to hold information.

Of course, data modelers themselves are sometimes a bit free-wheeling in their decisions as to what constitutes an object, so perhaps they too can learn from this article.

The problem is that UML is here. Whatever its flaws, it is widely recognized as a standard. We can proclaim (as your authors have proclaimed) that it is fundamentally different from data modeling and has nothing to do with database design – but clients and hiring managers keep asking whether you know how to do UML.

One of your authors has been one of the most vocal opponents of UML’s approach to class models over the years, primarily on aesthetic grounds [Hay, 1999]. As it happens, however, this year, David Hay has “gone over to the dark side” and been working on a project for the Object Management Group (OMG), the creators of UML. The project is to produce a set of metamodels to describe entity/relationship modeling itself, as well as relational database design, XML Schema design, and others. It is necessary in this project to produce what are essentially “conceptual” entity/relationship models, but since it is the OMG, it is necessary to use UML Class notation.

Ok, it’s true. It can be done. To be sure, the tools for manipulating UML are significantly more complex to use, because UML (even the class models) includes way more than is required for an entity / relationship model, but with patience the appropriate sub-set of the notation can be lifted and used. It is a matter of turning off various options in the tool. To be sure, at first it looked as though there were some logical obstacles that prevented this from being done at all. But it turned out that there was a “secret handshake” (again, an option in the tool that had to be turned off) that solved the problem.

This article shows you how to do it. It will take familiar concepts and show you how to represent them using the UML notation. The notation is not (dare we say it?) pretty. But it is usable.

Note, however, that, issues with UML modeling notwithstanding, people even within the data modeling community have very different ideas about what constitutes ]a “good” data model. Be advised, therefore that this article does reflect the prejudices of the authors. We have been doing data modeling for over twenty years,, and we learned from the beginning to address it as a semantic, not a technical discipline.

This article will reflect that history.

So, even as the article will show UML modelers how to expand their horizons to use their notation in a new way, perhaps it will also give data modelers new insights into the models they produce as well.

INTRODUCTION 2: FOR THE UML MODELER

Premise: An entity/relationship model is not the same thing as a class model in UML.

The Unified Modeling Language (UML) began as a collection of notations to support object oriented design. It was derived from an assortment of existing approaches and, as a result, is not a single notation, but an array of notations for modeling elements as diverse as classes, behaviors, events, and others.

By the time UML appeared in the early 1990s, the use of models to support the discovery of system requirements for business was already highly developed. Both data flow diagrams and entity/relationship data models were nearly 20 years old. Modeling in that context (whether it was data flows, events, or data structures) clearly distinguished between modeling the nature of the business and modeling the systems that would support that business.

A particularly powerful tool for describing the nature and structure of a business is the entity/relationship diagram. This drawing of things or objects of significance to a business and relationships between them can be used to allow business people to see clearly what things the analyst has misunderstood. Finding such misunderstandings during a modeling session is vastly cheaper than finding them embodied in a system that was subsequently built at great expense.

Because programs often did describe things in the real world, the UML class models that supported design looked much like the entity/relationship models that supported requirements analysis. Meiler Page-Jones, for example, included in his set of object classes those that referred to real-world things. He wrote of the “domains” of classes—categories such as the “business domain” and the “application domain” [Page-Jones 2000]. From this, the object-oriented world concluded that it had created “object-oriented analysis”.

Well, no. Requirements analysis (as it had developed over the previous 30 years or so) couldn’t be “object-oriented”, any more than it could be “relational-oriented” or “COBOL-oriented”. Requirements analysis is fundamentally about the business, not about technology.

According to the “three amigos” of UML, an “object is a “discrete entity with a well-defined boundary and identity that encapsulates state and behavior; an instance of a class” [Rumbaugh, et. al. 1999, p. 360] A “class”, in turn, is “the descriptor for a set of objects that share the same attributes, operations, methods, relationship, and behavior.” [Rumbaugh, et. al. 1999, p. 185] Note that there are no constraints in either of these definitions as to what kinds of objects or classes were of interest. Anything is an object.

The entity/relationship modeling world uses classes in a similar way to those in UML, but it has a much narrower definition. First of all, an “entity” (in the entity/relationship world), unlike an “object” (in the object-oriented world), is not concerned with operations, methods, or behavior. They belong to the world of “process modeling.” An entity/relationship model is only concerned with the structure of data. Second, an entity class in an entity/relationship model is not just any “discrete entity with a well-defined boundary and identity”. It is limited to what Richard Barker calls those things or objects “of significance, whether real or imagined, about which information needs to be known or held.” [Barker 1989]

Barker’s orientation is toward only those entities that are of interest to a business, while UML encompasses any objects and classes that one can come up with. Indeed, according to James Martin and James Odell, an “object type” (“class”) is simply “a concept” [Martin & Odell 1995, pp, 34, 143].

Any concept.

This includes computer elements and artifacts in addition to those of interest to the business.

Does this mean that the UML class diagram notation cannot be used to produce conceptual entity/relationship diagrams? Of course not. The Barker entities are certainly a subset of the objects defined by the three amigos.

But it is important to realize three things:

Only entity classes that pertain to the business at hand1 will be treated.

Only a subset of the notation used in UML can be used to represent the semantics of a business.

The meaning of the symbols is fundamentally – if subtly – different from their meaning as used in the object-oriented world.

With these slightly different meanings, a diagram using these subsets carries exactly the same semantics as a corresponding diagram using the Information Engineering notation, the Barker-Ellis notation, or any other.

A section has been added, by the way, about the aesthetic characteristics desirable in a data model, with a few words about how to present a model to business observers. This is included because, no matter what notation is used, a conceptual entity/relationship model is intended to be a means for communicating with the business. Contrary to what some in both camps believe, aesthetics is important.

The easy part of these articles (for both audiences) is to understand the notation required for this approach. More difficult is the change in attitude required in each case, in order to be successful.

Before proceeding, three observations should be kept in mind.

There are better and worse data modelers
There are better and worse UML modelers.
Neither “community” is homogeneous.

The objective of this series of articles is to provide all modelers with guidance as to how to produce an excellent conceptual entity/relationship model using UML Class Diagram notation.

NOTATION

Both the various forms of entity/relationship notation and UML can describe entity classes and relationships. Figure 1 shows a model fragment in the notation developed by Richard Barker and Harry Ellis. It asserts that an instance of an ORDER will be described by values of “Order number” and “Order date”, while a LINE ITEM is described by values of “Line number”, “Quantity”, “Price”, and “Delivery date”. Moreover, a value of “(Extended value)” may be computed for each instance of LINE ITEM as well.

In addition, this model fragment asserts that:

Each ORDER may be composed of one or more LINE ITEMS, and
Each LINE ITEM must be part of exactly one ORDER.2

Figure 1: A Relationship in Barker-Ellis Notation

Figure 2 shows exactly the same information, but in UML form.

In the Barker-Ellis notation, the may be part of the first assertion is represented by a dashed line connected to the first entity class (ORDER). In the UML model, this is represented by the “0..” notation next to the second entity class (LINE ITEM). In the Barker-Ellis notation, the must be part of the second assertion is represented by a solid line next to the first entity class LINE ITEM). In the UML model, it is represented by “1..”, next to the second entity class (ORDER).

Instead of the more than one part of the first assertion being represented in the Barker-Ellis notation by a “crow’s foot” (<—) symbol next to the second entity class, in the UML version, it is represented by “..*” next to the second entity class. Instead of the exactly one part of the second assertion being represented by a straight line (with no crow’s foot) in the Barker-Ellis notation, in the UML model it is represented by the characters “..1” next to the second entity class.

Figure 2: A Relationship in UML

The two forms are semantically equivalent.

Note that the form “1..1” is often abbreviated “1”.

Because the optionality part (“may be” or “must be”) of the notation is next to the first entity class in the Barker-Ellis notation, by convention, the relationship name is next to the first entity class as well. In UML, optionality is denoted by the symbols next to the second entity class. For that reason, in the case of UML, the relationship name is next to the second entity class. To protect the sanity of those who have to work with both notations, in all cases, relationship sentences are read in a clockwise direction: left to right above and right to left below.

In the Barker-Ellis model, mandatory attributes are designated with an asterisk (*) or an octothorpe (#)3 and optional ones are designated with a circle (0). In UML, the same symbols used for relationships are also used for attributes. The mandatory attributes in the example are annotated with “[1]”, representing “1..1”, and meaning that at least one value is required, but no more than one is permitted. Optional attributes are annotated with “[0..1]”, meaning that a value is not required but, again, no more than one value is permitted. In the entity/relationship modeling world, the second “..1” is unnecessary, since the original relational theory rules prohibiting multi-valued attributes are in effect. In the UML world such things are permitted, so the “..1” will always be present.

Note that, strictly speaking, any expressions could be used to describe the roles played by each end of the relationship, but in disciplined data modeling, there are stringent rules, described in the following section.
LANGUAGE

An entity/relationship diagram is primarily a graphic portrayal of English language assertions about an organization. Therefore, the only language to appear on a diagram must use terms relevant to the business. That is, only business terms (and conventional English) may be used as the names of entity classes and the names of roles.

Note that the typographical conventions (all capitals for entity class names, italics for relationships) are unnecessary. Indeed, a case could be made for showing the sentences in all normal case. It is helpful, however to distinguish the components in this tutorial so that their role in the sentences is clear.
Entity Classes

An Entity class is the name of a “thing or object of significance to a business, whether real or imagined, about which information needs to be known or held.” [Barker 1989, p. 5-1]. This may be a concrete thing, such as PERSON, or GEOGRAPHIC LOCATION, or it may be an abstraction, such as LINE ITEM or PROJECT ROLE.

A subset of the UML concept of “class” can be used for this, provided that it is understood to mean only entity/relationship model classes—that is things of significance to the enterprise, and only if the conventions described here for naming are followed.

Specifically, the name of an entity class is in the singular, and refers to an instance of that class. Hence, ORDER, LINE ITEM, above. While the name “Project history” is not allowed an entity class called PROJECT could contain instances over time, so it may in fact be a project “history”. But that is not how it is named. Database table names are not allowed, nor are abbreviations or acronyms4 Classes that are computer artifacts (“window”, “cursor”, and the like) are not allowed.
Attributes

As in UML, an attribute in an entity/relationship model is a characteristic of an entity class that “serves to qualify, identify, classify, quantify, or express the state of an entity” [Barker 1989, p. 5-6]. In the examples above, attributes of ORDER are “Order number” and “Order date”. Attributes of LINE ITEM are “Line number”, “Quantity”, “Price”, and “/Extended value”. The “/” in front of “Extended value” is a UML symbol for a computed field. (Most entity/relationship notations have no such symbol, although your authors’ convention surrounds the name with parentheses.) Each value of /Extended value is derived from the expression, “Quantity times Price”. The algorithm is not shown in an entity/relationship drawing, but must be documented behind the scenes. In UML, it can be shown in an annotation on the drawing.

UML has the ability to display a large number of things about an attribute: its data type, its “visibility”5 whether it is “read-only” or not, and so forth. In the entity/relationship version, the only things to display are the attribute name, whether it is optional or not, its optional “<<ID>>” stereotype (more on that, below), and the optional “/” that designates it as a derived attribute. Datatype must be documented behind the scenes, but, as it adds clutter, it is not normally shown on a diagram used for presentations. It can be included on diagrams if they are solely used for documentation. “Visibility” is a characteristic of an attribute’s use in a particular context, and does not belong on a structural diagram.

As with entity class names, attribute names must be common English names for the characteristics involved. In general, it is not necessary to include the entity class name in the attribute name, but in some companies, standards dictate that the entity class name be inserted in front of the common attributes – for example, “Person name” and “Person ID”.
Relationships and Roles

A relationship between two entity classes consists of two assertions about them. Each assertion is one entity class’s role with respect to the other. This can be described using the UML line for an “association”. In one sense a UML association is equivalent to an entity/relationship relationship, but a relationship in an
entity/relationship model is more constrained in what it can represent than is an object-oriented association. Specifically, as will be described below, each relationship is a pair of assertions about the nature of the business. It is not simply recognition that two things are somehow associated with each other.

Note that, while in preliminary entity/relationship models, many-to-many relationships are common, by the time the model has been resolved into a “conceptual” model, they have all been resolved into one-to-many relationships. This is important because often the intersection of the two entity classes contains important business information. Simply saying that each A is related to a lot of Bs and each B is related to a lot of As tells you nothing about each occurrence of an A being related to a B.

In Information Engineering and the Barker-Ellis notation for entity/relationship modeling, cardinality (called “multiplicity” in UML) is represented by either the presence or absence of the “crow’s foot” (>-) symbol. Optionality (also known as “modality”) is represented (in Information Engineering) by either a circle (O) or vertical line (|),or (in the Barker-Ellis notation) by whether half of the relationship line involved is solid or dashed.

In UML, cardinality is represented by characters: “..1” (meaning that an instance of the first entity class can be associated with no more than one instance of the second class) or “..*” (meaning that the first entity can be associated with an unlimited number of instances of the second class). A relationship’s optionality can be either “0..” (meaning that the relationship is optional) or “1..” (meaning that it is required).

UML, by the way, unlike entity/relationship modeling supports a variety of values for maximum cardinality. the expression could be “1,4, > 7”, meaning the value must be exactly 1 or 4, or greater than 7.

Unlike in conventional UML usage, each relationship consists of two ordinary English sentences, although that sentence does have a rigorous structure. Each relationship end is called a “role” in UML. Thus, the relationship portrayed in Figure 2 shows cardinality and optionality in graphic terms. Specifically:

Each
<entity class 1>
Must be … if the second entity class has “1..” next to it
(or)
May be … if the second entity class has “0..” next to it
<role name>
One or more … if the second entity class has “..*” next to it
(or)
exactly one … if the second entity class has “..1” next to it.

Note that to say an ORDER may be composed of one or more items is often expressed as an ORDER is composed of zero, one, or more items, but in your authors’ opinion, the latter is clumsier. Saying this to a non-technical audience sounds, well, technical.

Note also that each role name is in the form of a prepositional phrase, not a verb. The preposition is the part of speech that denotes a relationship. (Remember “Grover words”?) Verbs represent actions, that are the subject of a process-oriented, not a structural model.

The most common configurations are “1..1”, for “…must be … exactly one…”, and “0..*” for “…may be…one or more…” As mentioned above, because it is so common, “1..1” is often abbreviated “1”. That means, when reading such a role, the reader must parse “1” into its two components.

Thus, in Figure 2, above, the role reading from right to left produces the sentence:

Each ORDER may be composed of one or more LINE ITEMS.

From left to right, it reads:

Each LINE ITEM must be part of exactly one ORDER. (Note that “1..1” could have been abbreviated “1”.)

Note that if the modeler is successful, these relationship sentences appear almost self-evident to the viewer. These are perfectly normal, non-technical sentences. Not only do they sound normal, they are also strong sentences, such that if the assertions are in fact wrong, a viewer cannot simply let them go. ‘E has to disagree with them.

Note also, however, that coming up with such self-evident role names is very hard. To do so means that you must really understand the nature of the relationship, and you must be good at manipulating the English language (or whatever language you are modeling in). Unfortunately, many modelers don’t have the inclination or the ability to do so. The final product suffers.6

This, by the way is a very different approach to naming roles than is taken up in the object-oriented community. There the role is usually a noun, describing what the second entity is about. In many cases, this is indeed a noun form of the relationship coming back (“customer” rather than “customer in”), but in many cases it simply reproduces the name of the entity.

Figure 3 shows a model developed under object-oriented rules. In this, the SUBJECT AREA plays the role of being the containing subject area for the ENTITY. The ENTITY, in turn, plays the role of being the contained entity for the SUBJECT AREA. This is as apposed to the entity/relationship approach of asserting that each ENTITY may be contained in one or more SUBJECT AREAS and that each SUBJECT AREA may be a container of one or more ENTITIES.

Figure 3: Object-oriented Roles7

A Constraint

Some entity/relationship notations have the ability to describe an “exclusive or” arrangement of relationships. For example, Figure 4 shows the assertion:

Each LINE ITEM must be either for exactly one PRODUCT TYPE, or for exactly one SERVICE.

The “arc” across the relationship lines denotes this.

Figure 4: Exclusive Or in the Barker-Ellis Notation

Not all entity/relationship notations can show this, but in fact UML can. In UML, it is called an “XOR Constraint” and is shown in Figure 5.8

Figure 5: Exclusive Or in UML

Part 2 of this series will explore the representation of sub-types and unique identifiers, as well as some UML features that are, well, unnecessary for an entity/relationship model.

UML as a Data Modeling Notation, Part 2

UML as a Data Modeling Notation, Part 3

UML as a Data Modeling Notation, Part 4

FOOTNOTES:

By “business at hand” is meant the subject being modeled, which might be a business or a microbiology lab or a Space Shuttle. The key is that we are interested in describing the “problem space”, not the “solution space.” For convenience in this article, the term will be “business” even though the subject matter could be other than commercial.
The UML readers will point out that “part of/composed of” is represented in UML by the additional symbol of a black (or open) diamond. In E/R models, however, the additional symbol is unnecessary because the roles are fully described by the English phrases. (More on this next month.)
More on the “octothorpe” when we discuss identifiers in Part 2 of this series.
If the acronym is widely accepted in the organization, and if everyone agrees as to what it means, and if to spell it out would be too long and clumsy, then it may be permissible to use it in an entity class name. Maybe.
Note to E/R modelers: “Visibility” refers to whether or not an attribute in a module of an object-oriented program can be referred to by other modules. Clearly this is not of interest in a requirements analysis model.
As an indication of how important it can be to use the right relationship name, the editors of the Hitchhiker’s Guide to the Universe were once “sued by the families of those who had died as a result of taking the entry on the planet Tral literally (it said ‘Ravenous Bugblatter Beasts often make a very good meal for visiting tourists’ instead of ‘Ravenous Bugblatter Beasts often make a very good meal of visiting tourists’).” —Douglas Adams, The Restaurant at the End of the Universe (New York: Pocket Books, 1980). P. 38.
Our thanks to Jim Logan of Model Driven Solutions for this model.
Note that the role name “for” as a property of LINE ITEM is duplicated. This is not a logical problem, since “for a SERVICE” is not the same as “for a PRODUCT TYPE”. It can be a problem for UML, though, since the roles are “properties” of the entity class, but the other entity classes are not. In
entity/relationship modeling, however, this is not an issue. See the discussion of “Dealing with quirky UML concepts” next month.

References:

Barker, R. 1989. CASE*Method: Entity Relationship Modeling. (Wokingham, England: Addison Wesley).

Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley. Available at: http://en.wikiquote.org/wiki/Special:BookSources/0471810339

Hay, D. 1999 “UML Misses the Boat,” East Coast Oracle Users’ Group: ECO 99 (Conference Proceedings / HTML File). Apr 1, 1999. Available at http://essentialstrategies.com/publications/objects/umleco.htm

Hay, D. 2003. Requirements Analysis: From Business Rules to Architecture (Upper Saddle River, NJ: Prentice Hall PTR).

Miller, G. A. 1956. “The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information”, The Psychological Review, Vol. 63, No 2 (March, 1956). Available at http://www.musanim.com/miller1956/.

Martin, J., and James Odell. 1995. Object-Oriented Methods. (Englewood Cliffs, NJ: Prentice Hall).

Page-Jones, M. 2000. Fundamentals of Object-Oriented Design in UML. New York: Dorset House). Pp. 233-240.

Rumbaugh, J., Ivar Jacobson, Grady Booch. 1999. The Unified Modeling Language Reference Manual

MenuMenu

UML as a Data Modeling Notation, Part 2

UML as a Data Modeling Notation, Part 3

UML as a Data Modeling Notation, Part 4

UML as a Data Modeling Notation, Part 2

UML as a Data Modeling Notation, Part 3

UML as a Data Modeling Notation, Part 4

Share this post

David Hay