Published in TDAN.com July 2000
For several years now, your author has been searching for a catalogue to use for storing the “data about data” that are required to support a data warehouse or any major
application. Alas, it isn’t called a “catalogue” any more, or even a “data dictionary”. It is now called a “metadata repository”, and in keeping with this
new high-falutin’ name, actual examples are way more complex and abstract than seems to be really needed.
Now, no one has ever accused David Hay of being afraid to be abstract when the modeling situation required it. But both the commercial repositories and the generic models being promoted by the
likes of Microsoft, the Metadata Coalition, and the Object Management Group take this too far.
The fact of the matter is that, in most situations, there are relatively few, very well defined, things that we want to keep track of in a catalogue. To model these things should not be very
difficult. The models in this article took less than two days to develop.
This is the first of two articles which present a simple set of models to describe a catalogue that will support a typical application. Yes, these are sketches, and they could certainly be made
more elaborate. Indeed, there may be errors here. But I believe that they constitute a reasonably accurate representation of the things they set out to represent – concisely and in concrete
This article is about the conceptual model and other analysis artifacts. Next issue’s article will describe database design.
The first thing to collect in a catalogue is the object model that is the basis for the system’s architecture. To do this, we start with ObjectClass and
Attribute. The UML representation of these is shown in Figure 1. It shows the things that we are interested in as boxes and the relationship between them as a line.
An ObjectClass is the definition of an object – a thing of significance to an organization, about which it wishes to capture information. Indeed “ObjectClass” is itself an example
of an ObjectClass.
An Attribute is such a piece of information about an ObjectClass. “Attribute” is also an example of an ObjectClass. The line connecting the two boxes together means that an occurrence
of an ObjectClass is associated in some way to one or more occurrences of Attribute. Specifically, the notation means that each occurrence ObjectClass may be associated with no, one, or many
occurrences of Attribute, and that each Attribute must be associated with exactly one ObjectClass. This may more be more gracefully worded that each ObjectClass may bedescribed
byone or more Attributes, while each Attribute must beaboutone and only one ObjectClass. “Described by” and “about” are called “role
names” that describe the relationship in each direction.
Since “ObjectClass” is an ObjectClass, it has attributes – well, one, at least. While there might be many attributes of ObjectClass, only one of interest to us here is simply its
“Name”. This is shown in the figure. Attributes of Attribute also include “Name”, as well as its data “Type”, “Maximum length”, “Average
length”, number of “Decimal” places, and so forth.
ObjectClasses may be associated with each other (as indeed the ObjectClass “ObjectClass” in this diagram is associated with the ObjectClass “Attribute”). Each association is
composed of two Roles, one going in each direction. In Figure 1, “about” and “described by” are examples of Roles. Also in each direction there are
characteristics of the association. Are occurrences of the association mandatory or optional? This is a measure of its “Optionality”. Can you have more than one occurrence of an
association for each occurrence of an entity? This is a measure of its “Cardinality”. These qualities are shown in an object model by the symbols “0..*” (for zero or more)
and “1” (for exactly one).
Figure 2, then shows that each ObjectClass may be connected [to other object classes] via one or more Roles. Because Roles represent the two ends of an association, each Role must
beconnected to exactly one other Role, and the second Role must beconnected from the first Role.
A “sub-type” is an ObjectClass that contains some of the occurrences of a “super-type” entity. That is, the occurrences of a super-type are categorized into two or more
sub-types. Thus, Figure 2 shows that each ObjectClass may be a super-type of two or more other ObjectClasses (zero, two, or more). Each ObjectClass, in turn, may be a
sub-type of one and only one other ObjectClass (zero or one). (Yes, some would assert that an ObjectClass may be a sub-type of more than one other ObjectClass, but this
version of the repository won’t allow it.)
Ok, so much for object classes and attributes. Suppose you are one of those old-fashioned people who still models with entities and relationships? What does that model look like? This is shown in
Figure 3. In the entity/relationship model, each Entity may be described by one or more Attributes. It may also be connected to
or more RelationshipEnds, where each RelationshipEnd must be connected to exactly one other RelationshipEnd. Each Entity may be a supertype of two or
more other Entities, and each Entity may be a sub-type of one and only one other Entity.
Funny thing about the meta-model of entities and relationships: Figure 3 looks just like Figure 2’s meta-model of objects and roles, with a couple of names changed. This is not actually a
coincidence, though, since they in fact represent the same things. An object class model (at least as much as we’ve seen so far) is in fact an entity/relationship model. Both and Entity and
an Object Class represent a thing of significance to the business about which it wishes to hold information. The two models are sufficiently alike, for that matter, that the UML repository model
itself can be represented as an e/r diagram. This is shown in Figure 4.
This model is exactly equivalent to the UML model shown above, with one minor exception that will be described below. To be sure, the typography and the graphics are different. Instead of the first
character “0” in the relationship notation, you see a dashed line half from the first entity. This means that the relationship is optional (“may be” in the association
sentences above). Instead of the first character “1”, you see a solid half line from the first entity. This represents a mandatory relationship (“must be” in the sentences
above). Instead of the second character “*” you see a “crow’s foot” symbol for “one or more”. Absence of a crow’s foot represents “1” as
the second character, for “one and only one”. These differences have no effect whatsoever on the content of the model, however. Typographic changes also have no effect on the content of
Again, each entity may be described by one or more attributes and each attribute must be about one and only one entity. Also, the model says that each entity may be connected
via one or more relationship ends and that each relationship end must be connected to one and only one entity. As before, each relationship end must be connected to exactly
one other relationship end.
The one place where the entity/relationship model is not quite as expressive as the UML model is in the assertions about super-types and sub-types. It can say that each entity may be a
sub-type of one and only one other entity. But going the other direction it can only say that an entity may be a super-type of one or more other entities. It cannot constraint the
statement to two or more.
Which notation to use has been the cause of extensive debates in our industry over the years. Since your author is writing this article, for aesthetic reasons, he is going to pull rank and use the
entity/relationship notation that he prefers for the rest of the article. It is to be hoped that the equivalence between this and a UML version of the same subject has been adequately demonstrated.
In relational theory, derived attributes are not permitted. The information content of a derived attribute, after all, is contained in the rest of the model. Sometimes, however, it is useful to
show a derived value, in order to clarify the meaning of certain structures and calculations. To include a derived attribute in an object class or entity/relationship model is not to assert whether
the result should be calculated when the data are stored or when they are retrieved. That is a design decision. It only represents the fact that the value may in fact be derived from others.
Object modeling does differ from entity/relationship modeling in this regard. In an object model, in effect all attributes are derived. That is, to refer to an attribute is to ask the object for
its value. If the value is computed or retrieved from a data store is of no concern. Either way, processing is involved. In a relationally oriented data model, this is not the case. To refer to an
attribute is to retrieve its value from a data store, unless it is explicitly identified in some way as computed.
Figure 5 shows that an attribute may be either a derived attribute or an other attribute. Her we have examples of the sub-types described above. Derived attribute and other attribute are sub-types
of attribute. An other attribute is one whose value will be stored in a database. A derived attribute, on the other hand, is an attribute whose value is not given. Instead it’s value is
determined from the values of other attributes.
Each derived attribute may be calculated via one or more derivations. A derivation is an algorithm that describes the calculations involved. (A derivation must be the source of a
derived attribute.) Its primary attribute is a “Formula”. A derivation in turn may be composed of one or more derivation elements. (Each derivation element must be part
of one and only one derivation.) Each derivation element is a term in the derivation “Formula”. That is, each derivation element is either a “Constant”, or is the
use of another attribute.
Each occurrence of an entity (or an ObjectClass) must be unique. In object-oriented development, uniqueness is achieved by assigning a surrogate “object identifier” to each object
occurrence. This is not shown in the object class model. Relational database developers, on the other hand, are concerned about determining the attributes which uniquely identify an occurrence of
an entity. To be sure, surrogate keys are often used in a relational environment as well, but they must be identified, and to the extent that natural attributes can be used to identify occurrences,
it is useful to do so. Whichever kind of identifier is used, it is necessary to identify it.
The set of columns that uniquely identify each occurrence of an entity is called the unique identifer of that entity. Each unique identifier may be composed of one or more attributes or
one or more relationship ends, or both. This is shown in Figure 6.
(Naturally, there is a business rule that states that the unique identifier of an entity must be composed of attributes that are about and relationship ends that are connected to the same entity.)
For example, a contract line item might be identified by a combination of the attribute “Line number” and its relationship to one and only one contract. That is, to identify an
occurrence you must specify both the identifier of a contract, and the “Line number” of the line item.
And the object-oriented among you, dear readers, should be heartened to see that an object class is not required to be identified by a unique identifier.
Figure 7 shows a function, which represents something that is done by the enterprise. For purposes of this article, the only representation of functions that is available is that of a hierarchy.
That is, each function may be composed of one or more other functions. To describe data flows would require another article. The modeling of data flow diagrams is left as an assignment for the
Each function makes use of data. It either uses, creates, updates, or deletes data. In Figure 7, a function data usage is an expression of the fact that a particular function somehow uses either an
attribute or an entity. Attributes of function data usage are of course the indicators showing whether the usage is “create”, “retrieve”, “update”,
“delete”, or some combination of these.
It is possible to define views of entities. For example, in the conceptual model, we have party and contract. Business viewers of the model, however, might be interested
in seeing customer and vendor. A “customer” may be defined simply as a party who is buyer in a contract, where our company is the seller in the same contract.
Similarly, a “vendor” is a party who is seller in a contract, where our company is the buyer in the same contract.
Figure 8 shows how an entity view can be composed of some combination of entity elements, attribute elements, or relationship elements, where each of these are the use of either
an entity, an attribute, or a relationship, respectively.
These entity views are essentially the same as what some authors refer to as “business objects” – the objects perceived by business people in their everyday life.
Following the object-oriented idea further, it is possible to describe the “behavior” of each of these entity views, as shown in Figure 9. This is shown by the entity view behavior,
where each view behavior must be of an entity view and the use of a function.
Should you, dear reader, take exception to any of the models presented above – good! It is about time we had a discussion on the specific content we expect in a repository, instead of being
surrounded by fluff pieces talking about what a good idea it is.
Tell me exactly which entities (object classes) and/or relationships (associations) you disagree with. The purpose of a data model is to be wrong. This one is an assertion of your author’s
best guess as to the truth, and it is there for people to correct.
Please post your disagreements on the Data Management Mailing list. Subscribe by sending an e-mail to firstname.lastname@example.org, or go to its homepage at
http://www.egroups.com/list/dm-discuss. I appreciate any comments, good or bad.