From Data Modeling to Ontologies: Discovering What Exists, Part 1

Introduction

Data management’s history as the blending of business management and information technology makes it an unlikely candidate to have anything to do with the worlds of linguistics and philosophy. In recent years, however, companies and their systems have become so complex that the task of retrieving coherent information from various parts of an enterprise has become challenging, to say the least. The time has come to examine other disciplines for help. A particularly noteworthy issue is confusion in semantics—the fact that different parts of an organization often use terms in inconsistent ways. Thus, pulling up a coherent view of an organization has become progressively more difficult.

The time has come to address semantics straight on. Bring in the linguists and the philosophers!

Traditionally, attempts to model the semantics of organizations have made use of various kinds of entity/relationship diagrams. The results of this approach have been decidedly mixed. The kind of data models created by database designers are concerned more with database artifacts (tables, keys, etc.) than with the underlying semantics of the business being modeled.

As an alternative, Harry Ellis and Richard Barker developed approaches to describing a business via a data model that is very effective at capturing semantics. The notation is fundamentally limited, however it the assumption that a named entity type describes one and only one underlying thing. Moreover, there is not a good way to describe two entity types as being “similar” to each other.

In the last 10 years or so, however, web technology has taken on the problem from a different direction. Specifically, Tim Berners-Lee, the inventor of the World Wide Web, in cooperation with his colleagues at the World Wide Web Consortium (W3C), has developed an enhancement to the Web called the Semantic Web. Where the World Wide Web allowed us to link together pages from works published around the world, the Semantic Web will allow us to link together the language in those works from around the world.

This set of two articles is about two steps to be taken to move into the future of understanding and managing a large organization’s semantics—and from that its information.

  • This article describes the move from database design to a more conceptual approach for using entity/relationship models to describe the structure of an organization’s information. That is, instead of modeling data (words and numbers), model information (the nature of the business).
  • The second one will then describe the move to a more comprehensive approach to semantics, using ontological languages to describe the structure of information in a more sophisticated way. This clarifies the way words are used, and allows for the computer to help draw inferences from the model.

About Semantics and Ontology

Semantics
Semantics is “the theory of meaning; it is study of the signification of signs or symbols, as opposed to their formal relations (which is called “syntactics”).”1

This has become important to businesses in recent years, as the growth and increased complexity of companies and their systems have discovered the extent to which different parts of an organization use language in different ways.

Among other things, different departments (and the systems that manage their information) often use words…

  • with the same name that mean different things
  • …with different names that mean the same thing.

For systems (or departments, for that matter) to communicate, semantics must be addressed. To do this, the company’s ontology must be created.

Ontology
“Ontology” is one of the latest hot new IT industry buzzwords. OK, the word is 400 years old,2 and it refers to a branch of philosophy that is 2500 years old,3 but who’s counting? That is, it describes “the branch of metaphysics concerned with identifying, in the most general terms, the kinds of things that actually exist.”4

It turns out that identifying the things that exist in a systematic way is not a trivial task. For example, if you replace all of the parts of your chariot/car one by one, at what point is it no longer the same chariot/car? What are the essential characteristics whose values define the identity of the chariot/car, as compared with its accidental characteristics, whose values do not?

In modern times, the term “ontology” has come to mean “a catalog of the types of things that are assumed to exist…

…in a domain of interest,
…with rules governing how those terms can be combined to make valid statements,
…and ‘sanctioned inferences’ that can be made.”5

A common example of this is a glossary of terms for an enterprise—or for an industry. This could be as simple as a collection in one place of common, agreed-upon definitions for the terms of a company. The problem is with the “as simple as” part. Capturing the definitions of a wide range of terms in a large, complex organization has proved to be difficult indeed.

Data Model as an Ontology

An alternative could be a data model, which can graphically represent the things of significance to an organization, the attributes which describe them, and relationships between pairs of them.

But is such a data model really an example of an ontology?

Not necessarily. As usually produced, many data models do not describe what exists in an organization. Instead they describe the way data are structured in a database or otherwise in a computer system. The symptom of this is that, while entity types in an ontological model could be defined to describe the things in the world, data models as typically created are more often simply to describe information structures.

Tables, columns, primary keys, and foreign keys, are not things that exist in a business. These are computer artifacts. A symptom of this is the fact that relationships are often either not labeled at all, or labeled casually. If you haven’t described how things/objects in the world are related to each other in the world, you have not described what exists.

Modeling Notations

In 1977 Dr. Philip Chen published the original approach to data modeling as “Entity-Relationship Approach to Logical Database Design.”6 The notation did not have any technological artifacts, but it did assume a relational database environment.

James Martin and Clive Finkelstein published their original paper “Information Engineering” in 1981.7 In this, they described a comprehensive approach to systems development, beginning with strategic planning and moving through requirements analysis, design, programming, and implementation. To support this method, they developed a notation for representing data structure that was based on Dr. Chen’s approach.

This also presumed a “relational” approach, with foreign keys being described to implement relationships.8

The Information Engineering approach to notation describes relationships with “verb phrases” such as “has,” “orders,” “occupies,” etc. The problem with this is that verb phrases describe processes. They do not describe what exists. Process modeling, after all, is a different kind of modeling. Over the years, the Information Engineering notation has been most often used to describe database design.

Another approach, which was even more focused on relational database design, was IDEF1X, developed also in the early 1990s. It found its greatest use in the U.S. Federal Government.9

Barker/Ellis

Also in the early 1980s, Harry Ellis and Richard Barker developed an alternative way of modeling information. Also taking off from Dr. Philip Chen’s 1977 version of entity/relationship modeling, they wanted to provide a disciplined way of describing company information that is explicit enough to be validated by a group of non-technical business people. This meant forgoing any references to database technology. This effort resulted in a more visually pleasing notation, but more than that, it included an elegant and disciplined approach to naming relationships.

As with the other notations, it is based on rectangles representing entity types,10 and lines representing relationships.

Their premise—which was radically different from what had been published before—was this:

Relationship names assert facts about things that are purported to exist. 

To describe what exists, the verb that must be implicit in all ontological sentences is “to be.” Description of the relationship itself is a prepositional phrase. After all, it is the preposition that is the part of speech that describes relationships.

Those of you young enough to have seen the “Sesame Street” program for children may recall “Grover words”: “up,” “around,” “under,” and so forth. These are prepositions.

The verb “to be” with this approach is in the form “must be” or “may be,” where “must” and “may” are auxiliaries to the verb “be.” These are used to specify the relationship’s minimum cardinality. In addition, the predicate is further supplemented by adding the maximum cardinality, “one and only one” or “one or more.”

The effect of all this is that every relationship direction shown on a model can be read in the form:

TDAN_Hay09012012_2

For example, the model shown in Figure 1 represents pairs of direct, explicit, assertions about the automobile business:

1. Person / Automobile / Automobile Brand (compound association)

a. Each Person may be owner of one or more Automobiles, each of which must be an example of one and only one Automobile Brand.

b. Each Automobile Brand may be embodied in one or more Automobiles, each of which must be owned by one and only one Person.

2. Sports Car / Automobile (sub-type)

a. Each Sports Car must be an Automobile.
b
. An Automobile may be a Sports Car.

3. Sports Car / Person (simple association)

a. Each Sports Car must be for the exhilaration of one and only one Person.
b.
Each Person may be exhilarated by one or more Sports Cars.

Note that each of these sentences is explicit enough to be immediately recognized as a true or false statement. For example, is it true that each Automobile must be owned by one and only one Person? Could it not have an owner? Could it have more than one? The model may not be correct, but it must be clear enough that untruths will come out.

One thing that an entity/relationship diagram cannot show is the assertion that the person that is exhilarated must also be an owner. That is, the “for the exhilaration of” relationship could be a sub-type of the “owned by” relationship. This cannot be represented on an entity/relationship diagram.11

It can be represented in the semantic languages to be described below.

NOTE:There is nothing to prevent using this approach to relationship naming in Information Engineering, IDEF1X, or even UML diagrams. Indeed, in order for any diagrams to truly represent ontologies, this would be required.

There is nothing to prevent using this approach to relationship naming in Information Engineering, IDEF1X, or even UML diagrams. Indeed, in order for any diagrams to truly represent ontologies, this would be required.

TDAN_Hay09012012_6

Figure 1: Sample Model

UML “roleNames”

Note that UML models (as normally rendered) are not ontologies.12 Distinguishing them even more from conventional entity/relationship models, these models are normally about the design of object-oriented software: An “association” is not a structure. It is a path to be navigated by program code. A “roleName” is not a predicate; it is a label for the object class—to enable that navigation to take place. Often it simply repeats the name of that class.

So, following the rules to make an entity/relationship diagram into a conceptual model brings us closer to creating an ontology to describe a business in terms of its information. But there is more.

Issues

Even the most semantically sophisticated entity/relationship model, however, falls short of being a complete ontology because of its “closed-world assumption”:

If an assertion does not follow a set of  rules for truth, it is presumed to be false.  This is derived from our data management view that it is our job to protect a database from “bad” data. 

To describe the world that exists, in all of its messiness, this assumption must at least be acknowledged and managed. To the extent that it remains useful—for defining a closed data repository—so be it. But there is a lot of information out there that will not be found if you are limited to this assumption.

Specifically, if you are in the job of exploring the information that exists, rather than trying to create a structure for capturing “clean” data, the closed-world assumption doesn’t work.

Ontology promoters are advocates of the “open world assumption”:

If your rules haven’t explicitly asserted something to be false, it could be true. This comes from a world where you are exploring a large body of data to obtain insights.

Also, the closed word view makes much stronger assumptions about the relationship between the name of something and that something itself. These assumptions may not be valid.

As an example of the issues involved:

  • An E/R rule says that “each City must be located in one and only one State.
  • A record is received that asserts that a City named “Portland” exists, with no other information.
    • Closed world assumption rejects it. City must be in only one State.
    • Open world accepts that this is all we know, and doesn’t throw it away.
  • A record is received that asserts that a City named “Portland” is located in a State called “Maine.”
    • Closed world accepts it.
    • Open world accepts it.
  • Another record is received that asserts that a City named “Portland” is located in a State called “Oregon.”
    • Closed world rejects it: City may only be in one State.
    • Open world asks questions:
      – Is the State named “Oregon” the same as the State named “Maine”? or
    • – Is the City named “Portland” that is in the State named “Maine,” different from the City named “Portland” that is in the State named “Oregon”?

Note that this exercise doesn’t invalidate the rule, as such. But it clarifies that the rule is not complete without dealing with the way things are named.

So, there is more to be done to create a proper ontology.

More significantly, this highlights the fact that our data models are constrained by the language we have for describing something. Only indirectly do they address the underlying thing being described.

Part 2 of this series will describe how the languages of the Semantic Web—Resource Description Framework (RDF), RDF Schema, and the Web Ontology Language (OWL), address these ontological issues.

End Notes:

  1. G. Kemmerling. Philosophical Dictionary. http://www.philosophypages.com/dy/s4.htm#sems. 2002.
  2. Oxford University Press. 1971. “Ontology.” The Compact Edition of the Oxford English Dictionary. (New York: Oxford University Press). “[ad. mod. L. ontologia (Jean le Clerc, 1692), f. Gr.onto+logia]”.
  3. Aristotle. ~331 BCE. Categories. Translated by E. M. Edghill. 2007. Aristotle’s Collection. (Publish This, LLC.)
  4. G. Kemmerling. “Philosophical Dictionary” http://www.philosophypages.com/dy/o.htm#onty. 2002.
  5. Knowledge Based Systems, Inc., Information Integration For Concurrent Engineering. Prepared for Armstrong Laboratory AL/HRGA. 1994
  6. Chen, P. 1977. “The Entity-Relationship Approach to Logical Database Design.” The Q.E.D. Monograph Series: Data Management. (Wellesley, MA: Q.E.D. Information Sciences, Inc.)
  7. James Martin and Clive Finkelstein. November, 1981. “Information Engineering.” Technical Report, two volumes. (Lancs, UK: Savant Institute, Carnforth).
  8. Clive Finkelstein. 1992. Information Engineering: Strategic Systems Development. (Sydney: Addison-Wesley Publishing Company).
  9. Federal Information Processing Standard (FIPS). 1993. IDEF1X Federal Information Processing Standard. FIPS pub. 184. Dec., 1993.
  10. Originally, Dr. Chen described entity as an object in the world. An entity type, represented by a box, stands for all entities which share attributes. That is, an “entity” is an example of (or an instance of) an “entity type”. In the discussion of the Semantic Web, below, what is in the boxes here will simply be called classes.
  11. It can be inferred, however, based on the sub-type structure of the entity classes. The reader is hereby offered the opportunity to articulate just how.
  12. See my 2011 book, UML and Data Modeling: A Reconciliation for a description of how to make UML more ontological.

Share

submit to reddit

About David Hay

In the Information Industry since it was called “data processing”, Dave Hay has been producing data models to support strategic and requirements planning for thirty years. As President of Essential Strategies International for nearly twenty-five of those years, Dave has worked in a variety of industries and government agencies. These include banking, clinical pharmaceutical research, intelligence, highways, and all aspects of oil production and processing. Projects entailed defining corporate information architecture, identifing requirements, and planning strategies for the implementation of new systems. Dave’s recently-published book, “Enterprise Model Patterns: Describing the World”, is an “upper ontology” consisting of a comprehensive model of any enterprise—from several levels of abstraction. It is the successor to his ground-breaking 1995 book, “Data Model Patterns: Conventions of Thought”–the original book describing standard data model configurations for standard business situations. In addition, he has written other books on metadata, requirements analysis, and UML. He has spoken at numerous international and local data architecture, semantics, user group, and other conferences.

Top