Data Modeling, Left and Right

Published in TDAN.com April 2003

Maybe I’ve been watching too many nightly editions of CrossFire on CNN, but it seems to me that two very different ideals of data modeling are currently competing in the data-architecture
space. Both of these ideals are valuable and have their place. Consequently the need to reconcile them is great.


From the Left:

Data modeling as it is practiced by AI experts, KM experts, object-oriented data analysts, and “semantic web” ontology advocates involves the specification of deeply layered type
hierarchies, including multiple inheritance for entities and (often) multiple values per-attribute per-entity. Furthermore, “left-wing” data modelers are concerned to capture logical
rules (business rules) that are semantically dependent on the particular subject-area being studied, in addition to the logical/business rules that are independent of
subject-matter.

One key source of such semantic dependence is the particular verbs used within a given subject area:

The fact is that the meanings of the predicate letters of [basic first-order] predicate logic vary from problem to problem: Unlike quantifiers, truth-functional operators, and the identity
predicate, they do not have fixed meanings. Consequently [basic first-order] predicate logic provides no rules to account for the semantics of specific predicates – with the sole exception of
the identity predicate. It is therefore insensitive to validity generated by their distinctive semantics.[i]

For example, if we know that, within a specific subject-matter area, one predicate (such as “is the father of”) is the inverse of another predicate (“is the son of”), then
we may safely conclude that IF “Abraham is the father of Isaac” THEN “Isaac is the son of Abraham”. As another example, it is clear that the predicates “is the father
of” and “is the grandfather of” have a semantic relationship which should enable us to conclude that IF “Abraham is the father of Isaac” AND “Isaac is the father
of Jacob” THEN “Abraham is the grandfather of Jacob”. It is also clear that these particular predicates have meaning only with respect to specific types of entities, namely those
entities than can actually be grandfathers, fathers, or sons.

For these reasons (and others), left-wing data modeling requires the use of “non-standard” logics, such as semantically dependent logic, higher-order predicate logic, frame logic
(F-logic), three-valued logic, and/or probability-calculus logic. In spite of the fact that definite, unambiguous true/false conclusions often cannot be achieved by such non-standard logics, they
do help to capture how human subject-matter experts think about what they know, and they do help to capture how corporations and other human social institutions “think”, store, and
process information.

Ultimately, then, both the strength and weakness of left-wing data modeling is that it is comprehensive, nonlinear, and “far from equilibrium”.[ii]


From the Right:

By contrast, relational data modelers and other devotees of sorted, two-valued, first-order predicate logic hate hierarchies, since type hierarchies (especially) necessarily get them
involved with the paradoxes of higher-order predicate logics.[iii] In fact, C.J. Date and Hugh Darwen, in
their great book Foundation for Future Database Systems: The Third Manifesto, 2nd ed., can only find room for one possible kind of type inheritance in relational database theory,
namely, the so-called “specialization by constraint”, e.g. where “circle” is a subtype of “ellipse” because of a constraint on the ellipse (i.e., the equal
length of the major and minor axes in the case where the ellipse is also a circle).[iv] By contrast, with
respect to left-wing data modeling’s common specialization through the addition of entirely new properties (e.g., “colored ellipse” as a subtype of
“ellipse”), Date and Darwen can only handle this either by defining a relation that groups the added properties (“color”, etc.) together with the original
properties (“major axis”, “minor axis”, etc.), or by allowing an attribute of domain type “ellipse” to be included within the definition of
type “colored ellipse”, then using delegation to get at the colored ellipse’s specifically ellipse-related attributes.[v]
Unfortunately (or fortunately?) neither strategy results in a separate, definite subtype “colored ellipse” inheriting from the supertype “ellipse”, such as
pervasively occurs throughout left-wing data modeling. It is true that the strategies of Date and Darwen do keep relational database theory solidly within the linear, definite true/false
bounds of sorted, two-valued, first-order predicate logic, but only at the expense of possibly losing the rich layers of semantic meaning which a deeply layered fully-extensible type hierarchy can
provide.

(Interestingly, while the question of whether “colored ellipse” is a subtype of “ellipse” is controversial, the question of whether “colored elliptical entity”
is a subtype of “elliptical entity” is not, since the latter is certainly just an instance of “specialization by constraint”. Object-oriented data modeling could certainly
use clearer thinking on the difference between these two cases.)

Additionally (as we suggested earlier) first-order predicate logic (which is the solid basis for relational “right-wing” data modeling) is “semantically complete”
only with respect to the identity predicate, the truth-functional operators, and the quantifiers expressible within that system.[vi] Consequently the many additional business rules within a subject-area of interest that are dependent on the particular semantics of that
subject-area (especially the verbs within that subject-area) are simply lost if that subject-area is modeled exclusively in relational terms.

Consider the following example of an EMPLOYEE relation headed by the following attributes (the values of which are, of course, drawn from particular domains defined by their
corresponding domain types):

EmployeeID, Name, Address, Salary

Date and Darwen would say that this relation is a predicate (i.e., a truth-valued function) which might be expressed as follows:

An EMPLOYEE has a unique EmployeeID, is called by a certain Name, lives at a certain Address, and makes a certain yearly Salary.

Furthermore, each tuple (i.e., row) in this relation denotes a certain true proposition which must be constructed by substituting values for each attribute (drawn from their
proper domains) into the placeholders in the predicate. For example:

Jane Smith has unique employee ID 60756, is called “Jane Smith”, lives at “345 Holland Drive, Anytown USA”, and makes $45265 per year.

If the above proposition is true, then the following tuple may be inserted into the relation:

60756, “Jane Smith”, “345 Holland Drive, Anytown USA”, 45265

(See Date and Darwen, pp. 16-17, for an analogous example.[vii])

But is the relational model really capturing all of the information that Date and Darwen say it is capturing? What the relational model is really capturing in the example above is the
following predicate:

Each particular EMPLOYEE is associated with one-and-only one EmployeeID (drawn from a specified domain) which uniquely identifies that tuple within the
relation. Furthermore, each particular EMPLOYEE is also associated with one-and-only-one Name (drawn from a specified domain), with one-and-only-one
Address (drawn from a specified domain), and with one-and-only-one Salary (drawn from a specified domain).

One alternative way of saying most of this is that attributes Name, Address, and Salary are fully-functionally dependent on the
EmployeeID. In other words, if you know the EmployeeID (which uniquely and minimally identifies the tuple), then Name, Address,
and Salary can each be specified by a single, unique value.

Now, what’s missing from this second formulation of the relational predicate? What’s missing is all of the semantically particular verbs in the first formulation, as underlined in the
following restatement of that first formulation:

An EMPLOYEE has a unique EmployeeID, is called by a certain Name, lives at a certain Address, and makes a certain yearly Salary.

From the point of view of relational database theory, the information represented by the verbs underlined above must be stored outside the relational system (perhaps in “system
documentation” or just in “peoples’ heads”).

A similar point was made by Bertrand Meyer in his book Object-Oriented Software Construction, 2nd ed.:

We must distinguish between the abstract relation loved_one and the set of loved_one links that exist between the elements of a certain set of objects.
This distinction is emphasized neither by the standard mathematical definitions of relations nor, in the software field, by the theory of relational databases. Limiting ourselves to binary
relations, a relation is defined in both mathematics and relational databases as a set of pairs, all of the form where every x is a member of a given set TX and every y is a member of a given set
TY. (In software terminology: all x are of type TX and all y are of type TY.) Appropriate as such definitions may be mathematically, they are not satisfactory for system modeling, as they fail to
make the distinction between an abstract relation and one of its particular instances. For system modeling, if not for mathematics and relational databases, the loves relation has its own
general and abstract properties, quite independent of the record of who loves whom in a particular group of people at a particular time. [viii]

In other words, to use Meyer’s example, relational database theory doesn’t really care whether x loves y, x is married to y, or x
hates y: Rather, it just documents the bare fact that instances x and y of types TX and TY respectively can be grouped in pairs in some “meaningful” way. Of course, if
it is additionally specified that x is a unique identifier for the tuples , then one can also conclude that y is “functionally dependent” on x and that therefore only one value
of y can exist for each value of x. This captures a part (but only a part) of the semantics of is married to, as opposed to loves or hates (assuming a monogamous culture!). But this still leaves
the logical implications of most verbal semantics formally uncaptured.

Now, it cannot be denied that Date and Darwen are, at some level, aware of all of this. Elsewhere C.J. Date writes:

In an ideal world . . . the DBMS would know the [full] meaning of every relation, so that it could deal correctly with all possible updates. But, of course, that’s impossible. There’s
no way it can know those meanings exactly. For example, there’s no way the DBMS can know what it means for a certain supplier to be “in” a certain city or to “have” a
certain status; these concepts are outside the system – they’re understood by users, not by the DBMS. More precisely, they’re part of what logicians call the interpretation (of
the relation in question).[ix]

But if we couple this statement with another statement by C.J. Date, namely that “domains (or types) and relations are together both necessary and sufficient to represent
absolutely any data whatsoever”[x], it is clear that he rejects the idea that semantically
dependent logic (e.g., verb-dependent logic) can (or should?) be formally represented within computer-based business systems. In effect, this leaves many “business rules”
entirely outside the scope of formal business systems!

By contrast, left-wing data modelers seek to capture more of the meaning of business-related predicates in a formal way, in spite of the often-considerable difficulties this quest
presents.

How might they go about doing this? One way might be to replace all relational “attributes” with basic “sub-predicates” having a simple subject/verb/object format.
In that case, the first formulation of the example predicate above might then become changed to:

An EMPLOYEE has a unique EmployeeID
AND
An EMPLOYEE is called by a certain Name
AND
An EMPLOYEE lives at a certain Address
AND
An EMPLOYEE makes a certain yearly Salary

Within such a proposed left-wing logic, all “attributes” would be transformed into simple “business-rules” whose truth would be partially dependent on the semantics of the
particular subject-area being represented. [This approach is suggested, for example, by the RDF (Resource Description Framework) standard for the representation of metadata on the internet:
RDF’s Resource/Property/Value structure closely parallels the subject/verb/object structure suggested above.[xi])

However, more complexities are introduced when we consider other types of semantically dependent functions besides semantically dependent predicates. For example, the logical function “the
father of” returns a male human being rather than a Boolean true/false value, and (moreover) it has a semantic relationship to “the grandfather of” function that is analogous to
the semantic relationship between the predicates “is the father of” and “is the grandfather of”. From such considerations we should be able to conclude
(within this particular subject-matter area) that IF “The father of Isaac is Abraham” AND “The father of Jacob is Isaac” THEN “The grandfather of Jacob is
Abraham”.[xii]

But the downside of all such left-wing logics, whether they be semantically dependent logics, higher-order predicate logics, frame logics, three-valued logics, or probability-calculus logics is
that they often cannot yield unambiguous true/false answers: And when a business computer system is processing and analyzing millions of transactions per day, you don’t want it to tell you
“I dunno” very often!

Ultimately, then, both the strength and weakness of right-wing data modeling is that it is essentially linear and deterministic, which is why it can be relied on to produce definite true/false
conclusions.


Bridging the Gap:

How can we bridge this vast chasm which seems to separate left-wing and right-wing data modeling? Well, here are a few possibilities:


A. Stake out a position on the far-left or on the far-right and demonize the other side.

For example, the often-excellent “Database Debunkings” web site (www.dbdebunk.com) maintained by Fabian Pascal and C.J. Date is marred by attacks
on non-relational data modeling that border on ad hominem. This approach does not seem to me to be productive, since I see significant value on both the left and the right.


B. Hope for (or work towards) a breakthrough in left-wing data modeling that will make it as solid, deterministic, and two-valued (true/false) as right-wing data modeling.

This approach seems to me to be unrealistic: In order to logically model deeply layered, fully extensible type hierarchies, you must use a higher-order predicate logic. But all such higher-order
logics lead to logical contradictions (i.e., paradoxes) which effectively prevent those logics from being cleanly two-valued.

The contradictions that arise in logic whenever you attempt to model “properties of properties” or “types of types” (i.e., fully extensible type inheritance) are closely
analogous to the contradictions that arise in set theory with respect to “sets of sets” (such as Cantor’s paradox, Russell’s paradox, and so on). As Seymour Lipschutz puts
it: “Although it is possible to eliminate these known contradictions by a strict axiomatic development of set theory, there are still many questions which are unanswered.”[xiii] In other words, you can shift these problems around, but it is highly unlikely that you can eliminate them,
since over 100 years of effort by mathematicians and logicians have failed to do so.

Particularly instructive here is the case of frame logic (often abbreviated as F-logic). As presented in the well-known paper “Logical Foundations of Object-Oriented and
Frame-Based Languages”,[xiv]F-logic seeks to eliminate some of the complexities of
higher-order logics by using a higher-order syntax, but a first-order semantics, in an attempt to give a firm logical foundation to the object-oriented paradigm. In this it partly
succeeds, and its approach is one of the bases for the “semantic web” initiative of Tim Berners-Lee and others.[xv]

However, in a recent paper titled “Well-Founded Optimism: Inheritance in Frame-Based Knowledge Bases”,[xvi] Michael Kifer (the principle creator of F-logic) admits that the
“integration of inference by inheritance into rule-based deductive systems presents serious semantic and computational difficulties” and that the semantics he originally proposed for
F-logic is “known to yield questionable results in many cases”. Kifer goes on to propose solutions to these problems, but (significantly) those solutions require that
F-logic be transformed from a two-valued logic to a three-valued logic. (The third value, as always, is “I dunno”.) This strongly suggests to me that
left-wing logic, in general, is inherently recursive, nonlinear, and multi-valued (i.e., more than two-valued).

In this context it could be argued that two-valued, straight true/false logic is an unrealistic ideal that should be entirely given up. This point of view is suggested, for example, by the clearly
unrealistic nature of the material-implication (if-then) truth table in basic propositional logic: If the antecedent is false, then regardless of whether the consequent is true or false, the
overall if-then statement is regarded as being true (rather than, realistically, as being “I dunno”).[xvii]

But the correct right-wing relational answer to this line of argument is surely that at least some logical conclusions must be definitely true or false, and that, as a practical matter, business
systems which process many millions of transactions per day need to rely on such definite true/false conclusions as much as possible.


C. Accept both left-wing and right-wing data modeling as vital to the creation of the business data architecture, taking advantage of the strengths of each.

This seems to me to be the right way to go: The strengths of left-wing data modeling clearly center on the modeling of the complete subject-knowledge of business users and departments at
all corporate levels, while by contrast the strengths of right-wing relational data modeling clearly center on the delivery of massive-scale, definite, reliable business information to
those same business users and departments. So, let’s work together to provide full data-modeling business value to our clients! (Excuse me while I duck, to avoid the crossfire.)

——————————————————————————–

[i] John Nolt, Dennis Rohatyn, and Achille Varzi, Schaum’s Outline of Theory and Problems of Logic, 2nd ed. (New York: McGraw
Hill, 1998), p. 277.

[ii] Phillip L. Engle, Far From Equilibrium (Greensburg PA: Laurel Highlands Media, 2002)

[iii] Cf. Nolt, Rohatyn, and Varzi, p. 280.

[iv] C.J. Date and Hugh Darwen, Foundation for Future Database Systems: The Third Manifesto, 2nd ed. (Reading MA: Addison-Wesley,
2000), p. 415.

[v] Date and Darwen, pp. 416-7.

[vi] Nolt, Rohatyn, and Varzi, p. 277.

[vii] Date and Darwen, pp. 16-17.

[viii] Bertrand Meyer, Object-Oriented Software Construction, 2nd ed. (Upper Saddle River NJ: Prentice Hall PTR 1997), pp. 229-230.

[ix] C.J. Date, “The Question of Meaning” (Source: BRCommunity.com:: The Business Rules Community, http://www.BRCommunity.com)

[x] C.J. Date, “Twelve Rules for Business Rules” (5/1/2000), p. 6.

[xi] “What is RDF?”

[xii] Nolt, Rohatyn, and Varzi, pp. 284-6.

[xiii] Seymour Lipschutz, Schaum’s Outline of Set Theory and Related Topics, 2nd ed. (New York: McGraw-Hill 1998), p. 221.

[xiv] Michael Kifer, Georg Lausen, and James Wu, “Logical Foundations of Object-Oriented and Frame-Based Languages”, Journal
of the ACM
, 42:741-843, July 1995.

[xv] “Semantic Web”,

[xvi] Guizhen Yang and Michael Kifer, “Well-Founded Optimism: Inheritance in Frame-Based Knowledge Bases”,
CoopIS/DOA/ODBASE 2002: 1013-1032.

[xvii] Nolt, Rohatyn, and Varzi, p. 58.

Share

submit to reddit

About Phillip Engle

Phillip L. Engle is an information technology professional with fourteen years of experience in data architecture, information analysis, technical writing, and programming. During the past six years he was a data architect and information analyst for Mellon Financial Corporation in Pittsburgh PA.  He can be reached by email, or by calling (724) 832-5891.

Top