*Published in TDAN.com April 2003*

Maybe I’ve been watching too many nightly editions of CrossFire on CNN, but it seems to me that two very different ideals of data modeling are currently competing in the data-architecture

space. Both of these ideals are valuable and have their place. Consequently the need to reconcile them is great.

From the Left:

Data modeling as it is practiced by AI experts, KM experts, object-oriented data analysts, and “semantic web” ontology advocates involves the specification of deeply layered type

hierarchies, including multiple inheritance for entities and (often) multiple values per-attribute per-entity. Furthermore, “left-wing” data modelers are concerned to capture logical

rules (business rules) that are *semantically dependent* on the particular subject-area being studied, in addition to the logical/business rules that are *independent* of

subject-matter.

One key source of such *semantic dependence* is the particular *verbs* used within a given subject area:

predicate, they do not have fixed meanings. Consequently [basic first-order] predicate logic provides no rules to account for the semantics of specific predicates – with the sole exception of

the identity predicate. It is therefore insensitive to validity generated by their distinctive semantics.[i]

For example, if we know that, within a specific subject-matter area, one predicate (such as “is the father of”) is the inverse of another predicate (“is the son of”), then

we may safely conclude that IF “Abraham is the father of Isaac” THEN “Isaac is the son of Abraham”. As another example, it is clear that the predicates “is the father

of” and “is the grandfather of” have a semantic relationship which should enable us to conclude that IF “Abraham is the father of Isaac” AND “Isaac is the father

of Jacob” THEN “Abraham is the grandfather of Jacob”. It is also clear that these particular predicates have meaning only with respect to specific types of entities, namely those

entities than can actually be grandfathers, fathers, or sons.

For these reasons (and others), left-wing data modeling requires the use of “non-standard” logics, such as semantically dependent logic, higher-order predicate logic, frame logic

(F-logic), three-valued logic, and/or probability-calculus logic. In spite of the fact that definite, unambiguous true/false conclusions often cannot be achieved by such non-standard logics, they

do help to capture how human subject-matter experts think about what they know, and they do help to capture how corporations and other human social institutions “think”, store, and

process information.

Ultimately, then, both the strength and weakness of left-wing data modeling is that it is comprehensive, nonlinear, and “far from equilibrium”.[ii]

From the Right:

By contrast, relational data modelers and other devotees of sorted, two-valued, first-order predicate logic *hate* hierarchies, since type hierarchies (especially) necessarily get them

involved with the paradoxes of higher-order predicate logics.[iii] In fact, C.J. Date and Hugh Darwen, in

their great book *Foundation for Future Database Systems: The Third Manifesto*, 2nd ed., can only find room for *one* possible kind of type inheritance in relational database theory,

namely, the so-called “specialization by constraint”, e.g. where “circle” is a subtype of “ellipse” because of a constraint on the ellipse (i.e., the equal

length of the major and minor axes in the case where the ellipse is also a circle).[iv] By contrast, with

respect to left-wing data modeling’s common specialization through the addition of *entirely new* properties (e.g., “colored ellipse” as a subtype of

“ellipse”), Date and Darwen can only handle this *either* by defining a *relation* that groups the added properties (“color”, etc.) together with the original

properties (“major axis”, “minor axis”, etc.), *or* by allowing an attribute of domain type “ellipse” to be included *within* the definition of

type “colored ellipse”, then using delegation to get at the colored ellipse’s specifically ellipse-related attributes.[v]

Unfortunately (or fortunately?) *neither* strategy results in a separate, definite subtype “colored ellipse” inheriting from the supertype “ellipse”, such as

pervasively occurs throughout left-wing data modeling. It *is* true that the strategies of Date and Darwen do keep relational database theory solidly within the linear, definite true/false

bounds of sorted, two-valued, first-order predicate logic, but only at the expense of possibly losing the rich layers of semantic meaning which a deeply layered fully-extensible type hierarchy can

provide.

(Interestingly, while the question of whether “colored ellipse” is a subtype of “ellipse” is controversial, the question of whether “colored elliptical entity”

is a subtype of “elliptical entity” is not, since the latter is certainly just an instance of “specialization by constraint”. Object-oriented data modeling could certainly

use clearer thinking on the difference between these two cases.)

Additionally (as we suggested earlier) first-order predicate logic (which is the solid basis for relational “right-wing” data modeling) is “semantically complete”

*only* with respect to the identity predicate, the truth-functional operators, and the quantifiers expressible within that system.[vi] Consequently the many additional business rules within a subject-area of interest that are dependent on the particular semantics of that

subject-area (especially the *verbs* within that subject-area) are simply lost if that subject-area is modeled *exclusively* in relational terms.

Consider the following example of an **EMPLOYEE** relation headed by the following attributes (the values of which are, of course, drawn from particular domains defined by their

corresponding domain types):

**EmployeeID, Name, Address, Salary**

Date and Darwen would say that this relation is a predicate (i.e., a truth-valued function) which might be expressed as follows:

An EMPLOYEE has a unique EmployeeID, is called by a certain Name, lives at a certain Address, and *makes a certain yearly Salary*.

Furthermore, each *tuple* (i.e., row) in this relation denotes a certain *true proposition* which must be constructed by substituting values for each attribute (drawn from their

proper domains) into the placeholders in the *predicate*. For example:

*Jane Smith has unique employee ID 60756, is called “Jane Smith”, lives at “345 Holland Drive, Anytown USA”, and makes $45265 per year.*

If the above proposition is true, then the following tuple may be inserted into the relation:

*60756, “Jane Smith”, “345 Holland Drive, Anytown USA”, 45265*

(See Date and Darwen, pp. 16-17, for an analogous example.[vii])

But is the relational model really capturing all of the information that Date and Darwen say it is capturing? What the relational model is *really* capturing in the example above is the

following predicate:

*Each particular EMPLOYEE is associated with one-and-only one EmployeeID (drawn from a specified domain) which uniquely identifies that tuple within the*

relation. Furthermore, each particular EMPLOYEE is also associated with one-and-only-one Name (drawn from a specified domain), with one-and-only-one

Address (drawn from a specified domain), and with one-and-only-one Salary (drawn from a specified domain).

One alternative way of saying most of this is that attributes **Name**, **Address**, and **Salary** are *fully-functionally dependent* on the

**EmployeeID**. In other words, if you know the **EmployeeID** (which uniquely and minimally identifies the tuple), then **Name**, **Address**,

and **Salary** can each be specified by a single, unique value.

Now, what’s missing from this second formulation of the relational predicate? What’s missing is all of the semantically particular verbs in the first formulation, as underlined in the

following restatement of that first formulation:

*An EMPLOYEE has a unique EmployeeID, is called by a certain Name, lives at a certain Address, and makes a certain yearly Salary.*

From the point of view of relational database theory, the information represented by the verbs underlined above must be stored outside the relational system (perhaps in “system

documentation” or just in “peoples’ heads”).

A similar point was made by Bertrand Meyer in his book *Object-Oriented Software Construction*, 2nd ed.:

*loved_one*links that exist between the elements of a certain set of objects.

relations, a relation is defined in both mathematics and relational databases as a set of pairs, all of the form where every x is a member of a given set TX and every y is a member of a given set

TY. (In software terminology: all x are of type TX and all y are of type TY.) Appropriate as such definitions may be mathematically, they are not satisfactory for system modeling, as they fail to

make the distinction between an abstract relation and one of its particular instances. For system modeling, if not for mathematics and relational databases, the

*loves*relation has its own

general and abstract properties, quite independent of the record of who loves whom in a particular group of people at a particular time. [viii]

In other words, to use Meyer’s example, relational database theory doesn’t really care whether x **loves** y, x **is married to** y, or x

**hates** y: Rather, it just documents the bare fact that instances x and y of types TX and TY respectively can be grouped in pairs in some “meaningful” way. Of course, if

it is additionally specified that x is a *unique identifier* for the tuples , then one can also conclude that y is “functionally dependent” on x and that therefore only one value

of y can exist for each value of x. This captures a part (but only a part) of the semantics of is married to, as opposed to loves or hates (assuming a monogamous culture!). But this still leaves

the logical implications of most verbal semantics formally uncaptured.

Now, it cannot be denied that Date and Darwen are, at some level, aware of all of this. Elsewhere C.J. Date writes:

no way it can know those meanings exactly. For example, there’s no way the DBMS can know what it means for a certain supplier to be “in” a certain city or to “have” a

certain status; these concepts are outside the system – they’re understood by users, not by the DBMS. More precisely, they’re part of what logicians call the interpretation (of

the relation in question).[ix]

But if we couple this statement with another statement by C.J. Date, namely that “domains (or types) and relations are together both *necessary* and *sufficient* to represent

absolutely any data whatsoever”[x], it is clear that he rejects the idea that semantically

dependent logic (e.g., verb-dependent logic) can (or should?) be *formally* represented within computer-based business systems. In effect, this leaves many “business rules”

entirely outside the scope of formal business systems!

By contrast, left-wing data modelers seek to capture more of the *meaning* of business-related predicates in a *formal* way, in spite of the often-considerable difficulties this quest

presents.

How might they go about doing this? One way might be to replace all relational “attributes” with basic “sub-predicates” having a simple *subject/verb/object* format.

In that case, the first formulation of the example predicate above might then become changed to:

**EMPLOYEE**has a unique

**EmployeeID**

**AND**

An

**EMPLOYEE**is called by a certain

**Name**

**AND**

An

**EMPLOYEE**lives at a certain

**Address**

**AND**

An

**EMPLOYEE**makes a certain yearly

**Salary**

Within such a proposed left-wing logic, all “attributes” would be transformed into simple “business-rules” whose truth would be partially dependent on the semantics of the

particular subject-area being represented. [This approach is suggested, for example, by the RDF (Resource Description Framework) standard for the representation of metadata on the internet:

RDF’s *Resource/Property/Value* structure closely parallels the *subject/verb/object* structure suggested above.[xi])

However, more complexities are introduced when we consider other types of semantically dependent functions besides semantically dependent predicates. For example, the logical function “the

father of” returns a male human being rather than a Boolean true/false value, and (moreover) it has a semantic relationship to “the grandfather of” function that is analogous to

the semantic relationship between the predicates “*is* the father of” and “*is* the grandfather of”. From such considerations we should be able to conclude

(within this particular subject-matter area) that IF “The father of Isaac is Abraham” AND “The father of Jacob is Isaac” THEN “The grandfather of Jacob is

Abraham”.[xii]

But the downside of all such left-wing logics, whether they be semantically dependent logics, higher-order predicate logics, frame logics, three-valued logics, or probability-calculus logics is

that they often cannot yield unambiguous true/false answers: And when a business computer system is processing and analyzing millions of transactions per day, you don’t want it to tell you

“I dunno” very often!

Ultimately, then, both the strength and weakness of right-wing data modeling is that it is essentially linear and deterministic, which is why it can be relied on to produce definite true/false

conclusions.

Bridging the Gap:

How can we bridge this vast chasm which seems to separate left-wing and right-wing data modeling? Well, here are a few possibilities:

*A. Stake out a position on the far-left or on the far-right and demonize the other side.*

For example, the often-excellent “Database Debunkings” web site (www.dbdebunk.com) maintained by Fabian Pascal and C.J. Date is marred by attacks

on non-relational data modeling that border on ad hominem. This approach does not seem to me to be productive, since I see significant value on both the left and the right.

*B. Hope for (or work towards) a breakthrough in left-wing data modeling that will make it as solid, deterministic, and two-valued (true/false) as right-wing data modeling.*

This approach seems to me to be unrealistic: In order to logically model deeply layered, fully extensible type hierarchies, you must use a higher-order predicate logic. But all such higher-order

logics lead to logical contradictions (i.e., paradoxes) which effectively prevent those logics from being cleanly two-valued.

The contradictions that arise in logic whenever you attempt to model “properties of properties” or “types of types” (i.e., fully extensible type inheritance) are closely

analogous to the contradictions that arise in set theory with respect to “sets of sets” (such as Cantor’s paradox, Russell’s paradox, and so on). As Seymour Lipschutz puts

it: “Although it is possible to eliminate these known contradictions by a strict axiomatic development of set theory, there are still many questions which are unanswered.”[xiii] In other words, you can shift these problems around, but it is highly unlikely that you can eliminate them,

since over 100 years of effort by mathematicians and logicians have failed to do so.

Particularly instructive here is the case of *frame logic* (often abbreviated as *F-logic*). As presented in the well-known paper “Logical Foundations of Object-Oriented and

Frame-Based Languages”,[xiv]*F-logic* seeks to eliminate some of the complexities of

higher-order logics by using a higher-order syntax, but a *first-order* semantics, in an attempt to give a firm logical foundation to the object-oriented paradigm. In this it partly

succeeds, and its approach is one of the bases for the “semantic web” initiative of Tim Berners-Lee and others.[xv]

However, in a recent paper titled “Well-Founded Optimism: Inheritance in Frame-Based Knowledge Bases”,[xvi] Michael Kifer (the principle creator of *F-logic*) admits that the

“integration of inference by inheritance into rule-based deductive systems presents serious semantic and computational difficulties” and that the semantics he originally proposed for

*F-logic* is “known to yield questionable results in many cases”. Kifer goes on to propose solutions to these problems, but (significantly) those solutions require that

*F-logic* be transformed from a *two-valued* logic to a *three-valued* logic. (The third value, as always, is “I dunno”.) This strongly suggests to me that

left-wing logic, in general, is *inherently* recursive, nonlinear, and multi-valued (i.e., more than two-valued).

In this context it could be argued that two-valued, straight true/false logic is an unrealistic ideal that should be entirely given up. This point of view is suggested, for example, by the clearly

unrealistic nature of the material-implication (if-then) truth table in basic propositional logic: If the antecedent is false, then regardless of whether the consequent is true or false, the

overall if-then statement is regarded as being true (rather than, realistically, as being “I dunno”).[xvii]

But the correct right-wing relational answer to this line of argument is surely that at least some logical conclusions must be definitely true or false, and that, as a practical matter, business

systems which process many millions of transactions per day need to rely on such definite true/false conclusions as much as possible.

*C. Accept both left-wing and right-wing data modeling as vital to the creation of the business data architecture, taking advantage of the strengths of each.*

This seems to me to be the right way to go: The strengths of left-wing data modeling clearly center on the modeling of the *complete subject-knowledge* of business users and departments at

all corporate levels, while by contrast the strengths of right-wing relational data modeling clearly center on the *delivery of massive-scale, definite, reliable business information* to

those same business users and departments. So, let’s work together to provide full data-modeling business value to our clients! (Excuse me while I duck, to avoid the *crossfire*.)

[i] John Nolt, Dennis Rohatyn, and Achille Varzi, *Schaum’s Outline of Theory and Problems of Logic*, 2nd ed. (New York: McGraw

Hill, 1998), p. 277.

[ii] Phillip L. Engle, *Far From Equilibrium* (Greensburg PA: Laurel Highlands Media, 2002)

[iii] Cf. Nolt, Rohatyn, and Varzi, p. 280.

[iv] C.J. Date and Hugh Darwen, *Foundation for Future Database Systems: The Third Manifesto*, 2nd ed. (Reading MA: Addison-Wesley,

2000), p. 415.

[v] Date and Darwen, pp. 416-7.

[vi] Nolt, Rohatyn, and Varzi, p. 277.

[vii] Date and Darwen, pp. 16-17.

[viii] Bertrand Meyer, *Object-Oriented Software Construction*, 2nd ed. (Upper Saddle River NJ: Prentice Hall PTR 1997), pp. 229-230.

[ix] C.J. Date, “The Question of Meaning” (Source: BRCommunity.com:: The Business Rules Community, http://www.BRCommunity.com)

[x] C.J. Date, “Twelve Rules for Business Rules” (5/1/2000), p. 6.

[xi] “What is RDF?”

[xii] Nolt, Rohatyn, and Varzi, pp. 284-6.

[xiii] Seymour Lipschutz, *Schaum’s Outline of Set Theory and Related Topics*, 2nd ed. (New York: McGraw-Hill 1998), p. 221.

[xiv] Michael Kifer, Georg Lausen, and James Wu, “Logical Foundations of Object-Oriented and Frame-Based Languages”, *Journal
of the ACM*, 42:741-843, July 1995.

[xv] “Semantic Web”,

[xvi] Guizhen Yang and Michael Kifer, “Well-Founded Optimism: Inheritance in Frame-Based Knowledge Bases”,

*CoopIS/DOA/ODBASE* 2002: 1013-1032.

[xvii] Nolt, Rohatyn, and Varzi, p. 58.