Understand Relational to Understand the Secrets of Data

BLG01x - image - EDI don’t like talking about the relational theory of data. It is absolutely fundamental to any deep understanding of data, but most practitioners get along fine without it. It’s more the implementers of database management systems (DBMSs) who need to understand relational theory, so teaching relational theory to ordinary practitioners is a bit like tormenting people with irrelevant theory before you let them get on with the business at hand. Moreover, some of those who understand relational theory use their knowledge to beat other people over the head with it. I don’t want to be associated with that high-handed approach to this important theory.

But I’ve been goaded. Google made me do it.

My attention was drawn to a video put out by some folks at Google, Data Modeling for BigQuery. The video is fine for the most part, but it makes some misstatements about relational theory that just drive me crazy. They repeat commonly accepted misconceptions about relational databases—misconceptions that, unfortunately, have driven some of the “advances” we’ve seen of late in the realm of database technology. There have definitely been some true advances, but some new technology is merely different without being better.

If you’re a practitioner, designing, implementing, and using databases, whether SQL or NoSQL, this won’t matter much to you, although it never hurts to learn a little more about the theory of data. However, if you are a programmer who might be the one who builds the next NoSQL mega-star that will replace decades-old technology, you need to know this, because this knowledge will enable you to blind-side every established DBMS vendor, whether SQL or NoSQL.

So, I’m going to correct the record here.

Relational data theory doesn’t tell us how to store data (in tables); rather, it tells us how to understand data.

On the surface, it seems that relational theory introduced the idea of storing data in tabular form, and there’s some truth to that. However, when you treat relational theory as purely logical, divorced from any assumptions about physical storage, you realize that it provides the only comprehensive theory of data that works for data stored in any form, whether tabular (row-oriented, column-oriented, or key/value), document (XML or JSON), graph, or whatever.

Let me address some of the misstatements in the Google video.

Misstatement #1: The relational model of data is about mimicking relationships in the real world (2:55).

All databases, not just relational ones, mimic relationships in the real world. Document databases mimic real-world relationships between fields of a document, and certainly graph databases mimic relationships between nodes of a graph. Tabular databases mimic relationships between the columns of a table.

Perhaps this belief that relational theory is uniquely about mimicking real-world relationships comes from the fact that, in ordinary English, “relation” and “relationship” are synonyms. Unfortunately, the “relation” in relational theory does not refer to relationships. To put it simply, the “relation” of relational theory is a table where the order of its rows and columns carry no information. Relationships are something else. It’s a terminological tragedy that relational theory overloaded the word “relation” and created this confusion with “relationship.”

Misstatement #2: The cost of data drove the desire to eliminate duplication (6:44).

The relational theory of data was introduced to the world in 1970 when IBM researcher E. F. “Ted” Codd published a paper, A Relational Model of Data for Large Shared Data Banks. In this paper he states, “Although it is logically unnecessary to store both a relation and some permutation of it, performance considerations could make it advisable.” In other words, far from recommending against data duplication, he advocates duplicating data if it will make the database perform better.

Misstatement #3: Dependence on the key was important in order to eliminate data duplication (7:34).

Well, yes, but the desire to eliminate data duplication was not because storage costs were high in 1970. It was because data duplication leads to the potential for inconsistent data. For instance, data in one part of a database might be updated while data in another part of the database that is supposed to be identical (that is, duplicate) might be incorrectly left unchanged. The result could be that various queries on the database return inconsistent results.

Keep in mind that this paper was written in an era when no one was thinking of distributed databases with hundreds of servers working hard to keep data consistent across the nodes. The assumption in the Relational Model paper was that all the data was in a single-node database. Inconsistency would be the result, then, not of physical data distribution, but of a design allowing data duplication. The goal of making everything dependent on the key was one of correctness, not of minimizing storage. And, again, Codd makes allowance for deliberately duplicating data if it helps with performance.

Misstatement #4: Dependence on the key was important because of row-style access (9:34).

Wrong again. The reality is that, in order to understand any set of data, one needs to know which parts of the data make it unique and distinguishable from other sets of data. Those parts that make it unique are its key. That is a logical distinction, and is unrelated to whether the data is stored in rows, columns, documents, graphs, or whatever.

So Why Does Relational Theory Matter?

Relational theory, when properly understood, and when divorced from all considerations of physical storage, gives us the ability to understand, describe, and design any data at the logical level. If properly applied, it gives us data independence, which means independence from the form of storage. That independence is critical, so that we can store the same data in many different physical forms without distorting or corrupting it in some way.

DBMS Rock-Star, are you Listening?

The DBMS that will wipe out all its competition will:

  1. Treat all possible physical data organizations as implementation details, and support them all;
  2. Have, as a basic feature, the ability to manage the deliberate duplication of data—not only duplication in indexes, but also deliberate duplication in denormalized tables, documents, and graphs; and
  3. Have a logical data language that can describe any data, no matter how it’s stored or represented, without reference to its storage or representation.

The last feature can only be achieved by a data language that’s faithful to relational data theory.


This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://comn.dataversity.net/ for more information.

Share

submit to reddit

About Ted Hills

Ted Hills has been active in the Information Technology industry since 1975. At LexisNexis, Ted co-leads the work of establishing enterprise data architecture standards and governance processes, working with data models and business and data definitions for both structured and unstructured data.

  • Chris Pehura

    Ever so often, I’m asked what an “ER diagram” is. Remember those? There was a time where we would have these huge posters of them. We would roll them out on a huge table.

    People just don’t work that way today.

    • Frank van Amsterdam

      Putting a map of the world on the table when you are looking for a building in Washington DC doesn’t mean the map is wrong but that you have the wrong map. Sounds obvious. If your ER diagram is so big and complex you don’t understand or need it, perhaps you should look for one that fits your purpose rather than blame the approach.

      • Chris Pehura

        You should have asked about my context rather saying the following:

        “If your ER diagram is so big and complex you don’t understand or need it, perhaps you should look for one that fits your purpose rather than blame the approach”

        Just because something is big doesn’t mean it’s complex.

        I thought we were coming from the same background. Obviously, I’m mistaken. When I started, posters were a common practice for managing the change of big things involving millions of dollars. When they were done right they got people engaged to help them better articulate what needed to be done and how everyone could contribute. The poster was a show piece to discuss and was a byproduct of that discussion, Today some companies are still using posters for their large initiatives and hang them framed on the corporate walls. Years ago it was system diagrams. Today it’s cultural maps and journeys.

        People aren’t working this way anymore (using posters or using deliverables as show pieces) because of “paper”. People want to be paperless. But it’s more than that. Rather than deliverables being byproducts of conversations, the work is done differently. They are designing deliverables individually and then the deliverables are reviewed via a group. This is a little backwards and can greatly lower the quality of that deliverable..

        I’m all for going paperless, but there are times when we need paper. There is a psychological need and use for it. Look at Agile. Look at Kanban. Some are using posters to support those practices.

        Last year, I used one of my favorite traditional approaches called the “war room”. This is where over a hundred feet of posters were pasted around a room to form a 360 degree picture of what was going on. The client had a couple of billion on the line, so they didn’t want to miss anything. Was it complicated? Yes, but then those complications were targeted and simplified. Because of the psychological underpinnings the war room helped fast-track all conversations the client needed to have. Almost a year of conversations were condensed down to under two months. When the programs launched and ran their course, there were no earth shattering surprises.

        In my line of work posters are an effective approach. If I work with those 50 and over, we use posters. For those that are younger, they want to avoid as much paper as possible.

        My apologies for the confusion.

        • Frank van Amsterdam

          I think we agree more than we disagree. My main point was that ER diagrams don’t need to be huge posters. They need to explain a problem which can be simple or complex, small or large and therefore can be represented by a smaller or larger diagrams. Problem of posters is that it is difficult to take submodels separate which makes it seem more complex than it may be in reality.

  • Tessalator

    The real world is more hierarchal than relational, and things don’t fit nicely into fixed definition buckets. While I agree with much of what you say, I see relational perspectives more helpful in developing storage strategies than understanding the “true” nature of the information.

    • Clifford Heath

      The “real world” is neither hierarchical nor relational. Both are mental constructs. In reality there are no things, only processes. But if we allow the mental construct that there are things, and there are relationships between things, then it becomes clear that some of those relationships create hierarchies. In that respect, hierarchies are secondary to relationships. Modeling is the process of creating “nice fixed definition buckets” that simplify real world complexities for a specific purpose. No model is ever reality.

    • Ted Hills

      I agree with Clifford that the world is neither hierarchical nor relational. In fact, one frequently gets into trouble trying to force things into a single hiearchy. Watch for my “duck-billed platypus” blog coming next month.

      The fact that you reference “fixed definition buckets” in your comment indicates to me that you are reacting to the traditional approach to the relational model, which is to pre-define a fixed database schema. I agree with you that that doesn’t fit much of the world–though it does fit a good portion of it, and that portion needs to continue to be served well. But I think you’ve missed my main point. The relational model, when properly understood, is *not* about storage strategies. It’s about what data is and what data means. Any set of data, whether hierarchical, flat, nested, or graph, can be described as a relation.

      It’s too bad I can’t teach the relational model in a series of blogs. It’s too deep for that format. For now, please take it on faith that the relational model is not about tables.

Top