I don’t like talking about the relational theory of data. It is absolutely fundamental to any deep understanding of data, but most practitioners get along fine without it. It’s more the implementers of database management systems (DBMSs) who need to understand relational theory, so teaching relational theory to ordinary practitioners is a bit like tormenting people with irrelevant theory before you let them get on with the business at hand. Moreover, some of those who understand relational theory use their knowledge to beat other people over the head with it. I don’t want to be associated with that high-handed approach to this important theory.
But I’ve been goaded. Google made me do it.
My attention was drawn to a video put out by some folks at Google, Data Modeling for BigQuery. The video is fine for the most part, but it makes some misstatements about relational theory that just drive me crazy. They repeat commonly accepted misconceptions about relational databases—misconceptions that, unfortunately, have driven some of the “advances” we’ve seen of late in the realm of database technology. There have definitely been some true advances, but some new technology is merely different without being better.
If you’re a practitioner, designing, implementing, and using databases, whether SQL or NoSQL, this won’t matter much to you, although it never hurts to learn a little more about the theory of data. However, if you are a programmer who might be the one who builds the next NoSQL mega-star that will replace decades-old technology, you need to know this, because this knowledge will enable you to blind-side every established DBMS vendor, whether SQL or NoSQL.
So, I’m going to correct the record here.
Relational data theory doesn’t tell us how to store data (in tables); rather, it tells us how to understand data.
On the surface, it seems that relational theory introduced the idea of storing data in tabular form, and there’s some truth to that. However, when you treat relational theory as purely logical, divorced from any assumptions about physical storage, you realize that it provides the only comprehensive theory of data that works for data stored in any form, whether tabular (row-oriented, column-oriented, or key/value), document (XML or JSON), graph, or whatever.
Let me address some of the misstatements in the Google video.
Misstatement #1: The relational model of data is about mimicking relationships in the real world (2:55).
All databases, not just relational ones, mimic relationships in the real world. Document databases mimic real-world relationships between fields of a document, and certainly graph databases mimic relationships between nodes of a graph. Tabular databases mimic relationships between the columns of a table.
Perhaps this belief that relational theory is uniquely about mimicking real-world relationships comes from the fact that, in ordinary English, “relation” and “relationship” are synonyms. Unfortunately, the “relation” in relational theory does not refer to relationships. To put it simply, the “relation” of relational theory is a table where the order of its rows and columns carry no information. Relationships are something else. It’s a terminological tragedy that relational theory overloaded the word “relation” and created this confusion with “relationship.”
Misstatement #2: The cost of data drove the desire to eliminate duplication (6:44).
The relational theory of data was introduced to the world in 1970 when IBM researcher E. F. “Ted” Codd published a paper, A Relational Model of Data for Large Shared Data Banks. In this paper he states, “Although it is logically unnecessary to store both a relation and some permutation of it, performance considerations could make it advisable.” In other words, far from recommending against data duplication, he advocates duplicating data if it will make the database perform better.
Misstatement #3: Dependence on the key was important in order to eliminate data duplication (7:34).
Well, yes, but the desire to eliminate data duplication was not because storage costs were high in 1970. It was because data duplication leads to the potential for inconsistent data. For instance, data in one part of a database might be updated while data in another part of the database that is supposed to be identical (that is, duplicate) might be incorrectly left unchanged. The result could be that various queries on the database return inconsistent results.
Keep in mind that this paper was written in an era when no one was thinking of distributed databases with hundreds of servers working hard to keep data consistent across the nodes. The assumption in the Relational Model paper was that all the data was in a single-node database. Inconsistency would be the result, then, not of physical data distribution, but of a design allowing data duplication. The goal of making everything dependent on the key was one of correctness, not of minimizing storage. And, again, Codd makes allowance for deliberately duplicating data if it helps with performance.
Misstatement #4: Dependence on the key was important because of row-style access (9:34).
Wrong again. The reality is that, in order to understand any set of data, one needs to know which parts of the data make it unique and distinguishable from other sets of data. Those parts that make it unique are its key. That is a logical distinction, and is unrelated to whether the data is stored in rows, columns, documents, graphs, or whatever.
So Why Does Relational Theory Matter?
Relational theory, when properly understood, and when divorced from all considerations of physical storage, gives us the ability to understand, describe, and design any data at the logical level. If properly applied, it gives us data independence, which means independence from the form of storage. That independence is critical, so that we can store the same data in many different physical forms without distorting or corrupting it in some way.
DBMS Rock-Star, are you Listening?
The DBMS that will wipe out all its competition will:
- Treat all possible physical data organizations as implementation details, and support them all;
- Have, as a basic feature, the ability to manage the deliberate duplication of data—not only duplication in indexes, but also deliberate duplication in denormalized tables, documents, and graphs; and
- Have a logical data language that can describe any data, no matter how it’s stored or represented, without reference to its storage or representation.
The last feature can only be achieved by a data language that’s faithful to relational data theory.
This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://comn.dataversity.net/ for more information.