Occasionally I encounter discussions on various social media sites arguing passionately about the definitions of the terms “data” and “information.” Such threads usually involve wide-ranging opinions but very few appeals to authority, often intertwined with questions from frustrated readers who rightly ask, “What difference does it make?”
That last question is a good place to start. In my opinion, there are only two reasons why it matters. One reason is that it’s a bit embarrassing to be data/information professionals whose very job titles are based on words we can’t agree on. The other reason is that a deeper understanding of the difference between data and information paves the way for us to think clearly about meaning, and enables us to design systems that extract and use the meaning arising from our data and information.
As mentioned, these discussions rarely invoke authorities. That puts us all on our own authorities when offering definitions. Compared to the definitions found in DAMA, English dictionaries, and the general Web, the definitions below are narrower and are therefore, I believe, more useful than the standard generalities.
Unfortunately, a short blog like this can’t give the full justification for these definitions. I give such explanations in my book NoSQL and SQL Data Modeling (Technics Publications, 2016). But I’ll do my best to explain them here by example.
“Hello, My Name Is _______”
Suppose you are at a conference and the leader is asking attendees to introduce themselves by stating their first names and their cities of residence. We hear the following:
“My name is Pam and I live in Los Angeles.”
“My name is Siva and I live in Raleigh.”
. . .
This is getting a bit tedious, especially since there are 30 people in the room. So the leader interrupts and asks people to be more efficient by simply stating their first names and cities without the full sentence. Now we hear this:
“Sally, Chicago”
“Giuseppe, Milano”
. . .
As I’m sure most of you have guessed, the responses in the second set are data, and only have meaning if they are provided with the context that was explicit in each of the responses in the first set. We can provide that context generically in a form like this:
My name is X and I live in Y.
We understand that values should be substituted for X, which are personal names, and values should be substituted for Y, which are city names.
We assume that the conference attendees are telling the truth—that each statement is a fact—but perhaps we doubt some of them. This is a clue that each of the full name-and-city sentences we started with is a proposition, which is defined by Merriam-Webster as “an expression in language or signs of something that can be believed, doubted, or denied, or is either true or false.” Thus, every proposition has a truth-value. Propositions are at the core of propositional logic, which studies the truth-value of propositions in various combinations. Propositional logic is fundamental to the logic on which our data and programming languages depend.
When we replace parts of a proposition with symbols that we call variables, the resulting pattern for a proposition is called a predicate. So, “My name is X and I live in Y” is a predicate, where X and Y are the predicate’s variables. If we replaced the variables with values, we would once again have a proposition. Thus, a predicate can be defined as “a statement containing variables which, when the variables are bound, yields a proposition.” Some of you might know that the relational theory of data, applicable to all data no matter how it’s stored, is based on first-order predicate logic.
Now comes the tricky part. Most of us intuitively reacted right away with the observation that something like “Sally, Chicago” is data. However, the personal name Sally and the city name Chicago are only data if they are intended to be bound to the variables of a predicate. Otherwise, they’re just values.
Speaking of Sally and Chicago individually, each one is a datum, which is “that which is intended to be given to a predicate as a value for one of its variables.” This definition is consistent with datum’s Latin meaning, “that which is given.” Data, in this strict sense, is simply the plural of datum. We observe, then, that a value is a datum not by its nature, but by playing a role—that of being bound to a predicate’s variable.
A proposition is a piece of information. Information is defined as “a collection of propositions.” The term “information,” like the word “water,” refers to a mass quantity, meaning something we don’t count. When we say, “a cup of water,” we’re not counting water molecules. Similarly, when we say, “information,” we’re not counting propositions.
Linkage to Data Management
In programs and databases, variables like X and Y are usually named mnemonically, for example as FirstName and CityName. We also give them types, such as char{2-100}, or define them as foreign keys to tables of first names and city names.
It’s helpful to realize, then, that database fields hold values intended to be bound to predicates—which is to say, they hold data. We spend a lot of time describing the variables absent any predicates, but we rarely spell out the predicates. Data management needs to embrace predicates.
Other Definitions of Data and Information
Many definitions of data draw on the observation that data carries no meaning without context, but that is an observation of a quality of data, not a definition. “Data” is sometimes used as a pejorative term—“it’s just data”—to emphasize the lack of information. Sometimes, by analogy, rudimentary information is referred to as data—but, strictly speaking, it’s still information.
Some assert that data plus metadata equals information, and that is kinda/sorta true, but only if one classified a predicate as metadata. Metadata is a kind of data, but a predicate is not data—it’s just a predicate.
What Have We Gained?
What have we gained by these definitions and observations, besides—hopefully—terms that are more precise and more useful as building blocks for other terms and concepts? In my opinion, one of the most valuable insights is that propositional logic and predicate logic can be used to reason with data. We also learn that something is data not by its nature but by its use, and that the context for data is a predicate. All of these things must be understood before semantics can be conquered.
Most of these definitions don’t roll off the tongue, so don’t hope to impress your business stakeholders by quoting them. Instead, just integrate them into your thinking, use the terms “data” and “information” in ways consistent with these definitions, and you’ll find that everything is clearer and more concrete.
This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://www.tewdur.com/ for more details.