We are coming to grips with big data through analytical techniques that enable us to reduce it to useful insights that can guide our future actions. Eventually, we want to use this data to imbue our systems with knowledge of the world, so that they can use their mechanical processing power to serve us more intelligently. We want them not merely to automate routine tasks, but also to process new information against existing knowledge and give us insights that up until now have been obscured by too much data and not enough analysis. This is no longer the province of science fiction. Many companies are working to create “knowledge bases” and to program systems to use that knowledge.
But what is knowledge? How can we teach our systems knowledge if we don’t know what knowledge is and how to represent it inside a machine?
——————————————————-
State of the Art
The standards currently reigning for knowledge representation are the Resource Description Framework (RDF) and the Web Ontology Language (OWL). These are aimed at making it possible to combine logical predicates from many sources, including the Semantic Web, in order to allow machine reasoning across all of the knowledge available on the World Wide Web. While their goals are worthy and they have made some contributions, these standards have also thrown up a couple of significant roadblocks to their own goal.
The first is their insistence that all knowledge be reduced to three things, collectively called a triple: a subject, a predicate, and an object. Not every fact can be reduced to these three things. For instance, the statement that “John Marshall was Chief Justice of the Supreme Court during the years 1801 to 1835” does not reduce to any triple. This just makes knowledge representation harder. Second, the so-called “predicate” of a triple is more like the verb phrase of a sentence. Overloading “predicate” like this obscures the prior meanings of the word in English grammar and in logic, making it harder to leverage those fields of knowledge. Knowledge of grammar is essential to natural language processing, and knowledge of logic is essential to reasoning.
In addition to the difficulties posed by RDF and OWL, there is some confusion between knowledge and its representation in data. One hears talk of “knowledge graphs.” Yes, knowledge is graph-like, but so is all data. It confuses things to believe that one must organize data as a graph in order to represent knowledge. We can treat the relationships in any data as a graph for analytical purposes.
Finally, there is confusion about relationships and data. A relationship in the real world is between two entities—two independent things. The statement, “I am 39 years old” (I wish!), expresses an attribute—an inherent characteristic—of me, not a relationship. That’s because my age does not have any existence without me—it is not an entity. When we reduce this statement to data, there will be a data relationship between an identifier representing me and a value representing my age, but that relationship expresses an attribute of me, not a relationship with me.
Let’s step away from these things that make knowledge harder to represent, and get back to the simpler basics.
Provenance and Confidence
Knowledge comes in many forms, the most basic of which is information that is true, to the best of our knowledge. Here is the first problem. It is very difficult—some even say impossible—to know whether something we accept as true is “really” true. But there’s more to it than that. You’ll believe some sources more than others. If someone you don’t know tells you that the total world population is 7.4 billion, you might or might not believe it. If Wikipedia shows that number, then you’re more likely to accept it as true. If you learn that the person who told you the number is one of the world’s foremost authorities on population data, then you’re even more likely to believe the source. However, if that same person tried to tell you who was going to win the World Cup, you would probably not place much confidence in that data.
So, for every significant piece of data we record, we are going to want to record:
- The provenance of the data: what’s the source, and when was the data obtained from that source
- Our confidence in the source for that data: The simplest way to record this is as a number in the range of 0 (no confidence) to 1 (full confidence); this approach enables the use of fuzzy logic for calculating what our confidence should be in data that is combined from several sources
As humans, we do this without even thinking about it, by remembering who told us something, when they told us, and how much we trust them as a source of that kind of information. In the recent past, it was impractical to record the provenance and confidence of every data item. Now it is possible, though it will require careful modeling and design so as not to cause an unfeasible explosion in data volumes.
The Representation of Knowledge Itself
Besides recording the provenance of and our confidence in data, we will often want to record what the data means, in a way that enables our machines to combine facts to develop new facts, or to confirm assertions made by users. The most straightforward way to do this is to record the fact as a logical predicate with variables. This is not the predicate of RDF, nor of English grammar. See my blog from December 19, 2016, “What are Data and Information”, for details. The identity of the predicate corresponds to the meaning asserted by the fact, and the predicate’s variables are the entities and attributes referenced by the fact. Consider the statement:
John Marshall was Chief Justice of the Supreme Court during the years 1801 to 1835.
We could fashion a predicate to match this statement as follows:
presidedOverCourt(<PersonName>, <CourtName>, <TenureBeginYear>, <TenureEndYear>)
The original proposition is reconstituted, so to speak, by making the obvious substitutions:
presidedOverCourt(“John Marshall”, “Supreme Court”, 1801, 1835)
One could represent this statement, along with provenance (source and date of record) and confidence score in XML or JSON, and, if one wished, as a table with seven columns. One could make assertions, in the form of predicates, about the meaning of predicates themselves. It is illuminating to think about the unwritten predicates behind every relational database table, JSON object, and XML complex element. We’ll explore this point further in a future blog.
This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://www.tewdur.com/ for more details.
Copyright © 2017, Ted Hills