I have recently gained more interest in data quality as well as philosophy, and was struck by their relationship. Data quality is the extent to which data satisfies the implicit and explicit requirements required for its usage. It is often expressed in terms of dimensions such as accuracy, completeness, consistency, credibility, currentness, and understandability. Philosophy attempts to answer the deeper questions that we have, providing us with wisdom. This often relates to areas such as existence, knowledge, values, reason, mind, and language. There are two themes that are both central to data quality as well as philosophy: reality and meaning. This indicates that data quality is about “stuff that matters,” and should thus be taken very seriously. In this blog, I will explore other implications of the overlapping of themes.
Let’s start with “reality;” the state of things that exist. In data quality, the relationship with reality is reflected in the quality dimension “accuracy.” In this dimension, it is determined whether a value in a dataset is the same or close to the value of a property of an object. This can be a very simple “equals” question, but also a much more complicated “distance to” question. The latter is very frequent in spatial data, where positional accuracy is one of the most important measures of data quality. In data quality, the accuracy dimension is one of the most difficult dimensions to measure. An important reason is that often the party that checks the data quality does not have direct and/or digital access to reality. This can be solved by “going to” reality and observing or measuring the real-world object, but this is a very expensive thing to do. An alternative could be to compare the data with a reference data set that is closer to reality; however, that is not the same as comparing it with reality. Positional accuracy can be very difficult to measure when there are no clear boundaries (e.g. an area with trees).
From a philosophical perspective, reality is the state of things as they exist, rather than as they may appear or might be imagined. Reality includes everything that is and has been, whether it is observable or comprehensible. The truth refers to what is real, while falsity refers to what is not. A lot of philosophers have stated that terms such as “reality” and “truth” cannot be defined. Some idealists say that we only know how things appear to us, not how they really are. An object may appear contradictory from different viewpoints, which would imply that there cannot be an object that unites these contradictory characteristics. Independent of which philosophers you believe, it is a fact that independent of whether there is one objective reality, the way we perceive reality is subjective. It is based on how we interpret what we see, based on filters that we have built, given our personal experience. From a data quality perspective, this makes “accuracy” an even harder quality dimension to measure. How can you be sure that the value observed by a person that has measured the data is not just their interpretation of reality?
Related to the previous, we may be interested from a data quality perspective in determining the accuracy of unstructured data containing certain statements. You could for example try to measure the data quality of policy documentation, containing all sorts of policy statements that you want to test. You may wonder whether these statements are “true.” Policy documents with a lot of “true” statements would then have a high data quality. Similar difficulties exist with “truth” as with “reality.” Philosophers do not agree on what “truth” really is. Minimalists say that truth is not a characteristic of a conviction or statement. Pragmatists say that something is true when it works. An interesting perspective on the truth of statements is provided by the Tractatus Logico-Philosophicus of Ludwig Wittgenstein. The statements in this philosophical work have been discussed at length by a lot of people, and discussions remain about the truth of the statements. What does not really help is that the work is described only in terms of propositions that are not accompanied by any support; they are just stated. From a data quality perspective, we may relieve the “truth” test of policy statements and settle with a subjective alternative. We could also just ask a group of people whether they are convinced that a statement is true, also given the support that is provided.
Another theme that is reflected in data quality as well as in philosophy is that of meaning; how something is interpreted. In data quality, this very much builds on what is defined in the data model (or ontology). The classes, attributes, relationships, rules and definitions defined in such a model provide meaning to the actual data. They help you in interpreting the data. This is reflected in the “consistency” quality dimension, and sometimes also referred to as semantic consistency. The latter refers to the extent in which data conforms to what is described in the data model. Part of this is a very common type of test, which is often automatically performed using techniques such as schema validation. A more intricate version may also automate the testing of the rules/constraints described using some sort of rules engine. Meaning is also reflected in the “understandability” dimension. Within this dimension the understandability of text can be tested, for example, by determining whether it uses terms that are understood by the target audience of the text.
From a philosophical perspective, there is a lot of discussion on meaning. Next to questions about what words mean, philosophers have been very interested in meaning itself. Meaning has similar problems as “reality” and “truth;” it is subjective in practice and we are not sure whether there really is an objective meaning (although in practice we sort of understand each other). Philosophers have been intrigued by language for a long time. Problems were often seen as a manifestation of a problem in the interpretation of a word. Jacques Derrida stated that the meaning of words depends on the context and that they can have an insoluble ambiguity. The meaning of words is partly determined by the listener and by social uses over which the speaker has no influence. Schoppenhauer states that languages provide us with a disturbed view of reality, with untrue boundaries and a false feeling of order. It shall be clear that with all these philosophical insights a true “understandability” test is probably not feasible. We must settle for simpler tests, such as testing whether terms are part of a pre-defined vocabulary.
This blog item has tried to provide you with some insights into the relationship between data quality and philosophy. I have not tried to be complete, nor do I have the knowledge to be complete, nor will I ever have the knowledge to be complete. For life is a complex thing that we will never be able to fully comprehend. On the other hand, data quality is something that we can start with immediately. Hopefully I have provided you with a taste of the fundamentals of data quality. Maybe I have even provided you with some philosophical inspiration on how to improve your approach to data quality. I end with a philosophical statement from Confucius: “Life is really simple, but we insist on making it complicated.”