Is a Science of Data Possible?

A common complaint of data managers is that business people, and even many IT staff, simply do not understand the importance of data. Is this complaint just grouchiness, or a poor approach to
getting attention? I do not think so. It seems to me that it is an observation of fact; and like many such observations, there is an unasked question that lies behind it. That question has to be
whether business people, and others, understand why data is important. And if they do not think that data is important, it must be either because data really is not important, or because data is
important but they fail to see it as such.

I, for one, think that data is important, and I think that the best way in today’s world to demonstrate that anything is important is for it to have its own science. I cannot prove this very
easily, and there may be other ways to demonstrate that data is important. However, let us just accept for a moment that if data had its own science, it would be generally regarded as important,
and that being important is a good thing.


There is No Science of Data – Yet

There is no word or term for “a science of data” that exists in common or even specialized usage. “Data management” refers more to a craft, like some kind of modern
equivalent of plumbing or carpentry, than to a formal science. There is no “Department of Data” in every university in the world. Many would object to this point and claim that actually
there is a huge amount of work done on data-related matters in universities, research institutions, and the private sector. And a few universities really do specialize in data-related
matters.
 
What about “information science”? Surely many university departments have been created to teach it and do research into it? This is true; but if you look closely at these departments, a
large percentage of what they are concerned about it still technology. Beyond that they still treat data as a “black box.” There is usually a heavy emphasis on networks (to move data
around), or storage and retrieval (usually with reference to performance and reliability), or security (who can access what data). Other topics include human interaction with information
(inevitably interaction with technology), artificial intelligence (again, something that can be done with data). Thus, information science is all about doing things to or with data, but never
dealing with data as data outside of all other contexts.


What is a Science?

From this we can see that although data is essential for a number of other activities that we call sciences, none of these sciences has a primary orientation to data.

If we try to think whether a science is possible for data, we need to first work out what a science is. It might be that a science is any of the set of sciences that are called by their distinct
names, and which we find as university departments. However, many of these are relatively new, like “genetics,” “biochemistry,” and “immunology.” Only within the
last hundred years or so have they acquired names and become recognized as distinct sciences. So just because we have no name for a science of data, and no university departments for it, only means
that we are a little behind other fields of endeavor. We should be perfectly free to invent our own science of data, if we can justify it.

A more serious problem is that something cannot be a distinct science if everything we need to know about it can be answered by some science that already exists. My response to this would be that
at least some questions about data cannot be answered by other sciences. For instance, the relationship between character patterns and data quality in a given column in a given table is something
about which I have to ask questions in my job. No other science will answer them for me. If I want to know whether a dimensional model is more appropriate than an ER model in a given situation,
only the science of data can help me.

The best that other sciences can do is to supply tools like statistical analysis. Now, people can just as well use tools like statistical analysis in botany and meteorology, so using them in data
analysis does not prevent a science of data from existing, just as it does not eliminate botany and meteorology.


Asking Questions

A more robust response to the challenge of having questions about data answered by other sciences is to say that too few questions are being asked about data.

All sciences start with people asking questions and never feeling comfortable with the answers they get. People who ask these questions may first try to analogize them, to see if they can get the
answers from other pre-existing sciences. If the discomfort remains, then they are forced to think through the questions in a new way. As if this was not difficult enough, science also demands that
we ask questions in such a way that they can actually be answered. Asking a question that cannot be answered is the cardinal sin in science. To get to the point where questions can be asked in this
way is one of the most difficult things a human being can do.

It seems to me that this is where we are with data today. We can ask a few questions about it with clarity. But there is a lot more going on that we see, and for which we cannot as yet state clear
questions. Until we can state these questions, we will not have developed a distinct, autonomous science of data, and we will not be really sure if we are masters in our own house.


Methods and Properties

A second front in pushing for a science of data is to consider whether data has any characteristics that are necessarily, always, and only part of data. Such characteristics are
technically termed “properties.” Data has one such property, which Larry English has pointed out: data is not consumed as it is used. We have yet to work out the vast ramifications that
flow from this one property. Perhaps data has more properties. If data has properties, then we can claim that it has to be studied autonomously and needs its own science. If some other science
thinks it can deal with these properties successfully, it will have to prove it. This will not be an easy task. As of yet, we have not been able to fully resolve biology to biochemistry, and
biochemistry to organic chemistry, and organic chemistry to physics, and physics to mathematics. There may be overlaps, but these sciences have not all merged into one. The reason is that they each
have their unique properties that cannot be generalized or abstracted away.

Besides the subject matter having properties, a science of data may be able to develop its own methods. These methods may apply techniques from other sciences to the questions of data, or may be
entirely new. However, they will not be identical to anything that is used in any other science. Data modeling is one of these methods. Normalization is another. These are not ad hoc ways of
putting together a database. They are methods that enable us to make sense of and bring order to what would otherwise be an impenetrable tangle of data elements. Data modeling and normalization are
not used by other sciences to solve their special problems, and they are not copied in their entirety from some other science and used to solve identical problems with data. Where unique methods
are used to gain knowledge in a particular domain, we have a strong argument for a distinct science.


What is the Science of Data?

If we can now accept that a science of data is possible, we should ask what sort of science it might be. I think that the answer must be that it would be a specialized science since it could not
possibly be one of the general sciences like physics. It would have to be an applied science, since the rationale for data is the need for human beings to record information, rather than anything
we find external to humanity. It would also have relations to different sciences. For instance, since data is the stored representation of a fact, the aspect of storage needs to be dealt with, and
that brings us into contact with materials. However, there can be no doubt that a science of data deals with the conceptual order much more than the real order, and the science of the conceptual
representation of the real order is logic. So if the science of data is to be placed in relations to other sciences, it is probably best placed as a specialized, applied, branch of logic.

This brings us to another question: Why is there no mature science of data? The reason I think is that data is much more about the conceptual than the real. Empirical sciences like meteorology
study phenomena that are material and not fashioned by the human mind. We find it easier to deal with physical reality than the abstractions that form a large part of data. However, in the end this
only makes our task difficult, rather than being doomed to failure.

A science of data is possible and can justifiably be differentiated from other sciences. It would be nice to have a name for it, but we do not have one yet. In any case, it is more important to
grow the body of knowledge about data than worry about titles. It is only by growing this body of knowledge that we will be able to attain general recognition of a science of data, and thereby
firmly establish the importance of data.

Share

submit to reddit

About Malcolm Chisholm

Malcolm Chisholm has over 25 years experience in data management, and has worked in a variety of sectors, with a concentration on finance. He is an independent consultant specializing in data governance, master/reference data management, metadata engineering, and the organization of Enterprise Information Management. Malcolm has authored the books: Managing Reference Data in Enterprise Databases; How to Build a Business Rules Engine; and Definitions in Information Management. He was awarded the DAMA International Professional Achievement Award for contributions to Master Data Management. He holds an M.A. from the University of Oxford and a Ph.D. from the University of Bristol.

Top