Organizations have been struggling with integrating data across their departments and with business partners. Differences in syntax can often be solved by tools, but differences in semantics are intricately difficult. This difficulty is inherent to the fact that we do not have one language that everyone agrees upon. What makes it even more difficult is that people also have their own interpretation of terms; we often think we understand each other, but are really talking about something slightly different. A lot of semantics comes from context, and can thus not be defined up-front. It is naïve to think that it is possible to come up with one language that everyone understands and agrees upon, even within the borders of the organization. Rather, just enough and just in time semantics are needed.
There once was a time when we thought it was possible to define a corporate data model. This data model would define all data elements within the organization, and applications would use this data model as their internal data model. This would be so much easier for everyone; we would not have to define application-specific data models and translations between data models would not be needed. This clearly did not work, as organizations are not something you can fully design and control. There are just too many people, too many contexts and too many changes. Any why would it be needed to have everyone agree upon every term they use? A lot of what we talk about is local to a specific domain or context, and we shouldn’t bother other people with it.
It seemed logical to only try to agree upon the language that we share and that we use when we exchange data. This is why we came up with canonical data models that formed a uniform language that could be used to exchange data between applications. At the same time similar data would be copied to our corporate data warehouse, integrated in a similar manner and structured in third normal form. And wouldn’t it be great when this integrated data model in our warehouse could just be our canonical data model? A lot of organizations still seem to follow this route, with varying degrees of success. It turns out that this is also not an easy route. There are still a lot of data elements to define and a lot of contexts that our semantics need to take into account. The schemas we defined for our canonical data models turned out to be very difficult to evolve. Our time-to-market is still hindered by the time needed to for data integration.
New concepts are appearing that are positioned as silver bullets. We can implement a data lake that contains all the relevant data inside and outside your organization, without being bothered by semantics. Big Data technology allows us to process all sorts of structured and unstructured data without knowing their exact contents. We can supposedly just apply semantics and structure when we need to make information from it. Clearly, this will lead to all sorts of problems when you really try to do so. Thinking about semantics only after the fact is naïve. You cannot automatically add semantics to data that is unclear. We need to think about the types of data and their semantics up to a certain level beforehand. This allows us to detect the subjects that they are addressing and link the data so we can find related data elements when we need it.
Semantic technology such as ontologies and Linked Data seem to fit the need for just enough, just in time semantics. Data elements that come in can be “tagged” semantically, in terms of ontologies. Such tagging can be used for structured as well as unstructured data, also providing a means to link them. Ontologies do not necessarily describe all data elements that are exchanged, just those that we feel are important to our business. And when we reuse existing ontologies, we do not have to spend a lot of time defining our own semantics. Even better, we can benefit from the fact that other organizations are also using these ontologies, which implicitly links our data elements to data elements of others. This is the core of the semantic web. It also does not require us to put all these data elements into a fully structured, third normal form database that is hard to evolve. Instead, we can have a less strict form of structure that is easier to evolve, for example in the form of a triple store. This leads to a semantic data lake that is more valuable than “just a data lake” and less rigid than a corporate data warehouse. Note that the latter is probably still relevant for your stable (back-office) processes, and can co-exist with your semantic data lake.