In preparation for the SemTech conference this year, and in editing my book, I have been doing a great deal of reflecting on the relationship between Semantics and Business Metadata. This article
attempts to articulate these reflections, and hopefully will cause you to do the same. And, as a shameless plug, it will whet your appetite for more, and will motivate you to read my book when it
comes out later this year!
Definitions
Business Metadata is metadata that is intended to be used, and many times even created and edited, by business people. The term Semantics is
derived from ancient Greek philosophy. In ancient Greece, and also today on college campuses, it was – and still is — the study of meaning. The definitions, context, assumptions and rules
surrounding a business concept are its semantics. In our information systems throughout the years, we have done a poor job of capturing semantics. Humans are, by nature, poor communicators. In
fact, not only are we poor communicators, but we are even worse at writing things down. (Remember how we all absolutely hated to do documentation?! Now it’s coming back to bite us!)
The Vision of the Semantic Web
Tim Berners-Lee envisioned the idea of the “semantic web,” where intelligent agents will be truly intelligent. He envisioned that the computer would know exactly what “booking a restaurant
reservation” means, and all the underlying tasks associated with it. For example, you could ask the computer to book a reservation at an Indian restaurant on the way home from work, and the
computer can find an Indian restaurant located directly on your way home, book a reservation for you, and put it automatically on your calendar, all without human intervention.
In the context of searching for documents, a semantic web would be able to understand what the documents contain. Today, we rely mostly on document titles and tagging. Tagging is usually done
manually either by the document author, someone else charged with tagging after the fact, or through a folksonomy like del.icio.us. But-a true semantic web could decipher document contents on its
own.
On a smaller scale, the semantic web means distinguishing between word senses: two or more senses of a word, and asking the user, “Did you mean…?” For example, we have used the word “mole”
throughout the book to illustrate word sense. Right now, Google can distinguish between spelling variations and probable errors. However, if Google were semantically enabled, it would be able to
distinguish between the different word senses of mole, and either ask the user which sense they wanted, or, better, to display results based upon each sense. Both intelligent agents and
semantically-aware queries involve understanding the meaning of things. Berners-Lee’s example bases actions upon meanings, and is able to combine several different tasks automatically due to what
making a reservation means. In the more simple example of the semantically aware query, it translates into returning query results differently based upon different meanings. In either case, the
basic notion is the same: the goal is to codify meaning so computers can “understand” and to take useful action based upon that meaning.
The Importance of Semantics
Two very important items which encapsulate the semantics of a business are the definitions of its terms and its business rules. One way to help the business articulate its semantics therefore is to
build repositories that store these critical semantic components. Out of necessity, these repositories are usually home-grown, because as of this writing, the traditional repository tools don’t
store business rules or any type of business metadata (beyond that found in E/R models) very well.
This article explores semantics in several areas:
- How do we express semantics so it can be communicated?
- How is the nature of semantic communication different between humans to humans, humans to computers, and computers to other computers?
Semantic technologies are covered briefly; however, they typically aren’t working to solve the same problem that business metadata is. We will cover these two different approaches, and will
highlight how business metadata concerns itself with semantics. Business metadata has as its goal to make data understandable to business people, so this article is about providing services to
business people that clearly communicate semantics to the business people.
Semantics are Context-Sensitive
Every industry has its own unique language. Often, we don’t notice this because we work in that specific industry. But a consultant who doesn’t know the industry and its jargon can get contused
really fast. Likewise, we have all experienced going to work for a new company and being abruptly confronted with the company’s language: the acronyms, abbreviations and business terms that are
uniquely its own. Even within the organization, each department usually has its own language; many times, formulas or calculations mean different things in different divisions. For example, the
term “revenue” can be ambiguous and usually means different things in different groups, such as Sales or Finance.
Therefore, it is safe to say that semantics is highly dependent upon the context. The meaning of a term or phrase can vary, depending upon the group of people using the term. See Figure 1. Each
person has his or her own context, based upon their unique background and circumstances, as well as which organizational division he or she belongs to.
Each Person has His/Her Own Context
Each Information System has its Own Context
In a similar fashion, each information system has its own set of semantics, what data elements mean in the context of that system; see Figure 2 below. The problem is, the semantics of each system
are not documented very well. Some systems don’t have any documentation at all, and if you are lucky enough to have documentation, it is usually of shoddy quality. The state of data definitions in
most systems is deplorable. Most definitions are tautologies (“a unicorn is a beast with one horn” defines the term by itself; it adds no new information. The word unicorn means one horn:
uni=one, corn=old English for horn). A very common tautological definition that is seen in most systems is “Customer ID: The ID of the Customer”.
Each System has its Own Semantic Context
Each System has its own Semantics, and Semantics are Not Shared
Each system is a semantic island: The semantics for any given system hold true for that system only, and all bets are off that the same semantics apply to any other system. See Figure 3. This begs
the question: How can two or more systems really share data, if the semantics are different?
Each System has its Own Context, so How do they Share Data?
For example, suppose a firm has a sales database that tracks Customers. However, the Sales Department’s definition of a Customer is most probably not how the Shipping Department has defined the
term in its database. The Sales Department’s Customer database almost certainly contains prospects, and the Shipping Department’s database almost certainly does not; it is highly unlikely that
they are shipping anything to someone who hasn’t bought it (unless they have a “try it, you’ll like it” policy). This problem is even true of information systems within the same department,
because different systems are usually developed by different people, in a different situation, to solve different business problems. Individual developers leave or get transferred and new ones come
in. Unfortunately, semantics of business systems have been highly overlooked and ignored. Everyone just “assumes” that Customer is the same across the organization, because it is called the same
thing. Assumptions are very dangerous!
The discipline of semantics, therefore, is all about becoming aware of these definitions, assumptions, and the contextual nature of data, and trying to capture this information so that data can be
more understandable and also can be easily shared, across the enterprise and even external to the enterprise, when appropriate.
Human to Computer: No Shared Semantics
Semantics, as it is communicated to humans, is a type of business metadata, which leads us to the next problem. The context problem is further complicated by the human/computer interaction; how do
we know for sure if the human’s context is the same as the one the system has?
Human/Computer Context
Semantics as Business Metadata
Semantics is all about meaning, and business metadata is about adding meaning to data. Making meaning explicit is adding context to data. Thus, any way we can capture the semantics of data and be
able to display this meaning to a business user to add clarity to data, we are delivering business metadata to them.
The simplest way to do this is to provide definitions of terms used in applications, formulas or calculations used, etc. This provides the meaning of an individual term or data element on the spot
to a user viewing data in an application, and can provide immediate clarification. For example, it would be helpful to provide definitions of possibly confusing fields in a web form.
However, as Dave McComb points out in his excellent book, Semantics in Business Systems, “Definitions are Not Enough” [McComb, page 49]. Dictionaries typically do not provide the
relationships between terms that are so critical to understanding. Therefore, we need to capture relationships between terms and deliver them also as business metadata. In our web form example, it
may be helpful to provide business rules about the field in question, clarifying the relationship between the term in question and other data.
Expressing relationships can get very tricky: There is a clear need for increased richness of expression. However, the tools we have today are severely limited in communicating these concepts to
humans. The tools and languages that offer semantic richness are usually difficult to decipher for a non-technical person. While the semantic vendors are attempting to tackle the more difficult
problem of computers being able to reason with the end goal to offer more useful solutions to humans, we still have much work to do on the human side of the equation. We must continue to work on
more precision in communication; we must be able to distinguish nuances of dialects, and distinguish different usages or meanings of the tame word, or vice versa, when two different words mean the
same thing. The next section discusses how concept modeling can help in this endeavor.
Semantic and Conceptual Models
A concept or conceptual model is a model whose purpose is to convey individual concepts and their relationships to one another, independent of implementation. It is more specific than business
terms, because terms can often express more than one concept, or more than one term may be expressed in a single concept (in other words, there is a many-to-many relationship between term and
concept). Dave McComb states that the difference between a semantic model and a conceptual model is “…the effort spent on resolving meaning” [McComb, page 77]. Therefore, a semantic model
attempts to be more rigorous in the expression of meaning for each concept.
Delivering Definitions and Relationships
A conceptual model can therefore be used to express each atomic concept and its relationship to other concepts. An ER model can be used to represent relationships between concepts. It should be
noted that data modelers distinguish between three levels of ER models: Conceptual, Logical and Physical. Good modelers disagree on the boundaries between conceptual and logical, but all will agree
that a physical model includes implementation details that are not pertinent to the other two. A conceptual model is meant to focus in on the main concepts required to do business.
Alternatively, Concept Maps can be used, which allow for more flexibility in relationship expression. See Figure 5 for an example of a C-Map.
Example of a Concept Map (C-Map)
The C-map is not constrained by the rules of OO, which force hierarchies. It also does not have the problems of relational modeling, which does not allow hierarchies. The lack of these constraints
is both good and bad.
For example, even though the relational model does not model hierarchies well [1], it does provide rigor and certain rule enforcement (like cardinality and optionality,
and relational integrity).
There are open source tools available that extend C-maps, and the Institute for Human and Machine Cognition (IHMC) is instrumental in knowledge modeling research and extending C-maps, enriching
them with semantic technologies (see http://www.ihmc.us:16080/about.php) One of the known problems in the semantic modeling technologies available
today is their lack of visual modeling approaches. The use of the C-Map and its extensions allows us to have both semantic richness and visual models.
Business Metadata Expression
The problem with OWL (and many semantic languages or modeling techniques) is, it is a rich, robust modeling vocabulary but it is not easily translatable into business language; it is expressed in
XML syntax. If the goal of your project is to create a purely business metadata environment for the delivery of data explanations to business people, then OWL at this stage of its evolution is not
the right way to go; well-formed definitions are the appropriate vehicle. English language explanations of rules and relationships can be added as enhancements to definitions. Another alternative
is to use a combination approach with both a dictionary and C-Map or ER model to express the concept model; some users like graphical models. When the user is looking at a data element in a system
and wants its definition, a hot key or button can be pressed. If he wants more detail, like all the relationships between other business concepts, another mouse click or “details” button can be
pressed. Not everyone will probably want this grain of detail, and the definition may be all that’s needed at first, but the detail is available right there if required.
However, OWL goes a long way to solving one part of the equation: the computer’s understanding and reasoning capability. Go back to Figure 4, shown earlier in the article, which depicts the
man/machine communication problem. OWL helps the computer side of the drawing. This will help the business person under the scenes, because it will provide the building blocks and the
infrastructure for automated reasoning, more intelligent search and intelligent agents (see the next section); but you still are left with communicating back to the human in a reasonable way, which
is what business metadata concerns itself with. Therefore, semantic technologies, while paving the way for the future with lots of promise, at this time don’t provide much business metadata.
Exposing Semantics to the Business
We have established the need for semantic expression and delivery of business metadata to communicate context. How can we use business metadata to deliver contextual information and semantics to
business people to enhance their understanding of the data, if the expression of semantics is only in a technical representation?
Today’s technology necessitates writing a business translation along side of any technical representation of semantics, if one exists. It seems that the main link is the dictionary or glossary. If
OWL is used, the attempt can be made to generate fact statements from the relationships, or use a C-map. Glossaries and thesauri can be used to serve up rudimentary definitions, along with synonyms
and broad/narrow term relationships between terms, and these can be enhanced. As we discussed earlier in the article, the delivery of the definitions needs to be as ubiquitous as possible; they
should be made universally available throughout the Corporate Information Factory (CIF) and the entire IT environment. Although this takes some infrastructure and architectural planning, it
generally doesn’t require purchase of additional hardware or software. You can get it done with “Bonnie’s Law”, or “Use whatever is lying around”.
In addition to definitions, it is wise to do semantic modeling and create an enterprise conceptual model, as described above, and map all systems’ data elements to this model. Unlike some semantic
technologies, semantic models such as an enterprise concept model can be used directly by analysts and business people, providing the following benefits to the enterprise:
- A semantic and physical inventory of data elements throughout the enterprise
- A systematic way to determine data redundancy and put plans in place to manage it
- Two data elements that represent the same concept can reuse the same definition
- Sets the foundation for data sharing, both internal and external to the organization
- Sets the foundation for an enterprise data quality initiative
Such a system can then be used to track associated contextual information for each concept, including business rules. Then this business metadata can be made available wherever it is needed.
Although tools are helpful, they are not necessary; it is possible to create this as a homegrown solution.
Summary
The quest to provide meaning to data is inherent in business metadata, and therefore semantics play a large role. There are two facets to the delivery of meaning:
- To the human
- To the computer
The emphasis so far in the semantic technology community has been on the latter, forgetting the former. The rationale behind this dismissal was perhaps that the computer issue was the most
difficult, in trying to figure out how to codify meaning. When this problem is solved, maybe the human/semantic problem will also follow easily. However, we are finding that all our attempts at
codifying meaning using semantic technology is leaving humans more and more confused. OWL code and XML, after all, is not really human-friendly.
Yet the problem remains: how to best communicate the meaning and context behind the data, to enhance business peoples’ understanding of the data. The answer to our quest, given the state of
today’s technology, can be summarized in two areas:
- Good, unambiguous, robust definitions of business concepts (and mapping terms and data to these concepts)
- A method of representing relationships between concepts that allows for different kinds of relationships (perhaps C-maps or topic maps are the most promising).
[1] Various ER database design methods have added extensions that handle hierarchies such as IDEF 1X and Barker Notation, but both have problems when physical relational models are
generated from them.