I have always been fascinated with the need for business metadata: data that describes data from a business point of view. I got involved in this area about 15 years ago when I became interested in
business rules. I worked for Sybase at the time, and our advertising message stated that we were the only database vendor to store “business rules in the database server”. The year was
1989.
One of my customers pointed out that what we were storing in the server wasn’t business rules at all but stored procedures and triggers, which were code. Code is
considered arcane by business people. Business rules are different; they are supposed to be understandable and are therefore supposed to be expressed in the language of the business.
These comments sent me on a personal adventure that has lasted 15 years and is by no means over. This article will outline some of the issues I see in capturing business metadata and its
relationship to technical metadata, and how it relates to practical repository implementation considerations.
What Business Metadata is and is Not
The most practical use of metadata involves impact analysis, answering questions like “If this field were to change from 20 characters long to 30 characters long, where would I have to make
modifications?” I don’t mean to minimize this problem; it is a very important one. But this involves technical metadata, not business metadata.
Business metadata answers questions like, “This report shows revenue; what kind of revenue? What is meant by revenue? What calculations went into the determination of revenue?” And
questions like this: “How does this division of the enterprise calculate revenue? Is it the same way that our division calculates it?”
Many tools in their marketing literature state that they support business rules. However, most technical tools do not support business rules; they support business rules translated into code. ETL
tools and data quality tools all show code, not English language descriptions. As stated above, business rules are supposed to be expressed in the language of the business.
Is it as easy as taking a transformation expression in an ETL mapping and translating it in English? Not really. Some tools have a nice “description” field for this purpose. Can you
just use this? Business metadata also includes all the assumptions around the data, and the meaning of the terms used. Therefore, it is usually much more encompassing than just a single
“description” field. This collection of “stuff” is referred to as semantics. I will come back to semantics momentarily; now I would like to take you back to the 1990’s
briefly and discuss a data modeling fad that was popular back then.
Business Metadata and the Enterprise Data Model
Like all the other data modeler types back in the 1990’s, I was bitten by the “enterprise data model” bug—if we can just design enterprise data models then life would
magically work. An Enterprise Data Model is supposed to model the business concepts used throughout the enterprise, and map them to their technical instantiations. However, many of these efforts
were abandoned in mid-stream because the ROI wasn’t visible, and/or no one could figure out how to do it for enterprise data models. It was difficult to illustrate to the business the payback
of the enterprise data model. We knew it was important, it was just difficult to articulate in the language of finance. After all, I’m just a data modeler geek, not a finance person! But
it’s just this kind of mindset that kills project funding! Then the Data Warehouse rage hit, and warehouses took the place of the Enterprise Data Model, because they showed excellent ROI. But
now that I’ve made that point, let’s go back to semantics.
The Importance of Meaning, Language and Definitions
I have discovered that a large part of data migration projects (and this includes data warehouse and integration efforts) is getting the meanings of the terms correct. Everybody knows what happens
when the definition of Customer gets screwed up; questions like “How many customers do we have?” “Who are our top ten customers?” cannot be reasonably answered because no
one can agree on what a customer is. And in any enterprise, the definition of customer varies throughout all the information systems. The definitions and rules surrounding these concepts are the
semantics. In information systems throughout the years, we have done a lousy job at capturing these semantics. (Remember how we all absolutely hated to do documentation?! Now it’s coming back
to bite us!)
So, as a good consultant, when I realized this, I began to help my clients build systems that kept track of these semantics. In my mind I saw two very important things: definitions and business
rules. I helped my clients build repositories that store both of these things. Usually these repositories were home-grown, because the COTS tools didn’t store business rules very well.
The Enterprise Data Model Reappears
On one consulting assignment, we were capturing definitions of business terms. Then we moved to definitions of tables and columns in the systems. We were doing a data warehouse, so we had sources
and targets; all the tables and columns needed to be defined well.
The first thing I noticed was something I’ve seen over and over: There are business facts that are represented very poorly in the data. Have you ever seen fictitious fields with no inherent
business value, fields named “suffix”, for example? There’s nothing in the business called “suffix”. It turns out this field was dreadfully overloaded; it was used to
track three or four totally different things, and the only way you could tell which thing (or things) it was tracking was to use a secret decoder ring!
Anyhow, I thought about this for a while and realized that what they needed was an Enterprise Data Model to map stuff to; a universal model of all the business concepts. This would be the best,
semantically responsible way to do it. Some people in our industry call this a Conceptual Data Model; others call it an Information Model. But you still have to have the definitions; you must
responsibly define everything. Best practice: enter the definition at the moment you uncover these business facts during analysis.
So here we are again, back to the Enterprise Data Model! We were right all along. Why? Because you have to have a central business repository to map all the data to in order to make sense of it. As
everyone has experienced, the business concepts show up in all sorts of different instantiations in all the diverse systems all over the enterprise. The business concept is represented differently,
with different terms, formats and rules. There needs to be a central point of reference which is the pure business fact, without any “legacy artifact rules”.
Have you noticed that the system sometimes shapes the business? The limitations of the system force the business to do workarounds because it can’t store the data the way it should be, how
the business uses it. The most obvious example of this is the old teletype machines which only had capital letters, no lower case. This meant that if a contract had lower case letters in it, the
semantic was either lost or they used an arcane code (like surrounding the letter with equal signs: “=Y=” for lower case y). The worst part of this was, sometimes the business would be
forced to use contract identifiers with all uppercase letters, even if they wanted to use lowercase. They had to adapt to the system.
Repository Requirements for Business Metadata
So what does all this have to do with repositories? Plenty! Have you always wondered, like I have, why the Commercial Off the Shelf (COTS) package repositories always don’t seem to cut it?
Every client that I have assisted with a COTS repository has always had to extend the tool. Every single one, without exception; or if they were lazy they would overload fields. They are
perpetuating the very problem we are trying to mitigate when they use a field for “something else” or overload it. Here we go again!
Well, I think one of the answers to the question about COTS tools is, the tool has not been designed for business metadata; and the business metadata problem is really a semantic problem. It is all
about meaning.
So I tackled the problem by designing the “homegrown” metadata repositories around the data dictionary notion, and then expanded to tangential concepts. I also included business rules
in the “heart” of the repository. And I think this was the distinguishing factor from homegrown vs. COTS. The latter products were designed mainly for technical metadata, and most of
them do an OK job at this. However, David Marco, the expert on technical metadata, has always been quick to remark that you always have to extend COTS tools, even for technical metadata.
But the business metadata problem is much more complex, as I’ve made the case earlier. It involves the semantic ingredient.
Information Model
I am now beginning to see that the Information Model needs to be the heart of the business metadata repository, along with the definitions and the business rules. You really need all three. Then,
you map everything semantically to the Information Model, so you can see which business concepts are included in the system you are dealing with, and what semantics surround it. Then you can do
what transforms are necessary to have the data conform to the expectations of the department or business audience that will consume the data.
In essence, this Information Model acts like an integrated data store in a data warehouse environment: everything is stored in a central business, plain vanilla way, with no department bias (or the
department bias is documented well so everyone will understand it up front); then the specific rules are applied to the data and it is shipped to the individual marts as needed.
What will this do to performance? In this day of Change Data Capture, Messaging and EAI, the penalty shouldn’t be too bad. Speed of processors and databases is improving all the time. But
this approach truly makes sense. And all my current clients see the wisdom in this. I am therefore seeing my practice grow in eagerness of this solution.
COTS, Taxonomies and Ontologies
Taxonomy and ontology are words that are also being used increasingly alongside semantics.
I am a relative newcomer in this area, so I will do my best to describe these concepts as I understand them.
Taxonomy means a classification scheme, usually involving a hierarchy; an ontology is a classification scheme with more semantics added, usually more complex business rules concerning relationships
and navigation within the taxonomy. Ontologies can be developed and standardized for specific industries, for example the medical field has many different ontologies. Today, many industries are
beginning to publish standard ontologies, and the worldwide web/search engine crowd is driving these efforts. Standardized ontologies can make web searches easier. The whole point of taxonomies and
ontologies is to create common semantics so whenever a term is used, everyone knows instantly what is being talked about. It is a step beyond just dictionaries because it goes beyond just
definitions; it also provides more contextual information such as rules.
There are some new COTS tools that are beginning to address these issues, such as Unicorn. Unicorn uses an Information Model to drive everything. This area is still emerging, however, and the
products are not mature yet.
Conclusion
The business metadata problem is all about semantics. Period.
In the past, you had to build it yourself. We started out with data dictionaries. But that only addresses a small part of the problem. Then we started discovering business rules. And all of this
was after we had begun to build Enterprise Data Models and ran out of funding because nobody saw their value. And we never had the resources to connect them all together.
It is my belief that the Information Model, coupled with all the associated semantics of proper definitions and business rules (and maybe some workflow/process semantics too) is what is needed to
solve the business repository problem. Stay tuned; in an upcoming article I will address my findings concerning COTS tools in the emerging semantic area and see if they really address the issues
well. And another thing: can they link the technical metadata—the expression transforms in ETL mappings—to the business concepts and business rules that would be stored in the
Information Model? I will do my Sherlock Holmes imitation and check it out!