Conquering the Logical-Physical Divide February 2011

I’ve been following a spirited (and very informative) discussion in the DAMA section of LinkedIn on the question, “Is There A Single Version of the Truth?” The discussion has been interesting as much for what hasn’t been said as for what has been.

For one thing, none of the practitioners participating in the discussion has answered this question in the affirmative. The farthest anyone has been willing to go is to state that there may be a “golden” (acknowledged and approved) source for data in a particular business domain. For the most part, there seems to be a general agreement that the truthfulness of data is intertwined with business requirements and business needs. As Peter Aiken said, “It is better to focus not on the truth, but on fitness of use.”

In the words of Allen Perper, “a database (and certainly any models that precede its design and population), is not reality. It is only a collection of representations of a perspective of reality that may have been valid at a point in time.”  We can be sometimes misled, as data modelers, into thinking that we are capturing and modeling reality, rather than perspectives and points of view.

It is mostly for this reason that I regard the question of a Single Version of the Truth (SVOT) as meaningless and pointless. After all, it is not our job, as data management professionals, to create Truth. Our job is to create value for the organizations that employ us and pay our wages. So the proper question is not how can we create a single version of a company’s “truth” or a “golden copy” of an organization’s data. The proper question is how can we create data structures that add value to a company’s business processes and open up new opportunities for process improvement, knowledge creation, and business growth. To paraphrase another of Allen Perper’s comments, the “truth” of data is in the eye of the beholder simply because the value of that data will vary from individual to individual. Some people will place a higher value on the most current data, even if it’s not from the most trusted source. Other people will only value data from the most trusted source, even if it’s not current.

So what is it we’re doing (or think we’re doing) when we’re engaged in, say, a Master Data Management project?  I think Jim Harris hit the nail on the head when he spoke of trying to find a “Shared” (rather than a “Single”) version of the truth. In other words, trying to identify (and model) those data assets that cross organizational boundaries. By capturing this data, and storing it in a form that supports a shared view of the business and agreed-upon business rules and definitions, we can create a data repository that adds value to (and across) multiple areas of the organization, supports more efficient business processes, and opens up fresh perspectives on a company’s operations.
Similarly, in Data Quality projects, we are trying to come to a shared understanding of the business rules and definitions pertaining to shared data and then, in the words of Ken Smith, make sure the values for the data in the database adhere to these agreed-upon business requirements.

In my opinion, the two biggest “traps” in data work are these: thinking that we are capturing and modeling reality rather than perspectives on reality, and trying to shoehorn differing (and possibly competing) points of view into an overly-simplistic framework in the guise of achieving “a single version of the truth”. Eric Aranow brought up the analogy of the “Procrustean Bed” (Procrustes was a mythical Greek innkeeper whose rooms had beds of only one size. Guests who were shorter than the beds were stretched out on racks until they fit; guests who were too tall had their limbs lopped off!). Eric then provided an excellent example of over-simplification: the set of “Customers” recognized by a support department vs. the set of “Customers” recognized by Marketing. If you blindly combine the two, you might lose the distinction between Prospects and registered buyers, and the resultant data could be of lower quality than what went into it. I had a similar experience some years ago, when I was asked to integrate data from three different Customer databases. The problem was, each database represented a different type of customer; each of these customer types interacted with the company in a different way, and were affected by different business processes. It didn’t make sense to treat these different customer types as though they were the same!

I solved a similar problem, years ago, when designing and building a Finance and Budgeting data warehouse for a public University. The first stumbling block I hit was the discovery that there were multiple definitions of basic accounting entities such as “Budget” and “Account”. These differences exist because a public University has different funding sources (e.g., funds from taxpayers, allocated by the State legislature, funds from grants administered by agencies of the Federal government, and funds from contracts with private businesses). Each of these funding sources has a different set of auditing and reporting requirements. Rather than trying to create a single definition of these terms, I solved the problem (in part) by creating subtypes of the Budget and Account entities, using a different subtype for each funding source. I ended up with more complex (and also more robust) metadata that was better suited to satisfying the auditing and reporting requirements of each University department.

The object, as Daniel Paolini points out, is to make sure that everyone with the same specific need for data gets the same data and understands it to mean the same thing. This is completely different from trying to make sure that data means the same thing to everybody!  Making sure that everyone who has a specific use for data gets data with the same business meaning is ultimately what we are trying to achieve in data work.
Managing data to support multiple “versions of the truth” is more complex (but much less frustrating!) than trying to create a “single version of the truth”. You will have to create and manage more complex metadata, create more complex (and, occasionally, redundant) data structures, create and manage multiple views of data across your data repository, and (probably) create and manage multiple delivery systems for data (including web services and ad-hoc query and reporting capabilities). None of this is trivial, but all of it is important in satisfying the Data Manager’s mantra:

 The right data
 To the right person
 At the right time
 In the right form
 At a cost that yields value

My thanks (and respect) to all my learned colleagues at DAMA – I’ve enjoyed learning from you all!

NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me. Thanks for reading!


submit to reddit

About Larry Burns

Larry Burns has worked in IT for more than 25 years as a database administrator, application developer, consultant and teacher. He holds a B.S. in Mathematics from the University of Washington and a Masters degree in Software Engineering from Seattle University.  He currently works for a Fortune 500 company as a database consultant on numerous application development projects, and teaches a series of data management classes for application developers.  He was a contribut0r to DAMA International’s Data Management Body of Knowledge (DAMA-DMBOK), and is a former instructor and advisor in the certificate program for Data Resource Management at the University of Washington in Seattle.  You can contact him at