A New Way of Thinking – July 2006

Published in TDAN.com July 2006

My interest in the area of Master Data Management (MDM) is largely driven by my experience in data cleansing – parsing, standardization, and matching. As any data management professional who
has not been living in a cave for the past year knows, MDM is one of the hottest topics around with the analysts, media channels, conference sessions, and marketers. The concept – engineer a
central repository consolidating variant, replicated copies of (shat should be) shared data objects, intended to establish, under well-defined governance policies and service-level agreements, a
unified “best record” representation of each identifiable entity within the enterprise. Quite a mouthful, huh?

Clearly, data quality tools play a large part in this. It is necessary to have these tools to be able to take different data sets and shake out and then merge together all the unique entities. And
when you do a literature scan on the Web, you will find reams of articles and white papers (including some of my own!!) touting the benefits of MDM – high quality data, consistent views,
reduced complexity, etc. Typical claims are that MDM will ensure that all applications have a consistent, accurate, and timely view of all master data objects.

Yet there is one lingering issue in my mind that I seem to be unable to resolve, which is the question of performance. In my web searching, I have not been able to find a significant amount of
information regarding performance.

Let’s look at this a little more closely. The conventional wisdom is that there are three styles of MDM models:

The Central Master or Coexistence style, in which a unique record is stored in a central repository, and a subset of the attributes associated with each entity (e.g., customer)
are maintained within that centralized record. This central master is published out to each participant application, which may augment the data set with its own required attributes. Applications
can create their own instance records, and the new records are propagated back to the central repository (either in batch or through a service).
The Registry style, in which a unique identifier is assigned to each data instance, along with a mapping or cross-reference to all records in participating applications carrying
information for that data instance. Creation of new records triggers an update to the registry to document the new mapping.
The Transaction Hub style, in which the central repository is the only copy of the data, and participating applications interact with the master via a set of services. New record
creation is done only in the master repository.

Each of these styles must be able to support traditional database operations: create, read, update. The concept of deletes are a little trickier, and can be ignored for this thought experiment. In
addition, these systems must support a lookup operation, to find a “best match” for an entity, which is necessary to ensure that duplicate data is not being inserted into the
repository.

Consider the record creation operation for customer data. First, any application needing a customer record will need to acquire enough identifying information, and then consult the master index for
a lookup. If matching records are returned, either one may be selected as the appropriate match, or a new record needs to be created, and then returned as the appropriate match. Next, any
modifications to the record need to be made and posted to the central repository.

In the central master style, each application has a local copy of the master data, so these activities can be done locally, right? New records are created by the application itself, and that data
must be propagated back to the central master and then onto the other application. Hold on a minute there – if the application can create new records locally, then between the time hat action
takes place and the information propagates back, and then out to other applications, we have a situation where master data exists in one local copy and not in others… and doesn’t that
break the whole “consistent, unified record” concept? And it is possible that other applications are creating new records for the same customer at the same time. All of a sudden we are
bound by transaction semantics, which, if enforced, create a performance bottleneck at the central repository.

Well, let’s look at the registry. Since the central repository only maintains an index and cross-references, data reads are a little hairier; the central registry will need to invoke a series
of queries to each application that holds a piece of each virtual master record, and then assemble that master record on demand. While the performance penalty for creation goes down, the
performance penalty for reads goes way up.

Next is the transaction hub, which is just a more restrictive form of the central master. In this case we have the bottleneck associated with the transaction semantics at the central repository,
which must also contend with reads now occurring at the single copy instead of the local copies. Still more potential performance hits.

No matter what, it seems that MDM system performance is one of those nagging questions begging to be answered.

How are these performance questions addressed? One approach is the traditional system engineer’s answer: buy more powerful hardware. Throwing massively parallel appliances will probably help,
especially when they carry multiple I/O channels. Another approach is caching copies of the data along the system geography. Of course, this will now boil down to a memory hierarchy management and
cache coherence management problem, which is a whole other kettle of fish (albeit, one with a nice history of research behind it). Another approach is to embed the allowance of inconsistency to be
covered via service-level agreements guided by the governance component. Essentially, you can allow for some level of variance within a certain time frame for propagation, or restrict creation of
new records by a set of policies for coherence.

Of the articles and papers that I did find making reference to MDM performance issues seemed to imply that vendors are not adequately addressing them. My desire to see MDM succeed is tempered by
the pervasive presence of the 800 lb. performance gorilla hovering around the back of the room.

MenuMenu

A New Way of Thinking – July 2006

David Loshin

MenuMenu

Share this post

David Loshin