The past few years have seen enormous growth in interest in master data management (MDM). In terms of architecture, interest seems to have crystallized around various kinds of hubs. The original vision seems to have been that master data can be taken into these hubs from transaction applications where it is produced. Once in the hub, it can be integrated (including de-duplication), cleaned, enriched, and then distributed to consuming applications. There has been a debate on the scope of these hubs. For instance, in some cases little more than identifying information (keys) may be managed, while in others, full-blown “golden records” containing many non-key attributes are maintained. Additionally, there has been debate over whether an instance of a hub should manage one or a few master data entities, or whether a hub should be truly “multi-entity” and manage many such entities.
Also during the past few years, many projects have been implemented, and they have not always gone as well as originally anticipated. This has resulted in feedback into the architectural debates and the way in which MDM product vendors position themselves. One important trend seems to center around the difficulty of relying on master data produced in legacy applications. Such data is often of inherently poor quality because the legacy applications were never intended to support MDM. Cleaning and integrating it in a hub therefore has an inescapable element of “hit-or-miss,” and the result is that the hub master data usually has limits as to how reliable it can be – and these limits may be less than any level acceptable to the enterprise.
As a consequence, techniques for producing master data directly in MDM hubs have gained increasing attention. High-quality data is seen as the result of improved stewardship, and if facilities to support this can be built out in an MDM hub, then it seems possible to overcome the issue of the curse of master data produced by legacy applications. Figure 1 summarizes this architecture.
Figure 1: Simplified Modern MDM Hub Architecture
But is this architecture the final word? There are reasons to think that it might have inherent limitations and that another pattern may have more advantages. Let us look at why this might be so.
Producing Master Data versus Consuming Master DataOne of the early projects I worked on was for a large intergovernmental organization that ran social and economic projects in developing countries. These projects were financed from specific funds, and my task was to build an application that would permit the creation of new funds. I had considerable experience with nearly all the major applications in this environment, and I knew that the only fund data these applications required was a couple of attributes: Fund Code and Fund Name. However, I was horrified to find that there were elaborate processes to create a new fund that involved a large number of parties each with specific responsibilities. This not only dictated a complex workflow with many states that a nascent fund passed through, but it also meant that there were many additional attributes (and entities) needed to store information about the fund on-boarding process. These included quite a lot of metadata about the process flow and participation by the users.
I had originally thought that the project simply had to produce a table of Fund Code and Fund Name. In a sense it did, because this was all the consuming applications required. But the process and data required to get to this end point were very complex. It could also take a long time to on-board a fund – sometimes months.
Does Master Data Production Fit in a Distribution Hub?What this experience taught me is that the production of master data is quite different than the distribution of master data. When I looked at the data model of my fund on-boarding application, it was vastly different to the data model of the simple table that had to be distributed. Would it make sense to incorporate the unique data and processes to produce the fund data into a distribution hub? On my project, we chose not to.
Of course, in technical terms, there is nothing impossible about combining production and distribution in the same hub environment. But what is gained versus the problems that have to be overcome? Let us consider a few of the problems:
- State Control. Suppose the new fund must go through 5 states before it is production-ready. It is necessary to prevent all non-production funds from being distributed with production-ready data from the hub. This can be quite a challenge. However, if the production environment (where master data is created) is completely separated from the distribution environment, then there is no need to build this elaborate state control. Furthermore, such state control may never entirely eliminate the risk of accidentally distributing master data that is not yet in production status.
- Data Models. The data model for producing (creating) master data can be very different than the data model for distribution of the same master data. Why try to have just one data model. Does it not make more sense to keep them separate?
- Business Logic. The business logic for producing (creating) master data may be elaborate, while there may be little or none required for distribution. Is this one application, or two? Surely a combined application will be much more difficult to build and maintain than one for production (creation) and another for distribution.
Of course, I am aware that designs can be offered for a single hub that deals with all of the above issues. But how difficult would such a hub be to configure and maintain? And what degree of risk would be inherent in such a design? I would suggest that both maintenance requirements and inherent risk would be high. However, there is an additional concern, and this is the emergence of a new class of users in many enterprises whose job is purely to manage data. And these users are unlikely to be satisfied by any generic, monolithic hub architecture.
The Master Data FarmersFor decades, IT has viewed business users as rather a uniform lot. It is true that some have been recognized as being involved in data entry, and others as making business decisions based on outputs, and yet others as sponsors for IT activities. However, we are now seeing the emergence of a new class of users, and perhaps the best term for them is data content managers. These users do not participate in running or managing the enterprise. They are subject-matter experts in specific data domains. In financial services, we now commonly see teams dedicated to specific areas such as client data, account data, instrument data, or corporate action data. The data content managers, more than anyone else, “own” the data they are responsible for, and it is nearly always master data.
Failure to recognize the existence of data content managers is a crucial error for IT, because IT certainly does not “own” master data. IT only builds and manages the environments in which master data can be created and distributed – and has no interest in the data content itself. But the architecture that IT creates should match the way the data is managed. The data content managers are like farmers of master data. They slowly tend to and grow their crops of data; and when these are ripe, the farmers send them to the market – to distribute them to the rest of the enterprise. Surely an architecture that confuses farm and market is inappropriate.
An additional issue is that the environments in which master data is produced may need to be quite different. A team of data content managers looking after financial instrument master data will have very little in common with a team looking after client and counterparty master data. Why should the two teams be offered only a single environment in which to do their work? It is a little like saying that a dairy farm and a peach orchard are both farms and so they should be designed on identical principles. At a very high level this may be true, but for practical purposes such “one-size-fits-all” approaches always break down. Thus, we should not only separate master data production from distribution in our architecture, but production of master data entities should be separated into independent applications. Figure 2 summarizes this architectural pattern.
Figure 2: Farm and Market Architecture Applied to MDM
It is likely that as vendors try to put more and more specific master data production functionality into their hub products, they will be driven to the conclusion that the environments for the production of master data cannot be generalized.
The Master Data MarketIn contrast to farming, markets do tend to be centralized. It does make sense to have a single distribution hub from which all enterprise applications can obtain master data. In this respect, the current hub architecture is really good. It is very difficult to see what better pattern could be implemented. The issues we have been discussing lie much more with the production – the “farming” – of master data.
Like all architectural discussions, there is some degree of looking at an ideal state here. Some degree of integration and cleansing (at least data quality checking) will probably have to remain in the “market” hub. Also, the current designs of MDM hubs try very hard to deal with the fact that much master data is not truly “farmed” but is a by-product of legacy transaction applications. There will be no getting away from this in the near future, and so the issues of cleansing and integration will still need to be dealt with. However, architecture is also about planning for a target state, and the idea of having many master data farms and a single master data market at least represent a valid pattern for consideration in this planning.
Note: I am grateful to my colleague Fabio Corzo for suggesting the farm and market analogy when we were discussing these problems.