El Dorado: 1.
A vaguely defined historical region and city of the New World, often thought to be in northern South America. Fabled for its great wealth of gold and precious jewels, it was eagerly sought after by 16th- and 17th-century explorers.
2.
A place of fabulous wealth or inordinately great opportunity.
Enterprise Repository: 1.
A vaguely defined managerial region of the Information Age, often thought to be in Corporate America and Europe. Fabled for its great wealth of Return On Investment (ROI) and Return On Equity (ROE), it was eagerly sought after by late 20th century IT developers.
2. A place of fabulous wealth or inordinately great opportunity. Many, but not all of us have heard the legend of the Enterprise Repository. And some of us have presented the evidence, achieved funding, and gone off in search of the legendary riches. And of the Armada’s that have left, many have perished while sailing their projects, forever chasing the setting sun. Some have returned with small amounts of savings helping to offset the cost of such adventurous voyages. But to my knowledge, none have actually discovered the treasured city and brought back the celebrated great riches.
THE LEGEND
First, let me recount the legend. In the dark and undocumented days early in the Information Age there began to be code. The benefit of code enabled the exchange of information in such ways as never before known. People benefited greatly. Thus, the demand for code was great and it became a great and costly labor. So there came about methods to reuse code. The first great method of reuse to arise was:
COPY AND MODIFY
Again, the people benefited greatly as code was produced at much faster rates then hitherto known. But there arose a management problem with the proliferation of code. Many objects were duplicated and the enormous benefit of code was now compromised due to the high cost of maintenance. Then a new innovation was introduced. The second great method of reuse to arise was:
SHARED SOURCE CODE
Although primarily used for data definitions rather than procedure, the use of copybooks (or includes, if you will) was a tremendous advance in the managerial effort now that there was much code to maintain. One could simply recompile code when a common definition or procedure was changed. Yet, as code increased there came about enormous problems of coordination when identifying what code needed to be changed and when to install such code. Production problems became pandemic. Then as a way to alleviate the problems attendant with the second great method of reuse, there arose yet another improvement. The third great method of reuse to arise was:
SHARED STATICALLY LINKED MACHINE CODE
This did exceedingly alleviate but not rid the people of problems coordinating source code object, but did nothing to help with machine code object coordination. This led immediately to the
last great plateau of sharing. The fourth great method of reuse to arise was:
SHARED DYNAMICALLY LINKED MACHINE CODE
This method addressed the shortcomings of the third great method. But it did not help with embedded business rules and notions of data that were now hopelessly lost in machine code. And this last great method itself spawned myriad different and incompatible ways of sharing dynamically linked machine code. And people all had their own purposes and favorite vendors and so they were given over to their own ways. There arose many languages, methodologies and great IS cultures and alliances. In all this, code still had to be managed. As the notions of code became more and more sophisticated, all the different kinds of code and things that eventually become code collectively progressed to be known as “Meta data”. The seers of the Information Age were always delving into Meta data in order to understand and unravel the mysteries of managing Information. Information itself was the domain of the Application Specialists.
The Information Seers made ships of systems for controlling versioning. But these lacked versatility. Then they made ships of ideas, called “models” which were not code, but a kind of code; a precursor to what would eventually become code. Then they made ships to hold models. These were known as “Meta models” when referring to a models of a kind and as data dictionaries or repositories when referring to collections of models and/or Meta data. And as the Information Seers thought and reckoned with the high cost of maintenance (on this they thought much and trembled for fear of the budget barons on whose funding they lived) they looked at many causes of expenses. They saw that Codemakers did not reuse code when it was possible. Codemakers preferred to ignore the great methods of reuse and live and work as though still in the dark undocumented days of code. The Seers accused the Codemakers of heretical empire building. But the Codemakers simply said it was easier to work in the dark rather explore the jungles of existing code. They were manufacturers, not explorers.
The Codemakers were actually a rather docile bunch doing whatever they could to achieve the most immediate and apparent benefit. In fact, the Codemakers did reuse code if it were in their limited scope of awareness. So the Information Seers counted the high cost of working in the dark which also increased the density of the coded jungle. They surmised and theorized that fortunes were being lost and that a veritable “City of Gold”, an “El Dorado” as it were, existed if reuse were to be fully exploited. Indeed, it took little to realize that upon discovery of the “City” all that was needed was to farm the benefits for the rest of the Age by following the Great Methods of Reuse. They had documents of project cost overrun’s, endless meetings to discuss interface strategies, production failures, and duplicated efforts which all suggested a great wealth in savings yet to be achieved. Just as the appearance of quartz suggests gold veins to be in the immediate vicinity, so does the research suggest a gold mine, even a “City of Gold” yet to be found. So certain were they that such a place existed, that they christened the ship Repository for voyaging and charting the unknown waters and lands of the Enterprise. Many Enterprises embarked on such voyages looking for the City of Gold, some cautiously and others with great fanfare and send-off’s still others entirely discounting the existence of a place where such returns could be met. So goes the legend.
SPECULATION
Let’s ponder why the legendary level of return has not been found. Indeed, the arguments and evidence that such savings are there are quite convincing. I have no dispute with those. But I think we are still like the early explorers who thought that to cross the ocean west from Europe was to arrive in India or East Asia and simply plunder the riches of the foreign land. Little did they realize that there were whole continents to be crossed and explored and another ocean larger than the first before arriving at the intended destination, let alone simply reap in untold wealth. At this point I am not dissuaded that there truly is a way around our globe of technological discovery.
Only there is first a whole New World which we must come to discover, know and understand before being able to draw on the abstruse returns of the “City of Gold”. The divined source of great wealth will be many newer sources of linguistic, semantic, modeling and intelligence technologies becoming available to address the amblyopic path through the New World. The net result of these discoveries (there will be many) will be new technologies added into repository technology to create a kind of corporate technical self-awareness. These technologies will enhance or possibly displace the object-relation paths supported by present technology. These technologies include, but are not limited to Object Role Modeling, Rule Based technologies including various AI learning approaches, highly abstracted object oriented implementations chock full of overloaded and polymorhic behaviors, and generative lexicons.
This may well take years and will near certainly not be a thunderous introduction of a single technology. But here we are discussing why the anticipated great returns have by and large not been met. We should discuss repository needs first, or better yet, Meta data management needs and the role of repositories in fulfilling those needs. The repository, however it is called or
implemented, is plainly the tool of Meta data management.
DEFINING WHAT A REPOSITORY IS
For purposes of this discussion, a data dictionary and repository are one and the same thing.
Repository may suggest more function than a commercial data dictionary, but any differences are not relevant here. We are talking about a tool or collection of tools used to manage physical abstractions (code) and logical expressions (models) of the systems of the Enterprise. In a way, this definition includes version control systems because of the overlapping functionality with repositories. But I don’t intend to include them here because they do not have internal meta models with which to express the relations of system objects.
Version control systems simply collect and version individual objects or collections of objects. We do not consider file systems or even repository engines applied to areas other than Meta data management as repositories. Repositories (for this discussion) are not inventory systems of where software has been installed nor lists of job schedules or configuration management diagrams of network topographies. Albeit, ideally a repository would link closely with systems for managing physical location and task-oriented plans. Repositories are inventoried lists’ of systems, databases, model diagrams, data element mappings, data transformations, and cross references organized for storing definitions and obtaining change impact on existing systems. For the sake of this discussion we will categorize repositories into four main varieties:
- TOOL type – Supports a development tool or tool suite of a single vendor
- HOT PROJECT type -Supports analysis efforts
- INDUSTRIAL type – Supports a production software application
- ENTERPRISE type – Supports the entire Enterprise (as a goal)
Furthermore, let us consider whether the repository is a System of Record or a System of Reference (this is also called “active” or “passive”). As a System of Record its’ objects are the most reliable and current source of a Meta data. As a System a Reference its’ objects must be kept synchronized with an actual source Meta data object as it exists in the version control system or wherever the reliable source resides. We find that Enterprise class repositories are not clear in this regard. Nor can they be. While attempting to contain the original source of Meta data, they are in competition: with database catalog/dictionary Meta data, existing version control systems (if any) in the Enterprise, and with the proprietary Meta models of vendors with private repositories who also implement their own version control schemes.
The TOOL type repository is commonly found as the internal way CASE and other development and modeling tools store Meta data objects. The limited scope and narrow support requirements make for a maintainable closed system configuration. The paradigm for how to view Meta data is dictated by the creator (the vendor) and maintainer (the vendor) of the underlying Meta model. Conflicting notions of Meta data do not generally exist in these kind of closed systems. Quite simply, the tool based repository systems work well for what they are intended for. They do not, however, address the greater needs of the Enterprise and remain project focused in scope. They are usually the System of Record for pre-code objects and may also have places for storing code snippets. An entire project or model is typically stored in a version control system or on a backup as singularly versioned object. The HOT PROJECT type repositories are analytical in nature. These are ad hoc or tool based repositories that begin to take Meta data from disparate enterprise sources for the single purpose of answering specific systems analysis questions.
These tools may have very simple Meta models and never attempt versioning objects. They are always Systems of Reference and are reloaded periodically for refreshing. Y2K projects typically have multiple repositories of this sort. The INDUSTRIAL type repository is the active Meta data store used for the operation of an automated system. Data catalogs or data dictionaries of relational database systems fall into this category. So do interface definition sources used for dynamically obtaining linking parameters in object oriented systems such as is found in the CORBA specification. These are always Systems of Record. The ENTERPRISE type repository is the visionary umbrella which attempts to collect all relevant Meta data and make it available to both developers and business analysts alike.
This is the legendary “El Dorado” of the Information Age. If all of the information about systems were collected together and made readily available, it would indeed be a savings bonanza. Before discussing problems of Enterprise Repositories, we should remember that the driving force for such a tool as Enterprise Repositories is the fact that there is so very much; duplication of objects leading to, exponentially larger than necessary maintenance efforts, ambiguity of meaning, rework, disorganized access to explicit and implied business rules, loss of business rule knowledge, interfaces unnecessarily copying data to multiple sources instead of sharing and general lack of convention found in the systems of any large enterprise. PROBLEMS WITH REPOSITORIES OF THE ENTERPRISE The problem areas that seem to be endemic to repositories with the scope of an enterprise are: (as already noted) Confusion of whether the repository is a System of Record or System of Reference Amalgamation of Multiple paradigms Versioning and synchronization Maintenance effort Record or Reference problems.
The bane of documentation is the enormous effort required to keep documents current with installed software. As programmers know all to well, the final authority as to what a program does is to look at the code, not the requirement or the analysis or the specification. As a result repository vendors have attempted to bill Enterprise repositories as capable of being Systems of Record. This is done by tightly coupling production system change control procedures to include steps to populate the repository. The extra layer is usually viewed by programmers and DBA’s as an encumbrance to the most direct source. There are too many tools with nifty features that work directly with database system tables and source code to be ignored by DBA’s and programmers.
One solution would be for the repository to actually share database catalog/dictionary tables and also be the version control system for old fashioned third generation source code. This is not likely to happen soon. Next are the CASE type tools with their own proprietary meta models. It is a very daunting task for a commercial repository vendor to store and recreate models developed in various products in a single repository data store. If a DA wants the most current model of record, they will go to the tool. The repository will only be used as a system of reference when there is nothing better available. The solution is the same as noted above; make the CASE tools to use the Enterprise repository data stores for it’s own repository needs. There has been some vendor effort in that direction, but don’t expect a rush of cooperation and purpose to occur. Another solution may be for an Enterprise repository objects to map or point to objects in proprietary repositories.
This also would be difficult to do. So what we are left with is the cold fact that an Enterprise repository can only be a System of Reference – that is to say, a place of documentation. As such it consumes much resource in efforts to stay synchronized with production systems. Multiple Paradigm problems Information Technology has had no shortage of finding many ways to accomplish the same task. This has resulted in much duplication of work, much “reinventing the wheel” and much extra analysis and maintenance work due to the proliferation of code. Repositories have been touted as a place to cross reference and begin the work reducing the overwhelming amount of source code. But having a tool that has objects that relate to other objects does not make it a simple, or even a possible task. At the current state of the art, Enterprise Repositories do not help to homogenize business views across different implementation strategies.
Although files, tables, OO objects, entities and classes all have notions of synonymy, they are not always the same. Because they are not always the same, repository meta modelers make them to be entirely different things. They maintain states of synonymy by creating inordinate numbers of relations. These relations do need exist but should not have to be explicitly known to make a repository user to make semantic comparison between an ACCOUNT table and an ACCOUNT file. It becomes too overwhelming to know all the different relations that exist between repository objects. The repository ends up becoming unnavigable. Then there is the logical/physical translation problem. Different vendors have different levels of support for “denormalization”. This is the complex mapping that works when going from a thoroughly expressed idea (model) to a physical implementation expression. It is not a homogenous process. To serve everyone, that is to have Enterprise scope, a repository would have to support all the different techniques different tool vendors use and the different techniques used in various releases of a vendors products.
The result is enormous growth in the underlying repository meta model along with extra effort to support scan and load processes for all the tools. CASE tool vendors do not even support their own products that well. In the course of this monumental effort repositories may become error prone and labor intensive. Versioning and Synchronization problems We will look at versioning problems with the Enterprise Repository as having two main flavors, though they occur more like a chocolate swirl in vanilla ice cream rather than two distinct areas. The first is versioning within the Enterprise Repository. The second is bringing Meta data into the repository from another meta source, usually a CASE tool or program generator which has it’s own proprietary meta model. The first problem also harkens back to the intended use of the repository as a place of reference or as the primary version control system for a source of Meta data. If the repository is only for reference, then maintaining versions needs be less exacting than if using the repository as a version control system.
The underlying object reuse requirements will likely be different for the different purposes. For example; if a data type of a column in a database changes from a char(15) to a varchar(50), we may only need to reflect the version change to the column when the intended use is only for reference. This would, of course, depend on how much synchronization of versioning is required between the reference source and the actual source. (Higher levels of synchronization are more difficult to maintain.) But if the repository is meant to be an actual source of record then a simple data type change would have to be reflected in all the notions of independent entities that might contain that changed column; that of table, database, record, interface, program, system, or other meta data descriptor entity.
Additionally, as a system of record, the repository would need to have features for check-in and check-out; object version branching and merging; and object status designation. Independent native source objects would have to be declared so only reasonable check-in check-out could occur. For example, checking out a field in a COBOL record would be prohibited whereas check-out of a copybook would be allowed. The copybook file is a native source object. The field in the record in the copybook is always part of a larger meta data object. While this is obvious in COBOL, it can become more obscure with other kinds of meta data. As many repository administrators have found, it is difficult to keep an Enterprise Repository synchronized with a database catalog/dictionary.
One problem is that the database system repository does not support versioning of objects. If a table is dropped, it is gone with no record of the change. This reflects the industrial kind of use purposed by database engines on their local store of meta data, but may be difficult to show in version levels in the Enterprise Repository. This is because databases are changed directly by DBA’s. Enterprise Repositories are no where near tightly enough coupled to maintain a transparent layer over the actual sources of meta in databases. Database Administrators work directly on databases, even in tightly controlled environments. Once one has experienced the challenges of making the Enterprise Repository a reliable and current resource, one discovers that, barring some intelligent and automatic interface, there is enormous effort required to keep the Enterprise Repository object population properly versioned.
Maintenance Effort problems As noted above, just maintaining versions is an enormous effort. Apart from the substantial effort at writing and keeping any computer software product working, an Enterprise Repository must have efforts made by vendor or user to: Develop meta models for various meta data sources and synchronization Write scanners which properly parse and decompose meta data object from various sources Develop meta data population reuse strategies Population for all the different kinds of meta data to be stored in the repository on an ongoing basis, including correction and problem solving for load rejects Extract meta data components back into source objects (when using the repository as a system of record) The draw on resources can be extraordinary. If not careful, it may cost a City of Gold to implement. We eagerly await more automated and reliable solutions to the labor intensive procedures that are involved in populating and maintaining Enterprise repositories. CONCLUSION My contention in this discussion is that Enterprise repositories as huge collectives of Meta data which mix and rationalize across paradigms are not robust possibilities for the near future. They do not exist today.
El Dorado is not likely to be found. At best we get physical element cross reference systems which are labor intensive to maintain, sometimes having a modest value and sometimes disputable in their immediate value. But the exploration and effort of working with Enterprise repositories has invaluable long term benefit for those companies with an interest in acquiring wisdom. Those that are doing the work will have a greater understanding of Meta data and the difficulties of defining it, let alone managing it. As newer technologies are introduced into the vision of the Enterprise Repository those venturing out will be positioned to stake out a claim, exploit and obtain the benefits of the New World. Some of the early adventurers will fail for various reasons.
Others will stall and eventually make beachheads and small settlements. Introducing such a vision into an organization is a major cultural and technological change. It takes much time and effort regardless of the amount of funds available. This is also to say that it will be difficult for large organizations to quickly implement complex Meta data management solutions as the technology becomes stronger. In the mean time it remains a difficult decision as to how much an organization should spend on Enterprise Meta data management. Simple cost-justifications over a relatively short period of time, say 1 to 5 years, may not fulfill ROI requirements. But the exercise of attempting to collect, categorize, cross-reference, and generally simplify business analysis and programming analysis can be worthwhile for any large Enterprise that plans to continue as an organization. (If you are up for sale, don’t bother.) Lessons learned will likely yield meta data management strategies that will have benefit for ROI and ROE (especially getting return on the value of code), though perhaps not fit into the category of “Enterprise” in scope for a while to come.
BIBLIOGRAPHY
[1] http://msdn2.microsoft.com This is a good technical geek peek of how the near future of repositories are taking form. Notwithstanding, there is still a long, long way to go. There are many books on UML (Unified Modeling Language). For a starting point on general UML information check
[2] http://www.rational.com/uml/index.jtmpl
[3] http://www.cs.brandeis.edu/~jamesp/projects/models.html. This is a page entitled “Models of Lexical Meaning” and well worth spending 5 minutes to read. You may then ask, “What does this have to do with meta data repositories?” Well, to the best of my knowledge, it hasn’t had anything to do with repositories. But it should! We cannot model semantic behavior until unless we understand how language works. Also see the following 2 books on how language works,
[4] The Generative Lexicon, James Pustejovsky 1996
[5] Lexical Semantics, D.A. Cruse, 1986
[6] http://www.inconcept.com/JCM/ Great place to find discussions on analysis and methodologies. Repository meta modelers need to think about modeling for storing more than the syntactic parts of a programming language or computing environment. We should also model to hold knowledge of the business.