Repository Directions – Part 1

Published in TDAN.com June 1999

Articles in this series – Part 1, Part 2


The Beginnings

“Those who cannot remember the past are condemned to repeat it”. George Santyana (1863-1952)

Presented here are not so much the details of history as the important lessons of history in an attempt to both learn from them and to ask the question, “Are the problems that people tried to
solve still the problems people are trying to solve?” The 1980s can be truly looked upon as the beginnings of repository implementations.


Focus on Data Administration and Standardization

Striving for common meanings and assigning responsibility for Data

With the extensive data collection activities that resulted from the automation efforts of the 1970s, organizations found themselves sitting on mountains of misunderstood, sometimes poor quality
data assembled by a number of disparate applications feeding into often poorly designed (from an application sharing perspective) databases. This data quality problem was addressed from two
directions. The first direction was technologically. The use of relational databases provided a mathematics and body of theory to normalize data. The second direction was organizationally. A data
administration function provided a common means of defining and storing data semantics and structure. Commonly these aspects of data were referred to as metadata. A database was used as the storage
facility for the metadata and the data dictionary was born.

To address the data definition change management issues, organizations determined the need for data stewards or single points of control. Data items were associated with specific organizations and
roles for assigning responsibility. The dictionary not only tracked the technology items of the database, and the semantics (definitions) it also tracked the relationship of these items to the
people responsible for them. One of the significant challenges of the time was to build an active data dictionary – a dictionary that kept track of database schema changes in real
time and was always up to date with the database systems that it documented.

Due to the diverse semantics, formats and structure of data definitions across the services, the Department of Defense (DoD) took a pioneering role in the area of data standardization. The Air
Force and the Army collaborated in merging their divergent data standards into the DoD 8320. This standard was later adopted by all the services.

People also realized that they had to devise a way to track the way applications manipulate data. This became necessary for impact analysis as a change in the database affected an unknown number of
applications to an unknown degree of severity. Data dictionaries now had to track the applications and their associations with data items.

As relational database technology moved down from the mainframe to smaller servers so did the databases. This increased the scope of the data dictionaries which now had to track the data items
against the server computers that they ran on, the communications infrastructure needed to access the databases, and the person/organization responsible for the data items. With the fragmentation
of application delivery into multiple user interface screens, it became necessary to track which field of a user interface was associated with which data item. With the client server revolution it
became necessary to provide database definitional items to client server development environments.


The Information Resource Dictionary System (IRDS)

A super dictionary for all information resources.

Early on it became apparent that the data dictionary was growing out from its initial charter of managing data definitions. It was now managing a variety of application development items. These
items included applications, application menus, screens, actions, server computer references, and database server references. A standardization effort was undertaken to define a unified storage
mechanism for all information resources. In 1988 this culminated in the publishing of the commercial American National Standards Institute (ANSI) IRDS standard and the Federal Information
Processing Standard FIPS 156.

The IRDS was designed as an information resource dictionary that was significantly broader in scope than a data dictionary. The IRDS specification basically allowed every organization to build its
own information resource dictionary based on its perception of what information resources it thought were important to manage. One of the features of the IRDS was the complete freedom to define and
extend the schema of the resource dictionary. The IRDS defined a schema versioning language to achieve this flexibility. The Basic Functional Schema (BFS) was offered as a starter schema.
The BFS was provided for managing non-relational database systems (reflecting the times) as an illustration of the schema building capability. In addition it was foreseen that the schema of the
dictionary itself needs to be versioned and a complicated mechanism for versioning the IRDS schema was also specified. The data that populated the IRDS was versioned and commands were specified to
access and manipulate this data.

In summary, the IRDS provided a specification for a command language driven engine that could store an arbitrary dictionary classification scheme (using the “new” Entity-Relationship vocabulary)
and dictionary data. Access to both the schema and the data was accomplished through a stream of commands. In addition a specification for a “Panel Interface” for providing user access and an
Application Programming Interface (API) for providing the engine’s services were also specified. The implementation of the engine did not specify any database technology and was left as a decision
for the implementers.

The IRDS was a first in many areas to offer a:

  • Extensible schema that could be customized by implementers to manage an arbitrary set of information resources
  • Formal entity relationship schema to manage the dictionary itself in much the same way the industry was using the ER model to define database schemas
  • Meta-Schema for the dictionary designer consisting of the 8 primitives of the IRDS (somewhat analogous to the 26 letters of the alphabet and the special characters that have given us the wealth
    of the literature in English and all the languages that use the Roman script)
  • Schema versioning to manage the dictionary schema as it evolved over the years
  • Dictionary data versioning to manage the evolution of dictionary data
  • Extensive audit trails and stewardship information
  • Command Language engine concept for a dictionary that was similar to a RDBMS as a SQL driven engine
  • Specification that was completely free of implementation technology and could be delivered on relational, non-relational or object oriented database storage technology.

The IRDS though a technologically advanced specification for its time went relatively unsung in the commercial world because of an important announcement from the dominant corporation in the
industry: IBM. In addition, customers found that developing or extending repository information models took specialized skills. These extensions had significant ramifications for import/export
tools used to populate information and had to take advantage of these extensions. Often this lack of expertise resulted in defining a dictionary that was an overcomplicated database, more simply
implementable as a straightforward relational schema. It became obvious that designing a repository information model embraced the knowledge of design tools, methodologies, metamodeling, business
rules and technology constraints amongst others. Such expertise was not commonly available and customers often had to go back to their repository vendor for additional help in these areas.


The CASE & Methodology Revolution

Standardizing the Analysis and Design Process

In the meantime, a revolution was occurring in the way applications and database systems were analyzed and designed. With the advent of structured modeling methodologies that promised repeatability
of design, automatic generation of databases, and a high degree of communication between analysts, caused a change in the way databases were analyzed and designed. CASE tools such as Knowledgeware,
Bachman, and Excelerator rapidly gained market share. Each tool stored the results of the design in their private dictionaries and lost control of the original design once it was generated into
(often) a relational database system. In addition, there were formidable challenges in merging workproducts from different team members because each of them used standalone tools with standalone
dictionaries. Because of the flux in vendors many companies bought more than one tool for the same purpose and did not standardize on tool vendors based on perceived business risks. In the
applications arena there were a host of design methodologies each with their own diagramming notations and conventions. Often attempts to bridge applications to data were inadequate and the concept
of a separate data track and a process or function track gained popularity. The mantras for this age were CASE tool integration – which is upstream and downstream integration or model
integration.

The advent of CASE tools often coincided with the publishing and adoption of methodologies. Because of the plethora of methodologies, CASE tool vendors offered many methodology options to encompass
a broad enough market making their tools a viable choice. Not only did customers have to negotiate interchange of designs across CASE tools, they also had to deal with transformation of designs
from one methodology into another. The Federal market, especially the DoD sponsored efforts to standardize a few methodologies for extensive adoption throughout its operations and thus IDEF1X and
IDEF0 were born.

The primary goal of the CASE revolution was to move the design of software applications further upstream in the software development lifecycle. The CASE tools of the time assisted the analysis and
design phase of applications development. The analysts were still required to understand the business aspects of the enterprise and translate them into designs using the CASE tools. As a result
organizations were able to increase the span of each analyst and improve the degree of communication and precision of analysis products such as data models, process models, structure charts, and
data flow diagrams.


Application Development Cycle (AD/Cycle) and the IBM Repository

IBM’s vision of the Software Application Development Process

IBM recognized the opportunity to consolidate the CASE tool revolution, client server revolution, relational database management revolution, and dominate the applications development arena. They
embarked on the development of a framework for applications development. The centerpiece would be a repository engine running on a mainframe (MVS) and DB2. This was built somewhat on the lines of
the IRDS though not quite compliant with the specifications. In addition to the repository, IBM also announced the use of a Source Code Library Manager (SCLM) that would manage the various files of
application development code – from programs to test scripts. The architectural vision was to provide seamless operation between the IBM Repository and the SCLM. Repository Manager MVS
(RM/MVS) was architected with a classification scheme that IBM called the Information Model (IM). IBM perceived RM/MVS as the hub of all applications development in the enterprise. The objectives
for RM/MVS was to support the storage of designs from the most popular CASE tools, relational and non-relational databases from IBM, and manage application development from the most popular IBM
programming languages such as COBOL and PL/1. With the announced support for a few CASE tool vendors (in which IBM made business investments) and a focus on IBM customers and IBM platforms, IBM’s
repository did not serve the market of non-IBM customers.

IBM also promoted the storing of models and application artifacts in the repository and only the repository. All files that were used to build the design using the CASE tool were deemed expendable
because of the role of the repository as the central store. This resulted in a loss of information when tools stored more information in the design files than the current repository information
model could manage. When a design was imported into the repository from one tool and exported to another tool, loss of information or distortion occurred to the extent based on the divergence of
methodology, vocabulary and design semantics between the two tools. A significant burden was put on the repository developer to keep the IM as a superset of the metamodels of all participating
tools and to deliver versions of the IM each time there were changes in any participating tool’s metamodels. This pressure resulted from IBM’s vision of the repository offering “best of breed”
choices for application development tools and a mix and match approach that relied on the repository to make the necessary design transformations.

In summary, the IBM Repository and AD/Cycle provided a:

  • Comprehensive information model comprised of CASE, RDBMS, and Application Development technology for a phased implementation.
  • Static information model. Extension of the model involved a complex agreement from a number of (often) competing vendors.
  • Mainframe repository engine using a DB2 RDBMS with an Entity Relationship schema.
  • User interface to the repository through terminal emulation screens.
  • Proprietary interfaces between the repository and supported CASE tools.
  • Unpublished information model.
  • Plan and direction to integrate repository objects and source code through associations.
  • Plan for integrating diverse application development platforms inside the IBM realm (e.g. IMS, PL/1, DB2, SQL/DS, SQL/400, OS/2 etc) and other applications.
  • Degree of interoperability between a limited number of CASE tools both in terms of design exchange and upstream/downstream interchange. Definition of “the 8 blocks of granite” representing
    key phases in the application development process and the map of these 8 phases to different regions of the repository’s Information Model.
  • Definition of the word Repository to distinguish IBM’s Repository from the then prevalent data dictionaries. In IBM’s view the repository encompassed metadata not just about the data items,
    but all the elements of application development from programs to test scripts to user interfaces to HELP text. This general nature of the contents of the repository made the term repository
    equivalent to the information resource dictionary of the IRDS.
  • Central storing of models and application artifacts in the repository and only the repository. All files that were used to build the design using the CASE tool were deemed unnecessary because
    of the role of the repository as the central store.

IBM AD/Cycle and the repository were embraced mainly by the commercial market place. Over time, the inherent conflict between competition for market share and cooperation inside the repository and
IM arena drove rifts between the CASE tool vendors who disassociated themselves from the repository and went their own way with their own tool dictionaries. As more and more customers returned
their repository software, IBM softened the AD/Cycle and enterprise mainframe repository message and continued to work on object oriented repository engines for less ambitious workgroup/LAN based
solutions.


The Business Process Reengineering Revolution

Looking at the business in new and different ways.

With the publishing of the book, “Reengineering the Corporation” by Mike Hammer and James Champy and the embarkation of a massive Corporate Information Management (CIM) initiative by the DoD, the
focus of many enterprises shifted to representations and improvements to their business processes. The rush was to perform documentation of business processes, analyze these processes and formulate
strategies for change based on the results of the analysis.

The result of the Business Process Reengineering (BPR) rush was the development of a number of business function models primarily using IDEF0 in the DoD, that documented the business processes used
by the enterprise to arbitrary levels of detail. Often these analyses and the modeling process were performed by external contractors based on task assignments.

Organizations found that the business reengineering process itself was more complex and dealt with the impact of a variety of other factors. Some of these factors were data requirements of business
functions, business locations, business/market cycles and events, primary motivations for business functions. The strength of the revolution became more tempered with the reality they saw.
Enterprises looked for ways and means to capture and represent the “other” aspects of the enterprise not simply the business processes.

As a result of understanding the limitations of seeing enterprises simply as a collection of business processes, there was a renewed interest in the Enterprise as a collection of different
categories of information. The focus of the business process revolution was driven by the needs of the business. The focus on the repository and the application development cycle was driven from a
need to construct repeatable, cost effective, reusable, high quality software application systems based on contemporary technology. The tie in was obvious: Software applications must and do
implement critical functions of the enterprise. Software application development must therefore be driven by the business. At the same time the need to construct repeatable, cost effective,
reusable, high quality software application systems based on contemporary technology has not gone away!

The understanding to solve this dilemma demanded that there be one framework that allows enterprises to see and represent both the business side and the application development side of the problem.
As early as 1987 John Zachman had described the first three columns of the now famous thirty cell “Framework for Information Systems Architecture”. In 1992 John Zachman had postulated, that this
matrix displays the answers to the six closed interrogatives – What, How, Where, Who, When, and Why with the 6 perspectives: the Planner, Owner, Designer, Builder and Subcontractor which is a
complete representation scheme for the architectures of information systems.

Zachman’s Framework, for the first time provided a sweeping method for classifying the contents of an enterprise wide repository that would manage both information important for the business
perspective and the information important from the application development perspective.

Today, there is heightened interest in the Zachman Framework and enterprise architectures. The challenge has been the implementation of a repository system that embraces the Zachman Framework, the
capturing of both business and application development architectures and the capability to set up and maintain the hundreds of thousands of relationships that comprise the “seams” of the
enterprise.


The Near Past

The early to mid 1990s saw an era of dramatic change in the way software applications were built and deployed. The Microsoft Corporation has emerged as a significant, often perceived as a dominant,
vendor in the area of application development platforms and environments for the desktop. The shrinking market share for both OS/2 and for the Apple platform has made Windows and Windows NT the
desktop of choice for personal computers.

In summary, the early 1990s saw a widespread adoption of tools but the primary driver was the integration of the analysis and construction process and the consolidation of development environments.
Careful attention had been paid to the coverage and integration of rows 4 and 5 of the Zachman Framework and significant attention to making row 3 more automated and aligned with row 4.

Significant developments that impacted the software development process and environment for the 1990s repository were:

  • Consolidation and standardization in the Object Oriented Analysis and Design Arena
  • Announcement of repository engines by Microsoft and Unisys
  • Emergence of the Internet and the Web
  • Renewed insight on the importance of metadata
  • Enterprise Architecture Planning
  • Enterprise models.


Consolidation and Standardization in the Object Oriented Analysis and Design Arena

The growth of object oriented analysis and design techniques heralded as vehicles for reuse and standardization resulted in a number of methodologies and practitioners. CASE tool vendors, like
those of the 1980s were compelled to offer a variety of methodology options to command acceptable market shares. In November 1997 the Object Management Group (OMG), a consortium of over 800
companies, adopted the Unified Modeling Language (UML) as a consistent language for specifying, visualizing, constructing and documenting the artifacts of software systems, as well as for business
modeling.


Announcement of repository engines by Microsoft and Unisys

During this period, both Microsoft and Unisys announced the availability of object oriented repository engines that could perform repository management functions. The Microsoft engine used the
Microsoft Jet engine (a desktop database) or the Microsoft SQL Server as a RDBMS for storing its objects. Initially, the repository did not support versioning and was primarily intended as an
Original Equipment Manufacturer (OEM) engine to be used by CASE and design tool manufacturers. They would use it as a storage mechanism for designs and models in much the same way that the MS
Access Jet engine served as a relational engine that could be embedded in tools and applications as a private storage of data. Based on Microsoft’s pronouncements, they were motivated more by the
larger market for embedded repository engines to be distributed with copies of every design tool sold, than the small number of enterprise repository licenses that they could sell. With this
motivation, Microsoft started shipping free copies of their repository engine and the development environment needed to develop applications around the engine as part of the Visual Basic product.

The target audience for the Microsoft repository engine was tool developers who were already using Microsoft’s development environment and programming languages to build applications. The
interfaces to Microsoft’s repository were proprietary Component Object Model (COM) interfaces that were announced by Microsoft in competition against the industry standard Interface Definition
Language (IDL). Microsoft then set up alliances with Platinum Technology to port the engine and its services to other relational database and computing platforms. In the Microsoft Repository, the
use of interfaces allows the separation of the actual repository schema, itself expressed as a set of classes, properties and relationships, from the external view (interfaces) that are visible to
developers of applications that use the repository. The Microsoft Repository was intended as a desktop engine that would not have to face the scalability, multi-user concurrency, complex locking
and protection mechanisms, and security considerations that govern the design of an enterprise repository. The Microsoft repository would support object management services and the storage of
files. The layers that tie the object management services with the file/artifact management services, and the multi-user controls, and the security and policy enforcement considerations would need
to be programmed by repository product vendors who would have to implement these facilities over the basic engine.

Unisys has been seeking industry alliances with tool builders to embed the Unisys Repository (UREP) engine into their products also. Unisys repository architecture approach is similar to
Microsoft’s, using an object oriented repository schema metamodel, but Unisys does support the industry standard IDL as an interface definition language. Unisys is also concentrating on the Unix
platforms for propagating the repository engine.

At the same time Microsoft announced the development of several information models covering their own application development tools such as Visual Basic, MS Project to relational database systems
such as MS Access and MS SQLServer. This coincided with the convergence of Microsoft software development tools into interactive development environments such as Microsoft Visual Studio and the
convergence of Microsoft’s relationships with software developers through the Microsoft Software Development Network (MSDN).

Microsoft also continued to heavily advocate Object Oriented Development (OOD) and the use of a standard parts library called the Microsoft Foundation Classes (MFC). At the same time, it was
architecting the Windows and Windows NT operating environments to support the COM and Distributed Component Object Model (DCOM) interface environments and the formulation of a Microsoft
communications architecture that was object oriented.


The emergence of the Internet and the Web

A major competitor to the 1980s technology of client server computing through network and dial up connections as mechanisms to connect users to information systems is the emergence of the Internet
and Worldwide Web (WWW). With the Web has come the opportunity to distribute tremendous amounts of information on request to vast numbers of people. Coupled with this has come the challenge of
security, scalability, performance and presentation styles that are more natural to the way people work with a web browser.


A renewed insight on the importance of metadata

Two significant trends have brought the focus back on the importance of metadata. One is the widespread implementation of Data Warehouse and Data Marts in many enterprises. Implementing a data
warehouse involves understanding the organization of enterprise databases, extracting information and transferring it to a data warehouse. As enterprises run into metadata vacuums they have
realized the need to capture, store and manage their metadata in an ongoing process.

The other trend that has brought the focus back on metadata has been the Year 2000 or Y2K problem. In the process of assessing the impact of date handling on applications, enterprises have had to
look into their databases to determine the size allocation of date fields and how they are set and used by applications. During the process they have unearthed a significant amount of metadata that
will allow them to also address other impact analysis questions such as the impact of change of legislation.


Enterprise Architecture Planning (EAP)

With the realization that unless the business aspects of an enterprise are captured, represented and then used to drive the planning of information systems, enterprises will continue to build
ineffective information systems. These systems do not collect relevant information, do not perform relevant functions, do not provide the information basis for relevant decisions, and do not
automate the relevant business processes, while still costing significant amounts to develop and maintain.

A new set of methodologies have been formulated for performing the gathering of enterprise business related information, analyzing it and developing/formulating information system functions,
understanding systems’ data requirements, understanding implementation technology requirements and developing implementation schedules. Some of these methodologies also involve the baselining of
current application systems and technology and using this understanding to temper the development schedule for new applications and fitting them around current applications.


The Enterprise Data Model

With the intense adoption of data modeling methods and tools for implementing database design throughout the enterprise and the need to understand the nature of data at all rows of the Zachman
Framework, enterprises have had to build a common well understood enterprise data model. This logical data model depicts and standardizes the data items of interest to the enterprise. The
enterprise data model represents a common starting point for all database model development. The strategy was that by using a common starting point the divergence of specific database models from
each other for common items would be much less or non-existent depending on the degree of change management freedom allowed by the enterprise. A primary example of this approach was the design of a
single model with several thousand entities by the DoD called the Defense Data Model (DDM).

© Metadata Management Corporation, Ltd. 1999

Share this post

Paula Pahos

Paula Pahos

Paula Pahos is the President of Metadata Management Corp., Ltd. Located near Washington D.C. in Vienna Virginia, Metadata Management Corporation (MMC) specializes in providing enterprise information management capability to organizations that value the strategic importance of information. MMC recognizes that organizations, in both the public and private sector, are changing. More than ever, this requires the information systems departments to be both reactive and proactive in their support of these changes.

scroll to top