Multi-year content-rich documentation deemed unreliable and stopped being used. An immediate effort required to bring it in shape. The generally accepted approach to handle metadata is to load it
from all sources into a single repository for integration. The premise is to avoid the “spaghetti” of point-to-point synchronizations. Are there any alternatives if that approach is not
feasible? What if the metadata is distributed amongst multiple documents and the company has no plans for a full-scale metadata repository tool? What if, in addition, documents don’t follow a
strict layout? The article describes the approach taken and discusses the results achieved.
Importance of Documentation
Development continuity for a typical data warehousing effort is disrupted by time gaps, geographic and organizational boundaries:
- The teams responsible for data acquisition (ETL) and the teams responsible for data access (BI) have different skill sets and frequently report to different managers/directors.
- Resources within a single project are geographically dispersed; they may reside on different continents with work shifts barely overlapping. Face-to-face communication is limited and is
replaced by documentation handoffs. - Development effort is subdivided into a series of projects spread over time. Every subsequent project incorporates data created by the previous ones and relies heavily on their documentation.
Otherwise, it would be forced to reverse-engineer application code, which is time-consuming and should be unacceptable for any reasonable organization.
As a result, any significant data warehouse needs a trail of documentation to survive. Within the development workflow, an analyst produces documents that are used by next-level analysts, and so
on. Only the final deliverables are programs and database structures.
Documentation Flaws
Documentation is always flawed. It is imprecise and incomplete to begin with and easily gets out of date. We can only wish that documentation had the rigor of programs with syntax checking of
individual modules, resolving their mutual references by a linker, and finally multiple levels of testing.
Documents are checked only visually and many typographical and other errors are never caught. The original requirements are usually incomplete; multiple changes occur, but are not always reflected
in the documents. Some information is omitted because it is assumed obvious by the development team. At each workflow step, an analyst interprets information from the earlier documents; the errors
are compounded along the document chain.
Multiple versions are needed to reflect all projects within the development pipeline. The versions should be tagged and archived properly. Documents for cancelled and on-hold projects should be
distinguishable as well. Issues come in two different flavors. Sometimes versions are not tagged properly, and the resulting duplicates cause confusion. The opposite problem exists when only a
single instance of a document is maintained. It is then difficult to tell where each individual artifact is in its life cycle. For example, a data model might have obsolete tables and early drafts
of the new ones coexisting on equal rights with the tables currently implemented in production.
Case Study Problem
In our case study, a strong Data Administration team developed and maintained content-rich metadata. All relevant information about a given table was combined together into a single document.
Document layout was thought through and easy to understand. The documentation had been widely used for years by multiple development teams and business partners.
Time, however, took its toll, and the discrepancies accumulated. The documents stopped being used, because they were deemed unreliable. For example, only 60% of the data model tables were found in
the source-to-target spreadsheets. At first glance, it is difficult to comprehend how any work can be done under such circumstances. However, bear in mind that humans are adaptive while
inventorying programs can’t handle any discrepancy, however minute. (See also Flexible Authoring Tool Challenges below.)
An effort was initiated with the goal of improving quality of the documentation and maintaining it going forward. In order for it to be sustainable over time, it needed to be automated and
integrated in the development process.
Distributed Metadata Challenges
The generally accepted way to integrate metadata is to load it from all sources into a single repository. This is the best practice recommendation and the author’s preference as well.
However, a full-scale metadata repository tool was not a feasible option. Results needed to be achieved within a distributed metadata framework. The immediate need was limited to synchronizing just
data models and source-to-target maps, but the selected solution should have been extendable.
Let’s assume the metadata is distributed; it is produced by multiple authoring tools and stored within numerous individual documents. These documents in their entirety constitute the system
of record, and each document serves as a source of metadata. With n nodes (sources), the number of links is defined by the formula n(n-1)/2. If each link requires a
synchronization process, the number of processes grows as the square of sources. This is the dangerous “spaghetti” effect of point-to-point synchronization. For the complex environment
of a real-life organization, this number becomes prohibitively large. This potential scalability issue can be subdivided into two: multiple types of documents being integrated, and multiple
individual documents being integrated. Let’s address each one separately.
Multiple Types of Documents Being Integrated
The required number of interfaces between document types is defined by the relationships within a metadata data model. Theoretically, anomalies could be avoided by restricting to only one access
path between any two entities [Reingruber ‘94]. Therefore, a “pure” data model would predict a linear growth of (n-1) relationships. This ideal is never achieved
because of the inevitable redundancies.
Let us consider the metadata example in Figure 1. The organization is interested to keep the following five documents in sync:
- Project folders with project details and resources
- Data models with tables and columns
- File record layouts
- Source-to-target spreadsheets with transformation rules
- Employee phone list.
Figure 1: Sample Metadata Entity-Relationship Diagram
The square-power growth predicts 10 interfaces, and the linear growth – 4. The reality is somewhere between the predictions; 5 relationships cross subject area boundaries. This is still a
manageable number and cannot prevent expansion to a more comprehensive metadata integration.
Multiple Individual Documents Being Integrated
Information is distributed among multiple documents of the same type. Each document needs to be compared to all documents of the opposite type. Here lies the same danger of square-power growth. And
again, under closer consideration, the problem can be mitigated.
I remember once at a seaside lodge only cold water was running out of faucets in the morning. So you could not have HOT RUNNING water. But you heated some, and made do, by constantly switching
between RUNNING water and HOT water. It was not the optimal, but still a workable solution.
Similarly, we provide a workable metadata integration solution by combining two partial methods that complement each other:
-
Full Inventory of Documents with Limited Functionality
Information from all documents on each side is extracted and compared in one sweep. The functionality is limited to
compiling a report, no corrective action is taken. -
Full Synchronization of Individual Documents
The full metadata synchronization does not only uncover discrepancies, but corrects them as well. It involves updating a document on either side of the comparison. However, it can be done on an
individual document basis, not for the full portfolio.
Here is a typical scenario. A data modeler designs a new table for a project. She is only interested in a handful of project-related tables, not the whole data warehouse. She identifies the subset
of documents required by using the Full Inventory method. Before starting her own work, she verifies and reconciles existing documents (from now on using the Full Synchronization method). She
generates the first draft of a table in a data model out of the source-to-target spreadsheet. She then periodically updates the source-to-target spreadsheet based on the data model changes (and
vice versa). Her documents are in sync throughout the development process.
Flexible Authoring Tool Challenges
A flexible authoring tool, like Microsoft ™ Excel or Word is usually utilized to capture source-to-target maps and transformation rules. A standard document layout is usually established from
the start. However, enforcement of the guidelines is difficult, and deviations creep in; new spreadsheet columns are added; columns are moved around; headers change. In addition, standards evolve
over time, creating a series of accepted templates.
There is a strong (and understandable) desire to squeeze more information on a viewable page. The axiom of one atomic datum per element is often violated. A single spreadsheet cell gets overloaded
with multiple pieces of data.
- Old value is crossed-out but kept together with the new value.
- Comments get embedded in the cell text.
- The value and the name of an artifact are combined.
- Multiple values get inserted separated by commas, spaces, or newlines.
- The owner, table, and column names are combined with a dot notation.
For example, in our case, the “Source Column Name” header had over 50 different variations. As another example, there were four permutations for table name placement as depicted in
Figure 2. Variation might happen to accommodate specific project requirements or personal taste of the document author. Regardless of whether reasons are well grounded or whimsical, all this
complicates extracting information universally across the full portfolio of documents.
Figure 2: Table Name Placement
Algorithm
Document Recognition
In order to handle flexible layouts document recognition techniques were applied. To keep the programming effort down the algorithm was simplified as much as possible.
- Artifacts were searched either by name or by position.
- A list of synonyms was used to convert terms to a canonical form.
- False positives were eliminated with a list of exceptions and a list of elimination rules
Overall flexibility was achieved by extending and tuning the above lists. The rules, which were difficult to fit into this scheme, were hard-coded. Still, some documents evaded any straightforward
logic. The decisions were simple, if a variation was used in a lot of documents, handle programmatically, if only by few – modify documents instead.
Inventory Metrics
Both Full Inventory and Full Synchronization methods were implemented programmatically, but only results of the Full Inventory are within the article’s scope. The inventory provided
table-level comparison of data models with source-to-target spreadsheets (or more precisely, matched data model Tables with table-level source-to-target Maps). The metrics depicted in Figure 3 were
used to measure progress:
- Extraneous Maps are Maps without Tables.
- Duplicate Maps are measured by the total number of instances (excluding the first one) of the same target table across all spreadsheets. The number is overstated because it includes correctly
captured multiple sources for the same target table. - Exceptions are the Tables that do not need their own Map, for example, “clone” tables that are populated without transformations as a straight copy.
- Missing Maps = (Tables without Maps) – (Exceptions)
- Matched Table count is a number of Tables matched to Maps.
- Match ratio = Matched Tables / (Matched Tables + Missing Maps)
Figure 3: Inventory Metrics
The goal was to reduce the number of missing, extraneous, and duplicate maps and, as a result, improve the match ratio.
Implementation
The success of the effort hinged on the data analysts doing research and correcting documentation. They had to do it in addition to their heavy project load. Of course, they were motivated to
eliminate the perceived inadequacy of their collective results, but that did not diminish their accomplishments. And the author is grateful to their dedication. The inventorying program only
facilitated the process by monitoring the progress and identifying the remaining issues. It also provided “immediate gratification” to the analysts. It generated a hyperlinked
cross-reference of source-to-target documents and tables within data models. The document became an essential part of the development process by providing an easy searching and navigation tool.
At first, the inventory was run on a weekly basis. All problem areas were attacked: typos were corrected, exception rules were tuned, old versions were pruned, obsolete documents and artifacts
archived, missing documents recovered, etc. The progress during that initial period was mostly achieved by improving quality of existing documentation, not by creating new documents.
In a couple of months, the results started to show:
- About 20% of all tables were designated as not needing maps.
- The number of missing maps was cut by about 50%.
- The number of extraneous maps was cut by about 50%.
- The number of duplicates, as measured, slightly increased, because of the additional documents uncovered.
- The match ratio surpassed 80%.
Apparently, the confidence of developers and business partners in the data modeling documentation was restored, as all the crucial tables were matched. After the initial exploding growth, progress
stabilized with only small incremental changes. The inventory frequency was cut down to once a month.
Subsequently, the effort was transformed into an ongoing metadata administration function. It was assumed that the quality of the existing documentation portfolio was acceptable, and the attention
is focused on the changes only. This prevents future crises, while keeping work effort low.
Further improvements can be achieved by restoring the documentation, for example, by reverse-engineering from ETL programs. This is an expensive proposition. One way to justify the work is by
limiting it to the immediate needs of the new projects. Such an overhead might be acceptable to project sponsors.
Conclusions
The metadata improvement effort targeted immediate needs, but was based on a solid foundation that allowed its extension in the future. A point-to-point metadata synchronization was used to
accommodate the existing distributed metadata framework. Two partial methods complementing each other were developed and implemented programmatically. Document recognition techniques were used to
handle flexible layouts. A new cross-referencing document was added to the analysts’ toolset, making it an essential part of the development process. The initial effort improved the quality
of the documentation, and restored the confidence of developers and business partners in the data modeling deliverables. Subsequently, the effort was transformed into an ongoing metadata
administration function as a low maintenance measure against future crises.
Reference: