How to Protect the Data Warehouse Using the Repository

Meta data often is spoken about in terms associated with data warehouses. While it is understood that knowledge workers need meta data to perform their functions, other individuals related to the
data warehouse can also reap great rewards from the use of meta data. This article is about how the data warehouse developer’s can make use of the meta data to “protect” the data warehouse.

Knowledge Workers need to know:

  • the business definitions of the warehouse data
  • where the data was extracted from
  • the logic that was used to transform the data from operational to warehouse format
  • the valid values of the data
  • where the data is located and how they can get access to it
  • and so on…

As a matter of common practice, the meta data to support the data warehouse is stored in some sort of meta data repository. The repository can exist in several warehouse building products or it can
be stored in a centralized repository that brings together disparate meta data (much like a warehouse project of its own). This topic has been written about and presented many times.

However, one item that is almost never mentioned is that warehouse developers need support in the form of the meta data also.

For this article, let us consider that the meta data be stored in a centralized repository. A centralized meta data repository (or a coordinated set of product repositories) can be used to avert
potential disasters by giving the developers an early “heads up” when operational system changes may affect the loading process.

In a perfect world, warehouse developers should be notified when changes are being made to operational systems and data that may impact the warehouse building process. In the real world, this type
of communications does not always exist. That is why “warehouse protection” using the meta data repository is important.

The Set Up

To “protect” the data warehouse, there are five primary types of meta data that must be stored and kept current in the repository:

  • file definitions for operational data sources that feed the warehouse; this includes database and table names, copybooks names, or any name that represents the data definition …
  • the application programs that reference these operational data sources
  • jobs that execute the application programs that reference these operational data sources
  • the relationship between file definitions and the programs that reference them
  • the relationship between the programs and the jobs that reference them

By capturing these types of meta data, warehouse developers can effectively monitor the operational systems for changes that may impact data that feeds the warehouse.

For example:

  • If an operational file definition (see the first bullet above) changes, warehouse developers need to be made aware of changes that impact the specific fields of information that are being
    extracted from that data source.
  • If an application program is added or changed (see the second bullet above) that references these file definitions, warehouse developers need to be made aware of the impact that the logic
    changes have on the file definition.
  • If a job (see the third bullet above) that executes these programs changes, warehouse developers need to be made aware of the changes in dsns, scheduling, dependencies, program references, …

In many of these situations, changes are caught on the back end (the audit, testing or loading process) causing warehouse developers to spend valuable time tracing the source of a data problems.
The three types of meta data (and two type of relations between meta data), when harvested properly and coordinated in conjunction with a change management system, can be used to pro-actively
“protect” the data warehouse.

To “protect” the warehouse, there are two change control processes that must be in place:

  • a component check-in/check-out function for test and production
  • a method to produce change control reports that identify when components are being retrieved (for changes) from production, moved into the test environment, and moved to production

These two change control processes will be used to identify when a “warehouse protected component” is checked in and checked out of the development, test, and production system environments.

The Look

Warehouse developers should be able to identify all of their warehouse source data sets (table, copybook, …) by their data definition names (table names, copybook names, …). The meta data
repository should be able to identify the programs that make use of these file definitions. The repository should also be able to identify the jobs that execute the programs. To protect the data
warehouse, tag these items in the repository as “warehouse protected”.

There are three primary “action” points in a change control process that are critical to the warehouse developer:

  • the retrieval of source code or file definitions from production for (review of) changes
  • the movement of the operational source code or file definitions into a testing region
  • the movement of the operational source code or file definitions into a production region

When these three actions are taken on the system components mentioned earlier, the warehouse development team need to be informed immediately. It is important to recognize that the earlier in the
change control process that the warehouse developers are made aware of changes that affect the warehouse source data, the more likely the developers will be prepared to accept and transform the
operational data when the warehouse load process is (re) initiated.

Allow me to take a closer look at these actions:

The Retrieval of Source Code from Production

  • When an application system developer retrieves a component from product, it is likely that the developer is considering making a change to that component. If that component is labeled as being
    “protected” for the warehouse, this is the BEST TIME to notify the warehouse developers that a “protected” component is being reviewed for a change.
  • The warehouse development team may (or may not) try to influence the change in business requirements or the specific change to the component. This depends on the ability of the warehouse
    development team to drive business requirements. In some shops, data warehousing is considered another (expensive but rewarding) application and the development team is made up of less influential
    (but extremely knowledgeable) former application developers and data users. In other shops, warehouse developers have a strong ability to influence business requirement.
  • The influence IS NOT the most important factor. The early knowledge of potential system changes impacting warehouse source data IS the most important factor. If warehouse developers are
    notified that a warehouse protected component is being reviewed for change, the warehouse developer can contact the individual responsible for retrieving the component, and the communications
    between application developers and warehouse developers is assured.

The Movement of Source Code to Test

  • If the warehouse developers can not identify when warehouse protected components are retrieved from production, or components are checked out prior to the ability to tag components as
    “protected”, it is important to identify when warehouse protected components are moved into test. This is the SECOND BEST (not ideal) time to identify a
    change in an application.
  • The time that a system component resides in a test environment varies from IS shop to IS shop, from application to application, from specific change to specific change. Therefore, a component
    on the “fast track” may make a brief visit to the test region and move straight to production with little or not time for communications between the application and warehouse developers.
  • However, if the warehouse developers are made aware that a “protected” component is moved into test, they may still have a better chance to learn about and prepare for the change.

The Movement of Source Code to Production

  • This is the WORST possible time to be notified that a warehouse protected component HAS changed, however, it is better than not being notified at all.
    If a warehouse developer can be notified when “protected” components are moved to production, they may (or may not) have the ability to learn about and prepare for the change prior to the next
    warehouse refresh.
  • In many situations, being notified of a change at this time will delay the refresh of the data warehouse. This can become a painful experience and one that should be avoided if possible.

The Delivery

If the repository administrator can identify warehouse “protected” system components and the change control environment is capable of reporting check-ins and check-outs of system components, all
that needs to be developed is the ability to match components that have changed to the components tagged as “protected” and to report the matches to the warehouse developers.

This can be done by:

  • first, scanning change control reports or a change control master file for component additions and changes
  • second, comparing the component name to the names of the warehouse protected components in the repository
  • third, reporting the matches to the warehouse development manager or team

Summary

This type of approach is easily completed if the IS shop has implemented a meta data repository and has a sound means of change control. Both of these tasks require commitment and time to complete.
All shops are not this fortunate and many have just begun to identify the need for a meta data repository and a strict change control environment. BUT… Many of the less fortunate shops also have
data warehouses in production or in the analysis or building phase.

To the shops that have repositories and change control in place, this article has identified additional uses for meta data in the data warehouse environment.

To the less fortunate shops, perhaps this article can raise a warning signal as to where warehouse development can be broken down due to a lack of communications.

No matter who you are, if you are developing a data warehouse, meta data can play a large role in “protecting” what is often considered to be “the-mother-of-all” IS development efforts. … And
certainly worth consideration.

Share

submit to reddit

About Robert S. Seiner

Robert S. (Bob) Seiner is the publisher of The Data Administration Newsletter (TDAN.com) – and has been since it was introduced in 1997 – providing valuable content for people that work in Information & Data Management and related fields. TDAN.com is known for its timely and relevant articles, columns and features from thought-leaders and practitioners. Seiner and TDAN.com were recognized by DAMA International for significant and demonstrable contributions to Information and Data Resource Management industries. Seiner is the President and Principal of KIK Consulting & Educational Services, a data and information management consultancy that he started in 2002, providing practical and cost-effective solutions in the disciplines of data governance, data stewardship, metadata management and data strategy. Seiner is a recognized industry thought-leader, has consulted with and educated many prominent organizations nationally and globally, and is known for his unique approach to implementing data governance. His book “Non-Invasive Data Governance: The Path of Least Resistance and Greatest Success” was published in late 2014. Seiner speaks often at the industry’s leading conferences and provides a monthly webinar series titled “Real-World Data Governance” with DATAVERSITY.

Top