Meta data often is spoken about in terms associated with data warehouses. While it is understood that knowledge workers need meta data to perform their functions, other individuals related to the
data warehouse can also reap great rewards from the use of meta data. This article is about how the data warehouse developer’s can make use of the meta data to “protect” the data warehouse.
Knowledge Workers need to know:
- the business definitions of the warehouse data
- where the data was extracted from
- the logic that was used to transform the data from operational to warehouse format
- the valid values of the data
- where the data is located and how they can get access to it
- and so on…
As a matter of common practice, the meta data to support the data warehouse is stored in some sort of meta data repository. The repository can exist in several warehouse building products or it can
be stored in a centralized repository that brings together disparate meta data (much like a warehouse project of its own). This topic has been written about and presented many times.
However, one item that is almost never mentioned is that warehouse developers need support in the form of the meta data also.
For this article, let us consider that the meta data be stored in a centralized repository. A centralized meta data repository (or a coordinated set of product repositories) can be used to avert
potential disasters by giving the developers an early “heads up” when operational system changes may affect the loading process.
In a perfect world, warehouse developers should be notified when changes are being made to operational systems and data that may impact the warehouse building process. In the real world, this type
of communications does not always exist. That is why “warehouse protection” using the meta data repository is important.
The Set Up
To “protect” the data warehouse, there are five primary types of meta data that must be stored and kept current in the repository:
- file definitions for operational data sources that feed the warehouse; this includes database and table names, copybooks names, or any name that represents the data definition …
- the application programs that reference these operational data sources
- jobs that execute the application programs that reference these operational data sources
- the relationship between file definitions and the programs that reference them
- the relationship between the programs and the jobs that reference them
By capturing these types of meta data, warehouse developers can effectively monitor the operational systems for changes that may impact data that feeds the warehouse.
- If an operational file definition (see the first bullet above) changes, warehouse developers need to be made aware of changes that impact the specific fields of information that are being
extracted from that data source.
- If an application program is added or changed (see the second bullet above) that references these file definitions, warehouse developers need to be made aware of the impact that the logic
changes have on the file definition.
- If a job (see the third bullet above) that executes these programs changes, warehouse developers need to be made aware of the changes in dsns, scheduling, dependencies, program references, …
In many of these situations, changes are caught on the back end (the audit, testing or loading process) causing warehouse developers to spend valuable time tracing the source of a data problems.
The three types of meta data (and two type of relations between meta data), when harvested properly and coordinated in conjunction with a change management system, can be used to pro-actively
“protect” the data warehouse.
To “protect” the warehouse, there are two change control processes that must be in place:
- a component check-in/check-out function for test and production
- a method to produce change control reports that identify when components are being retrieved (for changes) from production, moved into the test environment, and moved to production
These two change control processes will be used to identify when a “warehouse protected component” is checked in and checked out of the development, test, and production system environments.
Warehouse developers should be able to identify all of their warehouse source data sets (table, copybook, …) by their data definition names (table names, copybook names, …). The meta data
repository should be able to identify the programs that make use of these file definitions. The repository should also be able to identify the jobs that execute the programs. To protect the data
warehouse, tag these items in the repository as “warehouse protected”.
There are three primary “action” points in a change control process that are critical to the warehouse developer:
- the retrieval of source code or file definitions from production for (review of) changes
- the movement of the operational source code or file definitions into a testing region
- the movement of the operational source code or file definitions into a production region
When these three actions are taken on the system components mentioned earlier, the warehouse development team need to be informed immediately. It is important to recognize that the earlier in the
change control process that the warehouse developers are made aware of changes that affect the warehouse source data, the more likely the developers will be prepared to accept and transform the
operational data when the warehouse load process is (re) initiated.
Allow me to take a closer look at these actions:
The Retrieval of Source Code from Production
- When an application system developer retrieves a component from product, it is likely that the developer is considering making a change to that component. If that component is labeled as being
“protected” for the warehouse, this is the BEST TIME to notify the warehouse developers that a “protected” component is being reviewed for a change.
- The warehouse development team may (or may not) try to influence the change in business requirements or the specific change to the component. This depends on the ability of the warehouse
development team to drive business requirements. In some shops, data warehousing is considered another (expensive but rewarding) application and the development team is made up of less influential
(but extremely knowledgeable) former application developers and data users. In other shops, warehouse developers have a strong ability to influence business requirement.
- The influence IS NOT the most important factor. The early knowledge of potential system changes impacting warehouse source data IS the most important factor. If warehouse developers are
notified that a warehouse protected component is being reviewed for change, the warehouse developer can contact the individual responsible for retrieving the component, and the communications
between application developers and warehouse developers is assured.
The Movement of Source Code to Test
- If the warehouse developers can not identify when warehouse protected components are retrieved from production, or components are checked out prior to the ability to tag components as
“protected”, it is important to identify when warehouse protected components are moved into test. This is the SECOND BEST (not ideal) time to identify a
change in an application.
- The time that a system component resides in a test environment varies from IS shop to IS shop, from application to application, from specific change to specific change. Therefore, a component
on the “fast track” may make a brief visit to the test region and move straight to production with little or not time for communications between the application and warehouse developers.
- However, if the warehouse developers are made aware that a “protected” component is moved into test, they may still have a better chance to learn about and prepare for the change.
The Movement of Source Code to Production
- This is the WORST possible time to be notified that a warehouse protected component HAS changed, however, it is better than not being notified at all.
If a warehouse developer can be notified when “protected” components are moved to production, they may (or may not) have the ability to learn about and prepare for the change prior to the next
- In many situations, being notified of a change at this time will delay the refresh of the data warehouse. This can become a painful experience and one that should be avoided if possible.
If the repository administrator can identify warehouse “protected” system components and the change control environment is capable of reporting check-ins and check-outs of system components, all
that needs to be developed is the ability to match components that have changed to the components tagged as “protected” and to report the matches to the warehouse developers.
This can be done by:
- first, scanning change control reports or a change control master file for component additions and changes
- second, comparing the component name to the names of the warehouse protected components in the repository
- third, reporting the matches to the warehouse development manager or team
This type of approach is easily completed if the IS shop has implemented a meta data repository and has a sound means of change control. Both of these tasks require commitment and time to complete.
All shops are not this fortunate and many have just begun to identify the need for a meta data repository and a strict change control environment. BUT… Many of the less fortunate shops also have
data warehouses in production or in the analysis or building phase.
To the shops that have repositories and change control in place, this article has identified additional uses for meta data in the data warehouse environment.
To the less fortunate shops, perhaps this article can raise a warning signal as to where warehouse development can be broken down due to a lack of communications.
No matter who you are, if you are developing a data warehouse, meta data can play a large role in “protecting” what is often considered to be “the-mother-of-all” IS development efforts. … And
certainly worth consideration.