Data Integration and Sharing, Part Two

Published in TDAN.com January 2004

Publisher’s Note … Click here to read Part One of this article.

3.0 Data Movement Methods for Data Sharing and Integration

Now let us discuss the actual methods or manners in which data is moved. These are specific implementation methods for implementing the data movement strategies we discussed above.


3.1 Data Layer

The standard client-server model divides architecture into three layers of Data, Function and Presentation. A Data Layer provides programming access to any database from any application. A Data
Layer consists of two main parts:

  • Data Integration Layer, which is DBMS independent and accepts requests for data; and
  • Data Storage Layer, which converts the request into a DBMS specific request.
  • The Data Layer is a client-developed piece of middleware that is:
  • Application independent, in that it can be called by any application;
  • DBMS independent, in that it can work with any supported DBMS
  • SQL independent, in that it retrieves data based on messages passed to it;
  • Database neutral, in that it can access any database in its catalog.

The Data Layer should be used to gain access to data required by an application, without that application becoming DBMS dependent. It should not be used to access data in other applications, to
access data from a reporting environment, nor to access reference data stored in other applications.


3.2 Database Link

A database link is a schema object in the local database that enables you to access objects on a remote database. For example, Oracle supports this. A database link connection allows local users to
access data on a remote database. A database link can be used refer to tables and views on the remote database. For example, you can query a remote table or view with the SELECT statement. The
remote database need not be an Oracle system.

3.2.1 One Way

A database link is a one-way link in this sense:

  • A client connected to local database A can use a link stored in database A to access information in remote database B, but users connected to database B cannot use the same link to access data
    in database A.

If local users on database B want to access data on database A, then they must define a link that is stored in database B.

3.2.2 Why Use Database Links?

The great advantage of database links is that they allow users to access another user’s objects in a remote database so that they are bounded by the privilege set of the object’s owner. In other
words, a local user can access a link to a remote database without having to be a user on the remote database.

For example, assume that employees submit expense reports to Accounts Payable (AP), and further suppose that a user using an AP application needs to retrieve information about employees from the HQ
database. The AP users should be able to connect to the HQ database and execute a stored procedure in the remote HQ database that retrieves the desired information. The AP users should not need to
be HQ database users to do their jobs; they should only be able to access HQ information in a controlled way as limited by the procedure.

Database links allow you to grant to local users a limited access on remote databases. By using current user links, you can create centrally managed global users whose password information is
hidden from both administrators and non-administrators. For example, A/P users can access the HQ database as SMYTH, but unlike fixed user links, Smyth’s credentials are not stored where
database users can see them.

3.2.3 General Guidelines on DB Links

  1. Do not create PUBLIC database links (it’s a potential security issue).
  2. A synonym can be created for the database link to simplify its use.

3.3 Extract-Transformation-Load (ETL)

Extract-Transformation-Load (ETL) is the process of gathering data from source systems, enhancing and enriching that data, and then loading it into a target system. Generally speaking, ETL is the
mainstay of data warehousing and reporting systems, though it is by no means restricted to them. ETL is ideally suited to data movement that has the following characteristics:

  • Large amounts of data are involved. ETL is often used to move millions of rows a day to the reporting environment. Messaging methods would be better if small amounts of data are involved.
  • Some latency is acceptable. With ETL, data is typically transformed at selected intervals. These intervals can be a short at minutes, and as long as days, weeks, and even periods. Many data
    warehouses, such as marketing warehouses, are still loaded monthly; others, closer to near-real time, are loaded at short minute intervals, such as every 15 minutes – all using ETL. ETL,
    however, is not the only way to achieve this. The replication services within DBMS’s are another way.
  • Data involves periodic snapshots. ETL is more appropriate where data is captured at specific points and fixed, rather than by propagation of the event or transaction itself.
  • Significant consolidation and integration of sources is required. In reporting environments, often the same target data element originates in many source data elements, often with incompatible
    definitions and domains. For example, an insurance policy might have many active statuses in operational systems that are consolidated into a single active status in the data warehouse. Where data
    is merely moved, different data movement methods, such as replication, messaging and ETL, could equally be used.
  • Transformations are complex. Where significant transformations must be done to the data, methods that merely move data as opposed to transform it, may not suffice. For example, we offer an
    over-the-counter drug product in multiple sizes but for the purpose of sales analysis and marketing we want all quantities converted to some common volume equivalent, like a 100-gram equivalent.
    ETL is well suited for such transformations.


3.4 Replication

Replication is the storing of a separate copy of data. Replication may use synchronous or asynchronous technology, though asynchronous is more common. Data can be complete or partial. A complete
replica copies the entire source. Partial replicas can be horizontal or vertical. A horizontal replica is a copy of some rows, such as all customers in New Jersey only. A vertical replica is a copy
of some columns, such as Customer ID, Customer Name, Customer Address and Customer Phone Number only. Replication has these advantages:

  • Reliability. If one copy fails, there is always another copy that can be obtained.
  • Performance. Local query processing can occur.
  • Simplification of integrity. Replicas are usually refreshed at planned intervals. These intervals are usually chosen when production requirements are more relaxed, such as overnight.
  • Decoupling. Transactions do not usually require coordination across a network.
  • Reduced network traffic at peak times. Updating replicas is usually performed off-hours, thereby reducing network traffic when update processing is at its peak.

Replication has two main disadvantages.

  • Storage. Each replica will consume the same amount of duplicate space as the original.
  • Update cost. If a replica is updated, the update will have to be propagated to all copies.

Replication should be considered when:

  • Performance is an issue with requesting set processing over a network.
  • Availability is an issue if the network were to go down.
  • Data access requirements are complex, such as doing large, complex joins over a network.
  • Consistent data is required for analysis. The same request for the same data should yield the same results.


3.5 Global Views

Global views are a method implemented in some organizations. A view is a virtual table, constructed dynamically as the data is requested. The real data is stored in base tables. The view provides a
dynamic in that its contents are selected each time the view is used. A view may combine some or all data from several base tables. A view only returns the specific columns contained in the view.
It can thereby hide base data, which the developer does not want the user to see. Global views are predefined views that a client organization has defined for either security or performance
purposes. Views have several uses and advantages:

  • Simplify query preparation. One does not have to know the detailed relationships in the base data.
  • Help with data security. Only the data in the view is returned.
  • Improve developer productivity. Predefined view should relieve the developer of having to repetitively code complex joins.
  • Returns most current data. Unlike stored aggregates, which record specific derived results, the view performs the retrieval and aggregation each time the view is used.
  • Use little storage space. Views are database object definition and contain no real data.

Views have several disadvantages:

  • Consume processing time. Each time the view is requested it is re-executed.
  • Updateability. Views may or may not be updateable, depending on the DBMS.


3.6 Guidelines for Application Development

3.6.1 Transactional Systems

Transactional Systems should be focused on data and processing. They should deal with operational requirements of the business process being supported and should not divert themselves into
reporting or maintenance of reference data.

Reporting should be minimized by limiting it to operational reports. Application reporting, also called Intra-Application Reporting, has the following additional characteristics

  • Involving small volume of data and in result set returned. Transactional application should not process or return all the data in a given table or set of tables.
  • Usually described as several tables, few rows
  • Limited to reports required to run the business
  • The audience is clerical or operational management.
  • Reporting does not require data from multiple sources
  • Requires a limited time range in reporting, usually days to weeks
  • Contains or requires a limited amount of history
  • Supports decisions within a single business function or application
  • Supports structured to semi-structured decisions
  • Reporting tends to be preplanned, parameterized, or template reporting only
  • Requires a limited volume of data
  • Uses data exclusively from within the application.

Transactional applications should also be autonomous in its usage of data. If other applications need the data, then the data should to be ‘pushed out’ or published for their usage.

Their data retention requirements are different from those of reporting applications. They are as follows:

  • Short time horizon – recent history
  • Follow aggressive archive/delete to optimize performance
  • Longer-term history and reporting needs handled through a reporting framework
  • “Archiving” is thereby partly handled in reporting framework

In terms of data sources, transactional applications obtain data from other sources in the following ways:

  • Access other data stores directly only through a publish/subscribe approach
  • Synchronous data sharing under two circumstances:
    If performance allows it and,
    If database updates require updates to tables in multiple databases

If synchronous sharing is required, the following methods should be evaluated:

  • Using an RPC, which is loosely coupled, as in a masterfile or reference data service.
  • Using the Data layer, which is loosely coupled, as with the Data Integration Layer/Data Storage Layer (DIL/DSL), which we saw earlier under Data Layer.
  • Using DB links and global views, which are tightly coupled.

Asynchronous is acceptable or needed:

  • If you have to copy data into the application, then copy only from a pub/sub authorized source.
  • Subscribe to a pub/sub will need adapter logic to retrieve the data
  • There may be workflow implications in using message queues. The application will have to decide how to process asynchronous responses.
  • Using snapshot. A snapshot is a read-only replica of selected portions of a database. It is a tightly couple method of sharing.

When transactional systems need to provide data to other sources:

  • Push the data out to a common store or a reporting staging area.
  • Use a queue or file method to push deltas out.

3.6.2 Analytical Reporting Applications

A Reference Data Application should not be part of a reporting framework but may feed the reporting framework. A Reference Data Application has the following characteristics:

  • Low volume data entry
  • Preferably one system for data entry, there may be multiple sources but the publishing point is controlled at one point.
  • Business stewards need to be in place, such as for masterfile or reference data

Analytical Reporting has the following characteristics:

  • Uses common data subjects
  • Supports upper management’s decision making
  • Supports strategic to tactical reporting
  • Requires cross-functional data or integration of data
  • Is fed from multiple data sources
  • Requires history (usually significantly well beyond operational data retention requirements)
  • Is essentially read-only and is not updateable. Changes to dimensions are quite common and can be handled in many ways. Histories of selected dimensions are often required, based on business
    needs.
  • Supports semi-structured to unstructured decisions
  • Reporting can be ad hoc, as well as preplanned, parameterized, or template reporting. In addition, sophisticated data mining can be performed, often on large amounts of data.


Summary

As we said at the beginning, data does flow within an organization. It flows frequently and plentifully throughout most organizations. We in data management, with our focus on integrated data,
often lose sight of the realities of dealing with the large and complex issues in managing the vast amounts of data that drive the modern organization. Unfortunately, vast amounts of disparate data
are often the byproduct of this phenomenon. Sound data management principles, implemented with sound messaging practices, can work together over time to provide order to this state.

Share

submit to reddit

About Tom Haughey

Tom is considered one of the four founding fathers of Information Engineering in America.  He is currently President of InfoModel, Inc, training and consulting company specializing in practical and rapid development methods. His courses on data management, data warehousing, and software development have been delivered to Fortune 100 companies around the world. He has worked on the development of seven different CASE tools, over 40,000 copies of which have been sold to date. He was formerly Chief Technology Officer for the Pepsi Bottling Group and Enterprise Director of Data Warehousing for Pepsico. He was also formerly Vice President of Technology for Computer Systems Advisers, who market the CASE tools called POSE and SILVERRUN. He wrote his own CASE tool in 1984. He formerly worked for IBM for 17 years as a Senior Project Manager. He is an author of many articles on Data Management, Information Engineering and Data Warehousing.

His book, Designing the Data Warehouse-The Real Deal will be published later this year.

Top