Data Integration and Sharing – Part One

 

Published in TDAN.com October 2003

We data modelers have a great passion about data. We understand (even enjoy) the process of creating data models. We love to see a business unfold before our eyes in the form of the data model. We
often say, “Data doesn’t flow”. In the late 1970’s and early 1980’s, I too said, “Data doesn’t flow.” By 1984, I stopped saying this. Of course, data
does not flow within a data model. It resides there. The reality is that data does flow and flows abundantly throughout an organization in the form of data movement. In most organizations, there is
a huge amount of data movement. Reference systems pass data to transactional and analytical systems. Transaction systems pass data to one another. External data is absorbed into transaction and
analytical systems. Warehouses pass data to marts. Warehouses and marts pass data for data mining. Some people even contend that there is more data moving throughout organizations than stored by
them. This article, which is in two parts, addresses different ways in which data can move throughout an organization. It specifically focuses on methods for data sharing. Part I this month
discusses messaging methods for data sharing. Part II, next month, will address data movement methods for data sharing.

Data integration and sharing deals with the use of common data by multiple applications or the exchange of data across multiple applications. When multiple applications exchange data, in some form
or other, messages are exchanged among applications. We can do this several ways. Some more flexible than others, and some more powerful than others. To understand integration models, first
requires understanding a few simple, common integration concepts and terms. We define these concepts before going into the data sharing models.


1.0 Common Integration Terms

Messaging refers to a mechanism for getting systems to interact via the passing of messages. A message is a single unit of communication encapsulating some
information. It is the unit of data for sending data and values across applications. Messages may contain factual or status information about application objects or processes, or even instructions
for the recipient. They consist of a header, containing message identification, and a body containing user-defined information. Depending on how the sending and receiving applications behave, they
can be in different states and have different levels of coupling between them.

Senders, receivers and messages themselves can have state. State is the description of the current situation of a component or object. It represents knowledge of the object. State
is typically described in memory. State could describe the identity of the object or the progress of an object through different processes, such as an Order being in the state of Certified, In
Process, and later Fulfilled. State could also describe the operations that a transaction can validly require. A stateful application is an application that retains state
information in memory after a service or operation has been performed. A stateless application is an application that flushes state information from memory after a service or
operation has been performed. Some integration methods are stateful, such as request/reply. Others are stateless, such as messaging (see below).

Coupling has to do with how intimately components relate to other components. Tight coupling is a form of integration in which each component has knowledge of the other component.
Thereby a change in one object will affect the other object. In loose coupling, one component does not have knowledge of the other and thereby is insulated from changes in the
other. Some integration methods use tight coupling, such as a database link. Others use loose coupling, such as messaging (see below).

Synchronization has to do with how extensively components cooperate in ensuring transactions are properly completed. Synchronous communication means that two or more separate
objects or systems partake in a single unit of work. One is dependent on the other. The requester must wait until the service provider responds. The requestor resumes its execution after it
receives the response. The entire unit of work from end to end is completed or nothing is completed. A typical form of synchronous communication is called request/reply (see below).
Asynchronous communication means that the work is broken into separate parts. One component is not dependent on the other. The requestor does not have to wait for the remote
process to complete, nor for a reply. In fact, the requestor can do other work while waiting for an answer. Two forms of asynchronous communication are queue-based and publish/subscribe (see
below).

For applications to exist in a loosely coupled, asynchronous relationship requires special software to make that happen. Message-Oriented Middleware (MOM) provides this. MOM is
software that provides a common, reliable way for programs to create, send, receive, and read messages in a distributed environment. MOM ensures fast and reliable asynchronous electronic
communication, guaranteed message delivery, receipt notification, and transaction control. MOM is probably the best way to ensure asynchronous, loose coupling.

The basic unit of work on data is the transaction. A transaction is a logical construct through which applications perform work on shared resources, such as databases. A
transaction is a complete unit of work, though it can involve multiple sub-units of work, which may even be performed on one or more systems. A transaction has four major characteristics, called
its ACID properties, defined as follows:

  • Atomic, which means that the transaction is a complete unit of work and either all data changes are completed or all are reversed.
  • Consistent, which means that the transaction must obey all integrity and business rules.
  • Isolated, which requires that changes made by a transaction to a database must not be visible to other operations until the transaction is complete.
  • Durable, which guarantees that changes made by a transaction are permanent and survive the completion of the transaction, even if there are future system or media failures.


2.0 Data Integration and Sharing Models

We will now take the above concepts and form them into the different integration models. Messaging systems can either be:

  • Synchronous or
  • Asynchronous

and can be classified into four interaction models that determine how messages are passed. :

  • Conversation
  • Request-Reply
  • Publish-Subscribe
  • Event messaging.


2.1 Four Messaging Models

The following table summarizes these models:

 


Conversation

In this, application A and B exchange messages reciprocally and steadily, and state is maintained in both. This type of messaging is inappropriate for business applications and is usually reserved
for lower level system and network functions.


Request-Reply

Request-Reply is used when an application sends a message and waits to receive a corresponding message in return. This is typically done in a remote procedure call. It is the standard synchronous
object-messaging format. In Request-Reply, state is maintained in the calling application only. The called application is only acting as a server and only needs to know how to respond to an
incoming message.


Publish-Subscribe

When multiple applications need to receive the same messages, Publish-Subscribe Messaging can be used. State is kept only in the subscribing application. The publishing application just sends
messages. It is up to the Subscriber to keep track of where it is in the published messages.

Multiple publishers can send messages to a topic, and all subscribers to that topic receive all the messages sent to it. This model is extremely useful when a group of applications wants to notify
each other of a particular occurrence, such as new product data being available.

In Publish-Subscribe Messaging, there may be multiple Senders and multiple Receivers. It is not necessary that the applications act as both—only that the solution supports both. For example,
a reference data owner may want to send out notification for all subscribers regarding the arrival or availability of new version of an organizational hierarchy. The Subscribers can use this
information to subscribe to and retrieve the necessary sets of this data.


Point-To-Point

Point-To-Point Messaging is used when one or more senders need to send messages to a single receiver. However, this may or may not be a one-way relationship. An application in a messaging system
may only send messages, only receive messages, or both send and receive messages. At the same time, another application can also send and/or receive messages. In the simplest case, one application
is the sender of the message, and the other client is the receiver of the message.

There are two basic types of Point-to-Point Messaging:

  • Direct Messaging. A client sends a message directly to another client. This is a somewhat older method of communicating and is not in as much favor today.
  • Event Messaging. This is the more common implementation and is based on the concept of a stack, called a message queue. In this situation, each message is addressed to a specific
    queue; clients get messages from the queues created to hold their messages. Senders drop messages into a message queue. The receiver takes the message out of the queue. State is kept in neither
    the sender nor the receiver but in the message itself in the queue. This message queue may be stored on the messaging server or may even be stored in a relational database for increased
    reliability.

In Event Messaging, even though there may be multiple Senders of messages, there is only a single Receiver. For example, multiple departments may send messages to a Purchase Department requesting
items to be purchased. These messages are only intended for Purchasing, and other applications will not receive them.


2.2 Synchronization Models

Here is a summary of the two methods for synchronizing applications:

 


2.3 Synchronous Integration

Synchronous communication follows the request-response model. An application initiates a request to another target application. The calling application then blocks its processing in the request
invocation thread while it waits for a response from the called application. The application continues its execution after it receives the response.

Typically, an application uses a remote procedure call to issue synchronous requests to the other application. For example, an application might define a remote procedure call to create an account
receivable item in the database. The calling application invokes this remote function to create an account receivable item and waits until it receives a reply containing the results and response.
This interaction is synchronous because the calling application’s program waits in timing with the called application and continues when it gets the remote response.

Synchronous interaction is applicable, for instance, where it is critical that multiple database updates are exactly synchronized.

Synchronous interaction leads to tight coupling between applications. One should consider the implications of this when integrating applications within an organization.

Synchronous interaction reveals three potential disadvantages:

  • Application interdependence
  • Middleware dependence
  • Network dependence

Here are several scenarios.

  1. An application needs to access another application to process a request. The calling application itself is designed to handle a large number of concurrent requests. When the calling application
    receives a client request, it synchronously invokes a remote procedure. The thread in their application process is then blocked from further processing until it receives a reply from the called
    application.
  2. The target application may have a more limited load capacity than the calling application. In other words, the called application is capable of handling only a limited number of concurrent
    requests on a limited number of connections. As a result, the target application is unable to process the same number of concurrent requests as the application. In a tightly-coupled synchronous
    integration, the original application’s response time and throughput will diminish since it must wait in synchrony for the called application to complete.
  3. An application’s performance may be affected by network failures. Here are two examples. If the target application is unavailable, an application request immediately gets an error return.
    To accommodate this, the application logic needs to include code to retry after such failed requests.
  4. The same problem can happen at a different point in the network. The called application may successfully get and execute the request, but be unable to reply because of a network failure. To
    handle this, the calling application must include timeouts or it will hang indefinitely waiting for a response.


2.4 Asynchronous Integration

Asynchronous integration involves message-based communication across applications. An application sends a request to a target application. The sender continues its own processing, while the target
application handles the request independently. The sender does not have to wait for the remote processing to complete nor for a reply to come back. Instead, the thread sends the message and
continues processing client requests.

When using asynchronous communication, applications are said to be loosely coupled. With loose coupling, an application can continue processing without interference from performance or
communication aberrations. The requesting application is not bound to the responding application, nor to the communication delivery mechanism.


2.5 Comparing Approaches

When designing application, one needs to decide whether to use synchronous or asynchronous integration other applications. Both synchronous and asynchronous integration approaches are valid for
application integration, and the choice should be based on the integration requirements and use cases. In deciding use the following guidelines

  • Quality of services — The use of queuing or a publish-subscribe mechanism provides higher quality of services, such as guaranteed delivery, than synchronous communications.
  • Performance — Asynchronous messaging can lead to better performance because a queue buffers or stacks up messages, and guarantees message delivery.
  • Transaction integration — A synchronous communication model is more suitable when an application needs to perform secure and transactional access to one or more
    applications synchronously for client request processing. In such cases, an application can afford the overhead of tighter coupling to ensure higher quality request processing and error handling.
  • Programming complexity — The synchronous model requires extra logic to handle error conditions, such as network conditions described above. This aspect of synchronous is
    more complex. However, the asynchronous model introduces its own complexities. When a message is dropped off for asynchronous processing, the calling application continues other processing. The
    calling application needs to be built to handle random arrival of responses to prior messages, which may arrive in any order. Though the asynchronous model provides more services, it comes with
    the cost of this greater application complexity.

Publisher’s Note … Look for the Part Two of this article in the January issue of TDAN.com.

Share

submit to reddit

About Tom Haughey

Tom is considered one of the four founding fathers of Information Engineering in America.  He is currently President of InfoModel, Inc, training and consulting company specializing in practical and rapid development methods. His courses on data management, data warehousing, and software development have been delivered to Fortune 100 companies around the world. He has worked on the development of seven different CASE tools, over 40,000 copies of which have been sold to date. He was formerly Chief Technology Officer for the Pepsi Bottling Group and Enterprise Director of Data Warehousing for Pepsico. He was also formerly Vice President of Technology for Computer Systems Advisers, who market the CASE tools called POSE and SILVERRUN. He wrote his own CASE tool in 1984. He formerly worked for IBM for 17 years as a Senior Project Manager. He is an author of many articles on Data Management, Information Engineering and Data Warehousing.

His book, Designing the Data Warehouse-The Real Deal will be published later this year.

Top