A New Way of Thinking – October 2003

Published in TDAN.com October 2003

Over the past fifteen years or so, there have been a number of attempts at making better use of idle computers by farming out application functionality to underused computing resources from some
designated pool. This used to be referred to as “cycle-stealing,” and it is a way to increase computing volume without a significant capital investment. Within recent years, other ideas
and concepts have been incorporated, mostly abstracted as general resource sharing, and a number of researchers working on these ideas have gotten together to formalize the process, which is now
referred to as “grid computing.”

In general, grid computing is about collaboration and sharing. A grid is composed of a collection of resources and a set of protocols for sharing those resources, much like the electrical power
grids that provide power services to many distributed clients. The kinds of resources that compose a grid might include computers, disks, memory, services, or even data. The protocols control the
distribution of functionality, security, authentication, management of distributed processes, and policies regarding restriction of use, such as thresholds of local activity above which the
resource is not available for sharing. The people that interact or share resources are together referred to as a virtual organization, and a virtual organization may actually incorporate
individuals and resources from more than one administrative authority. For example, two scientific research groups at different universities working on similar problems may share both data and
resources to solve problems relevant to both groups.

Grid applications are typically scientific ones, which require large-scale processing power to run, or require large amounts of data as input. However, if we were to imagine a collection of
capabilities that were transparently provided by a virtual organization, we can see that grids have the potential to develop into an underlying fabric for providing value-added services seamlessly
throughout (or across more than one) enterprise. Incorporating functionality as services within the grid augments enterprise architecture, because it provides a formal method for reusing
capabilities that up to this point might have been acquired multiple times independently by different vertical organizations within the enterprise, and there is some cost benefit promise in this
premise.

As a simple example, more than one group within a company might have a need for an extraction transformation load (ETL) tool, and vendors are all too happy to sell multiple licenses to these
different groups. On the other hand, it is not likely that the same tool is being used constantly by processes in each group, and a single license would probably be sufficient. In the grid
environment, all the groups needing that ETL tool can participate in a virtual organization, and one instance of that same tool can be made available to any of the members. By purchasing a single
instance, the company saves money by not buying multiple copies, the costs of maintenance and management are reduced, and the tool is more likely to have a high utilization, all of which results in
a larger return on investment (ROI).

I was recently working on a report on grid computing and during my research, I found a lot of material discussing the benefits of grid computing with respect to sharing information is what is
called a “data grid.” Data grids are more likely to focus on the distribution and/or sharing of data for scientific applications, but conceptually there is no reason why the grid
paradigm cannot be extended to the information management world. Apparently this is what Oracle has been thinking, considering their recent announcement of their Oracle Enterprise Grid computing
initiative as part of its Oracle 10G product suite.

Contrast this with the concepts of web services built to provide distributed access to databases. These services can be constructed to allow knowledge workers to peruse metadata via the browser
interface, request data from a database, and have the service provide the data and deliver it directly to the client. This also provides for distributed access to data to clients in different
locations.

I can imagine the next step in combining these two ideas: the virtual distributed database. This might be a system that, via web services, provides a traditional client interface to a collection of
independent databases that have been made available as shared resources within a virtual organization implemented using grid technology. In this paradigm, a knowledge worker can sit down in front
of a browser and interact with the virtual database as if it were a single resource. Having this capability would reduce the reliance on data replication (whether that is managed replication or
whether it is ad hoc, i.e., “making my own copy”) and reduce storage requirements. It might mitigate the need for certain ETL functions, or even reduce the necessity for some kinds of
data marts for analytical purposes.

I am certain that some readers are champing at the bit to claim that this is all doable today; new enterprise metadata tools are helpful in producing the enterprise view of available information,
and we can certainly build services as information provision middleware objects. While I am confident that this environment is build-able, I believe that there are some serious issues that need to
be understood before this virtual distributed database emerges into reality, and these include (although, this is not meant to be all-inclusive):

  • Access
  • Security
  • Functionality
  • Performance

My use of the word access refers to the mechanisms through which information is collected and repackaged. For this database to be successful, there must be a mechanism to gain access to any data
that exists in a structured format that is to be made available. Fortunately, we can use open database connectivity (ODBC) or the corresponding version for Java (JDBC) to allow applications to
access data directly from program code. But this will only work as long as there is an existing driver for the targeted database. Ultimately, there must be some kind of adapter and corresponding
API that will allow the service to access the data if it is to be incorporated into the pool of shared data resources.

Security is particularly relevant in a distributed environment. For the most part, the database server model provides certain levels of security and authentication at either the gross-level (via
username and passwords), or at a granular level (in those database systems with table, record, or column-level access control). In the distributed environment, though, we are using proxies to
access the data, and therefore the access rights are those granted to the proxy, not to the ultimate information client. Therefore, the service must be able to also provide a level of security and
authentication, and in the distributed environment the management of these access rights must also be viewed as a service that is part of the set of shared resources. Again, we are fortunate in
that these capabilities are already available as part of the grid computing protocols.

The last two issues are the ones I believe are more difficult. Let’s consider a simple view of functionality: supporting standard SQL. Simple queries are easy to handle – we have
already discussed the use of ODBC/JDBC for accessing the data from any specific database, and it is easy to package a query, direct it to a server, and then wait for the result set to be forwarded
back to the service, which can then repackage it and display it to the information client.

The problem is when it comes to supporting cross-table queries when the tables live in different databases. Joining two tables from the same database is easy – it is just another query.
Joining tables from different databases means that you cannot rely on the internal query engine of either database to materialize the result. This implies that we need to build the mechanics of the
query engine into the service itself! In other words, a join of two tables from different databases means that the service needs to access the data from each table and then execute the join outside
of the database, and this requires both query engine functionality as well as memory and disk resources outside of the individual databases.

And this leads to the last issue: performance. For this virtual database to work, it must be able to provide certain performance levels that are acceptable to the information client. But presumably
we are implementing a significant amount of database functionality outside of the database servers. In addition, we also need to factor in the latency for delivery of information from the separate
servers through the network. That means that the service must be able to optimize the database functionality across the collection of distributed resources. And we are in luck again, since the grid
computing paradigm is designed to be able to support parallelism across distributed systems, and since a large part of database query functionality is well-suited to parallelization, we can take
advantage of the grid to provide for the computational requirements as well as exploit network connectivity to support the needed functionality.

I anticipate that this kind of virtual distributed database service is the next logical step within a grid services environment. I would be interested in hearing from you if you are currently
working on this kind of project to learn more about your experiences – email me at loshin@knowledge-integrity.com, and I will be happy to
summarize what I learn in my next column.

Copyright © 2003 Knowledge Integrity, Inc.

Share

submit to reddit

About David Loshin

David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality ApproachÊand Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!

Top