Grids in Data Warehousing

When we think of a data warehouse as a solution, the first thing that comes to mind is the technology that will centralize all the data in a single system and, at the same time, offer good
response to the user queries, enabling them to make analytical decisions. In today’s competitive market, IT organizations are under pressure to increase operational agility, to establish and
meet IT service levels and to control costs. They need to think of the cost-effective solution that will make them endure in the market. Whenever designing a new data warehouse, the data architect
has to think about the technology that will respond with the reliable and secure performance their application needs. They need to understand the current business key indicators, query historical
information and perform trend analysis to predict future consequences.  As volumes of data are made available to a large number of business users and analysts making the tactical decisions,
scalability and high availability are of paramount importance.

Considering all these needs of data warehousing, most of the IT organizations have started implementing the newly emerging technology that is known as grid computing in their data warehouse
products. With the introduction of grid-enabled technology in data warehousing, customers can now build the architecture that provides speed and performance for their application.


What is Grid?

A grid is typically a collection of low-cost servers connected over a high-speed network in which IT resources such as computer power, storage and network capacity are pooled and shared into a
single set of shared services that can be distributed on demand. This leads to maximum utilization of resources already available.

Figure 1: A Simple Grid Implementation


Why Grid?

Data warehousing is about the loading of data from heterogeneous sources such as operational systems, mainframes, files, etc. The data is then queried by business users to make
analytical decisions.

Figure 2: Data Warehouse Architecture

As shown Figure 2, the data from various sources is extracted, transformed and loaded into the data warehouse. This process is called as ETL. In ETL process, loading becomes inefficient as the data
volume grows, thereby increasing the loading window for data warehouse loading. The data warehouse is loaded during the night time and queried during day time. The ETL process should be so
efficient that the data loading should be completed within the given load window or service level agreement (SLA), irrespective of the volume of the data. To overcome the bottleneck of delayed
SLAs, organizations increase the hardware resources to make the system effective, but that makes the overall system expensive. IT organizations face the challenges to increase the computational
power of already available resources and make the cost-effective, scalable and highly available systems.

To handle this data explosion and IT challenges, grid computing is an innovative solution that provides:

  • Scalability: By distributing the task over a shared pool of resources, the scalability and performance is improved.

  • Reliability: In grid, if any of the servers fail, then the other server will be used for further processing without failing the job – thus providing a reliable
    structure.

  • Cost Saving: By utilizing the computing power of unused resources, organizations can optimize their return on investment and lower cost of ownership.

  • Throughput: Number of users can access the shared pool of resources in order to obtain the best possible response time by maximizing the utilization of all resources available
    in pool.

With grid computing, groups of independent hardware and software components can be pooled and shared on demand to meet the changing needs of businesses. Instead of being dedicated to specific
applications, grid allows computing resources to be shared, while also making systems highly scalable and available. The accelerating adoption of grid technology is in direct response to the
challenges that IT organizations face with today’s rapidly changing and unpredictable business needs.


How is Grid Implemented in the Data Warehouse?

Keeping in mind the pros of grid technology, most of the organizations have started implementing grids in their ETL tools and databases. By combining grid technology with data warehouses,
organizations can reduce processing timelines while lowering the costs. The ETL tools like Informatica, SAS and databases like Oracle have implemented the grid in their products.

Below we will discuss in brief how the grids are implemented in data warehousing and how they are benefiting from this technology.

  1. Informatica Corporation

    Informatica PowerCenter 8 is the latest release of Informatica that harnesses the power of grid computing for greater data integration performance and scalability. The enterprise
    grid option delivers the load balancing, dynamic partitioning, parallel processing and high availability to ensure optimal scalability, performance and reliability. The grid technology
    implemented in Informatica PowerCenter 8 distributes the workload across the available resources, doing the proper load balancing and thereby increasing the scalability and performance. The
    High Availability option reduces the system failure chances and provides uninterrupted availability of computer resources.

  2. Oracle Corporation

    Oracle has implemented the grid technology in their Oracle 10g version (‘g’ for grid).  Oracle has incorporated the fundamentals of grid computing and implemented them in
    Oracle database, application server and Enterprise manager. In Oracle 10g, the database can balance the workload across a new node with new processing capacity as it gets re-provisioned from
    one database to another and can abandon the machine when no longer needed – which is on-demand sharing of resources.

    Oracle Database 11g delivers the benefits of grid computing with more self-management and automation, making it easier to partition and compress tables to store more data and run queries faster
    and to protect and audit data.

  3. SAS Institute

    SAS Grid Computing delivers enterprise-class grid computing capabilities that enable SAS applications to automatically leverage grid computing, run faster and takes optimal advantage of
    computing resources. SAS Grid Manager helps automate the management of SAS Computing Grids with dynamic load balancing, resource assignment and monitoring, and job priority and termination
    management. Customers can process high-volume SAS programs faster, improve hardware utilization and future-proof computing infrastructures while increasing the resilience of SAS applications.
    Computing resources can be scaled out to cost-effectively add new users and meet fluctuating processing demands.


Who Benefits from Grid in Data Warehousing?

Grid reduces the bottlenecks that occurred during the data loading in data warehouse, making the data processing faster. Speeding up the data processing means the data is available to business
users for analysis so the business decisions are not delayed and can be taken in timely manner. Also in grid, as already available resources are being utilized in a proper way, the cost remains the
same.

Due to these advantages of grid implementation, the business personnel at every level are getting benefited:

IT managers and directors: As grid harnesses the power of underutilized resources providing more computational power, there is no need for extra hardware – which makes it
cost-effective.

Data warehouse architects and specialists: Due to availability of grid option in databases and ETL tools, data warehouse architects do not have to worry about the loading time
windows even if the data volumes explode in the future.

Business analysts and decision makers: Due to grid, the data is available to business analysts and decision makers in timely manner, and they can make the analytic decisions
quickly.


Conclusion

As grid computing harnesses the power of underutilized resources, it will have a major impact on productivity and cost improvements at enterprise level. The benefits of grid such as flexibility to
manage the resource utilization, high availability, scalability and greater performance at lower cost are attracting most of the IT organizations to make better use of them in their tools and
technologies.

While the development and implementation of grid computing is still emerging, it will continue to increase rapidly over next several years. However, data warehousing will be forerunner of utilizing
grid and will benefit from its use.

References:

  1. SAS Institute,  “Grid Computing and SAS,” by Merry Rabb and Cheryl Doninger
  2. Oracle Corporation,  “Oracle Grid Computing,” May 2008
  3. Informatica Corporation, “Informatica PowerCenter Today and in the Future,” November 2006
  4. United Devices,  “Grid-Enabled Data Management and ETL,”  June
    2007

Share

submit to reddit

About Madhu Zode

Madhu Zode is data architect who has worked extensively in data modeling using canonical models, ETL architecture, design integration and implementation. She has published white papers in the past on "ETL Evolution" (DM Review, 2007), "Grids in Data Warehousing" (The Data Administration Newsletter, 2009) and "Canonical Data Model: Does It Actually Ease Data Modeling?” (Information Management, 2015). Madhu can be reached at madhuzode@gmail.com.

Top