In this age of business analytics, business users are clamoring for data to be provided to them at a faster and faster pace. In addition, the generation and expected consumption of source data has reached a feverish pace, threatening to overload existing infrastructures, already harried IT staff and tight budgets. Structured data provided in a batch mode is no longer good enough; unstructured data including social network data, third party data, XML data and streaming data are increasingly becoming a necessity for the analytics architectures of the near future. To assist in countering this data onslaught, one of the architectural components for analytic environments gaining ground in recent years is called data virtualization. Data virtualization can be can be categorized many ways, but for this article, two main functions are going to be examined – the abstraction function and the federation function.
Data Abstraction Function
Data virtualization used as an abstraction tool is a computing discipline that abstracts or “hides” the complexity of a computing infrastructure from an end user. The end user should have no knowledge or concern with the implementation details regarding physical storage, server locations or database structures.
The goal for the abstraction function is to:
- Hide the IT infrastructure details
- Reduce architectural complexity
- Allow for seamless movement behind the scenes as requirements change and build-outs occur
In database systems architecture the data abstraction layer typically is implemented using a layered architecture where physical, logical and view layers are defined. The physical layer, which tends to be complex, has the most detailed definition containing the actual implementation details. This layer is overlaid with a logical layer – which is less complex than the physical layer – hiding less important implementation details. Finally a set of views are set up to provide users with a simpler more concise view into their data.
The view layer normally consists of security views, business area views and custom views. Also, for performance reasons some of these views could be implemented as materialized views or managed persistent table structures.
Data Federation Function
Data federation’s goal is to join data from multiple heterogeneous sources, leaving the data intact in its original source system and eliminating any data redundancy. Data federation is traditionally implemented using data federation solutions to act as a middleware component that allows SQL-like queries to be applied to both structured and unstructured data sources. In addition, these sources may reside in many disparate databases and/or file technologies, such as Oracle, DB2, SQL Server, XML data stores, etc. End users and client applications will send standard queries to the data federation solution, which will then perform the necessary conversions to satisfy the query and send the results back to the data requestor in a seamless fashion.
As in an abstraction function, the data supplied back to the end user looks like it came from one physical database, but in reality was rationalized across many different data structures and technologies.
The benefits of data virtualization (abstraction and federation) are many, including but not limited to the following:
- Establishes layer of insulation between technical solution and end user
- Reduces complexity to end user
- Increases IT flexibility and agility allowing for more seamless movement behind the scenes, increasing ability to build out and/or shift platforms minimizing user impact
- Provides a single data provisioning point
- Allows for prototyping and data mashups
- Standardizes on a common taxonomy and schema
- Abstracts data model complexity
The ultimate goal of data virtualization is for the end users and client applications to perceive all of their analytic data sources as coming from a single database repository.
Data virtualization has some challenges to overcome before it can be utilized as an optimal solution. The following is just a short list of important items that should be part of any organization’s criteria short list.
- Source system performance. One of the challenges of data federation is the ability to query the source systems to satisfy queries without hindering source system performance. This would be one of your criteria before considering a source system as a candidate for data federation to ensure that data federation queries can execute with zero to minimal impact on the source system.
- Complex data transformation. Data virtualization has difficulty satisfying complex data transformations “on the fly.” This type of data integration may best be satisfied using an ETL data pattern.
- Query performance. Great care must be taken to ensure the DBMS or data virtualization tool’s query optimizer can perform to meet the query’s performance targets. A performance environment where this can be measured is a prerequisite if data virtualization becomes a pervasive part of the data warehousing architecture.
- Data Quality. Data quality issues are rampant throughout most organizations today and the data virtualization tool cannot also act as a data remediation tool. Great care must be taken to ensure that data virtualization does not include data interrogation, data cleansing and data remediation along with its other functions.
There are many other considerations, but if your data virtualization requirements can meet or beat the above then it should be considered as a potential alternative for implementation, provided that there are no negative architectural, infrastructure and/or organizational impacts.
There are multiple options for implementing data virtualization within your existing architecture. Data virtualization can be provided by the DBMS platform, an EII tool, using the BI tool itself or any combination of the above choices. The DBMS platform can implement it using different architectural layers mentioned above: physical, logical and views. It can also be implemented through EII solutions provided by Composite, Red Hat MetaMatrix or IBM’s Federation Server, to name just a few. BI tools such as SAP Business Objects also contain an abstraction layer sheltering end users from the unnecessary complexity of the underlying architecture. A combination of all, some or one of these can be used to build a comprehensive and robust data virtualization layer.
Data virtualization does have a place in the analytics environment of today but great care must be taken to not make it the first choice for data integration. Separate the abstraction and federation functions and establish clear criteria for each of its selective use within your organization. Federating too much data will overwhelm the infrastructure and federating too little or no data will stunt the ability to provide optimal data consumption. Careful architecture and design combined with clear business requirements will allow the data warehouse architect to design the correct overlap of functionality for each pillar (DBMS, EII tool, BI tool) of the overall architecture, ensuring standardized common access, sufficient performance and adequate flexibility for future implementations.