The Analytic Sandbox

The Challenge

Traditional enterprise data warehouse (EDW) and business intelligence (BI) processes are not rapid, responsive or agile. It typically takes six to nine months to analyze and define business requirements, model the data, develop and test extract, transform and load (ETL) processes to cleanse and integrate the data, create the metadata layer and develop and test BI reports and dashboards. This costs $500,000 or more for a typical initiative.

Efforts to make the process more agile have largely focused on speeding up the various steps in the traditional delivery cycle and staggering the steps across multiple projects. For example, once the business requirements and report specifications are complete on the first project, the business analyst(s) can start on the next project, while Data Modeling, ETL and BI report development and testing continue on the first one. These approaches have had limited success.

Often, the business has not had an opportunity to work with the data, so can’t clearly define metrics and BI reports properly. Another issue is the difficulty of integrating external data with data in an existing data mart. In short, the traditional approach doesn’t work for these cases, nor does it work for one-off exploratory initiatives due to the long development time and cost involved for what is essentially “throw away.”

Casual vs. Power Users

Most importantly, the traditional EDW treats all users as casual users. Casual users are executives, managers and front-line workers who periodically consume information created by others. They monitor daily or weekly reports and occasionally dig deeper to analyze an issue or get details. Generally, a well-designed interactive dashboard or parameterized report backed by a data warehouse with a well-designed dimensional schema is sufficient to meet these information needs.

Power users, on the other hand, explore data to answer unanticipated questions and issues. No predefined dashboard, report or semantic layer is sufficient to meet their needs. They need to access data both in the data warehouse and outside of it, beyond the easy reach of most BI tools and predefined metrics and entities. They then need to dump the data into an analytical tool (e.g., Excel, SAS) so they can merge and model the data in novel and unique ways.

For years, we’ve tried to reconcile casual users and power users within the same BI architecture, but it’s a losing cause. Power users generate “runaway” queries that bog down performance in the data warehouse, and they generate hundreds or thousands of reports that overwhelm casual users. As a result, casual users reject self-service BI and revert back to old habits of requesting custom reports from IT or relying on gut feeling. Meanwhile, power users exploit BI tools to proliferate spreadmarts and renegade data marts that undermine enterprise information consistency while racking up millions in hidden costs.

The Answer – An Analytic Sandbox/Exploration Warehouse

The answer is to create a separate analytic environment to meet the needs of the most extreme power users. This analytic sandbox concept is actually an old concept, as shown in the 2001 Corporate Information Factory diagram by Bill Inmon and Claudia Imhoff (Figure 1), although they used the term Exploration Warehouse (see the yellow database objects in the diagram).

Figure 1: Corporate Informaton Factory diagram

Inmon defined the Exploration Warehouse as a “DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing.”

Compared to a traditional data warehousing environment, an analytic sandbox is much more free-form with fewer rules of engagement. Data does not need rigorous cleaning, mapping, or modeling, and hardcore business analysts don’t need semantic guardrails to access the data. In an analytic sandbox, the onus is on the business analyst to understand source data, apply appropriate filters, and make sense of the output. Certainly, it is a “buyer beware” environment. As such, there may only be a handful of analysts in your company who are capable of using this architecture.

Based on the experience of companies who have implemented sandboxes, the best practices involve establishing clear rules and guidelines so that users don’t treat the sandbox environment like a production EDW. For example:

  1. No off-hours support.

  2. No regular backups.

  3. Limited and fixed disk space – once it’s full, something has to go, e.g., the oldest data.


Implementation Options

Physical Sandbox
One type of sandbox architecture uses a new platform –a data warehouse appliance, columnar database, or massively parallel processing (MPP) database – to create a separate physical sandbox for their hardcore business analysts and analytical modelers. They offload complex queries from the data warehouse to these turbocharged analytical environments, and they enable analysts to upload personal or external data to those systems. This safeguards the data warehouse from runaway queries and liberates business analysts to explore large volumes of heterogeneous data without limit in a centrally managed information environment.

Virtual Sandbox
Another approach is to implement virtual sandboxes inside the data warehouse using workload management utilities. Business analysts can upload their own data to these virtual partitions, mix it with corporate data, and run complex SQL queries with impunity. These virtual sandboxes require delicate handling to keep the two populations (casual and power users) from encroaching on each other’s processing territories. But compared to a physical sandbox, it avoids having to replicate and distribute corporate data to a secondary environment that runs on a non-standard platform.

Desktop Sandbox
Other BI teams are more courageous (or desperate) and have decided to give their hardcore analysts powerful, in-memory, desktop databases (e.g., Microsoft PowerPivot, Lyzasoft, QlikTech, Tableau, or Spotfire) into which they can download data sets from the data warehouse and other sources to explore the data at the speed of thought. Analysts get a high degree of local control and fast performance but give up data scalability compared to the other two approaches. The challenge here is preventing analysts from publishing the results of their analyses in an ad hoc manner that undermines information consistency for the enterprise.

Software as a Service (SaaS) Sandbox
Software as a Service BI can enable the easy and rapid creation of sandboxes for data exploration. SaaS BI vendors offering a complete and comprehensive solution can satisfy IT and business requirements for sandboxes. For IT, a comprehensive solution provides an integrated offering of all of the necessary BI components, such as ETL, data warehousing, data analysis and reporting. A unified metadata model ensures consistency across the entire BI stack and eliminates the need to consolidate and maintain metadata. The built-in ETL and data warehousing functionality means that users can tap into a clean, managed source of enterprise data and create a sandbox on the fly for ad hoc analysis. With on-demand BI, the entire ETL and data warehouse processes are automated with intelligent source data analysis, data extraction, loading and transformation to automatically create staging and warehouse tables. The end result is that IT can quickly set up BI sandboxes based on data that IT can control. The sandboxes themselves can also be centrally managed, so that end users have access to the appropriate amount of information and IT is confident in the data’s integrity and security. A SaaS solution can complement an existing BI solution, while requiring significantly less maintenance and support than traditional BI. This means that there are no conflicts with existing IT investments and the solution requires a minimal amount of IT’s time and attention.

Summary

Sandboxes / Exploration Warehouse offer many benefits. They allow the business an opportunity to work with the data first, and then define metrics and BI reports properly. They facilitate the integrating of external data with data from the EDW, i.e. support combining and enrichment of data. Sandboxes also support one-off exploratory initiatives and supports power users in their exploration without affecting other BI users. Lastly, sandboxes can serve as a “relief valve” until the work to put the data into the warehouse can be done.

References

  1. “Dual BI Architectures: The Time Has Come”, Wayne Eckerson, TDWI, Sep 30, 2010

  2. Bill Inmon’s Corporate Information Factory, http://www.inmoncif.com/library/cif/

  3. “Solve the Data Management Conflict Between Business and IT”, Brad Peters, Information Management Newsletters, July 20, 2010

  4. “Bringing Agility to Business Intelligence”, David Caruso, InfoManagement Direct, February 17, 2011

Share

submit to reddit
Top