Data warehouse performance is dependent on a number of factors including the nature of the queries being run against the physical database, the hardware and software resources available and the
underlying physical distribution of the data in the database.
This paper will identify many of the causes of poor performance in data warehouse queries and identify practical solutions for managing and improving performance. The talk will identify methods for
determining performance bottlenecks. Information will be provided on taking advantage of parallel processing, controlling the physical distribution of data for performance gain and aggregating data
across dimensions to improve performance.
The Slow Database
A complaint that is heard too often from many data warehouse users from newly implemented data warehouses is that data access performance is slow. In the world of fast client-server
applications and ever-increasing CPU speeds, users have come to expect rapid results from virtually any query. A large visible data warehouse effort must properly set user expectations for
performance and then must be able to deliver on those expectations. But achieving proper performance with a large diverse data set is not a trivial matter.
Identifying ‘Poor Performance’ vs. ‘Expected Performance’
It is important for data warehouse architects and IT managers to properly set expectations for a data warehouse effort. Performance is largely a factor of software and hardware resources versus the
size and complexity of the data set being processed. A large number of records being processed in a simple two-table join will generally process faster than the same number of records being
processed in a five-table join (which actually involves processing more records). So the type of processing being performed (join, aggregation, union) and the size of the data set directly
The other major factor influencing performance is the hardware and software being used to process the data. Hardware and software represent the resource available for processing; much as the size
of a car engine (number of cylinders, cylinder bore) impact the performance of the car, so the hardware available (number of disks, speed and number of CPUs) influence the speed of the processing
for a database. The software also impacts performance and in the case of the data warehouse it can involve several pieces of software: the front-end data access software (OLAP, DSS Query),
middleware and the back-end database. In most cases today, the database used for data warehousing will be a relational database, the type of database which will be the focus of
A query that involves scanning six million records and performing a complex join and aggregation on moderately powerful hardware would be unlikely to complete in six seconds; a more reasonable
processing time for this query would be on the order of a half-hour to an hour (depending on software and other factors that will be detailed in this article). This would be considered
‘expected performance’. But if an end-user were running this query and performance expectations had not been properly set, it could be viewed as ‘poor performance’ by the user.
It is possible, through various techniques, to improve the performance of many queries. In the case of the six million row query, reducing the number of rows read by creating aggregations across
dimensions is one technique. Desktop OLAP (DOLAP) tools that create ‘data cubes’ use a technique similar to aggregation by creating a local multi-dimensional database (MDD) that allows facts
to be read quickly across dimensions, but in most cases these tools must still at some point access the six million rows in the data warehouse. This article will explore methods of optimizing that
Identifying Resource Problems
The process of identifying resource problems involves reviewing the resource usage on the machine during data warehouse processing. If there is very little resource available during peak
processing, then there may be inadequate resources for peak processing.
Contrary to popular opinion, database engines are more often CPU-bound than disk-bound, meaning that available CPU resource is the constraining performance factor. Parallel
processing, large memory caches and asynchronous i/o allow data be moved quickly from disk into main memory where it is then processed by the database engine. This processing can be performed very
quickly, and CPU cycles are required to process this data efficiently. A lack of adequate CPU cycles would inhibit database performance.
To determine whether or not lack of CPU resources is limiting system performance,
system utilities should be run during database idle time to determine CPU resource available on the system. A machine that is using between 60 to 100 % of CPU resource during database idle time
could have inadequate CPU available for running the data warehouse at a full system load. On a symmetrical multi-processing (SMP) machine, additional CPUs could be added to increase the CPU
available. On a single processor machine, a faster CPU could be used to provide more CPU resource.
Similarly, tracking system CPU availability as data warehouse processing increases could be used to discern a the point at which CPU usage reaches 80-100%. Once again, additional CPU resource may
be needed to improve performance. The point at which the data warehouse is running at peak usage load and system CPU usage is between 60-80% is the point at which CPU resource is no longer a
performance constraint for the database.
Memory resource could be another constraining performance factor. As mentioned previously, database engines read large amounts of data into an available data cache in memory (usually
shared memory). All database processing is performed against this data cache. If a data page (a unit of data cache) that is needed is not in the data cache, it must be read from disk. And
if there is not room in the data cache for the data page from disk, a data page in the cache must be removed and/or written to disk. A small data cache would have to constantly remove pages from
the cache to make room for additional pages and thus increase the likelihood that a data page would not be found in memory. This is a process that would ultimately lead to more disk i/o and longer
processing waits while a process waited for a data page to be read from disk. For that reason, all other factors being equal, a larger data cache would lead to fewer disk i/o’s and faster, more
Identifying Problem Queries
Often problems with processing involve the execution of a specific query. With most relational databases, a query is phrased in the standard Structured Query Language (SQL) and then parsed and
optimized by the database engine. Ultimately the database engine will make decisions on how to process the query constructing what is referred to as a ‘query path’. This query path can have
a significant impact on the performance of the query.
A problem query is obviously any query where performance is suspect. If system resource problems have been ruled out (if there is adequate system resource available during the processing of the
query), then the query path being used should be examined. Most database engines have some means of displaying this information for a given query.
The query path should be examined to determine whether or not the tables in the query are being accessed correctly. If a table with a highly selective index that could be used to satisfy the query
is being scanned serially (the index is not being used and every page in the table is being read), then it is likely that the engine optimizer has not made the correct decision in accessing the
Conversely, if an index is not highly selective for the query being processed and is being used nevertheless, the query process may once again have made the wrong decision. Index access for a large
number of data pages is not efficient since it requires additional i/o calls to read an index page and then a data page. A sequential scan of the table would require fewer i/o calls and would
process more quickly. Some put the threshold for this type of access at scans that read 25% of the table, meaning that if 25% of the table is being read, then an index should not be used.
In order to correct a problem with a query that is generating an incorrect query path, several actions could be taken. It may be that the database engine does not have adequate statistics
concerning the table and/or indices available to make the correct decision. If there is some process for updating the statistics available to the optimizer, that should be attempted. Rephrasing the
SQL statement could also have an impact on performance (though in the perfect world, the optimizer should understand alternate methods of stating the same query and still choose the correct query
path). And some databases support optimizer hints, a method of indicating to the optimizer that the query being run should be optimized in a specific manner, such as choosing a specified
index or performing a table scan on a specified table.
Reviewing Physical Data Distribution
The physical distribution of the data on the disk can impact performance. Decision support applications that access data warehouse data often scan a large number of records. For a particular table
these scans are often referred to as full table scans. A full table scan can benefit from contiguous data, data pages (collections of table records) that reside physically side-by-side on
the disk. A table that contains contiguous pages is easier to scan than a table that contains data pages that are fragmented or distributed in various non-contiguous locations on the disk. A table
that is fragmented would require more i/o’s to read than a table that is contiguous. And more i/o’s means that data access will be slower than for the same amount of data collected with fewer
Taking Advantage of Multi-threading and Parallel Processing
All major relational database vendors support some form of internal parallel processing. Multi-threading allows several internal threads of execution to process at the same time. Parallel
processing generally refers to the operating system capability to execute multiple processes concurrently. Combined, these capabilities allow a database engine to perform multiple tasks
As explained previously, when a database engine creates a query plan, there are portions of this query plan that can be executed in parallel. For instance, data located in separate tables can be
scanned with multiple threads of execution which then join the data using a hash algorithm to find rows that fulfill the join criteria. Or a sort operation can be started as a separate thread of
execution and fed rows from an ongoing join or scan operation.
Parallel processing can have a significant impact on the performance of a query. Data warehouses generally have large fact tables with a number of smaller dimension tables. The
fact table is often scanned and performing scan and join operations on this table using parallel processing could have a significant impact on performance. Aggregation operations performed
on the fact table could also benefit from parallel processing, with parallel threads of execution performing aggregations as data is fed from scan operations.
Aggregating Data to Improve Performance
Decision support queries must often aggregate data within fact tables. These aggregate operations are usually performed across dimensions, for instance aggregating daily sales for each quarter for
each sales region. If these queries are run consistently it is usually possible to aggregate this data before the query is run and store the aggregations in tables within the database. When the
query is run, it will recognize these aggregate tables and use them to compute its results, thus avoiding the processing necessary to create the aggregations.
A sophisticated aggregation scheme would recognize dimensional hierarchies and build higher-level aggregations from more granular aggregations. For instance, a weekly aggregation of sales for a
region could be used to build a monthly or quarterly aggregation of sales by region, and thereby avoid an aggregation of the more granular daily sales totals. Such aggregate-awareness can
significantly improve performance but requires meta-data that recognizes dimensional hierarchies.