Published in TDAN.com January 2002
This is the second part of a three part series on Very Large Databases. The paper is divided into six chapters and two chapters will be included with each part.
3. Hardware and VLDB
In this chapter of the series we focus on the setup, necessity, proper sizing and usage of hardware resources; we do not discuss pricing options. By introducing the hardware concepts in relation to
VLDB, the software (next chapter) should make more sense. If you have not been reading the series from the beginning, we recommend you take a moment to do so. The concepts in this chapter build on
the previous chapters. VLDB in relation to hardware can be a challenging aspect. However without the proper knowledge, the VLDB installation can fail. As indicated in chapter 1, the triangles of
power with all of its components are related to the successful implementation of a VLDB system.
Having a VLDB system without the proper hardware could be like having an 18-wheeler truck sitting on wheelbarrow wheels – not a pretty site. On the other hand, if the hardware is an
18-wheeler, and there isn’t really a need for all that cargo space – it could be overkill. Like trying to drive a Ferrari 130 miles an hour just to go 5 miles. Common sense plays in to
this. What we address in this series are the concepts behind the hardware, and the factors to consider depending on the size of the VLDB system being created.
The following is an outline (as covered in chapter 1) for the content presented in this chapter.
One of the things frequently overlooked is the mathematics behind the VLDB – hardware and software included. Because hardware is mathematically based (as is software), there is a plethora of
tuning options available to improve throughput or performance. What the community forgets is that no matter what is being done, the foundations of math are still there – and these concepts
and theories need to be re-applied in today’s modern world. It will help provide a basis to get the architecture correct the first time around. What is being referred to here is in the
hardware sense known as speed, performance, resistance, heat production, and electricity transfer. This level of information is unnecessary to explain VLDB, but necessary to explain hardware. The
point is to remember the foundational roots of all systems as we discuss the theories, algorithms, and enhancement suggestions.
3.1. What is parallel hardware?
Parallel hardware is basically hardware that can run in parallel – at the same time as other hardware, performing a small chunk of a larger task. It’s the same concept base as threads
with parallel processing, only on the hardware level. However – there are numerous differences when the definition of parallel hardware changes. For instance, Numa-Q (now owned by IBM), or
Torrent’s Orchestrate different types of parallel hardware can be achieved. One is based on the clustering approach, the other based on load balancing and availability of on-line processors.
Clustering of hardware is one method of grouping hardware together to appear to the operating system, and software layers as one big machine. It takes the available CPU’s, RAM, and DISK and
places it all together. Although the clustering components usually provide for logical partitioning if desired. For instance, clustering 12 machines with 2 CPU’s each, and 2 GIG’s RAM
each would end up with 48 CPU’s, and 48 GIG’s of RAM available to the software. The issue with this, is that the machines connected in the cluster must have high speed parallel
communications between them, which typically means either backbone connectivity, or some sort of hardware addition (board). Usually the machines should be co-located close together. There are many
different types of methods by which clustering can be achieved, even through remote internet connections and shared resources – however the ultimate goal is to have the processes run
independently of each other, and be load-balanced across the multiply linked systems.
Load balancing is less of a hardware technique and more of a software layer technique. It is a concept that can be implemented on top of clustered machines as well as non-clustered machines. Load
balancing is the method by which the available resources are weighted, scored in their activity and utilization – then based on what’s available, and what’s free, and of course
what’s in the job queue are assigned tasks to complete. In other words, the load balancer works (usually) on between the Operating system and the applications layers to move threads around to
available hardware. Whereas clustering (most often) works between the hardware and operating system level, clustering most often acts as an extension to the operating system. Load balancing is a
function of computational analysis, although there are aspects of load balancing in the clustering hardware and software.
More examples of parallel hardware can be seen in Figure 3-4 where multiple controllers are added to a disk array. Thus allowing the controllers to operate I/O in parallel. Without the multiple
controllers, the one controller is left to run the operations in serial (one at a time) – forcing all other threads or processes to wait.
Parallel hardware is very expensive in terms of a solution, and can be related to the lumberyard crane we discussed in chapter 1 of this series. However, parallel hardware is the most scalable for
VLDB, and can achieve incredible throughput for off-the-shelf applications and databases (see: beyond the frontier on http://www.wintercorp.com). There are
different types of parallel hardware that need to be addressed for VLDB: disk controllers, disk devices, and network cards. Each of these devices works in conjunction with parallel CPU’s, and
3.2. How many CPU’s are enough?
This is a difficult question indeed. In the RAM section below, a chart is presented as a guideline. It applies to today’s’ technology but will surely change within 8 months of
publication of this article. The data sets always seem to grow faster than the hardware and it’s respective capabilities. However, how many CPU’s are enough? Well – ok….
That depends again on how much processing is done. It would also be wise to understand what types of processing is happening on the box or boxes involved. If there are many aggregations, these
process points might be CPU intensive. However, if there are many I/O operations there might be disk intensive activity.
Ok, in VLDB all of these things apply. The only true method for answering this question is to follow the suggested chart below, introduce the system to load gradually – monitor the load as
it’s introduced. Then, proceed to upgrade accordingly. Be prepared: when entering VLDB space to spend a lot of money on hardware upgrades, and software tuning/monitoring to get the best ROI
from the systems. VLDB can be achieved – but usually at a very high cost. CPU’s are not cheap, but can be one of the most significant bottlenecks on any system if not sized properly.
When considering CPU’s for the system also consider the speed (Mhz) of the cpu’s. Too often a lot of slow CPU’s will be installed and the thinking will be: there are a lot of
them, therefore we shouldn’t have any trouble with multi-threaded or VLDB operations. The mistake is that VLDB and the RDBMS engines that drive these systems can smoke slow CPU’s
quickly. Exception (there’s always one): Numa-Q clustering options change the picture dramatically. See the British Telecom 80 Terabyte Warehouse installation on: http://www.wintercorp.com/ for more information.
The only way to truly know if there are enough CPU’s is to monitor, monitor, and monitor. Of course – there are the high priced monitoring tools (non-invasive) that will use a hardware
card to watch the CPU performance, and gauge what the possibilities are. VLDB is not cheap, and probably never will be – given the volumes of data that these systems handle. The idea of
having CPU’s follows the need for multi-threading. Typically the rule of thumb is: 2 CPU’s for OS (or better), 2 to 4 for processing, and 4+ for RDBMS. Small systems not performing VLDB
can survive on 2 to 4 CPU’s, however when multi-threaded jobs are started, more CPU’s are better – again as long as the RAM can support the CPU utilization without disk I/O.
In VLDB situations, the number of CPU’s scales up dramatically, and the processing is typically moved to another server. Especially where the two servers are backboned together. Again,
clustering of the boxes changes the picture (as previously discussed). The RAM and disk typically scale up to match the CPU’s on the machine. There usually is a glass ceiling, or a point at
which adding new CPU’s has no affect on performance, but this usually is in the 16+ CPU range. The rule of thumb that RDBMS vendors like to tell people is: 1 CPU per bulk-loading facility or
process. Most of the activity in VLDB is either getting the data in, or out of a database. The rule we typically apply is ½ a CPU per bulk-process and ½ a CPU to handle a partition on
a query. But, the rule changes when starting to talk about partitioning in the database.
Threads typically like to run on a single CPU – this is called affinity. However threads can be bound to particular CPU’s or sets of CPU’s to run on. This goes against the
automated tuning grain, and insures that a particular CPU at a particular address runs that process. The CPU’s on the other hand (when not clustered) typically share threads, as one CPU swaps
a thread out, another idle CPU picks up the thread and can run it (as long as it’s not a I/O blocked thread). Firmware on the motherboard will work to balance CPU loads as much as possible.
All of the disks today typically have controllers. These are the motherboards of the disk devices. They include CPU, RAM (most often), and firmware. Their responsibilities are to make the disk
perform as fast as possible as well as handle fail over and buffered requests.
3.3. Disk and Disk Controllers.
Why are disk and disk controllers important to VLDB? They can have a significant impact on the throughput and opportunity to execute parallel architectures effectively. In the case of the
lumberyard, if the forklifts are the CPU’s and the lumber is the data, the storage areas that house the lumber must be the disk. If a trip is taken down to the local lumber store, there will
be different kinds of lumber stacked in different places. They don’t stack all of their lumber all together, they separate it by grain, wood type, quality, length and finish. In our data
world we should separate data by rate of change, type of data, aggregate level, and possibly quality.
In order to achieve this separation, particularly in rate of change, it is suggested that it be accompanied by physical delineation. Which means: put the different growth of data across different
physical storage areas. Think of it this way, having terabytes of information on a single disk – is this even possible? Hypothetically let’s say it is, then how long would it take to
retrieve 600 bytes from that Terabyte of information? The disk would either have to be spinning incredibly fast (much faster than 10,000 RPM) – or have an incredible set of RAM to cache most
frequently used information. The problem is, all this data in a single place with a single read/write head – not very efficient. If this is the case, then it forces all of the upstream
processes to become blocked while it reads or writes information for one process at a time; regardless if it’s parallel hardware, or a threaded software process.
The green process is running, and waiting on a disk operation to complete. The gray processes are blocked and in idle until the green process gets its result.
Simply throwing disk controllers at the problem wouldn’t necessarily solve it, neither would adding CPU’s, bandwidth, RAM, or increasing the degree of parallelism of the processes. All
the processes would still need to wait their turn to get the information to and from permanent storage. By adding another disk, we’ve increased the parallelism of the disk to two. We’ve
now allowed for two different independent processes to work at the same time, even if there’s only one controller available (while one disk is busy, the controller sends commands to the other
disk). The rule here is mathematically based, that electricity and electronic components are always faster than mechanical operations.
As indicated, the disk is allocated in non-striped mode, and represents the availability of disk to a single controller. The processes are still in blocked or I/O wait states.
In this case, the controller is after a lot of information for the top process. The other processes now wait in idle state until the controller and disk return the proper information requested by
the top process. There are many different variations on this theme, but this case is set up specifically to demonstrate the discussion. The disk is non-striped, and the entire set of data is kept
on a single disk for each process. Keep in mind that what usually happens in this case with multiple disks, is that the controller will tell the CPU/operating system “I’m ready to
execute another process on a different disk.” And if one of these other processes is ready, and actually requesting information on another disk, it will then run that request as well. However
– these operations are done in serial. Only the disk action itself (on different disks) is done in parallel. The bottleneck in this case might be the controller, and it’s ability to
handle I/O quickly.
Adding a controller allows a second process to become unblocked, and be running at the same time as the firstpocess. Thus increasing the degree of parallelism. Note: this is a sample of only one
Adding disk reaches it’s limit though. One controller can optimally control X number of disk platters before it’s “comfort zone” capacity is reached. The comfort zone would
be a combination of throughput, CPU cycles on the controller, RAM availability (buffer capacities), and disk speeds. Of course the other consideration is striped vs non-striped, RAID 5 vs RAID 0+1
(which we’ll discuss briefly here), and raw vs file system. These are all different algorithms for arranging disks and managing the content residing on the storage. Striping on disk usually
means that there are multiple disks (as shown above), and one segment of each disk contains a piece of information. To get the entire information, one would have to read that stripe from each disk.
It’s a vertical split of the file/data compared to placing the entire file or data set on a single disk. It has it’s advantages and disadvantages, but typically is better for OLTP than
VLDB. VLDB is generally better off with non-striped disk, setting up RAID 0+1 mirroring. Some Raid 0+1 configurations are pure RAW format; others may be striped file systems.
RAID 5 basically represents 4 levels of redundancy for every piece of information. Meaning that the first copy (original) is disk 1, disks 2 through 5 carry replicated copies of that information.
Raid 5 lends itself well to striping in order to maximize speed, and minimize impact. Although with hot-swap disks these days, Raid 5 can be done successfully on non-striped systems. Raid 5 is slow
for VLDB, and has tremendous impact (usually negatively) because of the number of copies that the disk device must make in order to back up your information. However it’s extremely reliant
and fault tolerant.
Raid 0+1 is the typical recommendation – especially for high-volume VLDB. This technique represents disk zero with your entire original content, and disk 1 to be a direct copy or mirror of
the entire disk. This replication technique usually requires 1 for 1 controllers. For every controller that processes a set of disk information, there is an equal controller to handle the mirrors.
Because there is only one copy of the data to make, this technique is extremely fast in VLDB. It also provides speed boosts to loading mechanisms and the software external to the disk device. Raid
0+1 is typically striped (although not required). The redundancy is 1 disk, which is hot-swappable and immediately recoverable from an original image. The hardware has gotten so good over the years
that it provides immediate isolation and fail-over with Raid 0+1. It’s helped to make 0+1 a viable reality for VLDB.
Some things to remember when implementing disk are: the number of parallel operating controllers help the throughput tremendously. A large number of controllers (32 controllers, 16 for disk, 16 for
mirror) can be very beneficial to performance numbers, provided that RAID 0+1 is implemented, and the speed of the disk averages 10,000 RPM or better. Usually the other number that makes a
difference in disk performance is the average seek time in milliseconds. Even though it’s milliseconds between seeks, it can have a huge impact on the amount and type of I/O occurring on a
disk. Of course, having on-board cache with the disk devices changes the whole picture, usually dramatically increases performance and throughput due to buffering strategies with no network impact.
Different levels of RAID can be extremely fast, if there are enough controllers to handle the I/O activity. There are different software techniques used to limit the I/O and buffering activity that
RDBMS’s make. This will be discussed in Series 4 on software.
3.4. What about RAM?
Well that’s always the question. How much memory is enough or too much? Having RAM is beneficial; no doubt about it; however typically most organizations don’t have enough. It’s
not just the RAM, but what’s on the box and the configuration of the RAM that matters. In VLDB there is no such thing as too much RAM (just kidding). There is too much RAM if there is not
enough CPU, or not enough throughputs on disk to support the RAM, or the processes are not fast enough to make use of the RAM.
Sometimes, the configuration of the RDBMS acts like a speed governor on a car, no matter what is done to the hardware, or software configuration – even adding RAM, all speed stays constant.
Which of course raises the question: how exactly do you measure performance of VLDB? Very carefully (ha, sorry about that)… There are different methods of tuning which will be discussed in a
future series on performance and tuning of VLDB, things to think about. For now, the focus of this topic is RAM. Slow RAM is worse than no RAM in some cases. There are different Mhz (megahertz)
speeds of RAM supported by motherboards. The slower, the cheaper. Most of the time this is not an option in the Unix environment, nearly all RAM is fast enough to meet the needs of the system.
In case there are questions, just what is a “typical” environment for RAM? What about the different sizes to handle small to large, including VLDB? While there is no definitive answer
to these questions, we will explore the most common well-performing environments, with exception to hardware technology “fixes” such as Numa-Q and clustering of machines.
This table is only misleading if one side of the system is considered. These projections are guidelines meant to be considered when dealing with BOTH sides: getting data IN and OUT *query time
– discussed later. Considering the qualification of VLDB is to handle 4 terabytes or better with what users call: reasonable query times. Of course that depends on the definition of
“reasonable query times”, and “reasonable data refresh or load times”. One example definition of VLDB is above 10 Terabytes, and growing every month.
So you can see, the hardware of today will not be the hardware of tomorrow. In fact, by the time this article is published, these specs should have changed (increased). The reason is the rate of
data in a VLDB system is typically growing at a very rapid rate, forcing the hardware to handle more and more information in the same time frames as before. So what does this mean to the world of
Microsoft and Intel? Basically, that if a clustering solution such as Numa-Q is used, the combination of Microsoft and Intel can be scaled appropriately to handle VLDB easily. But it also means
that the proper disk devices (such as EMC) be implemented to handle the disk operations.
In other words, VLDB can get expensive fast. With the above table, the machine is considered stripped bare, except for the database system, and the operating system to handle the data. This affects
the RAM. RDBMS systems love RAM, and the more RAM cache they have, the faster they go… This is generally a rule of thumb anyhow. However, as indicated by the chart, if the rest of the
hardware isn’t scaled appropriately – performance will hit a glass ceiling, no matter how much RAM is installed in the machine.
RAM serves a much more important purpose than just a place to address user data. Its other real task is to provide buffering capacity. RAM is a set of data storage much faster than any physical
device – because of this; RAM is used to buffer I/O. I/O is not just disk I/O, but CPU, keyboard, monitor (display), mouse, and any other hardware device attached to the machine. If 100% of
the RAM is allocated for user tasks, OS tasks, processing tasks, then there is nothing left for buffering purposes. This can kill a machine’s performance. In the Unix and NT world one can see
a high level of activity in the /tmp (temp) or pagefile areas when RAM is overbooked. RAM buffers provide tremendous speed advantage, but like anything else – too much of a good thing can
turn the tables the other way. Likewise it would leave nothing available for the OS or user processes of housing data. What’s being suggested is a good balance. Have enough RAM to allow the
system and hardware to provide appropriate buffering, while still leaving a majority of the RAM available for user/OS and hardware buffers. In this case, user buffers are being defined to include
RDBMS systems that access RAM buffers of their own. In the next series we will explore the RAM buffers of different database systems.
Lest one forget: the RAM is also used in multi-threaded systems and CPU’s to swap live processes in and out, when waiting for blocked I/O. If you have questions about the terminology, feel
free to post them on the Inner Core. CPU’s do work with RAM, which is why the speed of the RAM chips is critical, not to mention utilization factors. If 100% of the RAM is allocated, and
processes are attempting to run (say 23 processes on 6 CPU’s), then the CPU’s will be attempting to swap their processes out. Which in turn forces the operating system to pre-empt other
processes on the CPU with a virtual RAM process to swap the original processes to disk… Thrashing at this point on both RAM and CPU can occur very fast. The best rule of thumb I apply all
the time is to never run above 75% capacity on CPU and RAM resources (if at all possible), this provides the OS with enough breathing room to prevent thrashing from occurring.
3.5. Swap Algorithms
Now that’s interesting… What do swap algorithms mean? I thought that having enough RAM, and plenty of CPU that nothing would be swapped? So what’s really going on? Ok, as
discussed briefly in the section above, thrashing can occur. But that’s the short answer. The long answer is: swapping algorithms are built in to the OS, all the way down to the firmware on
the hardware board. These swapping algorithms actually manage the processes as they execute across the CPU’s. For instance: the clock always has interrupt priority over everything executing-
the CPU has one or two machine instructions which it must execute every (x millionths of a second). The inherent swap algorithms make sure that the interrupt gets serviced (in other words) the CPU
is told to stop what’s it’s doing and execute the clock update algorithm instructions (which update the RAM area holding the clocks’ value).
This is a very rudimentary explanation. There are many other books which describe this functionality at the hardware level, therefore it will not be discussed any deeper than this, except to say
that it becomes a factor when running VLDB database processes which manage I/O, RAM, and user SQL queries. As discussed – if RAM is at 100% utilization, thrashing begins and if it’s
thrashing at the CPU level typically the machine freezes and needs a cold-boot to be reset. The main reason we choose not to elaborate on this is that awareness of it is enough. It cannot be
controlled at the firmware level, nor can it be configured. The only solution is to add RAM to the machine.
3.6. Parallel Network Cards, and Network Traffic
After spending the money on machine hardware, the bottleneck shifts. Bottlenecks in VLDB are an occurrence that shift around but can never really be eliminated. There has yet to be software or
hardware that’s specifically focused on identifying and eliminating (automatically) bottlenecks for huge volumes of information. That being said, the bottleneck has now shifted to the
network. The focus of this part of the discussion is to pick apart the functionality of the network cards, hubs, routers, and network traffic itself.
VLDB is notorious for quadrupling (or better) the amount of information on any given network. With this in mind it becomes important to address the potential issues of bottlenecking on the network.
The network could be thought of as a series of pipes through your city. If the pipes aren’t big enough to handle the flow of water (for example) then the city crawls to a stop when it rains
(been there, seen that….) Ok, so what’s the big hoopla?
The big hoopla is this: most VLDB installations or more typically: high volume transaction environments today – the installation consists of a 1 gigabit Ethernet backbone hosted on a VPN in
use only by the servers themselves, meaning that NO client traffic crosses through those lines – allowing the servers exclusive access to the disk devices mounted to the server network. VPN
is virtual private network, and is sealed off by routers with specific tables – the routers are like post offices filling PO boxes with mail. If the mail goes to another city, it’s
forwarded on, it never reaches the carrier’s route.
IP (Internet Protocoll) works today by sending packets of information across the network, relying on routers to “route” or place the information/send it to the right receiver, as well
as deliver it back to the right client. However, 15 clients on the same router (aka HUB) will all have the same subnet. It then becomes the job of the network card in the machine to determine which
packets are for its address, and which ones to ignore. All 15 clients receive ALL the IP packets sent out on that subnet.
So why is this important to understand when a server is involved? When more servers that are setup on the same subnet (VPN or not) – there is more packet traffic that will clog the network
lines. This includes NFS mounted drives such as network appliances. Without sufficient controllers, and enough bandwidth to handle the traffic, the servers could slow to a crawl because the network
is overloaded with packets. This is also why it’s important to place the disk locally on the server if at all possible – eliminate the network between the I/O and the server machines
themselves. This leaves the network between the servers to handle the real traffic – information exchange between multiplexed machines. Again, when using Clustering hardware, this
doesn’t necessarily hold true, because clustering hardware opens it’s own networking protocol and shares information on a backbone (below the network layer) and doesn’t usually
even use the standard TCP/IP network to share information. It’s a proprietary solution (which helps explain the cost).
So, what can be done with VLDB? Adding network cards is one solution, but typically that is only done to handle fail over – if a network card goes out, the other picks up where the other left
off. That also requires some architectural and operating system solution. What usually is done for VLDB is bandwidth of networking. Items like Fibre Channel are installed between the machine and
disks, or 1 gigabit VPN Ethernet between servers, backbone between servers, or fast-wide internal SCSI is hooked up with disk.
Now that we understand some of this, what happens with IP configurations and packet sizes? Aren’t all of the default sizes of IP packets the same (512 bytes)? The answer is it varies,
depending on the network cards, bandwidth, and router capacities. These days IP (the protocol) itself has defined extensions, allowing software and hardware to work together to sometimes change the
default packet sizes traveling across the network. In most cases though, the hardware changes the RAM buffering algorithm it contains and still uses 512 bytes across the wire. By reallocating the
RAM buffer, and setting up the RAM buffering allocation the network appears faster. Buffers on both sides of the network (sender and receiver) must be set appropriately to the right sizes in order
to see improvements to speed. Most RDBMS vendors these days allow “packet tuning” or buffer sizing to be controlled and altered from the software level.
What does this mean to the hardware layers? Sometimes hardware vendors offer compression/decompressing with their buffering firmware. This can help – but the cost is: twice as much RAM buffer
is needed to compress the set of data. This allows for the original set of data to be held resident in RAM while it’s compressed in to a new buffer. This also means that there is some
firmware and on-board CPU activity to compress the actual data. Well now – you mean processing/threads are involved in compression activities on the network card? You bet! So, can we add
CPU’s to the network card to help it out? Not usually. These cards come pre-configured special from the manufacturer. This is one of the many reasons why clustering hardware is so expensive.
It contains a vast amount of firmware, RAM and more CPU’s just to process and handle the networking traffic across machines. These clustering cards are essentially modified network cards
acting as mini-computers on their own. They perform special tasks that can only be controlled at the hardware level.
Can multiple Network Cards help? Sometimes – if the packets can be split and the work can be split across TCP/IP addresses from a database perspective. But it isn’t necessarily
recommended as the first alternative for performance and tuning in a VLDB world. What about network traffic? It’s been discussed a little above, however network traffic can be a problem.
Typically processing in VLDB is attempted to be co-located with the RDBMS engine. While TCP/IP listeners are started and utilized – shared memory connections are used more often. This is
discussed in the next series on Software.
3.7. Measuring Resource Utilization
So we’ve hit bottom have we? Now we’ve discussed nearly all the hardware components at a very high level. Again keep in mind that there are many books written on each of the subjects
above, and will serve as wonderful resources respectively. The idea and focus here is to keep it all relevant as to why it’s important to consider these things when thinking of VLDB. It all
comes down to understanding the hardware. So, measurement should be a snap right? WRONG. There are many different factors, which only some of the most expensive resource utilization software begins
to touch. This is a growing industry that is still young in the software age side. Resource measurement utilities are a double-edged sword. They can help, but they can also lie.
What about the standard monitors that come with Unix/NT? First off – NT by itself is incapable of handling VLDB on it’s own. It requires special hardware like Numa-Q, then of course the
software to monitor and balance Numa-Q system comes with it. In Unix, the OS tells the operators only part of the story. The resource monitoring tools provided with Unix will tell you on average,
the resource utilization. Anyone with a statistics background should be able to see that the highs and lows are thrown out, and it’s an average over time across the # of resources available.
That’s just it: averages lie (in this case).
Consider the following: 1 car at a gas pump, pumping gas. There’s only 1 gas pump, and it’s utilized at 100%. Ok fine, works with one pump. Now, two pumps – one diesel, one
unleaded. 2 cars pull-up, both needing unleaded. The second car waits while the first uses the unleaded gas pump. The unleaded gas pump is at 100%, the diesel is at 0% utilization. The average pump
utilization is: 50%…. (100% + 0%) / 2 pumps. The same thing happens when using resource-monitoring tools against CPU’s and RAM. No other way to put it. We need to understand
“why” one is at 100%, and the other at 0% in order to do something about it. In this case, the only solution is to install a second unleaded pump. It wouldn’t do any good to
install another diesel pump would it?
On the other hand, using simple resource utilization tools can help Systems Analysts gain insight in to hardware activity as a whole. These figures must be considered as part of a larger whole in
order to be a help. For instance, when we look at CPU utilization, we also look at RAM, disk, network, and controller activity. Once an understanding of how the data is flowing is established
– a different set of monitoring tools will be necessary to pick apart what is truly happening. On Unix systems there are good operational commands provided like “mpstat” which
show resource utilization, and break it down by CPU, disk controller, etc.. What is important here is the whole picture – broken down by the hardware sub-components, not necessarily just the
averages (like those provided by perfmon).
The more expensive resource utilization monitoring tools take this in to account, and help measure blocked I/O, and blocked RAM operations. Once the bottlenecks are identified properly then
the real problems can be addressed. Therefore: most free or standard resource utilization tools don’t tell the whole truth. They will show CPU utilization across 6 CPU’s at 60%
on average. Ouch…. Particularly if 2 are being burned up on I/O requests, causing the rest of the system to wait and wait and wait.
In the next series we explore the RDBMS, and software sides of these issues including parts of the operating system. VLDB best hardware performance is achieved by performing a balancing act between
all the available hardware and software. If the balancing act is not performed or one side of the scale is heavily utilized then VLDB installations will perform poorly. This includes the systems
that are being operated under clustering hardware.
3.8. Operating Systems
There are three major types of operating systems that VLDB is capable of running on: Unix, Teradata, and MainFrame. NT and Windows 2000 is only capable of VLDB when the machines are clustered
together (as discussed in the previous article). Linux is up-and-coming, and eventually will be capable of running VLDB as the RDBMS vendors convert their products to be compatible on Linux. So
what does this mean? Do I have to upgrade? Not necessarily, not until your machine pushes the limits of VLDB. Working on any operating system requires a particular architecture in any
software-based tool (outside the operating system).
The analogy here is: if you have a Ferrari (the hardware), and it’s capable of moving at speeds of 150 mph, before you jump in and go – you might want a driving class. The class teaches
you (the operator / software) how to manipulate the machine at the proper rates, with the correct clutch movements, and how the car handles in turns at high speeds. If you don’t learn how to
drive the new vehicle – you may be in danger of crashing the machine. Operating systems, and databases must understand how to manage, and manipulate the hardware properly. Hardware upgrades
frequently require software upgrades (as well as operating system upgrades). Software (you – the operator) is a critical component to the hardware’s success in a VLDB mode. The more
load that is placed on a machine and it’s resources, the more critical it becomes to have the software be capable of correctly balancing it’s operations.
The different operating systems all provide unique solutions to handling massive load, however they all have one thing in common: true pre-emptive multi-tasking capabilities with protected mode
memory. The only operating systems that don’t contain true pre-emptive multi-threaded tasking are NT and Windows 2000. These two systems must interrupt existing tasks in order to execute I/O
on disk. This is well known as full-system blocked I/O. The procedure of swapping to a single swap file during the middle of execution causes system performance to degrade significantly. It also
means that the NT and Win2k operating systems must stop all other processing in order to perform a disk swap to it’s “pagefile.sys”. It is incapable of running this swap
Unix, Mainframe, and Teradata all of other options – and are capable of swapping processes, RAM, and other items to swap files when necessary at the same time. The result is full-parallelism
and complete scalability on CPU’s and RAM. Limitations abound on NT and Win2k which keep it from rising to the ranks and power of the other operating systems. Scalability is key in the
success of VLDB. Again, these limitations are overcome by purchasing hardware such as Numa-Q at a high cost (which basically replaces the motherboard, almost entirely). The basic algorithm utilized
in Unix is: if there’s enough physical RAM, store it there – until the RAM is needed by a higher priority process or thread. Then, swap it to it’s own disk swap file. Each process
can contain it’s own swap file – and doesn’t need to follow the issues of a single swapfile like “pagefile.sys”.
Fragmentation will be discussed later, but it can be thought of this way: one file that sits around for the life of the machine as RAM swap – is heavily utilized, and is severely fragmented
over time. Whereas, in unix, each swap file is deleted at the end of the process – and re-created the next time the process starts up (only if needed) – thus fragmentation is not an
Because of some of these limitations, NT and Win2k cannot keep up with the hardware. Moving from 32 bit to 64 bit hardware, unix (for instance) must be recompiled to address 64 bit. Not a problem.
Adding CPU’s – the only limitation is the motherboard (aside from cost of course). Adding RAM – the limit is the addressable amount available in the bit level (64 bit can address
much more than 32 bit). Of course the motherboards on NT and Win2k are still 16 bit, and emulating 32 bit architectures through the operating system. That’s why the NT and Win2k hardware is
still limited to 4 GIG of RAM, and 4 CPU’s. Microsoft has not yet made the jump through the hoop allowing Intel to upgrade its hardware capabilities. Again, cost is an inhibitor and Microsoft
doesn’t want cost in the way of progress.
There are apparently varying degrees of opinion on NT and Win2k, as to if they support pre-emptive multi-tasking. I have heard directly from Microsoft that it does not support real time true
pre-emptive multi-tasking, and have witnessed this behavior. On the other hand, there are apparently books out there of a different opinion. Until the machine begins to perform like a similar
machine in a Unix environment, and until the I/O blocking issues are removed, it should be said that we believe it is not true pre-emptive multi-task capable.
3.9. Configuration and Environment
Configuration of the environment is critical to success of VLDB. On Unix, and other operating systems the Administrator can configure the maximum amount of RAM, disk, stack space, total time, and
priority of processes. This flexibility allows the Administrator to rank certain processes higher in the food chain than others. It also allows the Administrator to set the processes for maximum
amounts of RAM utilized on the machine by user’s processes. This can be dangerous in VLDB situation. The amount of RAM for ETL processing, Bulk-Loaders, Databases, and operating systems
should be balanced in the architecture. However, only the RDBMS engine should be limited to a maximum – but that needs to be done in the configuration of the database (which will be discussed
in the next chapter).
Configuration of the operating system is critical if it is to perform properly. In general it’s enough to leave the priorities of processes alone, providing a sound architecture is in place
for the overall use of the machine. As far as the environment goes – there usually doesn’t need to be any additional settings (outside those required by 3rd party software) in order to
get the Operating System to perform. Although certain OS boasts of hidden environment variables which “could” increase the performance of your machine in specific instances.
So what’s the analogy here? Back to the Ferrari – without the correct gas, a good tune-up, clean air filter, and proper oil – as well as all the rest of the general maintenance,
the car engine would come to a grinding halt. No matter how good the car (hardware), or the operator (software), the car has not been maintained or configured properly for operation. The driver
would be unable to start the engine. What’s the point in having a Ferrari if you can’t drive it (don’t keep it maintained)? The same can be said for expensive hardware/software.
More often than not, the money is spent on hardware and software to handle VLDB, but none of the skills necessary to configure, setup, tune, and maintain the environment is available. Nor is the
company willing to spend the time and money to send it’s employees to the proper training so that they can retrieve the maximum ROI from the machines and software they purchased.
Configuration also includes the proper number of disk controllers, amount of physical RAM, number of CPU’s, speed of CPU’s, bus speed, disk speed, and network speed. All of this
directly contributes to either the success or failure of the VLDB system in question.
4. Software and VLDB
In this chapter of the series we focus on the setup, necessity, proper sizing and usage of software resources; we do not discuss pricing options. If you have not been reading the series from the
beginning, we recommend you take a moment to do so. The concepts in this chapter build on the previous chapters.
This chapter focuses on installation, setup, and provides an overview of the RDBMS engines which are installed on the systems to handle VLDB data sets. Without the proper configuration and setup
the system could fail. It would be similar to having the Ferrari, but attempting to pull a camper – the Ferrari isn’t built to pull trailers (over utilization of the available
resources). On the other hand, it could be like purchasing a big-rig to pull a rowboat trailer – underutilization of the big-rig engine. Why use the wrong machine to handle the incorrect
load? If you have a Ferrari, you are looking for performance, speed, handling. If you have a big-rig you’re looking for towing capacity, strength, and long hauling. The larger the machine,
the more information it can handle in shorter periods of time, however the architecture of the jobs / processes / threads plays a huge role in the success or failure of the machine’s
performance as seen by management.
4.1. Configuration and Environment
RDBMS engine configuration is an important part of the balancing act on a VLDB system. Over utilization of the system by the RDBMS engine can lead to slow, sluggish or even system failure
performance. Under-utilization leads to inefficient use of resources – slow and sluggish performance of the RDBMS while the system shows “low load usage”.
Configuration of the system plays a significant factor in the success or failure of a VLDB “hardware ready” system. It is vital that the systems administrators take the time to measure
the capacity, throughput, and load capabilities of the system they wish to install software on. Without the correct consideration – it becomes difficult to achieve the expected results. For
example, if you have a ¼ ton truck, and you want to haul a ½ ton of rocks for landscaping, you run the risk of breaking the struts (among other things); not a very good decision, is
it? Well – in this case, the engineers building the truck and labeling it as a ¼ ton, have done all the analysis. They’ve considered the engine size, pulling power, capacity of
the bed, tire pressure, strut support and so on.
Now, back to systems. In the software world to many Systems Administrators are told to continue installing more and more software on a machine because it “appears big,” and looks like
it can handle it (please refer to Series 2, and 3 for information on load balancing and performance monitors). This is a poor decision, it’s like loading the truck with more rocks because it
“appears big,” and looks like it can handle it. Truth of the matter, software sizing and hardware sizing has always been somewhat of a mystified art form – protected like some
black magic. Really it’s not that hard. Common sense and a good set of experience, and the systems administrator should be capable of stating up front: this machine can handle X and Y but not
Z (in terms of software packages that should be installed).
For example, the cases I’ve run in to in the past vary from 2 instances of an RDBMS engine to 4 instances on a 2 CPU machine, Intel no less. I’ve also seen machines with RDBMS engines,
java engines, ftp engines, app servers, web servers, ETL applications, and transactional capture applications. Those machines are extremely overloaded. There are beginning to be some basic
recommendations or rules of thumb that one can follow when considering the configuration and setup of a machine with regard to software:
- ETL applications and RDBMS engines play really well together in a single sandbox (MSSQLServer and anything else is an exception).
- Single instances of RDBMS engines typically do much better than multiple instances, combine the tablespaces / databases on to a single instance and manage the data that way. If there is a
problem with security, or fail over reasons require a second instance, move it to another machine.
- Application Servers and Java Run Time environments play well together on the same machine, as long as the RAM is properly scaled.
- Application Servers and JRE’s don’t play well with Web Servers (typically).
- Transactional servers and services should be separated on their own machine as well.
- File Servers, Security Services (LDAP etc…) should be separated on their own machine.
By separating the configurations, a company can achieve maximum performance – and the SA can then tune the configuration appropriately for the required environment. Typically when
there’s too much on a single box, anything the SA tunes can decrease performance of a different piece. This makes it dangerous for the SA to really do anything other than defaults on the
machine (which are a significant performance hit in VLDB). Keep in mind this series is focused on massive sets of data (as described in Series 3) being passed through the ETL in to the RDBMS, or
massive sets of transactions being utilized on the source feed systems. This series is not targeted at the small scale systems employed by small businesses.
Bottom line for configuration: we don’t attempt to move 5 tons of rock or earth, using a ¼ ton pickup truck, or even a 2 ton pickup truck – it takes a big investment in a large
scale machine such as a dump truck, then the 5 tons of rock can be moved effectively and efficiently.
Recommended configurations for RDBMS servers is as follows: (most of these are discussed later in this series)
- Limit the single instance to at least ½ of the available physical RAM in the system.
- Limit the number of available CPU’s (engines) to at least ½ the physical available CPU’s to avoid pegging the machine, and to avoid process thrashing (see series 2 on
- Enable parallel query options if at all possible.
- Increase the amount of shared global ram (SGA for Oracle, Global Buffer for Sybase) so that the parallel engines can do their job properly.
- Partition the source and target data sets, again allow the parallel engines to do their job.
- Separate the indexes from the data, move to a RAID 0+1 configuration, stay as far away from RAID 5 as possible for performance reasons.
- Change the optimizer from Rule to Choose (Cost or Rule based)
- Use hash based indexing if possible.
- Partition the indexes as much as possible.
Environment is the next big player in this setup. Without the proper environment, the proper configuration won’t help. Most cases the environment consists of correct semaphore settings,
proper location of temp and log spaces, as well as removal of old software or old versions of the RDBMS engines or clients. The typical recommendation for environment TEMP space is to have 2 times
the amount of physical RAM allocated on the /tmp (TEMP) space free (on Microsoft machines it’s the pagefile size). The other rule of thumb is to have at least 40% or more free on the temp
device at any given time. The reason for this is: the disk manufacturer’s tolerance for determining a hard drive is “bad” or unshippable is up to 10% failure. Which means: if you
run at 60% temp, and the system fills to 90% temp, the machine is at a much higher risk for rate of failure due to hard disk errors.
In order to understand this, the differences between Windows and a basic Unix system will be briefly discussed here. In Unix, each process / thread is assigned it’s own temp file (in the TMP
or TEMP directory). This allows the system to maximize performance, minimize thrashing, and minimize blocked I/O (see series 3). Because of this – each program can run in sync with the
others. Not forgetting that this goes for the data that each program uses as well. In NT/Windows, the pagefile is a single threaded, single file system for ALL processes and ALL RAM allocations.
Thus making it more difficult to achieve large scale performance (again without the use of hardware add-ons).
We have evaluated numerous cases where hard disk failures occur with large sets of data, and where the disks are running at 80% and above (full) on average at any given moment. To find the
potential failures on Unix, it is necessary to examine the multiple log entries in var/adm/messages. This generally provides clues in to disk read/write/seek errors that occur. Even sporadic errors
can wreak havoc on a system, causing the system to “swap in” a bad image of an executable, or the data it’s dealing with. On Windows, it may be necessary to use a utility program
such as chkdsk, or Norton Utilities to check the disk and it’s files for potential errors. If the errors are left unattended eventually the system will crash and the disk must then be
reformatted. By that time it’s usually too late to recover.
In setting up the environment, the location of the RDBMS TEMP space, RDBMS Log Space, and RDBMS engine code is critical. These should not conflict with each other, nor should they reside on the
same disks as the data they are attempting to manage. To frequently a sample is setup with the RDBMS engine and then is populated with small amounts of information (while the organization
understands that this is to grow very large). This works for the proof of concept, so the project receives the go-ahead, the data set then grows and after 6 months users begin to complain about the
speed of the system. Why can’t anything be done to improve performance without spending more money (they cry….)? Well the truth of the matter is, there is something that can be done in
this situation. The rule of thumb is that RDBMS TEMP space doesn’t need redundancy, or backup – so why not put it on the Internal (typically SCSI) disk? Which is both very fast, and
hardly used (except for the TEMP OS space that’s out there). If there is a potential conflict with the OS Temp Space, then the RDBMS TEMP space should be moved to another internal disk which
also does not require mirroring or recovery. However, the place where the RDBMS Temp space is placed should operate under the same rules about free space as discussed above for the OS Temp space.
Typically the optimal configuration is: multiple controllers internal, multiple internal SCSI disk. With the OS Temp, and the RDBMS engine Code, the OS, and the ETL application code all on the
first internal disk, the RDBMS Temp Space on a second internal disk, and the RDBMS log space on a third internal disk. With the RDBMS Temp Space disk being formatted RAW. This significantly reduces
contention for the data device, and provides a solid (very fast) platform from which to launch all the system’s operations. This frees up the disks formatted for Data and Indexes, and allows
minimal contention across external disk lines for optimal speed. It’s like tuning the Ferrari engine with larger carburetor’s, larger header block, and a racing cam – as opposed
to utilizing what’s sold in the car. It increases the top speed of the car.
When I started this series, I probably mentioned (if I didn’t, I should have) that a) tuning is critical in VLDB, but b) only tune where and when necessary, unless you know ahead of time that
this system is geared to grow in to the VLDB space within 1 to 6 months after launch. This keeps the focus on the right items necessary for ongoing success in a VLDB operating environment. I
probably also forgot to mention that much of this is mathematically based, and can be proven through numerous algorithms and theorems, however these documents are not meant for the Math PHD –
therefore the explanations will speak to the basic physics which support the laws, rather than the actual calculations which prove these concepts.
4.2. RAW versus Files
Many different RDBMS engines permit two styles of formatting disk, one is RAW, and the other is File System or Files. The first of these is called RAW because the RDBMS engine manages the following
factors: disk blocks, block size, distribution, fragmentation, free space, and overall speed of the disks. The second of these options (files) is simply a file system, where each of the tablespaces
or databases is created as a series of one or more files. It is then up to the operating system to manage all the components. Thus adding a layer of management that often slows performance,
especially in a VLDB situation.
The file system is good when there is no time for DBA’s or SA’s to architect the RAW solution, or when running Raid 5 instead of Raid 0+1 (see series 3 for this description). The file
system provides quick and easy management of the database; the size and location of it’s information, however as with anything that’s easy (not complex), there remains a lower level of
power to tune for performance. All the work is left up to the OS to manage the files in a file system thus as mentioned; adding a layer of overhead between the database and the information. When
this layer is added, VLDB RDBMS slow down. Why? The database must pass the information out of the engine first when talking to the disk; also because the OS is not typically tuned for huge or
ultra-large files (in terms of speed). This is a hindrance for VLDB systems where the time it takes to format devices (adding file space, extending segments) is a significant cost.
In the RAW system, the OS does not know, cannot see, and does not manage the RDBMS’s information. The manner in which RAW is setup, it is up to the RDBMS engine to read/write directly to a
specially formatted disk. In a RAW format the disk is formatted by the RDBMS engine; thereby introducing complexity that also introduces tuning points and can increase speed. RAW formatted disks
are handled, managed, and altered all through the RDBMS engine. Because of this items such as the parallel query options, and multiple engines (CPU’s) can be more efficiently utilized. Not to
mention the RDBMS’s ability to tie it’s global RAM area directly to a disk map.
Under RAW, fragmentation is handled and minimized by the RDBMS, the disk can be reorganized (and often is) while the engine is idle to make larger contiguous free blocks. While both RAW and file
system can be split on different disks across Raid 0+1 (which should be done for performance reasons of VLDB), RAW partition block sizes can vary depending on the database size; whereas in the file
system, each file is dependent on a single block size that the OS manages (which cannot be changed). The typical block size for RDBMS engines in a VLDB situation should be between 20k and 50k, for
indexes should be closer to 5k to 15k, and for wide tables / data sets should be closer to 45k. Narrow tables, or a series of stagnant tables should be between 15k and 25k.
We can liken this back to the Ferrari, we shouldn’t attempt to carry a 27inch TV in the back seat of a Ferrari should we? I don’t even think it fits in the doors, much less the back
seat… Typically this should be carried in a larger cargo space (think of the TV as one row of data). Now you have 100,000 of these TV’s. Are you still going to use a pickup to haul
lots of small loads to and from the destination? Probably not – you’ll probably hire a few hundred big rigs to haul all the TV’s in one shot. The pickup represents the Operating
System in a File System setup, whereas the big rigs represent the RAW setup, where you have control over how many rows (TV’s) can be placed in to a single block (one rig). While more complex,
it’s easier to manage and much quicker to transport the data.
In this instance having control over the size of the rigs is equivalent to having control over the size of the blocks. Where this example somewhat falls apart is when the TV’s are 5 inch
TV’s, in the electronic world – smaller blocks for smaller sets or more stagnant sets of information are better. What we can liken this to is: say you need to find 100 of the 100,000
TV’s, and you have all 100,000 5 inch TV’s in 25 big rigs, that’s a tall task (particularly if these TV’s are numbered, ordered, and the ranges of these numbers are posted
on the outside of the big rigs. It would take you twice as long to find all 100 TV’s in 25 big rigs, than it would if you had 50 smaller rigs for the smaller TV sets. By doubling the number
of blocks (making them smaller), you make the searching easier – more ranges of smaller distance between the ranges. With the large TV’s the same is true mathematically –
therefore larger rigs but with the same number of TV’s in the truck will also enable you to find 100 big TV’s faster. Which, basically if you had 50 small trucks, and 100,000 large
TV’s it may take you twice as long to search more trucks to find the 100 you want.
The same goes for the RAW file system, simply setting up a “one size fits all” block size won’t necessarily work. It takes time, effort, and analysis to get it right. However once
correct for 80% of the information, speed becomes nearly a non-issue and the ability to handle very large information sets becomes simpler for the RDBMS engine. Below we discuss the concept of disk
fragmentation, and how that affects what a VLDB system can handle.
4.3. Disk Fragmentation
Disk fragmentation is a huge cause of slowdown in a VLDB situation. If you’ve ever run a Windows operating system you’ve seen this first hand, even in DOS it was a problem (hence Norton
Speed Disk for instance). What happens? A fresh new disk is formatted, and Windows installed in the first few sectors. Then as you continue to install other applications the entire system seems to
slow down (but the boot time is still relatively fast). However – over time, as the machine is used, documents written, files moved, deleted, and copied the whole system (including the boot
time) appears to slow down significantly. What’s happening? The disk is becoming fragmented.
What does that mean? Fragmentation basically means smaller pieces (but more numerous) are spread across the disk in a random fashion, thus making access of any given file slow, due to the fact that
the system must seek out all the small pieces and put them back together again for a coherent piece of information. Ok, that’s all tech speak… It’s like this: if you have 100,000
TV”s, at first they are all on shelves in a single large warehouse, indexed and ordered – you have a very efficient and quick method of retrieving what you want (25 TV’s of stock
number 200 to 225). They are all co-located on similar shelves right next to each other, and preferably they are ordered. This is like a disk with NO fragmentation. Now, spread these 100,000
TV’s across the city in to 25 different warehouses with no order to their distribution – then attempt to locate all 25 TV’s of stock number 200 to 225, is it more difficult? You
bet, why? Because they could potentially be scattered across 25 different warehouses (fragmentation of a disk). Each of the TV’s could potentially reside in a different warehouse. Not only is
the search time consuming, but the collection of the 25 TV’s in to a single shipment takes an order of magnitude more time.
So what else is going on? When you receive a shipment of 100 more of these TV’s, in the single organized warehouse example you know right away if you have room on the shelf for the new
shipment, and at which warehouse the shipment can be stored. Just like the RDBMS or Operating system knows where the “free space” is for new data, new files, new documents, and can
stack them all together in a single portion of free space. On the other hand, if your warehouses are spread across the city, and your organization of the TV’s and knowing where the free space
is – is not very good, you have to search the whole warehouse (and probably each one) to find enough free shelf space to place each of the 100 new TV’s. It is a very labor intensive
process and takes a lot of time. It’s the same with the operating system or RDBMS engine, attempting to locate a large enough “shelf space” (free space) to put all the new data.
Things slow down mathematically when they are disorganized, and not well tracked. The level of effort to keep things in order sits on an exponential curve. The more complex the operations, the
longer it takes by an ever increasing factor.
So what can we do about it? Well, there are a few things. By setting up the disk space as RAW the RDBMS engine will de-fragment the information as best as it knows how during idle time. It will
move blocks around and re-order information according to the indexing that it needs to reference the information later. By setting up a file system, there isn’t much you can do. You are at
the mercy of the Operating System. However, you can run de-fragmentation programs on the disk where the files reside. This will speed up the references to the files, but any expansion or growth of
the files will cause havoc, because all the files will be stacked back to back – so when they grow, they will allocate free space beyond the entire file set, thus compounding the
fragmentation problem. In RAW, new blocks will be allocated – all the blocks are the same size and the maximum amount of blocks is fixed until a new RAW disk segment is added. However, there
are a few things to remember when running a de-fragmentation program on a File System based RDBMS:
- The RDBMS MUST be taken off-line, or at least that database / tablespace, or the OS which has the file open, may not allow you access because of corruption possibilities
- The de-fragmentation at a file level doesn’t solve the entire problem
- This operation must be run periodically, thus incurring huge overhead to the system and negatively impacting the user base.
- If a de-fragmentation program quits in the middle or is interrupted, it may leave the database/file in an unusable state, forcing a restore. In VLDB situations this is not an option, because
the data sets are much too large – and the overhead is way beyond what the business is willing to bear.
There are typically commands in the RDBMS engine to de-fragment RAW databases, which can be executed on-line (hot, or during operations), or scheduled during idle times. There are also methods of
managing the fragmentation potentials by partitioning your data sets, distributing the data sets across multiple differently partitioned RAW segments, and cluster indexing techniques. All of these
items are discussed below.
Partitioning of the databases has long been overlooked. Some of the people consider this to be too difficult to implement or too complex to manage. Others understand that partitioning is nearly
impossible on both: file based systems and Raid 5 configurations. If implemented on these systems it causes more harm than good. What is partitioning? We’ve discussed it already in different
examples, like multiple big rigs to haul hundreds of TV’s. Only in that case we were stating that the big rigs were blocks on the disk. Partitioning is the segregation of information by a
particular key (single or composite) allowing the information to be read or written in parallel across multiple disk segments.
Therefore, Raid 0+1 setup is just the first step to successful partitioning, followed by RAW formatting. Once these two components are in place, partitioning can be considered as an extremely
effective method for dealing with very large sets of data. The key to partitioning your data correctly is selecting the right key to divide the information in to equally accessible parts. One can
liken it to cutting a pizza pie. If we cut a pizza pie in to too many small pieces – it’s difficult to pick up and eat. On the other hand, if we cut it only in half, the pieces are too
large to pick up and eat. Generally each size pizza is cut in to reasonable and often equal size pieces that fit neatly in your hand.
This figure represents a disk configuration of Raid 0+1. It is a logical representation demonstrating how a database table might be partitioned on to different disk sets. Each disk representing a
physical partition of information.
Data partitioning (pieces of the pie) should be split up in the same manner. Too many small partitions become difficult and awkward to manage, as well as causing problems with fragmentation. Too
large, and the partitions can’t serve the purpose of access. In other words, try not to defeat the purpose of partitioning by cutting the data in to the incorrect segment sizes. For instance,
a small pizza usually is cut in to 4 equal slices, a medium in 8, and a large in to 12, very large in to 16. This is usually the manner in which different pizza sizes are handled. In setting up the
partitioning, consider the overall size of the table in 3 to 6 months from now, and attempt to map the growth rates of different potential keys. These keys should cut the information in to equal
slices as the data set grows over time. Generally once a year, or twice a year it’s good to re-evaluate the key selection for partitions (unless the analysis has been done correctly up
front). This will continue to ensure the success of the partition.
Ok, so why does partitioning the data really help all that much, and what does it affect? It helps by allowing the queries to access only the information on the disk segments that it needs to
return the proper results. In a manner of speaking, it also helps the parallel engine designs, and the multi-threaded access points across the disk. It increases parallelism by a factor of X where
X is the number of partitions that have been separated on to different disks. Once a partitioning scheme has been setup, it will be necessary to partition the indexes as well. The whole idea is to
reduce I/O contention by allowing more parallel processes to access additional information quickly.
One of the most forgotten items in VLDB is the index. How they’re constructed and where they’re placed can make a huge difference in the success or failure of a VLDB implementation.
There are two types of indexes: single key, and composite, with many different options such as unique, constraint based, clustered, and partitioned. What is most frequently forgotten is the
placement of these indexes. Where they reside on the disk can make a huge difference in the performance of the retrieval of the information.
Too often indexes are placed on the same disks as the data set, the system log, or the database TEMP space. This causes the indexes to cost double or triple the I/O when the disk is hit for data.
Both reads and writes are impacted. What is forgotten, is that the database does all the work to balance index trees during inserts, updates, and deletes. This means that there are blocks of
indexes being moved around on the disk in conjunction with 80% of the activity taking place. Most indexing technologies do house data, and are built in RAM in most cases, but managed on disk.
Increasing parallelism and dividing the I/O work across different controllers and disks is a technique utilized to increase performance. This means being able to recognize and architect a solution
where indexes and their activity doesn’t collide with the other data on that disk. The table below represents this concept of basic separation.
|RDBMS Log Space (Internal Disk)||Staging Data Space|
|RDBMS Temp Space (Internal Disk)||Staging Index Space|
Each of the divisions above represents an individual disk, formatted RAW and separated across controllers as a Raid 0+1 setup. This concept extends itself across multiple data and index spaces. For
instance, when adding other table spaces such as data warehouses and data marts, the architect must consider when data is read, when indexes are written along with other things. What the system
should NOT do, is read from an index disk while it’s attempting to write data to that same disk. This is a common faux pas made all the time, and is unavoidable when using Raid 5
configuration. In a Raid 0+1 setup, specific disks can be allocated for read operations while others are being written to. In this scenario, the load can be spread across the device thus avoiding
what is known as hot spots on the disk.
The table below represents the concept in a visual fashion utilizing 4 disks.
|Disk 1: Staging Data Space||Disk 3: Data Warehouse Data Space||Disk 1: Data Mart Data Space|
|Disk 2: Staging Index Space||Disk 4: Data Warehouse Index Space||Disk 2: Data Mart Index Space|
This is a limited situation, requiring a limited architecture. In this case we attempt to minimize the impact on the users, while maximizing throughput during load times. Mainly because of the
amount of information we have to deal with. The data is picked up off of flat files on Internal disk (read), and inserted in to the Staging Data and Staging Index spaces (disk 1 and 2). The data is
then picked up from Disk 1 and 2, and written to disk 3 and 4. The data finally is picked up from disk 3 and 4, and written back to disk 1 and 2. Thus reducing contention as much as possible.
Consider these to be logical disks at the moment, physically they can be comprised of as many disks as necessary to hold the information. Consider these to be groups of disks.
The point here is separating the contention across disks, so that when large sets of information are being read/written to they are not on a single disk. Again, with Raid 5, there is no concept of
groups of physical devices – everything is striped as one large array of storage. The other disks, Disk 1, 2, 3, and 4 appear as “virtual disk 1” – so the concept of
separation doesn’t apply. Also keep in mind that beyond this are the clustering, unique, non-unique indexing techniques (as well as partitioning) which need to be applied.
A basic definition of clustering data is to think of it as keeping it in physical order to match the key. The information is actually stored in a sorted order designated by a key or index.
Clustering can help remove fragmentation of the information. We’ve mentioned earlier that fragmented data can slow down a system significantly, and that in a VLDB situation there is so much
data to deal with – the fragmentation problem places speed impacts high on the exponential curve of impact. In other words, the system will slow down on orders of magnitude if the
fragmentation is allowed to grow and continue.
Ok, so how does clustering help? Won’t the system slow down trying to insert or update massive sets of data in order to keep it in sorted order? Yes, yes it will, unless the concept of
partitioning has been applied properly. However, clustering will keep the data from fragmenting too much (unless the information is deleted frequently). Keeping the data in sorted order forces the
physical blocks to be moved around on the disk so that record named “A” is physically and always continually located next to a record named “B”.
The paradigm could be: having a parking garage, and saying I will park all the Ferrari’s in this section, Cadillac in that section, and Lamborghini’s in the other section. And when a
section becomes too full – we’ll move some Cadillac’s to make more room, we want to keep the entire set of Ferrari’s together. Another method of clustering the cars is by
color, or maybe make and model, or maybe year. Anyhow – the concept stays the same as partitioning: keep the distribution of the data as evenly spread as possible, and by clustering it
– if you are looking for a particular Ferrari, you’ll always be guaranteed to find it in “that section.” because they are always kept together.
However, putting a clustered index on partitioned information (particularly when the partitions are already split across physical disks), can make for a very fast system – especially in
retrieval. Although for inserts, updates, and deletes on a mass scale – it may make it a bit slower. That is, unless the partitions are broken up slightly smaller than they were before, and
the cluster coincides with the partitioning scheme. Why is this important? In dealing with massive sets of information, it is often times necessary to consider ALL the alternatives to performance
and tuning. Some things work for certain situations, but without all the right tools in the box – the job may not be done correctly. This series is an attempt to place additional tools in
Well gee whiz; I never thought indexing was as complicated as setting up the architecture of a system. Do you mean to say I should consider it when designing the information storage? Absolutely. If
it’s not designed properly up front, then the business analysis may not have been done properly. It basically means that there is a missing level of understanding of what kind of information
is to be stored. The architecture of the Index in VLDB is just as critical as every other component in the system. However – it does require an additional level of knowledge as to the data
set content being stored.
What other things should I think about? We’ve discussed the fragmentation, and partitioning at basic levels. These concepts apply to index trees as well. Remember that indexes are also data
storage components. In a sense, they are data about data, just so happens, it’s ordered or sorted data, and is used for the specific purpose of helping the optimizer perform it’s binary
search. What other types of indexing strategies should be considered in design? Composite versus non-composite, bitmap, and hash indexing. Each of these indexing strategies have their benefits and
place in a VLDB situation, however the only index type which will be discussed in this series is composite, due to the nature of the index.
A composite index is one that is made up (generally) of more than one field. These indexes are often built without an understanding of why, or what they are needed for. Thus resulting in a complex
system that is over-indexed. The more indexes, the longer the insert, update, or mass delete times. In other words – bad indexing can impact the performance of VLDB system negatively. When
dealing with that much information, a poor decision can cost the implementation team weeks or even months of internal analysis as to “how to solve the performance problem.” When in all
reality the index probably isn’t even used by any other component.
So what’s in a composite index anyway? Generally the concept is that an index is made up of a basic tree like structure, typically a B+ tree (derivative of the Binary Tree). The B+ tree
structure allows the binary search to work it’s magic. Inside the tree structure is a set of levels (depth), and a finally a set of pointers to the actual data (breadth). The depth of the
tree is part of the efficiency and how well balanced it is across the structure. The root of the tree is inverted, and is the top entry always used first by the optimizer to determine the
tree’s efficiency. Actually the optimizer keeps statistics on how efficient each index tree is, and utilizes a series of mathematical formulas to determine which index to use.
This figure represents a binary tree structure, not a B+ tree structure. The binary structure only has two and at most two nodes or pointers left or right. The B+ structure extends 2 nods to N
nodes (5 or 6 usually), but can contain additional entries. This is for visualization purposes only.
The first field in any index is always the most important field. This field should represent a very good “coverage” or split of the data within the particular table. In VLDB this can
make or break the performance of the system. The first field’s coverage should be (on average) a very equally divided distribution across the data set. If the target table is partitioned, the
first field should represent the range or hash that is utilized to originally partition the data. On the other hand, if it’s the primary key, the first field should always be the unique
primary key value. Typically partitions are not built on the primary key – there are too many variations in the growth pattern of the primary key to partition this way.
The second field of the composite index should further equally divide the information. In this manner, the index coverage or “depth” is managed. If an index is too narrow, and too deep
it takes many different comparison operations to reach the data that is being asked for (read slow). If the index is too wide and shallow, the same affect can be seen. However, it should be noted
that (as with any rules or generalizations) there are exceptions to this rule. In discussing the exceptions, it is important to note that the focus on the index ordering may be frequency in which
these terms show up in qualified queries. Under this circumstance, the coverage across the first fields of ALL the indexes should be good (allow the optimizer 80% of the time to choose the correct
index). If all the indexes start with the same first field, the chances of the optimizer selecting the right index are poor. Most optimizers only keep statistics on the first N bytes in an index.
Some are 30 bytes, some are 50 bytes, it varies from RDBMS to RDBMS. Following these rules of thumb, any RDBMS engine can be utilized and the implementer is capable of constructing successful
For example, if the user desires to query a VLDB table with 80+ million rows, partitioned by Year, account type, and finally city, each year containing 10 million rows (8 years of data), then the
index should first and always start with year (at least one index) – because it represents the partitioning of the information. The second field could be account type (if each of the 10
million rows are potentially split average across 4 account types). However if all 80 million, or each partition, only holds 1 account type – then the second field may need to be city.
There are several different methods for architecting the correct indexes – it simply boils down to the basics of understanding the data, and how the data will change over time. Without these
two components, successful indexing is difficult if not impossible in a VLDB situation. It simply isn’t good enough to understand how the data needs to be accessed. The knowledge needs to be
a mix of understanding the business and the physical layout of the indexes themselves.