Published in TDAN.com October 2001
This is the first part of a three part series on Very Large Databases. The paper is divided into six chapters and two chapters will be included with each part.
1. Introduction to VLDB, Very Large DataBases
Welcome to the discussion on VLDB. The purpose of this series, these white papers is to present and discuss VLDB regarding it’s implementation, application, design and architecture needs. The
focus here will be centered on the techniques necessary to setup, maintain, and grow a database of 1 to 80 Terabytes in nature. The target audience is those who have a background in systems
architecture, database administration, or systems administration. This series will not discuss the business benefits nor pro’s and con’s of VLDB from a business perspective.
What the series promotes are the following goals of understanding:
- Impacts of VLDB on Hardware and Software
- Sizing and scooping of Hardware
- Parallel Processing, multi-threaded Architecture in relation to VLDB
The other areas of discussion will focus on: networks, partitions, clustering files and indexes, database sharing, CPU speed and quantity, RAM speed and quantity, and an introduction to MPP
possibilities (such as NUMA-Q). A new series will introduce new structures capable of assisting in housing VLDB data sets: a structure called a Data Collection, and a new data architecture model
called the Data Vault™ will be presented as a means for massive data storage in a detailed normalized form. Watch for this series in the coming months.
Much of the theories and hypothesis presented here are supported by mathematical functions, however mathematical theory is not the focus of this series; therefore the mathematical functions may be
referenced in certain places. However the proof of these theorem’s or math functions will not be demonstrated.
1.1. What is VLDB? VLDW?
Different definitions exist of each term, and the terms themselves conjure different images for everyone. However, the standard definitions in the industry are as follows:
- VLDB = Very Large DataBase, an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information. The definition of what
exactly a VLDB is changes every day as hardware and software adapt, become faster and faster – and are capable of handling more and more loads. Usually though, VLDB’s “don’t
fit on laptops”. But wait, this changes tomorrow to. IBM is scheduled to produce a disk that is capable of holding 3x the gigabytes on your laptop within the next year. - VLDW = Very Large Data Warehouse. A data warehouse can be made up of both physical (hardware/software), and virtual (electronic information) content. The warehouse being the structure or
architecture serving a specific business purpose (or several). The VLDW is based on VLDB technology. The two usually go hand in hand.
1.2. Technical Theory
The database or data warehouse will grow in the specific areas consisting of transactional information. The information describing the transaction itself is usually not as frequently changing, and
is capable of being termed: Meta Data – or data about data. Most of the transactions around the world consist of: date and time (when did the transaction occur?), who (the individual
conducting, initiating, or entering the transaction), and what (the content of the transaction itself). Beyond these three components, the transactions typically include many descriptive elements,
and relationships between information – all of which must be recognized, but should be stored in a minimalist fashion in the database. Therefore, compression ratio’s, growth
ratio’s, and metrics ranging from implementation time, to overall sizing should be computable attributes following a mostly accurate profiling activity. The keys are: how many
types of transactions, and how many total transactions are being entered in to the system?
The theory goes on to say: the relationship between hardware, software, speed, and RDBMS, and data set growth appears to be closely related – as well as consistent in nature. Therefore most
common techniques included in mathematical algorithms should hold true for splitting data in to chunks while holding other elements constant. As a result, speed should only change if one of the
other variables change.
The suggested very broad formula for this theory might be: speed = [(hardware)*(software)] bytes per second / (size of data in bytes)
With this formula, if the size of the data increases, while the hardware and software stay constant, the speed decreases. Or vise versa if the hardware increases and the size of the data stay the
same the speed should increase. Of course the dependency is on the bottleneck of the system. A delicate balance is achieved in keeping these three “points” moving in ever larger
directions in equal portions. Think of three sides of a triangle:
The pressure is in the middle of the triangle, pushing outward. Speed is the size of the arrow set inside the triangle. Increasing one, doesn’t necessarily increase speed. A number of factors
must be addressed. If you picture a pyramid, the 3d dimension might be the number of parallel processes capable of executing the requests. Consider the volume of a pyramid as the resulting speed,
the more volume, potentially the more speed for more data. For this example, the items considered for each piece are discussed below:
- Hardware = Size + Speed
-
- Size = Amount of RAM, # of parallel CPU’s, # of controllers, # of disks
- Speed = RAM speed, Disk Speed, Bus Speed, Network Speed (bandwidth), and removing the # of idle cycles per second per CPU, as well as the # of blocked I/O processes per second per CPU.
- Software = # of parallel processes + # of partitions + operating speed of software (throughput)
If the formula holds true, then there are a number of things that can be done to achieve better performance.
1.3. Why is all of this important?
We are seeing rapid change in both technology and the requirements for analysis of massive sets of information. More and more businesses are looking at the related issues and recognizing that
patterns, trends, and analysis of these items is becoming critical to the success of the business. Ok – enough business talk. What about the technical end of it? Because of the business, the
IT specialists left to implement these complex problems are beginning to deal with physical problems they’ve never had to think about before. OLTP simply “deleted” or rolled off
the unused portion of the data, Data Warehouses physically were unable (until recently) to hold such massive sets of information. The focus has simply changed, the name of the game for IT is to get
the most accurate answer from a compilation of as much information as possible – and do it as quickly as possible, providing that much desired competitive edge in business.
In recognizing speed as a foremost goal, we begin to realize that one of IT’s responsibilities is integrating, understanding, and tuning all of these different components. Not to mention the
mountains of data that exist within an enterprise. Technically, having a VLDB situation without the expertise to predict it, manage it, grow it, or architect it – is like having the
information aboard the best aircraft carrier in the world without having the aircraft. What’s the point? More simply put – it’s like a rowboat without the oars.
More and more IT personnel need to understand the varying levels in order to tune the correct component. This is analogous to improving a car with the best carburetor money can buy, but then
putting the same old fuel in to the engine, maybe the engine is too small? Maybe the fuel is bad? Maybe there isn’t enough horsepower to take advantage of the best carburetor? Anyhow, by
understanding only a single component of this particular triangle, only certain levels of performance and management of VLDB can be achieved. Provided the amount of money it costs a business to
house, build and deploy VLDB this would be a grave mistake on the business’s part to not properly train IT employees.
So where does one go to achieve such knowledge? They don’t teach it, and it’s something that’s not readily available on the web. This series will try to assist in unraveling some
of the complications and necessities of what to think about when getting in to VLDB for the first time. For those advanced with VLDB, please send any comments, omissions, errors, or thoughts to the
author directly. We always welcome input. This series is meant to be dynamic in nature.
1.4. Who’s involved in VLDB?
Now that the discussion has opened, who is actually involved in VLDB? Who needs to know? What’s the weakest link? Ok, sorry for the bad pun. We’re just checking to see if you’re
awake. The answer is: everyone. Everyone who interacts from a business perspective is interested in VLDB even if they don’t know it. They need to understand that the questions they are
beginning to ask about the content (specifically the historical patterns of the content) require more and more analysis. Which in turn, most statisticians will tell you that to predict for 6 months
(most accurately), should have at least 12 months behind it. In other words: twice as much data for half the amount of prediction over time. All of these statistics prove true when coming to most
accurate conclusions in VLDB. Sure sample sets can be implemented, but what if that key factor is missed? There are examples of these written up in various books such as Oracle8i – DBA
Handbook, and some other sparsely populated writings.
From an IT perspective, everybody that touches, maintains, architects/designs, or builds a data warehouse is interested (or should be) in VLDB. The list of roles include: systems architect, systems
administrator, enterprise architect, database administrator, ETL operator, dss/olap (Business Intelligence individual), operations for backup/restore and disaster recovery.
The next series will cover the concepts of VLDB such as parallel processing, architectures, and introductions to hardware/software requirements.
2. VLDB / VLDW Concepts
This next section discusses the concepts behind VLDB and VLDW. What to think about when creating these situations or considering them for implementation practices. There are many concepts regarding
VLDB, which due to time and space constraints will not be covered here. If there is a desire to see a write up on certain concepts pertaining to VLDB that has been left out of this series please
feel free to contact Core Integration Partners and let us know. We will attempt to include them in a section of post-series write-ups.
The concepts included in this discussion focus on the hardware, software, and architecture sides of VLDB. There are separate chapters on both hardware and software included in this series. For that
reason these concepts are interwoven with all such ideas, and should be read as precursory knowledge before launching in to the other respective areas. Included in the discussions are the following
items:
2.1. Architecture Overview
Architecture plays a much larger role in VLDB than most care to think about. In order to understand VLDB/VLDW we must consider the architecture and the concepts behind what makes it all work. The
architecture is like the set of blueprints for a house, without the overview of what to put where, and the guidelines or steps to build it – it becomes difficult to create correctly. All too
often most projects begin with the right intentions, but forget to include the architecture components. This leaves the foundations of VLDB extremely weak, and many problems begin to crop up as
data set size grows rapidly.
An example paradigm might be as follows:
Suppose you had a lumber yard, and in that yard you had 1 stack of 500 foot boards, 8 feet high, and 6 feet deep. These boards are stacked neatly on one side of the lumberyard. Your objective is to
move the boards across to the other side of the yard. Do you simply start by grabbing a board, and attempting to move it by hand, one by one across to the other side? Most often all the options of
how the board(s) can be moved are considered. In doing so, you find out that there’s a forklift of standard size is available. Does this mean you can move all the boards now? Not necessarily.
The boards are too long, and too heavy for a single forklift, so additional options must be considered.
In this paradigm, if the boards represent data, the forklift represents the database/query or software, and possibly a CPU (the engine of the forklift). If the boundaries of the lumberyard
aren’t considered, there could be trouble. Again, the architecture or the entire process (all the options) must be considered before the best option for the situation at hand is chosen. If
the worst option is chosen (because the entire architecture was not considered), then the boards may break – or it may take a very long time to move the boards, or halfway through, no more
boards could be moved. Maybe some of the other boards have now been placed in the way of completing the task.
Four other co-workers join you, each with their own forklift, and your boss has just told you these boards need to be moved within the next 2 hours. This further complicates the problem. Now
there are five people, five forklifts, and still 500 ft long boards. After discussing it, everyone agrees that there are two options: cut the boards in to 5 portions of 100 feet each and each move
the stacks independently, or put all five forklifts at the same time under the boards and try to move the entire stack all at once. Of course this would require great synchronization so as not to
disturb the stack or have it fall over.
After further consideration the option for cutting the boards is chosen, the stacks are moved efficiently across the yard, and the job is done in 1 hour and 45 minutes. Just in time for the
boss to come out, and start congratulating you – explaining how he can’t move the boards in trucks anyway if they were still longer than 100 ft. Or maybe he begins to yell, for the lack
of asking about the fact that these 500 ft boards were specifically ordered at that length, and you and your co-workers should have asked first.
What happened here? You solved the problem with two possible outcomes (probably more outcomes available). But, was the architecture right to solve the problem? The problem itself was solved, but
can be considered a success only if the requirements were discovered before delving in to the actual application of solving the problem. Let’s examine one more possibility…
You and your co-workers go back, ask the boss hey, do you want these in 500 ft lengths, or do you mind if we cut them? If we cut them we can move them with our 5 forklifts, if you want them in one
length, we need to use the crane to haul the whole stack across the yard. The boss thinks for a minute and responds, the crane is more expensive, but the customer wants the boards in 500ft lengths.
Go ahead and use the crane to move the boards. Successfully, the boards are moved across the yard in 45 minutes, in a single stack – uncut.
Interesting, now there’s another solution to the same problem: an introduction of a high-priced item, but one that can accomplish the task successfully according to the requirements. The
items in this real-world case should be thought of as follows: the boards are data in a table. Cutting the boards is like partitioning them. The forklifts are like CPU’s with RAM for a
carrying capacity, and they have a certain speed. The more forklifts, the faster the stacks can be moved, at a maximum of 5 stacks – other forklifts would simply sit idle. Each forklift
operating independently is like a parallel process, capable of accessing a partition. If the boards are in a single stack – the forklifts must operate together at the same time (as if they
were one large forklift).
The crane is a high-priced item that can meet the requirements, but it can do it faster. It’s like having more CPU’s, more RAM, faster machines, faster bus speeds, faster disks, more
controllers, etc.. Higher cost get’s the job done faster than lower cost, but it’s over-kill unless this kind of operation will take place regularly. If the data set grows (more boards,
more stacks), then it would make sense to invest in a crane, but only after the forklifts are overwhelmed or the time frame is squeezed to tightly.
Ok, enough of the theory. Let’s get down to brass tacks. The bottom line is: architecture, engineering, and requirements are extremely important in considering VLDB. As this series will now
delve in to the technical sides of all of this. Here we go….
2.2. What is a thread?
A thread is typically defined as an independent process which can execute asynchronous to other processes. Usually a thread is capable of operating independently of other threads, or processes.
When a thread has to wait on the results of another thread, it will sit in “idle mode” until a token is passed (semaphore/mutex). Ok, in English – a thread can be likened to a
car. Put four cars coming to a stop at a 4 way stop sign, each car is running independently of the other cars. They all arrive at the stop at different times. Then each car proceeds to go in turn
and continue on their way. Think of each car being a thread, the stop sign being a semaphore or shared lock, by which all cars must stop first before proceeding, and each car can only proceed when
the intersection is cleared.
Threads can also be thought of in terms of different users connecting to the database. Each has their own connection, and each can execute different queries at the same time, or they can execute
the same query at the same time. The threads allow for multiple processes to occur at the same time. This concept is very important in understanding how multi-threaded architectures work. A
multi-threaded architecture provides the system with the ability to run multiple threads or processes at the same time. The system performs management of the 4 way stop (with stop lights), so that
accidents and traffic jams are avoided as best as possible.
Each car represents a thread. All cars must stop before proceeding, but all cars are in motion independent of one another.
Why is this important? It is important to recognize threads as a basis of architecture for systems with VLDB, so that larger tasks or requests (such as massive inserts, selects, deletes or updates)
can be split in to smaller multiple tasks that can run at the same time – thus shortening the overall time frame necessary to complete the task. Going back to the analogy used earlier,
it’s similar to having N# forklifts. Each thread is a forklift, the operators (people) are like the system that communicates and manages what each thread is doing. It can take lot’s of
little threads working together to complete what otherwise is a huge task, in a short period of time. However, without the Operating System providing parallel processing capabilities, threads still
execute in order (serially). It would be like having five forklifts, but only one key that fits all, you can only operate one forklift at a time.
The time line increases, for serial execution are multiples longer than that of parallel execution. The parallel threads all executetTogether at the same time, while the serial threads execute one
after the other.
2.3. What is Parallel Processing?
Parallel Processing is the ability of the system to run, manage, and maintain multiple threads at the same time (synchronously). It’s the five keys and five operators needed to run each
forklift independently. Degree of parallelism is a measure which indicates how many parallel processes can run at once before the system is overloaded or maximized. For instance, if you have five
stacks of lumber, and seven forklifts, you can still only operate five forklifts at a time, indicating a degree of parallelism of five. This is a very simplistic explanation of the parallel
processing theories. There are many books in the marketplace that describe in great detail how parallel processing and degree of parallelism is reached. It is not the scope of this series to
discuss each in extreme detail, only to introduce the concepts as they relate to VLDB.
Parallel processing enables speed, and division of tasks across CPU’s. This is also where the concept of load balancing comes in to play. Load balancing is the process by which the operating
system decides where to run the threads, how many threads to run, and which ones need to sit idle. It attempts to maximize the overall usage of the hardware resources available. Load balancing in
the lumberyard would consist of deciding which forklifts are out of gas and need to sit idle. Maybe one of the forklifts can carry a larger capacity load than the other, so it may be used to carry
a slightly larger pile in a shorter time frame. Maybe it carries two smaller piles more quickly to the other side of the yard. Load balancing is dynamic in nature. Threads are swapped to and from
different CPU’s unless they are “CPU BOUND” during execution. CPU Bound means that the threads have been tied to run on a specific CPU, and only that CPU. According to the load
balancer, cannot be moved to run on another CPU. Very few programs allow this level of control by the designer or operator. This is a double-edged sword, and can be tremendously beneficial, or
extremely dangerous depending on how it’s set up.
In this case, the stoplights represent the Operating System, and the capacities for executing or managing parallel processes. The threads (cars) are the processes.
Why is parallel processing important to VLDB? It allows multiple processes (SQL queries for instance) to execute at the exact same time frame, and with the same priority. It also provides the
vehicle for very large tasks to be split in to multiple smaller tasks for the purposes of faster execution times. Without parallel processing, it becomes difficult if not impossible to handle
systems of scale. Parallel processing also plays a huge role in the success of partitioning the data, which in the example provided here – are the stacks of lumber being cut in to shorter
stacks. This concept of partitioning will be explained in one of the follow on chapters.
Ok, so now that threads exist, and parallel processes exist to manage threads, how does the system decide who takes priority and when? The answer is through a technique called pre-emptive
multi-tasking.
2.4. What is pre-emptive multi-tasking?
Pre-emptive multi-tasking is the ability to handle parallel processes, but to grant priorities to certain processes so they may interrupt other processes at specific points in time, or in reaction
to specific events. In the case of the Four-way intersection with stoplights, a pre-emptive task would be an ambulance, which changes all the lights to red, but is still (in most cases) able to go
through the intersection. It’s a random occurrence that’s unpredictable, but when it happens, it takes priority over the cars on the road. The multi-tasking part is parallel processing,
it allows multiple tasks to interrupt other tasks at the same time – for instance, three ambulances at three different intersections, all have the same effect and the same priority on the
road. Hopefully not all the ambulances head towards the same intersection at the same time – this would cause collision, and the ambulances would have to slow down, stop and wait until each
one makes it through the intersection. This concept is called Blocked-I/O or blocked processing, and it will be discussed later in this chapter.
Why is pre-emptive multi-tasking important? What does it mean to the VLDB world, and how does it play in the operating system? It’s important because it allows the operating system to manage
execution priorities. For instance, if you type a key on your keyboard the system must respond immediately, regardless of what it’s doing. This is an example of a hardware interrupt, one of
the highest priority tasks that can occur on a system. It allows the system to react to a series of unpredictable events at the time they happen. The system considers it less important to update
the screen than to respond to a keyboard event. These priorities also run through different software in the operating system. In fact, in near real time systems or true pre-emptive multi-tasking
systems such as Unix, these priorities can even be assigned. Unix term for this is “niceness”. Setting the “nice” for software sets the priority of it’s interrupt
levels.
For instance, you can tell the system what’s an ambulance, versus a fire-truck versus a car, etc.. This is important, to be able to set the priority in VLDB means that the database software
can take precedence over other software running on the same system. Once priority is decided, the software makes the request to the operating system that enters a queue. When the system has time,
or needs to execute the request it pulls it from the queue, assigns it to a processor, then it begins executing. Of course there are many more steps to this process which have not been described
here but that’s the general flow.
Unfortunately there are still systems out there that claim to be multi-threaded. This may be true, but when it comes down to brass tacks they are not true pre-emptive multi-tasking. These systems
block processes from executing on an order of magnitude more frequently than those systems with true pre-emptive multi-tasking. It’s one of the primary differences between Unix, and
Microsoft’s Windows platforms.
2.5. What is a blocked process?
A blocked process is a thread that is sitting in idle mode, waiting for another of its requests to complete. It can be likened to going to a gas station, finding out they are filling the gas tanks
below the surface – so you have to wait maybe five minutes before you can fill up your car (when they turn the pumps back on). But, before you can get gas, these events such as turning the
pumps back on must happen first. This sequencing of events causes you to wait. This is equivalent to “idle state” for the thread that was executing.
The thread sitting in idle, waiting for something else to happen or complete is called a blocked process or blocked thread. Blocked I/O simply means that the thread is waiting for a disk operation
to complete which is the most common event to block threads. There are many different methods for threads or processes to become blocked, if you want to find out more about these things you can
find books on multi-threaded processing which walk through all the details.
Are blocked processes bad? Not necessarily, unless there are too many processes waiting in the wings to execute. Sometimes multiple threads are waiting on each other, this could be equated to
deadlock situation in a database. Deadlocks cause machines to freeze up – frequently requiring cold boots. Blocked processes are a problem for the resource monitors, because the CPU
utilization rates drop during processing while the CPU suspends the thread in “idle” or wait state until the blocked lock releases and continues to execute. The problem is, the actual
utilization of the CPU may drop, but 100% of all available resources are taken, causing CPU load to be significantly higher than utilization rates.
The less blocking that the processes or threads do, the faster the execution of the overall processing. The problem is, no matter what’s done, there will always be interrupting threads with
higher priorities, some of which require one or more lower priority processes to become blocked. Why is this important to VLDB? When dealing with massive sets of information, it is most optimal to
have the process broken up in to smaller more manageable processes, but also to have them be able to be independently executed. When the dependencies from the separate threads have been removed
they can execute in parallel and have less “blocked” action. As long as each thread is going after independent data sets on disk, again this leads to partitioning the large data set in
the database. If you have four threads going after the same data, then you’ll end up with dueling threads. It would be similar to having four forklifts trying to lift the same stack of lumber
at the same time. It makes no sense.
2.6. What are pipes and why do they matter?
Pipes are conceptual term used for throughput – at least in this document. The pipes are the ability of the system to perform movement of X amount of information in Y time frame through a
series of circuits or connectivity. Ok – what was that again? Basically with the lumber yard, it may be the number of stacks or total amount of lumber that a delivery truck can carry over the
course of a single tank of gas. It’s the transport mechanism for the information inside the system. Without proper sizing of the transport mechanism the job takes too long, costs too much, or
produces too much waste or excess. In technical terms, that relates to over-utilization of available resources, under-utilization of available resources, or too tight a time frame in which to move
too much information.
Just like everything else, pipes come in all shapes, sizes, and colors. Virtual pipes can have any attribute you wish to assign to it. Anyhow the throughput of these pipes is what matters most. How
fast can a pre-determined amount of information travel from point a to point b? It may be a network connection, or disk connection, or disk controller, bus speed on the CPU board. It could even
represent the speed of the CPU – yes it’s all about speed and performance. As mentioned earlier, throughput is a critical success factor to VLDB. Especially since users are less and
less willing to wait for responses from the systems they access, they need/want immediate answers from mountains of data.
The pipes come in to play, with how many pipes are available, their bandwidth (diameter & volume). These are frequently the most expensive components to “get right” in the complex
world of servers and connectivity, but in most cases, they are the most beneficial to increasing throughput. Going back to the triangle of power introduced in the first series, the connectivity
between the points is the pipe system.