Published in TDAN.com April 2002
In 1998, our company was contacted by one of our customers, a Social office of a Swedish municipality, with a request to build a system that could help them to see the trends in the distribution of
their resources. The information on which this system would work was gathered by a software system that supported the everyday operations. The operational system was of a transactional type, which
means that it stored all transactions made, e.g., payments, but not the history of changes in “static” objects, like people. The prospective-users of a new analytical system would be
ordinary office workers without any special analytical education.
Getting this request, we screened the market for analytical tools that existed at the time. We found the following:
- All OLAP (OnLine Analytical Processing) tools were built for analysts, and not for ordinary office workers.
- There were no programming toolkits available on the market that could help a system developer to build an OLAP system. All OLAP tools were meant for the end-users (analysts).
- Buying and tuning any of the existing tools would require considerable investments equal to developing a system from scratch that could be tuned to the customer’s needs in a better way
than a standard system.
Thus, a decision had been made to develop an OLAP system on our own. The next step was screening the OLAP literature in order to find practical recommendations on how to build such a system.
Unfortunately, though there were quite a lot of books on OLAP and Datawarehouse on the shelves, we had not found the information we were looking for. All books we found were about how to use
analytical tools more or less equal to the ones we found on the market.
Not finding practically useful information in an application field was not completely new for us. We started to work in a usual way:
- Creating a conceptual model of a new field that lists the most important concepts and their relationships.
- Deciding on how various elements of the conceptual model could be represented in the database structure and system architecture.
- Choosing the programming tools to develop the system.
- Creating a programming framework that consists of general routines that could be (re)used in many places of the would be system.
- Coding the system.
This document represents the results of the first step of this work. It was first written as an internal memo for our development group. The text below represents the slightly edited version of the
internal memo. We apologize for conciseness of the text and lack of examples. Still, we hope that the reader could find it interesting to have a look on the description of the OLAP and
Datawarehouse concepts that somewhat differs from the one normally found in the literature.
The system was successfully built based on the approach above. It had a web-based user interface, and it allowed the end-user to choose from tenths of predefined reports. The results could be
presented as lists or graphs. The ordinary Oracle database was used to store analytical data. As a system development tool, we used JAM from Prolifics, Inc. (New York, NY, US).
A conceptual model is a (minimal) set of abstract notions that describes a particular application domain, or class of tasks or systems. The conceptual model serves as a basis for both database
design, and system construction. Each new object in the database should be classified according to the notions of the conceptual model. Each operation should be also classified according to it. If
it is difficult or impossible to represent some element of the application domain in the conceptual model, it is desirable that the conceptual model is modified to determine the right place for
this element in the overall picture of the application world. If, because of the time limitations, the programming fixed is used as a work around the limits of the conceptual model, the revision of
the model should be undertaken at a later time.
In this document, a conceptual model is presented for the application domain of so-called OLAP systems, where OLAP stands for Online Analytical Processing. This is not a conceptual model to be used
when talking to the end-users. The goal is to have a common language in the development group that allows to formally represent functional specifications requested by the customer, e.g., reports.
The goal with OLAP systems is to measure some essential properties of a dynamic world, e.g., a company, organization, society, etc. Examples of values to be measured include: total of payments per
month, an average number of days a patient remains in the intensive ward after an operation, etc.
3. Time Based Facts
To be able to measure the world’s characteristics, the dynamic world of interest needs to be represented in the system in some formal way. One way of representing such world is as a set of
time-based facts. A time-based fact (TBF) is a logical statement about the external world that is true at a certain time. A time-based fact may be of two types:
Event fact – states that certain event happened at a certain time. Examples: 5 pieces of goods of a particular type were sold on November 1, or a decision to
invest $5000000 was made on December 12, etc.
State fact – states that certain objects were in a certain state, or at certain place, etc. at a certain time. Examples, 5 pieces of goods were in stock on
November 1. A patient was in the hospital on December 12, etc.
Each time-based fact, as a minimum, has a predicate, and a timestamp.
- The predicate defines what kind of logical statement the fact stands for (i.e., logical predicate name). Examples: Sale (something has been sold),
Decision (some decision has been made), Placed (somebody was in the hospital).
- The timestamp defines when the fact was valid in the real world. The precision of the timestamp may differ from one type of facts to another or from application to
application. The timestamp can be given with precision of seconds, minutes, hours, days, weeks, months, etc. The precision of timestamps is chosen based on the application needs. The meaning of
non-precise timestamps is as follows:
Beside the predicate, and timestamp, a TBF may include attributes, object references, and measurements (measured values):
- An attribute specifies an event or state in more details. For examples, for the Decision predicate, attribute type of decision (investment/reorganization) can be of
use. For the Placed predicate, attribute type of ward (the patient was placed in) can be of use. Each attribute is defined by its name, and a set of values. All TBFs with the same predicate share
the same set of attributes, though not all values may be known for a particular TBF.
- An object is an abstraction for representing physical objects and processes from the given application domain. Examples of physical objects include a person, a
hospital, a company, etc. Examples of processes include order processing in a sales department, patient case in the hospital, etc. Objects are grouped in types. Each object has its unique ID,
current state, and history:
However, for each particular object, some attribute values may not be known.
Note: In our model, we simplify the reality by not allowing one object to refer directly to another object. All connections between objects are done
- A TBF often concern specific objects, or processes. For example, the Sale TBF is about some product (object) being sold to some customer (object). This event also happens in the frame of some
order process (object) that can embrace several other Sale events that concern other products sold to the same customer. A Placed predicate concerns a person (object) who is placed in the hospital
(object), for example to undergo basic examinations (process object – patient case).
hospitals, hospital cases, etc.). All TBFs with the same predicate shares the same set of object references (more precisely, the names of object references) However, for a particular TBF, not all
of them may be known.
given TBF was true in the real world.
- A measurement is a parameter that numerically characterizes TBFs. Examples include amount of money received in a Sale event, amount of money to invest in a Decision event, etc. Each measurement
is defined by the name of the parameter, and the range of possible values. All TBFs with the same predicate share the same set of measurements. However, for a particular TBF, not all of them may be
Measurements are used to evaluate integrated parameters of the dynamic world, e.g. an average amount of investment planned per month, the total sales per month for a certain group of products, etc.
Attributes are used to group TBFs during the aggregation process, e.g. all investment (attribute value) decisions made this month, all patients placed in the intensive ward (attribute value), etc.
Note: Sometimes the same parameter may be used as both a measurement and an attribute. For example, assume we want to group all investments
decisions according to the size of investments made, like small investments (< $10000), middle-size investments ($10000-$100000), etc. Suppose also that we want to know the totals of planned
investments for each group. To cover this situation, characteristic size of investment should be included in TBFs as both an attribute, and measurement. The precision of the attribute and the
measurement based on the same characteristic may differ. Normally, an attribute does not require the same precision as a measurement.
In a more complicated case, even characteristics of the objects referred by the TBFs may need to be included as measurements in a particular TBF. For example, if we
are interested in the average age of patients who undergo a specific operation, the person age should be included in the TBF as a measurement, while it continues to serve as an attribute in the
description of object state.
4. Space Transformation
4.1 Simple Dimensions
Let P(t,m1,m2, …,mi,,a1, a2,…, aj,r1,…,rk) be a TBF, where
P – is a predicate,
t – is a timestamp,
m1,m2, …,mi – are measurements,
a1, a2,…, aj – are attributes,
r1,…,rk – are object references.
Let us consider different ways of representing this fact as a point in a multidimensional space. First, we need to have some order over components referred by the TBFs. Measurements have numeric
values; thus for each measurement, values are ordered in a natural way. The same is true for the timestamps.
As far as attributes are concerned, for some attributes (e.g. numeric attributes), the values have some natural order, for others, there is no natural order. However, we can always introduce some
artificial order over the attribute values. An attribute with some order introduced over its values will be called an attribute dimension. The same trick can be made with references. Objects that a
particular TBF’s reference can point to are, normally, not ordered, but we can introduce some arbitrary order in them. A predicate reference with some order introduced over the objects it
points to will be called a reference dimension.
With the dimensions introduced above, we can construct a multidimensional space in a number of different ways.
One way is when an (i+j+k+1)-dimensional space is constructed from the time dimension (timestamp t), i measurement dimensions (m1,m2, …,mi), j attribute dimensions (a1, a2,…, aj), and k
reference dimensions (r1,…,rk). This way is standard for OLAP literature. However, the above space may have undesirable dependencies between dimensions. This happens when some measurement and
some attribute represent the same characteristic of the real world.
Another way of defining a multidimensional space is to construct it from the time dimension (timestamp t), j attribute dimensions (a1, a2,…, aj ), and k reference dimensions (r1,…,rk); the
resulting space will have (j+k+1) dimensions This construct gives us a “more orthogonal” space than the previous one, and we will use it for defining a multidimensional space to
represent TBFs of the given type (i.e., with the given predicate). Under assumption that the timestamp, attributes and references totally differentiate one TBF from another, measurements may be
treated as values (i.e., weights) assigned to some points of thus defined space.
4.2 Extension and Reduction
Let rc, 1 >= c <= k, be an object reference from TBF P(t,m1,m2, …,mi,,a1, a2,…, aj,r1,…,rk), and the state of the object to which this reference points is defined by a set of
attributes: . Suppose, we have defined a dimension over each attribute of objects to which reference c can point. Now, we can consider any TBF with predicate P as a point in a (j+k+n+1)-dimensional
space constructed from the time dimension (timestamp t), j attribute dimensions (a1, a2,…, aj), k reference dimensions (r1,…,rk), and n object attribute dimensions (b1, …, bn). An operation
of space transformation where a reference dimension is complemented by the object attribute dimensions is called extension of the original space via reference rc.
Another useful operation to transform the space is by eliminating one of the dimensions that belong to the original space (attribute, reference, or object attribute dimension). This transformation
is called reduction via the specified dimension. The coordinates of a TBF in the reduced space are the same as in the original space for all axes but the one that has been eliminated.
Note that in the reduced space, there is no guarantee that each TBF of the original space will be mapped into a unique point in a new space. Consequently, we need a way of defining measured values
assigned to a point in a new space that assert several TBFs from the old space. This process is called aggregation and it will be described later.
A complex transformation that includes several extensions may be defined by naming a set of reference dimensions that should be extended.
A complex transformation that includes several reductions could be defined by naming a set of dimensions (be it attribute, reference, or object attribute) that remains in the transformed space (all
other are eliminated). The latter is just a usual projection.
Note: If the reduction concerns an object attribute (e.g. a product category), and the object reference is not part of the excluded dimensions, no aggregation will happen along the dropped
4.3 Hierarchical Dimensions
As we saw in the previous section, a simple attribute dimension is defined by introducing a strict order over the attribute values. A hierarchical dimension introduces several levels of attribute
values to make it possible to specify the attribute with less precision than the exact value. Basically, a hierarchical dimension defined over an attribute specifies:
- A number of levels, greater than 1, which are numbered from 0 and up. Each level constitutes a new scale of attribute values that defines the attribute with less precision than the original
- A strictly ordered set of elements (units) on each level. Units on level 0 are equal to the original values of the attribute. The number of units on a level higher than 0 is always less than
the number of units on the previous level.
- A mapping between adjusted level units. Each unit of the given level should include one or more units of the previous level. The sets of lower level units mapped to two different units of the
current level should not overlap. All units of the lower level should be covered (included in some higher level unit). The order of the higher-level units should not contradict the order imposed on
the mapped lower level units.
Let us also presume that each hierarchical dimension includes a so-called root level. The root level is the highest level in the hierarchy, and it has only one element (unit) denoted as all.
According to the above definition of hierarchical dimension, all embraces all units of the previous level. With the root-level concept, each simple dimension can be treated as a hierarchical
dimension with two levels: zero, and one, where level one is the root.
Note: Our definition of a hierarchical dimension is not done in the form usually used in the OLAP literature.
Suppose we have defined a multidimensional space that includes a hierarchical dimension x. Then substituting the original values along axes x by the higher-level units, say of level n, will produce
a new space that specifies the x values with less precision. We will refer to this kind of transformation as to shrinking along axes x up to level n. Again, when using this kind of transformation,
several TBFs that occupies different points in the original space can be mapped into the same point in the “shrunk” space.
If we treat all dimensions as being hierarchical, then the reduction operation discussed in the previous section may be regarded as a particular case of shrinking, i.e. shrinking up to the root
level. This definition may be expanded to the reference dimensions too. Now, we can define a complex operation of reduction/shrinking by specifying the level of “shrinkage” for each
When a shrinking/reduction operation is undertaken over a multidimensional space, a number of points from the original space with measurement assigned to them may be projected to the same point in
the reduced space. This effect is called aggregation, and it requires some way of calculating new measured values to assign to the aggregated points in the new space. New measurements are assigned
by “integrating” the values originally assigned to the points being aggregated.
Actually, the new space does not reflect the meaning of TBFs that served as a basis for building the original space. The new TBFs predicate can be defined, for example, as characteristics of an
average monthly sale, total investment per business segment per month, etc. Consequently, the integration formula depends on the meaning we want to assign to the new predicate and, it cannot be
defined generally. Such formula should specify what measurements are to be assigned to the points in the new space, and how they can be calculated from the measurements assigned to the points of
the original space.
Though the general integration formula does not exist, we would like to introduce one standard way of integration that can be useful as an intermediate step for other types of integration.
Let m1,m2, …,mi be measurements defined for the TBFs being aggregated. Add to them i+1 additional measurements c, c1, …, ci.. Define an integration procedure that assigns
values to the measurements m1,m2, …,mi,, c, c1, …, ci as follows:
- c gets the number of all points being aggregated.
- ck – gets the number of all points which have the measurement mk defined, k = 1,…,i (remember that not all TBFs need to have all measurements assigned).
- And last, mk gets the average value of mk over all the points being aggregated that have the k-th measurement defined, k = 1,…,i.
The above integration rule is associative and commutative in respect to the order in which the space is transformed, if the latter is done step by step. The results will be the same if a complex
transformation of shrinking/reduction along several axes is done in one step, or several reductions each concerning only one axis are done one at a time.
The standard integration procedure has all data to calculate some other types of values, e.g. totals over all aggregated points. The total over the p-th measured value is derived from cp, and mp
according to formula cp * mp.
In some cases, the integrated values can be treated not as measurements, but as attributes in the transformed space (sometimes, we need both). This allows us to define a new hierarchical dimension,
for example, on average values, and perform shrink (or restrict, see the next section) operations over this dimension.
4.6 Restriction and Shift
Sometimes, only part of the attribute dimension values are of interest in the current analysis.
Let s be a subset of all possible values in the attribute dimension x. A new space can be created from the original one by eliminating all points for which the value of coordinate x does not belong
to s. This operation of space transformation is called restriction of dimension x on set s. The way to defined a subset of values for restriction is chosen from practical considerations. The set
can be defined by:
(a) listing all values that are included
(b) specifying the range limits (minimum, and maximum)
(c) listing several ranges
(d) listing ranges and standalone values
(e) specifying values that shouldn’t be included in the set
Unlike the shrinking/reduction operation, restriction does not result in any aggregation, it just removes uninteresting parts of the space.
After the restriction has been made, the initial range of values from 0 and up to some limit may become free of points with measurements. The shift operation moves point 0 in a dimension to the
right. The operation is defined by the shift value. For example if we are interesting in TBFs that are valid for 1997, we can make restriction over this year and shift the beginning of the time
axis to the beginning of 1997.
Restriction with shift that follows allows comparative analysis of measurements that belong to two different time intervals, e.g., 1996 and 1997. When we perform two separate restrictions over the
two years and make a shift in each of the resulting spaces, the time axes in both spaces will become compatible, i.e. they will get the same set of values. The join operation described below helps
two create a new space to compare values for different years.
In the previous sections, we discussed operations that transform a multidimensional space based on TBFs, all of which have the same predicate. Some of these operations, shrinking/reduction, may
result in a space that corresponds to TBFs with another predicate than the original one. This is an example of so called derived TBFs, e.g. time based facts that originally are not included in the
database, but can be derived from the facts that are included in the database.
There are other ways of defining derived TBFs. For example, suppose we have a hospital database that for each patient case stores a TBF that register the event when a patient was accepted to the
hospital and a TBF that register the event when he/she was discharged from the hospital. These TBF’s are connected to the same person object (physical object), and the same hospital case
(process object), and there is a rule that a case may have only one Accept and one Discharge event. Even if the original database does not have TBFs that shows who was at the hospital at a
particular date, this information can be derived. To do that, we need for each date:
- find all events of type Accept with the timestamp less or equal than the given date,
- and exclude from them all events that belong to the cases for which event Discharge was registered with the timestamp less than the given date.
The above example presents a relatively complicated case of derived TBFs that is difficult, if possible, to define via space transformation. However, in some practically useful cases, a derived
space can be defined in a general way. Joining two spaces is one of them.
Consider two multidimensional spaces M1, M2 each with n dimensions but with completely different sets of measurements. Let each i-th dimension from M1 be compatible with the i-th dimension of M2,
i.e. they have the same set of values ordered in the same way. Then we can join M1 with M2 along the common axes getting a new space M3 . To do that, we need to assign measurements to each point in
M3. This can be done only if the corresponding point in M1, or/and M2 has some measurement assigned. There can be different ways to assign measurements to the points in the new space, but we can at
least define a standard one:
- the measurements assigned to the point in M3 consist of all measurements assigned to the corresponding point in M1 plus all measurements assigned to the corresponding point in M2.
Continue the example from the previous section, we can use the join operation to create a space which compares measurements made over time in two different time intervals, e.g. in 1996 and 1997.
5. Space Scanning and Presentation
5.1 What is a Query
An end-user query in an OLAP system is a request to present time-base facts in the form of one or more multidimensional spaces. The query has two parts: what spaces(s) should be presented, and in
what form, e.g., list (report), graph, 3-d picture, etc.
When more than one space is to be presented, there should be some likeliness between the spaces, e.g., they should share at least one dimension. Here, in contrast with the join operation, sharing
may not mean that the dimensions are fully compatible. Weaker compatibility is allowed, i.e. the actual axes used in the two spaces belong to the same hierarchical dimension, but different levels
of units are used in each of the spaces. Typical example when several spaces are presented together is a report with totals and subtotals for different time ranges (week, month, and year),
categories of products sold, etc. For these reports, the spaces merged can be produced by stepwise shrinking/reduction of the original space along various axes (time, product, etc).
5.2 Query Decomposition
After the problem of what space(s) to present is solved, the next question to answer is if this space is stored in the database, or should be derived from the database in some way. In the latter
case, we need to understand if the derived space can be obtained with standard space transformation operations (extension, shrinking/reduction, restriction, union), or nonstandard operations should
Generally, a target space should be presented as a sequence of operations (standard or nonstandard) over one or more source spaces. Each operation that can lead to aggregation should be supplied
with a formula for measurements calculation. A formula can be added to non-aggregating operations too, for example, if the measurements stored in the database are not the ones that are needed
according to the query, but they can be calculated. Moreover, a calculation formula may be needed even when no space transformation is required, i.e. when the database already has the space
requested in the query, but with another set of measurements than requested.
Often, there are several ways for defining an order of operations that lead to the given target space. A new order can be obtained, for example, by switching the places of two standard operations
in an already defined order. This is allowed only if the calculation formula is invariant to such changes in order of operations. This is true, for example, for the standard integration formula
discussed in section 4.5. When the order of operations can be changed, the following rules are useful to follow:
- Swapping shrinking and restriction. If restriction follows shrinking over the same dimension, it might be more effective to reverse the order of these operations. Restriction exclude some
points without integrating them, so there will be less number of points to integrate when shrinking is invoked. One problem should be solved though, i.e. when restriction is defined after shrinking
in the same dimension, the restriction specification will be based on units of the higher level in hierarchy than the units of the non-shrunk space. In order to swap shrinking and restriction, the
restriction specification should be translated into the lower level units. This is done based on the hierarchical dimension definition.
- If two shrink operations are defined in the same dimension, first try to change the order of operations so that they follow one another (using Rule 1 above, for example). After that they may be
merged in one shrinking that covers several levels. This is most effective when the second shrinking is done up to the root level (i.e., the dimension is being dropped after all)
- If a dimension undergoing shrinking has an index defined in the database, it is better to move shrinking in this dimension after shrinking in other dimensions. If more than one dimension
indexes are present, the dimension that result in less number of aggregated points (small cells) should be preferred as the last one to shrink. If a complex index that stretches over several
dimensions exists, then the shrinking order should be reversed in respect to the order of attributes defined for this index.
Scanning is a process of traversing a multidimensional space and collecting all points that have got some measurements. The result of scanning is a stream of weighted points, each of which is
defined by its coordinates in the space, and measurements. Scanning is needed for both:
- presenting information to the end-user
- aggregation and integration when the space undergo shrinking/reduction
When scanning is being done, each point should be visited only once. In practice, this rule is not sufficient for the purposes of presentation and integration, more strict traversal order may be
A scanning order can be defined as a sequence of space dimensions, say: D1, .., Dn. Scanning starts with the lowest value of dimension D1 and goes through all the points of the space that have this
lowest value as the coordinate in dimension D1. After that, the next value in dimension D1 is chosen and all points with this value are scanned. The order of scanning the points with D1 value fixed
is defined by dimension D2, etc.
Suppose we scan an n-dimensional space in order D1, .., Dn. Then the resulting stream may be considered as consisting of a number of series, defined as follows:
- Each series includes the points with the same coordinate in dimension D1. A point that starts a new series is called a break of level 0 in dimension D1.
- Alternatively, each series includes the points with the same coordinates in dimensions D1 and D2.A point that starts a new series is called a break of level 0 in dimensions . Naturally, breaks
in D1, D2 will happen more often then breaks in D1. All breaks in D1 are also breaks in .
- Alternatively, each series includes the points with the same coordinates in dimensions D1, …, Dk, where k <= n. A point that starts a new series is called a break of level 0 in dimensions
If dimension Dk is a hierarchical dimension with h levels and f is a nonzero level (f <=h) , then the scanned stream may be considered as consisting of a number of series such that:
- Each series includes the points with the same coordinates in dimensions D1, …, Dk–1, and Dk , but for dimension Dk the coordinates are the same only with precision of level f units. A point
that starts a new series is called a break of level f in dimensions .
5.4 Stream Processing and Merging
Scanning can be used as a way of space transformation on the fly. Applying special processing rules to the scanned stream, we can get a stream that could be produced by scanning a space that
differs from the original one. Naturally, this other space can be defined as a transformation of the original space.
Let space M2 be derivable from an n-dimensional space M1 = D1 x …x Dn via the shrink operation along dimension Dk up to level f. Assume farther that some integration formula to assign
measurements to aggregated points is defined. Let us scan M1 in the order defined as D1, …, Dk-1, Dk+1, …, Dn,, Dk. Consider the resulting stream as a number of series, each series lying
between two breaks of level f in dimensions (the rightmost break being included in the series). By applying the integration formula to all points in each series, and substituting the Dk value of
precision 0 with the value of precision f, we will get a stream that would result from scanning space M2. This procedure can be generalized to cover cases where several shrink operations are
applied one after another.
Beside of being useful for space transformations, breaks are useful for merging two or more spaces that should be presented to the end-user together. More precisely, breaks of a level l greater
than 0 can be used for merging two streams, where the second one defines some integrated values, like subtotal, average, etc., for the units of level l. In this case, the next point from the second
stream is placed directly before the next break in the first stream (to summarize the previous series). Moreover, instead of merging two independent streams, the stream processing rules described
above may be applied to the current stream to get the second stream while scanning the original space. This will result in space transformation and merging made at the same time.
Visualization is a way of presenting the stream of points to the end-user. This can be done in various forms, e.g., as a list (report), graphical diagram (pie, bar), graph, two dimensional matrix,