Published in Association with the Meta-Data and Data Modeling Summit
June 14-16, 2005 – Long Branch, New Jersey
For more information please visit – http://www.debtechint.com/summit2005/summit2005.htm
Possibly, the most difficult problem to support in the relational model is hierarchical data. A hierarchy according to Webster is a “group of persons or things arranged in order to rank grade,
class, etc.” Examples are organization structures, product reporting structures, employee-manager relations, and customer-to-customer relationships. There are three issues with hierarchical data:
- how to represent it logically;
- how to design it physically; and
- how to write SQL to navigate it.
As so often happens, the data modeling community praises the logical purity of its solution, which is recursion. The database community rightfully revolts because that solution has serious
performance problems. The SQL community complains that you’d have to be Houdini to get the data out.
In this article, we will talk about the basic concepts of recursion in a logical model, and will discuss several practical physical implementations, namely, current DBMS implementations, descendent
tables, snowflaked hierarchies, flattened hierarchies, self-joins, and nested sets.
A reader completely familiar with recursion might choose to scan the Logical Model section and proceed to the Physical Implementation section.
The Logical Model
Relationships can fall into three general categories as follows:
- one-to-one relationships, such as husband and wife;
- one-to-many relationships, such as manager and employees; and
- many-to-many relationships, such as a product and all its parts.
Where these relationships are between occurrences of the same entity, as above, they are termed “recursive relationships”.
Hierarchical Recursion
The one-to-many relationship represents a simple hierarchy, which explodes in size as we look down in the hierarchy. An example is a simple corporate employee hierarchy. An employee reports to a
manager – who of course is also an employee. In relational, this is called a self-join.
Network Recursion
The many-to-many recursive relationship, sometimes called a bill of material (BOM), is more complex. This structure is also called the “adjacency model”. For now, we will stick to the term BOM.
It is commonly used. It is also very descriptive of an actual manufacturing bill of materials. Although it can be applied to many other recursive situations, an actual bill of material makes a good
example for explaining the concept. The term BOM was originally coined in manufacturing and some of the earliest manufacturing data base systems were called BOMP (Bill of Material Processor) and
DBOMP (Data Base Bill of Material Processor).
Simplified View Of Bill Of Materials
The Bill of Materials represents what component parts comprise a superior part and into what superior parts a component part goes. The structure requires a primary entity, such as ITEM, and an
associative entity, such as ITEM BOM (BILL OF MATERIAL). The ITEM represents the part out of context. All items, regardless of their role or associations, are first represented in ITEM. The ITEM
BOM specifies the relationships that exist between any item and its components. Here is a typical bill of materials structure.
Why the multiple relationships between the two entities? Is this legitimate?
In ITEM BOM, the two ITEM NOs are required. One ITEM NO represents the parent occurrence and the second ITEM NO represents the child occurrence. The combination of the two represents the
relationship. Consequently, the two ITEM NOs play different roles. There are multiple relationships between the two entities.
Multiple relationships are valid when any of the following conditions exists:
- the business rules are different,
- the cardinalities are different, or
- the occurrences are different.
In this case, the occurrences would be different because one points to the parent occurrence, the other to the child occurrence. It does not matter which attribute is chosen to play which role –
parent or child. The parent does not have to be the first one and the child the second one. However, once chosen, the role of the attribute is fixed. There will be as many rows in the ITEM BOM as
there are parts in the entire bill of materials.
To break the item down further, the children of each parent may have child parts of their own. They are then used as parents with their own children to extend the structure, as is illustrated below
for an F15 plane.
Navigating The BOM
If an assembly is navigated downward from any given parent, it is commonly called explosion. The quantity of parts needed to form an assembly can be identified. We can ask, for example, what are
all the items that make up this part? If we look upward, from the view of any given child, it is called implosion. The quantity needed within each assembly can be identified. We can ask, what parts
is this part used in? Note that for both the explosion and implosion, the quantity is dependent on this child part (ITEM) within this parent part (also ITEM). In other words, it is dependent on the
combination of parent and child.
The bill of materials structure, consisting of the primary entity, such as ITEM, and an associative entity, such as ITEM BOM, is a fully normalized structure.
The classical BOM provides the capability to navigate from any parent to all of its component parts. It may be necessary to navigate and flip-flop many, many parent-child relationships to get
there, but it is eminently possible. It is important to note one final thing: the classical BOM contains relationships only between any parent and its immediate children. Retrieval
of the entire bill of materials for a given part requires navigation through the succession of parent-child relationships, starting with the ultimate parent for explosion, or the ultimate child for
implosion.
This treatment also shows those familiar with current computer implementations of BOM that the construct is, in fact, a network and not a hierarchy.
Assembly Hierarchies and Product Structure Trees
Consider the example of an airplane. An airplane has a hierarchy of components. If one is simply decomposing one airplane into greater and greater levels of detail, it appears as a hierarchy.
However, when looking at the entire range of airplanes for a manufacturer, the structure is not a hierarchy but rather a network of parts, i.e., a many-to-many relationship. In other words, a given
part or assembly such as a carburetor type can be used in many different types of engines. Likewise, an engine might be used in many different types of airplanes, and so on. So any given assembly
can have subassemblies under it and other assemblies above it.
This can be depicted in what is called a product structure tree as shown. Any item can be considered either an assembly or a part depending upon its relationship to other items. For example, item C
is a part within assembly B, but it is also an assembly consisting of parts 2, D, and 4.
Financial Example
The financial industry often complains that texts like this do not show enough examples that relate to them. With this in mind, let us examine a typical loan example using the typical “bill of
materials” structure.
Consider the case that a bank offers commercial loans to credit approved organizations and that those organizations can have different relationships with other organizations. An organization can
offer collateral, can co-sign a loan, can syndicate a loan, and can hypothecate a loan – all with different organizations. To solve this, first one should consider each of these organizations as
customers: the borrowing and the supporting customer. Therefore, CUSTOMER can relate to CUSTOMER as follows:
Look familiar? The many-to-many relationship is resolved by retaining the CUSTOMER entity and creating a CUSTOMER RELATIONSHIP entity, retaining the dual relationships, as follows:
Incidentally, it might be best to expand this example by using subtypes to represent the different types of the CUSTOMER RELATIONSHIPs, or by adding a role type attribute. The latter type code
could differentiate supporting roles like cosigning, offering collateral, etc.
The Pros and Cons
The benefit of the recursive structure, especially for a logical model, is its elegance and simplicity. The structure is endlessly extendible. You can even add components in the middle without
traumatizing the rest of the structure. To do this, you do the following:
- Add the part to ITEM if necessary;
- Add its relationships with its children to ITEM BOM; and
- Modify the previous parent-child relationships in ITEM BOM to point to the new part.
It can easily be applied to a wide variety of problems. For example, it can be applied to a true bill of materials, to organization and employee structures, to financial relationships and to
reporting rollup structures.
In current relational technology, recursion can pose two main problems: performance and programming complexity.
The Physical Implementation
One way to simplify a problem is to eliminate variability. There are two popular implementations of this, especially in data warehousing, called the flattened structure and the snowflake. Of
course, these solutions are not unique to data warehousing.
Fixed Solutions
The first solution is a fixed hierarchy, such as you see in a snowflake. In this approach, as in the snowflake, the different levels of the hierarchy are kept in separate tables. Each level is
fixed. Clive Finkelstein in his book Introduction to Information Engineering often models this way. Here is an example of a snowflake or fixed hierarchy.
In the flattened structure, the different levels in the hierarchy are reduced to sets of different columns in a single row. Imagine a product hierarchy containing (in one row), product, component,
subcomponent, assembly, subassembly, and part. This is highly denormalized but if the structures are fixed, it could work. Here is an example of a flattened reporting structure (on the right
below).
Both the flattened and the snowflake really prefer a requirement that is fairly fixed and with few levels. Although both of these structures are popular in data warehousing, they do not have to be
restricted to it.
Recursive Solutions
Some situations are more complex and require a more open-ended solution, which often comes down to recursion. Simple recursions can be solved easily, almost simplistically. First is a recursion
that is very shallow, such as, an Employee-Manager relationship, which is only two deep. These can be supported in an SQL self-join, for example, joining the Employee table as Employee to the
Employee table as Manager.
Second is a recursion that has more levels but where the levels are fixed and few, such as a product hierarchy that always contains 5 levels. These are easy to support in SQL. The SQL can be a
fixed set of nested subqueries, roughly as follows:
SELECT PARENT1_ITEMNO, CHILD1_ITEMNO,…
(SELECT PARENT2_ITEMNO, CHILD2_ITEMNO,…
(SELECT PARENT3_ITEM_NO, CHILD3…
(SELECT PARENT4_ITEM_NO, CHI …)
When the recursion is deeper or more variable, this approach becomes unwieldy.
Vanilla SQL Implementation
Vanilla SQL is just a colloquial term for those SQL dialects that do not have specific support for recursion. You can get all the products and all their parts, with the emphasis on all. There are
several problems: first, the results are not ordered by product, and second, you cannot get the entire path for an individual parent or child. To retrieve the path from a given parent down, or from
a given child up, you have to exit to application code. This poses several problems. First, you have to exit SQL, which has overhead. Second, you will be reestablishing cursors in the DBMS, which
entails more overhead.
Recent SQL Implementations
Recently, DBMSs have risen to the occasion. IBM’s UDB and Oracle, for example, have extensions to support recursion. UDB does it with a common table function and Oracle with a CONNECT BY
parameter. Effectively, these implementations do what vanilla SQL could not: they retrieve the children for a specified parent, and build a temporary table of them. They then use those children as
parents and retrieve their children, appending this new result set to the temporary table. This process is repeated for the Maximum Levels specified. The result is presented in a final result set.
The problem with this implementation, which is eons easier to use than former relational implementations, is still performance. Performance is much better than the combination of SQL and
application code, but not as good as other implementations, which we describe below.
General Considerations
Before we discuss two optimized implementations, several changes can be made even to basic recursion to improve its usability. In most cases, a recursive structure would benefit from the addition
of several attributes:
- A Level Number for each row, indicating at what level this occurrence is; or
- A Depth attribute, indicating the distance of this entry from the top (or bottom); and
- A Max Levels attribute indicating the total number of levels for any branch of the tree.
- A Sequence Number indicating the preferred order of retrieval.
The level number would tell you the level of each entry; the depth attribute would tell you the distance from the top and bottom; the maximum levels would tell you the total possible levels for
that branch of the tree, and a sequence number would allow ordered result sets.
An Ordered Result Set
One of the characteristics of relational is that the data is not ordered. If you want the result set ordered, either you have to sort the data somehow (for example, using the ORDER BY parameter) or
you have to provide a sequence number.
A sequence number can be used in one of several ways.
A universal sequence number is a single attribute from top to bottom that provides the entire data structure in a given sequence. This is useful if you want to pull them out in an
absolutely certain order. (We will explore this later when we discuss nested sets.)
A Sequence Number within Level Number would allow the data to be pulled off sequentially within level. It would not, however, allow parts to be pulled off in sequence within any
given parent. In other words, you get all the level ones, then all the level twos, etc., not a level one parent, then all its level 2’s, then its level 3’s, etc., and then the next parent in the
same way. To achieve this, one has to resort to special implementations, which we discuss next.
Two Important Implementations
To circumvent the limitations of complexity and performance, two particular implementations are often used: descendent tables (often called speed tables) and nested sets. For different purposes,
each makes a different modification to recursion.
Descendent Tables
Descendent tables are often called speed tables. This is not a good name to use exclusively for this implementation because there are many different designs about many different things – all
intended to provide speed. Why use this term to refer only to this implementation? So we recommend the term, “descendent tables”.
As we saw, the traditional BOM includes relationships between any parent and its immediate children. You retrieve the whole hierarchy by retrieving a parent, then taking the children for this
parent and using them as parents, retrieve their children, and so forth. Again the problem is that this structure is slow (with today’s technology) and difficult to support.
Descendent tables change this by including relationships between each parent and all of its descendents, not just the immediate children. There is one row for each parent and each descendent. The
following illustrates this with a simple example:
This solution works well for hierarchies, such as reporting hierarchies. It applies only to hierarchy recursion, not network recursion. Second, it does not allow you to navigate easily upward, from
any given child to all of its parents. Also, be careful as to how your Business Intelligence tool handles this. Ensure that its ORDER BY, GROUP BY does not double-count.
Nevertheless, this is an effective compromise for many reporting implementations.
Nested Sets
Joe Celko is a consultant-author whose writings on SQL are always worth reading. For hierarchies, he recommends the Nested Set Model and has been very critical of recursion. Celko refers to the
typical BOM, which we called above the Network Recursion, as an adjacency model. Some of his criticisms of recursion are right on target, others are way off base. First let us discuss the Nested
Set Model, then his comments about the adjacency model (classical recursion) and lastly his comments on normalization.
The Nested Set Model is optimized and very appropriate under certain circumstances.
Refer to the example below. This example is taken from an article by Gijs Van Tulder, entitled “Storing Hierarchical Data in a Database.” Gijs calls this model a “modified preorder tree
traversal”.
The tables to support this are as follows (using his terms):
parent | title | lft | rgt |
Food | 1 | 18 | |
Food | Fruit | 2 | 11 |
Fruit | Red | 3 | 6 |
Red | Cherry | 4 | 5 |
Fruit | Yellow | 7 | 10 |
Yellow | Banana | 8 | 9 |
Food | Meat | 12 | 17 |
Meat | Beef | 13 | 14 |
Meat | Pork | 15 | 16 |
In the nested set model, each entry has two hierarchical attributes, called a LEFT and RIGHT count. These are generally referred to as “LFT” and “RGT” because LEFT and RIGHT are reserved words
in SQL. (Remember LEFT and RIGHT OUTER JOIN). In the following example, each entry is assigned a left and right pointer. In this discussion, the term, “the current entry”, means any specific
entry, such as Banana. The left pointer is the position of the current entry in the hierarchy. The right pointer is its last place in the hierarchy. It is as though you enter on the left pointer
and exit on the right one. If the entry is a leaf entry (that is, has no descendents), its right pointer will be its left pointer plus one. You enter it and immediately ascend after it. For an
entry that is non-leaf (that is, has descendents), the first descendent’s left pointer will be next in sequence. You enter the current entry and descend to the next level.
An analogy often used for this model can help one understand it. Consider a worm that is crawling throughout the entire hierarchy. It crawls to each entry sequentially. When it gets to an entry, it
looks for the next entry. To do so, it looks at the RGT column. If the number is LFT plus one, it exits this entry upward; otherwise, it crawls to the next lower entry.
The nested set model can be seen as a variation of sequence numbers. You can follow any hierarchical path by following it in sequence. To go from parent to child, follow it in ascending sequence.
To go from child to parent, follow it in descending sequence. To get all the entries between any two entries get the entries between the left and right. The beauty of this solution is that it
requires no special SQL functionality, it retrieves data fast and in sequence, it allows you to navigate forward or backward, and it even allows you to retrieve partial sets of data between any two
points. But, there’s no such thing as a free lunch.
The main disadvantage of nested sets is maintenance. To insert new entries, you have to modify all other entries that depend on it. This could cause you to regenerate all or most of the entire
hierarchy, depending on the modification. Certainly, you have to change all entries beneath the modified entry.
Celko responds that, well after all, organizational structures do not change every day. Is that really so? Let’s look at a real-life example.
I think Celko significantly oversimplifies this problem, especially in a data warehousing environment. Organizations can change often, even every day in some cases — not in its gross
organizational structures, but often the detailed sales structures. Say we are a soft drink distributor. Customers are organized into Routes. Once assigned to a Route, a Customer stays on the
Route. Routes are organized into Territories, and Territories into Market Units. However, and here’s the catch, Routes are assigned to different Territories when a Territory gets too big. New
Territories can be created. When the Route is reassigned, all the sales of that Route are moved to the new Territory. In a large national distributor, changes like this can happen every day. To
update a Nested Set Model of Org Structure like this each day could be a significant maintenance challenge. In one real-life case, maintenance took 15 minutes. This can be done in a data warehouse
if you have enough batch window. In a tight, daily batch window of 3 hours for a warehouse, an additional 15 minutes is an eternity. If you do not have enough batch window, then you have to adjust.
One way is to apply the change at a longer interval, such as, at weekend or period-end, where there is a longer batch window. If that is not possible, and you cannot change the batch window or the
service level agreement, then you may have to resort to a solution that provides easier maintenance at the cost of reduced performance.
Celko and Normalization
Joe Celko’s also made some comments on the normalization of the recursive model (the adjacency model). The following discussion does not diminish the respect I have for his SQL insights. However,
in this case his analysis of normalization is really quite incorrect. Here are his example and his comments.
Emp | boss | salary |
‘Albert’ | ‘Null’ | 1000.00 |
‘Bert’ | ‘Albert’ | 900.00 |
‘Chuck’ | ‘Albert’ | 900.00 |
‘Donna’ | ‘Chuck’ | 800.00 |
‘Eddie’ | ‘Chuck’ | 700.00 |
‘Fred’ | ‘Chuck’ | 600.00 |
Of the recursive model (which he calls the adjacency model), he says it “is denormalized in several ways. We are modeling both the OrgChart and the organization chart [sic] in one table. …
Another problem with the adjacency list model is that the boss and employee columns are the same kind of thing (i.e., name of OrgChart), and therefore should be shown in only one column in a
normalized table.”
“To prove that this is not normalized, assume that ‘Chuck’ changes his name to ‘Charles’; you have to change his name in both columns and several places. The defining characteristic of a
normalized table is that you have one fact, one place one time.” Thus spake Celko.
Obviously, his example is not normalized – but not for the reasons he says. It is not the recursion that’s the problem. It’s his model and specifically his keys. Using a name as the identifier is
a very poor choice. No good modeler should do this. A key must be stable (never change), unique (have no duplicates) and minimal (have only as much key as necessary). His key violates two of those
principles: stability and uniqueness. By definition of his example, the employee name can change. Because the key is a name, it is also not unique. Surprisingly, the key he has chosen even violates
the primary key principles in his own book, Joe Celko’s Data and Databases: Concepts in Practice, p. 247.
In the adjacency model (classical recursion), the two attributes in question are “the same kind of thing”. For that matter, so are LFT and RGT in the nested set model. In the Employee hierarchy
example, both are Employee IDs. In the ITEM example, the keys to ITEM BOM are both ITEM IDs. However, in each case, each foreign key plays a different role, one is the parent, and the other the
child. Below is a better solution to his example, and this example is normalized.
Corrected OrgChart Example
As we just said, his same reasoning could equally be applied to the Nested Set Model. After all, the LFT and RGT columns are really “the same kind of thing”; they are each positional or
sequential references.
In fact, we should say that both the Adjacency Model and the Nested Set Model (when properly modeled) are normalized. Each model is normalized because the IDs play different roles, as do LFT and
RGT. Each solves the problem. They are different solutions. One is easier to implement than the other; one is faster than the other. One is more flexible than the other; one is more navigable than
the other.
Conclusion
There are many ways to solve the problem of hierarchies, namely,
- as a classical “bill of materials” structure, supported by current SQLs;
- as a fixed hierarchy (such as a snowflake);
- flattened (as in a star);
- with explicit sequencing;
- as descendent tables with relationships between a parent and all of its children;
- as nested sets, which have high query performance but high update maintenance.
They do not all have the same pros and cons. Define your requirements; evaluate the alternatives; refine your compromises; pick your solution. It truly depends on your needs. As they said in The
Pirates of Penzance, “Let the punishment fit the crime.”
© InfoModel, Inc.