|
Modeling Hierarchies
Published: June 14, 2005
Published in Association with the Meta-Data and Data Modeling Summit Possibly, the most difficult problem to support in the relational model is hierarchical data. A hierarchy according to Webster is a "group of persons or things arranged in order to rank grade, class, etc." Examples are organization structures, product reporting structures, employee-manager relations, and customer-to-customer relationships. There are three issues with hierarchical data:
As so often happens, the data modeling community praises the logical purity of its solution, which is recursion. The database community rightfully revolts because that solution has serious performance problems. The SQL community complains that you'd have to be Houdini to get the data out. In this article, we will talk about the basic concepts of recursion in a logical model, and will discuss several practical physical implementations, namely, current DBMS implementations, descendent tables, snowflaked hierarchies, flattened hierarchies, self-joins, and nested sets. A reader completely familiar with recursion might choose to scan the Logical Model section and proceed to the Physical Implementation section. The Logical ModelRelationships can fall into three general categories as follows:
Where these relationships are between occurrences of the same entity, as above, they are termed "recursive relationships". Hierarchical Recursion The one-to-many relationship represents a simple hierarchy, which explodes in size as we look down in the hierarchy. An example is a simple corporate employee hierarchy. An employee reports to a manager - who of course is also an employee. In relational, this is called a self-join.
Network Recursion The many-to-many recursive relationship, sometimes called a bill of material (BOM), is more complex. This structure is also called the "adjacency model". For now, we will stick to the term BOM. It is commonly used. It is also very descriptive of an actual manufacturing bill of materials. Although it can be applied to many other recursive situations, an actual bill of material makes a good example for explaining the concept. The term BOM was originally coined in manufacturing and some of the earliest manufacturing data base systems were called BOMP (Bill of Material Processor) and DBOMP (Data Base Bill of Material Processor).
Simplified View Of Bill Of Materials The Bill of Materials represents what component parts comprise a superior part and into what superior parts a component part goes. The structure requires a primary entity, such as ITEM, and an associative entity, such as ITEM BOM (BILL OF MATERIAL). The ITEM represents the part out of context. All items, regardless of their role or associations, are first represented in ITEM. The ITEM BOM specifies the relationships that exist between any item and its components. Here is a typical bill of materials structure.
Why the multiple relationships between the two entities? Is this legitimate? In ITEM BOM, the two ITEM NOs are required. One ITEM NO represents the parent occurrence and the second ITEM NO represents the child occurrence. The combination of the two represents the relationship. Consequently, the two ITEM NOs play different roles. There are multiple relationships between the two entities. Multiple relationships are valid when any of the following conditions exists:
In this case, the occurrences would be different because one points to the parent occurrence, the other to the child occurrence. It does not matter which attribute is chosen to play which role - parent or child. The parent does not have to be the first one and the child the second one. However, once chosen, the role of the attribute is fixed. There will be as many rows in the ITEM BOM as there are parts in the entire bill of materials. To break the item down further, the children of each parent may have child parts of their own. They are then used as parents with their own children to extend the structure, as is illustrated below for an F15 plane. Navigating The BOM If an assembly is navigated downward from any given parent, it is commonly called explosion. The quantity of parts needed to form an assembly can be identified. We can ask, for example, what are all the items that make up this part? If we look upward, from the view of any given child, it is called implosion. The quantity needed within each assembly can be identified. We can ask, what parts is this part used in? Note that for both the explosion and implosion, the quantity is dependent on this child part (ITEM) within this parent part (also ITEM). In other words, it is dependent on the combination of parent and child. The bill of materials structure, consisting of the primary entity, such as ITEM, and an associative entity, such as ITEM BOM, is a fully normalized structure. The classical BOM provides the capability to navigate from any parent to all of its component parts. It may be necessary to navigate and flip-flop many, many parent-child relationships to get there, but it is eminently possible. It is important to note one final thing: the classical BOM contains relationships only between any parent and its immediate children. Retrieval of the entire bill of materials for a given part requires navigation through the succession of parent-child relationships, starting with the ultimate parent for explosion, or the ultimate child for implosion. This treatment also shows those familiar with current computer implementations of BOM that the construct is, in fact, a network and not a hierarchy. Assembly Hierarchies and Product Structure Trees Consider the example of an airplane. An airplane has a hierarchy of components. If one is simply decomposing one airplane into greater and greater levels of detail, it appears as a hierarchy. However, when looking at the entire range of airplanes for a manufacturer, the structure is not a hierarchy but rather a network of parts, i.e., a many-to-many relationship. In other words, a given part or assembly such as a carburetor type can be used in many different types of engines. Likewise, an engine might be used in many different types of airplanes, and so on. So any given assembly can have subassemblies under it and other assemblies above it.
This can be depicted in what is called a product structure tree as shown. Any item can be considered either an assembly or a part depending upon its relationship to other items. For example, item C is a part within assembly B, but it is also an assembly consisting of parts 2, D, and 4. Financial Example The financial industry often complains that texts like this do not show enough examples that relate to them. With this in mind, let us examine a typical loan example using the typical "bill of materials" structure. Consider the case that a bank offers commercial loans to credit approved organizations and that those organizations can have different relationships with other organizations. An organization can offer collateral, can co-sign a loan, can syndicate a loan, and can hypothecate a loan - all with different organizations. To solve this, first one should consider each of these organizations as customers: the borrowing and the supporting customer. Therefore, CUSTOMER can relate to CUSTOMER as follows:
Look familiar? The many-to-many relationship is resolved by retaining the CUSTOMER entity and creating a CUSTOMER RELATIONSHIP entity, retaining the dual relationships, as follows:
Incidentally, it might be best to expand this example by using subtypes to represent the different types of the CUSTOMER RELATIONSHIPs, or by adding a role type attribute. The latter type code could differentiate supporting roles like cosigning, offering collateral, etc. The Pros and Cons The benefit of the recursive structure, especially for a logical model, is its elegance and simplicity. The structure is endlessly extendible. You can even add components in the middle without traumatizing the rest of the structure. To do this, you do the following:
It can easily be applied to a wide variety of problems. For example, it can be applied to a true bill of materials, to organization and employee structures, to financial relationships and to reporting rollup structures. In current relational technology, recursion can pose two main problems: performance and programming complexity. The Physical ImplementationOne way to simplify a problem is to eliminate variability. There are two popular implementations of this, especially in data warehousing, called the flattened structure and the snowflake. Of course, these solutions are not unique to data warehousing. Fixed Solutions The first solution is a fixed hierarchy, such as you see in a snowflake. In this approach, as in the snowflake, the different levels of the hierarchy are kept in separate tables. Each level is fixed. Clive Finkelstein in his book Introduction to Information Engineering often models this way. Here is an example of a snowflake or fixed hierarchy.
In the flattened structure, the different levels in the hierarchy are reduced to sets of different columns in a single row. Imagine a product hierarchy containing (in one row), product, component, subcomponent, assembly, subassembly, and part. This is highly denormalized but if the structures are fixed, it could work. Here is an example of a flattened reporting structure (on the right below).
Both the flattened and the snowflake really prefer a requirement that is fairly fixed and with few levels. Although both of these structures are popular in data warehousing, they do not have to be restricted to it. Recursive Solutions Some situations are more complex and require a more open-ended solution, which often comes down to recursion. Simple recursions can be solved easily, almost simplistically. First is a recursion that is very shallow, such as, an Employee-Manager relationship, which is only two deep. These can be supported in an SQL self-join, for example, joining the Employee table as Employee to the Employee table as Manager. Second is a recursion that has more levels but where the levels are fixed and few, such as a product hierarchy that always contains 5 levels. These are easy to support in SQL. The SQL can be a fixed set of nested subqueries, roughly as follows:
SELECT PARENT1_ITEMNO, CHILD1_ITEMNO,... When the recursion is deeper or more variable, this approach becomes unwieldy. Vanilla SQL Implementation Vanilla SQL is just a colloquial term for those SQL dialects that do not have specific support for recursion. You can get all the products and all their parts, with the emphasis on all. There are several problems: first, the results are not ordered by product, and second, you cannot get the entire path for an individual parent or child. To retrieve the path from a given parent down, or from a given child up, you have to exit to application code. This poses several problems. First, you have to exit SQL, which has overhead. Second, you will be reestablishing cursors in the DBMS, which entails more overhead. Recent SQL Implementations Recently, DBMSs have risen to the occasion. IBM's UDB and Oracle, for example, have extensions to support recursion. UDB does it with a common table function and Oracle with a CONNECT BY parameter. Effectively, these implementations do what vanilla SQL could not: they retrieve the children for a specified parent, and build a temporary table of them. They then use those children as parents and retrieve their children, appending this new result set to the temporary table. This process is repeated for the Maximum Levels specified. The result is presented in a final result set. The problem with this implementation, which is eons easier to use than former relational implementations, is still performance. Performance is much better than the combination of SQL and application code, but not as good as other implementations, which we describe below. General Considerations Before we discuss two optimized implementations, several changes can be made even to basic recursion to improve its usability. In most cases, a recursive structure would benefit from the addition of several attributes:
The level number would tell you the level of each entry; the depth attribute would tell you the distance from the top and bottom; the maximum levels would tell you the total possible levels for that branch of the tree, and a sequence number would allow ordered result sets. An Ordered Result Set One of the characteristics of relational is that the data is not ordered. If you want the result set ordered, either you have to sort the data somehow (for example, using the ORDER BY parameter) or you have to provide a sequence number. A sequence number can be used in one of several ways. A universal sequence number is a single attribute from top to bottom that provides the entire data structure in a given sequence. This is useful if you want to pull them out in an absolutely certain order. (We will explore this later when we discuss nested sets.) A Sequence Number within Level Number would allow the data to be pulled off sequentially within level. It would not, however, allow parts to be pulled off in sequence within any given parent. In other words, you get all the level ones, then all the level twos, etc., not a level one parent, then all its level 2's, then its level 3's, etc., and then the next parent in the same way. To achieve this, one has to resort to special implementations, which we discuss next. Two Important Implementations To circumvent the limitations of complexity and performance, two particular implementations are often used: descendent tables (often called speed tables) and nested sets. For different purposes, each makes a different modification to recursion. Descendent Tables Descendent tables are often called speed tables. This is not a good name to use exclusively for this implementation because there are many different designs about many different things - all intended to provide speed. Why use this term to refer only to this implementation? So we recommend the term, "descendent tables". As we saw, the traditional BOM includes relationships between any parent and its immediate children. You retrieve the whole hierarchy by retrieving a parent, then taking the children for this parent and using them as parents, retrieve their children, and so forth. Again the problem is that this structure is slow (with today's technology) and difficult to support. Descendent tables change this by including relationships between each parent and all of its descendents, not just the immediate children. There is one row for each parent and each descendent. The following illustrates this with a simple example: This solution works well for hierarchies, such as reporting hierarchies. It applies only to hierarchy recursion, not network recursion. Second, it does not allow you to navigate easily upward, from any given child to all of its parents. Also, be careful as to how your Business Intelligence tool handles this. Ensure that its ORDER BY, GROUP BY does not double-count. Nevertheless, this is an effective compromise for many reporting implementations.
Nested Sets Joe Celko is a consultant-author whose writings on SQL are always worth reading. For hierarchies, he recommends the Nested Set Model and has been very critical of recursion. Celko refers to the typical BOM, which we called above the Network Recursion, as an adjacency model. Some of his criticisms of recursion are right on target, others are way off base. First let us discuss the Nested Set Model, then his comments about the adjacency model (classical recursion) and lastly his comments on normalization.
The Nested Set Model is optimized and very appropriate under certain circumstances. Refer to the example below. This example is taken from an article by Gijs Van Tulder, entitled "Storing Hierarchical Data in a Database." Gijs calls this model a "modified preorder tree traversal". The tables to support this are as follows (using his terms):
In the nested set model, each entry has two hierarchical attributes, called a LEFT and RIGHT count. These are generally referred to as "LFT" and "RGT" because LEFT and RIGHT are reserved words in SQL. (Remember LEFT and RIGHT OUTER JOIN). In the following example, each entry is assigned a left and right pointer. In this discussion, the term, "the current entry", means any specific entry, such as Banana. The left pointer is the position of the current entry in the hierarchy. The right pointer is its last place in the hierarchy. It is as though you enter on the left pointer and exit on the right one. If the entry is a leaf entry (that is, has no descendents), its right pointer will be its left pointer plus one. You enter it and immediately ascend after it. For an entry that is non-leaf (that is, has descendents), the first descendent's left pointer will be next in sequence. You enter the current entry and descend to the next level. An analogy often used for this model can help one understand it. Consider a worm that is crawling throughout the entire hierarchy. It crawls to each entry sequentially. When it gets to an entry, it looks for the next entry. To do so, it looks at the RGT column. If the number is LFT plus one, it exits this entry upward; otherwise, it crawls to the next lower entry. The nested set model can be seen as a variation of sequence numbers. You can follow any hierarchical path by following it in sequence. To go from parent to child, follow it in ascending sequence. To go from child to parent, follow it in descending sequence. To get all the entries between any two entries get the entries between the left and right. The beauty of this solution is that it requires no special SQL functionality, it retrieves data fast and in sequence, it allows you to navigate forward or backward, and it even allows you to retrieve partial sets of data between any two points. But, there's no such thing as a free lunch. The main disadvantage of nested sets is maintenance. To insert new entries, you have to modify all other entries that depend on it. This could cause you to regenerate all or most of the entire hierarchy, depending on the modification. Certainly, you have to change all entries beneath the modified entry. Celko responds that, well after all, organizational structures do not change every day. Is that really so? Let's look at a real-life example. I think Celko significantly oversimplifies this problem, especially in a data warehousing environment. Organizations can change often, even every day in some cases -- not in its gross organizational structures, but often the detailed sales structures. Say we are a soft drink distributor. Customers are organized into Routes. Once assigned to a Route, a Customer stays on the Route. Routes are organized into Territories, and Territories into Market Units. However, and here's the catch, Routes are assigned to different Territories when a Territory gets too big. New Territories can be created. When the Route is reassigned, all the sales of that Route are moved to the new Territory. In a large national distributor, changes like this can happen every day. To update a Nested Set Model of Org Structure like this each day could be a significant maintenance challenge. In one real-life case, maintenance took 15 minutes. This can be done in a data warehouse if you have enough batch window. In a tight, daily batch window of 3 hours for a warehouse, an additional 15 minutes is an eternity. If you do not have enough batch window, then you have to adjust. One way is to apply the change at a longer interval, such as, at weekend or period-end, where there is a longer batch window. If that is not possible, and you cannot change the batch window or the service level agreement, then you may have to resort to a solution that provides easier maintenance at the cost of reduced performance. Celko and Normalization Joe Celko's also made some comments on the normalization of the recursive model (the adjacency model). The following discussion does not diminish the respect I have for his SQL insights. However, in this case his analysis of normalization is really quite incorrect. Here are his example and his comments.
Of the recursive model (which he calls the adjacency model), he says it "is denormalized in several ways. We are modeling both the OrgChart and the organization chart [sic] in one table. ... Another problem with the adjacency list model is that the boss and employee columns are the same kind of thing (i.e., name of OrgChart), and therefore should be shown in only one column in a normalized table." "To prove that this is not normalized, assume that 'Chuck' changes his name to 'Charles'; you have to change his name in both columns and several places. The defining characteristic of a normalized table is that you have one fact, one place one time." Thus spake Celko. Obviously, his example is not normalized - but not for the reasons he says. It is not the recursion that's the problem. It's his model and specifically his keys. Using a name as the identifier is a very poor choice. No good modeler should do this. A key must be stable (never change), unique (have no duplicates) and minimal (have only as much key as necessary). His key violates two of those principles: stability and uniqueness. By definition of his example, the employee name can change. Because the key is a name, it is also not unique. Surprisingly, the key he has chosen even violates the primary key principles in his own book, Joe Celko's Data and Databases: Concepts in Practice, p. 247. In the adjacency model (classical recursion), the two attributes in question are "the same kind of thing". For that matter, so are LFT and RGT in the nested set model. In the Employee hierarchy example, both are Employee IDs. In the ITEM example, the keys to ITEM BOM are both ITEM IDs. However, in each case, each foreign key plays a different role, one is the parent, and the other the child. Below is a better solution to his example, and this example is normalized.
Corrected OrgChart Example As we just said, his same reasoning could equally be applied to the Nested Set Model. After all, the LFT and RGT columns are really "the same kind of thing"; they are each positional or sequential references. In fact, we should say that both the Adjacency Model and the Nested Set Model (when properly modeled) are normalized. Each model is normalized because the IDs play different roles, as do LFT and RGT. Each solves the problem. They are different solutions. One is easier to implement than the other; one is faster than the other. One is more flexible than the other; one is more navigable than the other. ConclusionThere are many ways to solve the problem of hierarchies, namely,
They do not all have the same pros and cons. Define your requirements; evaluate the alternatives; refine your compromises; pick your solution. It truly depends on your needs. As they said in The Pirates of Penzance, "Let the punishment fit the crime." © InfoModel, Inc. Go to Current Issue | Go to Issue Archive Recent articles by Tom Haughey
Tom Haughey -
Tom is considered one of the four founding fathers of Information Engineering in America. He is currently President of InfoModel, Inc, training and consulting company specializing in practical and rapid development methods. His courses on data management, data warehousing, and software development have been delivered to Fortune 100 companies around the world. He has worked on the development of seven different CASE tools, over 40,000 copies of which have been sold to date. He was formerly Chief Technology Officer for the Pepsi Bottling Group and Enterprise Director of Data Warehousing for Pepsico. He was also formerly Vice President of Technology for Computer Systems Advisers, who market the CASE tools called POSE and SILVERRUN. He wrote his own CASE tool in 1984. He formerly worked for IBM for 17 years as a Senior Project Manager. He is an author of many articles on Data Management, Information Engineering and Data Warehousing. His book, Designing the Data Warehouse-The Real Deal will be published later this year. |