You might remember back to your early school days, when you learned about the five major divisions of the vertebrates of the Animal Kingdom: reptiles, amphibians, fish, birds, and mammals. Scientists have refined this taxonomy greatly, but this is close enough to the technical classifications for most schoolchildren.
Mammals were defined as those animals that breathed with lungs, gave birth to live young (as opposed to eggs), and nursed their young. These classification criteria worked just fine—until the Europeans discovered Australia. Someone came across the duck-billed platypus in eastern Australia in the late 1700s. This is a mammal that has a bill like a duck’s, a tail like a beaver’s, feet like an otter’s, and—lays eggs!
When Captain John Hunter sent a platypus pelt back to Great Britain, many scientists thought it was a hoax. Eventually, however, scientists had to accept the reality of this animal that challenged their classification criteria, and the requirement was dropped that an animal would only be classified as a mammal if it gave birth to live young.
I often encounter hierarchical systems of classification in data systems that insist that each item be put into exactly one bucket of a hierarchy. A product must be either tangible or intangible—it can’t be both. A customer must be from a small, medium, or large firm, and nothing else. A document is a treatise or an editorial work, etc. Sooner or later, something is encountered that won’t fit into the nice classification system that people have worked so hard to define. Tremendous amounts of energy go into deciding which of two equally suitable classifications an entity must be assigned. Complex rules are written to handle those things that cross the classifications. Costs, delays, and emotional pains ensue.
When you have the data equivalent of a duck-billed platypus come crashing through your classification system, what should you do?
When designing classification systems, there are some key principles that you can follow to prevent news from Australia giving you a bad day (so to speak).
The most important principle is to accommodate the reality that any one thing can usually be classified in multiple independent ways. I illustrate this in my workshops with a standard deck of playing cards. Take, for example, a jack of diamonds. It can be classified by color (red), suit (diamonds), face or number card (face card). While playing the game of euchre, when trump is declared, a jack of diamonds could end up being the left bower, the right bower, or just the jack of diamonds. There are multiple independent systems of classification in operation here, including one that is dynamic (trump). If a playing card could only ever be classified exactly one way, playing cards wouldn’t be much fun.
We often see this need for multiple classification systems with financial data in multi-national corporations. Given one set of leaves—financial accounts—the balances in these accounts often need to be “rolled up” or summarized in multiple independent ways:
- Across every country and product line, as the board of directors would like to see the summaries for managerial purposes (the “management view”)
- Summarized in a way appropriate to deliver to shareholders
- As dictated by each region’s Generally Accepted Accounting Principles (GAAP)
- As dictated by each region’s tax laws
- By the Sales organization’s viewpoint
- By product line
- By capital expenditure versus expense
In order to accommodate what appear to be conflicting needs, systems are designed that enable users to start from the same set of leaf accounts and build trees on top that roll up independently to different roots. Each tree must account for every leaf, so that the total seen at the top of every tree is the same, but the values at intermediate summary nodes of each tree are unrelated to each other. Further, each tree might be very different from the other trees that roll up the same leaves. One tree might be balanced—the same number of levels from its root to every leaf—while another tree might be unbalanced or ragged—a different number of levels might be traversed from its root to any leaf.
Systems designed to support multiple independent roll-ups are also usually designed so that it’s easy, from a data point of view, to add a brand-new rollup or tree to the existing system. Such flexibility enables these systems to adapt to changing classification criteria.
We can learn a lot from these systems for financial reporting, and apply the lessons to classification systems in general. A classification system should be designed to support multiple independent systems of classification. For example, products might be classified by 50 different states’ sales tax laws, by Sales incentive plans (which change over time), and by delivery organization (which are also subject to change). Customers might be classified by region, by size, by industry, by revenue, and by sales organization serving the customer. It should be easy to add, remove, or modify a classification system. It should be possible to change the classification criteria. (Often classification criteria are stored separately from the system storing the classification hierarchies.)
Data modelers should understand the various strategies available for designing tables that can store multiple classification hierarchies, and that can “unroll” or denormalize them for easier use.
By designing our classification systems to be flexible, we won’t have to spend inordinate amounts of time wringing our hands over the one thing that breaks our singular, rigid classification system. News of the latest mammal from Australia will be easy to accommodate.
This monthly blog talks about data architecture and data modeling topics, focusing especially, though not exclusively, on the non-traditional modeling needs of NoSQL databases. The modeling notation I use is the Concept and Object Modeling Notation, or COMN (pronounced “common”), and is fully described in my book, NoSQL and SQL Data Modeling (Technics Publications, 2016). See http://comn.dataversity.net/ for more information.
Copyright © 2017, Ted Hills