The Data-Centric Revolution: Fighting Class Proliferation

One of the ideas we promote is elegance in the core data model in a Data-Centric enterprise. This is harder than it sounds. Look at most application-centric data models: you would think they would be simpler than the enterprise model, after all, they are a small subset of it. Yet we often find individual application data models that are far more complex than the enterprise model that covers them.

You might think that the enterprise model is leaving something out, but that’s not what we’re finding when we load data from these systems. We can generally get all the data and all the fidelity in a simpler model.

It behooves us to ask a pretty broad question:

Where and when should I add new classes to my Data-Centric Ontology?

To answer this, we’re going to dive into four topics:

The tradeoff of convenience versus overhead
What is a class, really?
Where is the proliferation coming from?
What options do I have?

Convenience and Overhead

In some ways, a class is a shorthand for something (we’ll get a bit more detailed in the next paragraph). As such, putting a label to it can often be a big convenience. I have a very charming book called, Thing Explainer – Complicated Stuff in Simple Words,^[1] by Randall Munroe (the author of xkcd Comics). The premise of Thing Explainer is that even very complex technical topics, such as dishwashers, plate tectonics, the International Space Station, and the Large Hadron Collider, can all be explained using a vocabulary of just ten hundred words. (To give you an idea of the lengths he goes to he uses “ten hundred” instead of one “thousand” to save a word in his vocabulary.)

So instead of coining a new word in his abbreviated vocabulary, “dishwasher” becomes, “box that cleans food holders,” food holders being bowls and plates). I lived in Papua New Guinea part time for a couple of years, and the national language there, Tok Pisin, has only about 2,000 words. They ended up with similar word salads. I remember the grocery store was at “plas bilong san kamup,” or “place belong sun come up,” which is Tok Pisin for “East.”

It is much easier to refer to “dishwashers” and “East” than their longer equivalents. It’s convenient. And it doesn’t cost us much in everyday conversation.

But let’s look at the convenience / overhead tradeoff in a typical information system. Every time you add a new class (or a new attribute) to an information system you are committing the enterprise to deal with it potentially for decades to come. The overhead starts with application programming, that new concept has to be referred to by code, and not just a small amount. I’ve done some calculations in my book, Software Wasteland, that suggests each attribute added to a system adds at least 1,000 lines of source code—code to move the item from the database to some API, code to take it from the API and put it in the DOM or something similar, code to display it on a screen, in a report, maybe even in a drop-down list, code to validate it. Given that it costs money to write and test code, this is adding to the cost of a system. The real impact is felt downstream, felt in application maintenance, especially felt in the brittle world of systems integration, and it is felt by the users. Every new attribute is a new field on a form to puzzle about. Every new class is often a new form. New forms often require changes to process flow. And so, the complexity grows.

Finally, there is cognitive load. When we have to deal with dozens or hundreds of concepts, we don’t have too much trouble. When we get to thousands it becomes a real undertaking. Tens of thousands and it’s a career. And yet many individual applications have tens of thousands of concepts. Most large enterprises have millions.

One of the other big overheads in traditional technology is duplication. When you create a new class, let’s say, “hand tools,” you may have to make sure that the wrench is in the Hand Tools class / table and also in the Inventory table. This relying on humans and procedures to remember to put things in more than one place is a huge undocumented burden.

We want to think long and hard before introducing a new class or even a new attribute.

What is a Class

In modern programming environments a class is really three things, almost simultaneously:

A template
A set
A type

Sure enough, when I googled, “What is a class?” most of the first page suggested the first meaning, using either the term template or blueprint. Developers create classes (either database classes / tables or classes within their programming language environment) when they want to say something different about a particular type of object. If a Single-Family Mortgage has different properties than a Multi-Family Mortgage, they create new classes to accommodate those differences. They do this not necessarily because it’s the best thing to do, but usually it’s the most expedient. We will explore other options and tradeoffs, but first let’s complete, “What is a class?”

Database modelers and users acknowledge the template-ness of a class / table. If you have a table with four columns, when you insert a row, you can supply up to four values. Not five. The template doesn’t allow five.

But database users also see the set idea. When they query the employee master file, they expect to get the set of all employees. When they want to subset it, say just exempt employees, they supply a filter: overtime status = “07”, or some such obtuse way of signaling that an individual is exempt from overtime. Of course, some developer or data base designer may have beaten them to it, and already had two tables—one for exempt and one for nonexempt, in which case the challenge isn’t so much sub-setting but putting things back together. Now if you want the set of all employees, you need to query two tables and combine them.

The final use of class is to signal type. Type is simple with primitive types. We talk about dates and integers as primitive types. When we want to refer to more complex types, we need classes (or something somewhat like a class).

When you come across a person and want to know (relative to your domain) what type of person this is (Is this a patient or a provider?), then you’re asking about the type. You’re not immediately concerned with the structure (template) of the records, nor the sets, you’re only interested with the one.

Where is the Proliferation Coming From?

I think one source of class proliferation comes from conflating these three notions of class. If you need to make a distinction for any of the three reasons, you create a new class. In doing so, you’ve committed to a considerable amount of overhead.

Another place proliferation comes from is laziness. Often, when you come across something new, it’s easier to treat it as a new thing. That way, it doesn’t mess up the stuff you’ve already done. You have another separate place to deal with this new thing. While it may reduce the cognitive load for the developer, it has exported the cognitive load to the persons who have to deal with the system.

A variation on this, and one that bleeds into the next couple of paragraphs, is the distinction between splitters and lumpers and how the world of data modelers and developers seems to be highly skewed to splitters. Darwin coined the term in 1857 in a letter to Joseph Hooker: “It is good to have hair-splitters & lumpers.” Over the years the distinction has grown to be that splitters like to have many small categories and lumpers like to take new distinctions and fit them into existing categories. You can see how a predilection toward splitting would lead to more classes.

Importing and reuse also lead to proliferation. If you bring in a data model, an ontology, or a whole application, just to get a few concepts, you have easily and rapidly polluted your data space.

And finally, a malady that seems to be more prone in ontological circles than traditional database design circles, is “throwing in everything you know.” Many ontologies are built by committees. The individual members of committees tend to want to contribute, and in the act of doing so, throw in everything they know about a subject. A lot of what gets thrown in, becomes classes. There are some pretty bloated examples out there. Snomed, an ontology of symptoms and diseases, has over 300,000 classes. And yet one of the most successful healthcare applications at Montefiore Healthcare has managed to import Snomed into an ontology with less than 500 classes. eCl@ss is an ontology of electrical devices containing over 30,000 classes, and yet we imported all of Schneider-Electrics products into an ontology of less than 100 classes.

Proliferation comes from many quarters and in many guises. The question is: What to do about it?

Options

I’m going to proceed as if you have a Knowledge Graph and an Ontology, because this affords many more options to solve these problems. These problems are solvable in traditional technology but it’s often harder to see. We have come across cases where designing something in semantics led to design choices that could be re-exported and implemented in traditional technology. Once you see the traditional design you think, “That could have been designed that way natively,” but that is not our experience. The affordances of traditional development environments often blind us to some possibilities.

First, consider why you are contemplating making a new class. Next, ask for whom it is a convenience and who will suffer the overhead.

I’m going to use an example from a client to make these options and tradeoffs more tangible.

This client deals with countries as clients. They like to group countries for analytic purposes. Some of these groupings are purely geographical. “Africa” is a such group; if you are located within the borders of the continent (plus Madagascar), you are “in” “Africa.” Some of the groupings are self-selected, for instance, ASEAN (the Association of South East Asian Nations) is a membership-based grouping, where the individual countries opt into the group. Some groupings are based on characteristics of the country, for instance, one might define the group of countries called “emerging markets” as those with low to middle per capita income. This requires us to define what low to middle per capita income is, but that’s not a mighty hurdle. And finally, there are groups defined somewhat arbitrarily to assign internal divisions or departments. They may lump Libya, Egypt, Jordan, and Syria into a Northeast Africa group, even though two of the four aren’t in Africa.

Let’s look at these through the lens of the three types of classes above.

Template

Do we think these types of groups need separate templates? That is, do we think we are going to impose different structures, or require different sets of properties on these different types of classes? It seems unlikely to me on the surface. This is good because template expansion seems to be the category that most creates overhead. In a typical relational environment, because there is no reuse and no extension mechanism, every new class built as a template creates bloat.

Interestingly, in a semantic system we can separate meaning from structure in some very interesting ways. I will only cover it briefly here, but imagine the set of all employees. That set wants to have a different structure in the payroll database than the HR database. In the payroll database, we insist on having social security number, start date, pay rate, vacation accrual, and the like. In the HR database, we may insist on not having social security number, but at the same time, require emergency contact information, skills inventory, previous employers, and the like. In a traditional database, those would be two completely separate sets. In a semantic database (really a federation of semantic databases), this can be one set (the set of all employees) who have different properties in different repositories.

Set

In Semantics, one individual (country, in this case) can simultaneously be in multiple sets. It would be as if one row were in multiple tables, which doesn’t happen in traditional systems (although with a lot of work, you can simulate it). We have at least two mechanisms for assigning an individual to a set. Each involves a single triple. We can say the individual, :Libya, is rdf:type :EmergingMarket. We’ve just assigned it to this group, which is implemented as a class. :Libya is still rdf:type :Country, and perhaps rdf:type :DangerousRegion.

This can work, but astute readers will find it a bit awkward. It is a bit more natural when we think of its relationship to OPEC. OPEC is the Organization of Petroleum Exporting Nations. Despite the fact that the US currently exports petroleum, it is not a member, as this is an explicit membership group. We could make a class for OPEC. But we could just as easily make an instance of the class, :MembershipBasedCountryGroups (that’s a bit of a mouthful, but I wanted to be precise). In many ways, this is preferrable especially as we may want to keep track of when a country joined a group. It is virtually impossible to model when an individual became a member of a class.

Our go-to way to define a set, is usually to start with a category. We use a single predicate, gist:categorizedBy, to connect an individual to a category, which indirectly defines a set. So when we say an individual person is categorized as :female, we have a set (which you have to query for) of all female persons. If we later decide that we need a class for female persons, it is a very simple thing in owl to say :Woman == :Person and hasValue gist:categorizedBy :female. As soon as you make this one line axiom (and create a formal definition of :Woman) and run the reasoner, all the :Persons that have been categorized as female will now be members of this class.

This has all kinds of benefits; perhaps the most profound is that this declaration can be done locally. Everyone need not be burdened by this class definition. Additionally, it does not imply any further overhead. No one has to place the people in this class and everyone else can continue to classify people by their categorical gender.

In the case of countries, any case where the assignment of a country to a category involves human judgement is probably better done this way. The borderline case are groups that could be determined by inference.

Type

In the case of countries, most countries are primarily of type :Country. It is probably best to think of other cases as being membership or grouping rather than types.

Conclusion

Almost all traditional systems have runaway proliferation of classes. There are many reasons for this, but many are just the lack of introspection and asking why this was done. The cost of proliferation is huge, and often hidden.

Semantic technology gives us some options (but as we’ve seen from bloated ontologies, not the guarantee) that can dramatically reduce class bloat.

^[1] https://www.amazon.com/dp/0544668251/

MenuMenu

The Data-Centric Revolution: Fighting Class Proliferation

Convenience and Overhead

What is a Class

Where is the Proliferation Coming From?

Options

Template

Set

Type

Conclusion

Dave McComb

MenuMenu

Convenience and Overhead

What is a Class

Where is the Proliferation Coming From?

Options

Template

Set

Type

Conclusion

Share this post

Dave McComb