A Universal Problem in Data Management – Classification

ART02xSome classification problems are very straightforward. For example, a data vendor uses a numerical country code, and the business uses the alphanumeric ISO-2 standard. In principle, there is a one-to-one mapping between values and a few obvious and simple implementation choices. In practice, there are always exceptions, and even this simple scenario can be tricky.

Other classification problems get very complex. A dozen data elements from three different vendors traverse a decision tree full of branches to produce a proprietary asset risk classification. The observations of a handful of doctors who have never met produce an evolving medical diagnosis. From a universe of potential suitors, a bride chooses one.

Classification is a particular problem in Master Data Management (MDM), which requires the alignment of mostly-incompatible data standards from multiple vendors, as well as the implementation of deeper logic like the asset classification system described above. Hard-pressed MDM teams focus on the problem of the moment, typically producing a special-purpose solution to each issue as it arises, instead of a general solution to all their classification requirements.

Real-World Asset Classification: a Rule Set

Cue the Chorus: Wait, are you seriously suggesting every team should build a general-purpose classification engine? Should we all write our own versions of Microsoft Word, too?

Fair enough. If we were talking about some other software space, the progression would look like this:

  1. Build a special-purpose mapping module.
  2. Build another six completely different classification widgets.
  3. Realize each component is a special case of the same general principle.
  4. Build out the general case, refactor away all the special cases, and rely on your interfaces to limit the blast area of the change.

You get there, and you’ll probably wind up wondering why you didn’t just go out and buy the thing, but you get there gracefully and you move on.

In Data Management, interface-based design is hard. So instead, steps 4 and beyond usually look more like this:

  1. Observe that the tangled plate of spaghetti is your code base that will require a complete overhaul in order to integrate your new classification approach, at a far higher cost more than just continuing to build out more mapping tables.
  2. Decide not to do the refactor, and not to build out the general case going forward, either, because we probably won’t have to do that much more classification anyway.
  3. GO TO 1.

We were fine up through step 5. But what bit us in the end was the one thing that can be truly said about all successful Data Management projects: they never end. New requirements never stop coming in. So the critical assumption at the end of step 5 is just plain wrong: you will have to build another classification widget. Probably lots of them.

All of this might sound like an argument in favor of instructing your Data Management team to build out a general-purpose classification engine before they do anything else. Also my team. And the guys down the street.

Yes? All of us should solve precisely the same general problem, at a considerable expense of time and treasure, and in as confidential and uncooperative a manner as modern commerce can devise. And instead of doing it under the covers of Agile iteration, let’s build those elephants right up front in all their redundant glory!

That’s not such a good argument.

Here’s a better one: if there’s a piece of general functionality that you know you’re going to need, take all the money you were going to spend on building that thing – however you were going to build it – and go buy it instead. Don’t create. Integrate.

It’s 2018. We live in a world of mashups. Software is a service, and even your toaster has an API. Forget reinventing the wheel… these days, writing software to solve a problem that has already been solved in software makes just about as much sense as producing a wheel from piles of iron ore and coal. Right next door to a wheel store.

Buying component functionality brings a lot of advantages:

  • It’s usually a lot less expensive than building things from scratch, especially in the long run.
  • A dedicated toolmaker with many customers is likely to think of corner cases and features before not having them becomes a crisis.
  • Time is money. And there’s no kind of software development faster than the kind you do with a credit card.

The bottom line is that a little up-front analysis at project start, focused on identifying key low-level functional requirements and aligning them with commercially available integrations, can shave a lot of time and treasure off any project’s road map.

And when that project is in Data Management, and the functionality in question is related to classification – which is pervasive in these systems – those numbers can very easily add up to man-years of effort and hundreds of thousands or millions of dollars.

Share

submit to reddit

About Jason Williscroft

Jason is an Annapolis-trained systems engineer with a talent for software development and a deeply mathematical streak. After wrestling with bad financial data for more than a decade, Jason founded HotQuant to fix it at its source. HotQuant serves Data Governance Organizations with strategic consulting, technology implementation services, and craftsman-grade developer tools.

  • Richord1

    Agreed, classifications of data are important but not a technical problem. A classification scheme must be designed using techniques from ontology, taxonomy, linguistics, human behaviors and philosophy of human communications and information.
    Each organization has its own ontological view of its data. Standards bodies such as ISO have their ontological views. There is no “one classification”. Harmonizing these ontological views is a reality. Mapping of ontologies must be done before any rules engine is considered.
    For example the ISO country codes contain countries that are not recognized by the US state department. From a political point of view they do not exist as “countries”. Using the ISO codes as expressed is not necessarily viable in some instances.
    A list of US may or may not include territories. Seems like a “simple” problem but how do you “map” a territory to a state? The concept of a territory is different and has different properties. Simplistic classification mapping (smashups) is not a solution.

Top
We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept