A Universal Problem in Data Management – Classification

ART02xSome classification problems are very straightforward. For example, a data vendor uses a numerical country code, and the business uses the alphanumeric ISO-2 standard. In principle, there is a one-to-one mapping between values and a few obvious and simple implementation choices. In practice, there are always exceptions, and even this simple scenario can be tricky.

Other classification problems get very complex. A dozen data elements from three different vendors traverse a decision tree full of branches to produce a proprietary asset risk classification. The observations of a handful of doctors who have never met produce an evolving medical diagnosis. From a universe of potential suitors, a bride chooses one.

Classification is a particular problem in Master Data Management (MDM), which requires the alignment of mostly-incompatible data standards from multiple vendors, as well as the implementation of deeper logic like the asset classification system described above. Hard-pressed MDM teams focus on the problem of the moment, typically producing a special-purpose solution to each issue as it arises, instead of a general solution to all their classification requirements.

Real-World Asset Classification: a Rule Set

Cue the Chorus: Wait, are you seriously suggesting every team should build a general-purpose classification engine? Should we all write our own versions of Microsoft Word, too?

Fair enough. If we were talking about some other software space, the progression would look like this:

  1. Build a special-purpose mapping module.
  2. Build another six completely different classification widgets.
  3. Realize each component is a special case of the same general principle.
  4. Build out the general case, refactor away all the special cases, and rely on your interfaces to limit the blast area of the change.

You get there, and you’ll probably wind up wondering why you didn’t just go out and buy the thing, but you get there gracefully and you move on.

In Data Management, interface-based design is hard. So instead, steps 4 and beyond usually look more like this:

  1. Observe that the tangled plate of spaghetti is your code base that will require a complete overhaul in order to integrate your new classification approach, at a far higher cost more than just continuing to build out more mapping tables.
  2. Decide not to do the refactor, and not to build out the general case going forward, either, because we probably won’t have to do that much more classification anyway.
  3. GO TO 1.

We were fine up through step 5. But what bit us in the end was the one thing that can be truly said about all successful Data Management projects: they never end. New requirements never stop coming in. So the critical assumption at the end of step 5 is just plain wrong: you will have to build another classification widget. Probably lots of them.

All of this might sound like an argument in favor of instructing your Data Management team to build out a general-purpose classification engine before they do anything else. Also my team. And the guys down the street.

Yes? All of us should solve precisely the same general problem, at a considerable expense of time and treasure, and in as confidential and uncooperative a manner as modern commerce can devise. And instead of doing it under the covers of Agile iteration, let’s build those elephants right up front in all their redundant glory!

That’s not such a good argument.

Here’s a better one: if there’s a piece of general functionality that you know you’re going to need, take all the money you were going to spend on building that thing – however you were going to build it – and go buy it instead. Don’t create. Integrate.

It’s 2018. We live in a world of mashups. Software is a service, and even your toaster has an API. Forget reinventing the wheel… these days, writing software to solve a problem that has already been solved in software makes just about as much sense as producing a wheel from piles of iron ore and coal. Right next door to a wheel store.

Buying component functionality brings a lot of advantages:

  • It’s usually a lot less expensive than building things from scratch, especially in the long run.
  • A dedicated toolmaker with many customers is likely to think of corner cases and features before not having them becomes a crisis.
  • Time is money. And there’s no kind of software development faster than the kind you do with a credit card.

The bottom line is that a little up-front analysis at project start, focused on identifying key low-level functional requirements and aligning them with commercially available integrations, can shave a lot of time and treasure off any project’s road map.

And when that project is in Data Management, and the functionality in question is related to classification – which is pervasive in these systems – those numbers can very easily add up to man-years of effort and hundreds of thousands or millions of dollars.

Share this post

Jason Williscroft

Jason Williscroft

Jason is an Annapolis-trained systems engineer with a talent for software development and a deeply mathematical streak. After wrestling with bad financial data for more than a decade, Jason founded HotQuant to fix it at its source. HotQuant serves Data Governance Organizations with strategic consulting, technology implementation services, and craftsman-grade developer tools.

scroll to top