The Data Centric Revolution: The Evolution of the Data Centric Revolution Part Two

15-SEPCOL01MCCOMB-edIn the previous installment (The Data Centric Revolution: The Evolution of the Data Centric Revolution Part One), we looked at some of the early trends in application development that foreshadowed the data centric revolution, including punched cards, magnetic tape, indexed files, databases, ERP, Data Warehouses and Operational Data Stores.

In this installment, we pick up the narrative with some of the more recent developments that are paving the way for a data centric future.

Master Data Management

Somewhere along the line, someone noticed (perhaps they harkened back to the reel-to-reel days) that there were two kinds of data that are, by now, mixed together in the databased application: transactional data and master data.  The master data was data about entities, such as Customers, Vendors, Equipment, Fixed Assets, or Products.  This master data was often replicated widely. For instance, every order entry system has to have yet another Customer table because of integrity constraints, if nothing else.

If you could just get all the master data in one place, you’d have made some headway.  In practice, it rarely happened. Why? In the first place, it’s pretty hard.  Most of the MDM packages are still using older, brittle technology, which makes it difficult to keep up with the many and various end-points to be connected.  Secondly, it only partially solved the problem, as each system still had to maintain a copy of the data, if for nothing else, for their data integrity constraints.  Finally, it only gave a partial solution to the use cases that justified it. For example, the 360o view of the customer was a classic justification, but people didn’t want a 360 o view of the master data; they wanted to see the transaction data.  Our observation is that most companies that had the intention to implement several MDMs  gave up after about 1 ½ years when they found out they weren’t getting the payout they expected.

Canonical Message Model

Service Oriented Architecture (SOA) was created to address the dis-economy in the system integration space.  Instead of point-to-point interfacing, you could send transactional updates onto a bus (the Enterprise Service Bus), and allow rules on the bus to distribute the updates to where they are needed.

The plumbing of SOA works great.  It’s mostly about managing messages and queues and making sure messages don’t get lost, even if part of the architecture goes down. But most companies stalled out on their SOA implementations because they had not fully addressed their data issues.  Most companies took the APIs that each of their applications “published” and then put them on the bus as messages.  This essentially required all the other end-points to understand each other.  This was point-to-point interfacing over a bus.  To be sure, it is an improvement, but not as much as was expected.

Enter the Canonical Message Model.  This is a little-known approach that generally works well, where we’ve seen it applied.  The basic concept is to create an elegant [1] model of the data that is to be shared.  The trick is in the elegance.  If you can build a simple model that captures the distinctions that need to be communicated, there are tools that will help you build shared messages that are derived from the simple model.  Having a truly shared message is what gets one out of the point-to-point trap. Each application “talks” through messages to the shared model (which is only instantiated “in motion,” so the ODS problem versioning is much easier to solve), which in turn “talks” to the receiving application.

RESTful Endpoints

Roy Fielding’s PhD dissertation [2] reverse engineered the principles underlying the incredible success of the World Wide Web.  We take it for granted that we now have access to a network that has billions of attached nodes and handles a zettabyte [3] of information that can be retrieved by any of them that has the access rights.  One of the key principles he described is REpresentation State Transfer (REST), which is a style of interaction that focuses on the “Resources”, rather than APIs.  Most APIs are very procedural (e.g., “getCustomerAddress”, “postBalanceToAccount”), whereas RESTful interfaces are data centric (“/Customers” will give you a list of customers and “/Customer/<custid>” will get you everything we know about that customer).

This is a great architectural development, and all that’s really needed to make a RESTful architecture data centric is to focus on the definition and modeling of the resources.

Big Data

Many people think of a very large (say multi-petabyte) data warehouse as “Big Data.”  It is a lot of data, but it isn’t what Big Data practitioners mean when they say “Big Data.”

Big Data is more about an approach to dealing with data.  The approach generally co-evolved with Map Reduce and Hadoop, and generally involves massively replicating small code emissaries, sending them off to distributed data stores, and then coordinating the results on their way back.

Companies have been able to handle incredible scale with the Big Data approach. But in my opinion, it’s the stylistic approach that makes it Big Data, not the amount of data.

The Big Data approach is easiest with huge amounts of similarly named and structured data.  Examples of homogenous data include clickstream analysis, Twitter feeds, and query logs.

Data Scientists [4] have taken to the Big Data approach. And because of the nature of their tasks, they are often dealing with many heterogeneous data sets rather than the homogenous set on which Big Data originally cut its teeth.  This is what led Gartner to coin the three Vs of Big Data: volume (the amount of data), velocity (really latency, as many Big Data algorithms will not give you real time information), and variety (it’s not really homogenous most of the time).

The variety piece is catching up to us.  Several studies [5] have suggested that data scientists are spending 50-80% of their time hand-crafting the collecting, understanding, and organizing of the data they are dealing with.

The Data Lake

One of the current data fads is the “Data Lake.”  It is generally motivated by either the cost and hand work involved in onboarding new data sources (in most mature environments it takes 3-6 months to bring a new data set into the data warehouse), or the latency (most data warehouses still work on a batch cycle).

The basic premise of the data lake is to skip most of the hand work in the ETL process and just lay the source data down in the “lake” pretty close to its native format.  This also has the added benefit of not introducing errors in the extract and transform processes.

From there, data scientists (or business analysts who would like to believe they are data scientists) can dip into the lake and get “insights.” In other words, they can spend 50-80% of their time trying to interpret what has been laid down in the lake in hopes of finding a few nuggets worthy of justifying their time investment.

The good news about the Data Lake fad is that soon everyone will have a data lake and it won’t seem like a foreign concept that has to be sold to management.  The bad news is that, in many organizations, it will get a bad name.

However, it offers a platform that, with a bit of work, has a lot of promise for the data-centric approach.  Just a minimal bit of alignment on the way in will save incredible amounts of ad-hoc schema creation.  Having a prebuilt starting point will also save a great deal of time.  And for the time that does get invested in “research” (plumbing the depths of the data lake), we can create a data lake repository so that others following in the footsteps need not retrace all the steps.

The Evolution of the Data-Centric Revolution

I rehashed the history of the computer industry in order to point out that, almost since the beginning, there has been a latent desire to organize our systems around the data. The history of any industry is a series of advances, lurches, side steps, and retracing as we work through things that would have worked but were ahead of their time, and things that did work but ran into essential limits.

Many people over a long period of time have believed that an approach that puts data front-and-center will ultimately lead to much more satisfying results. We’re going to start with all the lessons learned, and the technology that is now available, and explore in this column what is now possible and what needs to happen to open up this possibility.

If you’re interested in this, by all means, sign the Data-Centric Manifesto at http://datacentricmanifesto.org/.


References

[1] I’m using the 6th definition of elegant from http://dictionary.reference.com/browse/elegant “gracefully concise and simple; admirably succinct.”

[2] As PhD dissertations go, this is very readable.  https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf

[3] Cisco estimates we’ll hit a zettabyte this year http://www.telegraph.co.uk/technology/2016/02/04/worlds-internet-traffic-to-surpass-one-zettabyte-in-2016/

[4] https://en.wikipedia.org/wiki/Data_science

[5] http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

Share

submit to reddit

About Dave McComb

Dave McComb is President of Semantic Arts, Inc. a Fort Collins, Colorado based consulting firm, specializing in Enterprise Architecture and the application of Semantic Technology to Business Systems. He is the author of Semantics in Business Systems, and program chair for the annual Semantic Technology Conference.

Top