Data Architecture, Where Go’est Thou?

FEA02x - image - EDThe other day, I ran across an article that discussed data architecture. The article had an “old” architecture and the “new and improved” architecture. The old architecture had applications, ETL, a data warehouse, and data marts. The new and improved architecture had big data, containing both structured and unstructured data. The sources for big data were varied – from Facebook to legacy systems to email, and so forth.

From the perspective of the big data vendor trying to sell new technology, and from the standpoint of marketing hype, there is little question that the architecture depicted as the new and improved architecture is accurate. Ask IBM. Ask Teradata. Ask HP. Ask Cloudera. Ask Hortonworks. They will all tell you that there is an evolution from the old architecture to the new and improved architecture.

There is no question that the new and improved architecture can handle larger volumes of data than the older architecture. That is – after all – the main feature of big data.

The Architectural Vision

The hardware and software vendors want you to believe that they know how to lead the change when it comes to architectural vision.

But there are some serious and fatal deficiencies with the architecture depicted as the new and improved architecture. In their zeal to sell new hardware and new software products, there have been some important architectural elements that have been overlooked.

The Integrity/Believability of Structured Data

For example – when it comes to structured data – in the new and improved architecture, where is the infrastructure made for integrating structured data? The big data vendors seem to not be aware that you can’t just take raw data and use it intelligently for decision-making purposes. A long time ago, the IT and end user community learned that in order to use raw data for decision-making purposes, you need to pass the raw structured data through a rigorous integration process. That is where corporate data comes from, and corporate data is the basis for making intelligent decisions. If there is any integration of structured data that is occurring in the new and improved architecture, it is well hidden.

In their zeal to solve the problems of managing large amounts of data, the big data vendors seemed to have missed some of the history of the computer profession. The new and improved architecture resets the clock to about 1980 when it comes to the architectural mindset of data. Back in 1980, we were just learning about the need to integrate data.

Turning Unstructured Data into Usable Data

A second huge miscalculation made by the big data vendors is the assumption that unstructured data is simply there for the taking. Indeed, you can find and capture unstructured data. But once you have captured a handful of unstructured data, there isn’t really much you can do with it. The world is learning that there is a whole different technology required for turning unstructured data into useful data on which to make business decisions. The big data vendors apparently have never heard of taxonomies, inline contextualization, and textual disambiguation, for example. There are subjects like homographic resolution, inference processing, conjunction resolution, special character sweeping – and so forth – that are unknown in the world of big data. Yet, all of those technologies and more are required in order to turn unstructured data into useful, believable data. If the big data vendors have heard of taxonomies, inline contextualization, and textual disambiguation, and have an understanding of the technology that is required to turn unstructured data into useful data, it certainly is not apparent. The big data vendors seem to think that you somehow just capture unstructured data then turn your data scientist loose on the unstructured data. That – for the most part – is a formula for disaster. There is a wholly different technology (indeed a whole technological world) that is required for making sense of and using unstructured data that the big data vendors are apparently not even aware of.

Where is the Return on Investment?

So where does that leave us? There is no question that a lot of companies have invested a lot of money into the new and improved architectural vision because of the PR job done by the big data vendors. One day those companies are going to wake up and ask – where is the return on my investment? I have spent a lot of money on big data, now how does that investment make my company a better-run and more profitable company? The truth is that there is pittance when it comes to return on investment for big data. The return on investment is nothing like that promised by the big data vendor when the organization was sold on the great promise of big data.

Chasing a Mirage

The problem is that the corporation has spent so much money chasing a mirage that simply isn’t there that they cannot admit to having made a grievous strategic mistake. The problem is that the companies believed the big data vendors. Some errors of judgment are so large that they cannot be publicly admitted. The real winners are those corporations that have not bought into the big data mirage.

From an Architectural Standpoint

From an architectural standpoint, it is predictable that – over time – there will be a merging of the two architectures. People do need to process more data than ever before, but they need data with integrity on which to make intelligent decisions. They need to turn their unstructured data into useful data. Without integrity of structured data and without transforming unstructured data into useful data, there is little to no return on investment. The architecture of the future will include those capabilities, despite the best efforts of the big data vendors.

Share

submit to reddit

About Bill Inmon

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations.

  • John O’Gorman

    Best line in the whole article:

    “For example – when it comes to structured data – in the new and improved architecture, where is the infrastructure made for integrating structured data? The big data vendors seem to not be aware that you can’t just take raw data and use it intelligently for decision-making purposes.”

    Integration is not about architecture, though. Semantic equivalence in all of its forms (translation, abbreviation, acronyms, etc,) and identity management (aliases, deltas, renaming) require process. You can store / structure the results any way you want.

Top