Agile Data Design February 2014

Agile-DataAs I mentioned in my last column,1 our company has been exploring the potential of Big Data analytics and is about to embark on a proof of concept of this exciting new technology. The first step is to identify potential Big Data opportunities, which leads us to two questions: Where in our company do we find “Big Data”? And what characteristics of this data require us to adopt Big Data technologies (such as Hadoop and MapReduce)?We’re a Microsoft shop, so most of our data analysis currently is done using SQL Server and Microsoft’s BI stack (which includes Analysis Services, Reporting Services and PowerPivot). Some people reading this will say, “It’s obvious; you can only use SQL Server for structured data,” but that simply isn’t true. You can do textual analysis (including fuzzy matching), create and manipulate objects, parse XML, read and write file data, and execute .NET program code in SQL Server. You can implement non-relational structures, including graph structures, columnar stores, and NoSQL-type name/value tables.2 And SQL Server scales to at least multiple terabytes comfortably. You can even implement event processing using Service Broker. So it’s not as simple as saying, “If it’s not relational data, you need Hadoop.”

So when would you look to a Big Data platform like Hadoop? Some possible criteria might include the following:

  • When the volume of data, or the type of analysis involved, requires parallel computing capabilities across multiple commodity servers.
  • When the computations involved can be handled more quickly or effectively in Java or .NET than in SQL.
  • When results need to be delivered in real or near-real time.
  • When you need to be able to scale the solution quickly and easily to accommodate ever-growing volumes of data.
  • When your developers’ skill sets include Java or .NET programming, but not SQL or database expertise.

The simple truth is that almost all of the data used or generated by our company has at least some type of structure (most often XML “updategrams” generated from telematics devices or web services), and the characteristics and properties of this data are known. And (since we’re not Walmart, Google, or Amazon), the volumes of data we produce are manageable using traditional database and BI technology. So it will be interesting to see whether we can actually find data and/or a computational problem of sufficient size, complexity and ambiguity to warrant a Hadoop-type solution (also, please note, Hadoop and MapReduce are not the only Big Data technologies; there are many available options and platforms, including some that are SQL-based).

We’re working with our partners at Microsoft now, looking at the HD Insight Hadoop platform that is part of their Azure cloud service, and talking with our management to identify business problems and opportunities that might require a Big Data solution. I’ll have more to report on this in my next article.

Another question I’m wrestling with is this: Once we’ve done our Big Data analysis, what will we do with the results? From a data management standpoint (as I explained in my previous article), I’m excited about the possibilities of using the output from Big Data analytics as input to data governance and master data management processes. But the output from Big Data analytics is used mostly to guide and direct business decisions, often in real or near-real time. This has the potential to create significant business value, as companies like UPS, Express Scripts, Daimler, Dannon and Groupon have demonstrated. It also gives companies the ability to shoot themselves in the foot, suffering financial losses and public embarrassment while alienating their customers and other stakeholders. Examples from the Big Data Hall of Shame3 include:

  • JP Morgan Chase, which lost billions of dollars by investing heavily in mortgage derivatives during the housing bubble, in response to market analyses that ignored the increasing numbers of mortgage defaults.
  • Target, which mailed coupons for maternity items to the parents of a teenaged girl who didn’t even know she was pregnant.
  • CNN, which ran a headline announcing that the Supreme Court had overturned the Affordable Care Act.
  • Wells Fargo, whose data-driven attempts at cross-selling and up-selling led its employees to sell six checking and savings accounts, with fees totaling $39 a month, to a homeless woman.
  • Insurance companies that deny coverage or raise premiums based on their customers’ online purchases (such as plus-sized clothing).
  • Credit card companies that use their customers’ social network associations to determine credit risk. Some companies even adjust their customers’ credit ratings based on the repayment history of other customers of stores where they shop!

I can provide another example from my own personal experience: My insurance company raised my auto insurance premiums after I took out a home equity loan to have new energy-efficient windows installed in my home! They explained that their statistical analysis showed a correlation between people who increased their debt and people who had auto accidents. I explained that there was also a correlation between companies that didn’t value long-time customers and companies who lost those customers!

This isn’t simply a matter of not being careful enough with data analysis (although so-called “data scientists” certainly need to understand the difference between correlation and causation, and the dangers of trying to create correlations out of context). It’s more a matter of not understanding that the whole point of data analytics is to build relationships of trust between a company and its stakeholders (including customers, employees, suppliers, dealers, regulators and the community at large). Companies need to stop asking the question: “How can I screw more money out of this customer?” and start asking the question: “How can I develop a stronger, more positive, long-term relationship with this customer?”4 The companies that have been truly successful with Big Data are the companies that understand the importance of using data to build relationships. They are the companies that allow their customers to custom-tailor their relationship with the company, and manage the ways in which they use that company’s products and services. They make their customers feel like empowered partners, not terrified serfs.

A colleague of mine expressed it best: “Data is only as good as the relationships it is supporting.” Or, as I like to put it, there needs to be a grown-up in the room when decisions about what to do with data are being made. It helps if this grown-up is someone who is truly knowledgeable about the company’s business, who understands the ethical imperatives of business, and who is committed to the company’s long-term success (not just short-term profits).

I am very much looking forward to our exploration of Big Data, and hope to be able to derive significant value for our company from Big Data analytics. At the same time, I will be working diligently to influence how the results of these analyses are used. I don’t want to see our company’s name added to the list of the Big Data Hall of Shame!

NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me at Larry_Burns@comcast.net. Thanks for reading!

References:

  1. Burns, Larry. “Big Data and Data Governance.” TDAN, November 2013: http://www.tdan.com/view-articles/17106
  2. Burns, Larry. “NoSQL and the Data Dump.” TDAN, May 2013: http://www.tdan.com/view-articles/16928
  3. Frank, Adam. “A Brave New World: Big Data’s Big Dangers”. NPR blog, June 11, 2013: http://www.npr.org/blogs/13.7/2013/06/10/190516689/a-brave-new-world-big-datas-big-dangers
  4. Crosman, Panny. “The Downside of the Data-Driven Decision”. Information Management, January 8, 2014: http://www.information-management.com/news/the-downside-of-the-data-driven-decision-10025214-1.html?utm_campaign=daily-jan%209%202014&utm;_medium=email&utm;_source=newsletter

Share

submit to reddit

About Larry Burns

Larry Burns has worked in IT for more than 25 years as a database administrator, application developer, consultant and teacher. He holds a B.S. in Mathematics from the University of Washington and a Masters degree in Software Engineering from Seattle University.  He currently works for a Fortune 500 company as a database consultant on numerous application development projects, and teaches a series of data management classes for application developers.  He was a contribut0r to DAMA International’s Data Management Body of Knowledge (DAMA-DMBOK), and is a former instructor and advisor in the certificate program for Data Resource Management at the University of Washington in Seattle.  You can contact him at Larry_Burns@comcast.net.

Top