Pragmatic Data: Where Does Your Data Science Live?

15-SEPCOL02BROOKS-ed“…with the help of the pattern language that existed in traditional society, they were able to generate a complete, living structure, because this local adaptation… could happen because they all shared these local languages. In our time, the production of the environment has largely gone out of the hands of people at large in society.”

Christopher Alexander
ACM OOPSLA, 1996, San Jose1

 

Some of my friends aren’t sure I have my job title right. They are “real” Architects – the kind who design stuff like skyscrapers, stores, and homes out here in the real world, and who license the use of the term “Architect” to enforce professional standards in their practice. I would (and often do) argue that there’s more in common between IT architects and general contractors than between IT and building Architects, but the strength of the architecture metaphor has grown tremendously since the early 2000’s. With nearly all work and home life spent in worlds created by the software and mathematical models we in data architecture create, our work now creates as much of our environment as do the spaces where we work, live, and sleep.

Christopher Alexander, author of “A Pattern Language,” is famous for his contributions to architecture practice for both building and software. In 1996, at a conference of the Association for Computing Machinery, he noted that past architecture (“real” architecture) was able to adapt and provide complete, living structures – buildings, roadways, towns – because it was created from a common language of patterns shared among local occupant/architects. In modernity, he said, “the production of the environment has largely gone out of the hands of people at large in society.” Building and development had become less adaptive and substituted a goal of “working,” for “good.” The same has been happening to data and software for most of the last sixty years.

The Agile movement and the practice of data science are beginning to change our way of thinking about how and where some aspects of IT should be practiced. A CIO I’m acquainted with has been known to say, “IT should not exist.” And while I think that may be a bit too far (Accounting and Facilities both exist for a reason, as should IT), it does seem that IT may not all need to happen in IT.

Big Data and data science have become popular buzzwords, but the ideas have been around for a long time. The term “data science” has been in circulation since the 1960s, but has specifically meant the practice of statistics and data analysis since around 1997.2 “Big Data” has been in use for almost exactly the same time, first appearing in its modern use in the ACM digital library in 1997.3 The terms are sometimes used interchangeably, but for most practitioners of data science I know, Big Data is too restrictive a description of what they do. The ability to analyze and visualize huge amounts of data has certainly made data science more visible, but marketing analysts and scientists have been applying and creating mathematical analyses on increasingly large sets of data since before the days when rooms full of people (mostly women, by the way) employed as “computers” analyzed and modeled nuclear fission data for the US’s Manhattan Project4. Asset managers have modeled returns and attempted to predict mutual fund performance since the 1920s; Consultants have been modeling compensation and career paths for decades; and although marketing departments were working from their gut during the days of “Mad Men,” they have been using tools like regression analysis to analyze and forecast the results of advertising campaigns for most of the half-century since.

What do all these data teams have in common? They’re not in IT. Their work began on literal spread-sheets, two or three-foot wide ruled paper, and it sometimes remains in the spreadsheet’s electronic descendants today. Excel, once limited to 65,536 rows, now supports respectable datasets of more than a million rows and supports multiple regression. Some amazing things have been done with its statistical packages and VBA. You’ll find SAS, MatLab, SPSS, and other commercial statistical software in your marketing department, your research labs, on the quant desk, and elsewhere. The open source R language and the RStudio GUI tools are freely downloadable advanced analytical and data visualization software. With more than 7,000 add-on packages available, R is frequently what college grads bring to the office, familiar with it from stats classes. A bit less press-friendly, but easily as powerful, the Python language is also free, enthusiastically supported by its community, and finding its way onto the laptops of researchers and weekend data wranglers (yes, this is a thing).

Several years ago, while trying to figure out how we (in IT) could best help the quantitative analysts on a quant desk, I asked a group about what they used for analysis and what their preferred tools were. “Mostly C#,” came the reply. As someone working in a traditional BI & reporting group, I found this a bit surprising at the time – these folks were in “the business,” not developers! But increasingly, this is the norm, and it should be.

Analysts, quants, DIY developers, statisticians, data scientists – call them what you will – move at the speed of business out of necessity. Although databases and developers still frequently reside in IT, and storage infrastructure is well within its domain, actionable analysis depends upon the speed and flexibility of local patterns, shared among the community closest to the work. Sound familiar?

Your big data (and data science) lives in “the business.” Put more clearly, they live outside of IT. As with spreadsheets, word processors, mobile devices and most every other form of technology, tools supporting data scientists can do the most good when they the most available. There are huge volumes of data being handled and analyzed throughout most organizations. Though not given IT titles, those analyzing that data are out in the field, managing databases, writing code, managing computing clusters. They are creating and sharing patterns and responding to their own needs more rapidly and with better alignment than many of us in IT may remember ever having been able to do, and cloud-based providers are backing up that flexibility with on-demand capacity.

Does Data Architecture have a role here? In a word, yes. Data Architects – and IT more broadly – can help to ensure that data science practice delivers on its promise efficiently and securely, and that data science teams have ready access to not only the tools, but to the pattern languages that they need to create good analyses (with apologies to Christopher Alexander). In my opinion, though our practice may have been a little late to the table, we are developing an understanding the technologies and learning how to create a “good” data science work environment. We have developed excellent patterns for database design, security models and shared coding that we can make easier for these new practices to adopt. We have practiced data modeling methodologies that create consensus, adapt to change and provide the foundation for a better understanding of data; these apply equally (though differently) to datasets being tidied for analysis or stored in massive distributed databases. And if we’re paying close attention, we’re learning how to apply a new interpretation of the term “distributed computing” beyond just application architectures and datacenter topology, distributing the capability for data science to everyone whose work it will enhance.


 

1 Alexander, Christopher (1996) “Christopher Alexander – Patterns in Architecture,” https://youtu.be/98LdFA-_zfA retrieved 11/2015

2 “Data science,” https://en.wikipedia.org/wiki/Data_science, retrieved 11/2015

3 Press, Gil (2013)“A Very Short History of Big Data,” http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/ retrieved 11/2015

4 Kean, Sam (2010). The Disappearing Spoon – and other true tales from the Periodic Table.

Share

submit to reddit

About William Brooks

Bill Brooks has been modeling, managing and integrating data since 1995, beginning at CID Associates developing application databases, then at Children's Hospital Boston as manager of the Decision Support Systems Group. He managed data integration before becoming Enterprise Data Architect for MFS Investment Management. Bill is now Chief Data Architect at Mercer, where he is developing a firm-wide data architecture practice. Bill's background includes traditional relational database design, data warehouse design and implementation, and enterprise application integration using a variety of ETL, message broker and service bus approaches.

Top