Everyone’s talking about data. Data is the key to unlocking insight— the secret sauce that will help you get predictive, the fuel for business intelligence. The transformative potential in AI? It relies on data. The thing that powers your CRM, your monthly report, your Tableau dashboard. Data.
The good news is that data has never been more available. Tried and true sources of public data like the census are becoming easier to access as an increasingly tech-savvy public demands modern ways to connect to it.
The open data movement has also evolved, and more and more businesses and organizations are releasing data into the wild. The result is an overwhelming desire in the business community to use data to get smarter about their operations and deliver results in every industry, across every department. In fact, for the year 2018, 98.6% of firms aspired to become a data-driven culture. The excitement has led a lot of organizations to invest heavily in business intelligence tools and the fleet of data scientists who will use them to spin gold out of raw data.
“In 2018, 98.6% of firms aspired to a data-driven culture.”— MIT Technology Review report
The problem? There’s a big disconnect between wanting to be data-driven and getting it done.
The main cause of this disconnect is the massive infrastructure gap between data and analytics – and the largest portion of the resources being spent ends up dumped in this chasm. The reality is that data science, once called the sexiest job of the 21st century, is currently 80% data preparation and 20% actual insight. This means that organizations that have invested in a data science division are only seeing a fraction of the business intelligence that they anticipated.
Why is this happening? There are a few reasons, and most of them can be addressed by a largely missing component in the data economy: DataOps.
Working with large amounts of data requires centralized management and automated operations.
Like its namesake DevOps, DataOps is something that, once you realize you need it, you realize you actually needed it a year ago. Before DevOps, engineering pipelines became unsustainably complex. They required a system – a process and a set of tools – to bring together teams, technology, and processes, while removing the operational silos between team members. With the volume of data that’s available – and the opportunities available to those who can use it effectively – it is becoming clear that organizations need to implement a similar solution for managing their data.
Similar to DevOps, DataOps isn’t any one thing. There are dozens of janitorial tasks most data scientists do daily on a case-by-case basis that can be addressed by getting smarter about their data operations.
Sourcing Data
Sounds simple. It isn’t. Anyone can find data – it’s never been more available – but finding the data that answers the specific question you’re asking is incredibly rare. You’re ultimately looking for a needle in a stack of needles.
Can You See the Data You’re After?
Data is everywhere, sure, but seemingly simple requests (“what days did it rain on last year?”, “how does the price of rent change from one neighborhood to another?”) rarely produce the data you need, accessible, in the format you want, in a neat and tidy package. There’s no single repository for all the public data on the planet – no, it’s not Google’s Public Data Explorer – so it’s unrealistic to think that a data scientist can comb through the entire universe of available data in order to act on a good idea.
Cleaning Data
Scrubbing data is probably the most time-consuming, most necessary, and least productive part of a data scientist’s job. One of the problems is that “clean” is a relative term. Data semantics are highly particular. For any given dataset a data scientist may have to normalize the file format (but who doesn’t love working with PDFs?), the character encoding, and the column formats. What’s more, one person’s ideal dataset is another’s headache.
To compound those problems, legacy systems, proprietary formats, and human error can slow the process down to a crawl. An unescaped comma somewhere in that csv? It’s a problem with a quick fix, but a lot of time might go into finding out where the fix is needed. Because it’s so particular, cleaning data is currently a brute force job, and it’s a million miles from analysis.
Transforming Data
Once a dataset is standardized, it may still need to be transformed to meet internal standards. How does your company record dates and addresses? Do they want every proper noun title-cased? What if you want to perform more complex transformations, like add logic and math functions at the row level?
Any data scientist worth their salt can make these transformations, but you can’t grow your data science team the same way you grow your data libraries. With all of the variability in processes, what happens when one of your data scientists transitions into a new job? Can you count on somebody new to maintain data products in exactly the same way? And can you scale that handover? Any medical professional can tell you that it doesn’t always go as planned (but they’re using data to understand and solve the problem). Adding transparency and collaboration to a pipeline is critical to ensure you avoid single points of failure in your data operations.
Serving Data
As I mentioned earlier, there are a lot of SaaS solutions that run on data to generate insight and analysis. Ever since Gartner announced that Business Intelligence was the top priority for CEOs in 2009, BI tools have become increasingly available and user-friendly. Data visualization platforms like Tableau are mainstream, and data scientists are increasingly using open source web applications like Jupyter Notebooks and RStudio for statistical computing and machine learning.
The problem, though, is that you need to pipe data into these solutions in order to make them work. Similar to what we’re seeing with AI, data analytics is often tested in a synthetic or sterile environment, leading business leaders to believe that it’s easy to plug a data set into a tool and see immediate business impact. It isn’t. Raw data is crude oil and BI/analytics tools is a fast car – there’s a lot of refinement needed before you’re going anywhere.
The truth of the matter is that tasks like these represent the majority of a data scientist’s working day. Currently, they’re looked at as a necessary evil, but finding a way to apply automation to these processes would have an immediate impact. Data scientists would be able to spend the majority of their time doing precisely what they got into the business to do: analyze, model, and experiment.
Unlocking the value of data has undeniable business impact – to do it right you’re going to have to start thinking seriously about DataOps.