Despite being about as prevalent as electricity, it can be difficult to adequately explain how critical data is to the modern world. From business operations to tackling the environmental crisis, data is the key to unlocking insight and developing intelligent solutions across every sector. Although Big Data has been in the news for at least a couple of decades, other types of data are now getting airtime as well. Open data, External data, Training data, Dark data – these are new contours to an already multi-faceted conversation, and it’s understandable that the general public is getting overwhelmed.
For those of us in the industry, though, there are really only two types of data: the data you have and the data you don’t. What makes things confusing is that the line between these two isn’t always clear. Open data and public data, as their name suggests, should be easy to access. They’re not. Even the data within your organization – data that essentially belongs to you – can be locked away behind operational silos or internal restrictions. People are clamoring to develop AI products, but they can’t be built without a corpus of training data. This often has to be synthetically generated, meaning you have to build the ballpark before playing catch. The upshot is that organizations that have already made the difficult step in establishing a data science division are now facing the very real problem of actually getting access to the data they need to drive insight.
Virtually every new business can leverage data to gain a competitive advantage, but the fact of the matter is that there are a lot of professionals trained in using data who are spending most of their time just finding it. Because of this, the “importance” of data is difficult to quantify. Like the Internet, the limitless uses of data will only be fully understood when everyone – business, government, and individual – can get their hands on it.
A Digital Divide
There is a canyon between those who have data and those who want it. The first five years of the open data movement involved governments of every size publishing whatever data they could, wherever they could. The result was a mess. Standards were lacking, access was difficult, and even when you got it, the data was often one or more of the three Ps of open data: partial, PDFs, or a piece of s***. Due to these limitations, using public data became a skill that belonged to the minority, which was directly contrary to its purpose and to the principles of transparency on which the open data movement was based.
“The result was a mess. Standards were lacking, access was difficult, and even when you got it, the data was often one or more of the three Ps of open data: partial, PDFs, or a piece of s***.”
The past few years have seen improvements. Governments doubled down on policy and reworked their open data portals to make access easier and provide more use-cases to consumers. This is a good thing. Governments and organizations that are focused on releasing data should iterate in order to streamline the process. But it is not their role to build a bridge across the divide. I’ve sat in the meetings and it’s too easy for well-intentioned governments to get bogged down in the highly idiosyncratic technical details of delivering data to a spectrum of users and use-cases.
At the same time, it’s not the user’s job to figure out how to access data. Average citizens should be data literate, but that doesn’t mean they all have to learn how to run Python scripts. Without the ability to test out their ideas, most consumers aren’t eager to commit the time, energy, and money to figure out ways to get at the data they’re interested in using.
The result of this is that both data producers and data users stay on either side of the digital canyon and theorize about the advantages that one’s access to the other could provide.
There’s a prediction that data will disrupt virtually every industry in the next decade — from business and government to food production and medical research — but this prediction is based on the hypothesis that data is, or will be, fundamentally easy to access. The reality is that this isn’t currently the case, and it won’t be until a bridge is built between those who have data and those who can use it. Put simply, data can’t be useful until it’s usable.
The Myth of Usability
I speak to a lot of people about data and although everyone agrees that it’s important, there are two distinct groups: idealists and realists. Idealists want to use data in hundreds of different ways. They want to develop apps, build visualizations, and plug new feeds into their BI tools. They want AI now. The difference between them and realists – who, incidentally, want the same things – is knowing how difficult it’s going to be.
At this point, there’s little doubt about the value of data in general, but any confidence in this value assumes that it is ready to be consumed. The truth is that whether we’re talking about data that exists inside of an organization or data that’s generated outside of it, there are operational hurdles that need to be overcome before you start to see the kind of game-changing insight that everyone’s excited about.
Take procurement data, for example. Many countries, including Canada, release procurement data as “open” information that’s available for general consumption and use. It’s worth taking a minute to think about how valuable this is. Every day, companies win contracts with the government – the largest, most trusted buyer in the market – and these contracts are made available for public consumption. In an ideal scenario, citizens would be able to use this data to see how their tax dollars are being spent and ask important questions about their government’s purchasing process. How many contracts did multinational corporations win over Canadian companies? What’s the spend on defense? Education? The environment? The business use case is also powerful. A bank can use this data to make more efficient decisions around activities like approving small business loans or augment their database with the information in order to fine-tune risk ratings.
But this isn’t happening.
Why not?
Well, the data that’s released is incompatible with how people want to use it. Currently, in order to query the data or inject it into a model, you have to do one or all of the following:
- connect to the feed directly and monitor it for updates;
- normalize the feed into standard formats;
- roll up subsidiary organizations to parent companies;
- run sophisticated entity resolution scripts over the data to provide a master record of organizations and the contracts they win; and finally,
- use some analytics tool to make the millions of records human-readable.
I don’t know about you, but that’s not the kind of tech most people have at their disposal.
The government, despite the great strides in improving the quality of their procurement data, can’t be on the hook for layering all of the above on the feed; it’s their job to get the data out the door. But by the same token, neither organizations nor individuals should have to develop the infrastructure. In both cases, the time, energy, and technological expertise that’s required is prohibitive.
The work that’s necessary to make data like this compatible with the end-user can’t be the job of the provider or the consumer – there needs to be an intermediary set of tools, products, and processes that can automate this process.
The Role of Government
Although governments are increasingly eager to be transparent, they’re not really coached on how to best open up their records. While they have been told that their data is valuable, they cannot know what datasets represent the highest value or how exactly that value will flow downstream. Frequently, municipal governments are having to figure out their process and policies with small, scrappy open data divisions that have to figure out what to release and how to release it. This is a tall order, and it also ensures that the idiosyncratic nature of the data being released by governments isn’t getting fixed any time soon. If, for example, San Francisco releases the city’s building permits, that’s useful for San Francisco. If you can benchmark that data against similar data sets in New York and Chicago you have market indicators you can use to model economic health across the country. When every city does it, you have something else entirely. The point is that more data provides bigger opportunities. Instead of thinking about the individual opportunities that public data provides, it’s helpful to think about open data as a global puzzle, where every municipality and county has a piece to contribute.
The size, structure, and nature of government bodies make them poor candidates for the kind of overarching standardization and aggregation that’s necessary in order for government data to be usable at scale. It’s a predictable problem. Governments act independent of one another because it’s important for them to tailor what they do to better serve their citizens. But when it comes to data, this customization hampers the efficacy of the entire movement; it isolates one government from another, which means that only a fraction of the problems that their data can address are being solved.
Our governments have access to a wealth of data that they should make available but asking them to also be curators and catalysts of digital best practices is too much to ask.
Bridging the Divide
Despite how hard it can be to get data, the past few years have seen the development of new products and solutions that help individuals and organizations access more data with less effort. DataOps frameworks are becoming commonplace, and data science continues to grow as a discipline. Government strategies have evolved, too. Ontario recently announced the formation of a Digital and Data Task Force designed to help citizens and businesses benefit directly from the data economy. This task force will work alongside the province’s open data policy and will help educate and define best practices to enable people to use data effectively.
In 2013, McKinsey put the value of open data at $3–5 trillion per year, a number that’s trotted out whenever anyone asks about the value of public data in the private sector. While it’s impressive, this number means nothing whatsoever to the average data consumer who can’t figure out how to load a shapefile on their computer. The value of data is obvious, and the benefits are enormous, but at ground level the operational hurdles that need to be overcome in order to unlock this value have slowed innovation to a crawl.
In 2018, 98.6% of companies aspired to be a data-driven culture. This near-total agreement that data is the single most important thing to get right is illustrated by the size of data science divisions and the massive appetite for analytics and BI tools in the market.
Behind this number is the reality that the desire to use data to achieve insight is a far cry to actually getting it done, which might explain why, in the same year, Gartner found that 87% of organizations have low business intelligence maturity.
Why the disconnect? It’s the same problem. Data providers push more data into the wild while consumers try to figure out how to use it. Data scientists are stuck in the prep and process phase while business divisions demand results. There is a divide between the people who have data and the people who can use it, and we’ll never uncover the benefit of a data-driven world unless we find a way to build a bridge between the two.