Companies are increasingly focused on finding ways to connect to new and valuable sources of data in order to enhance their analytical capabilities, enrich their models, or deliver more insight to their business units.
Due to the increased demand for new data sources, companies are also looking at their internal data differently. Organizations that have access to valuable datasets are starting to explore opportunities that will let them share and monetize their data outside the four walls of their business.
The Data Sharing Landscape Today
More and more focus is being placed on data sharing. Its importance is tightly connected to that of data discovery, which has been in the spotlight for a number of months now. Venturebeat featured data sharing specifically as part of their 2021 Machine Learning, AI, and Data Landscape, and highlighted a few key developments in the sphere:
Google launched Analytics Hub in May 2021 as a platform for combining datasets and sharing data, dashboards, and machine learning models. Google also launched Datashare, which was developed more for financial services and based on Analytics Hub.
Databricks announced Delta Sharing on the same day as Google’s release. Delta Sharing is an open-source protocol for secure data sharing across organizations.
In June 2021, Snowflake opened broad access to its data marketplace, coupled with secure data sharing capabilities.
ThinkData has always made data sharing a cornerstone of our technology, and recently, our capacity to securely share data within and outside of an organization has increased (you can read more about the warehouse technology we recommend and why).
The Changing Needs of the Enterprise
There are some things to consider when it comes to sharing data. Choosing who you will let use your data, and for what purposes, is a strategic question that organizations should think through. But outside of the rules about who will use your data and why, there are also logistical problems involved with sharing data that are more technological than legal in nature.
When it comes to data— whether it’s coming in or going out, public or private, part of core processes or a small piece of an experiment— how you connect to it is often as important as why. For modern companies looking to get more out of their data, choosing to deploy a data catalog shouldn’t mean you have to move your data. The benefits of a platform that lets you organize, discover, share, and monetize your data can’t come at the cost of migrating your database to the public cloud.
The value of finding and using new data is clear, but the process by which organizations should share and receive data is not.
Memory sticks aren’t going to cut it, so what does a good sharing solution look like?
Owner-Defined Access
Often, when organizations want to share the data they have, they have to slice and dice the datasets to remove sensitive information or create customer-facing copies of the dataset. Aside from the overhead of building a client-friendly version of your datasets, this is bad practice from a data governance perspective.
Making and distributing a copy of a dataset creates a new way for data and data handling to go wrong every time, and opens up possibilities for inconsistencies, errors, and security issues. If you want to share the dataset to ten different people for ten different use cases, you’re repeating work and multiplying risk every time you do.
A good data sharing solution lets data owners create row- and column-level permissions on datasets, creating customized views of one master file, ensuring all data shares still roll up to a single source of truth.
Consistent and Configurable Metadata
Most datasets aren’t built to be consumer-facing. If the data is generated as a by-product of a core business process, it’s probably particular to a very specific use case that may not be shared by the end user who wants to use the data.
Metadata is the key that lets us understand and use data effectively. Whether it’s descriptive, telling us who created the data and how we should interpret it, or administrative, declaring who owns it and how it can be used, appending and maintaining quality metadata is the most efficient way to ensure anyone understands how to use the data they have access to.
Secure Data Flow Management and Access Auditing
This should be a no-brainer, but you might be surprised how many organizations still email spreadsheets from one floor to the next. If you’re sharing open data, you may not have to worry about data security, but if the data is even moderately sensitive (not to mention valuable) you’ll want to make sure you’ve got a way to trace exactly where it’s going.
Ensuring the data is delivered through a service with adequate data protection measures is important, and making sure the delivery is reliable and stable will increase trust.
That covers internal sharing, but what about data sharing between organizations, or even between departments?
Historically, if you provide your data to another business unit, it’s a transaction from one black box into another. But if you’re an organization that has limitations on who can use the data, you need to know which users are connecting to the dataset on a go-forward basis. A good data sharing platform should not only let you control the flow of data, but give you insight into how users are connecting to it, and when.
Support for On-prem and Public Cloud
Most modern data sharing initiatives are designed for the public cloud exclusively, which not only limits who can leverage them, but also increases spend for provider and consumer as they use more storage to access all the data that’s been shared to them.
The reality is that modernizing companies that are at any stage of becoming data-driven require systems and infrastructure that disrupts their standard operating procedure as little as possible. One of the problems with many sharing solutions and traditional data catalogs is there is a primary migration step to the public cloud that isn’t possible. For organizations with mission critical or sensitive data and transactional workloads, the fine-grained control of an on-premises data centre will always be a necessity.
Even for organizations that would love to do a lift-and-shift to the public cloud, there are miles of legal and practical considerations that grind the process to a halt, just when they’re in the process of trying to modernize. A multinational financial institution can’t take a couple months to physically move their data into the cloud.
No matter the state of your database, you should still be able to safely and securely provide access to your data where it resides, without migrating it to a public cloud or creating unnecessary copies.
A Mechanism for Discovering New Data
The rise of the data marketplace indicates a strong need for most organizations to leverage a solution that helps them find and connect to data that resides outside the four walls of their business.
Where data marketplaces often fail, however, is that they provide a shopping experience, not a consumption model. The data marketplace has, historically, been a place that gives users the ability to find data, but not a mechanism by which they can then connect to it, which is a bit like letting someone shop for clothes, but not let them try anything on.
A new data sharing model has emerged out of this need, and Snowflake made waves last year when they launched the Snowflake Data Marketplace as a place where users could bilaterally share data into each other’s accounts.
This model works well in the public cloud, where it’s relatively easy to create a dataset in a new cloud region, but it’s a bit unwieldy (recreating the data in every cloud region adds substantial overhead to the process) and it can also create runaway costs for compute and storage if you’re not careful.