Data is the raw material for any type of analytics – whether it is related to historical analysis presented in reports and dashboards by business analysts, or predictive analysis that involves building a model by data scientists that anticipates an event or behavior that has not yet occurred.
Before business analysts or data scientists can do anything with data, they essentially go through the same set of steps to make data useful for analysis.
The set of operations to convert raw data into a format that is easy to understand and hence useful for further analysis is collectively known as data preparation or pre-processing. Generally speaking, this involves:
- Finding relevant and available data
- Arranging or labeling data into categories such as numerical, categorical, text, audio, video, etc.
- Improving data consistency by coming up with an approach to deal with misspelled categories, null values, and or missing values
- Applying data balancing methods if the data contains an unequal number of observations
When data is ready for consumption, business analysts can create reports, dashboards and visualizations to paint a picture of the overall health of the business. Also, data scientists can now build and train statistical models using techniques such as linear regression, logistic regression, clustering, and time series. In turn, the output of these models can be used to automate decision making using sophisticated techniques such as machine learning.
At its core, data analytics is an exploratory exercise where data consumers (business analysts and data scientists) analyze large bodies of data to reveal patterns, behaviors or insights to inform some decision-making process. In the case of machine learning, it specifically attempts to understand the features with the biggest influence on your target variable. This requires access to a large amount of data that may contain sensitive elements such as customer information (e.g., age, social security number, address, etc.). In many instances, this data is owned by different business units and is subjected to strict data sharing agreements.
This presents data infrastructure teams with unique challenges. For example, they need to provide data consumers with access to enterprise data at the required granularity while complying with privacy regulations in addition to the requirements set by the data owners. Another major challenge for the data infrastructure team is to support the rapid demand of data by the data science team for their analytics and innovation projects.
Data science requires not only simply reading data, but also updating data as mentioned in the pre-processing steps. In other words, data science by nature is a read and write-intensive activity. Data infrastructure teams usually create sandbox instances for these data consumers whenever they start a new project. However, these sandbox instances also require robust data access governance to not expose any sensitive or confidential data during data exploration.
To support the growing data science activities in a company, data infrastructure teams need to implement a unified data access governance platform that has four important attributes.
- First is the ability to encrypt data when it is being extracted from source systems, on its way to the sandbox instance, and at rest. This also includes the ability to mask a column that contains sensitive elements so that data consumers still have access to data like social security numbers for analysis. Meanwhile, data infrastructure teams can redact or partially mask data at the individual customer level to protect customer privacy.
- Secondly, the data access governance platform must have the ability to natively enforce read as well as write access control for on-prem and cloud services. For example, a business analyst might need to insert a column in a table to reflect the change in sales territories. To perform this operation, the analyst needs permission from the administrator to both read data from the table as well as write new data into it. Read and write operations are critical for data science, yet some of the data access control products can enforce read-only access policies.
- Thirdly, the data access governance platform you select must provide the functionality to identify, classify and tag sensitive elements in your data. A platform that offers this capability through a combination of mechanisms such as data dictionaries, pattern matching, and models is generally more effective in detecting sensitive elements in your data.
- Finally, the data access governance platform must offer visibility to IT administrators into the analytics process. This requires the platform to have real-time capabilities that inform administrators about who requested access to what data and whether access is granted or denied, and which policy was in effect to grant access to data. The ability of the data access governance platform to log the status of every access request is critical to complying with privacy and industry regulations as part of internal and external audits.
Business analysts and data scientists constitute a substantial and growing portion of the data consumer community in any company. This community is unique in the sense that it needs to not only find answers to its questions by querying data in various repositories but also transform it to build sophisticated analysis and models. Therefore, read and write operations are at the heart of the data science process.Data infrastructure teams that are tasked with democratizing data while complying with privacy and industry regulations must select a data governance platform that can accelerate the data sharing process in the organization by satisfying the unique requirements of the data consumers.