Data Professional Introspective: Data Provider Management – Part 2

Governance for Acquired Data / Selecting Sources

Our next column in the series explores challenges with governing acquired data, and then we’ll introduce a framework for managing acquired data— the data acquisition lifecycle.

We’ll explore best practices in developing data requirements and selecting data sources that lead to acquiring the right data for business purposes.

In Part 2, we’ll address the front end of the acquisition and monitoring lifecycle, and introduce the acquisition and management lifecycle framework:

Governance capabilities and common gaps
Harnessing the power of a data catalog
Data acquisition and monitoring lifecycle
Defining requirements for data sources
Selecting appropriate data sources and comparing them.

Governance Capabilities and Gaps

In many cases, the lack of a robust data governance program can be a major cause of scattered, ad hoc, and poorly controlled data assets. Conversely, when organizations commit to establishing and operationalizing data governance, it is a major step towards effective management of both existing and future data assets.

Neglecting to extend governance responsibilities to acquired data and internal data sources fosters a chaotic environment. These are the key capabilities pertaining to governance of acquired data and management of data providers:

Designation of data domain stewards – A data domain steward is an individual (or defined group) fulfilling a leadership role, applied to a defined shared data domain. To illustrate some examples, domains for a university typically include Enrollment, Academic Planning, and Courses; for manufacturing, domains may include Product Design, Inventories, and Sales; and for banking, domains may include Customer Accounts, Loans, and Financial Products. The data domain steward is ultimately responsible for decisions about domain data that affect its definition, production, and use, on behalf of supplying and consuming stakeholders across the organization.

Once data domains are defined and bounded, organizations usually align their operational data stores and repositories to one or more domains. However, frequently organizations do not undertake the effort to ensure that acquired data is also categorized, governed, and managed appropriately. This creates a gap in responsibility.

Designation of data owners – A data owner is responsible for one or more data stores; the individual who decides who can create or modify the data, who can access the data, what performance requirements should be met, what quality checks should be implemented, and so on.

Many organizations have not assigned data ownership to acquired data, and since project teams working towards their deadlines are apt to grab data from the most convenient location, this often results in ‘orphaned’ data sources. For instance, a legal organization in the process of implementing security features found that nearly half of their systems had no designated owner to receive a security alert about a potential data breach. In the same vein, organizations may have many ‘stray’ interfaces that are undocumented, which may not be discovered until a system is redesigned.

Business data steward engagement – Designated data stewards and business subject matter experts are often not consulted or fully engaged in the source selection process, leading to incomplete scoping requirements, lack of input in comparing sources, and missed opportunities to help improve contracts and data provider performance.

Internal operating agreements / service level agreements – For internally acquired data, no purchase contract is required. However, it is advised to institute a baseline standard operating agreement for internal data suppliers and consumers. A service level agreement is more formal, and should be employed for all external data sources. It describes the working relationship between the parties, specifying the service to be provided (i.e., delivery of a data product), its frequency, how change notifications are delivered, quality measures, monitoring, issue reporting and resolution, and other terms. SLAs enable the data consumer to rely on the data acquired. We’ll address them in more detail in Part 3.

Since most organizations have so many internal data sources, it may seem like an undue burden to document operating agreements, but failure to do so exposes data consumers to risks caused by incomplete or missing data, lack of change notifications, or an undefined issues resolution process. In effect, it creates a significant gap in the ‘controlling’ function of data governance.

Data Catalogs Help Manage Interfaces and Data Feeds

Constructing a comprehensive description of the data assets is facilitated by a data catalog. Its content scope should be extended beyond physical repositories and databases to include recurring interfaces, external data feeds, and internal data sets. This will greatly assist in the organization’s knowledge of what data is entering the data layer, where it is going, who owns it, and who consumes it. A data catalog is essential for improving the management of acquired data.

Since metadata (knowledge management for your data), is just as important as the data itself, most organizations have determined that implementing and populating a robust data catalog platform is critical to finding, understanding, and accelerating access to data. Since business users, and analytics teams engaged in integrating data often spend an inordinate amount of time simply locating needed data, having the information readily available is a tangible productivity gain.

As a software product, the catalog is an automated, organized inventory of data assets that captures, stores, and delivers information about the data. A catalog is primarily composed of metadata properties, which are linked to business concepts— that is, shared terms and definitions. It is a foundational pillar of description for the existing data architecture, and it enables parsing and expanding governance responsibilities and controls at a detailed level, for data stores, repositories, and acquired data from external and internal sources.

Key benefits include:

It enables data consumers to search for, understand, and access data across the organization.

When organized by data domain, it facilitates rapid discovery of useful shared data assets that were previously not easily discoverable.

It simplifies establishing relationships across databases and data sources, to enable traceability and data lineage across the lifecycle.

It provides embedded control levels to govern access, and to improve communication about data changes and impacts.

And it features built-in scanners and connectors to business intelligence platforms and other advanced software, enabling self-service and accelerating time to insights.

The data catalog is a key focus point for development of policies, processes, work products, and data governance responsibilities.

Data Acquisition and Monitoring Lifecycle

What processes, when implemented, help organizations effectively manage data products and data providers?

Overall, implementing ‘defined’ (documented and approved) processes for external and internal data sourcing ensures that business requirements are satisfied by the data acquired, and that data is understood, documented, discoverable, and accessible. In addition, effective management of data providers requires defined processes to specify agreements and support services, implement systematic communication, and enforce issue resolution and remediation.

The diagram below outlines a lifecycle approach to developing standard processes that can be applied across the organization. We’ll summarize them and then explore them further, while indicating corresponding standard work products that an organization is advised to develop and employ.

Data requirements specification is the first step in data acquisition. The driver for developing requirements can be relatively straightforward. For example, if you’re implementing a repository to answer questions about your customers’ purchase patterns, you would definitely need to identify the customers. If your organization has a reliable Customer master data application, the sourcing decision for that requirement is simple.

When the need is to answer key questions for a strategic business decision, the requirements definition process has a broader scope and can be complex. Let’s say a consumer stone products company wants to determine if they should expand into an adjacent region.

They first need to pose the high-level questions which, when answered, will shape that decision. Questions could include: Has there been recent growth of similar product sales in that region? What competitors operate there now, and h ow are they performing? What is the percentage of home ownership? What is the average income level? And, what marketing channels are the most consistently successful for similar products? These and similar questions would define the scope of exploration for data sources.

The next step is selecting sources. If the data sets satisfying your requirements are available internally, your task is to discover where they are, what they include, and how to access them. If there’s more than one possibility, you’ll need to define criteria to determine the best source to satisfy your requirements.

If the data is not available internally, you’ll need to be more precise in defining the data sets you need, explore what is available (i.e., publicly available information or a vendor data product), what organizations or vendors supply it, and what the estimated cost would be.

If there is more than one external source, you need to determine the criteria you’ll apply to compare them. We’ll discuss some factors to help you decide.

Next, you’ll need to coordinate with the Procurement or Purchasing group. It is advised to learn in advance what their requirements for requests are, and what they will need from you to complete the product purchase.

Most vendors have standard contracts that can be modified according to the purchase requirements. With assistance from Procurement and your legal group, you’ll want to review the contract, note any areas of concern, and request any needed modifications.

Accompanying the contract is a Service Level Agreement which specifies the actions the supplier will take to deliver the data and perform related services, such as timeliness, quality controls, issues escalation, etc. If the data source is internal, the SLA (or operating agreement) functions in lieu of a contract.

Once the product is being delivered, you’ll need to implement regular and systematic communication with the internal supplier or external vendor, and monitor the product and associated services utilizing defined metrics.

And of course, you should provide information about the new data source for inclusion in the data catalog.

Standardizing these processes leads to: increased ease and efficiency on the part of the consumer, reducing the time from request to provisioning, straightforward contract negotiations, and capturing information about the new data sources for other potential consumers.

Defining Source Requirements

Defining data requirements, including those pertaining to data sources, is a top-down process, starting with an expressed business purpose, a use case, or business questions that need to be answered. The complexity and duration of your requirements effort may range from minimal, as in the Customer reference data example, to more extensive, as in the regional expansion example.

Another aspect of defining the scope of data to be acquired is the extent of the potential consumer base. For example ‘consumer sentiment’ data may be useful to shape development of a company’s Products, as well as to create marketing strategies and sales campaigns. Therefore, from whatever group the intended acquisition originated, it’s advised to engage other relevant stakeholders to ensure that their current or future requirements are included. This may include data governance groups, domain stewards, data asset managers, business and technical data stewards, and possibly multiple projects or analytics teams.

After completing the first two steps, you’ll be prepared to define the data sets of interest, decomposing the high-level requirements to a greater level of detail. You can create a conceptual data model as a starting point, or develop a list of critical data elements.

Now you’re ready to identify sources. Step one is to determine if the data is available internally, so your first stop would be the data catalog. If it is not yet populated fully, metadata about data sources that may contain your critical data elements may not be available. In that case, you should contact other organizational units who may use the same or similar data, and request a data model or data dictionary.

If there are multiple internal sources that contain the data you’re after, you’ll need to consider enhanced criteria to help you choose the best source for the purpose, such as: Does the time series match your needs – for example, daily or monthly aggregations? How often is the data refreshed? What is the access method? and so on.

Whether the data is sourced internally or externally, these requirements are similar.

Your organization may have work products available to assist you in developing requirements for acquired data. If they are not available, your data acquisition effort may serve as a Proof of Concept that helps the organization to develop them.

If the data is not available internally, you need to research where it may be found, whether from a publicly available source or a data vendor. For example, the consumer stone products company could obtain, from the Census bureau, how many households in a State, County or municipal area have 2 or more people, the number of households with an annual income over $150,000, the number of homeowners with or without a mortgage, and the percentage of income spent on housing. Those basic aggregated facts help support an estimate of disposable income, and indicate both the likelihood of interest in making improvements, and the ability to do so.

In our example, the company also needs to acquire current and historical spending data for residential stone products. There are several data providers who offer free statistical reports, and provide for purchase detailed information by product type, application, region, and related dimensions. The decision of which external sources to analyze requires additional analysis.

Refining Source Selection

If you’ve verified that data meeting your requirements is available internally, and there is more than one source, first determine if your organization has designated one of the candidate sources as ‘authoritative.’ By ‘authoritative’ we mean that the source has been approved by governance processes for shared business use. The ‘authoritative source’ may be the system of origin (or ‘record’) which creates the data, or a repository designed to supply high-quality shared data, which has been integrated and is provisioned according to best practices, such as a data warehouse or data lake.

When there is no designated authoritative source, you will need to research further. Your organization’s data catalog may support detailed information for data assets that have already been populated. Metadata that may bear upon your decision is illustrated in this table.

Catalog metadata should contain the identification of the supplying system and the data owner. If available, this information enables you to definitively map your data requirements to the candidate sources. Some additional questions that you may want to ask are:

How is the data produced, and what are the business processes and systems that create it? Process steps and execution affect the data, e.g., it may fail to capture some of the data that you need, or a system may lack quality controls.

If the internal source is not the originating source, where is the data created? For instance, the data may originate from a data vendor, in which case there may be licensing restrictions on shared use.

If the product provides statistical information, is the data aggregated such that it will be satisfactory for the business purpose? For example, if the stone products company wanted to distinguish homeowners with reportable annual income over $150,000, and the source being considered only provided a top range of annual income ‘over $100,000,’ that wouldn’t enable them to refine their target market as planned.

Data Source Selection Criteria

If the answers still leave you with more than one candidate data source, you should apply criteria that can help you decide on the best source. Your organization may have standard templates or guidelines that you can use to assist your decision. These factors can apply both to internally and externally sourced data:

Is the level of detail (granularity) sufficient for the intended data usage? If the company in our example needed to obtain data by zip code, but it was only available by State, that wouldn’t provide enough information to analyze the suitability of a smaller region.

Is the time range for the data adequate for the desired currency of information? How often is the data updated in the source? Intra-day, Daily, monthly? And how important is it for your purposes that a specific time range is met?

If historical data is required, what is the time period available? For example, if a medical researcher is analyzing evolving health trends, they might need data spanning several decades.

How is the condition of the data maintained in the source? Evaluate key data quality characteristics and how they are supported by the source owner or responsible parties. You should consider the following data quality dimensions:

Completeness – Does the data to be received contain values in all of the specified attributes? Which values may be missing?
- Timeliness – Can the data be received according to the frequency required?
- Accuracy – What validations are performed, if any, to ensure that the information reflects real-world facts?
- Uniqueness – What is the incidence of duplicate records, and how does the source address this?

Is there a convenient method to obtain the data that conforms with the receiving application’s capabilities? Automated methods are preferred, for instance, delivery through an API or file transfer protocol, versus requiring a system administrator to initiate manual downloads.

For internal shared data repositories, data consumer satisfaction metrics may be collected and available for review, or you can contact other users of the data to learn about their experiences with the source. For external data sources, the vendor may supply you with metrics and customer references.

Standard work products that your organization may offer to assist your decision are displayed in this table.

Next, we’ll look at additional criteria that you can use to compare external data products and providers.

Comparing Vendor Products

By the time your requirements effort concludes that external data is needed, you may already have a good idea which data providers are likely to have what you want. If not, research which vendors offer products that may meet your needs, and ensure that the content scope is met as well as the other selection factors important for the purpose. If there’s only one vendor product resulting from this exploration, you can begin negotiating prices, options, and reviewing their standard contract, and if more than one, you’ll need to compare the products at a more granular level.

With regard to the content of external data products, the scope of data provided may be broader than the data set you’ve identified. Vendors frequently package many tables or files into a product offering, based on the collective requirements of their primary customer base. Luckily, it is not difficult to ignore or delete unused data, so obtaining more data than you actually need should not be a negative factor in your decision.

Then there is the opposite situation. In some cases, such as with product offerings from credit reporting agencies, the purchaser is required to pay for each selected data element, or for defined groupings of available data elements, similar to tiered television content packages. If there’s a specific channel you really want to watch, you may have to pay more for a premium access tier. In this case, you’ll want to constrain the scope, aligning with defined data set(s).

Let’s say there are several vendors with the content you’ve specified, and you’ve compared their standard pricing to verify that they fall within your allowed budget. A comparison table is a very useful tool to help you evaluate products based on the factors most important for your business purpose. The candidate products are arranged as columns, and the comparison criteria are arranged as rows. Your organization may have a standard template available with key factors designated.

You can add additional criteria as needed, weighting the factors by relative importance. For example, the availability of historical data for a specified time period may be more or less critical for the business purpose. You can add comment fields to document discrepancies, or extend the template to account for desirable features that are vendor-specific.

Once the comparison table is populated, share the results with other potential data consumers through the appropriate governance group, and solicit feedback. Revise the comparison, or your weighting, to accommodate shared data requirements.

If possible, it is also a good idea to request a sample of the data from the vendor. The sample should include all tables and columns provided in the product over a specified period of time. This allows you to test the data against your intended use case and validate its completeness and condition.

Comparing costs may be straightforward or complex, depending on the product scope, and the vendor’s pricing scheme— for instance, pricing for individual products, or a tiered or bundled price structure.

Considerations that you may encounter that affect pricing include:

The vendor may limit the number of locations where the data may be stored
They may limit data usage to the organizational unit that made the purchase
Or, they may only offer an enterprise-wide license.

If there are multiple user groups for your data set, incurring some additional costs for broader access may yield significant value to the organization. Some organizations have found that multiple organizational units have purchased the same data from the vendor; the vendor will not always inform you of this fact, but if your organization has many business units or is regionally dispersed, you may have the opportunity to purchase an enterprise license for less than the cost of the separate feeds. It’s certainly worth exploring!

At this point, we’ve discussed the front end of developing requirements for, and selecting data sources. In Part 3, our next column, we’ll address:

Coordinating with procurement— pricing, purchasing and onboarding
Developing contract terms
Service level and operating agreements
Managing service level agreements
Metrics and monitoring data providers
Data provider communication.

MenuMenu

Data Professional Introspective: Data Provider Management – Part 2