Can You Trust Your Data?

ART04x - image - EDIt’s increasingly difficult to achieve a reliable data set – and that could mean you can’t trust the information shared with the business. What’s the solution?

Four Essential Steps to
Trustworthy Data

Without reliable data, your smart-looking marketing dashboard is a liability at best, misleading at worst. But maintaining a flow of high quality and harmonized data is a tough challenge that can swallow an enormous amount of time. The answer? An intelligent, structured approach to data management.

Your visualization interface gives you quick business insights – but the results presented are only as good as the data used. The familiar dictum still holds true: garbage in, garbage out.

That doesn’t mean your data sources are necessarily garbage, but they might not be the right ones for the business any more. It does mean that the quality of your data preparation can dramatically affect the results in your dashboard. And with data flowing from a growing number of sources in multiple formats, organized by multiple criteria, presenting visualization interfaces with clean data sets is an increasingly difficult process.

What’s the answer? And, while we’re at it, what do terms like ‘data discovery’ and ‘data lake’ really mean?

A Problem Consuming Far too Much Time

Businesses are spending enormous amounts of time wrestling structured and unstructured data into shape in an effort to feed valid, relevant data into the underlying engine. The challenge goes beyond basic data quality. There’s also the highly complex process of combining (or harmonizing) data into a workable data set, with your own business rules and market definitions applied.

Surveys reveal that up to 80% of analytics projects are consumed by this task of data preparation[1]. Much of the work has historically been done using Excel spreadsheets, which presents further trouble. Studies have shown that close to 90% of spreadsheet documents contain errors[2] and, even after careful development, have errors in 1% or more of formula cells. One MIT professor asserted that, in large spreadsheets with thousands of formulas, dozens of undetected errors hide within the dataset.

So is the answer throwing more bodies at the problem? There is a better way than brute force.

The answer lies in understanding the data in these multiple data sets, together with a robust data management framework to link harmonized data together and then to maintain the integrity of these links. And, of course, the ability to deal with changes as soon as they occur.

There are four key parts to this framework:

  1. Data discovery
  2. Data harmonization
  3. Data curation
  4. Data traceability

1. Data Discovery

Gone are the days of ‘one size fits all’ for sales and marketing data on today’s leading pharmaceutical products and therapy areas, whether cross-country or in-country.

Today the landscape ranges from good data for the primary sector small molecule products to patchy data for the biologics, with their use in multiple indications, to poor or no data for certain drug/device combinations and products being developed for rare diseases. From a geographical perspective, there is good data (generally) in established western markets, but now management demands information across the globe – and this means emerging economies where the data is patchy

Increasingly, management want faster, more granular insights. This in turn is being supported by an increasing number of data vendors/institutions providing faster, more granular data thanks to better technology at the point of collection. It’s particularly notable in emerging economies – vendors are beginning to fill the market gap.

So that challenge is more one of knowing which data source to use to monitor your market – ‘data discovery’ in the business sense – than creating and capturing data from scratch. The question is primarily what is already out there and what should I know about that I don’t know?

However, in some cases the data doesn’t even exist. In the case of cross country analytics, each affiliate may have its own source to contribute, so some way of capturing and applying governance to the input is required.

Data discovery, in the business intelligence sense, includes searching for appropriate sources and then capturing the data into a ‘data lake’. Gartner defines a data lake as ‘a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The role of a data lake is to present an unrefined view of data to only the most highly skilled analysts’. It therefore requires a lot more work to make the data fit for purpose for reporting and regular analytics.

Once the data lake is assembled the data discovery (in the IT sense) starts.

The Data Discovery Process

Data discovery includes all of the following stages:

  • assessing samples
  • profiling
  • cataloguing data assets
  • tagging/annotating data for future exploration
  • discovering sensitive and/or commonly used attributes
  • discovering data lineage and pattern detection

All these stages will lead to a certain level of trust as to whether the data is accurate. In some cases various datasets, or parts of data sets, will need to be combined and triangulated to get a more reliable and complete picture.

Should You Use Self-Service Data Preparation?

There’s much talk in the IT world now about ‘self-service’ data preparation tools to enable this data discovery, but these tools have a specific function. They’re business-oriented and enable data preparation for interactive analysis, essentially allowing users to quickly see the links between data sources so they can evaluate these sources.

While the tools are useful for this specific purpose, they don’t replace an enterprise data integration platform – although they do complement it. The two approaches to data preparation can and should co-exist. It all depends upon what business question needs to be answered.

Consider the question ‘is time to insight more important than data quality?’ If the answer is YES – for example, where a quick decision is being made as to whether to pursue a licensing opportunity, or to investigate ‘Big Data’ such as EHR/EMR data – then a data discovery tool could be useful to find the insights and then ascertain the appropriate data to track. If the answer is NO, which it is for the majority of solutions focusing on regular reporting and market monitoring, then a robust enterprise data integration platform is essential.

2. Data Harmonization

Once the appropriate data have been identified, the next challenge is getting that disparate data to play nicely together.

Multiple data sources offer different data elements and level of detail, each with unique identifiers. Bringing these sources together presents challenges.

For example, consider an organization looking to combine a sales audit file with an internal sales ex-factory file. The files probably won’t be in the same format, so before the data can be blended or harmonized, you’ll need to match product and pack level. In this type of case, the matching is for own product/own pack only. But just think how the challenge is exponentially exacerbated when different audit data sources with thousands of products and packs are included.

When the complexity of data harmonization becomes overwhelming, organizations will sometime reduce the scope of the project. OK, that helps keep things manageable, but it can mean foregoing valuable insights that could be gained from a more expansive analysis.

True and comprehensive data harmonization requires an experienced business partner who really understands the data sources, and can recommend the level of harmonization that is right for the therapy area. You may not need to harmonize all products and all packs, but if you do, the devil really is in the detail. Converting syrups or long duration injectables into days of therapy requires careful thought. Also, is it necessary to harmonize all packs with all sales in value or volume even if these values are zero?

3. Data Curation

Once you’ve harmonized your data sources you will have unified data sets, but over time data sources become out of date. They need to be maintained and enhanced to keep them relevant to the business – that’s the ‘data life cycle’.

There is no point maintaining a data source that is out of date or irrelevant to the business. If a data source isn’t updated with the latest data, or you suspect it’s inaccurate, the marketing user will quickly stop using it and find other sources themselves.

Data curation is therefore an essential element of data preparation: it’s the maintenance element.

There’s another aspect of this. Certainly your data sets should be updated as quickly as practical when the source data is updated, while maintaining data quality, but they are also a resource for reuse and further discovery. Are there trends in the data that should alert us to a particular change in the dynamics of the market? Whether maintained in house or outsourced, the platform should have consistent care and attention to lengthen its life cycle.

Similarly, you can extend its life cycle if it accurately reflects market changes. These could be new market definitions or business rules, or they could be a change of data source or an enhanced data source.

Data curation is a key component whose importance should not be underestimated. Key to this is traceability and transparency.

4. Data Traceability

Being able to trace where the data came from to ensure it’s reliable is critical for understanding and delivering credible results – and Excel spreadsheets mean trouble.

Data that is captured, presented, and distributed in Excel makes traceability painful at best and sometimes impossible. Most of the work happens in the background, creating a ‘black box’ syndrome for analysts who have no means of seeing into the logic of the data. Also, once the operation is performed, the original data is gone, with no history to recapture it. If spreadsheets are linked together, the process is even more convoluted and obtuse.

But traceability and transparency are key if you’re to deliver credible results. Sometimes a number in a dashboard might not look right, in which case you need a simple, accurate way to go back and review the original data sources.

This is something an outsourced data management and analytics solution is designed to do, especially one built for your specific business problem: you’ll have full traceability and visibility built into the data management processes.

Quality control routines ensure that the data can be traced back to the source to help resolve discrepancies and validate the results. If something fails, for example at step 16 of the data management, the team can determine the issue and resolve the problem. What’s more, all business rules and market definitions will be documented and shared – making group collaboration more likely and making it easy to gain a powerful and informed consensus for change. These are important side benefits.

This transparency is particularly important when data needs to be modified or new data added to enrich the analyses. Making changes to a hidden 10-step process built in Excel is extremely difficult and increases the probability of errors. You can find yourself depending on one person’s labyrinthine knowledge, and that’s a dangerous dependency. When they leave the business, they’ll take a great deal with them. It is far better to make things transparent and accessible by anyone.

Conclusion: Visualization is Powerful but Requires Trustworthy Data

The visualization of marketing information to pharmaceutical clients certainly provides rapid insights that enables faster decision-making. But to reap the full rewards of those decisions, your data needs to be properly prepared.

Manually manipulating data through Excel spreadsheets or pulling data from different solutions throughout the organization is cumbersome, time consuming, and prone to error.

The answer is a data management platform designed specifically for the purpose of aligning and regulating the data, with data curation routines ensuring that the data life cycle is maximized. This will enable you to unleash the power of your analysts, who are now free to focus on uncovering valuable business insights.


  • Review your data management procedures.
  • Establish if spreadsheets and other platforms are being used to manage data.
  • Identify if quick insights are more important than thorough accurate insights.
  • Identify a data management platform designed for proper data management.
  • Approach a data management partner to provide efficiencies your internal teams can’t match.

Themis brings your data to life.

Themis gives pharmaceutical companies a clearer picture of their markets, empowering them to make better, faster decisions. As experts at combining data from disparate sources and transforming it into usable, useful information, we build powerful connections between people and data.


[1] Gartner Business Intelligence, Analytics and Information Management Summit, February 22-23 2016, Sydney.

[2] 88% of Spreadsheets Have Errors, by Jeremy Oshan, MarketWatch, April 20, 2013

Share this post

Marion Wyncoll

Marion Wyncoll

As Co-founder and Business Development Director of Themis Analytics, Marion is passionate about equipping clients with the right information to make the best decisions. Marion started her career in pharmaceutical market research with SmithKline (now GSK) before moving to Isis (now Ipsos), where as Research Director she was responsible for multi-country primary market research projects.

scroll to top