The FAIR principles for data sets are gaining traction, especially in the pharmaceutical industry in Europe. FAIR stands for: Findable, Accessible, Interoperable and Reusable. In a world of exponential data growth and ever-increasing silo-ization, the FAIR principles are needed more than ever. In this article, we will first summarize the FAIR principles and describe the typical roadmap to FAIR. After that we will argue that using a Data-Centric approach is the best route to achieving FAIR principles. This material was recently presented at a FAIR workshop for a European pharmaceutical firm.
FAIR emerged as a response to the fragmentation of data within and among most large firms. Most of the publicly available FAIR materials tend to focus on awareness and assessment (how FAIR are you?) and leave the way-finding to individual companies. Difficulties in sharing and applying data to problems create a friction on innovation that FAIR is meant to address.
The official position of FAIR is presented on their website. There are fifteen principles that help both with the definition of what is meant by FAIR and while also allow you to self-assess how far you are on the journey. Here is the first-level summaries of each top-level principle, we will return to the detailed list later when we take up FAIR and Data-Centricity.
- Findable – Metadata and data should be easy to find for both humans and computers.
- Accessible – Once the user finds the required data, they need to know how the data can be accessed, possibly including authentication and authorization.
- Interoperable – The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
- Reusable – To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.
Why is This Hard?
These are worthy aims. We might pause to first ask: “why isn’t this already the norm?” It’s primarily because of the extreme balkanization of our data. It is more extreme in life sciences, partly driven by the huge volume of data coming off devices and being generated by experiments, exacerbated by the move toward narrower and narrower sub-specializations. Every experiment, every clinical trial, every scientific instrument generates data to a specific data model designed for that particular situation.
Are We at the Wrong End of the DIKA pyramid?
The “Data, Information, Knowledge, Action” pyramid (often the “Data, Information, Knowledge, Wisdom” pyramid) suggests a relationship between raw data and its meaning.
Most sources are a bit vague about the top of the pyramid, but I prefer “Action” at the top to be a bit more, well, actionable. There is pretty good agreement on the bottom two layers. Data is raw data, signals, signs and is only meaningful if it is interpreted. Theoretically data could be about any sort of phenomenon in the world, from a practical standpoint there isn’t much need to discuss it until we have digitized it. But having digitized it doesn’t mean we’ve understood it and even if it is understood, that understanding isn’t shared.
Relative to data, information is telling us, at some level, what the data means. Schema is the classic example. The column heading on a table tells us that the data in the cells is “rainfall” or “gross domestic product” or “HbA1c.” This information itself is also data (in this case metadata), and the extent to which the schema means anything to the consumer depends on whether the data consumer knows (in some out of band way) what the column heading means. The value “9.0” in a column headed “HbA1c” means many things to someone who knows what HbA1c is. First, it means that this is the result of a blood test, performed on a patient. Secondly, it means that the patient in question has had an average blood glucose of 240 mg/dl over the last 60-90 days. By implication, this means that this patient is having trouble managing blood sugar, is severely diabetic and can predict outcomes common with diabetes such as neuropathy, if left untreated. “9.0” means nothing to someone who doesn’t know what HbA1c means.
For this reason, we need to be cautious when assigning value to “information” as the value continues to be subjective. (That is, it might mean something to some data consumers and not to others.)
After much thinking about this topic, I’ve come to find it useful to think of knowledge in terms of models. The knowledge we have in our heads is a mental model of the world. A machine learning algorithm works by creating a model in the form of weights of simulated neurons in a black box. Some of the most useful knowledge models are complex feedback models such as those that predict pandemics based on contagion and lethality. And as semantics practitioners, our own favorite models are those that help us define new concepts in terms of existing ones, and that allow us to infer new information and data from existing information and data.
Why bring all this up? Because if you look at it this way, you see that the challenge is a numerical one. These are rough orders of magnitudes for most firms:
|Actions||Hundreds (of actionable insights)|
|Knowledge||Thousands (of concepts)|
|Information||Millions (schema elements)|
|Data||Billions (individual facts)|
Many FAIR-ification projects are run as data governance projects. This typically means lots of cataloging and dashboards. The dashboards initially are keeping track of how many buckets of the data tsunami have been filled. Eventually, people realize that merely cataloging the data sets is not really getting to the objectives of FAIR. Then, comes FAIR maturity indexes. The maturity index is meant to measure whether people are merely cataloging or if they are getting to shared meaning.
This seems like a long and tortuous route to get to FAIR. Let’s look a bit deeper at the detailed FAIR principles and line them up with their Semantic/Data-Centric embodiments.
Detailed FAIR Principles
The four high level principles have been further broken down with specific guidelines. The main point of this article is to highlight the relationship between FAIR and Data-centric, which in turn is enabled by semantics technology. Below, we summarize those relationships. GoFAIR is the website promoting the FAIR principles:
|GoFAIR principle||Semantic / Data-Centric embodiment|
|F1: (Meta) data are assigned globally unique and persistent identifiers.||Metadata and data are identified by URI/IRIs.|
|F2: Data are described with rich metadata.||Data and metadata are expressed in the same syntax and standards (RDF) and are collocated and connected. Graph-based metadata can be infinitely rich.|
|F3: Metadata clearly and explicitly include the identifier of the data they describe.||Both data and metadata have URI/IRSs. (It seems more productive to have the data refer to the metadata which identifies it, but either way works.)|
|F4: (Meta)data are registered or indexed in a searchable resource.||We recommend a triple store.|
|A1: (Meta)data are retrievable by their identifiers using a standardized communication protocol.||SPARQL is one such protocol. It is implemented on http / https, but others can be supported and stored with the metadata.|
|A1.1: The protocol is open, free, and universally implementable.||RDF/OWL and SPARQL are W3C standards and have been implemented by dozens of major vendors and in dozens of major open-source projects.|
|A1.2: The protocol allows for an authentication and authorization procedure where necessary.||The SPARQL protocol allows for authentication and authorization. However, the metadata should also include references to other authentication and authorization procedures.|
|A2: Metadata should be accessible even when the data is no longer available.||While the metadata and data can be commingled in the same triple store, best practice suggests keeping the metadata in its own named graph. Metadata (TBox) and data (ABox) each can be accessed and processed without the other, however the ABox gains meaning when combined with the TBox and the TBox becomes instantiated when combined with the ABox.|
|I1: (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.||Data use RDF. Metadata use OWL, primarily. Both should be present in the searchable triplestore.|
|I2: (Meta)data use vocabularies that follow the FAIR principles.||We use the term CBox (category box) for taxonomies (vocabularies) that have been designed to interoperate with an ontology (TBox). These CBoxes are expressed in RDF and OWL.|
|I3: (Meta)data include qualified references to other (meta)data.||When you read the further details on this item, it says that rather than use generic links like “seeAlso” or “refersTo” that FAIR data should have more explicit reference (predicates). We strongly recommend using object properties that are as specific as possible, but not more than necessary.|
|R1: (Meta)data are richly described with a plurality of accurate and relevant attributes.||The F2 principle focused mostly on keywords that would make the data set more findable. This principle focuses on data that would tell you whether you want to use the data once you’ve found it. We advocate also encoding this metadata, also in machine readable metadata (triples).|
|R1.1: (Meta)data are released with a clear and accessible data usage license.||In the metadata, references to licenses can be in a property specific to that purpose such as dc:license. In ABox data, we recommend attaching licensing to the named graph of all the triples the license applies to.|
|R1.2: (Meta)data are associated with detailed provenance.||The example shown in the goFAIR web site is more for attribution, which is a good thing to do, but the more powerful thing is to recognize that many, perhaps most, datasets are derivative of other data sets. There are a small number of generic derivations (augmentation, deduplication, projection etc.). Codifying and including this information can be hugely beneficial, for instance to draw data lineage flow charts and to understand alternate sources of data.|
|R1.3: (Meta)data meet domain-relevant community standards.||This is primarily about third party archiving standards.|
It is apparent that the authors of FAIR knew of the semantic web. They chose not to make the Semantic Web standards a prerequisite and opted to allow for local variation.
But it is clear that the Semantic Web and Data-Centric provide a very direct route to becoming FAIR. Everything contemplated in the FAIR principles (and much more) is already covered in Semantic / Data-Centric approach.
We advocate taking a slight detour on the way to FAIR. At first, it looks like going to the right is a detour, until you realize the road you were on is washed out, and the apparent alternative snakes on forever. That’s why we say this detour is a shortcut.
The first step involves creating a high-level version of your enterprise data model, your core ontology. This should include many concepts beyond what you will be immediately cataloging in your FAIR catalog issue, but it goes a long way toward future-proofing much of what you will do later.
The second very important step is to populate the model with some real data. This step has been skipped in most of the ontologies in life science. As a result, most life science ontologies cannot be instantiated. One of the things I like to do when I encounter a new ontology is consider a few concepts I expect to find in the ontology and then search to see where they are, how well they are expressed etc. One of the more popular life science ontologies is SNOMED. It is touted as being “the most comprehensive and precise, multilingual health terminology in the world.” I figured I could find “liver” in there. Turns out, there are at least 25 references to “liver” (maybe the UI stops at 25), but I couldn’t find “liver” itself. So, I decided to take one of the 25, “liver mass.” Being a gist ontologist and not a medical doctor, I was thinking liver mass would be a measurement of weight. Indeed in gist, the class for all measurements of weight is called gist:Mass.
Nothing in the SNOMED definition or its place in the hierarchy dissuaded me that “liver mass” was not a weight. Later, I did a web search and found that it was a growth, often a tumor, and usually not cancerous. It is highly unlikely that you would ever have an instance of a liver mass. You might, but let’s work through this to see what sort of data instances you would have. Certainly, you would have an instance of a person (a Patient). By the way, SNOMED has many hundreds, maybe over a thousand subtypes of Person, three of which are Patients (Patient, InPatient, OutPatient). You would likely have an instance of the radiological image that identified the mass. You might have a region on the image as an instance. Probably, you would annotate the region as a “liver mass” along with dozens of other properties. But it is highly unlikely, even in a knowledge graph, that you would instantiate that liver mass, giving it a unique identifier and declaring it an instance of the class “Liver Mass.”
I’m making a few points here, and it may seem like unhelpful digressing with a lot of detail. But pure generalizations are often not as convincing. For the most part, the life science “ontologies” are really taxonomies or controlled vocabularies. They are two or three orders of magnitude more complex than the kind of ontology a life science or healthcare delivery company needs. We think they have distracted people from what is necessary and possible.
So, the detour is: build a proper core enterprise ontology and populate a few corners of it with real data. This will prove to yourselves that it is a workable ontology and at the right level of detail.
Now, start cataloging your data sets and making them FAIR. Conform them to your core ontology as you go. This will be a forcing function. It will get you to the right level of detail, and it will ensure that you are not making up yet another vocabulary for the FAIR catalog. (We only say this because we’ve watched clients do it).
And I hope the description above about the sheer number of datasets, and what people do when they must catalog tens of thousands of things (spoiler alert: they just repeat the column name as the description) will convince you that a straight-ahead FAIR-ification project will get mired in detail and complexity.
Not only is the Data-Centric approach going to speed up your FAIR-ification, it will give you things that the FAIR approach does not even dream of. Despite its broader ambitions, the FAIR-only approach generally stalls out at merely building a catalog. The Data-Centric approach is about building a catalog within the context of your future Enterprise Knowledge Graph.
By conforming your catalog to your Core Enterprise Ontology, you know exactly where every data item would go in your knowledge graph. In the short term, you may not populate your Knowledge Graph with data from the datasets that you are cataloging, but over the long term you might do so.
The FAIR movement outlines a valid set of worthwhile aims and principles. But it is a bit light on prescribing exactly how to get there. Cataloging one’s datasets just results in more catalogs. We believe the North Star to aim for is having your datasets cataloged to your emerging Data-Centric Enterprise Knowledge Graph, which provides the Interoperability and Reusability goals. It will be an asset that will not only continue to yield high quality insights, but is a platform to begin the process of legacy replacement.