FAIR Bioassay Metadata from the Data FAIRY
The Pistoia Alliance’s Data FAIRy project is a great example of the power of human-in-the-loop machine learning in drug discovery. By applying NLP to bioassay protocols (via CDD‘s BioHarmony: Annotator), incorporating standardized ontologies (e.g., BAO), and including a final curation step by domain experts, the Data FAIRY project team showed how standardized annotation of assay metadata can be scaled.
Standardized assay protocols organized in machine readable formats have the potential to unlock enormous value in drug discovery research. Genetics, genomics, proteomics, metabolomics, high content, virtual, and high throughput bioassay data contain valuable insights for identifying novel targets, uncovering disease mechanisms and discovering new medicines.
These data are often hidden and unusable, siloed in different locations and formats. This is a particularly acute issue for bioassay data. Interpretation of these data depends on metadata that describe the methods, cell types used, conditions of the assay and endpoints measured. While many protocols are accessible, they are not standardized, organized or in machine readable formats; and they are often incomplete.
In a recent commentary to Bio-IT World, “Bioassays Have An Integration Problem: Collaboration Will Be Key To Making Them FAIR” Dana Vanderwall and Vladimir Makarov described the problem and the work of the Data FAIRy (or BioAssay FAIR Annotation Project) team to address the challenge.
In the pilot phase the team developed a process leveraging NLP software (e.g., CDD‘s BioHarmony: Annotator) to annotate 500 assay protocols. The project team is now scaling up the process 10-100-fold and is developing a standard information model for assay protocol metadata. Once finalized, this consensus model will be an open, publicly available standard.
Find out more about this project here.
Photo credit: Dandelion Umbrellas in Macro | Photo by Paul Talbot on Unsplash