Read a draft of the full paper here.
Abstract: Information emerging from life science research has increasingly been recorded electronically and stored in databases. The sheer volume of data collected by researchers, the speed at which it is generated, range of its sources, quality, accuracy, and need for assessment of usefulness, results in complex, multidimensional, diverse datasets, often annotated in specific terminologies and coding systems by researchers in distinct disciplines. The resulting data silos undermine interoperability, meta-data analysis, reproducibility, pattern identification, and discovery across disciplines. The value of cross-discipline meta-data analysis is, however, evident in the present pandemic. Prostate cancer researchers have leveraged existing research on enzymes crucial in host cell penetration by SARS-CoV-2 to explain differences in disease severity across sex. Immunologists have combined insights from research on SARS-CoV-1 and MERS-CoV with chemical compound profile data, to identify drug and vaccine options for SARS-CoV-2. Pediatric researchers observing that children have fewer nasal epithelia susceptible to SARS-CoV-2 infection than adults, have suggested this difference partially explains symptom disparities between the groups. Researchers across the life sciences are recognizing the pressing need for coordinated data-driven efforts during the current crisis.
Shared, interoperable, logically well-defined, controlled vocabularies representing common entities and relations across life science disciplines facilitates data-driven insights across those disciplines. The present need for rapid analysis of evolving datasets representing coronavirus research motivates, moreover, the development of virus, coronavirus, and SARS-CoV-2 specific vocabularies. To these ends, we have developed the Virus Infectious Disease Ontology (VIDO; https://bioportal.bioontology.org/ontologies/VIDO) and the COVID-19 Infectious Disease Ontology (IDO-COVID-19; https://bioportal.bioontology.org/ontologies/IDO-COVID-19). Each is a structured vocabulary, with textual definitions for terms and relations, as well as logical axioms expressed in the OWL 2 Web Ontology Language (https://www.w3.org/TR/owl2-overview/), a World Wide Web Consortium (https://www.w3.org/) language developed for the semantic web. The formal representations of these ontologies support automated consistency checking, querying over relevant datasets, and interoperability with existing data on the semantic web. VIDO is an extension of the widely-used Infectious Disease Ontology Core (IDO Core; https://bioportal.bioontology.org/ontologies/IDO), an ontology comprised of terminological content common to all investigations of infectious disease. VIDO is a refinement of IDO to the specific domain of infectious diseases caused by viruses. As such, VIDO is comprised of common terminological content in investigations of viral diseases, including virus classification, epidemiology, replication, vaccinology, and rational viral drug design. VIDO provides a carefully curated foundation for ontologies representing specific viral infectious diseases such as IDO-COVID-19, an extension of VIDO to the specific disease COVID-19 and its causative virus SARS-CoV-2.