Scientific Experimental Data

Scientific experiments generate a range of data and information ranging from 'raw' data output by scientific instruments, namely Mass Spectrometers (MS) or Real Time Polymerase Chain Reaction (RT-PCR) machines, to processed and analyzed data, namely list of identified peptides. In the post Human Genome Project (HGP) era, industrial scale techniques have been adopted by biological sciences that generates large amount of data at a rapid pace. For example, in ms-based analysis of one sample generates about 700-800 data files (each ~5MB) and each day a mass spectrometry group may conduct 4-5 analysis runs.

The primary problem faced by researchers in such a scenario is the ability to retrieve specific datasets according a domain context. The domain context is derived from the thought process of a researcher trying to achieve a biological objective. The domain context is constructed of named relationships and values of concepts associated with and linking seemingly disparate experimental datasets.

A traditional experimental data repository severely lacks in being able to adapt according to changes in the domain. The structure of data repository and retrieval techniques built on them cannot change easily or rapidly to enable researchers to query and retrieve experimental datasets that satisfy a domain context. Hence, in collaboration with researchers at the Complex Carbohydrate Research Center (CCRC), University of Georgia, we have built an extensive infrastructure using Semantic Web technologies to enable researchers to query and retrieve experimental datasets according to a context defined with relevance to the glycoproteomics domain.

This infrastructure named Integrated Semantic knowledge and Information System (ISiS), is enabling biologists to use an unified interface to access experimental datasets using domain based named relationships linking these datasets. ISiS is built on two ontologies namely ProPreO and FormA (they are described in detail in the section on Knowledge Representation and Management). We have used automated semantic annotation of scientific experimental data, using the two ontologies, to associate semantic provenance information with each of the experimental dataset generated in a high-throughput environment. We store the semantic provenance information in RDF format and in response to biologists query, currently use SPARQL to query and retrieve relevant experimental datasets.

Structured Biological Data

In collaboration with the Lister Hill National Center for Biomedical Communication (U.S. National Library of Medicine, NIH), we are working on integrating biological data in structured resources namely relational databases using Semantic Web representational formats. We converted the NCBI Entrez Gene (EG) data source into RDF using named relationships to relate data entities in EG. This enabled us to capture the logical, domain relevant connections between genes, proteins encoded by these genes, the disease information associated with these genes and their location on the chromosomes.

Next, we integrated the Gene Ontology structure, available in RDF format, with EG RDF and were able to effectively answer research queries linking 'glycosyltransferase' to 'congenital muscular dystrophy'. Currently, we working to integrate all gene related NCBI data sources with EG and GO using RDF as the common representational format.