04/28/2008: We released a little application to navigate the corpus through the the set of typed dependencies computed for this work: Semantic Dependencies Browser. Thanks to Wes Workman for his contributions to the implementation.
Abstract:
Richly populated ontologies with entities, their syntactic/semantic variants and a variety of relationships connecting these entities are critical to the successful use of Semantic Web technology in a variety of domains. In this paper we investigate unsupervised population of a biomedical ontology via information extraction from biomedical literature. Entities in biomedical text seldom occur in their simple canonical forms. Complex sentential variants of entities are more prevalent in text and are connected by relationships in sentences. We therefore focus on identifying complex entities rather than mentions of simple entities. We present a method based on simple rules over grammatical dependency structures for unsupervised segmentation of sentences into complex entities and relationships. We complement the rule-based approach with a statistical component that prunes structures with low information content, thereby reducing false positives in the prediction of complex entities and relationships. We use a superset of the BioInfer corpus created via PubMed queries using the known entities in BioInfer. The extraction is evaluated with respect to the UMLS Semantic Network by analyzing the conformance of the extracted triples with the corresponding UMLS relationship type definitions.
Samples
Compound Entity Subject
UMLS Relationship
Compound Entity Object
The cardiac myosin heavy chain Arg-403-->Gln mutation
causes
hypertrophic cardiomyopathy
A pre-treatment of cells with SGE from partially fed ticks in amounts salivary glands
increased
the level of both viral nucleocapsid (N) protein phosphoprotein (P) in a dose-dependent manner
alpha-catenin
inhibits
beta-catenin signaling
MgCl2
inhibits
these effects of profilin, most likely
Moreover, addition of profilin to steady-state actin filaments
causes
slow depolymerization
11-22 microM) into infected PtK2 cells
causes
a marked slowing of actin tail elongation and bacterial migration
the cytoplasmic domain of E-cadherin
binds
either beta-catenin or plakoglobin
a constituent
binds
RBC alpha-spectrin antibody plus the presence of significant quantities of actin
In addition to instantiating UMLS relationships our extraction mechanism also finds relationships that are no in UMLS but are relevant nonetheless.
Compound Entity Subject
NON-UMLS Relationship
Compound Entity Object
The comparison of E-CD proteins synthesized cell lines
revealed
no structural or functional differences
These triples clearly show the compound entities that are discovered. Going one step beyond this, we use corpus statistics to predict token subsequences that form sub-entities within these compound entities.
Examples of these sub-entities include entities like "cardiac myosin heavy chain" which is a sub entity of the subject of the first triple listed above i.e. "The cardiac myosin heavy chain Arg-403-->Gln mutation".
More examples of these can be found here
Resources
List of sentences on which information extracted was performed. These sentences are collected from the BioInfer corpus.
List of constituent entities contained within the extracted complex entities. These have been predicted from the compound entities using statistics collected over 850,000 sentence.