TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping

Ontologies are a central component of the Semantic Web (SW) infrastructure. The design and construction of domain ontologies and taxonomies is a human intensive process which requires allocation of huge resources in terms of cost and time. For the SW to scale and become feasible, approaches that reduce human effort and resource commitments need to be investigated urgently. Towards this end, we present a framework for automated taxonomy construction based on a large corpus of documents, a first step towards large scale, automated ontology construction. Our approach involves: (a) generation of a document cluster hierarchy; (b) extraction of a topic hierarchy from this cluster hierarchy; and (c) assignment of labels to nodes in the topic hierarchy. We draw upon a suite of clustering and NLP techniques and identify parameters which form the basis of an experimentation framework. We also propose metrics to measure quality of the resulting topic hierarchy and evaluate the impact of various parameters on these quality metrics. The MEDLINE&#174 database is used as the document corpus and the MeSH thesaurus as the gold standard. Insights from these experiments are presented and discussed.

