A Novel Approach for Classifying Gene Expression Data using Topic Modeling

TitleA Novel Approach for Classifying Gene Expression Data using Topic Modeling
Publication TypeConference Paper
Year of Publication2017
AuthorsSoon Jye Kho, Yalamanchili, HBindu, Raymer, ML, Amit P. Sheth
Conference Name8th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
Date Published08/2017
PublisherACM
Conference LocationBoston, MA
KeywordsCancer, Classification, clustering, Gene Expression, Latent Dirichlet Allocation, machine learning, Topic modeling
Abstract

Understanding the role of differential gene expression in cancer etiology and cellular process is a complex problem that continues to pose a challenge due to sheer number of genes and inter-related biological processes involved. In this paper, we employ an unsupervised topic model, Latent Dirichlet Allocation (LDA) to mitigate overfitting of high-dimensionality gene expression data and to facilitate understanding of the associated pathways. LDA has been recently applied for clustering and exploring genomic data but not for classification and prediction. Here, we proposed to use LDA inclustering as well as in classification of cancer and healthy tissues using lung cancer and breast cancer messenger RNA (mRNA) sequencing data. We describe our study in three phases: clustering, classification, and gene interpretation. First, LDA is used as a clustering algorithm to group the data in an unsupervised manner. Next we developed a novel LDA-based classification approach to classify unknown samples based on similarity of co-expression patterns. Evaluation to assess the effectiveness of this approach shows that LDA can achieve high accuracy compared to alternative approaches. Lastly, we present a functional analysis of the genes identified usinga novel topic profile matrix formulation. This analysis identified several genes and pathways that could potentially be involved in differentiating tumor samples from normal. Overall, our results project LDA as a promising approach for classification of tissue types based on gene expression data in cancer studies.

Full Text

Citation:

  • S. Kho, H. Yalamanchili, M. Raymer, A. Sheth. A Novel Approach for Classifying Gene Expression Data using Topic Modeling. 8th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. Boston, MA. August 20-23, 2017.


Additional Resources: