You are here

Machine Learning and Natural Language Processing Lab

Scientists in many fields now collect massive, high dimensional data on complex processes. The key research problems in these fields are increasingly becoming those of coping with (and indeed benefiting from) scale. Machine learning and natural language processing lab aims to support this developing mode of scientific research by addressing the statistical and computational challenges of building statistical models to make optimal interpretations of data from noisy, incomplete and conflicting evidence. In particular, we investigate techniques for learning accurate models from data, performing efficient inference in complex models, and solving the difficult optimization and search problems that arise. The goal is to advance the state-of-the-art in computer interpretation (natural language processing and computer perception), computer reasoning and decision making (automated reasoning and autonomous systems) and intelligent data analysis (data mining and bioinformatics) including the discovery of new patterns in large databases of medical, financial, or consumer-preference data.

Research projects

  • Large scale distributed syntactic, semantic and lexical language models
    We aim to build large scale distributed syntactic, semantic, and lexical language models that are trained by corpora with up to web-scale data on a supercomputer to substantially improve the performance of machine translation and speech recognition systems. It is conducted under the directed Markov random field paradigm to integrate both topics and syntax to form complex distributions for natural language. It uses hierarchical Pitman-Yor processes to model long tail properties of natural language. By exploiting the particular structure, the seemingly complex statistical estimation and inference algorithms are decomposed and performed in a distributed environment. Moreover, a long standing open problem, smoothing fractional counts due to latent variables in Kneser-Ney's sense in a principled manner, might be solved. We demonstrate how to put the complex language models into one-pass decoders of machine translation systems, and lattice rescoring decoder in a speech recognition system.
  • Semi-supervised structured prediction
  • Direct loss minimization for classification and ranking problems

Faculty: Shaojun Wang

Ph.D. students: Ming Tan, Tian Xia, Shaodan Zhai, Raymond Kulhanek

M.S. students: Lily Guo

Visiting scholars: Professor Baoguo Wei, Northwestern Polytechnical University, 2011.9 - 2012.8

Group wiki page: http://130.108.28.50/

© 2012 Knoesis | 377 Joshi Research Center, 3640 Colonel Glenn Highway, Dayton, OH 45435 (937 - 775 - 5217)