ContactPerson: srikanth@cedar.buffalo.edu Remote host: meissa.cedar.buffalo.edu ### Begin Citation ### Do not delete this line ### %R 2004-12 %U /tmp/ubphd-submitted.ps %A Srikanth, Munirathnam %T EXPLOITING QUERY FEATURES IN LANGUAGE MODELING APPROACH\\FOR INFORMATION RETRIEVAL %D August 26, 2004 %I Department of Computer Science and Engineering, SUNY Buffalo %K information retrieval, statistical language modeling, biterms, concept language model, maximum entropy language models %X Statistical Language Modeling has recently been applied to the problem of Information Retrieval~(IR) and appreciable performance improvements have been achieved over traditional vector space and probabilistic retrieval models. Most experiments demonstrating the language modeling approach to text retrieval have been based on smoothed unigram language models that only exploit the term occurrence statistic in probability estimation. Experiments with additional features like bigrams have met with limited success. However, language models incorporating $n$-gram, word-triggers, topic of discourse, syntactic and semantic features have shown significant improvements in speech recognition. The main thrust of this dissertation is to identify the need to {\em design} language models for IR that satisfy its specific modeling requirements and demonstrate it by designing language models that (1) incorporate IR-specific features~(biterm language model), (2) correspond to better document and query representations~(concept language model) and (3) combine evidence from the different information sources~(language features) towards modeling the relevance of a document to a given query~(maximum entropy language models for IR). Prior research in incorporating additional features in language models for information retrieval have adopted models derived for speech recognition. However, speech recognition and information retrieval have different language modeling requirements. For example, in speech recognition the utterances {\em information retrieval} and {\em retrieval of information} should be distinguished; whereas they should have same representation for efficient information retrieval. {\em Biterm Language Models} -- a variant of bigram language models -- are introduced here to address this problem. Biterm language models handle the local variation - within 2 words - in the surface form of words used in the expression of a concept. It is, however, these concepts that need to be modeled in the queries to improve retrieval performance. {\em Concept Language Models}~(CLM) that prescribe a two-layered query model is presented in this dissertation. A user's information need is modeled as a sequence of concepts and the query is viewed as an expression of such concepts of interest. It is shown that such a model provides significant improvements to retrieval performance. CLM also provide a natural way of incorporating concept synonymy in the language modeling approach to IR. Mixture models combine statistical evidence from different sources to estimate the probability distribution. For example, smoothed unigram language model for a document combines unigram counts from the document and corpus. While mixture models are easy to implement, they seem to make suboptimal use of their components. A natural method of combining information sources based on the Maximum Entropy Principle has been shown to contribute significantly to perplexity reduction and hence better language models for speech recognition. Such a framework is proposed for information retrieval in the context of document likelihood or relevance language models. The {\em maximum entropy language model} for information retrieval provides a better mechanism for incorporating external knowledge and additional syntactic and semantic features of the language in language models for IR.