ContactPerson: srikanth@cedar.buffalo.edu
Remote host: meissa.cedar.buffalo.edu
### Begin Citation ### Do not delete this line ###
%R 2004-12
%U /tmp/ubphd-submitted.ps
%A Srikanth, Munirathnam
%T EXPLOITING QUERY FEATURES IN LANGUAGE MODELING APPROACH\\FOR INFORMATION RETRIEVAL
%D August 26, 2004
%I Department of Computer Science and Engineering, SUNY Buffalo
%K information retrieval, statistical language modeling, biterms, concept language model, maximum entropy language models
%X Statistical Language Modeling has recently been applied to the problem of
Information Retrieval~(IR) and appreciable performance improvements have
been achieved over traditional vector space and probabilistic retrieval
models. Most experiments demonstrating the language modeling approach
to text retrieval have been based on smoothed unigram language models that only
exploit the term occurrence statistic in probability estimation. Experiments
with additional features like bigrams have met with limited success. However,
language models incorporating $n$-gram, word-triggers, topic of discourse,
syntactic and semantic features have shown significant improvements in
speech recognition.
The main thrust of this dissertation is to identify the need to {\em design}
language models for IR that satisfy its specific modeling requirements and
demonstrate it by designing language models that (1) incorporate IR-specific
features~(biterm language model), (2) correspond to better document and query
representations~(concept language model) and (3) combine evidence from the
different information sources~(language features) towards modeling the
relevance of a document to a given query~(maximum entropy language models
for IR).
Prior research in incorporating additional features in language models for
information retrieval have adopted models derived for speech
recognition. However, speech recognition and information retrieval have
different language modeling requirements. For example, in speech
recognition the utterances {\em information retrieval} and {\em retrieval of
information} should be distinguished; whereas they should have same
representation for efficient information retrieval. {\em Biterm Language
Models} -- a variant of bigram language models -- are introduced here to
address this problem.
Biterm language models handle the local variation - within 2 words -
in the surface form of words used in the expression of a concept. It is,
however, these concepts that need to be modeled in the queries to improve
retrieval performance. {\em Concept Language Models}~(CLM) that prescribe a
two-layered query model is presented in this dissertation. A user's
information need is modeled as a sequence of concepts and the query is
viewed as an expression of such concepts of interest. It is shown that such
a model provides significant improvements to retrieval performance. CLM
also provide a natural way of incorporating concept synonymy in the language
modeling approach to IR.
Mixture models combine statistical evidence from different sources to
estimate the probability distribution. For example, smoothed unigram
language model for a document combines unigram counts from the document and
corpus. While mixture models are easy to implement, they seem to make
suboptimal use of their components. A natural method of combining
information sources based on the Maximum Entropy Principle has been
shown to contribute significantly to perplexity reduction and hence
better language models for speech recognition. Such a framework is proposed
for information retrieval in the context of document likelihood or relevance
language models. The {\em maximum entropy language model} for information
retrieval provides a better mechanism for incorporating external knowledge
and additional syntactic and semantic features of the language in language
models for IR.