Papers by:
 Google Scholar
 DBLP
 CiteSeer
 CSB 
Papers by year: (click [abstract] links to toggle)

Creswell, C.Y., Beal, M.J., Chen, J., Cornell T., Nilsson, L. and Srihari, R.K.
Automatically Extracting Nominal Mentions of Events with a Bootstrapped Probabilistic Classifier
To appear in Proceedings of CoLingACL 2006 (International Committee on Computational Linguistics and the Association for Computational Linguistics), 1721 July, Syndey, Australia, 2006. (poster)
[pdf]
[abstract]
Most approaches to event extraction focus on mentions anchored in
verbs. However, many mentions of events surface as noun
phrases. Detecting them can increase the recall of event extraction
and provide the foundation for detecting relations between events.
This paper describes a weaklysupervised method for detecting nominal
event mentions that combines techniques from word sense disambiguation
(WSD) and lexical acquisition to create a classifier that labels noun
phrases as denoting events or nonevents. The classifier uses
bootstrapped probabilistic generative models of the contexts of events
and nonevents. The contexts are the lexicallyanchored semantic
dependency relations that the NPs appear in. Our method dramatically
improves with bootstrapping, and comfortably outperforms lexical
lookup methods which are based on very much larger handcrafted
resources.

Beal, M.J. and Krishnamurthy, P.
Clustering Gene Expression Time Course Data with Countably Infinite Hidden Markov Models
To appear in Proceedings of 22nd Conference on Uncertainty in Artificial Intelligence UAI 2006, 1316 July, MIT, 2006.
Also presented by Praveen at NESCAI'06, Cornell, April 2829 2006.
[pdf]
[abstract]
Most existing approaches to clustering gene expression time course data treat the different time points as independent dimensions and are invariant to permutations, such as reversal, of the experimental time course. Approaches utilizing HMMs have been shown to be helpful in this regard, but are hampered by having to choose model architectures with appropriate complexities. Here we propose for a clustering application an HMM with a countably infinite state space; inference in this model is possible by recasting it in the hierarchical Dirichlet process (HDP) framework (Teh et al. 2006), and hence we call it the HDPHMM. We show that the infinite model outperforms model selection methods over finite models, and traditional timeindependent methods, as measured by a variety of external and internal indices for clustering on two large publicly available data sets. Moreover, we show that the infinite models utilize more hidden states and employ richer architectures (e.g. statetostate transitions) without the damaging effects of overfitting.

Beal, M.J. and Ghahramani, Z.
Variational Bayesian Learning of Directed Graphical Models with Hidden Variables
To appear in Bayesian Analysis 1(4), 2006.
[preprint pdf]
[abstract]
A key problem in statistics and machine learning is inferring suitable
structure of a model given some observed data. A Bayesian approach to
model comparison makes use of the marginal likelihood of each
candidate model to form a posterior distribution over models;
unfortunately for most models of interest, notably those containing
hidden or latent variables, the marginal likelihood is intractable to
compute.
We present the variational Bayesian (VB) algorithm for directed
graphical models, which optimises a lower bound approximation to the
marginal likelihood in a procedure similar to the standard EM
algorithm. We show that for a large class of models, which we call
conjugate exponential, the VB algorithm is a straightforward
generalisation of the EM algorithm that incorporates uncertainty over
model parameters. In a thorough case study using a small class of
bipartite DAGs containing hidden variables, we compare the accuracy of
the VB approximation to existing asymptoticdata approximations such
as the Bayesian Information Criterion (BIC) and the CheesemanStutz
(CS) criterion, and also to a sampling based gold standard, Annealed
Importance Sampling (AIS). We find that the VB algorithm is
empirically superior to CS and BIC, and much faster than AIS.
Moreover, we prove that a VB approximation can always be constructed in
such a way that guarantees it to be more accurate than the CS
approximation.

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Hierarchical Dirichlet Processes
To appear, Journal of the American Statistical Association, December 2006.
[preprint pdf] [older tech report version]
[abstract]
We consider problems involving groups of data, where each observation within a group is
a draw from a mixture model, and where it is desirable to share mixture components between
groups. We assume that the number of mixture components is unknown a priori and is to be
inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one
for each group, where the wellknown clustering property of the Dirichlet process provides a
nonparametric prior for the number of mixture components within each group. Given our desire
to tie the mixture models in the various groups, we consider a hierarchical model, specifically
one in which the base measure for the child Dirichlet processes is itself distributed according to
a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily
share atoms. Thus, as desired, the mixture models in the different groups necessarily share
mixture components. We discuss representations of hierarchical Dirichlet processes in terms of
a stickbreaking process, and a generalization of the Chinese restaurant process that we refer
to as the ÒChinese restaurant franchise.Ó We present Markov chain Monte Carlo algorithms
for posterior inference in hierarchical Dirichlet process mixtures, and describe applications to
problems in information retrieval and text modelling.

Sridharan, K., Beal, M.J. and Govindaraju, V.
Competitive Mixtures of Neurons
To appear at 18th International Conference on Pattern Recognition (ICPR'06).
[pdf]
[abstract]
We propose a competitive finite mixture of neurons (or perceptrons) for solving binary classification problems. Our classifier includes a prior for the weights between different neurons such that it prefers mixture models made up from neurons having classification boundaries as orthogonal to each other as possible. We derive an EM algorithm for learning the mixing proportions and weights of each neuron, consisting of an exact E step and a partial M step, and show that our model covers the regions of high posterior probability in weight space and tends to reduce overfitting. We demonstrate the way in which our mixture classifier works using a toy 2dimensional data set, showing the effective use of strategically positioned components in the mixture. We further compare its performance against SVMs and onehiddenlayer neural networks on four realworld data sets from the UCI repository, and show that even a relatively small number of neurons with appopriate competitive priors can achieve superior classification accuracies on heldout test data.

Ghosh, J., Beal, M.J., Ngo, H. and Qiao, C.
On Profiling Mobility and Predicting Locations of Campuswide Wireless Network Users
To appear in ACM/SIGMOBILE REALMAN, in conjunction with ACM MobiHoc, Florence, Italy, May 2006.
[pdf]
[related TR]
[abstract]
In this paper we analyze a year long wireless network users' mobility trace data collected on ETH Zurich campus. Unlike earlier work in Chen et al. (2005) and Ghosh et al. (2005a), we profile the movement pattern of wireless users and predict their locations. More specifically, we show that each network user regularly visits a list of places such as a building (also referred to as "hubs") with some probability. The daily list of hubs, along with their corresponding visit probabilities, are referred to as a mobility profile. We also show that over a period of time (e.g., a week), a user may repeatedly follow a mixture of mobility profiles with certain probabilities associated with each of the profiles. Our analysis of the mobility trace data not only validate the existence of our socalled sociological orbits (Ghosh et al., 2005b), but also demonstrate the advantages of exploiting it in performing hublevel location predictions. In particular, we show that such profile based location predictions are more precise than common statistical approaches based on observed hub visitation frequencies alone.

Srinivasan, H., Srihari, S.N., Beal, M.J., Fang, G. and Phatak, P.
Comparison of ROCbased and Likelihood methods for Fingerprint
Verification
Proc. Biometric Technology for Human Identification III: SPIE Defense and Security Symposium, Orlando, FL, April 1722, 2006.
[pdf]
[abstract]
The fingerprint verification task answers the question of whether or not two fingerprints belongs to the same
finger. The paper focuses on the classification aspect of fingerprint verification. Classification is the third and
final step after after the two earlier steps of feature extraction, where a known set of features (minutiae points)
have been extracted from each fingerprint, and scoring, where a matcher has determined a degree of match between
the two sets of features. Since this is a binary classification problem involving a single variable, the commonly
used threshold method is related to the socalled receiver operating characteristics (ROC). In the ROC approach
the optimal threshold on the score is determined so as to determine match or nonmatch. Such a method works
well when there is a wellregistered fingerprint image. On the other hand more sophisticated methods are needed
when there exists a partial imprint of a finger as in the case of latent prints in forensics or due to limitations
of the biometric device. In such situations it is useful to consider classification methods based on computing the
likelihood ratio of match/nonmatch. Such methods are commonly used in some biometric and forensic domains
such as speaker verification where there is a much higher degree of uncertainty. This paper compares the two
approaches empirically for the fingerprint classification task when the number of available minutiae are varied.
In both ROCbased and likelihood ratio methods, learning is from a general population of ensemble of pairs,
each of which is labeled as being from the same finger or from different fingers. In the ROCbased method the
best operating point is derived from the ROC curve. In the likelihood method the distributions of same finger
and different finger scores are modeled using Gaussian and Gamma distributions. The performances of the two
methods are compared for varying numbers of minutiae points available. Results show that the likelihood method
performs better than the ROCbased method when fewer minutiae points are available. Both methods converge to
the same accuracy as more minutiae points are available.

Ghosh, J., Beal, M.J., Ngo, H. and Qiao, C.
On Profiling Mobility and Predicting Locations of Campuswide Wireless Network Users
Technical Report # 200527 (December), Department of Computer Science and Engineering, University at Buffalo, SUNY, 2006.
[pdf]
[abstract]
In this paper, we analyze a year long wireless network users' mobility trace data collected on ETH Zurich campus. Unlike earlier work in Chen et al. (2005), Kim et al. (2006), and Tuduce and Gross (2005), we profile the movement pattern of wireless users and predict their locations. More specifically, we show that each network user regularly visits a list of places, such as a building (also referred to as "hubs") with some probability. The daily list of hubs, along with their corresponding visit probabilities, are referred to as a mobility profile. We also show that over a period of time (e.g., a week), a user may repeatedly follow a mixture of mobility profiles with certain probabilities associated with each of the profiles. Our analysis of the mobility trace data not only validate the existence of our socalled sociological orbits (Ghosh et al., 2005), but also demonstrate the advantages of exploiting it in performing hublevel location predictions. Moreover, such profile based location predictions are found not only to be more precise than a common statistical approach based on observed hub visitation frequencies, but also shown to incur a much lower overhead. We further illustrate the benefit of profiling users' mobility by discussing relevant work and suggesting applications in different types of wireless networks, including mobile ad hoc networks.

Beal, M.J., Falciani, F., Ghahramani Z., Rangel, C. and Wild, D.L.
A Bayesian Approach to Reconstructing Genetic Regulatory Networks with Hidden Factors
In Bioinformatics 21:349356, 2005.
[web abstract]
[pdf]
supplementary data page
VBLDS software
[abstract]
Motivation: We have used statespace models to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a wellestablished model of T cell activation. Statespace models are a class of dynamic Bayesian networks in which the observed measurements depend on some hidden state variables that evolve according to Markovian dynamics. These hidden variables can capture effects which cannot be directly measured in a gene expression profiling experiment, for example: genes that have not been included in the microarray, levels of regulatory proteins, the effects of mRNA and protein degradation etc.
Results: We have approached the problem of inferring the model structure of these statespace models using both classical and Bayesian methods. In our previous work, a bootstrap procedure was used to derive classical confidence intervals for parameters representing `genegene' interactions over time. In this paper variational approximations are used to perform the analogous model selection task in the Bayesian context. Certain interactions are present in both the classical and the Bayesian analyses of these regulatory networks. The resulting models place JunB and JunD at the centre of the mechanisms that control apoptosis and proliferation. These mechanisms are key for clonal expansion and for controlling the long term behavior (e.g. programmed cell death) of these cells.
Availability: Supplementary data is available at http://public.kgi.edu/~wild/index.htm and Matlab source code for variational Bayesian learning of statespace models is available at http://www.cs.toronto.edu/~beal/software.html.

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes
In Neural Information Processing Systems 17:13851392, MIT Press, 2005.
[ps]
[pdf]
[ps.gz]
[pdf.gz]
[djvu]
[bibtex]
NIPS 2004
project page
[abstract]
We propose the hierarchical Dirichlet process (HDP), a nonparametric
Bayesian model for clustering problems involving multiple groups of
data. Each group of data is modeled with a mixture, with the number of
components being openended and inferred automatically by the model.
Further, components can be shared across groups, allowing dependencies
across groups to be modeled effectively as well as conferring generaliza
tion to new groups. Such grouped clustering problems occur often in
practice, e.g. in the problem of topic discovery in document corpora. We
report experimental results on three text corpora showing the effective
and superior performance of the HDP over previous models.

Srinivasan, H., Srihari, S.N. and Beal, M.J.
Signature Verification using KolmogorovSmirnov Statistic
In Proceedings of the 12th Conference of the International Graphonomics Society (IGS), Salerno, Italy, June 2005.
[pdf]
[abstract]
Automatic signature verification of scanned documents are presented here. The strategy used for verification is applicable in scenarios where there are multiple knowns(genuine signature samples) from a writer. First the learning process invovles learning the variation and similarities from the known genuine samples from the given writer and then classification problem answers the question whether or not a given questioned sample belongs to the ensemble of known samples or not. The learning strategy discussed compares pairs of signature samples from amongst the known samples, to obtain a distribution in distance space, that represents the distribution of the variation amongst samples, for that particular writer. The corresponding classification method involves comparing the questioned sample, with all the available knowns, to obtain another distribution in distance space. The classification task is now to compare the two distributions to obtain a probability of similarity of the two distributions, that represents, the probability of the questioned sample belonging to the ensemble of the knowns. The above strategies are applied to the problem of signature verification and performance results are presented.

Srinivasan, H., Beal, M.J. and Srihari, S.N.
Machine Learning Approaches for Person Identification and Verification
In Proceedings of the SPIE Defense and Security Symposium, Orlando, FL, March 2005.
[pdf]
[abstract]
New machine learning strategies are proposed for person identification which can be used in several biometric modalities such as friction ridges, handwriting, signatures and speech. The biometric or forensic performance task answers the question of whether or not a sample belongs to a known person. Two different learning paradigms are discussed: personindependent (or general learning) and persondependent (or personspecific learning). In the first paradigm, learning is from a general population of ensemble of pairs, each of which is labelled as being from the same person or from different personsthe learning process determines the range of variations for given persons and between different persons. In the second paradigm the identity of a person is learnt when presented with multiple known samples of that personwhere the variation and similarities within a particular person are learnt. The personspecific learning strategy is seen to perform better than general learning (5% higher performance with signatures). Improvement of personspecific performance with increasing number of samples is also observed.

Srihari, S.N., Beal, M.J., Bandi, K., Shah, V. and Krishnamurthy, P.
A Statistical Model for Writer Verification
In Proc. 8th International Conference on Document Analysis and Recognition (ICDAR), Seoul, Korea, August 2005, pp. 11051109.
[pdf]
[abstract]
A statistical model for determining whether a pair of documents, a known and a questioned, were written by the same individual is proposed. The model has the following four components: (i) discriminating elements, e.g., global features and characters, are extracted from each document, (ii) differences between corresponding elements from each document are computed, (iii) using conditional probability estimates of each difference, the loglikelihood ratio (LLR) is computed for the hypotheses that the documents were written by the same or different writers; the conditional probability estimates themselves are determined from labelled samples using either Gaussian or gamma estimates for the differences assuming their statistical independence, and (iv) distributions of the LLRs for same and different writer LLRs are analyzed to calibrate the strength of evidence into a standard ninepoint scale used by questioned document examiners. The model is illustrated with experimental results for a specific set of discriminating elements.

Attias, H.T. and Beal, M.J.
Tree of Latent Mixtures for Bayesian Modelling and Classification of High Dimensional Data
Technical Report No. 200506. Dept. Computer Science and Engineering, University at Buffalo, SUNY. January 1, 2005.
[pdf]
[ps.gz]
[abstract]
Many domains of interest to machine learning, such as audio and video, computational biology, climate modelling, and quantitative finance, involve very high dimensional data. Such data are often characterized by statistical structure that includes correlations on multiple scales. Attempts to model those high dimensional data raise problems that are much less significant in low dimensions. This paper presents a novel approach to modelling and classification in high dimensions. The approach is based on a graphical model with a tree structure, which performs density estimation on multiple spatial scales. Exact inference in this model is computationally intractable; two algorithms for approximate inference using variational techniques are derived and analyzed. We discuss learning and classification using these algorithms, and demonstrate their performance on a handwritten digit recognition task.

Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
Hierarchical Dirichlet Processes
Technical Report 653, UC Berkeley Statistics, 2004.
[JASA version]
[pdf]
project page
[abstract]
We consider problems involving groups of data, where each observation
within a group is a draw from a mixture model, and where it is
desirable to share mixture components between groups.
We assume that the number of mixture components is unknown a priori
and is to be inferred from the data. In this setting it is natural
to consider sets of Dirichlet processes, one for each group, where
the wellknown clustering property of the Dirichlet process provides
a nonparametric prior for the number of mixture components within
each group. Given our desire to tie the mixture models in the
various groups, we consider a hierarchical model, specifically one
in which the base measure for the child Dirichlet processes is
itself distributed according to a Dirichlet process. Such a base
measure being discrete, the child Dirichlet processes necessarily
share atoms. Thus, as desired, the mixture models in the different
groups necessarily share mixture components. We discuss representations
of hierarchical Dirichlet processes in terms of a stickbreaking
process, and a generalization of the Chinese restaurant process
that we refer to as the ``Chinese restaurant franchise.'' We present
Markov chain Monte Carlo algorithms for posterior inference in
hierarchical Dirichlet process mixtures, and describe applications
to problems in information retrieval and text modelling.

Neal, R.M., Beal, M.J. and Roweis, S.T.
Inferring State Sequences for Nonlinear Systems with Embedded Hidden Markov Models
In Neural Information Processing Systems 16:401408, MIT Press, 2004.
[pdf]
[ps.gz]
[abstract]
We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a nonlinear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of "pools" of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forwardbackward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple onedimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers.

Beal, M.J. and Ghahramani, Z.
The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures
In Bayesian Statistics 7:453464, Oxford University Press, 2003.
Preprint:
[pdf]
[ps.gz].
Volume preface
[pdf]
and contents
[pdf]
[abstract]
We present an efficient procedure for estimating the marginal likelihood of probabilistic models with latent variables or incomplete data. This method constructs and optimises a lower bound on the marginal likelihood using variational calculus, resulting in an iterative algorithm which generalises the EM algorithm by maintaining posterior distributions over both latent variables and parameters. We define the family of conjugateexponential modelswhich includes finite mixtures of exponential family models, factor analysis, hidden Markov models, linear statespace models, and other models of interestfor which this bound on the marginal likelihood can be computed very simply through a modification of the standard EM algorithm. In particular, we focus on applying these bounds to the problem of scoring discrete directed graphical model structures (Bayesian networks). Extensive simulations comparing the variational bounds to the usual approach based on the Bayesian Information Criterion (BIC) and to a samplingbased gold standard method known as Annealed Importance Sampling (AIS) show that variational bounds substantially outperform BIC in finding the correct model structure at relatively little computational cost, while approaching the performance of the much more costly AIS procedure. Using AIS allows us to provide the first serious case study of the tightness of variational bounds. We also analyse the perfomance of AIS through a variety of criteria, and outline directions in which this work can be extended.

Beal, M.J.
Variational Algorithms for Approximate Bayesian Inference
PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. (281 pages)
[pdf]
[ps.gz]
or as individual chapters:
Ch 1: Introduction 
(31) 
[pdf 410k]
[ps.gz 335k]

Ch 2: Variational Bayesian Theory 
(38) 
[pdf 491k]
[ps.gz 379k]

Ch 3: Variational Bayesian Hidden Markov Models 
(24) 
[pdf 529k]
[ps.gz 652k]

Ch 4: Variational Bayesian Mixture of Factor Analysers 
(53) 
[pdf 980k]
[ps.gz 906k]

Ch 5: Variational Bayesian Linear Dynamical Systems 
(47) 
[pdf 1.1M]
[ps.gz 1.3M]

Ch 6: Learning the structure of discretevariable graphical models with hidden variables 
(44) 
[pdf 749k]
[ps.gz 689k]

[abstract]
The Bayesian framework for machine learning allows for the
incorporation of prior knowledge in a coherent way, avoids overfitting
problems, and provides a principled basis for selecting between
alternative models. Unfortunately the computations required are
usually intractable. This thesis presents a unified variational
Bayesian (VB) framework which approximates these computations in
models with latent variables using a lower bound on the marginal
likelihood.
Chapter 1 presents background material on Bayesian inference,
graphical models, and propagation algorithms. Chapter 2 forms the
theoretical core of the thesis, generalising the
expectationmaximisation (EM) algorithm for learning maximum
likelihood parameters to the VB EM algorithm which integrates over
model parameters. The algorithm is then specialised to the large
family of conjugateexponential (CE) graphical models, and several
theorems are presented to pave the road for automated VB derivation
procedures in both directed and undirected graphs (Bayesian and Markov
networks, respectively).
Chapters 35 derive and apply the VB EM algorithm to three
commonlyused and important models: mixtures of factor analysers,
linear dynamical systems, and hidden Markov models. It is shown how
model selection tasks such as determining the dimensionality,
cardinality, or number of variables are possible using VB
approximations. Also explored are methods for combining sampling
procedures with variational approximations, to estimate the tightness
of VB bounds and to obtain more effective sampling algorithms.
Chapter 6 applies VB learning to a longstanding problem of scoring
discretevariable directed acyclic graphs, and compares the
performance to annealed importance sampling amongst other
methods. Throughout, the VB approximation is compared to other methods
including sampling, CheesemanStutz, and asymptotic approximations
such as BIC. The thesis concludes with a discussion of evolving
directions for model selection including infinite models and
alternative approximations to the marginal likelihood.

Beal, M.J., Jojic, N. and Attias, H.
A Graphical Model for AudioVisual Object Tracking
In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), July 2003 (Vol. 25, No. 7), pp. 828836.
[web abstract]
[pdf]
[abstract]
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using offtheshelf equipment.

Beal, M.J., Ghahramani, Z. and Rasmussen, C.E.
The Infinite Hidden Markov Model
In Advances in Neural Information Processing Systems 14:577584, eds. T. Dietterich, S. Becker, Z. Ghahramani, MIT Press, 2002.
[pdf]
[ps.gz]
[abstract]
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.

Beal, M.J., Attias, H. and Jojic, N.
AudioVideo Sensor Fusion with Probabilistic Graphical Models
In Proc. European Conf. on Computer Vision (ECCV), May 2002.
[web abstract]
[abstract]
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using offtheshelf equipment.

Beal, M.J., Jojic, N. and Attias, H.
A SelfCalibrating Algorithm for Speaker Tracking Based on AudioVisual Statistical Models
In Proc. Int. Conf. on Acoustics Speech and Signal Proc. (ICASSP), May 2002.
[pdf]
[ps.gz]
[abstract]
We present a selfcalibrating algorithm for audiovisual tracking using two microphones and a camera. The algorithm uses a parametrized statistical model which combines simple models of video and audio. Using unobserved variables, the model describes the process that generates the observed data. Hence, it is able to capture and exploit the statistical structure of the audio and video data, as well as their mutual dependencies. The model parameters are estimated by the EM algorithm; object templates are learned and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location using the model. Successful performance is demonstrated on real multimedia clips.

Beal, M.J. and Ghahramani, Z.
The Variational Kalman Smoother
Gatsby Unit Technical Report TR01003 (contains an error that is corrected in Ph.D. thesis).
[pdf]
[ps]
[abstract]
In this note we outline the derivation of the variational Kalman smoother, in the context of Bayesian Linear Dynamical Systems. The smoother is an efficient algorithm for the Estep in the ExpectationMaximisation (EM) algorithm for linearGaussian statespace models. However, inference approximations are required if we hold distributions over parameters. We derive the Estep updates for the hidden states (the variational smoother), and the Mstep updates for the parameter distributions. We show that inference of the hidden state is tractable for any distribution over parameters, provided the expectations of certain quantities are available, analytically or otherwise.

Ghahramani, Z. and Beal, M.J.
Propagation Algorithms for Variational Bayesian Learning
In Advances in Neural Information Processing Systems 13:507513, eds. T.K. Leen, T. Dietterich, V. Tresp, MIT Press, 2001.
[pdf]
[ps.gz]
[poster]
[software]
[abstract]
Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugateexponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linearGaussian statespace models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the statespace model in a variety of synthetic problems and one real highdimensional data set.
 Ghahramani, Z. and Beal, M.J.
Graphical Models and Variational Methods
Book chapter in Advanced Mean Field methods  Theory and Practice, eds. D. Saad and M. Opper, MIT Press, 2001. jacket
Part of this work is that discussed at NIPS*99 workshop of the same title as the book.
[pdf]
[ps.gz]
[abstract]
We review the use of variational methods of approximating inference
and learning in probabilistic graphical models. In particular, we
focus on variational approximations to the integrals required for
Bayesian learning. For models in the conjugateexponential family, a
generalisation of the EM algorithm is derived that iterates between
optimising hyperparameters of the distribution over parameters, and
inferring the hidden variable distributions. These approximations make
use of available propagation algorithms for probabilistic
graphical models. We give two case studies of how the variational
Bayesian approach can be used to learn model structure: inferring the
number of clusters and dimensionalities in a mixture of factor
analysers, and inferring the dimension of the state space of a linear
dynamical system. Finally, importance sampling corrections to the
variational approximations are discussed, along with their
limitations.
 Ghahramani, Z. and Beal, M.J.
Variational Inference for Bayesian Mixtures of Factor Analysers
In Advances in Neural Information Processing Systems 12:449455, eds. S. A. Solla, T.K. Leen, K. Müller, MIT Press, 2000.
[pdf]
[ps.gz]
[poster]
[software]
[abstract]
We present an algorithm that infers the model structure of a mixture of factor analysers using an efficient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to overfitting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples.
By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exact predictive density, and the KL divergence between the variational posterior and the true posterior, not only in this model but for variational approximations in general.

Classical and Bayesian approaches to reconstructing genetic regulatory networks (28th July 2004)
A poster presented at the 12th International Conference on Intelligent Systems for Molecular Biology (ISMB'04) in Edinburgh on 02/08/04.
[pdf]
[ps.gz]
 Variational Bayesian quickreference sheet (updated 1st April 2002)
Something I cooked up during the summer of 2000. Ongoing, and not thoroughly checked for mistakes. This version is for A4 paper.
[pdf]
[ps.gz]
[html]
 Variational Scoring of Graphical Model Structures (15/09/03)
A recent talk on VB structure learning that I gave at the Toronto Machine Learning group meetings.
[pdf]
 Variational Inference in the ConjugateExponential Family (16/08/00)
A talk I gave to David J.C. MacKay's group in
Inferential Sciences at Cambridge University. Many of the slides'
contents are from the NIPS 2001 paper with Zoubin.
[pdf]
[ps.gz]