Object Recognition

"No recognition is possible without knowledge. Decisions about classes or groups into which recognized objects are classified are based on such knowledge -- knowledge about objects and their classes gives the necessary information for object classification."
       -- Milan Sonka, Vaclav Hlavac & Roger Boyle, Image Processing, Analysis, and Machine Vision

The Problem
Some methods exist for clustering image representations and text to produce models that link images with words (Barnard & Forsyth Exploting image semantics for picture libraries 2001, Barnard & Forsyth Learning the semantics of words and pictures 2001). This work is capable of predicting words for a given image by computing words that have a high posterior probability given the image. This process, referred to as auto-annotation, is useful in itself. In this form however, auto-annotation does not tell us which image structure gives rise to which word.

The problem at hand is to construct and train a model using annotated images that allows us to automatically classify specific portions of an image. To make the classification process ammenable to human interpretation, we represent classes using English words. In doing so, the process of annotation becomes analogous to machine translation; we have a representation of one form (image regions) and we wish to turn it into another form (words). In particular, our task is to build a lexicon: a device that predicts one representation (image regions), given another representation (words).

lion mane grass sky

plane jet sky

cactus flower

Images labeled with text. Sample training data from the Corel data set provided by Kobus Barnard, Pinar Duygulu and David Forsyth.
Our solution
Our approach is to segment images into regions and then learn to predict words from information about the regions (for background see Duygulu et al Object recognition as machine translation 2002). We use a Bayesian statistical model to find the correlation between the image regions (what we call "blobs") and the word labels. Following this, we guide you through the training process for our probabilistic translation model.

The original training image with label: lion tree grass

Step 1: Segment image into blobs. We use the Normalized Cuts (Shi and Malik: put full reference here) implemented by Doron Tal to segment the images into blobs, using information of colour, texture and position.

Step 2: Normalize the blob features. For each segment (i.e. blob) in the image, we have an n-dimensional vector of features. In our case, we use LAB colour and orientation information.

The segmented image provided by Pinar Duygulu, Kobus Barnard and David Forsyth.
Step 3: Run the Expectation-Maximization algorithm on the translation model. The idea is to maximize the likelihood of our model; in other words, the probability of the model given the data. In this instance we are maximizing the likelihood that the blobs will occur given the corresponding words (i.e. labels). The likelihood of the model is given by the following equation

where N is the number of images in the data, M is the number of segments in each image, and L is the number of words in each image label. t(b|w) is the probability that a specific word (w) will translate to the corresponding blob (b). In the discrete case, we learn the t's using Expectation-Maximization. The Gaussian model's parameter set consists of the mean and variance for each word k.

We appeal to a Gaussian mixture model (the second equation) to describe our word-to-blob translation probabilities. We do this because the blobs possess features more appropriately considered to be continuous. For example, the colour features are continuously distributed over the LAB space. Gaussians allow us to model our classifications continuously, which leads to a more flexible model for annotation.

We have a "chicken and egg" situation: to build the translation probabilities we need to know the correct associations, and to estimate the correspondences, we need to know the translation probabilities. The Expectation-Maximization algorithm will provide a satisfying solution, beacuse it quickly follows the likelihood surface towards a local maximum. What makes this representation so handy is that we can run the training several times, and choose the distribution that converges to the highest likelihood.

An example of translation on the training data is given to the right.

The lion labeled on the trained model.

Once we have trained our probabilistic translation model using the process described above, we can use it to identify regions within an image.

Below are some sample results. Note that we obtain a certain degree of variance in the results because we start with random initializations and therefore the model can end up in suboptimal likelihood (the EM algorithm only guarantees convergence to a local, not global, maximum). Therefore, results can be somewhat difficult to predict. Another problem is that if the training data is biases towards certain images (say, it contains a lot of horses) then the model will place a greater emphasis on the correspondences for those images, and as a result it will predict horses more often then other objects.

Here are some selected results for the discrete and continuous (Gaussian) translation models.

[1] Nando de Freitas and Kobus Barnard. Bayesian Latent Semantic Analysis of Multimedia Databases. UBC TR 2001-15.

[2] Pinary Duygulu, Kobus Barnard, Nando de Freitas and David Forsyth. I. Jordan. Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. ECCV 2002. Best Paper prize on Cognitive Computer Vision.

[3] Peter Carbonetto, Nando de Freitas, Paul Gustafson and Natalie Thompson. Bayesian Feature Weighting for Unsupervised Learning, with Application to Object Recognition. AI-Stats 2003.