Place and scene recognition from video

While navigating in an environment, a vision system has to be able to recognize where it is and what the main objects in the scene are. We present a context-based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, conference room 941, Main Street), to categorize new environments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., table, chair, car, computer). We have trained a system to recognize over 60 locations (indoors and outdoors) and to suggest the presence and locations of more than 20 different object types. The algorithm has been integrated into a mobile system that provides real-time feedback to the user.

As a test-bed for the approach proposed, we use a helmet-mounted mobile system. The system is composed of a web-cam that is set to capture 4 images/second at a resolution of 120x160 pixels (color). The web-cam is mounted on a helmet in order to follow the head movements while the user explores their environment. The user receives feedback about system performance through a head-mounted display.

   
Kevin Murphy                             Antonio Torralba

We use a low-dimensional global image representation that captures the "gist" of the scene. This can be used as input to a Bayes net/ HMM, as shown below. (See our ICCV03 paper for details.)

Below we show the performance of place recognition for a sequence that starts indoors and then goes outdoors. (ICCV03 Figure 3). Top. The solid line represents the true location, and the dots represent the posterior probability associated with each location. There are 63 possible locations, but we only show those with non negligible probability mass. Middle. Estimated category of each location. Bottom. Estimated probability of being indoors or outdoors.

Some images from the dataset.

Publications

Movies

Data