Next: Role of depth Up: Three-Dimensional Object Recognition Previous: Three-Dimensional Object Recognition

Introduction

Much recent research in computer vision has been aimed at the reconstruction of depth information from the two-dimensional visual input. An assumption underlying some of this research is that the recognition of three-dimensional objects can most easily be carried out by matching against reconstructed three-dimensional data. However, there is reason to believe that this is not the primary pathway used for recognition in human vision and that most practical applications of computer vision could similarly be performed without bottom-up depth reconstruction. Although depth measurement has an important role in certain visual problems, it is often unavailable or is expensive to obtain. General-purpose vision must be able to function effectively even in the absence of the extensive information required for bottom-up reconstruction of depth or other physical properties. In fact, human vision does function very well at recognizing images, such as simple line drawings, that lack any reliable clues for the reconstruction of depth prior to recognition. This capability also parallels many other areas in which human vision can make use of partial and locally ambiguous information to achieve reliable identifications. This paper presents several methods for bridging the gap between two-dimensional images and knowledge of three-dimensional objects without any preliminary derivation of depth. Of equal importance, these methods address the critical problem of robustness, with the ability to function in spite of missing data, occlusion, and many forms of image degradation.

How is it possible to recognize an object from its two-dimensional projection when we have no prior knowledge of the viewpoint from which we will be seeing it? An important role is played by the process of perceptual organization, which detects groupings and structures in the image that are likely to be invariant over wide ranges of viewpoints. While it is true that the appearance of a three-dimensional object can change completely as it is viewed from different viewpoints, it is also true that many aspects of an object's projection remain invariant over large ranges of viewpoints (examples include instances of connectivity, collinearity, parallelism, texture properties, and certain symmetries). It is the role of perceptual organization to detect those image groupings that are unlikely to have arisen by accident of viewpoint or position. Once detected, these groupings can be matched to corresponding structures in the objects through a knowledge-based matching process. It is possible to use probabilistic reasoning to rank the potential matches in terms of their predicted reliability, thereby focusing the search on the most reliable evidence present in a particular image.

Figure 1: On the left is a diagram of a commonly accepted model for visual recognition based upon depth reconstruction. This paper instead presents the model shown on the right, which makes use of prior knowledge of objects and accurate verification to interpret otherwise ambiguous image data.

Unfortunately, the matches based on viewpoint-invariant aspects of each object are by their nature only partially reliable. Therefore, they are used simply as ``trigger features'' to initiate a search process and viewpoint-dependent analysis of each match. A quantitative method is used to simultaneously determine the best viewpoint and object parameter values for fitting the projection of a three-dimensional model to given two-dimensional features. This method allows a few initial hypothesized matches to be extended by making accurate quantitative predictions for the locations of other object features in the image. This provides a highly reliable method for verifying the presence of a particular object, since it can make use of the spatial information in the image to the full degree of available resolution. The final judgement as to the presence of the object can be based on only a subset of the predicted features, since the problem is usually greatly overconstrained due to the large number of visual predictions from the model compared to the number of free parameters. Figure 1 shows a diagram of these components and contrasts them with methods based upon depth reconstruction.

These methods for achieving recognition have been combined in a fuctioning vision system named SCERPO (for Spatial Correspondence, Evidential Reasoning, and Perceptual Organization). This initial implementation uses simplified components at a number of levels, for example by performing matching only between straight line segments rather than arbitrary curves. However, even this initial system exhibits many forms of robustness, with the ability to identify objects from any viewpoint in the face of partial occlusion, missing features, and a complex background of unrelated image features. The current system has a significant level of performance relative to other model-based vision systems, including those that are based upon the accurate derivation of depth measurements prior to recognition. In addition, it provides a framework for incorporating numerous potential improvements in image description, perceptual grouping, knowledge indexing, object modeling, and parameter solving, with resulting potential improvements in performance. Many of the components of this system can be related to corresponding capabilities of human vision, as will be described at the relevant points in this paper. The following section examines the psychological validity of the central goal of the system, which is to perform recognition directly from single two-dimensional images.

Next: Role of depth Up: Three-Dimensional Object Recognition Previous: Three-Dimensional Object Recognition

David Lowe
Fri Feb 6 14:13:00 PST 1998