Next: Solving for spatial Up: Three-Dimensional Object Recognition Previous: Introduction

Role of depth reconstruction in human vision

There is a widespread assumption within the computer vision and psychology communities that the recognition of three-dimensional objects is based on an initial derivation of depth measurements from the two-dimensional image [11,22]. However, this assumption seems to be based more upon the perceived difficulty of achieving recognition from a two-dimensional projection than from any convincing psychological data. In this paper we will argue that human capabilities for recognition are much more general than this restricted model would suggest, and that in fact the need for depth reconstruction is the exception rather than the rule. It is true that human vision contains a number of capabilities for the bottom-up derivation of depth measurements, such as stereo and motion interpretation, and these presumably have important functions. However, biological visual systems have many objectives, so it does not follow that these components are central to the specific problem of visual recognition. In fact, the available evidence would seem to strongly suggest the opposite.

One difficulty with these methods for depth reconstruction is that the required inputs are often unavailable or require an unacceptably long interval of time to obtain. Stereo vision is only useful for objects within a restricted portion of the visual field and range of depths for any given degree of eye vergence, and is never useful for distant objects. At any moment, most parts of a scene will be outside of the limited fusional area. Motion information is available only when there is sufficient relative motion between observer and object, which in practice is also usually limited to nearby objects. Recognition times are usually so short that it seems unlikely that the appropriate eye vergence movements or elapsed time measurements could be taken prior to recognition even for those cases in which they may be useful. Depth measurements from shading or texture are apparently restricted to special cases such as regions of approximately uniform reflectance or regular texture, and they lack the quantitative accuracy or completeness of stereo or motion.

Secondly, human vision exhibits an excellent level of performance in recognizing images---such as simple line drawings---in which there is very little potential for the bottom-up derivation of depth information. Biederman [4] describes an experiment in which almost identical reaction times (about 800 ms) and error rates were obtained for recognition of line drawings as compared with full-color slides of the same objects from the same viewpoints. Whatever mechanisms are being used for line-drawing recognition have presumably developed from their use in recognizing three-dimensional scenes. The common assumption that line-drawing recognition is a learned or cultural phenomena is not supported by the evidence. In a convincing test of this conjecture, Hochberg and Brooks [15] describe the case of a 19-month-old human baby who had had no previous exposure to any kinds of two-dimensional images, yet was immediately able to recognize ordinary line drawings of known objects. It is true that there has been some research on the bottom-up derivation of depth directly from line drawings or the edges detected in a single image [2,3,27], including previous research by the author [20]. However, these methods usually lead to sparse, under-constrained relationships in depth rather than to something resembling Marr's 2-D sketch. In addition, these methods apply only to special cases and it is often not possible to tell which particular inference applies to a particular case. For example, one often-discussed inference is the use of perspective convergence to derive the orientation of lines that are parallel in three-dimensions; however, given a number of lines in the image that are converging to a common point, there is usually no effective way to distinguish convergence due to perspective effects from the equally common case of lines that are converging to a common point in three-dimensions. In this paper we will make use of many of the same inferences that have previously been proposed for deriving depth, but they will instead be used to generate two-dimensional groupings in the image that are used directly to index into a knowledge base of three-dimensional objects.

Finally, and of most relevance for many applications of computer vision, there has been no clear demonstration of the value of depth information for performing recognition even when it is available. The recognition of objects from complete depth images, such as those produced by a laser scanner, has not been shown to be easier than for systems that begin only with the two-dimensional image. This paper will describe methods for directly comparing the projection of three-dimensional representations to the two-dimensional image without the need for any prior depth information. Since the final verification of an interpretation can be performed by comparing the projected knowledge with each available image to the full accuracy of the data, there is nothing to be gained at this stage from any depth information that is derivative from the original images. The one remaining issue is whether there is some way in which the depth information can significantly speed the search for the correct knowledge to compare to the image.

Of course, none of the above is meant to imply that depth recovery is an unimportant problem or lacks a significant role in human vision. Depth information may be crucial for the initial stages of visual learning or for acquiring certain types of knowledge about unfamiliar structures. It is also clearly useful for making precise measurements as an aid to manipulation or obstacle avoidance. Recognition may sometimes leave the precise position in depth undetermined if the absolute size of an object is unknown. Human stereo vision, with its narrow fusional range for a given degree of eye vergence, seems to be particularly suited to making these precise depth measurements for selected nearby objects as an aid to manipulation and bodily interaction. However, it seems likely that the role of depth recovery in common instances of recognition has been overstated.

Next: Solving for spatial Up: Three-Dimensional Object Recognition Previous: Introduction

David Lowe
Fri Feb 6 14:13:00 PST 1998