Next: Related research on Up: Three-Dimensional Object Recognition Previous: The SCERPO vision

Directions for future research

The most obvious direction in which to extend the current system is to generalize the object models to include many new types of visual knowledge. These extensions could include the modeling of moveable articulations, optional components, and other variable parameters in the models. The section above on solving for spatial correspondence described methods for incorporating these extensions during the viewpoint-solving and matching process. However, further research is required to determine the optimal order in which to solve for individual parameters. Imagine, for example, that we had a generic model of the human face. The model would include small ranges of variation for the size and position of every feature, as well as many optional components such as a beard or glasses. However, given some tentative correspondences for say, the eyes and nose, we could use the expectation of bilateral symmetry and the most tightly constrained dimensions of our model to solve for approximate viewpoint. This would then suggest quite tightly constrained regions in which to search for other features, such as ears, chin, eyebrows, etc., each of which could be used to derive better estimates of viewpoint and the other parameters. The resulting values of these parameters could then be mapped into a feature space and used to identify particular individuals, which in turn may lead to further detailed constraints and expectations. Some mechanisms for ordering these constraints were incorporated into the ACRONYM system [6].

The current implementation of SCERPO has used only an edge-based description of the image because that is a comparatively reliable and well-researched form of image analysis. But the same framework could incorporate many other dimensions of comparison between model and image, including areas such as surface modeling, texture, color, and shading properties. Further research would be required to detect viewpoint-invariant aspects of these properties during bottom-up image analysis. Many of the modeling and predictive aspects of these problems have been developed for use in computer graphics, but it may be necessary to find faster ways to perform these computations for use in a computer vision system.

Once a number of different sources of information are used to achieve the optimal ordering of the search process, it is necessary to make use of general methods for combining multiple sources of evidence. The use of evidential reasoning for this problem has been discussed elsewhere by the author in some detail [19, Chap. 6,]. These methods make use of prior estimates for the probability of the presence of each object, and then update these estimates as each new source of evidence is uncovered. Sources of evidence might include particular perceptual groupings, colors, textures, and contextual information. Context plays an important role in general vision, since most scenes will contain some easily-identified objects which then provide a great deal of information regarding size, location, and environment which can greatly ease the recognition of more difficult components of the scene. Evidential reasoning also provides an opportunity to incorporate a significant form of learning, since the conditional probability estimates can be continuously adjusted towards their optimal values as a system gains experience with its visual environment. In this way associations could be automatically created between particular objects and the viewpoint-invariant features to which they are likely to give rise in the image.

The psychological implications of this research are also deserving of further study. Presumably human vision does not perform a serial search of the type used in the SCERPO system. Instead, the brief time required for typical instances of recognition indicates that any search over a range of possible objects and parameters must be occurring in parallel. Yet even the human brain does not contain enough computational power to search over every possible object at every viewpoint and position in the image. This can be demonstrated by the fact that even vague non-visual contextual clues can decrease the length of time required to recognize degraded images [17]. Presumably, if a complete search were being performed in every instance, any top-down clues that narrowed the search would have little effect. Given that the search is proceeding in parallel, the mechanisms used for ranking the search in SCERPO would instead be used to select a number of possibilities to explore in parallel, limited according to the available computational resources. This model for the recognition process suggests many psychophysical experiments in which average recognition times could be measured for different combinations of image data and contextual information. Some important experimental results relating recognition time to the availability of various image and contextual clues have been reported by Biederman [4].

Next: Related research on Up: Three-Dimensional Object Recognition Previous: The SCERPO vision

David Lowe
Fri Feb 6 14:13:00 PST 1998