Next: Perceptual organization Up: Three-Dimensional Object Recognition Previous: Role of depth

Solving for spatial correspondence

Many areas of artificial intelligence are aimed at the interpretation of data by finding consistent correspondences between the data and prior knowledge of the domain. In this paper, we will begin by defining the consistency conditions for judging correspondence between image data and three-dimensional knowledge. Unlike many other areas of artificial intelligence, an important component of this knowledge is quantitative spatial information that requires specific mathematical techniques for achieving correspondence. The particular constraint that we wish to apply can be stated as follows:

The viewpoint consistency constraint: The locations of all projected model features in an image must be consistent with projection from a single viewpoint.

The effective application of this constraint will allow a few initial matches to be bootstrapped into quantitative predictions for the locations of many other features, leading to the reliable verification or rejection of the initial matches. Later sections of this paper will deal with the remaining problems of recognition, which involve the generation of primitive image structures to provide initial matches to the knowledge base and the algorithms and control structures to actually perform the search process.

The physical world is three-dimensional, but a camera's image contains only a two-dimensional projection of this reality. It is straightforward mathematically to describe the process of projection from a three-dimensional scene model to a two-dimensional image, but the inverse problem is considerably more difficult. It is common to remark that this inverse problem is underconstrained, but this is hardly the source of difficulty in the case of visual recognition. In the typical instance of recognition, the combination of image data and prior knowledge of a three-dimensional model results in a highly overconstrained solution for the unknown projection and model parameters. In fact, we must rely upon the overconstrained nature of the problem to make recognition robust in the presence of missing image data and measurement errors. So it is not the lack of constraints but rather their interdependent and non-linear nature that makes the problem of recovering viewpoint and scene parameters difficult. The difficulty of this problem has been such that few vision systems have made use of quantitative spatial correspondence between three-dimensional models and the image. Instead it has been common to rely on qualitative topological forms of correspondence, or else to produce three-dimensional depth measurements that can be matched directly to the model without having to account for the projection process.

Our goal, then, is to carry out a quantitative form of spatial reasoning to provide a two-way link between image measurements and the object model. Matches between the model and some image features can be used to constrain the three-dimensional position of the model and its components, which in turn leads to further predictions for the locations of model features in the image, leading to more matches and more constraints. The problem of generating the few initial matches to begin this process will be dealt with in later sections of this paper. Here we will describe the spatial reasoning process that relates the image measurements to three-dimensional constraints.

The precise problem we wish to solve is the following: given a set of known correspondences between three-dimensional points on a model and points in a two-dimensional image, what are the values of the unknown projection and model parameters that will result in the projection of the given model points into the corresponding image points. The unknown parameters might include the position and orientation of the object in three dimensions, the focal length of the camera, and various degrees of freedom in the model description, such as articulations or variable dimensions. We will later extend this problem description to allow for the least-squares solution of overdetermined constraints, and to allow for matches between corresponding lines (without concern for the position of endpoints) rather than just points.

There has been some previous work on solving for the position of a rigid three-dimensional object given matches between points on the model and points in the image. This problem must be solved in order to bring photographs into correspondence with mapping data in the field of photogrammetry. An analytic solution, known as the Church method, is presented in the photogrammetry literature [31], but this solution involves nonlinear equations which must themselves be solved by iterative numerical methods. The current preferred technique in photogrammetry is to use the same linearization and iterative methods that will serve as our starting point in this paper. Fischler and Bolles [9] have presented another analytic solution to the problem, but one which also requires iterative numerical solution. In addition, they have presented useful information on the conditions under which multiple solutions can arise. A different approach was taken in the ACRONYM computer vision system [6], which used a general symbolic equation solver to place bounds on projection and model parameters from image measurements. However, the equations for the general projection problem were often much too difficult for exact solution, so sub-optimal bounds would be produced that failed to apply the information inherent in the viewpoint consistency constraint.

Figure 2: Three steps in the application of Newton's method to achieve spatial correspondence between 2-D image segments and the projection of a 3-D object model.

The approach taken in this paper will be to linearize the projection equations and apply Newton's method for the necessary number of iterations. A reparameterization of these equations is used to simplify the calculation of the necessary derivatives. This allows us to efficiently solve not only the basic rigid-body problem studied in photogrammetry, but also to extend the method to variable model parameters and forms of correspondence other than the matching of points. Figure 2 illustrates the sequence of steps involved in applying Newton's method to this problem.

Application of Newton's method

Following standard practice in computer graphics, we can describe the projection of a three-dimensional model point into a two-dimensional image point with the following equations:

where is a 3-D translation vector and R is a rotation matrix which transform in the original model coordinates into a point in camera-centered coordinates. These are combined in the second equation with a parameter f proportional to the camera focal length to perform perspective projection into an image point .

Our task is to solve for , R, and possibly f, given a number of model points and their corresponding locations in an image. In order to apply Newton's method, we must be able to calculate the partial derivatives of u and v with respect to each of the unknown parameters. However, it is difficult to calculate these partial derivatives for this standard form of the projection equation. In particular, this formulation does not describe how to represent the rotation R in terms of its three underlying parameters. Many previous attempts to solve the viewpoint determination problem have treated the rotation as consisting of more than three parameters, which leads to the requirement for more image data than is actually needed and to poor techniques for handling errors.

The partial derivatives with respect to the translation parameters can be most easily calculated by first reparameterizing the projection equations to express the translations in terms of the camera coordinate system rather than model coordinates. This can be described by the following equations:

Here the variables R and f remain the same as in the previous transform, but the vector has been replaced by the parameters , and . The two transforms are equivalent [under an affine approximation] when

In the new parameterization, and simply specify the location of the object on the image plane and specifies the distance of the object from the camera. As will be shown below, this formulation makes the calculation of partial derivatives with respect to the translation parameters almost trivial.

We are still left with the problem of representing the rotation R in terms of its three underlying parameters. Our solution to this second problem is based on the realization that the Newton method does not in fact require an explicit representation of the individual parameters. All that is needed is some way to modify the original rotation in mutually orthogonal directions and a way to calculate partial derivatives of image features with respect to these correction parameters. Therefore, we have chosen to take the initial specification of R as given and add to it incremental rotations , \ and about the x, y and z axes of the current camera coordinate system. In other words, we maintain R in the form of a 3x3 matrix rather than in terms of 3 explicit parameters, and the corrections are performed by prefix matrix multiplication with correction rotation matrices rather than by adding corrections to some underlying parameters. It is fast and easy to compose rotations, and these incremental rotations are approximately independent of one another if they are small. The Newton method is now carried out by calculating the optimum correction rotations and to be made about the camera-centered axes. The actual corrections are performed by creating matrices for rotations of the given magnitudes about the respective coordinate axes and composing these new rotations with R.

Figure 3: The partial derivatives of x, y and z (the coordinates of rotated model points) with respect to counterclockwise rotations 's (in radians) about the coordinate axes.

Another advantage of using the 's as our convergence parameters is that the derivatives of x, y, and z (and therefore of u and v) with respect to them can be expressed in a strikingly simple form. For example, the derivative of x at a point with respect to a counter-clockwise rotation of about the z axis is simply -y. This follows from the fact that , where r is the distance of the point from the z axis, and therefore . The table in Figure 3 gives these derivatives for all combinations of variables.

Given this parameterization it is now straightforward to accomplish our original objective of calculating the partial derivatives of u and v with respect to each of the original camera parameters. For example, our new projection transform above tells us that:

Also,

but, from the table in Figure 3 , we know that

and, for simplicity, we will substitute

giving,

Similarly,

All the other derivatives can be calculated in a similar way. The table in Figure 4 gives the derivatives of u and v with respect to each of the seven parameters of our camera model, again substituting for simplicity.

Figure 4: The partial derivatives of u and v with respect to each of the camera viewpoint parameters.

Our task on each iteration of the multi-dimensional Newton convergence will be to solve for a vector of corrections

If the focal length is unknown, then would also be added to this vector. Given the partial derivatives of u and v with respect to each variable parameter, the application of Newton's method is straightforward. For each point in the model which should match against some corresponding point in the image, we first project the model point into the image using the current parameter estimates and then measure the error in its position compared to the given image point. The u and v components of the error can be used independently to create separate linearized constraints. For example, making use of the u component of the error, , we create an equation which expresses this error as the sum of the products of its partial derivatives times the unknown error-correction values:

Using the same point we create a similar equation for its v component, so for each point correspondence we derive two equations. From three point correspondences we can derive six equations and produce a complete linear system which can be solved for all six camera model corrections. After each iteration the corrections should shrink by about one order of magnitude, and no more than a few iterations should be needed even for high accuracy.

Unknown model parameters, such as variable lengths or angles, can also be incorporated. In the worst case, we can always calculate the partial derivatives with respect to these parameters by using standard numerical techniques that slightly perturb the parameters and measure the resulting change in projected locations of model points. However, in most cases it is possible to specify the three-dimensional directional derivative of model points with respect to the parameters, and these can be translated into image derivatives through projection. Examples of the solution for variable model parameters simultaneously with solving for viewpoint have been presented in previous work [18].

In most applications of this method we will be given more correspondences between model and image than are strictly necessary, and we will want to perform some kind of best fit. In this case the Gauss least-squares method can easily be applied. The matrix equations described above can be expressed more compactly as

where J is the Jacobian matrix containing the partial derivatives, is the vector of unknown corrections for which we are solving, and is the vector of errors measured in the image. When this system is overdetermined, we can perform a least-squares fit of the errors simply by solving the corresponding normal equations:

where is square and has the correct dimensions for the vector .

Making use of line-to-line correspondences

Another important extension to the basic algorithm is to allow it to use line-to-line correspondences in addition to point-to-point ones. This is important in practice because low-level vision routines are relatively good at finding the transverse locations of lines but are much less certain about exactly where the lines terminate. Line terminations may also be widely displaced due to occlusion, shadows, or various sources of failure in the low-level edge detection algorithms. Therefore, we should express our errors in terms of the distance of one line from another, rather than in terms of the error in the locations of points. The solution is to measure as our errors the perpendicular distance of selected points on the model line from the corresponding line in the image, and to then take the derivatives in terms of this distance rather than in terms of u and v.

In order to express the perpendicular distance of a point from a line it is useful to first express the image line as an equation of the following form, in which m is the slope:

In this equation d is the perpendicular distance of the line from the origin. If we substitute some point into the left side of the equation and calculate the new value of d for this point (call it ), then the perpendicular distance of this point from the line is simply . What is more, it is easy to calculate the derivatives of for use in the convergence, since the derivatives of are just a linear combination of the derivatives of u and v as given in the above equation, and we already know how to calculate the u and v derivatives from the solution given for using point correspondences. The result is that each line-to-line correspondence we are given between model and image gives us two equations for our linear system---the same amount of information that is conveyed by a point-to-point correspondence. Figure 2 illustrates the measurement of these perpendicular errors between matching model lines and image lines.

The same basic method could be used to extend the matching to arbitrary curves rather than just straight line segments. In this case, we would assume that the curves are locally linear and would minimize the perpendicular separation at selected points as in the straight line case, using iteration to compensate for non-linearities. For curves that are curving tightly with respect to the current error measurements, it would be necessary to match points on the basis of orientation and curvature in addition to simple perpendicular projection. However, the current implementation of SCERPO is limited to the matching of straight line segments, so these extensions to include the matching of arbitrary curves remain to be implemented.

The use of parameter determination for matching

The mathematical methods for parameter determination presented above need to be integrated into a matching algorithm that can extend a few initial matches and return a reliable answer as to the presence or absence of the object at the hypothesized location. Some of the implementation details that must be worked out include the choice of a representation for object models, the calculation of a starting viewpoint to initiate Newton iteration, and the methods for making use of the initial parameter determination to generate new matches between image and object features.

The object models used in the current implementation of SCERPO consist simply of a set of 3-D line segments. A primitive form of hidden line elimination is performed by attaching a visibility specification to each line segment. The visibility specification contains a boolean combination of hemispheres from which the segment is visible. Each hemisphere of directions is represented by a unit vector, so that the visibility of the segment can be very efficiently computed by simply checking the sign of the dot product of this vector with the vector pointing to the camera position. It is important that the visibility calculation be fast, as it is performed in the inner loop of the matching process. However, this form of hidden line elimination is only an approximation, since it does not allow for partial occlusion of a line segment or for calculating occlusion between objects. It would also need to be extended to allow for models with variable internal parameters. As with other parts of the matching process, we rely upon the overall robustness of the system in the face of missing or extra features to compensate for occasional errors in hidden-line determination.

As with all applications of Newton's method, convergence of the viewpoint solving algorithm is guaranteed only for starting positions that are ``sufficiently close'' to the final solution. Fortunately, in this problem several of the parameters (scaling and translation in the image plane) are exactly linear, while the other parameters (rotation and perspective effects) are approximately linear over wide ranges of values. In practice, we have found that as long as the orientation parameters are within 60 degrees of their correct values, almost any values can be chosen for the other parameters. Reasonable estimates of the viewpoint parameters can be easily computed from the same matches that will be used for convergence. Since most parts of the model are only visible from a limited range of viewpoints, we can select an orientation in depth that is consistent with this range for each object feature being matched. Orientation in the image plane can be estimated by causing any line on the model to project to a line with the orientation of the matching line in the image. Similarly, translation can be estimated by bringing one model point into correspondence with its matching image point. Scale (i.e., distance from the camera) can be estimated by examining the ratios of lengths of model lines to their corresponding image lines. The lines with the minimum value of this ratio are those that are most parallel to the image plane and have been most completely detected, so this ratio can be used to roughly solve for scale under this assumption. These estimates for initial values of the viewpoint parameters are all fast to compute and produce values that are typically much more accurate than needed to assure convergence. Further improvements in the range of convergence could be achieved through the application of standard numerical techniques, such as damping [8].

A more difficult problem is the possible existence of multiple solutions. Fischler and Bolles [9] have shown that there may be as many as four solutions to the problem of matching three image points to three model points. Many of these ambiguities will not occur in practice, given the visibility constraints attached to the model, since they involve ranges of viewpoints from which the image features would not be visible. Also, in most cases the method will be working with far more than the minimum amount of data, so the overconstrained problem is unlikely to suffer from false local minima during the least-squares process. However, when working with very small numbers of matches, it may be necessary to run the process from several starting positions in an attempt to find all of the possible solutions. Given several starting positions, only a couple of iterations of Newton's method on each solution are necessary to determine whether they are converging to the same solution and can therefore be combined for subsequent processing. Finally, it should be remembered that the overall system is designed to be robust against instances of missing data or occlusion, so an occasional failure of the viewpoint determination should lead only to an incorrect rejection of a single match and an increase in the amount of search rather than a final failure in recognition.

Extending the initial set of matches

Once the initial set of hypothesized matches has been used to solve for the viewpoint parameters, this estimate of viewpoint can be used to predict the locations of other model features in the image and thereby extend the match. The predictions of these image locations can be obtained by simply projecting the model edges onto the image using the current set of viewpoint parameters. The more difficult problem is to select matches in the image that are consistent with these predictions. Rather than setting arbitrary thresholds on the error range within which image segments must fall to be considered in agreement with a predicted model segment, a probabilistic approach is taken. First, each segment is projected, and the image data structure is searched to select potential matches. All line segments in the image are indexed according to location and orientation, so this search can be carried out efficiently. Each potential match is assigned a value giving the probability that a randomly placed image segment would agree with the prediction to within the measured difference in length, orientation and trasverse position. The probability calculation uses the same assumptions and formalisms as are used for the perceptual organization evaluations to be described in the next section. The lower this probability of accidental agreement, the more likely it is that the match is correct.

After evaluating each potential match for a given prediction, the top-ranked match is assigned a higher probability of being mistaken if the second-ranked match has a similar evaluation. The purpose of this penalty is to avoid committing ourselves to either choice of an ambiguous match if there is some less ambiguous alternative from some other model prediction. At each stage of the iteration we select only matches with a probability of being accidental of less than 0.01, or else we select the single lowest probability match from all of the model predictions. These are then added to the least-squares solution to update the estimate of viewpoint. By the time a number of the most reliable matches have been found, the viewpoint estimation will be based on a substantial amount of data and should be accurate enough to choose correctly between the more ambiguous alternatives. The set of matches is repeatedly extended in this way until no more can be found. This iterative matching procedure has the appealing property of using the easy cases to provide better viewpoint estimates to disambiguate the more difficult situations, yet it avoids the expense of search and backtracking.

The final accuracy of the viewpoint determination procedure could probably be improved by attempting to discard the points that deviate the most from the least-squares solution and reconverging on the remaining data. This would allow the system to converge on a consensus viewpoint estimate that was not influenced by the largest errors in modeling or feature detection. However, this procedure remains to be implemented.

The final judgement as to the presence of the object is based simply on the degree to which the final viewpoint estimate is overconstrained. Since only three edges are needed to solve for viewpoint, each further match adds to the verification of the presence of the object. In addition, the least-squares solution provides an estimate of the standard deviation of the error. Given sufficiently detailed models, correct instances of recognition should be greatly overconstrained even in the face of partial occlusion and other missing features, so the precise threshold for rejection should be unimportant. In the examples that are presented at the end of this paper, correct matches typically had over 20 image segments in close agreement with the model, while incorrect matches seldom found more than 7 matching segments.

Next: Perceptual organization Up: Three-Dimensional Object Recognition Previous: Role of depth

David Lowe
Fri Feb 6 14:13:00 PST 1998