Overview Emotions View Ecological View Context Media Implications Adaptive Recognition

Ecological Facial Display Recognition

I have recently been working on unsupservised recognition of facial displays from video. The basic idea is as follows
Input: A continuous video stream taken with a video camera mounted on the computer screen. Initialize the tracker on the first frame, I0 of the sequence using skin color segmentation to get the centroid and scale (in x and y directions): c0={xc0,yc0,sx0,sy0} of the face based on skin color only. For each subsequent frame, It,t=1....,
  1. Compute Optical flow between It-1 and It using the Lucas and Kanade algorithm [1], with modifications suggested by Simoncelli, Adelson and Heeger [2]. I modified John Barron's code [3] by adding a Gaussian pyramid so the flow can be calculated for large motions. I use 3 levels in the pyramid.


  2. Project the optical flow to the complete, orthogonal basis of Zernike polynomials. The projection is performed over the ellipse defined by
    ((x-xct)/sxt)2+((y-yct)/sxt)2 = 1

    This gives a feature vector of variable dimension, which can represent the input optical flow with arbitrary accuracy. This vector is denoted z(t) = { zi(t) }, i=0...nz in what follows. Read more about Zernike polynomials.


  3. Update scale and centroid based on the Zernike projection using
    xc't = xct-1+z0
    yc't = yct-1+z1
    sx't = sxt-1+z2
    sy't = syt-1+z5
    Where z0, z1 are the DC components of flow in the x and y directions, resp, and z2 is divergence in the x direction and z5 is divergence in the y direction.1

    This tracking gives fairly good results all by itself for short periods of time, it loses the track after 1000 frames of normal face movement. If there are any unusual movements (like head-scratching or face-rubbing), the track is also lost. Therefore, we need some kind of measurement of the scale and centroid which we can use to correct this estimate (think Kalman filter...). We get this measurement (call it ect={exct,eyct,esxt,esyt}) using skin-color segmentation. However, these estimates are usually quite a bit worse than the flow-based ones, so they must be used only in situations when the flow estimates are really bad. The estimates of the error covariances in each flow vector (see [2]) can be propagated to estimates of errors in this new scale and centroid, dc't = {dxct,dyct,dsxt,dsyt}. Assuming we have an estimate of error on the scale and centroid from the skin segmentation, dec't = {dexct,deyct,desxt,desyt}, we can update the scale and centroid using a weighted combination of c't and ect.

    xct = xc't+(exc't-xc't)*dxc't/(dxc't+dexc't)
    yct = yc't+(eyc't-yc't)*dyc't/(dyc't+deyc't)
    sxt = sx't+(esx't-sx't)*dsx't/(dsx't+desx't)
    syt = sy't+(esy't-sy't)*dsy't/(dsy't+desy't)
    For now, I just choose the errors on the skin segmentation to be really large, so they only come into play when the flow is really bad.
  4. These mpegs show a small clip (a few seconds) of a difficult situation for a flow-only tracker, and how it is overcome using skin segmentation. See mpegs without and with skin-color corrections.


  5. Temporally segment the video stream into small sequences of facial movement. This temporal segmentation must be done at more than one level of temporal abstraction. That is, the sequence needs to be broken up on different time scales. We use scale-space segmentation to accomplish this [4,5,6,7]. We filter each dimension of the vectors z(t), t=0...N, with Gaussians of different variances s, giving c's(t). The peaks of c's(t) are the places in the signal where there is lots of change going on at the scale s. However, since facial displays are often tri-phasic, we want to make sure to group all three phases together. Therefore, we compute the cumulative sum of each c's(t), giving cs(t) = SUMt'=0...tc's(t') . Finally, we compute the length of these vectors ds(t) = SUMi=0...nzci,s(t), a scalar function of time and of scale, s. Lastly, we locate the zero-crossings of the first temporal derivative of ds(t) (the peaks and valley bottoms): these are the centers of the segments. Most of the frames in between a peak and a valley contain little or no movement in the face. Therefore, we segment out sequences surrounding each peak or valley. These sequences are the segmented frames. All that remains to do is choose the appropriate scale to catch the level of facial display we aree interested in. If the scale is chosen correctly, the segmentation is very close to what you would expect. Examples to come here soon.


  6. Model the complete sequence using a hierarchical dynamic Bayesian network [8]. More to come on this too. You can read a paper in gzipped ps or in pdf about this model. The model essentially winds up clustering the sequences of Zernike vectors, z(t), and building temporal dynamic models of the clusters. It also builds a model of the high-level dynamics (the transitions between clusters of sequences). This is important because we can add the context in at this high level. The results for simple sequences are quite good, getting 80-90% of what you would intuitively cluster on some simple sequences.
  1. B.D. Lucas and T. Kanade. An iterative image registration technique an application to stereo vision. In Proceedings IJCAI 1981, Vancouver, Canada, p674-679
  2. E.P. Simoncelli, E.H. Adelson, and D.J.Heeger. Probability distributions of optical flow. In Proceedings of CVPR 1991, Mauii, Hawaii, USA. p.310-315.
  3. J. Barron, D. Fleet, S. Beauchemin and T. Burkitt. Performance of optic flow techniques, IJCV, vol. 12(1), 1994, pp. 77-104.
  4. Andrew P. Witkin. Scale-space filtering, in Proc. IJCAI, 1983, pp1019-1022.
  5. Andrew P. Witkin. Scale-space filtering: a new approach to multi-scale description, Proc of ICASSP, San Diego, CA, March pp.39A.1.1-39A.1.4, 1984
  6. Tony Lindeberg. Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics vol 21, number 2, pp 225-270 (1994)
  7. Malcolm Slaney and Dulce Ponceleon. Hierarchical Segmentation using latent semantic indexing in scale space. Proc of ICASSP , Salt Lake City, Utah, May 2001.
  8. Jesse Hoey. Hierarchical Unsupervised learning of facial expression categories. Proc. of IEEE Workshop of detection and recognition of events in video , Vancouver, BC, July 2001.

1 These low-frequency Zernike polynomials correspond to the affine flow parameters, so we can write the flow field as
u = z0 + z2x + z3y
v = z1 + z4x + z5y