This paper addresses the issue of computer vision-based face motion capture as an alternative to physical sensor-based technologies. The proposed method combines a deformable template-based tracking of mouth and eyes in arbitrary video sequences with a single speaking person with a global 3D head pose estimation procedure yielding robust initializations. Mathematical principles underlying deformable template matching together with definition and extraction of salient image features are presented. Specifically, interpolating cubic B-splines between the MPEG-4 Face Animation Parameters (FAPs) associated with the mouth and eyes are used as template parameterization. Modeling the template a network of springs interconnecting with the mouth and eyes FAPs, the internal energy is expressed as a combination of elastic and symmetry local constraints. The external energy function, which allows to enforce interactions with image data, involves contour, texture and topography properties properly combined within robust potential functions. Template matching is achieved by applying the downhill simplex method for minimizing the global energy cost. Stability and accuracy of the results are discussed on a set of 2000 frames corresponding to 5 video sequences of speaking people.
|