Abstract

We analyze in this paper the benefits that can be derived from employing color image alignment techniques in the context of face segmentation or tracking based on texture (defined as the patch of intensities) template matching. By making full use of the decorrelated color information, improvements on the accuracy of the segmentation are demonstrated. This is intended to enhance the face segmentation algorithm by increasing its robustness to differences in images caused by various image acquisition devices or settings or by variations in the ambient illumination conditions.

1. Introduction

The use of color information becomes increasingly important in nowadays image processing applications as inexpensive color image acquisition devices become easily available. Color image processing permits a more extensive image representation, which expectedly leads to better results.

We deal in this paper with the specific case of face segmentation which employs face modeling techniques. This can also be viewed in the more general context of deformable template matching, using for this purpose a statistical model of shape variations. Extensive work has been carried out in the area of face modeling and face segmentation using statistical models [15]. These techniques have initially been developed for gray level images. Extensions have later been proposed for color images [6, 7]. Some advantages of using the color extension have been demonstrated, but mostly for working in a controlled image acquisition environment. Processing color information can thus be challenging, especially when designing more general applications that are supposed to work within unconstrained image acquisition conditions. We demonstrate in this paper some positive results when using the decorrelated color information for applications which include face segmentation and face tracking, intended to work under no predefined constrains.

The outline of this paper is as follows. In Section 2, we briefly describe several decorrelated color spaces in terms of their transforms from the common color space; we also include a comparison between these color spaces in terms of how well they are able to decorrelate image channels on a series of test images. In Section 3, a face segmentation method is described, based on a statistical shape model and a fixed face texture template. The limitations of the application described in Section 3 are addressed in Section 4, introducing some texture alignment and color transfer techniques in order to adapt the texture template to the color distribution of the current image. These operations are facilitated by converting the texture data to one of the decorrelated color spaces presented in Section 2. In Section 5, we show our experiments performed on a general face image database; the database is built as a mixture of images, gathered mostly from various standard face image and video databases; a set of comparative results is provided in Section 5. Finally, in Section 6 we draw the conclusions of our work.

2. Image Decorrelation with Respect to Color Information

Colorwise image decorrelation is useful for applying color image processing operations independently on each image channel.

2.1. Karhunen-Loève Transform

The Karhunen-Loève transform (KLT) is optimal in terms of energy compaction and mean-squared error minimization for a truncated representation. By applying KLT to a color image, it creates image basis vectors which are orthogonal, and it thus achieves complete decorrelation of the image channels [812] as follows:where contains the image color signals and , with ; is the mathematical expectation. is the covariance matrix of the of the image color signals as follows:with , . is the transformation matrix formed by the eigenvectors of the covariance matrix :

Yet, KLT is data dependant, meaning that it requires the recalculation of the transformation matrix for each set of data (e.g., each new image).

2.2. Color Space

An interesting color space is , proposed by Ohta et al. [13], which realizes a statistical minimization of the interchannel correlations (decorrelation of the components) for natural images. The conversion from to is given by the simple linear transformation in (4) as follows:

stands as the achromatic (intensity) component, while and are the chromatic components. We remark that the simple numeric transformation from to enables simple and efficient transformation of datasets between these two color spaces.

was designed as an approximation for the KLT of the data to be used for region segmentation on color images. As the transformation to represents a good approximation of the KLT for a large set of natural images, the resulting color channels are almost completely decorrelated.

In the previous work of Ohta et al., the discriminating power of linear combinations of , , and was tested on eight different color scenes. The selected linear combinations were gathered such that they could successfully be used for segmenting important (large area) regions of an image, based on a histogram threshold. It was found that of the linear combinations had all positive weights, corresponding mainly to an intensity component which is best approximated by ; another showed opposite signs for the weights of and , representing the difference between the and components which are best approximated by ; finally, the remaining linear combinations could be approximated by . Thus it was shown that the , , and components in (4) are effective for discriminating between different regions and that they are significant in this order [13]. We can further conclude, based on the above figures, that the percentage of color features which are well discriminated on the first, second, and third channels is around %, %, and %, respectively.

is also found in [14] to perform better as compared to other color space implementations like YIQ, CIELAB, and for segmentation of color images based on Markov random field (MRF) processing. In [15], the color space was used for color image segmentation based on an MRF model and simulated annealing due to its effectiveness in terms of the quality of the segmentation and the reduced complexity of the transformation.

2.3. Color Space

Assuming that the human visual system is ideal for processing natural scenes, Ruderman et al. [16] developed the color space, which also minimizes the correlation between channels for natural images. The conversion from is realized by means of an initial transform to cone space, followed by a conversion of the data to logarithmic space (used to reduce skewness):

Finally, the data is obtained from

This color space has successfully been used in [17, 18] for image color transfer operations, which will be described in Section 4.1.

2.4. Comparison between the Different Color Image Representations

The correlation between two image channels is given bywhere and represent the th and th image channel signals, respectively, (with , ), for a certain color image representation.

The total interchannel correlation is calculated as follows:

The correlation coefficients have been measured for several test images (see Figure 1) in the discussed color image representations using the above formulae [19]. Results are summarized in Table 1.

It can be observed that the representation presents a very high interchannel correlation, while the and image representations significantly reduce this correlation. As stated above, the KLT, which is adapted to each particular image, achieves total decorrelation of the image channels.

3. Face Segmentation Using Deformable Template Matching

Note that the term texture, frequently used in this paper, refers in the context of this work to the set of pixel intensities across an object, also subsequent to a suitable normalization.

3.1. Statistical Shape Models

We are interested in designing a shape model robust to head pose variations. The shape is defined as the set of positions of some fiducial points on the face. The model is statistically built from a training dataset which contains image examples, annotated with a fixed set of landmark points. The sets of 2-D coordinates of the landmark points define the shapes inside the image frame. These shapes are aligned using the generalized procrustes analysis [20], a technique for removing the differences in translation, rotation, and scale between the training set of shapes. This defines the shapes in the normalized frame.

Let be the number of training examples. Each shape example is represented as a vector of concatenated coordinates of its points , where is the number of landmark points. principal components analysis (PCA) is then applied to the set of aligned shape vectors reducing the initial dimensionality of the data. It can be noted that PCA is very similar to KLT. In a geometric interpretation, KLT can be viewed as a rotation of the coordinate system, while for PCA, the rotation of the coordinate system is preceded by a shift of the origin to the mean point [21]. Shape variability is thus linearly modeled as a base (mean) shape plus a linear combination of shape eigenvectors:where represents a modeled shape, is the mean of the aligned shapes, is a matrix having shape eigenvectors as its columns (); finally, defines the set of parameters of the shape model. is chosen so that a certain percentage of the total variance of the data is retained.

The standard deviation for each parameter of the face model, as resulted from the training dataset, provides its dynamic range. By altering the model parameters within their dynamic range helps insuring that only plausible instances of the modeled object are being generated. A description of the way in which the optimal model parameters for a new image can automatically be estimated follows in Section 3.2.

3.2. Face Texture Template Optimization Algorithm

In order to optimize the face model parameters, a texture template is also required. The separation between shape and texture is realized using a reference shape. Based on this reference shape, the so-called texture examples can be extracted. The reference shape is usually chosen as the pointwise mean of the shape examples. The texture examples are defined in the normalized frame of the reference shape. Each image example is then distorted such that the points that define its attached shape, used as control points, match the reference shape, such that the topology is preserved. An image warping method is employed for this purpose. Image warping methods are discussed in Section 3.3.

Subsequent to the warping stage, all shape differences between the image examples have been removed. The texture across each image object is thus mapped into a shape-normalized representation. The resulting images are also called the image examples in the normalized frame. For each of these images, the corresponding pixel values across their common shape are scanned to form the texture vectors , where is the number of texture samples.

Based on previous experiments, we remark that the variability of the shape component of the face is much more important than the variability of the texture component in terms of a successful segmentation of the face. Due to this fact, we consider in the following a simplified formulation of a model-based face segmentation technique, where the modeled image is represented by a fixed texture template; extensions could be made so that to include texture variability, yet that was beyond the purpose of the current work. Thus during an optimization stage (fitting the model to a query image), the parameters to be found are , where are the shape 2-D position, 2-D rotation, and scale parameters inside the image frame, and are the shape model parameters. The optimization of the parameters is realized by minimizing the reconstruction error between the query image and the modeled image. The error is evaluated in the coordinate frame of the model, that is, in the normalized texture reference frame, rather than in the coordinate frame of the image. The difference between the query image and the modeled image is thus given by the difference between the (normalized) image texture and the (normalized) template texture as follows:and is the reconstruction error, with marking the Euclidean norm.

A first order Taylor extension of is given by should be chosen so that to minimize . It follows that:Normally, the gradient matrix should be recomputed at each iteration. Yet, as the error is estimated in a normalized texture frame, it was shown that this gradient matrix may be considered as fixed, being thus possible to precompute it from a training dataset; these techniques, introduced in [22], and extended so that to also incorporate a statistical texture variation model (as opposed to a fixed texture template described above), are called active appearance models (AAMs). Using this technique, each parameter in is systematically displaced from its known optimal value retaining the normalized texture differences. The resulted matrices are then averaged over several displacement amounts and over several training images. The update direction of the model parameters is then given bywhere is the pseudoinverse of the determined gradient matrix, which can be precomputed as part of the training stage. The parameters continue to be updated iteratively until the error can no longer be reduced and convergence is declared.

3.3. A TPS-Based Model Fitting Technique

Piecewise affine warping is extensively used in techniques like AAM due to its reduced computational costs. A triangulation (e.g., Delauney) is used to partition the convex hull of the control points. The points inside triangles are then mapped via an affine transformation which uniquely assigns the corners of a triangle to their new positions. Although the assumption that the face patches are piecewise affine within the triangles is a satisfactory solution when there is a sufficiently large number of landmark points, it also shows an important drawback. This refers to the fact that, when modeling large face pose variations, corners of some triangles tend to get reversed due to occlusions of the corresponding landmark points. This obviously affects the image warping outcome by creating erroneous face patches. The errors are further propagated into the fitting algorithm, resulting in an incorrect fit. That is why the piecewise warping method works well mostly for modeling frontal or nearly frontal faces.

A more advanced and accurate warping method is obtained by employing the thin plate splines (TPSs), introduced in [23]. A short description of this warping method is also given in the appendix. An initial drawback of using the thin plate splines was represented by the fact that they were quite expensive to calculate. The solution requires the inversion of a matrix (the bending energy matrix) which has a computational complexity of , where is the number of points in the dataset (i.e., the number of pixels in the image); furthermore, the evaluation process is . Fortunately, important progress has been made in order to speed this process up. An approximation approach was proved in [24] to be very efficient in dealing with the first problem, reducing greatly the computational burden. As far as the evaluation process is concerned, the multilevel fast multipole method (MLFMM) framework was described in [25] for the evaluation of two-dimensional polyharmonic splines, while in [26] this work was extended for the specific case of TPS, showing that a reduction of the computational complexity from to is indeed possible. Thus the computational difficulties involving the use of TPS have been to an important extent removed.

We show in Figures 2 and 3 an example of fitting the model based on TPS warping. The error is evaluated relative to the number of available data points after the deformation.

4. Improved Model Fitting by Means of Local Color Transfer

A face detection algorithm is firstly applied for the current image. We used here the Viola-Jones face detector [27], which is based on the AdaBoost algorithm [28]. A statistical relation between the face detector estimates for the face position and size (rectangle region) and the position and size of the reference shape inside the image frame is initially learnt (offline) from a set of training images. This relation is then used to obtain a more accurate initialization for the reference shape, tuned with the employed face detection algorithm. It is also important to have a reasonably close initialization to the real values in order to insure the convergence of the fitting algorithm described in Section 3. Color statistics are then extracted across the convex hull of landmark points of the initialized reference shape.

4.1. Image Color Transfer

According to [17], color can be transferred between two images (global color transfer) using the formula in (15), applied in the color space:where and are, respectively, the mean and standard deviation of the Gaussian distribution in the considered color space.

For local color transfer between two images, color statistics (e.g., mean and variance of the Gaussian-modeled color distribution) are gathered from the target and source image, respectively, and used to calculate the color influence map (CIM). CIM contains the weights for each pixel in the target image, determined based on their proximity to the color range in the source image.

Consider the distance between a pixel and the center of the color distribution. For three-dimensional color data this is the Mahalanobis distance given bywhere is the covariance matrix of the three-variate color texture vector.

Yet, if a decorrelated color space is used, then the covariance matrix is close to being diagonal and (16) reduces to the normalized Euclidean distance (17):where is the standard deviation vector of over the sample set.

The weights in CIM are calculated using a function of the above distance , for which the following conditions should be met as follows:

The function below was proposed in [18] to be used with the color space:

The color transfer equation in (15) was also extended in [18] toor, if a single color is used as source for color transfer,

4.2. Adaptive Texture Template Matching

Using a decorrelated color space (see Section 2), the color of the texture template (see Figure 4(a)) can be adapted to the current image, increasing the chance of a correct fitting (correct-face segmentation) of the face model. Experimental results to support this premise and to confirm the benefits of employing color adaptation techniques with the template matching algorithm follow next.

5. Experiments

The experiments have been performed on a randomly chosen subset of 16 images from the database in Figure 1. The images have been semiautomatically annotated and the set of annotations has been used as the ground truth for calculating the boundary errors, which give an objective measure for the fitting quality of the face model. The boundary errors are measured between the exact shape in the image frame (obtained from the ground truth annotations) and the optimized model shape in the image frame. The boundary error is calculated as the point-to-point (Pt-Pt) error, which is given by the Euclidian distance between the two shape vectors of concatenated and coordinates of the landmark points. The mean and standard deviation of Pt-Pt errors is used to evaluate the boundary errors over a whole set of images. The results are summarized in Table 2.

An implementation based only on the intensity (gray scale) component has also been tested. The gray scale images have been obtained by applying the standard mix of components in (22):

The initial results (no color adaptation) show a slight gain in the fitting accuracy over the gray scale implementation when color information is added. However, significant increase in face segmentation accuracy can be observed when adapting the color of the texture template using color transfer techniques. It can also be noted that the implementation based on color space performs slightly better in terms of segmentation accuracy, although subjectively better color adaptation results have been observed when using the color space. This can be explained by the fact that the color space representation is more suitable to be used together with the fitting algorithm which is implemented in the color space.

The robustness to changes in the illumination conditions was also tested using the Oulu face image database [29]. An example of color adaptation of the texture template for this database is shown in Figure 5.

6. Discussion and Conclusions

We analyzed in this paper the possibility of enhancing a face segmentation/tracking method based on texture template matching by means of color image alignment. We also presented a model parameters optimization approach which minimized the error between the texture template and the warped image texture across the current shape. We employed here the TPS-based warping method which is more robust for head pose variations.

The color alignment techniques make use of the decorrelated color statistics of the current image and template image. Improvements of the accuracy of the segmentation have been demonstrated.

From our experiments, we can conclude that the color-adaptation method for the texture template can also be useful in face tracking applications which employ face modeling techniques similar to the one described in Section 3. In particular, it was shown significant improvements and increased robustness for the case of tracking a face under changes in the illumination conditions, like the change of the type of illuminant. This may be a real change of the illuminant or it could be caused by some wrong white balance setting of the image acquisition device.

Appendix

Image Warping: Principal Warps

The thin plate splines (TPSs)-based warping method, also named principal warps, was first introduced in [23]. It represents a nonrigid registration method, built upon an analogy with a theory in mechanics. Namely, the analogy is made with minimizing the bending energy of a thin metal plate on which pressure is exerted using some point constraints. The bending energy is then given by a quadratic form; the spline is represented as a linear combination (superposition) of eigenvectors of the bending energy matrix:where ; are the initial control points. defines the affine part, while defines the nonlinear part of the deformation.

The total bending energy is expressed as

The surface is deformed such that to have minimum bending energy. The conditions that need to be met so that (A.1) is valid (so that has second-order derivatives) are given by

Adding to this the interpolation conditions , (A.1) can now be written as the linear system in (A.4):where , is a matrix of zeros, is a vector of zeros, ; and are the column vectors formed by and , respectively, while .

Acknowledgments

This research was jointly sponsored by Enterprise Ireland and FotoNation (Ireland) Ltd. under the Innovation Partnership Scheme, Grant no. IP/06/361, part of the National Development Program of Ireland. In addition to financial support the authors also wish to express their appreciation for advice and access to facilities provided by the industrial sponsor.