Applied Digital Signal and Image Processing Research Centre, University of Central Lancashire, Preston PR1 2HE, UK
This paper describes a novel method for representing different
facial expressions based on the shape space vector
(SSV) of the statistical shape model (SSM) built from 3D
facial data. The method relies only on the 3D shape, with
texture information not being used in any part of the algorithm,
that makes it inherently invariant to changes in the
background, illumination, and to some extent viewing angle
variations. To evaluate the proposed method, two comprehensive
3D facial data sets have been used for the testing.
The experimental results show that the SSV not only controls
the shape variations but also captures the expressive
characteristic of the faces and can be used as a significant
feature for facial expression recognition. Finally the paper
suggests improvements of the SSV discriminatory characteristics
by using 3D facial sequences rather than 3D stills.
1. Introduction
Facial expressions provide important information in communication between people and can be used to enable communication with computers in a more natural way. Recent advances in imaging technology and ever increasing computing power have opened up a possibility of automatic facial expression recognition. Up till now some research efforts have been exploited in applications such as human-computer interaction (HCI) systems [1], video conferencing [2], and augmented reality [3]. From the biometric perspective, the automatic expression recognition has been investigated in the context of patients’ monitoring in the intensive care and neonatal units [4] for signs of pain and anxiety, behavioural research on children’s ability to learn emotions by interacting with adults in different social contexts [5], identifying level of concentration [6], that is, for detecting drivers’ tiredness, and finally in aiding face recognition. Facial expression representation, which forms one of the most important elements in the facial expression recognition system, is concerned with extraction of facial features for representing variations of expressions. Good features for representing the facial expressions should enable interpretation of various face articulations without any limitation of race, gender, and age. Furthermore, it should also have the capability of reducing the complexity of classification algorithms.
Generally, facial expressions can be represented in two forms, namely, holistic representation and local representation [7]. For the holistic representation, the face is processed as a single entity. Wang and Yin [8] introduced a holistic representation method for representing facial expressions, which is named the topographic context (TC). In this method a grey-scale facial image is treated as a topographic terrain surface in a 3D space with the height of the terrain represented by the image intensity at each pixel. As the result of the topographic analysis, each pixel of the image is described by one of the topographic labels: peak, ridge, saddle, hill, flat, ravine, and pit. The topographic context has been also extended for 3D facial surfaces by Wang et al. [9], where it is referred to as the primitive surface feature method. Huang et al. [10] proposed a method for expression representation based on the local binary pattern, which is originally designed for the texture description. The local binary pattern is calculated by encoding the information of depth difference of a 3D facial surface. Active appearance model (AAM) is a statistical model of shape and grey level of object of interest and mainly used for 2D facial images. For the facial expression representation, the AAM is built on the facial images which are manually selected with a set of landmarks localised around the facial features such as eyebrows, eyes, mouth, and nose [11]. As an extension of the AAM, the 3D morphable model was developed by Blanz and Vetter [12]. Instead of using manually selected sparse facial landmarks, the 3D morphable model uses all the data points of 3D facial scans to represent the geometrical information. This model has been used to control 3D facial surfaces from a 2D image, across variations in pose, ranging from frontal to profile view, and a wide range of illuminations. B-spline is a parametric model which is often used to describe surfaces. When used with 3D facial data, a large number of data points can be efficiently modelled by a small number of B-spline’s control points [13]. When combined with the facial action coding system (FACS) [14], the control points are placed in areas that correspond to action units, and the expression of a face can be generated automatically by adjusting the B-spline’s control points.
In contrast to the holistic approaches, the local representation methods focus on the local features or areas that are prone to change with facial expressions. Saxena et al. [15] introduced the localised geometric model to locally represent facial expressions. Their method uses the classical edge detectors with colour analysis for extracting the local appearances of a face such as eyebrows, lips, and nose. Subsequently a feature vector containing measurements of the facial appearances, such as the height of eyebrows, brow distance, mouth height, mouth width, and lip curvature, is created for the facial expression classification. A local parameterised model proposed by Black and Yacoob [16] is developed based on image motion which is calculated using the optical flow of facial image sequences. The image motion not only accurately models a nonrigid facial motion but also provides a concise description that is related to the motion of local facial features to recognise facial expressions. Kobayashi et al. [17] used a point-based geometric model for the facial expression representation. The model contains facial characteristic points in the frontal-view of the face. These facial characteristic points are around the areas that are the most affected by change of facial expressions, such as eyes, nose, brows, and mouth.
In this paper, a novel method for representing facial expressions is proposed based on the authors’ previous work [18–20], which postulates that the shape space vectors constitute a significant feature space for the recognition of facial expressions. The proposed method uses only 3D shape information, with the texture not being used at all. The method is therefore inherently invariant to variations in scene illumination conditions, background clutter, and to some extent angle of view. This is in a striking contrast to the methods based on texture where these factors severely limit their practical applicability. Additionally as the texture is not being used, it does not have to be captured; hence fast full frame 3D acquisition techniques based on the time-of-fly principle [21] can be used (3D scanners capturing in excess of 40 frames/sec are commercially available) instead of more computationally intensive, and therefore slower, stereovision scanning systems. The shape space vector (SSV) is the key element in the statistical shape model (SSM), which models the high-dimensional shape variations observed in the training data set using projections on a low-dimensional shape space. In order to obtain the SSV two consecutive stages are necessary, namely, (i) model building stage and (ii) model fitting stage. In the model building stage, the correspondences of points between all faces present in the training data set are established first so that the training data set can be aligned into a common reference face. Subsequently the principal component analysis (PCA) technique is applied to the aligned training data set to obtain the SSM of the shape variations. In the model fitting stage, an iterative algorithm based on a modified iterative closest point (ICP) method is used to gradually adjust the pose parameters and optimise the shape parameters in order to match the model to the newly observed facial data. The pose parameters consist of a translation vector, a rotation matrix, and a scaling factor, whereas the shape parameters are embedded in the SSV. In order to validate the discriminatory ability of the SSV, 3D synthetic faces generated from the FaceGen Modeller [22] and real 3D facial scans from the BU-3DFE database [23] are used for the separability analysis in the SSV domain. The experiments on recognition of facial expressions using a selection of standard classification tools are also presented.
The remainder of this paper is organised as follows. Section 2 introduces the details of construction of the SSM. Section 3 describes the procedure used for fitting the model to the facial data that has not been included in the training data set. Section 4 provides results of qualitative and quantitative separability analysis. Results of facial expression recognition using some popular classification algorithms operating on the SSV feature space are presented in Section 5. Finally, concluding remarks are given in Section 6, and a potential improvement of the expression representation using the SSV constructed for dynamic 3D data is briefly discussed in Section 7.
2. Statistical Shape Model
The statistical shape model (SSM) is developed based on the point distribution model (PDM) which was proposed by Cootes et al. [24], and it is one of the most widely used techniques for the model-based data representation and registration. The model describes shape variations based on the statistic calculated from the position of the corresponding points in the training data set. In order to build an SSM, the correspondence of points between different 3D faces in the training data set must be established first. Subsequently the principal component analysis (PCA) is applied to the mutually aligned training data set.
2.1. Estimating Point Correspondence
The knowledge of the correspondence of points between 3D faces in the training data set is essential, because the incorrect correspondence can either introduce too much variations or lead to illegal instance of the model [24]. In the case of the data used in this paper the correspondence of points for the database generated using the FaceGen Modeller is explicitly provided by the software, whereas the dense correspondence of points for the faces in the BU-3DFE database is estimated based on a set of facial landmarks included in the database.
In this work, the estimation of the correspondence is achieved in three steps: (i) facial landmark determination, (ii) thin-plate spline (TPS) warping, and (iii) closest point matching. The first step is to identify the corresponding facial landmarks on the reference and training faces. The second step is to warp the reference face to different training face using TPS transformation that is calculated based on the selected facial landmarks as control points [25]. The last step is to estimate the point correspondence between the warped reference face and different training faces based on the closest distance metric. Figure 1 shows the framework of computing the dense point correspondence of different training faces from the BU-3DFE database. The reference face is usually selected as a face containing neutral expression with the mouth closed. Such selection of the reference face helps to avoid wrong correspondences in the case of matching between closed-mouth and open-mouth shapes. If the reference face were selected with the mouth open, after dense correspondence estimation, each point in the open-mouth area of the reference face will find an incorrect corresponding point in the training face within the closed-mouth region even though those corresponding points of the open-mouth area do not exist in the training faces with mouth closed.
Figure 1: Example of point correspondence estimation in the training data set, with example images from the BU-3DFE.
2.1.1. Thin-Plate-Spline Warping
The TPS warping technique is a point-based registration method which was first proposed by Bookstein [26]. The TPS warping can be used for interpolation as well as approximation. For the TPS interpolation, the positions of corresponding landmarks are assumed to be known exactly and the corresponding landmarks are forced to match exactly each other after warping [25, 27]. For the TPS approximation, the landmark position errors are taken into account, implying that the corresponding landmarks are not forced to match exactly after warping is applied. It can be shown that the solution of the approximation problem is equivalent to inclusion of a regularisation term in the cost function along with a fidelity term which is exactly the same as used in the definition of the interpolation problem [28]. In this work, the corresponding facial landmarks are manually labeled on the 3D face scans, and their positions are always prone to some errors. Therefore, the TPS approximation model is more suitable for our application.
Given sparse corresponding facial landmarks in the reference face and one of the training faces, represented, respectively, by and , where and denote , and coordinates of the th corresponding pair and is the total number of corresponding facial landmarks, the objective is to find the TPS warping function that warps the reference face to the training face. The interpolating warping function, , has to fulfill the following constraint for all the landmarks in and :
where the deformation model is defined in terms of warping function with
where is a point on the reference face and the warping functions for , and coordinates are defined as follows
Function is a radial basis function of the form
where is a distance between two points. According to Bookstein [26], the coefficients of the TPS interpolation model can be calculated from
and
where is a matrix which contains facial landmarks on the target face and written as
and are the matrices containing coefficients of the TPS interpolation and defined as
whereas matrix that contains the radial basis functions is defined as
and the radial basis function is
is the matrix including all corresponding landmarks of the reference face and defined as
and matrix is defined as
In the TPS approximation model, the interpolation condition has to be weakened since the landmark localisation errors have to be taken into account. The regularisation term needs to be added into the TPS interpolation model in order to control smoothness of the transformation. The coefficients of the TPS approximation model can be calculated as
where is a relative weighting factor between the interpolating behavior and the smoothness of the transformation. For small , the TPS warping maintains a good approximation of the landmarks. For large , the TPS warping function becomes very smooth and adopts very little to the local structures present in the data.
2.1.2. Closest Point Matching
After the TPS approximation, the shape of the reference face is warped to match the training face. Since the shape of the reference face is close to the shape of the training face, the dense point correspondence of the reference face for the training face can be computed using the closest distance metric. With the Euclidean distance between two points and are defined as
Denoting a set of points of the training face by , the closest distance between a point of the reference face and the training face is defined as
Using the TPS approximation and closest point matching, the dense point correspondence between the reference face and a training face can be established. This process is applied to all the training faces such that all of them are in correspondence. The training faces from the BU-3DFE database contain between 13 000 and 20 000 mesh polygons with 8711 to 9325 vertices. The reference face used in this paper has 15 687 mesh polygons and 8925 vertices. After performing the TPS approximation and closest point matching, it is likely that there will be multi-to-one correspondences between a training face and the reference face. It is impossible to avoid this completely due to the nature of the closest point matching technique. In order to reduce the number of such correspondences, a subdivision surface method has been used to increase the number of vertices in the training faces [29].
2.2. Principal Component Analysis
Using the standard principal component analysis (PCA), each 3D face in the training data set can be approximately represented in a low-dimensional shape vector space [30] instead of the original high-dimensional data vector space. Given a training data set of faces, , each containing corresponding data points , where contains all the data points of the th face encoded as a 3N-dimensional vector. The first step of the PCA is to calculate the mean vector (representing the mean 3D face):
Let be defined as the covariance matrix calculated from the training data set:
By building a matrix of “centered” data vectors with as the th column of matrix , covariance matrix can be calculated as
where matrix has rows and columns. Since the number of faces, , in the training data set is smaller than the number of data points, the eigen decomposition of matrix is performed first [31]. The first largest eigenvalues and eigenvectors of the original covariance matrix, , are then determined, respectively, from
where and are eigenvalues and eigenvectors of matrix , respectively. By using these eigenvalues and eigenvectors, the data points on any 3D face in the training data set can be approximately represented using a linear model of the form
where is a so-called “Shape Matrix” of eigenvectors, or “modes of variation”, which correspond to the largest eigenvalues, and is the shape space vector (SSV), which controls contribution of each eigenvector, , in the approximated surface [12]. The shape matrix is database-dependent. In a case when new faces are added to the existing database, this shape matrix needs to be recalculated. Most of the surface variations can usually be modelled by a small number of modes . Equation (23) can be used to generate new examples of faces by changing the SSV, , with suitable limits [24]. According to the work proposed by Edwards et al. [11], the suitable limits of the SSM are typically defined as
Figure 2 shows the effect of varying the first three largest principal components of the two models. These models were built using training faces from the FaceGen and BU-3DFE database, respectively.
Figure 2: Effects of changing the contribution of the first three principal components of the shape space vector on the models derived from the FaceGen and BU-3DFE data sets.
3. Model Fitting
Provided that the faces in the database are representative of the faces in the population, a new face from the same population, which has not been included in the training data, can be represented using the derived SSM. In the proposed method, the model fitting is treated as a surface registration problem, which includes the estimation of the pose parameters and shape parameters of the model. Whilst the pose parameters include a translation vector, a rotation matrix, and a scaling factor, the shape parameters are defined by the SSV. As described in the following subsection, the algorithm starts by aligning a new face with the mean face of the model using similarity transformation. Subsequently the model continues to be refined by iteratively estimating the SSV and pose parameters.
3.1. Similarity Registration
The iterative closest point (ICP) method can be used to achieve similarity registration between the model mean face and a new face. The ICP [32] is a widely used point-based surface matching algorithm. This procedure iteratively refines the alignment by alternately estimating points correspondence and finding the best similarity transformation that minimises a cost function between the corresponding points. In this work the cost function is defined using Euclidean distance:
where and are, respectively, the corresponding vertices from the model and the data face. is a rotation matrix, is a translation vector, and is a scaling factor. Following the algorithms in [33, 34], , and are calculated as follows.
From the point sets, and , compute the mean vectors, and :
Calculate and :
Calculate the matrix :
Find the SVD of :
Compute the rotation matrix:
Find the translation vector and scaling factor:
where and are matrices. In (32), matrix is used as a “safeguard” making sure that the calculated matrix is a rotation matrix and not a reflection in 3D space. The outline of the similarity registration procedure is given in Algorithm 1. The criterion used to terminate the iteration of the algorithm is based on the variation of the distance between the two surfaces at two successive iterations. According to the experimental results, the iteration of similarity registration is terminated when the variation, , is below . Figure 3(a) shows an example of the results obtained by the similarity registration. The position of the model is fixed and the new face is transformed to align to the model. Although there are noticeable local misalignments, that is, around the mouth and eyes, due to different facial expressions, they are globally well matched.
Algorithm 1: Similarity registration.
Figure 3: An example of the model fitting.
3.2. Model Refinement
With the data registered to the current model using similarity transformation, the objective of the model refinement is to deform the model so that it is better aligned to the transformed data points. To estimate the optimal pose and shape parameters the whole process has to iterate. This can be seen as a superposition of the ICP method and the least squares projection onto the shape space. The least squares projection onto the shape space provides the SSV, , which controls the deformations of the model. It is also postulated here that at the convergence point this vector can be used as a feature for interpretation of the face articulation. The SSV, , for an observed face is calculated from
where is a vector which contains corresponding data points representing the new face. The mean vector of data points and shape matrix are obtained from (18) and (22), respectively. The details of the algorithm are explained in Algorithm 2. The criterion used to terminate the iteration of the model refinement is based on the change of the SSVs at two successive iterations. According to the experimental results, the iteration of the algorithm is terminated when the change of the SSVs, , is below . For most cases, it is seen that the shape variation of the model during the model refinement is negligible when the change of the SSVs is smaller that this preset threshold.
Algorithm 2: Model refinement.
An example of the results obtained from the model refinement is shown in Figure 3(b). In this case the model is matched to a face with a strong fear expression. The intermediate states illustrate how the model is being deformed to match the new face during the refinement iterations.
4. Separability Analysis
To assess if the SSV can be used as a feature space for the facial expression analysis and recognition, the separability of the SSV-based features has been analysed, using qualitative and quantitative methods. In the qualitative analysis, the separability of the SSV-based features is examined visually in a low-dimensional SSV space. The quantitative analysis is carried out using one of the numerical separability criteria. Four types of data sets have been used in the separability analysis; they are 3D synthetic faces generated from the FaceGen Modeller, manually selected 3D facial landmarks from the BU-3DFE database, 3D face scans from the BU-3DFE database, and automatically detected 3D facial landmarks from the BU-3DFE database. All these data sets cover a wide variety of ethnicity, age range, as well as gender. Face samples from the FaceGen and BU-3DFE data sets showing different individuals and different expressions are shown in Figure 4. The faces used for testing are not included in the training data sets used for building the SSM.
Figure 4: Face samples showing four different subjects and expressions with four levels of expression intensity.
4.1. Qualitative Evaluation
Since the high-dimensional SSV-based features are hard to visualise, only the first three elements of the SSV are used for qualitative analysis. For different types of data, the first three principal components retain different levels of variability present in the training data set. With the retained variability defined as , where and are given in Section 2.2, the first three principal components retain around of the total data variability for the model built using synthetic faces. For the model built from the facial landmarks the first three principal components retain around data shape variability, whereas for the model built using dense set of facial points the first three principal components retain of the variability. The last two models were built using the same faces randomly selected from the BU-3DFE database.
4.1.1. -D Synthetic Faces
Firstly, the 3D synthetic faces generated from the FaceGen Modeller are used to show the separability of the SSV-based features. The FaceGen Modeller is a commercial software designed to create realistic faces with controllable type and level of expressions for subjects of any ethnic origin or gender. Since the correspondence information is provided for all the face vertices (3428 vertices are used to represent all the synthetic faces), the SSM can be built directly without correspondence search. However, it needs to be stressed that the priori knowledge about the correspondence, for the faces in the training data set, was only used in the model building stage. In the model fitting stage the information about the data correspondence was ignored and the correspondence search was included in finding the SSV representation of the faces from the test sets.
For the evaluation, a training data set of 3D synthetic faces from subjects was used to build the SSM. A sample of faces from the training data set is shown in Figure 4(a). Another synthetic faces of subjects were used for testing. The training and testing faces are mutually exclusive. First, for clarity of the presentation, Figure 5 shows the separability of the synthetic faces’ SSVs for selected expression pairs with five different subjects and varying expression’s intensity. The SSVs of the same subject and representing the same expression with various expression’s intensity are linked together. Considering the expression’s intensity as only variable the corresponding SSVs are aligned on the same line segment. It can be observed that the SSV-based features corresponding to different subjects and different facial expressions are well separated; furthermore the orientation of each line seems to define a type of the expression. Figure 6 shows the separability of the synthetic faces’ SSVs for all six basic expressions and five subjects shown in different colours. It can be seen that the SSVs representing different expressions for the same subject are clustered together and the SSVs representing the same expression are located on the line segments having the same orientation which is independent of the subject.
Figure 5: Visualization of the synthetic faces separability using first three elements of the SSV and five different subjects.
Figure 6: Visualization of the synthetic faces separability for six expressions and five subjects.
From the obtained results, showing clustered lines in the SSV space, it seems reasonable to postulate that the FaceGen Modeller uses a linear shape space model for face generation, whereby different eigen subspaces represent different face expressions as well as different face types. Such an approach for face generation was previously proposed in computer graphics literature [35]. From the presented results, it can be concluded that the proposed face registration method is able to recover the facial expression and subject control parameters used in the face generation model (e.g., orientations of the clustered lines in the SSV space define eigen faces responsible for generating different expressions in the FaceGen shape space model, whereas positions of the clustered lines define the subject’s identity, as shown in Figures 5 and 6).
4.1.2. Manually Selected Facial Landmarks
To test that the SSV feature space can be used for classification of expressions present in real faces and in the same time to circumvent any potential problems caused by wrong data correspondence, tests were carried out on the SSM derived from manually selected landmarks on faces from the BU-3DFE database. Each set of 3D facial landmarks provided in the database contains facial points, which are manually labeled around the areas that are most affected by changes of facial expressions including eyes, nose, brows, and mouth. Figure 7 illustrates positions of the landmarks on two different faces. The BU-3DFE database contains subjects; for each subject, various expressions are included, which can be categorised into neutral, happy, disgust, fear, angry, surprise, and sad [23]. The SSM was built using landmarks from faces belonging to randomly selected subjects. Another set of landmarks from a different set of faces from different subjects was used as a test set.
Figure 7: Example of manually selected landmarks in two different faces from the BU-3DFE database.
Figure 8 demonstrates the separability of the SSV feature space, derived using manually selected landmarks. The first three elements of the SSV were used with five types of facial expressions. Figure 8(a) shows that facial expressions of happy and sad can be easily separated even in a low-dimensional SSV feature space. This is in agreement with the general consensus that the expressions of sadness and happiness are the most recognisable human expressions as confirmed by a number of psychophysical test. Some of the expressions are not as well separated in the feature space as, for example, “angry” and “fear”, as shown in Figure 8(c). Although they are partly “mixed” together in the low-dimensional shape space, it is still possible to separate the majority of these facial expressions. Again this result reflects findings of psychophysical tests, which confirm that expressions such as anger and fear can be easily misclassified by a human observer [36].
Figure 8: Separability analysis for manually selected landmarks using first three principal components.
4.1.3. Full 3D Face Scans
The results from the previous section show that with the use of the SSV feature space it is possible to discriminate facial expressions on real facial scans. Unfortunately, although the SSM built from manually selected landmarks uses real faces, the correspondence is established manually. This approach would not be a satisfactory solution for most applications as the manual landmark selection is too tedious and time consuming. In this section discriminatory characteristics of the SSV feature space constructed using a dense set of facial points, as described in Section 3, are examined. As explained there, the correspondence is estimated automatically during the pose estimation stage of the model fitting process. It should be noted here that as the dense correspondence is not given in the training data set, the correspondence between points on different training facial scans is also estimated during the model building phase as explained in Section 2.1.
Figure 9 illustrates the separability of the facial expressions in the feature space of the first three principal components of the SSV built from the full facial scans. As in the previous section five different facial expressions were used. Similarly to the results shown for the manually selected facial landmarks the results demonstrate again that the SSV feature space offers a good expression separability.
Figure 9: Separability of the facial expressions in the feature space of first three principal components of the SSV built from the full facial scans.
4.1.4. Automatically Selected Facial Landmarks
As shown in the previous section, the SSV feature space built from full facial scans, using dense facial points, provides good separability of expressions. Additionally this approach is more practical as the correspondence is estimated automatically. Intuitively discriminatory characteristics of the SSV feature space can be further improved by using only information from the facial regions which are articulated the most during different expressions. In the “full facial scan” approach, all the points contribute to the SSM, but some points, that is, on a forehead, carry very little information about face expression. These points would still contribute to the variations of the SSM model as they would represent variability of facial shape for different subjects. Evaluation was therefore carried out to use the “full facial scan” SSM first to establish the correspondence between the model and the data and subsequently used the SSM built from predefined facial landmarks on the model for the facial expression representation.
This approach is in principle very similar to using the SSV representing variations of the manually selected landmarks, with the difference that landmark selection is automated, where the automation is achieved through registration of the “full facial scan” SSM with a new face. Since the corresponding indices of the facial landmarks on the model are already known, the positions of the corresponding landmarks on a new face scan can be directly estimated when the model is matched to the new face scan. In this case the surface registration error may introduce variability in the position of the landmarks which in turn may have negative effects on the classification performance. To examine registration accuracy of the proposed method tests were carried with the synthetic and real faces. In the experiments, for each data type, the model has been matched to 450 faces which were not used for the model building. Subsequently the Euclidean distance between corresponding landmarks on the deformed model and the test faces was calculated. The average distance between corresponding landmarks on the synthetic faces and the model, calculated from all the 450 test faces, was 1.49 mm with maximum error of 3.95 mm, whereas corresponding distances obtained for the real faces were 3.56 mm and 7.64 mm, respectively. The bigger registration errors obtained for the real faces are mainly thought to be due to the errors in the manual selection of the facial landmarks. Indeed it is believed that the errors in the manual landmark selection, used in the model building stage, have more influence on the method performance than the registration error.
Similar to the previous experiments, the model is built using face scans from randomly selected subjects, and another face scans from subjects are used for testing. Figure 10 shows the separability test for the proposed method. As before the first three principal components are used to represent five facial expressions. Compared to the case with the manually selected facial landmarks, the SSV feature space offers a comparable performance on separability of expressions.
Figure 10: Separability analysis for automatically selected landmarks using first three principal components.
4.2. Quantitative Evaluation
The separability of the SSV-based features has been demonstrated qualitatively in the preceding section. This qualitative analysis shows that the SSV feature space exhibits good facial expressions separability. Due to the way the synthetic data is generated, the SSV-based features in that case were seen to form very distinctive linear patterns with different line directions responsible for different expressions. From experiments with real facial scans from the BU-3DFE database, the best performance is achieved when landmarks are used to build the SSM.
In order to further investigate the separability of the SSV-based features, a quantitative evaluation was carried out. For this analysis, only the SSM which was generated using the data from the real scans was included in the test. The data sets included (i) manually selected facial landmarks, (ii) full face scans, and (iii) automatically selected facial landmarks. In this quantitative evaluation, a computable criterion based on the within-class and between-class distances [37] was used to measure the separability of expressions in the corresponding SSV feature spaces. A similar criterion has been used by Wang and Yin [8] to evaluate the separability of topographic context (TC) and intensity-based features for the facial expression analysis and recognition. The criterion relies on the average between-class distance in the case of multiple categories, which is defined as follows:
where and are the number of samples in classes and , and are the -dimensional feature vectors (SSV) with labels and . is the number of distinct classes. and are the class-prior probabilities, and denotes the distance between two samples, which is usually calculated using Euclidean distance. can be represented in a compact form by using the so-called within-class scatter matrix and between-class scatter matrix [38], which are defined as follows:
where is the mean of samples in the th class:
and is the mean for all of the samples:
Using (38), can be rewritten in the following form:
Although is an efficient and computable separability criterion for feature selection, it is not appropriate for comparing two or more features since the calculated value of depends on the scale and dimensionality of the feature space. In order to compare two or more features which lie in different spaces with different scales and dimensionalities, a new criterion, , similar to , is used (as in [8]) based on a natural logarithm of the ratio of the determinant of the within-class scatter matrix and between-class scatter matrix. The new metric is defined as
where is the entry which contains the maximum value in matrix , and matrix is obtained using the singular value decomposition (SVD) of matrix :
The larger the value of the better the samples are separated. For comparison, the models using manually selected landmarks, full face scans, and automatically selected landmarks are built using the same face scans as described in the previous sections. As shown in Figure 11, for the same ratio of retained variability in the model training data, calculated for the SSV feature space of manually selected landmarks is always the highest. It is not though significantly different from calculated for automatically selected landmarks when the retained variability is within the most commonly used range of 70% to 90%. As expected the separability based on is the worst for the SSV computed from the full face scans.
Figure 11: Quantitative evaluation of facial expression separability in the SSV feature spaces.
5. Experiments on Facial Expression Recognition
The separability analyses performed in the previous section indicate that the SSV feature space can be used in principle for classification of facial expressions. In this section, the person-independent facial expression recognition experiments using the high-dimensional SSV are conducted to further validate discriminatory properties of the SSV feature space. Again, four different types of facial data were used in the experiments. For each type of facial data, faces from subjects are used containing six basic facial expressions of anger, disgust, fear, happiness, sadness, and surprise. These faces are divided into six subsets. Each subset contains six subjects with faces per subject representing different expressions. During algorithm evaluation one of the subset is selected as the test subset while the remaining sets are used to construct the training database. Such experiment is repeated six times, with the different subsets selected as the test subset each time. As the focus of this paper is on the feature extraction and not on design of the best possible classification algorithm, three well-know (off-the-shelf) classification methods have been used, namely; linear discriminant analysis (LDA) [39], quadratic discriminant classifier (QDC) [40], and nearest neighbor classifier (NNC) [37]. The detailed description of these methods is beyond the scope of this paper but can be found in most of the textbooks on pattern recognition. The average recognition rates as well as standard deviations, calculated from all the six experiments using different subsets of faces, for the four different types of facial data, are give in Table 1. To have a fair comparison, the size of the SSV for each data type has been selected in such a way that the retained variability in each corresponding SSM is as similar as possible. For the results presented below, SSV for the synthetic data has elements corresponding to of retained variability, SSV for the full facial scans has elements corresponding to , whereas the SSV for the facial landmarks (both manually and automatically selected landmarks are using the same model) has elements corresponding to .
Table 1: Recognition rate.
As shown in Table 1, all the classifiers achieve a similar recognition rate for the same data type with the extremely hight rates achieved for the synthetic faces for all the classifiers but the NNC classifier. For the facial data from the BU-3DFE database, the manually selected landmarks’ SSVs always reach the highest recognition rate, whereas the real faces’ SSVs always achieve the lowest rate. Tables 2 to 5 show LDA classifier confusion matrices for all the different data types used in the experiments.
Table 2: Confusion matrix of the LDA classifier for the synthetic faces.
Table 3: Confusion matrix of the LDA classifier the real faces.
Table 4: Confusion matrix of the LDA classifier for the manually selected landmarks.
Table 5: Confusion matrix of the LDA classifier for the automatically selected landmarks.
The presented results show that the SSV-based features can be used for recognition of facial expressions. The results for the manually selected landmarks are included only for a reference as using this data type is not practical due to lengthy process of landmarks’ selection. From the presented results it can be seen that the best recognition rate of obtained for the automatically selected landmarks is comparable with the best recognition rate of obtained for the manually selected landmarks. This shows that the deformable surface registration method described in Section 3 is able to recover correct correspondences. An interesting insight into classification performance can be gained by looking at the confusion matrices. From Table 5 showing the confusion matrix of the LDA classifier for the automatically selected landmarks, it can be concluded that the anger and surprise expressions are all classified with above accuracy, whereas the fear expression is only classified correctly in . This can lead to the question about adequacy of the ground truth data. This is a difficult problem as the human expressions are very subjective by their nature. To demonstrate this Table 6 shows the confidence confusion matrix obtained for the human observers. This data has been obtained as a part of the project aiming to build and validate a 3D dynamic human facial expression database [41]. The specific results shown in the table are based on 10 observers asked to rank their confidence about recognising facial expressions represented in video clips and each video clip lasts seconds. As it can be seen in the table the observers were very confident about recognising the happy expression whereas the fear expression was often confused with the surprise expression. This shows a “subjective” nature of the ground truth data. Although recognition rate of for the fear expression in Table 5 seems to be quite low, when taking into account results presented in Table 6, they can be considered as reasonable.
Table 6: Confidence confusion matrix for the human observers using 2D video sequences.
6. Conclusions
A novel method for facial expression representation has been presented in this paper. It uses only 3D shape information, and therefore, in contrast to most of the methods using texture, our method is invariant to changes in the illumination, background, and to some extent viewing angle. The proposed method assumes that the SSV efficiently encodes facial expressions, and this encoding can be separated from the SSV variations caused by observing different faces. The performed tests indeed confirmed this hypothesis showing that the proposed representation is, at least partially, invariant to changes of the face ethnicity, gender, or age. A number of different configurations of the SSM have been tested. These include the SSM built from facial landmarks as well as full facial scans of real as well as simulated data. A fully automatic method has also been proposed for estimation of the SSV, with an iterative procedure which in turn estimates correspondence and shape parameters.
7. Future Work
In the method described in this paper the statistical shape model is built using a single database. In the case of the multiple databases which are subsequently integrated or combined together, a further improvement of the method would include construction of a hierarchical system, where firstly the face type is decided upon, and subsequently the facial expression is recognized using shape model built from the facial expression database constructed for that specific face type detected in the previous step.
The separability results presented in the paper show that the SSV feature space can offer generally good separation for different expressions. For some expressions though such as angry and fear, the method provides only a limited separation, at least for the data used in the experiments. As a result, these expressions can be easily confused. One way to improve the separation of these “difficult” expressions is to provide more information to the model. From the reported psychophysical test it can be concluded that temporal information of the expression articulation provides important cues for human observers and helps them to correctly read expressions. Following this observation some simple tests were conducted with dynamic 3D facial scans. The dynamic face sequences are captured by the 3dMD scanner [42] in ADSIP research centre, and the facial landmarks set on each face in the sequence were manually labeled subsequently. An example of face sequence is shown in Figure 12. Using the face sequences, the trajectory of each specified facial expression is recorded and displayed in the 3D feature space. Figure 13 shows two trajectories plotted in the SSV domain for sequences representing fear and angry expressions. It can be seen that these trajectories are well separated in the SSV domain, thereby illustrating the potential usefulness of the temporal information of the face articulation for automatic expression classification.
Figure 12: An example of the dynamic 3D face sequence.
Figure 13: Trajectories of the first three principal components of the SSV-based feature on dynamic face sequences.
Acknowledgments
The authors would like to acknowledge Dr. Lijun Yin from Binghamtopn University (USA) for making available to them BU-3DFED database. This work has been supported in part by the MEGURATH project (EPSRC grant no. EP/D077540/1).