Abstract

Perception of head pose is useful for many face-related tasks such as face recognition, gaze estimation, and emotion analysis. In this paper, we propose a novel random forest based method for estimating head pose angles from single face images. In order to improve the effectiveness of the constructed head pose predictor, we introduce feature weighting and tree screening into the random forest training process. In this way, the features with more discriminative power are more likely to be chosen for constructing trees, and each of the trees in the obtained random forest usually has high pose estimation accuracy, while the diversity or generalization ability of the forest is not deteriorated. The proposed method has been evaluated on four public databases as well as a surveillance dataset collected by ourselves. The results show that the proposed method can achieve state-of-the-art pose estimation accuracy. Moreover, we investigate the impact of pose angle sampling intervals and heterogeneous face images on the effectiveness of trained head pose predictors.

1. Introduction

Research on head pose estimation has been attracting increasing attention of researchers thanks to its importance in a wide variety of applications, such as behavior analysis, gaze estimation, fatigue driving detection, and face recognition. During the past decades, a number of methods have been proposed. In general, these methods can be roughly divided into two major categories depending on whether or not they require facial landmarks in pose estimation. A thorough review on various head pose estimation methods can be found in [1].

The methods using facial landmarks estimate pose angles mostly based on the assumption that the geometrical structure among the facial landmarks is relevant to the pose angles. Some of the methods [25] in this category fit a 3D face model to the landmarks detected on 2D face images by changing pose angles, and pick as the final result the angles which make the 3D face model fit the landmarks best. These methods have achieved impressive results especially on the faces with relatively small pose angles. However, their performance highly depends on the accuracy of landmark localization, which is itself a challenging task. Besides, the assumption underlying these methods is also arguable, because the difference between the facial landmark configurations might be due either to different head poses or to different faces (i.e., identities).

The methods [69] not using landmarks treat the pose estimation problem as a classification or regression problem. They first train a pose angle predictor using a set of face images whose pose angles are known and then use the predictor to estimate the pose angles in unseen face images. These methods do not need to detect facial landmarks but directly extract appearance features from the face images. They are thus easier to apply. However, there are still some open issues related to these methods. First, the available training face images usually have discrete pose angles (say from −90 to 90 degrees with an increment of 5 degrees), though the head pose angle itself is continuous. In other words, the available training data is merely a sampling of the full head pose space. But it is still unknown how the pose angle sampling interval of the training face images affects the effectiveness of the obtained pose angle predictor. Second, several popular feature extraction methods have been used for head pose estimation, such as Gabor filter responses, Local Binary Patterns or LBP, Histograms of Oriented Gradients of HOG, and even Grey-scale intensity features. In this paper, we will investigate the performance of different feature extraction methods using our proposed method. Moreover, existing studies usually use the face images captured with same devices and under similar conditions for both training and testing. In real-world applications, however, it might have to apply a trained pose angle predictor to a scenario different from the training phase.

In this paper, we focus on the appearance and training based method, which does not need to detect facial landmarks before pose estimation and is thus easier to use. Specifically, we attempt to estimate the head pose from a single 2D face image by using improved random forests with appearance-based features. To this end, we first extract the features from 2D face images and then use supervised locality preserving projections (SLPP) to reduce the feature dimension. After applying SLPP, each dimension of the features is associated with an importance index (i.e., a weight). Finally, we construct random forests based on the weighted features [10, 11]. When constructing random forests for head pose estimation, unlike previous methods, we consider the importance of different features and conduct tree screening to discard deficient trees [1214]. Thanks to the weighting and screening processes, our constructed random forests are more effective, as will be demonstrated by our experimental results. The main contribution of this paper includes(i)employing feature weighting and tree screening to improve the effectiveness of the constructed random forest based head pose predictor,(ii)investigating the impact of pose angle sampling interval of training face images on the accuracy of trained pose angle predictors,(iii)studying the performance of pose angle predictors when the training and testing face images are captured under different scenarios (i.e., heterogeneous face images).

The remainder of this paper is organized as follows. Section 2 briefly introduces related work. Section 3 presents the proposed head pose estimation method. Section 4 then reports the experimental results. Finally, conclusions are drawn in Section 5.

A random forest is an ensemble of tree predictors that can cope with classification and regression problems. It has been successfully applied to many problems [1521] including head pose estimation. Head pose estimation based on random forests was first proposed by Huang et al. [22] in 2010. The method enhanced the strength of the decision tree by LDA with minimum Euclidean distance classifier as the node tests and had achieved good results. Li et al. [23] used random regression forests to estimate head poses in two steps: (i) half-face and tree structured classifiers with cascaded-Adaboost algorithm are adopted to detect faces with various head poses; (ii) based on the cropped face images, random regression forests are learned and applied to estimate head orientations. Recently, Fanelli et al. [24, 25] exploited random forests for estimating 3D head poses in 3D face images. They utilized depth information as feature vectors when constructing the tree predictors in random forests. However, because depth information is not available in 2D images, this method cannot be applied to 2D face images.

In this paper, we improved the strength of decision trees in the random forest via selecting features according to their importance and screening the trees to retain only those with higher accuracy. Our method differs from previous random forest based head pose estimation methods in the following three ways.(i)Our method assumes that different features are not equally important and selects important features with higher probabilities.(ii)Our method evaluates the trees and retains them only if they have sufficiently high accuracy. This screening process improves the quality of the obtained forests.(iii)A machine learning technique, that is, SLPP, is applied to enhance the discriminant power of the features extracted from 2D face images. Moreover, SLPP provides a natural measurement of the importance of different features.

3. The Proposed Head Pose Estimation Method

Figure 1 shows the block diagram of the proposed method. Details of each component are given below.

3.1. Feature Extraction

Faces are first detected in the input images by using the method in [26]. They are then cropped from the original images and normalized to grey-scale images of 64 × 64 pixels. Then certain texture features (e.g., Gabor [8], LBP [21], or HOG [22]) are extracted from the normalized face images. The features are further processed by using the SLPP method for two purposes: (i) reducing and weighting feature dimensions and (ii) enhancing the discriminant power of features.

SLPP is a method for supervised linear dimensionality reduction [27]. It seeks a set of projection axes , which can preserve the local structure of samples that are from the same classes while maximizing the separation margin between the samples of different classes. Let be -dimensional features extracted from a given set of training samples belonging to classes . The objective function of SLPP is defined as follows:where , and are called Laplacian matrix, and are the within-class scatter matrix and the between-class scatter matrix, and denotes the transpose operator. The optimal projection axes of SLPP are obtained by maximizing the between-class separation and simultaneously minimizing the within-class scatter. It can be mathematically shown that the optimal projection axes of SLPP are the eigenvectors corresponding to the nonzero eigenvalues of the following general Eigen-decomposition problem:

In the SLPP generated feature space, each dimension corresponds to one of the projection axes (i.e., the eigenvectors of (6)), and its associated eigenvalue as well. The larger the eigenvalue is, the better the feature along the corresponding dimension fulfills the objective function in (1). This motivates us to measure the importance of different features based on the eigenvalues. Specifically, in the lower dimensional feature space, the th feature is assigned with a weight, which is defined as follows:where is the eigenvalue associated with the th eigenvector and is the dimension of the reduced SLPP feature space. The weights of different features will be used to guide the selection of features in constructing tree predictors in random regression forests.

3.2. Training Random Regression Forests

Based on the training samples, a random regression forest is constructed. Let denote the extracted features, and let denote the annotated class labels of the training samples (i.e., yaw or pitch angles in this paper). For each of the trees in the random forest, a subset of samples is randomly chosen by bootstrap from the training samples. To train a tree using the chosen training samples, the parameters involved in the binary tests at their nonleaf nodes have to be determined.

The binary test at a node, denoted by , determines which branch the sample arriving at the node should be forwarded to. In this paper, the binary test is defined by a chosen feature and its associated threshold. Given a sample, the binary test compares its feature with the threshold and forwards the sample to the left branch if the evaluation result is lower than the threshold or forwards it to the right branch otherwise. In the training phase, a pool of binary tests is generated at every nonleaf node by randomly choosing a feature from a subset of features and a threshold for it. The subset of features is randomly chosen at each nonleaf node out of the features. In order to choose the best binary test from these randomly generated binary tests, we use the information gain to evaluate the effectiveness of each binary test. Specifically, all the training samples (denoted by ) arriving at the node are first split into the left and right branches by the binary test under evaluation. The two subsets are denoted by and , respectively. The information gain of this splitting is then computed as follows:where is information entropy and is the probability of th feature occurring at the node. The binary test with the highest information gain is taken as the best one for the node.

During the construction of a tree, the tree keeps growing until the number of samples arriving at the node is sufficiently small, or the samples are all of the same class, or the tree reaches the maximum number of layers. A leaf is created when the tree stops growing. The leaf stores a real-valued, multivariate distribution computed from the pose parameters () of the samples arriving at it. Figure 2 illustrates a random regression forest for head pose estimation.

As proven by Breiman [28], the generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Motivated by this fact, we choose features according to their weights. That is, the feature with larger weight has higher probability to be chosen. At each nonleaf node, we randomly select features from features without repeat according to their weights. This feature weighting method improves the effectiveness of the constructed trees, as will be demonstrated by our experiments.

3.3. Tree Evaluation and Screening

Furthermore, once a tree is built, we evaluate its quality to determine whether the tree should be retained or discarded. Only the trees which have sufficiently high accuracy will be kept. As mentioned above, the bootstrap method is used to randomly choose a subset of samples from the given training samples to construct a tree. These chosen samples are called in-of-bag (IOB) data, while the remainder is called out-of-bag (OOB) data. The OOB data are utilized to evaluate the tree constructed based on the IOB data.

Given a tree regressor built from the th training data subset, we define the mean absolute error (denoted by MAE) of the tree aswhere is a sample in the OOB data and is the class label of the sample . It is obvious that the tree with low MAE has high accuracy. Hence, we accept a tree whose MAE on the OOB data is below the prespecified threshold. In this way, the trees constructed in the forest are all with lower MAE, and a better random forest can be obtained.

3.4. Predicting Head Poses

Once the random regression forest is constructed, it can be used to predict the head pose in a new unseen 2D face image. At each node of a tree in the forest, the stored binary test evaluates a feature and sends the testing sample either to the right or to the left branch, all the way down until a leaf. When the sample arrives at a leaf, the tree gives an estimate for the pose parameters in terms of the stored distribution. The multivariate distributions predicted by different trees in the forest are averaged to give the final estimation of the head pose of the testing sample.

4. Experiments

In this section, we evaluate the proposed method on four public databases: Pointing’04 [29], CAL-PEAL [30], NCKU [31], and CUbiC FacePix [32, 33]. For all the images in these databases, the face detector in [26] is applied to locate the face regions (when the method fails to detect the right face region, we crop the face region manually), and the face regions are cropped and normalized to the same size of 64 × 64 pixels.

We compare our method (denoted by WSRF) with some existing random forests based methods, including the method (denoted by VRF + LDA) in [22], the method (denoted by SRF) in [23] and the method (denoted by WRF) in [7], and some other state-of-the-art methods, including the methods proposed in [34] (denoted by LPLS and KPLS), [35] (denoted by LBIF and SLBIF), [36] (denoted by MGD), [2] (denoted by 3DMM), and [9] (denoted by MLD). In the experiment, we investigate how different features (e.g., Gabor, LBP, HOG, and Grey-scale intensity) work with our proposed method. We use Gabor kernels with five spatial frequencies and eight orientations. To extract LBP features, face images are divided into cells of 16 × 16 pixels. For each pixel in a cell, we consider a circular neighborhood with radius of eight pixels. When extracting the HOG features, each scanning window over the face images is divided into cells with a fixed size of 6 × 6 pixels, and each group of 3 × 3 cells is integrated into a block in a sliding fashion. The extracted Gabor, LBP, and HOG features are reduced to 256 dimensions by SLPP. For each method, we report the mean absolute errors (MAE) and the accuracy (the percentage of the samples whose pose estimation error is less than 5 degrees) of head pose estimation. Furthermore, in order to evaluate the generalization ability of the proposed method, we evaluate it on a set of samples captured by surveillance cameras.

In our experiment, when constructing random regression forests, the number of trees is empirically determined as being done in many random forest based methods [37, 38]. Specifically, the number of trees is gradually increased until the prediction error becomes stable or increasing.

4.1. Results on Pointing’04

The Pointing’04 database [29] contains 2D face images of 15 individuals. Each subject has 2 series of 93 images at different poses. The images display variations in expressions, skin colors, and occlusions (e.g., wearing glasses). The pose is described by rotation of yaw and pitch. Yaw angles range between −90 and 90 with increments of 15 degrees, and pitch angles range between −90 and 90 with increments of 30 degrees. Figure 3 shows some example face images of a subject in the Pointing’04 database. It is worth mentioning that the subjects in this database were asked to rotate their heads and the only-one camera in front of them captured their face images. Consequently, the head pose angles in this database are not very precise (see Figure 4).

In the experiments, we set and . For a fair comparison with the methods in [9, 17, 29, 31], we use 80% of Pointing’04 images for training, 10% for cross-evaluation, and 10% for testing.

Table 1 shows the accuracy and MAE of yaw and pitch angles by different methods on the Pointing’04 database and by our proposed improved random regression forest based method with different features. The proposed method with HOG feature achieves MAE of 5.21 and 4.06 degrees for yaw and pitch, respectively, which are comparable to the MLD method and better than the other methods. Considering that the MLD method is specially designed for handling the uncertainty of head pose angles in this database, our proposed method could be further improved by incorporating the idea in MLD into the training process (say using face images with yaw angles of 40 to 50 degrees as the training samples for the yaw angle of 45 degrees. This is among our future work).

4.2. Results on CAS-PEAL

The CAS-PEAL face database [24] contains 99,594 images of 1,040 individuals (595 males and 445 females). Each subject displays twenty-one poses combining seven yaw angles (−45, −30, −15, 0, 15, 30 and 45) and three pitch angles (−30, 0 and 30). In this experiment, we consider only yaw angles as in [6].

For a fair comparison with [6, 22], in our experiments, we use a subset containing totally 4,200 images of 200 subjects whose IDs range from 401 through 600. The images are ranked by their subject IDs and then divided into three subsets. Two subsets are taken as the training set and the other subset is taken as the testing set. Figure 5 shows some example cropped face regions in this database. In the experiments, the number of trees and the number of randomly selected features at each node are set as 40 and 14, respectively.

Table 2 summarizes the accuracy and MAE achieved by different methods as well as by our proposed method with different features on the CAS-PEAL database. Obviously, the proposed method is much better than the method in [30]. In particular, the proposed method with HOG feature achieves MAE of 4.53 degrees for yaw, which is the best results among the methods under consideration. In addition, the method in [2] achieved the MAE of 3.78 degrees, but testing on yaw angles that range from −15° to 15°. By comparing the results of WRF and the proposed method with different features in the list, it can be seen that screening trees is effective in improving the pose estimation accuracy of random forests.

4.3. Results on NCKU

The NCKU face database [31] contains 6,660 images of 90 subjects (78 males and 12 females). Each subject has 74 images, where 37 images were taken every 5 degrees from right profile (defined as +90) to left profile (defined as −90) in the yaw rotation. The remaining 37 images are generated (synthesized) by the existing 37 images using commercial image processing software in the way of flipping them horizontally. In our experiments, we use a subset containing the first 37 images of all subjects. Figure 6 shows some example face images of a subject in the NCKU face database.

In the experiment, the images are ranked by their subject IDs and divided into three subsets. Two subsets are taken as the training set and the other subset is used for testing. The number of trees and the number of randomly selected features at each node are set as 30 and 14, respectively.

Table 3 summarizes the accuracy and MAE achieved by different methods on the NCKU face database. It can be seen that our proposed method with HOG feature achieves the best results. This again proves the effectiveness of our proposed method in improving the pose estimation accuracy of random forests.

4.4. Results on CUbiC FacePix

The FacePix database [32] contains 16,290 images of 30 individuals (25 males and 5 females). Each subject has two sets of face images: a set with pose angle variations and a set with illumination angle variations. In the experiment, we only use the images of the pose variations. Each subject displays 181 poses from right profile (denoted by +90) to left profile (denoted by −90) with increments of 1 degree in the yaw rotation. Figure 7 shows some images of a subject displaying yaw angles from −90 to 90 with an interval of 10.

In the experiments, we set and , respectively. The images are ranked by their subject IDs and divided into three subsets. Two subsets are used for training and the rest one is used for testing.

Table 4 summarizes the accuracy and MAE achieved by different methods on the CUbiC FacePix face database. The proposed method with HOG features obtains an MAE of 4.51 degrees and accuracy of 89.15% for yaw, which are significantly better than the existing method.

4.5. Results on Surveillance Data

In order to evaluate the generalization ability of our method in more realistic environment, the forests trained by samples from Pointing’04, CAS-PEAL, NCKU, and FacePix are directly applied to a set of samples captured by surveillance cameras. There are totally 30 participating volunteers. To capture images with pose variation of a subject, the subject is asked to look upwards, look right into the camera, and look downwards, when he/she walks through under the surveillance cameras. As a result, 15 images are captured for each subject, showing yaw angles of (−30, −15, 0, 15, and 30) and pitch angles of (−30, 0, and 30). Figure 8 shows the 15 images of a subject in this surveillance dataset.

The face regions in the captured images are located by using the face detector in [26] and then cropped and normalized to the same size of 64 × 64 pixels. HOG features are extracted from the normalized cropped face regions and projected to the SLPP-reduced feature space. Table 5 shows the MAE and accuracy of yaw and pitch angles achieved on this surveillance dataset by, respectively, using the random forests trained on the Pointing’04, CASPEAL, NCKU, CUbiC and FacePix databases. In order to examine whether using face images from multiple sources to train the random forest helps, we combine the four databases (i.e., Pointing’04, CAS-PEAL, NCKU, and CUbiC FacePix) and use all the face images together to train another random forest based head pose predictor. The results are shown in Table 5.

It can be observed from Table 5 that when using a single training database, the smallest pose estimation error is achieved by the NCKU-trained predictor, for which the training data have an interval of 5 degrees in pose angles. In contrast, the other three predictors trained using data with 1-degree interval (i.e., FacePix) or 15-degree interval (i.e., Pointing’04 and CAS-PEAL) all have larger errors. A possible reason is that too coarse sampling of pose angles cannot well interpolate all possible pose variations, whereas too dense sampling of pose angles might in fact introduce more confusing information, rather than more discriminative information. On the other hand, by combining multiple databases for training, the generalization ability of the resulted head pose predictor can be enhanced, thanks to the improved diversity of the training data.

Figure 9 shows some face images for which the trained predictors fail to accurately estimate the pose angles. The MAE for these images is larger than 10 degrees. As can be seen, the cluttered background and intense facial expressions could generate distracting features for the predictors trained using the face images collected under laboratory conditions. How to cope with such poor quality images is among the topics of our future work.

Furthermore, by comparing the results in Tables 1, 2, 3, and 5, we can see that the predictors can obtain higher accuracy when the testing images are captured by the same devices as the training images. This is essentially a problem of heterogeneous image analysis. To better understand the impact of heterogeneous images, we evaluate our method by using the surveillance dataset for both training and testing. Specifically, the images of 20 subjects are used for training, while the images of the other 10 subjects are used for testing. The number of trees and the number of randomly selected features at each node are set as 10 and 14. Table 6 gives the accuracy and MAE of yaw and pitch angles achieved by the predictor trained on the surveillance dataset. The best MAE of the estimated yaw and pitch angles are, respectively, 5.43 and 5.21 degrees, which are obviously lower than that obtained by the predictors trained using other databases and even lower than that achieved by the predictor trained by combined databases (refer to Table 5). This again demonstrates the difficulty in handling heterogeneous data.

4.6. The Threshold in Tree Screening

In tree screening, we evaluate the performance of a tree in terms of its MAE on the OOB data and impose a threshold on its MAE. In the above experiments, the threshold of OOB MAE is set to 8. That is, we accept the trees whose OOB MAE is below 8. Figures 10 and 11, respectively, show the MAE and accuracy of yaw and pitch angles on testing data when different MAE thresholds are used for tree screening in training. Figure 12 plots the required training time versus MAE thresholds.

From these curves, we can see that when the MAE threshold increases, it takes a shorter time to train a random forest, but its accuracy becomes lower. On the other hand, as the MAE threshold decreases to 8 and even smaller, the resulting random forests’ accuracy becomes relatively stable. This suggests that using an MAE threshold smaller than 8 cannot significantly improve the accuracy while the training time increases substantially. Therefore, we report above the performance of our proposed method when the MAE threshold is set as 8.

4.7. Diversity Analysis

This experiment evaluates the diversity of the trees in the obtained random forests. A random forest with higher diversity is more preferred because such a forest is believed to be more generative. To assess the diversity of a forest, we define that two trees share a common node if there are two nodes which are at the same position in them and use the same feature. The similarity of two trees can be then computed aswhere and are, respectively, the total number of nodes in the two trees and denotes the number of common nodes between them. The diversity of the forest is measured by the mean and variance of the similarity of all pairs of trees in it. A forest with low similarity mean and variance is deemed to have high diversity.

Table 7 shows the similarity mean and variance of the random forests obtained on the five databases. As can be clearly seen, all are small, which indicates high diversity of these forests. It is worth mentioning that if we further consider the thresholds associated with the nodes, the diversity could be even higher. Therefore, although the trees in a forest are established based on the same set of weighted features, they still have high diversity thanks to the random division of both training data and candidate features (i.e., for each node, a random subset of training data and a random subset of candidate features are used to learn its parameters).

5. Conclusions

An improved random forest based method has been presented in this paper for estimating the head pose in 2D face images. The method weights the features when constructing tree predictors in the random forest and screens the trees to retain only the trees with high accuracy. A series of evaluation experiments have been done on four public databases and a set of surveillance images captured by ourselves. The experimental results demonstrated that the random forests constructed with our proposed method indeed improve the head pose estimation accuracy.

We have systematically evaluated the accuracy of random forest based head pose predictors when training face images with different pose angle sampling intervals are used. Our results suggest that neither too dense nor too coarse sampling intervals are preferred, but a sampling interval of median (say five) degrees generates the best pose angle prediction accuracy.

We have also investigated the performance of trained pose angle predictors on a heterogeneous testing dataset. The results demonstrate that the accuracy decreases when the testing data is captured in a different scenario from that of training data. This is obvious when the predictor is trained with laboratory data and tested with real-world surveillance data. Such heterogeneous problem is not aware by more and more researchers.

In our future work, we are going to focus more on the heterogeneous problem in head pose estimation and further improve the pose estimation accuracy for real-world face images by exploring more advanced face image processing techniques and more effective feature representations.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Key Scientific Instrument and Equipment Development Project of China (no. 2013YQ49087903) and National Natural Science Foundation of China (Grant no. 61202160).