Abstract

During the past few decades, face recognition has been an active research area in pattern recognition and computer vision due to its wide range of applications. However, one of the most challenging problems encountered by face recognition is the difficulty of handling large head pose variations. Therefore, the efficient and effective head pose estimation is a critical step of face recognition. In this paper, a novel feature extraction framework, called Directional Correlation Filter Bank (DCFB), is presented for head pose estimation. Specifically, in the proposed framework, the 1-Dimensional Optimal Tradeoff Filters (1D-OTF) corresponding to different head poses are simultaneously and jointly designed in the low-dimensional linear subspace. Different from the traditional methods that heavily rely on the precise localization of the key facial feature points, our proposed framework exploits the frequency domain of the face images, which effectively captures the high-order statistics of faces. As a result, the obtained features are compact and discriminative. Experimental results on public face databases with large head pose variations show the superior performance obtained by the proposed framework on the tasks of both head pose estimation and face recognition.

1. Introduction

Face recognition has attracted significant attention in computer vision and pattern recognition. It has a wide range of practical engineering applications, including access control, video surveillance, and human computer interaction. Many works have been developed towards robust face recognition systems [14], and they usually encounter several challenging issues, such as occlusions, illumination changes, and pose variations [1]. Among these issues, face recognition under large head pose variations is one of the most challenging problems [510], mainly due to the fact that the face appearance dramatically changes under different head poses.

Generally speaking, the image can be described by using its two-dimensional Fourier transform in the spatial frequency domain, where significant advantages can be exploited (such as shift-invariance, graceful degradation, and closed-form solution [10]). However, most of the traditional methods work in the image domain, which may lose much important information for image recognition. The main objective of this paper is to investigate the benefits of using the spatial frequency domain representation for effective head pose estimation and face recognition.

In this paper, we propose a novel feature extraction framework, called Directional Correlation Filter Bank (DCFB), for robust head pose estimation and face recognition by taking advantage of the correlation filter technique. Specifically, in the proposed framework, the 1-Dimensional Optimal Tradeoff Filters (1D-OTF) corresponding to different head poses are jointly designed in the low-dimensional subspace obtained by Principal Component Analysis (PCA). Then, the origin correlation outputs corresponding to different 1D-OTFs constitute a compact and discriminative feature vector, which can be used for the final head pose classification. Based on the shift-invariance property of the correlation filter technique, one distinguishing advantage of the propose framework is that it does not need the precise localization of the key facial feature points used in the traditional methods. Experimental results on public face databases show that the obtained compact feature vector is very discriminative for head pose estimation and face recognition.

In summary, the main contributions of this paper are given as follows. Firstly, we develop a novel feature extraction framework for head pose estimation. The extracted directional feature vector (a three-dimensional feature vector) effectively encodes the information for pose estimation. Secondly, an effective correlation filter is designed. Compared with the traditional correlation filter designed in the 2D image space, the proposed filter is designed in the 1D low-dimensional PCA subspace, which not only reduces the computational complexity but also improves the final performance. Thirdly, experimental results show the superiority of the proposed framework for head pose estimation and face recognition. Moreover, we show that the proposed framework can also be extended to occlusion estimation.

A preliminary version of this work was reported in [11]. However, we have made several significant improvements. More specifically, we reformulate the original method and offer more mathematical details and motivations of the proposed method. We conduct extensive experiments to demonstrate the superiority of the proposed method for head pose estimation and face recognition. Moreover, we successfully extend the proposed method to the application of occlusion estimation, which validates the generalization ability of the proposed method for different tasks.

The remainder of the paper is organized as follows: Section 2 describes related work, where head pose estimation methods and correlation filter methods are, respectively, discussed. Section 3 explains the methodology of the proposed framework in detail. Section 4 shows extensive experimental results on the tasks of head pose estimation and face recognition. Finally, we make the conclusions in Section 5.

Our work mainly focuses on the head pose estimation for robust face recognition [5]. Existing head pose estimation methods can be roughly classified into two categories: model-based methods [1214] and appearance-based methods [7, 8, 1526].

On one hand, model-based methods aim to construct a 3D model of human head, where a large number of 3D samples are usually required during the training stage. On the other hand, appearance-based methods receive much attention recently. These methods depend on pose-invariant local features or the localization of facial feature points [7], which include appearance template methods [8, 17], detector array methods [15], nonlinear regression methods [7, 1620], manifold embedding methods [2123], and Convolutional Neural Network (CNN) based methods [2426].

For instance, Kim et al. [17] estimated the head pose based on a part-based face matching algorithm. But the performance of head pose estimation heavily relies on the accuracy of the localization of facial features. Ho and Chellappa [8] proposed using the dense SIFT features [9] for head pose estimation. Jones and Viola [15] built different face detectors for different head poses based on the decision tree, where the face detectors are trained separately. Dantone et al. [7] demonstrated the benefits of Conditional Regression Forests (CRF) by modeling the appearance and the location of facial feature points that are conditionally dependent to the head pose. Guo et al. [18] proposed to combine regression and classification for head pose estimation. Lu et al. [22] proposed an ordinary preserving manifold analysis method for head pose estimation. Wei et al. [23] developed a robust face pose estimation method by using the geometry-preserving visual phase.

Recently, Convolutional Neural Network (CNN) has made significant progress for head pose estimation. Xu et al. [24] proposed a framework to jointly perform head pose estimation and face alignment based on the global and local CNN features. Later, Patacchiola and Cangelosi [25] developed a head pose estimation method using CNN and adaptive gradient methods to obtain the promising performance. Ahn et al. [26] used the multitask deep CNN for real-time head pose estimation. However, most of the CNN based methods suffer from high computational complexity.

The correlation filter technique has been shown to be effective for the task of object recognition [10], due to its desirable properties, such as graceful degradation, shift-invariance, and closed-form solutions. Graceful degradation means that although some pixels in a face image are occluded or contaminated, the face can be recognized since only the peak is decreased on the correlation plane (but can still be discernible). Shift-invariance shows that any shift operation in an input image will result in the correlation output being shifted by the same distance. Therefore, the correlation filter based method does not need to align the face images, while the traditional feature-based head pose estimation methods are required to align the face images before feature exaction. Finally, instead of the iterative operations used in many filtering methods (e.g., [27]), correlation filters usually have closed-form solutions.

Among decades of development, many types of correlation filters have been proposed. For example, Mahalanobis et al. [28] proposed the Minimum Average Correlation Energy (MACE) filter. Kumar [29] proposed the Minimum Variance Synthetic Discriminant Function (MVSDF) filter. The Optimal Tradeoff Filter (OTF) in [30] combines the MACE filter and the MVSDF filter to produce the sharp correlation peaks and suppress noises. Note that all these correlation filters are designed based on the 2D image space.

Correlation filter is not only employed in face recognition [10], but also applied in other pattern recognition tasks. Venkataramani et al. [31] showed the application of correlation filter in fingerprint verification for access control. Henning et al. [32] illustrated the application of correlation filter in palmprint identification and verification. Henriques et al. [33] proposed the popular Kernel Correlation Filter (KCF) for robust object tracking. In the paper, we will extend the application of correlation filter towards robust head pose estimation and face recognition.

3. Methodology

In Section 3.1, we present a novel feature extraction framework, called DCFB, for head pose estimation. In this framework, the 1-Dimensional Optimal Tradeoff Filter (1D-OTF) is designed for each directional correlation filter in Section 3.2.

3.1. Directional Correlation Filter Bank

The proposed feature extraction framework, i.e., the Directional Correlation Filter Bank (DCFB) is illustrated in Figure 1. Firstly, a high-dimensional feature vector is extracted for each input image. After that, Principal Component Analysis (PCA) is used to perform dimension reduction so that the prominent information of the face is preserved while the noise is reduced. Then, DCFB consists of the three correlation filters corresponding to left-profile pose, frontal pose, and right-profile pose, respectively, which are separately used to correlate with the low-dimensional PCA features to generate the final direction feature vector.

In Figure 1, the 1-Dimensional Optimal Tradeoff Filter (called 1D-OTF) is proposed for each correlation filter in the 1D frequency domain. Note that the Optimal Tradeoff Filter (OTF) has been successfully applied in face recognition [10], but its computational complexity is very high. Besides, OTF is relatively sensitive to variations in illumination and facial expressions due to the fact that the filter is designed in the original 2D image space. Different from the traditional OTF which is based on the 2D image space, we propose to design 1D-OTF in the low-dimensional PCA subspace. A directional correlation filter bank consisting of three 1D-OTF correlation filters is obtained. In other words, the DCFB only concerns the pose information and ignores the person identity information. More specifically, each correlation filter in DCFB tries to discriminate one specific pose from all the other poses. Finally, a direction feature vector is obtained for head pose estimation. In this paper, the pose of a head can be easily estimated by using the simple nearest neighbor classifier, which is based on the Euclidean distance of the direction feature vectors.

3.2. 1-Dimensional Optimal Tradeoff Filter

In the proposed DCFB, the PCA is firstly used to preserve the dominant information and remove the noises in the face image. Moreover, we design the 1D-OTF in the PCA feature subspace. The advantages of 1D-OTF are that the computational cost can be reduced significantly and the variations caused by illumination and pose can be effectively alleviated.

The PCA is a classic and popular dimension reduction method [34]. Formally, if we have the face images denoted as , where and each image belongs to one class. The total scatter matrix is defined aswhere is the number of all the training images and is the mean of all the training images.

Projection matrix is defined aswhere is the set of the eigenvectors of corresponding to the largest eigen-values.

The new feature vector is obtained by using the following transformation:

After performing PCA, we obtain a 1D low-dimensional feature vector for each face image. Then, a variant of OTF is designed based on the frequency domain of the obtained feature vector. Unlike the traditional OTF which is designed in the 2D image space, the proposed 1D-OTF is designed in the low-dimensional PCA subspace. Then, the three different 1D-OTF correlation filters are obtained, where each filter corresponds to a specific pose thus distinguishing from the other two poses.

Specifically, the 1D-OTF is derived by combining the 1D-MACE (Minimal Average Correlation Energy) filter and the 1D-MVSDF (Minimal Variance Synthetic Discriminant Function) filter. The objective of the 1D-MACE filter is to minimize the average correlation energy (ACE), which can be formulated aswhere , , and are the 1D Fourier transforms of the output , the correlation filter and the input , respectively; is the vector of ; “+” means the conjugate transpose; is a diagonal matrix whose diagonal entries are the average power spectrum of all features.

The origin value of the correlation output is . The constraints of the 1D-MACE filter are that the values of the outputs at the origin are equal to 1 for the authentic images (corresponding to the images of a specific pose) and 0 for the imposter images (corresponding to the images of the other poses), expressed aswhere is the 1D Fourier transform of the low-dimensional features obtained by PCA; is a vector and denotes the correlation peak amplitude of the th training image; is equal to 1 for the authentic images and 0 for the imposter images.

Therefore, the objective of the 1D-MACE filter is

As a result, the solution of the above objective function is obtained by using the method of Lagrange multipliers. The optimum solution of 1D-MACE is

The solution of the 1D-MVSDF filter is derived in the same way as the 1D-MACE filter. The optimum solution iswhere is an identity matrix if the input noise is modelled as the white noise.

Based on the combination of (7) and (8), 1D-OTF is written asHere, where () is a trade-off parameter ( leads to the 1D-MVSDF filter and leads to the 1D-MACE filter).

4. Experimental Results and Analysis

In this section, we firstly introduce the face databases used in the experiments in Section 4.1. Then, we give the experimental results obtained by the different methods for the task of head pose estimation in Section 4.2. Next, the generalization ability of the proposed framework is shown in Section 4.3. To further validate the effectiveness of the proposed framework, we extend our framework for the task of occlusion estimation in Section 4.4. Finally, we describe the experimental results for the task of face recognition in Section 4.5.

4.1. Face Databases

In this paper, three popular face databases with large pose variations, including PIE [35], HPI [36], and UMIST [37], are used to demonstrate the performance of the proposed method for head pose estimation and face recognition. Figure 2 shows some examples from the PIE and HPI databases. Besides, the AR database [38] is used to evaluate the performance for occlusion estimation, which contains occlusions in face images.

The PIE face database contains 41,368 images of 68 different persons with variations in pose, illumination, and expression. We choose 612 images with three different poses, that is, left-profile ([-45°,-15°]), frontal ([-15°, 15°]), and right-profile ([15°, 45°]). And each pose has nine images for each person. The HPI face database contains 15 persons with various poses. We choose ten images for both the left-profile and right-profile poses and six images for the frontal pose for each person. The UMIST face database consists of 564 different pose images of 19 persons. We, respectively, choose six images for each pose in our experiments. The AR face database contains over 4000 face images of 126 people, including the frontal view of faces with different facial expressions and occlusions (including sun-glasses and scarfs). The images of 120 individuals are taken in two sessions (separated by two weeks) and each session contains 13 color images.

All the faces in the images are cropped and resized to the size of 64×64. For all the databases, we randomly choose 30% images as the training set and the rest are used as the test set. We also compare our method with several other competing methods (see the following subsections for details). The experiments are repeated 30 times. We report the average pose/face classification rates obtained by all the competing methods.

4.2. Results on Head Pose Estimation

In this section, we show the results of head pose estimation obtained by the competing methods (including PCA [34], LDA [39], QRLDA [40], and OTF [41]) and our proposed DCFB method. To validate the effectiveness of the proposed method, we use three different face representations, including gray, Gabor [42], and HOG [43] features.

The HOG features use the sliding block to record the gradient information of images, where the block size will greatly affect the performance of head pose estimation. The large block catches the global information while the small block records the details. The head pose estimation usually needs the outline information rather than the identity characteristic, which indicates that we should use the large block size. To verify our assumption, we set ten different kinds of sizes of blocks used in the HOG features for head pose estimation. From Figure 3, we can see that the proposed method with the larger block size have better head pose estimation performance than the smaller size. However, if we use one block to cover an image, we cannot get the best performance, because not only the outline gradient but also the gradient information of the mouth and the eye will be recorded, thus leading to the performance decrease. The best results can be obtained by using four blocks and each block has nine bins. Therefore, we only need 36-dimensional features to estimate the head pose. We will use the 36-dimensional HOG features for the other following experiments.

Table 1 shows the results obtained by all the competing methods for head pose estimation based on the three types of features. From Table 1, we can see that the proposed method obtains the best performance by employing the three different face representations for the most cases, which shows the feasibility and robustness of DCFB. In particular, the DCFB achieves the 100% accuracy on the PIE and UMIST databases. Most of methods using the HOG features can achieve the higher performance for pose estimation than those using the gray and Gabor features. This is because that the HOG features use the histogram of gradients to describe the shape of a face image, where the information of face directions is included. Therefore, the HOG features are more effective for head pose estimation. In contrast, the performance of the Gabor features is worse than the HOG features. This is because the Gabor features can extract features that are insensitive to the head pose variations.

To demonstrate the superiority of the proposed DCFB for head pose estimation, we further show the distance distributions between the test images and the template images for all the competing methods on the PIE database (see Figure 4). We can see that PCA is not suitable for head pose estimation since the distances between different pose images are not big enough to be distinguished. Although PCA can effectively reduce the noises, the extracted features are not distinguishable for head pose estimation. LDA achieves better results than PCA, since LDA considers the class information. LDA and QRLDA achieve the similar performance because these two methods try to differentiate all the classes. In contrast, OTF gives the better distance distributions than LDA and QRLDA. However, compared with OTF, DCFB shows the excellent capability to separate the distance distributions for different templates, since a specific filter is designed to classify between one pose and the other two poses in the PCA feature subspace. This makes the head pose estimation more effective and robust. The effectiveness of DCFB can be attributed to the superiority of the correlation filter in dealing with frequency domain representations and the advantage of the filter bank design in learning head pose features.

4.3. Generalization Ability

In this section, we show the generalization ability of the proposed framework. That is, we use one database for training while applying the other database for testing. In this way, the generalization ability of the proposed method across different databases can be evaluated.

More specifically, we randomly choose 30% images of one database to train the proposed DCFB and all the images in the other database for testing. The results are given in Tables 24. We can see that the proposed DCFB using the HOG features obtains the excellent performance in all the experiments, which demonstrates the effectiveness of the correlation filter for cross database validation. Note that the DCFB using the gray features does not work well in the experiments. This is mainly because the gray features that only preserve the appearance information. The DCFB using the HOG features achieve the better results than the DCFB using the Gabor features. There are three main reasons for the good generalization ability of DCFB. Firstly, the correlation filter bank is designed in the low-dimensional PCA subspace, which not only significantly suppresses the noises in the facial images but also focuses on extracting intrinsic features for head pose estimation. Secondly, the proposed framework exploits the frequency domain representation, which inherits the advantages of graceful degradation and shift-invariance in the correlation filter. Therefore, the precise localization of the key facial feature points for different individuals is not required. Thirdly, the HOG features are used to compute the gradient histograms so as to effectively reduce the appearance differences between individuals. Therefore, one advantage of using the proposed DCFB is that it provides the robustness against overfitting (i.e., we can train on one database and test on the other database with completely different individuals). In summary, the proposed framework using the HOG features can effectively extract the discriminative information, and thus the superiority generalization ability is obtained for head pose estimation.

4.4. Occlusion Estimation

In this section, we show the results of the proposed framework for the task of occlusion estimation. Occlusion estimation is to determine whether the given input image is occluded or not. Therefore, there are only two 1D-OTFs (one for occluded images and the other for non-occluded images) designed in DCFB. The AR database is used for evaluation. The images of the AR database are separated into two classes, i.e., occluded images and nonoccluded images. The gray, HOG and Gabor features are extracted from the AR database for all the images. Figure 5 shows the distance distribution of occlusion estimation of the proposed framework using the gray, HOG, and Gabor features. As demonstrated in Figure 5, the occluded images can be better classified than the nonoccluded images. This is mainly because that there is a distinguish gap between two classes. The proposed framework using the gray features does not work well for non-occluded images (it only obtains 67% classification rate). The Gabor features can classify images correctly, but the gap between the occluded images and nonoccluded images is not distinct. The gray features are sensitive to illumination changes, while the Gabor features mainly depict the textures of facial images. In contrast, the HOG features effectively characterize the edge variations, which are of great importance to occlusion estimation. In a word, the proposed framework using the HOG features achieves the steady and excellent performance in occlusion estimation.

4.5. Face Recognition with Head Pose Estimation

In this section, we show the results on face recognition by taking advantage of head pose estimation. Note that if a face is not frontal, then its appearance is usually surrounded by the background. The information provided by the background is not effective for face recognition. To overcome such a problem, we use the partial information (i.e., the nonbackground face region) to perform the face recognition process. The CFA method [10] (without relying on head pose estimation) is used as the baseline face recognition method. For head pose estimation, we use the OTF and DCFB for comparisons. Table 5 shows the results of the recognition accuracy obtained by all the competing methods. For each method, three different face representations (i.e., gray, Gabor, and HOG) are used for face recognition. The CFA method directly performs face recognition without head pose estimation. The method (A+B) denotes the combination of the head pose estimation method A (OTF or DCFB) and the face recognition method B (CFA in our paper). Note that we use the HOG features for both DCFB and OTF.

From Table 5, we can see that DCFB+CFA with the Gabor features obtains the best recognition performance among three feature representations. By using the head pose estimation, the performance of face recognition (i.e., DCFB+CFA) can be significantly improved. Specifically, DCFB+CFA with the Gabor features can improve at least 40% than CFA with the Gabor features. DCFB+CFA can obtain the higher recognition accuracy than OTF+CFA because DCFB has the better head pose estimation ability than OTF. In addition, DCFB+CFA with the Gabor features achieves the highest recognition accuracy among all the competing methods. CFA achieves the worst performance since head pose estimation is not used. In this way, the intraclass variations caused by pose differences are much larger than the interclass variations caused by individual differences. As a result, the recognition rates greatly drop. Note that different from head pose estimation where the HOG features show the best performance for DCFB, Gabor is much more effective than Gray and HOG for DCFB in face recognition, due to the fact that Gabor is less sensitive to pose variations during the filtering steps. From the above results, the robust head pose estimation is an essential step for face recognition.

5. Conclusions

In this paper, a novel feature extraction framework DCFB has been proposed for robust head pose estimation and face recognition. An effective 1-Dimensional optimal tradeoff filter, called 1D-OTF, is designed in DCFB by using the frequency representations of 1D features on the linear subspace obtained by principal component analysis. Experimental results have demonstrated the effectiveness of DCFB for head pose estimation, face recognition, and occlusion estimation. Moreover, the superiority of the generalization capability of DCFB has been shown.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

An earlier version of this paper was presented as a conference paper at the 4th International Conference on Intelligent Science and Big Data Engineering, Beijing, China, 2013.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the National Key R&D Program of China (no. 2017YFB1302400), by the National Nature Science Foundation of China (nos. 61503315, 61571379) and the National Nature Science Foundation of Fujian (nos. 2018J01576, 2017J01127). The authors are indebted to Professor Hanzi Wang for helpful criticism of an earlier version of this paper.