Abstract

Head pose estimation from single 2D images has been considered as an important and challenging research task in computer vision. This paper presents a novel head pose estimation method which utilizes the shape model of the Basel face model and five fiducial points in faces. It adjusts shape deformation according to Laplace distribution to afford the shape variation across different persons. A new matching method based on PSO (particle swarm optimization) algorithm is applied both to reduce the time cost of shape reconstruction and to achieve higher accuracy than traditional optimization methods. In order to objectively evaluate accuracy, we proposed a new way to compute the pose estimation errors. Experiments on the BFM-synthetic database, the BU-3DFE database, the CUbiC FacePix database, the CMU PIE face database, and the CAS-PEAL-R1 database show that the proposed method is robust, accurate, and computationally efficient.

1. Introduction

As one of the most active research topics, head pose estimation has received significant attention in computer vision field. Human visual system is able to interpret the orientation of a human head naturally, but it is challenging to make computer have this ability. There are so many difficulties we have to overcome for it. However, the head pose as an important indication can be widely used in many applications; for example, in video surveillance, it helps the computer to determine human identity and focus of attention in the scene [1]. In human computer interaction, it helps the computer to understand the intentions of human being [2]. In video conferencing, it helps to estimate user’s focus of attention in meetings [3]. In monitoring driver’s attention, it helps to ensure the state of driver [4]. Most automatic face recognition algorithms especially try to normalize facial images in order to remove variations caused by anything but the identity of the person. Head pose is just one great part of undesired variations in facial images [5].

There have been numerous approaches to tackle the head pose estimation problem. The temporal information from face video sequence could improve the head pose estimation [6], but the applications of head pose estimation from a single 2D image are more extensive. Furthermore it is still difficult. So we focus on it. Recent works demonstrate the depth information is the key for overcoming many of the problems inherent to 2D image data. Fortunately, all kinds of sensors have become both affordable and accurate recently including depth sensors [7]. Now many researches could be completed under multiple sensors, such as [8]. So the use of 3D data to assist pose estimation from single 2D image is already practicable.

Murphy-Chutorian and Trivedi [9] reviewed different algorithms used for head pose estimation. Most head pose estimation algorithms can be divided into two steps. The first step is feature extraction. The second step estimates the head pose according to the obtained features. We can roughly categorize head pose estimation approaches into model based approaches and appearance-based approaches.

Model based approaches: these approaches usually use geometric features. In earlier work, Gee and Cipolla [10] used five key features (points, nose tip, the outer eye corners, and mouth corners) to estimate the facial orientation. Wang and Sung [11] used two outer eye corners, two inner eye corners, and two mouth corners to estimate pose. They considered the three lines between the outer eye corners, the inner eye corners, and the mouth which are parallel. The ratio of their lengths could be used to estimate the head orientation with the assumption that the eye corners and mouth corners are approximately in the same plane. Because different persons have different ratios, the precision is not very ideal. Lanitis et al. [12] extracted face feature by active shape model (ASM). They adopted a greedy search to match the feature points. But such techniques in [12] usually need a lot of feature landmarks more than 60.

Appearance-based approaches: these approaches use features that are modeled and learned from the training data. Distance metric learning [13], subspace analysis [14], and manifold embedding [15] are popular methods used to extract features. But such techniques are usually sensitive to data alignment and background noise.

Robustness and accuracy both are important in head pose estimation; hence many hybrid approaches are proposed. Reference [16] gives a coarse-to-fine framework to estimate pose. The first stage is based on the subspace analysis. In the second stage, the estimated pose is refined by analyzing finer geometrical structure details captured by bunch graphs. Reference [17] presents a 3D head pose estimation approach based on 3D morphable appearance model. But it considers yaw angle just spanning from to and pitch angle spanning from to .

Now the obvious difficulty in many applications of a single 2D face image is still pose, light, and expression invariance. 3D information of the object is very useful to overcome this difficulty. Flexible and expressive 3D face models are a powerful tool to estimate the 3D shape, texture, and pose transfer between individuals using a single photograph [18]. For this reason, 3D morphable models (3DMM) have been introduced a decade ago [19]. A morphable face model represents the shape and texture of samples by a vector space. In this vector space, any convex combination of shape and texture vectors of some face samples describes a realistic human face [20], and the constraints of shape and texture are derived from the statistics of sample faces [18]. But the expensive and tedious construction process makes it impractical. On an SGI R10000 processor, computation time was 50 minutes [18]. IN FER [21] Keeping the optimization time below 180 s, on a 2.8 GHz processor. And iN FER [17] 33.5 s runtimes are measured in MATLAB TM using an Intel Core 2 Duo processor running at 2.4 GHz. Therefore we simplify traditional 3D morphable models to a 3D morphable shape model that only encodes shape in terms of model parameters. Only reconstructing the shape of model is enough to estimate head pose.

During the procedure of 3D face shape reconstruction, model matching is an important step. It is a high-dimensional optimization problem for which both the speed and matching precision have to be considered. Traditionally, the stochastic gradient descent method is hard to converge at optimal point in a short time. In addition, converging result is not stable since it depends too much on initial values. The PSO algorithm originally is proposed by Kennedy and Eberhart [22, 23], which is a swarm intelligence-based optimization algorithm, which is independent of initial values and the gradient information of object function. And it is central control, which helps to avoid the optimization problem being failed by several “bad” individuals. So we choose it. Furthermore, we improve the PSO algorithm to make it better suitable 3D face shape reconstruction.

This paper presents a novel hybrid head pose estimation method which consists of three major parts. Equation (1) gets the coordinates of five feature points on 2D input image (left and right eye centers, nose tip, and left and right mouse corners). Automatically capturing facial feature points is an important and open problem. There are so many methods presented to complete this task. In this paper, we employ two approaches to get feature points: FACE++ SDK [24] used for 2D database and we recorded the values of feature points during the process of projection for 3D database. Experiments show that five feature points are enough to accurately estimate pose angles (2). According to the best suitable dimensions of shape parameters we found and improved data distribution of shape parameters to structure cost function (3). We combine an improved particle swarm optimization algorithm to minimize the fitting error between the five feature points on 2D input image and 3D shape model. So far we reconstruct a 3D model with the pose we want to get. It is the rotational Euler angles relative to the , and -axes of the head, a.k.a. the input images, three degrees of freedom (yaw, pitch, and roll).

The paper is organized as follows. First, we introduce the 3D morphable shape model in Section 3. In Section 4 we introduce our proposed method of shape parameters estimation and PSO optimization process. In Section 5, we give the experiment results on five databases. Finally, we conclude the paper and discuss potential future work in Section 6.

3. 3D Morphable Face Shape Model

As we know, we could represent any face shape as a linear combination of a limited basis set of face shape prototypes. Now we use -dimensional vectors to represent an exemplar face shape as and define these vectors as shape vectors. In a 3D face scan, the is the number of vertices and the , and are coordinates of the th vertex. When we have face shape vectors , we do dense correspondence between them; then the vectors have same length and a 3D average face shape model can be constructed. Then we perform a principal component analysis (PCA [25, 26]) on the shape vectors . The covariance matrix is calculated. Let its eigenvectors be , and let its eigenvalue be . Using the eigenvectors as an orthogonal basis, any face shape can be represented as follows: Obviously different will construct different person; we define as shape parameters.

The Basel face model (BFM) [27] can be constructed with 200 persons’ higher resolution laser scans data with a large variety in face shapes and appearance, which were registered with each other by using an improved dense correspondence algorithm. After alignment all the 200 faces are unified with 53490 vertices.

4. Proposed Method

4.1. Cost Function

Through fitting the 3D shape model to an input 2D image we can estimate the pose of the face in this 2D image, because, in the process of fitting, not only the projective parameters and the shape parameters but also the pose parameters will be ascertained. We concatenate the projective parameters (the focal length of the camera, the 3D translation ) and the pose parameters (the rotation angles , and ) into a vector .

The primary goal of the fitting is to minimize the sum of square differences about the five facial feature points in the input 2D image and the projective positions of the five points corresponding vertices in BFM where and can be computed through the following method.

The object-centred coordinates of the th vertex could be mapped to a position relative to the camera by the rigid transformation [28]: where the angles and control the BFM’s rotation around the vertical and horizontal axis (i.e., yaw and pitch) and defines a rotation around the camera axis (i.e., roll). is a spatial shift.

According to the perspective projection method we could map vertex to image plane coordinates where is the focal length of the camera and defines the principal point which is the intersection point of the optical axis and the image plane.

We suggest that the principal point should be equal to the projective position of the origin of object-centered coordinates, Like Figure 1. Then we can assume , and fix to an empirical value. Then only , and need to be estimated; this could reduce fitting time. In order to find the projective position of the origin of object-centered coordinates on the 2D input image, we can adjust the origin of object-centered coordinates of the BFM to nose tip and keep optical axis through the origin of object-centered coordinates and collinear with the -axis of object-centered coordinates.

If we only minimize the in (2), the overfitting is unavoidable. So we add the maximum a posteriori estimator (MAP) into the cost function. Now given the feature points on the input 2D image our task is to find model parameters with maximum posterior probability . According to Bayes rule,

If we neglect correlations between the variables, the right-hand side is . We assume that feature point coordinates may be subject to Gaussian pixel noise with a standard deviation . So is a normal distribution with the starting values for and ad hoc values for : In [29], Ding et al. propose that the shape parameters are more likely to be subject to Laplace distribution than Gaussian distribution. So the prior probability is where is a normalization factor.

Then, posterior probability could be maximized by minimizing

The value of could be adjusted after empirical experimentation. The value of   could be obtained through trial and error method [29]. The value of could be decided through theorem.

The equation (10) is the final cost function.

4.2. Improved PSO Algorithm

PSO algorithm is a swarm intelligence-based optimization algorithm which is independent of initial values and the gradient information of object function. Furthermore, the most important step in head shape estimation based on morphable 3D model is model matching. Some models low-dimensional parameters can be determined by skillfully solving a set of linear matrix inequalities [30]. But face model matching is a high-dimensional optimization problem, where both speed and precision are important. So we dynamically adjust the inertia weight in PSO algorithm according to the probability distribution of shape parameters . The improved matching method will reduce the running time and better the fitting result. The flowchart of the improved PSO algorithm is shown in Figure 2.

In the PSO model, each particle in the search space represents a potential solution, and it is comprised of three parts: a position, a velocity, and a record of its past performance. The best solution can be smoothly found through cooperation and information sharing among individuals in the swarm.

We encode each particle ; there is focal length, , and are the three pose angles, and is the -dimensional shape coefficients. So the dimension of particle is .

The particles move according to the following equations: where is the velocity of the particle . is its position. is the best previous position of the particle (called ), and is the best previous position among all the particles in the swarm (called ), and and are acceleration factors. is used to move the particle to the own best position and is used to move the particle to the global best position. The changes of and determine the relative pull for each particle towards and . The rand() is random function in the range of , which injects random factors into the optimization process. is inertia weight. When the value of is big, the whole species can jump out of local maximum points easily, while the small is beneficial to fast convergence.

Dynamically adjusting the inertia weight can get a better result than fixed value. In model fitting, the value of is determined according to the following equation [31]: where is obtained from (8).

Particle swarm optimization iterates according to above equations and gives a solution until the in (10) reach to the minimum.

5. Experimental Results

5.1. The Dimension of Shape Components

The location of feature points in the face of a person related not only to his/her head pose but also to the unique identity. Hence, this method refines shape deformation to ensure the reasonableness of the five feature points’ positions on the same pose but different persons.

The dimension of BFM’s shape components is 199. But as mentioned in literature [32], some prior components control the most aspects of a face shape while some other components may decrease the reconstruction precision. So we investigate how the dimension of shape components affects the quality of pose estimation.

We use BFM model to construct 20 synthetic 3D faces as reasonable random shape parameter shown in Figure 3. Then we rotate and project them to 2D images; meanwhile we accurately record the five feature points’ coordinates. In Figure 4(a), we give the pose estimation absolute error on three degrees of freedom when the number of shape components varies from 0 to 199 and the pose angle of every degree of freedom varies from to in steps of (19 pose angles in total). In Figure 4(b), we give the estimation absolute error on two degrees of freedom including pitch and yaw (fixing the roll angle to zero). There are 361 pose angles varying from to in steps of .

In Figure 4(a), we can see that when the number of shape components is 20 in yaw direction could get the min absolute error, 35 in pitch and 25 in roll. So we vary shape components from 10 to 35 to test the best value for two degrees of freedom. In Figure 4(b) we can see that when the number of components is 25, the best result is achieved. We show the best result in terms of mean, standard deviation, median, and upper quartile (Q25) and lower quartile (Q75) of the absolute error in Table 1.

5.2. The Number of Feature Points

In this section we compare the estimation absolute errors when the number of feature points varies from 15 to 5. The absolute error on BFM-synthetic database is very small when the number of feature points is 5 and the tendency of change is not notable. So we do this experiment on the BU-3DFE 3D face database. We project the 100 3D data and record their coordinates of the 15 feature points. On every degree of freedom the pose angle varies from to in steps of . Single degrees of freedom count up 19 pose angles. The 15 feature points include two upper eye midpoints of outline, two lower eye midpoints of outline, two outer eye corners, two inner eye corners, nostril demarcation point, the jaw tip, tow midpoint of eye socket, two mouth corners, and the nose tip. The number of feature points reduces to 13, 11, 9, 7, 6, and 5 when we remove the preceding points in order. In Figure 5, we give the comparison result. The BU-3DFE database [33] consists of 100 individuals with 7 emotional states. We only use neutral emotional images. Each facial model is composed of more than 7,700 vertices and 14,000 triangles. Images of the individuals can be projected in arbitrary pose.

BU-3DFE 3D face database did not supply the two midpoints of eye socket, the nose tip, and nostril demarcation point, so we utilize some supplied points to compute these absent points. We use the points’ mean of 1, 5, 3, and 7 (the release file, 4DFE-Featurepoints83.JPG, in BU-3DFE gives this 83 points’ definition) to compute the and coordinates of right midpoint of eye socket; then according to the and coordinates, the nearest points’ coordinate is found, as the coordinate of right midpoint of eye socket. In the same way, we compute the left midpoint of eye socket using 9, 11, 13, and 15, compute the nostril demarcation point using 42, 43, and compute the nose tip using 40, 45.

In Figure 5, we can see that when the number of feature points is five the absolute error of pose angle (5 pixels on average) is enough for many practical applications. Furthermore, it is much easier to obtain 5 feature points than 15 feature points, and it is more efficient during fitting. In the following experiments, we set the number of feature points to 5.

5.3. Comparison with Other Methods

We do head pose estimation on two publicly available datasets (the BU-3DFE 3D face database and the CUbiC FacePix database) to compare with 3D MAM [17] and 3D morphable model (3DMM) method [19].

The CUbiC FacePix database [34] consists of 30 individuals. Three sets of images are available for each individual. The first set contains images in which only yaw angle is annotated and taken from the individuals right to left in one-degree increments. The second and third sets are targeted to illumination experiments. We only use the first set for our pose estimation experiments.

We prepare experiment data according to [17] and then compare the proposed method with 3D MAM and the well known 3DMM method. But we failed to apply for the USF Human ID 3D face database used in [17], so we choose a similar 3D database (the BU-3DFE database) to substitute for it. In BU-3DFE database we use the projected images of all the 100 individuals in the database by only varying the yaw angles from to in steps of while fixing the roll and pitch angle to zero. The five feature points’ coordinates are recorded during projection. Table 2(a) presents the comparison results on the USF Human ID and the BU-3DFE. In the CUbiC FacePix database we take those images whose pose angles vary from to in steps of . Table 2(b) presents the comparison result. The five feature points’ coordinates are automatically detected by the face++ SDK [24].

Then we extend the previous experiment to two degrees of freedom by changing the yaw angle to [] and the pitch angle to [] correspondingly. Table 3 presents the results.

In Table 4, we can see the runtime per facial fit. Our approach is faster than the 3DMM [19] but slightly slower than 3D MAM. The runtimes are measured in using an Intel Core 2 Duo processor running at 2.4 GHz and the resolution of the image is 180 × 180 pixels. 3D MAM [17] using the same processor and the resolution of the image is 60 × 80 pixels.

In [35], several experiments were conducted on the CUbiC FacePix database using 68 feature points to train support vector regression and AAM fitting. The mean error obtained there in yaw is when pose variation is . Our approach could reach on the same database using 15 face feature points, but pose variation is in yaw because beyond there are obvious deviations in the locations of the feature points detected by using face++ SDK.

5.4. Experimental Results on Other Databases
5.4.1. On the CMU PIE Face Database

The CMU PIE face database [36] consists of over 40,000 facial images of 68 individuals. We only derived the pictures taken by c27, c07, c09, c29, and c05 cameras with a neutral expression and not wearing glasses. There are three individuals’ images with some especial factors such as the color of skin and appearance feature, for which FACE++ SDK could not detect face. Excluding the three individuals, our experimental set totally includes 65 individuals and contains yaw and pitch angle from to . When estimating the pitch angle we adjust the position of optical center to adapt to the shooting environment. We give the result in Table 5.

Actually, there are so many objective factors to affect the accuracy. Figure 6 shows five individuals’ frontal pose images. It can be obviously seen that the frontal face images of different people could appear very different. Some unexpected artificial factors maybe cause this phenomenon. And if we fixed the cameras the height of models will affect the pitch angle obviously. In other words, it is difficult to define “standard frontal face image” of all people.

As a result, the estimation of pose angles will have certain deviations. So we propose a new evaluation method for estimation accuracy. is the annotation for angle ; is the estimation for angle  . Conventionally absolute error is defined as but our modified absolute error is computed as . We think this new absolute error will help to avoid effect from inaccurate frontal pose and make the absolute error more objective. So we give the modified absolute error in Table 6 and draw the estimation result and the value of in Figures 7 and 8.

From the curves (Figures 7 and 8), we can see that the deviations between the estimated pose angles and the annotated ones are relatively consistent across different people. This demonstrates that the method is reasonably accurate despite the possible deviations in the annotated frontal face images.

5.4.2. On the CAS-PEAL-R1 Database

The CAS-PEAL-R1 [37], a subset of the CAS-PEAL face database, has been released for the purpose of research. It contains 30,871 images of 1,040 individuals (595 males and 445 females) with varying pose, expression, accessory, and lighting (PEAL). We are interested only in the pose subset. But pitch pose in the CAS-PEAL-R1 was constructed through artificially rotating head. So we did not test on pitch images, but only using the images with to yaw angles. We show the result in Table 7.

6. Conclusion

In this paper, we introduced a novel head pose estimation method. It only relied on five facial feature points and utilizes the shape model of the BFM. Our method does not need training and could keep stable performance on different database. For more objective evaluation, we modified the way to compute the absolute errors. The experiment results show that the proposed method could be used to estimate face pose in most realistic situation and the absolute error is acceptable for practical applications. In the future we are going to improve our approach from two aspects. First, we want to improve the optimization algorithm to reduce the running time and enhance estimation accuracy. Second, we plan to construct a new 3D morphable face model including more races to deal with more internationalization applications.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Key Scientific Instrument and Equipment Development Project of China (no. 2013YQ49087903), National Natural Science Foundation of China (Grant no. 61402307), and Educational Commission of Sichuan Province of China (no. 15ZA0007).