Mathematical Problems in Engineering

Volume 2015, Article ID 678973, 10 pages

http://dx.doi.org/10.1155/2015/678973

## Robust Head Pose Estimation Using a 3D Morphable Model

^{1}School of Computer Science, Sichuan University, Chengdu 610064, China^{2}Wisesoft Software Co., Ltd., Chengdu 610045, China^{3}College of Information Engineering, Sichuan Agricultural University, Ya’an 625014, China^{4}School of Aeronautics and Astronautics, Sichuan University, Chengdu 610064, China

Received 30 September 2014; Accepted 17 November 2014

Academic Editor: Hui Zhang

Copyright © 2015 Ying Cai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Head pose estimation from single 2D images has been considered as an important and challenging research task in computer vision. This paper presents a novel head pose estimation method which utilizes the shape model of the Basel face model and five fiducial points in faces. It adjusts shape deformation according to Laplace distribution to afford the shape variation across different persons. A new matching method based on PSO (particle swarm optimization) algorithm is applied both to reduce the time cost of shape reconstruction and to achieve higher accuracy than traditional optimization methods. In order to objectively evaluate accuracy, we proposed a new way to compute the pose estimation errors. Experiments on the BFM-synthetic database, the BU-3DFE database, the CUbiC FacePix database, the CMU PIE face database, and the CAS-PEAL-R1 database show that the proposed method is robust, accurate, and computationally efficient.

#### 1. Introduction

As one of the most active research topics, head pose estimation has received significant attention in computer vision field. Human visual system is able to interpret the orientation of a human head naturally, but it is challenging to make computer have this ability. There are so many difficulties we have to overcome for it. However, the head pose as an important indication can be widely used in many applications; for example, in video surveillance, it helps the computer to determine human identity and focus of attention in the scene [1]. In human computer interaction, it helps the computer to understand the intentions of human being [2]. In video conferencing, it helps to estimate user’s focus of attention in meetings [3]. In monitoring driver’s attention, it helps to ensure the state of driver [4]. Most automatic face recognition algorithms especially try to normalize facial images in order to remove variations caused by anything but the identity of the person. Head pose is just one great part of undesired variations in facial images [5].

There have been numerous approaches to tackle the head pose estimation problem. The temporal information from face video sequence could improve the head pose estimation [6], but the applications of head pose estimation from a single 2D image are more extensive. Furthermore it is still difficult. So we focus on it. Recent works demonstrate the depth information is the key for overcoming many of the problems inherent to 2D image data. Fortunately, all kinds of sensors have become both affordable and accurate recently including depth sensors [7]. Now many researches could be completed under multiple sensors, such as [8]. So the use of 3D data to assist pose estimation from single 2D image is already practicable.

#### 2. Related Work

Murphy-Chutorian and Trivedi [9] reviewed different algorithms used for head pose estimation. Most head pose estimation algorithms can be divided into two steps. The first step is feature extraction. The second step estimates the head pose according to the obtained features. We can roughly categorize head pose estimation approaches into model based approaches and appearance-based approaches.

Model based approaches: these approaches usually use geometric features. In earlier work, Gee and Cipolla [10] used five key features (points, nose tip, the outer eye corners, and mouth corners) to estimate the facial orientation. Wang and Sung [11] used two outer eye corners, two inner eye corners, and two mouth corners to estimate pose. They considered the three lines between the outer eye corners, the inner eye corners, and the mouth which are parallel. The ratio of their lengths could be used to estimate the head orientation with the assumption that the eye corners and mouth corners are approximately in the same plane. Because different persons have different ratios, the precision is not very ideal. Lanitis et al. [12] extracted face feature by active shape model (ASM). They adopted a greedy search to match the feature points. But such techniques in [12] usually need a lot of feature landmarks more than 60.

Appearance-based approaches: these approaches use features that are modeled and learned from the training data. Distance metric learning [13], subspace analysis [14], and manifold embedding [15] are popular methods used to extract features. But such techniques are usually sensitive to data alignment and background noise.

Robustness and accuracy both are important in head pose estimation; hence many hybrid approaches are proposed. Reference [16] gives a coarse-to-fine framework to estimate pose. The first stage is based on the subspace analysis. In the second stage, the estimated pose is refined by analyzing finer geometrical structure details captured by bunch graphs. Reference [17] presents a 3D head pose estimation approach based on 3D morphable appearance model. But it considers yaw angle just spanning from to and pitch angle spanning from to .

Now the obvious difficulty in many applications of a single 2D face image is still pose, light, and expression invariance. 3D information of the object is very useful to overcome this difficulty. Flexible and expressive 3D face models are a powerful tool to estimate the 3D shape, texture, and pose transfer between individuals using a single photograph [18]. For this reason, 3D morphable models (3DMM) have been introduced a decade ago [19]. A morphable face model represents the shape and texture of samples by a vector space. In this vector space, any convex combination of shape and texture vectors of some face samples describes a realistic human face [20], and the constraints of shape and texture are derived from the statistics of sample faces [18]. But the expensive and tedious construction process makes it impractical. On an SGI R10000 processor, computation time was 50 minutes [18]. IN FER [21] Keeping the optimization time below 180 s, on a 2.8 GHz processor. And iN FER [17] 33.5 s runtimes are measured in MATLAB TM using an Intel Core 2 Duo processor running at 2.4 GHz. Therefore we simplify traditional 3D morphable models to a 3D morphable shape model that only encodes shape in terms of model parameters. Only reconstructing the shape of model is enough to estimate head pose.

During the procedure of 3D face shape reconstruction, model matching is an important step. It is a high-dimensional optimization problem for which both the speed and matching precision have to be considered. Traditionally, the stochastic gradient descent method is hard to converge at optimal point in a short time. In addition, converging result is not stable since it depends too much on initial values. The PSO algorithm originally is proposed by Kennedy and Eberhart [22, 23], which is a swarm intelligence-based optimization algorithm, which is independent of initial values and the gradient information of object function. And it is central control, which helps to avoid the optimization problem being failed by several “bad” individuals. So we choose it. Furthermore, we improve the PSO algorithm to make it better suitable 3D face shape reconstruction.

This paper presents a novel hybrid head pose estimation method which consists of three major parts. Equation (1) gets the coordinates of five feature points on 2D input image (left and right eye centers, nose tip, and left and right mouse corners). Automatically capturing facial feature points is an important and open problem. There are so many methods presented to complete this task. In this paper, we employ two approaches to get feature points: FACE++ SDK [24] used for 2D database and we recorded the values of feature points during the process of projection for 3D database. Experiments show that five feature points are enough to accurately estimate pose angles (2). According to the best suitable dimensions of shape parameters we found and improved data distribution of shape parameters to structure cost function (3). We combine an improved particle swarm optimization algorithm to minimize the fitting error between the five feature points on 2D input image and 3D shape model. So far we reconstruct a 3D model with the pose we want to get. It is the rotational Euler angles relative to the , and -axes of the head, a.k.a. the input images, three degrees of freedom (yaw, pitch, and roll).

The paper is organized as follows. First, we introduce the 3D morphable shape model in Section 3. In Section 4 we introduce our proposed method of shape parameters estimation and PSO optimization process. In Section 5, we give the experiment results on five databases. Finally, we conclude the paper and discuss potential future work in Section 6.

#### 3. 3D Morphable Face Shape Model

As we know, we could represent any face shape as a linear combination of a limited basis set of face shape prototypes. Now we use -dimensional vectors to represent an exemplar face shape as and define these vectors as shape vectors. In a 3D face scan, the is the number of vertices and the , and are coordinates of the th vertex. When we have face shape vectors , we do dense correspondence between them; then the vectors have same length and a 3D average face shape model can be constructed. Then we perform a principal component analysis (PCA [25, 26]) on the shape vectors . The covariance matrix is calculated. Let its eigenvectors be , and let its eigenvalue be . Using the eigenvectors as an orthogonal basis, any face shape can be represented as follows: Obviously different will construct different person; we define as shape parameters.

The Basel face model (BFM) [27] can be constructed with 200 persons’ higher resolution laser scans data with a large variety in face shapes and appearance, which were registered with each other by using an improved dense correspondence algorithm. After alignment all the 200 faces are unified with 53490 vertices.

#### 4. Proposed Method

##### 4.1. Cost Function

Through fitting the 3D shape model to an input 2D image we can estimate the pose of the face in this 2D image, because, in the process of fitting, not only the projective parameters and the shape parameters but also the pose parameters will be ascertained. We concatenate the projective parameters (the focal length of the camera, the 3D translation ) and the pose parameters (the rotation angles , and ) into a vector .

The primary goal of the fitting is to minimize the sum of square differences about the five facial feature points in the input 2D image and the projective positions of the five points corresponding vertices in BFM where and can be computed through the following method.

The object-centred coordinates of the th vertex could be mapped to a position relative to the camera by the rigid transformation [28]: where the angles and control the BFM’s rotation around the vertical and horizontal axis (i.e., yaw and pitch) and defines a rotation around the camera axis (i.e., roll). is a spatial shift.

According to the perspective projection method we could map vertex to image plane coordinates where is the focal length of the camera and defines the principal point which is the intersection point of the optical axis and the image plane.

We suggest that the principal point should be equal to the projective position of the origin of object-centered coordinates, Like Figure 1. Then we can assume , and fix to an empirical value. Then only , and need to be estimated; this could reduce fitting time. In order to find the projective position of the origin of object-centered coordinates on the 2D input image, we can adjust the origin of object-centered coordinates of the BFM to nose tip and keep optical axis through the origin of object-centered coordinates and collinear with the -axis of object-centered coordinates.