Abstract
Face recognition under varying pose is a challenging problem, especially when illumination variations are also present. In this paper, we propose to address one of the most challenging scenarios in face recognition. That is, to identify a subject from a test image that is acquired under different pose and illumination condition from only one training sample (also known as
a gallery image) of this subject in the database. For example, the test image could be semifrontal and illuminated by multiple lighting sources while the corresponding training image is frontal under a single lighting source. Under the assumption of Lambertian reflectance, the spherical harmonics representation has proved to be effective in modeling illumination
variations for a fixed pose. In this paper, we extend the spherical harmonics representation to encode pose information. More specifically, we utilize the fact that 2D harmonic basis images at different poses are related by close-form linear transformations, and give a more convenient transformation matrix to be directly used for basis images. An immediate application is that
we can easily synthesize a different view of a subject under arbitrary lighting conditions by changing the coefficients of the spherical harmonics representation. A more important result is an efficient face recognition method, based on the orthonormality of the linear transformations, for solving the above-mentioned challenging scenario. Thus, we directly project a nonfrontal view test image onto the space of frontal view harmonic basis images. The impact of some empirical factors due to the projection is embedded in a sparse warping matrix; for most cases, we show that the recognition performance does not deteriorate
after warping the test image to the frontal view. Very good recognition results are obtained using this method for both synthetic and challenging real images.
1. Introduction
Face recognition is one of the most successful
applications of image analysis and understanding [1]. Given a database of training images (sometimes
called a gallery set, or gallery images), the task of face recognition is to
determine the facial ID of an incoming test image. Built upon the success of
earlier efforts, recent research has focused on robust face recognition to
handle the issue of significant difference between a test image and its
corresponding training images (i.e., they belong to the same subject). Despite
significant progress, robust face recognition under varying lighting and
different pose conditions remains to be a challenging problem. The problem
becomes even more difficult when only one training image per subject is available.
Recently, methods have been proposed to handle the combined pose and
illumination problem when only one training image is available, for example,
the method based on morphable models [2] and its extension [3] that proposes to handle the complex illumination
problem by integrating spherical harmonics representation [4, 5]. In these methods, either arbitrary illumination
conditions cannot be handled [2] or the expensive computation of harmonic basis images
is required for each pose per subject [3].
Under the assumption of Lambertian reflectance, the
spherical harmonics representation has proved to be effective in modelling
illumination variations for a fixed pose. In this paper, we extend the harmonic
representation to encode pose information. We utilize the fact that all the
harmonic basis images of a subject at various poses are related to each other
via close-form linear transformations [6, 7], and derive a more convenient transformation matrix
to analytically synthesize basis images of a subject at various poses from just
one set of basis images at a fixed pose, say, the frontal view [8]. We prove that the derived transformation matrix is
consistent with the general rotation matrix of spherical harmonics. According
to the theory of spherical harmonics representation [4, 5], this implies that we can easily synthesize from one
image under a fixed pose and lighting to an image acquired under different poses and
arbitrary lightings. Moreover, these linear transformations are orthonormal.
This suggests that recognition methods based on projection onto fixed-pose
harmonic basis images [4] for test images under the same pose can be easily
extended to handle test images under various poses and illuminations. In other
words, we do not need to generate a new set of basis images at the same pose as
that of test image. Instead, we can warp the test images to a frontal view and
directly use the existing frontal view basis images. The impact of some
empirical factors (i.e., correspondence and interpolation) due to the warping
is embedded in a sparse transformation matrix; for most cases, we show that the recognition
performance does not deteriorate after warping the test image to the frontal
view.
To summarize, we propose an efficient face synthesis
and recognition method that needs only one single training image per subject
for novel view synthesis and robust recognition of faces under variable
illuminations and poses. The structure of our face synthesis and recognition
system is shown in Figure 1. We have a single training image at the frontal pose
for each subject in the training set. The basis images for each training
subject are recovered using a statistical learning algorithm [9] with the aid of a bootstrap set consisting of 3D face
scans. For a test image at a rotated pose and under an arbitrary illumination
condition, we manually establish the image correspondence between the test image
and a mean face image at the frontal pose. The frontal view image is then
synthesized from the test image. A face is identified for which there exists a
linear reconstruction based on basis images that is the closest to the test
image. Note that although in Figure 1 we only show the training images acquired at the frontal pose,
it does not exclude other cases when the available training images are at
different poses. Furthermore, the user is given the option to visualize the
recognition result by comparing the synthesized images of the chosen subject
against the test image. Specifically, we can generate novel images of the
chosen subject at the same pose as the test image by using the close-form
linear transformation between the harmonic basis images of the subject across
poses. The pose of the test image is estimated from a few manually selected
main facial features.
Figure 1: The proposed face synthesis and recognition system.
We test our face recognition method on both synthetic
and real images. For synthetic images, we generate the training images at the
frontal pose and under various illumination conditions, and the test images at
different poses, under arbitrary lighting conditions, all using Vetter's 3D face
database [10]. For real images, we use the CMU-PIE [11] database which contains face images of 68 subjects
under 13 different poses and 43 different illumination conditions. The test
images are acquired at six different poses and under twenty one different
lighting sources. High recognition rates are achieved on both synthetic and
real test images using the proposed algorithm.
The remainder of the paper is organized as follows.
Section 2 introduces related work. The pose-encoded spherical
harmonic representation is illustrated in Section 3 where we derive a more convenient transformation
matrix to analytically synthesize basis images at one pose from those at
another pose. Section 4 presents the complete face recognition and synthesis
system. Specifically, in Section 4.1 we briefly summarize a statistical learning method to
recover the basis images from a single image when the pose is fixed. Section 4.2 describes the recognition algorithm and demonstrates that
the recognition performance does not degrade after warping the test image to
the frontal view. Section 4.3 presents how to generate the novel image of the
chosen subject at the same pose as the test image for visual comparison. The
system performance is demonstrated in Section 5. We conclude our paper in Section 6.
2. Related Work
As pointed out
in [1] and many references cited therein, pose and/or illumination
variations can cause serious performance degradation to many existing face
recognition systems. A review of these two problems and proposed solutions can
be found in [1]. Most earlier methods focused on either illumination
or pose alone. For example, an early effort to handle illumination variations
is to discard the first few principal components that are assumed to pack most
of the energy caused by illumination variations [12]. To handle complex illumination variations more
efficiently, spherical harmonics representation was independently proposed by
Basri and Jacobs [4] and Ramamoorthi [5]. It has been shown that the set of images of a convex
Lambertian face object obtained under a wide variety of lighting conditions can
be approximated by a low-dimensional linear subspace. The basis images spanning
the illumination space for each face can then be rendered from a 3D scan of the
face [4]. Following the statistical learning scheme in [13], Zhang and Samaras [9] showed that the basis images spanning this space can
be recovered from just one image taken under arbitrary illumination conditions
for a fixed pose.
To handle the pose problem, a template matching scheme
was proposed in [14] that needs many different views per person and does
not allow lighting variations. Approaches for face recognition under pose
variations [15, 16] avoid the strict correspondence problem by storing
multiple normalized images at different poses for each person. View-based
eigenface methods [15] explicitly code the pose information by constructing
an individual eigenface for each pose. Reference [16] treats face recognition across poses as a bilinear
factorization problem, with facial identity and head pose as the two factors.
To handle the combined pose and illumination
variations, researchers have proposed several methods. The synthesis method in [17] can handle both illumination and pose variations by
reconstructing the face surface using the illumination cone method under a
fixed pose and rotating it to the desired pose. The proposed method essentially
builds illumination cones at each pose for each person. Reference [18] presented a symmetric shape-from-shading (SFS)
approach to recover both shape and albedo for symmetric objects. This work was
extended in [19] to recover the 3D shape of a human face using a
single image. In [20], a unified approach was proposed to solve the pose
and illumination problem. A generic 3D model was used to establish the
correspondence and estimate the pose and illumination direction. Reference [21] presented a pose-normalized face synthesis method
under varying illuminations using the bilateral symmetry of the human face. A
Lambertian model with a single light source was assumed. Reference [22] extended the photometric stereo algorithms to recover
albedos and surface normals from one image illuminated by unknown single or multiple
distant illumination source.
Building upon the highly successful statistical
modeling of 2D face images [23], the authors in [24] propose a 2D + 3D active appearance model (AAM)
scheme to enhance AAM in handling 3D effects to some extent. A sequence of face
images (900 frames) is tracked using AAM and a 3D shape model is constructed
using structure-from-motion (SFM) algorithms. As camera calibration and 3D
reconstruction accuracy can be severely affected when the camera is far away
from the subjects, the authors imposed these 3D models as soft constraints for
the 2D AAM fitting procedure and showed convincing tracking and image synthesis
results on a set of five subjects. However, this is not a true 3D approach with
accurate shape recovery and does not handle occlusion.
To handle both pose and illumination variations, a 3D
morphable face model has been proposed in [2], where the shape and texture of each face is
represented as a linear combination of a set of 3D face exemplars and the
parameters are estimated by fitting a morphable model to the input image. By
far the most impressive face synthesis results were reported in [2] accompanied by very high recognition rates. In order
to effectively handle both illumination and pose, a recent work [3] combines spherical harmonics and the morphable model.
It works by assuming that shape and pose can be first solved by applying the morphable
model and illumination can then be handled by building spherical harmonic basis
images at the resolved pose. Most of the 3D morphable model approaches are
computationally intense [25] because of the large number of parameters that need
to be optimized. On the contrary, our method does not require the
time-consuming procedure of building a set of harmonic basis images for each
pose. Rather, we can analytically synthesize many sets of basis images from
just one set of basis images, say, the frontal basis images. For the purpose
of face recognition, we can further improve the efficiency by exploring the
orthonormality of linear transformations among sets of basis images at
different poses. Thus, we do not synthesize basis images at different
poses. Rather, we warp the test image to the same pose as that of the existing
basis images and perform recognition.
3. Pose-Encoded Spherical Harmonics
The spherical
harmonics are a set of functions that form an orthonormal basis for the set of
all square-integrable functions defined on the unit sphere [4]. Any image of a Lambertian object under certain
illumination conditions is a linear combination of a series of spherical
harmonic basis images
.
In order to generate the basis images for the object, 3D information is
required. The harmonic basis image intensity of a point
with surface normal
and albedo
can be computed as the combination of the
first nine spherical harmonics, shown in (1), where
.
,
,
,
,
are defined similarly.
denotes the component-wise product of
with any vector
.
The superscripts
and
denote the even and the odd components of the
harmonics, respectively:
(1)
Given a bootstrap set of 3D models, the spherical
harmonics representation has proved to be effective in modeling illumination
variations for a fixed pose, even in the case when only one training image per
subject is available [9]. In the presence of both illumination and pose
variations, two possible approaches can be taken. One is to use a 3D morphable
model to reconstruct the 3D model from a single training image and then build
spherical harmonic basis images at the pose of the test image [3]. Another approach is to require multiple training
images at various poses in order to recover the new set of basis images at each
pose. However, multiple training images are not always available and a 3D
morphable model-based method could be computationally expensive. As for efficient
recognition of a rotated test image, a natural question to ask is that can we represent the basis
images at different poses using one set of basis images at a given pose, say,
the frontal view. The answer is yes, and the reason lies on the fact that 2D
harmonic basis images at different poses are related by close-form linear
transformations. This enables an analytic method for generating new basis
images at poses different from that of the existing basis images.
Rotations of spherical harmonics have been studied by
researchers [6, 7] and it can be shown that rotations of spherical
harmonic with order
are linearly composed entirely of other spherical
harmonics of the same order. In terms of group theory, the transformation
matrix is the
-dimensional representation of the rotation
group SO (3) [7]. Let
be the spherical harmonic, the general
rotation formula of spherical harmonic can be written as
,
where
,
,
are the rotation angles around the
,
, and
axes,
respectively. This means that for each order
,
is a matrix that tells us how a spherical
harmonic transforms under rotation. As a matrix multiplication, the
transformation is found to have the following block diagonal sparse form:
(2)
where,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
. The analytic
formula is rather complicated, and is derived in [6, equatioin (7.48)].
Assuming that the test image
is at a different pose (e.g., a rotated view)
from the training images (usually at the frontal view), we look for the basis
images at the rotated pose from the basis images at the frontal pose. It will
be more convenient to use the basis image form as in (1), rather than the spherical harmonics form
.
The general rotation can be decomposed into three concatenated Euler angles
around the
,
, and
axes, namely, elevation (
), azimuth (
), and roll (
), respectively. Roll is an in-plane rotation
that can be handled much easily and so will not be discussed here. The
following proposition gives the linear transformation matrix from the basis
images at the frontal pose to the basis images at the rotated pose for orders
,
which capture 98% of the energy [4].
Proposition 1.
Assume that a rotated view is obtained by rotating a
frontal view head with an azimuth angle
.
Given the correspondence between the frontal view and the rotated view, the
basis images
at the rotated pose are related to the basis
images
at the frontal pose as
(3)
where
,
,
,
,
,
,
,
,
.
Further, if there is an elevation angle
, the basis images
for the newly rotated view are related to
in the following linear form:
(4)
where
,
,
,
,
,
,
,
,
.
A direct proof (rather than deriving from the general rotation
equations) of this proposition is given in the appendix, where we
also show that the proposition is consistent with the general
rotation matrix of spherical harmonics.
To illustrate the effectiveness of (3) and (4), we synthesized the basis images at an arbitrarily
rotated pose from those at the frontal pose, and compared them with the ground
truth generated from the 3D scan in Figure 2. The first three rows present the results for subject
1, with the first row showing the basis images at the frontal pose generated
from the 3D scan, the second row showing the basis images at the rotated pose
(azimuth angle
,
elevation angle
) synthesized from the images at the first
row, and the third row showing the ground truth of the basis images at the
rotated pose generated from the 3D scan. Rows four through six present the
results for subject 2, with the fourth row showing the basis images at the
frontal pose generated from the 3D scan, the fifth row showing the basis images
for another rotated view (azimuth angle
,
elevation angle
) synthesized from the images at the fourth
row, and the last row showing the ground truth of the basis images at the
rotated pose generated from the 3D scan. As we can see from Figure 2, the synthesized basis images at the rotated poses
are very close to the ground truth. Note in Figure 2 and the figures in the sequel the dark regions
represent the negative values of the basis images.
Figure 2: (a)–(c) present the results of the synthesized basis images for subject 1, where (a) shows the basis images at the frontal pose generated from the 3D scan, (b) the basis images at a rotated pose synthesized from (a), and (c) the ground truth of the basis images at the rotated pose. (d)-(e) present the results of the synthesized basis images for subject 2, with (d) showing the basis images at the frontal pose generated from the 3D scan, (e) the basis images at a rotated pose synthesized from (d), and (f) the ground truth of the basis images at the rotated pose.
Given that the correspondence between the rotated-pose
image and the frontal-pose image is available, a consequence of the existence
of such linear transformation is that the procedure of first rotating objects
and then recomputing basis images at the desired pose can be avoided. The block
diagonal form of the transformation matrices preserves the energy on each order
. Moreover, the orthonormality of the transformation matrices helps to further
simplify the computation required for the recognition of the rotated test image
as shown in Section 4.2. Although in theory new basis images can be generated
from a rotated 3D model inferred by the existing basis images (since basis
images actually capture the albedo (
) and the 3D surface normal (
) of a given human face), the procedure of
such 3D recovery is not trivial in practice, even if computational cost is
taken out of consideration.
4. Face Recognition Using Pose-Encoded Spherical Harmonics
In this section, we present an efficient face recognition method using pose-encoded
spherical harmonics. Only one training image is needed per subject and high
recognition performance is achieved even when the test image is at a different
pose from the training image and under an arbitrary illumination condition.
4.1. Statistical Models of Basis Images
We briefly summarize a statistical learning method to recover the harmonic basis images
from only one image taken under arbitrary illumination conditions, as shown in [9].
We build a bootstrap set with fifty 3D face scans and corresponding texture maps from Vetter's 3D face database [10], and generate nine basis images for each face model.
For a novel
-dimensional vectorized image
,
let
be the
matrix of basis images,
,
a 9-dimensional vector, and
,
an
-dimensional error term. We have
.
It is assumed that the probability density functions (pdf's) of
are Gaussian distributions. The sample mean
vectors
and covariance matrixes
are estimated from the basis images in the
bootstrap set. Figure 3 shows the sample mean of the basis images estimated
from the bootstrap set.
Figure 3: The sample mean of the basis images estimated from the bootstrap set [
10].
By estimating
and the statistics of
in a prior step with kernel regression and
using them consistently across all pixels to recover
,
it is shown in [9] that for a given novel face image
,
the corresponding basis images
at each pixel
are recovered by computing the maximum a
posteriori (MAP) estimate,
. Using the Bayes rule,
(5)
Taking logarithm, and setting the derivatives of the
right-hand side of (5) (w.r.t.
) to 0, we get
,
where
and
.
Note that the superscript
denotes the transpose of the matrix here and
in the sequel. By solving this linear equation,
of the subject can be recovered.
In Figure 4, we illustrate the procedure for generating the basis
images at a rotated pose (azimuth angle
) from a single training image at the frontal
pose. In Figure 4, rows one through three show the results of the
recovered basis images from a single training image, with the first column
showing different training images
under arbitrary illumination conditions for
the same subject and the remaining nine columns showing the recovered basis
images. We can observe from the figure that the basis images recovered from
different training images of the same subject look very similar. Using the
basis images recovered from any training image in row one through three, we can
synthesize basis images at the rotated pose, as shown in row four. As a
comparison, the fifth row shows the ground truth of the basis images at the
rotated pose generated from the 3D scan.
Figure 4: The first column in (a) shows different training images

under arbitrary illumination conditions for the same subject and the remaining nine columns in (a) show the recovered basis images from

.
We can observe that the basis images recovered from different training images
of the same subject look very similar. Using the basis images recovered from
any training image

in (a), we can synthesize basis images at the
rotated pose, as shown in (b). As a comparison, (c) shows the ground truth of
the basis images at the rotated pose generated from the 3D scan.
For the CMU-PIE [11] database, we used the images of each subject at the
frontal pose (c27) as the training set. One hundred 3D face models from
Vetter's database [10] were used as the bootstrap set. The training images
were first rescaled to the size of the images in the bootstrap set. The
statistics of the harmonic basis images was then learnt from the bootstrap set
and the basis images
for each training subject were recovered. Figure 5 shows two examples of the recovered basis images from the single training image, with the first column showing the training images
and the remaining 9 columns showing the
reconstructed basis images.
Figure 5: The first column shows the training images

for two subjects in the CMU-PIE database and the remaining
nine columns show the reconstructed basis images.
4.2. Recognition
For recognition, we follow a simple yet effective algorithm given in [4]. A face is identified for which there exists a
weighted combination of basis images that is the closest to the test image. Let
be the set of basis images at the frontal
pose, with size
,
where
is the number of pixels in the image and
is the number of basis images used. Every
column of
contains one spherical harmonic image. These
images form a basis for the linear subspace, though not an orthonormal one. A
decomposition is applied to compute
,
an
matrix with orthonormal columns, such that
, where
is a
upper triangular matrix.
For a vectorized test image
at an arbitrary pose, let
be the set of basis images at that pose. The
orthonormal basis
of the space spanned by
can be computed by
decomposition. The matching score is defined
as the distance from
to the space spanned
by
.
However, this algorithm is not efficient to handle pose variation because the
set of basis images
has to be generated for each subject at the
arbitrary pose of a test image.
We propose to warp the test image
at the arbitrary (rotated) pose to its
frontal view image
to perform recognition. In order to warp
to
,
we have to find the point correspondence between these two images, which can be
embedded in a sparse
warping matrix
,
that is,
. The positions of the nonzero elements in
encode the 1-to-1 and many-to-1 correspondence
cases (the 1-to-many case is same as 1-to-1 case for pixels in
) between
and
,
and the positions of zeros on the diagonal line of
encode the no-correspondence case. More
specifically, if pixel
(the
th element in vector
) corresponds to pixel
(the
th element in vector
), then
.
There might be cases that there are more than one pixel in
corresponding to the same pixel
,
that is, there are more than one 1 in the
th row of
,
and the column indices of these
's are the corresponding pixel indices in
.
For this case, although there are several pixels in
mapping to the same pixel
,
it can only have one reasonable intensity value. We compute a single
“virtual” corresponding pixel in
for
as the centroid of
's real corresponding pixels in
,
and assign it the average intensity. The weight for each real corresponding
pixel
is proportional to the inverse of its distance
to the centroid, and this weight is assigned as the value of
.
If there is no correspondence in
for
which is in the valid facial area and should
have a corresponding point in
,
it means that
.
This is often the case that the corresponding “pixel” of
falls in the subpixel region. Thus,
interpolation is needed to fill the intensity for
.
Barycentric coordinates [26] are calculated with the pixels which have real
corresponding integer pixels in
as the triangle vertices. These Barycentric
coordinates are assigned as the values of
,
where
is the column index for each vertex of the
triangle.
We now have the warping matrix
which encodes the correspondence and
interpolation information in order to generate
from
.
It provides a very convenient tool to analyze the impact of some empirical
factors in image warping. Note that due to self-occlusion,
does not cover the whole area, but only a
subregion, of the full frontal face of the subject it belongs to. The missing
facial region due to the rotated pose is filled with zeros in
.
Assume that
is the basis images for the full frontal view
training images and
is its orthonormal basis, and let
be the corresponding basis images of
and
its orthonormal basis. In
,
the rows corresponding to the valid facial pixels in
form a submatrix of the rows in
corresponding to the valid facial pixels in
the full frontal face images. For recognition, we cannot directly use the
orthonormal columns in
because it is not guaranteed that all the columns
in
are still orthonormal.
We study the relationship between the matching score
for the rotated view
and the matching score for the frontal view
.
Let subject
be the one that has the minimum matching score
at the rotated pose, that is,
, for all
,
where
is the number of training subjects. If
is the correct subject for the test image
,
warping
to
undertakes the same warping matrix
as warping
to
,
that is, the matching score for the frontal view
. Note here that we only consider the correspondence and interpolation
issues. Due to the orthonormality of the transformation matrices as shown in (3) and (4), the linear transformation from
to
does not affect the matching score. For all
the other subjects
,
the warping matrix
for
is different from that for
,
that is,
. We will show that warping
to
does not deteriorate the recognition
performance, that is, given
,
we have
.
In terms of
,
we consider the following cases.
Case 1.
,
where
is the
-rank identity matrix. It means that
is a diagonal matrix and the first
elements on the diagonal line are 1, all the
rest are zeros.
This is the case when
is at the frontal pose. The difference between
and
is that there are some missing (nonvalid)
facial pixels in
than in
,
and all the valid facial pixels in
are packed in the first
elements. Since
and
are at the same pose,
and
are also at the same pose. In this case, for
subject
,
the missing (nonvalid) facial pixels in
are at the same locations as in
since they have the same warping matrix
.
On the other hand, for any other subject
,
the missing (nonvalid) facial pixels in
are not at the same locations as in
since
.
Apparently the 0's and 1's on the diagonal line of
has different positions from that of
,
thus
has more 0's on the diagonal line than
.
Assume
and
,
where
is a
matrix. Similarly, let
,
where
is a
vector. Then
,
,
and 

.
Therefore,
.
Similarly,
, where
is also a
matrix that might contain rows with all 0's, depending on the locations of the 0's on the diagonal line of
.
We have
. Thus,
.
If
has rows with all 0's in the first
rows, these rows will have
's at the diagonal positions for
,
which will increase the matching score
.
Therefore,
.
Case 2.
is a diagonal matrix with rank
,
however, the
1's are not necessarily the first
elements on the diagonal line.
We can use some elementary transformation to reduce
this case to the previous case. That is, there exists a orthonormal matrix
,
such that
.
Let
and
.
Then
(6)Note that elementary transformation does not change the norm. Hence, it reduces
to the previous case. Similarly, we have that
stays the same as in Case 1. Therefore,
still holds.
In the general case,
's in
can be off-diagonal. This means that
and
are at different poses. There are three
subcases that we need to discuss for a general
.
Case 3.1.
1-to-1 correspondence between
and
.
If pixel
has only one corresponding point in
,
denoted as
,
then
and there are no 1's in both the
th row and the
th column in
.
Suppose there are only
columns of the matrix
containing 1.
Then, by appropriate elementary transformation again, we can left multiply and
right multiply
by an orthonormal transformation matrixes,
and
,
respectively, such that
.
If we define
and
,
then
(7)
Under
,
it reduces to Case 2, which can be further reduced to Case 1 by the
aforementioned technique. Similarly, we have that
stays the same as in Case 2. Therefore,
still holds.
In all the cases discussed up to now, the
correspondence between
and
is 1-to-1 mapping. For such cases, the following lemma
shows that the matching score stays the same before and after the warping.
Lemma 1.
Given the correspondence between a rotated test image
and its geometrically synthesized frontal view
image
is 1-to-1 mapping, the matching score
of
based on the basis images
at that pose is the same as the matching score
of
based on the basis images
.
Let
be the transpose of the combined coefficient
matrices in (3) and (4), we have
by
decomposition, where
is the warping matrix from
to
with only 1-to-1 mapping. Applying
decomposition again to
,
we have
,
where
is an orthonormal matrix and
is an upper triangular matrix. We now have
with
.
Since
is the product of two orthonormal matrices,
forms a valid orthnormal basis for
.
Hence the matching score is 

.
If the correspondence between
and
is not 1-to-1 mapping, we have the following two cases.
Case 4.
Many-to-1 correspondence between
and
.
Case 5.
There is no correspondence for
in
.
For Cases 4 and 5, since the 1-to-1 correspondence assumption does not hold any
more, the relationship between
and
is more complex. This is due to the effects of
fortshortening and interpolation. Fortshortening leads to more contributions
for the rotated view recognition but less in the frontal view recognition (or
vice versa) because of the fortshortening. The increased (or decreased)
information due to interpolation, and the assigned weight for each interpolated
pixel, is not guaranteed to be the same as that before the warping. Therefore,
the relationship between
and
relies on each specific
,
which may vary significantly depending on the variation of the head pose.
Instead of theoretical analysis, the empirical error bound between
and
is sought to give a general idea of how the
warping affects the matching scores. We conducted experiments using Vetter's
database. For the fifty subjects which are not used in the bootstrap set, we generated images at various poses and obtained
their basis images at each pose. For each pose,
and
are compared, and the mean of the relative
error and the relative standard deviation for some poses are listed in Table 1.
We can see from the experimental results although
and
are not exactly the same that the difference
between
and
is very small. We examined the ranking of the
matching scores before and after the warping. Table 2 shows the percentage that
the top one pick before the warping still remains as the top one after the warping.
Thus, warping the test image
to its frontal view image
does not reduce the recognition
performance. We now have a very efficient solution for face recognition to
handle both pose and illumination variations as only one image
needs to be synthesized.
Now, the only remaining problem is that the
correspondence between
and
has to be built. Although a necessary
component of the system, finding correspondence is not the main focus of this
paper. Like most of the approaches to handle pose variations, we adopt the
method to use sparse main facial features to build the dense cross-pose or
cross-subject correspondence [9]. Some automatic facial feature detection/selection
techniques are available, but most of them are not robust enough to reliably
detect the facial features from images at arbitrary poses and are taken under
arbitrary lighting conditions. For now, we manually pick sixty three designated
feature points (eyebrows, eyes, nose, mouth, and the face contour) on
at the arbitrary pose. An average face
calculated from training images at the frontal pose and the corresponding
feature points were used to help to build the correspondence between
and
.
Triangular meshes on both faces were constructed and barycentric interpolation
inside each triangle was used to find the dense correspondence, as shown in
Figure 6. The number of feature points needed in our approach
is comparable to the 56 manually picked feature points in [9] to deform the 3D model.
Figure 6: Building dense
correspondence between the rotated view and the frontal view using sparse
features. The first and second images show the sparse features and the
constructed meshes on the mean face at the frontal pose. The third and fourth
images show the picked features and the constructed meshes on the given test
image at the rotated pose.
4.3. View Synthesis
To verify the
recognition results, the user is given the option to visually compare the
chosen subject and the test image
by generating the face image of the chosen
subject at the same pose and under the same illumination condition as
.
The desired
-dimensional vectorized image
can be synthesized easily as long as we can
generate the basis images
of the chosen subject at that pose by using
.
Assuming that the correspondence between
and the frontal pose image has been built as
described in Section 4.2, then
can be generated from the basis images
of the chosen subject at the frontal pose
using (3) and (4), given that the pose
of
can be estimated as described later. We also
need to estimate the 9-dimensional lighting coefficient vector
.
Assuming that the chosen subject is the correct one, that is,
,
we have
by substituting
into
.
Recalling that
,
we have
and then
due to the orthonormality of
.
Therefore,
.
Having both
and
available, we are ready to generate the face
image of the chosen subject at the same pose and under the same illumination
condition as
using
.
The only unknown to be estimated is the pose
of
,
which is needed in (3) and (4).
Estimating head pose from a single face image is an
active research topic in computer vision. Either a generic 3D face model or
several main facial features are utilized to estimate the head pose. Since we
already have the feature points to build the correspondence across views, it is
natural to use these feature points for pose estimation. In [27], five main facial feature points (four eye corners
and the tip of the nose) are used to estimate the 3D head orientation. The
approach employs the projective invariance of the cross-ratios of the eye corners
and anthropometric statistics to determine the head yaw, roll and pitch angles.
The focal length
has to be assumed known, which is not always
available for the uncontrollable test image. We take the advantage that the
facial features on the frontal view mean face are available, and show how to
estimate the head pose without knowing
.
All notations follow those in [27].
Let
be the image coordinates of the four eye
corners, and
and
denote the width of the eyes and half of the
distance between the two inner eye corners, respectively. From the well known
projective invariance of the cross ratios we have
which yields
,
where
.
In order to recover the yaw angle
(around the
-axis), it is easy to have, as shown in [27], that
,
where
is the focal length and
is the solution to the equation
,
where
and
.
Assume that
is the inner corner of one of the eyes for the
frontal view mean face. With perspective projection, we have
and
.
Thus,
(8)
Then we have
,
which gives
(9)
In [27],
(the rotation angle around the
-axis) is
shown to be
with
,
where
denotes the projected length of the bridge of
the nose when it is parallel to the image plane, and
denotes the observed length of the bridge of
the nose at the unknown pitch
.
Anthropometric statistics is employed in [27] to get
.
With the facial features on the mean face at the frontal view available, we do
not need the anthropometric statistics.
is just the length between the upper midpoint
of the nose and the tip of the nose for the frontal view mean face. So we can
directly use this value and the estimated focal length
in (8) to get the pitch angle
.
The head pose estimation algorithm is tested on both
synthetic and real images. For synthetic images, we use Vetter's 3D face
database. The 3D face model for each subject is rotated to the desired angle
and project to the 2D image plane. Four eye corners and the tip of the nose are
used to estimate the head pose. The mean and standard deviation of the
estimated poses are listed in Table 3. For real images, we use the CMU-PIE database. The
ground truth of the head pose can be obtained from the available 3D locations
of the head and the cameras. The experiments are conducted for all 68 subjects
in the CMU-PIE database at six different poses, illustrated in Figure 7 with the ground truth of the pose shown beside each
pose index. The mean and standard deviation of the estimated poses are listed
in Table 4. Overall the pose estimation results are satisfying
and we believe that the relatively large standard deviation is due to the error
in selecting the facial features. The
mean and standard deviation (std) of the estimated pose for images from the
Vetter's database.
Table 3: The mean and standard deviation (std) of the estimated pose for images from the Vetter's
database.
Table 4: The mean and standard deviation (std) of the estimated pose for images from the CMU-PIE
database.
Figure 7: An illustration
of the pose variation in part of the CMU-PIE database, with the ground truth of
the pose shown beside each pose index. Four of the cameras (c05, c11, c29, and
c37) sweep horizontally, and the other two are above (c09) and below (c07) the
central camera, respectively.
Having the head pose estimated, we can now perform the
face synthesis. Figure 8 shows the comparison of the given test image
and some synthesized face images at the same
pose as
from the chosen subject, where Figure 8(a) is for the
synthetic images in Vetter's 3D database and Figure 8(b) is for the real images in the
CMU-PIE database. Column one shows the training images. Column two shows the
synthesized images at the same pose as
by direct warping. Column three shows the
synthesized images using the basis images
from the chosen subject and the illumination
coefficients
of the training images. A noticeable
difference between column two and three is the lighting change. By direct
warping, we obtain the synthesized images by not only rotating the head pose,
but also rotating the lighting direction at the same time. By using
,
we only rotate the head pose to get the synthesized images, while the lighting
condition stays same as the training images. Column four shows the synthesized
images using the basis images
from the chosen subject and the same
illumination coefficients
of
.
As a comparison, column five shows the given test image
.
Overall, the columns from left to right in Figure 8 show the procedure migrating from the training images
to the given test images.
Figure 8: View synthesis results with
different lighting conditions for (a) synthetic images from Vetter's 3D database and (b) real images
in the CMU-PIE database. Columns from left to right show the training images,
the synthesized images at the same pose as the test images using direct
warping (both the head pose and the lighting direction are rotated), the synthesized images at the same pose as the test images from

(the basis images of
the chosen subject) and

(the illumination coefficients of the training images),
the synthesized images at the same pose as the test images from

and

(the illumination coefficients of the given test images),
and the given test images

.
5. Recognition Results
We first
conducted recognition experiments on Vetter's 3D face model database. There are
totally one hundred 3D face models in the database, from which fifty were used
as the bootstrap set and the other fifty were used to generate training images.
We synthesized the training images under a wide variety of illumination
conditions using the 3D scans of the subjects. For each subject, only one
frontal view image was stored as the training image and used to recover the
basis images
using the algorithm in Section 4.1. We generated the test images at different poses by
rotating the 3D scans and illuminated them with various lighting conditions
(represented by the slant angle
and tilt angle
). Some examples are shown in Figures 9(a), 9(b), 9(c) and 9(d). For a test image
at an arbitrary pose, the frontal view image
was synthesized by warping
,
as shown in Figures 9(e), 9(f), 9(g) and 9(h).
Figure 9: (a) shows the
test images of a subject at azimuth

under different lighting conditions (

;

;

;

;

;

from left to right). The test images of the
same subject under some extreme lighting conditions (

;

;

;

from left to right) are shown in (b). (c) and
(d) show the generated frontal pose images from the test images in (a) and (b),
respectively. The test images at another pose (with

and

) of the same subject are shown in (e) and
(f), with the generated frontal pose images shown in (g) and (h),
respectively.
The
recognition score was computed as
where
is the orthonormal basis of the space spanned
by
.
As a benchmark, the first column (f2f) of Table 5 lists the recognition rates when both the testing
images and the training images are from the frontal view. The correct
recognition rates using the proposed method are listed in columns (r2f) of
Table 5. As a comparison, we also conducted the recognition
experiment on the same test images assuming that the training images at the
same pose are available. By recovering the basis images
at that pose using the algorithm in Section 4.1 and computing
,
we achieved the recognition rates as shown in columns (r2r) of Table 5. As we can see, the recognition rates using our
approach (r2f) are comparable to those when the training images at the rotated
pose are available (r2r). The last two rows of show the mean and standard deviation of the recognition
rates for each pose under various illumination conditions. We believe that
relatively larger standard deviation is due to the images under some extreme
lighting conditions, as shown in Figures 9(b) and 9(f).
Table 5: The correct
recognition rates at two rotated pose under various lighting conditions for
synthetic images generated from Vetter's 3D face model database.
We also conducted experiments on real images from the
CMU-PIE database. For testing, we used images at six different poses, as shown
in the first and third rows in Figure 10, and under twenty one different illuminations.
Examples of the generated frontal view images are shown in the second and
fourth rows of Figure 10.
Figure 10: The first and
third rows show the test images of two subjects in the CMU-PIE database at six
different poses, with the pose numbers shown above each column. The second and
fourth rows show the corresponding frontal view images generated by directly
warping the given test images.
Similar to Table 5, Table 6 lists the correct recognition rates under all these
poses and illumination conditions, where column (f2f) is the frontal view testing
image against frontal view training images, columns (r2r) are the rotated testing
image against the same pose training images, and columns (r2f) are the rotated
testing image against the frontal view training images. The last two rows of Table 6 show the mean and standard deviation of the
recognition rates for each pose under various illumination conditions. As we
can see, the recognition rates using our approach are comparable to those when
the training images at the rotated pose are available, even slightly better. The reason is that the
training images of different subjects at the same rotated pose are actually at slightly different poses. Therefore, the
2D-3D registration of the training images and the bootstrap 3D face models are
not perfect, producing slightly worse basis images recovery than the frontal pose case.
Table 6: The correct
recognition rates at six rotated pose under various lighting conditions for 68
subjects in the CMU-PIE database.
We have to mention that although colored basis images
are recovered for visualization purpose, all the recognition experiments are
performed on grayscale images for faster speed. We are taking the efforts to
investigate how color information affects the recognition performance.
6. Discussions and Conclusion
We have
presented an efficient face synthesis and recognition method to handle
arbitrary pose and illumination from a single training image per subject using
pose-encoded spherical harmonics. Using a prebuilt 3D face bootstrap set, we
apply a statistical learning method to obtain the spherical harmonic basis
images from a single training image. For a test image at a different pose from
the training images, we accomplish recognition by comparing the distance
from a warped version of the test image to the space spanned by the basis
images of each subject. The impact of some empirical factors (i.e.,
correspondence and interpolation) due to warping is embedded in a sparse
transformation matrix, and we prove that the recognition performance is not
significantly affected after warping the test image to the frontal view.
Experimental results on both synthetic and real images show that high
recognition rate can be achieved when the test image is at a different pose and
under arbitrary illumination condition. Furthermore, the recognition results
can be visually verified by easily generated face image of the chosen subject
at the same pose as the test image.
In scenarios where only one training image is
available, finding the cross-correspondence between the training images and the
test image is inevitable. Automatic correspondence establishment is always a
challenging problem. Recently, promising results have been shown by using the 4
planes, 4 transitions stereo matching algorithm described in [28]. The disparity map can be reliably built for a pair
of images of the same person taken under the same lighting condition, even with
some occlusions. We conducted some experiments using this technique on both
synthetic and real images. Reasonably good correspondence maps were achieved,
even for cross-subject images. This technique has been used for 2D face
recognition across pose [29]. However, like all the other stereo methods, the
intensity-invariant condition is required, which does not hold if the images
are taken under different lighting conditions. For our challenging face
recognition application, the lighting condition of the test image is
unconstraint. Therefore, currently this stereo method cannot be directly used
to build the correspondence between
and
.
Further investigations are being taken for dense stereo with illumination
variations compensated.
Appendix
Assume that
and
are the surface normals of point
at the frontal pose and the rotated view,
respectively.
is related to
as
(A.1)
where
is the azimuth angle.
By replacing
in A.1 with
,
and assuming that the correspondence between the rotated view and the frontal
view has been built, we have
(A.2)
Rearranging, we get
(A.3)
As shown in A.3,
and
are linear combinations of basis images at the
frontal pose. For
,
and
,
we need to have
which is not known. From [4], we know that if the sphere is illuminated by a
single directional source in a direction other than the
direction, the reflectance obtained would be
identical to the kernel, but shifted in phase. Shifting the phase of a function
distributes its energy between the harmonics of the same order
(varying
), but the overall energy in each order
is maintained. The quality of the
approximation, therefore, remains the same. This can be verified by
for the order
.
Noticing that
,
we still need
to preserve the energy for the order
.
Let
and
,
we have
(A.4)
Then
(A.5)
Having
and
,
we get
(A.6)
and then
.
Two possible roots of the polynomial are
or
.
Substituting
into A.4 gives
,
,
,
which is apparently incorrect. Therefore, we have
and
.
Substituting them in A.4, we get
(A.7)
Using A.3 and A.7, we can write the basis images at the rotated pose
in the matrix form of the basis images at the frontal pose, as shown in 3.
Assuming that there is an elevation angle
after the azimuth angle
and denoting by
the surface normal for the new rotated view,
we have
(A.8)
Repeating the
above derivation easily leads to the linear equations in 4 which relates the basis images at the new
rotated pose to the basis images at the old rotated pose.
Next, we show that the proved proposition is
consistent with the general rotation matrix of spherical harmonics. If we use a
formulation for the general rotation, we have
,
the dependence of
on
and
is simple,
where
is a matrix that defines how a spherical
harmonic transforms under rotation about the
-axis. We can further decompose it into a
rotation of
about the
-axis, a general rotation
about the
-axis followed finally by a rotation of
about the
-axis [30]. Since
(A.9)
(A.10)
it is easy to show that
is exactly the same as shown in 3 by taking the above equations into
and reorganizing the order of the spherical
harmonics
.
Since 4 is derived similarly as 3, the rotation around the
-axis can be proved to be the same as 4. This can also be verified by taking the rotation
angle
into 4 which gives the same
as shown above.
Acknowledgment
This work is partially supported by a contract from UNISYS.
References
- W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, 2003.
- V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063–1074, 2003.
- L. Zhang and D. Samaras, “Face recognition from a single training image under arbitrary unknown lighting using spherical harmonics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 351–363, 2006.
- R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003.
- R. Ramamoorthi, “Analytic PCA construction for theoretical analysis of lighting variability in images of a Lambertian object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1322–1333, 2002.
- Y. Tanabe, T. Inui, and Y. Onodera, Group Theory and Its Applications in Physics, Springer, Berlin, Germany, 1990.
- R. Ramamoorthi and P. Hanrahan, “A signal-processing framework for reflection,” ACM Transactions on Graphics (TOG), vol. 23, no. 4, pp. 1004–1042, 2004.
- Z. Yue, W. Zhao, and R. Chellappa, “Pose-encoded spherical harmonics for robust: face recognition using a single image,” in Proceedings of the 2nd International Workshop on Analysis and Modelling of Faces and Gestures (AMFG '05), vol. 3723, pp. 229–243, Beijing, China, October 2005.
- L. Zhang and D. Samaras, “Face recognition under variable lighting using harmonic image exemplars,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), vol. 1, pp. 19–25, Madison, Wis, USA, June 2003.
- “3dfs-100 3 dimensional face space library (2002 3rd version),” University of Freiburg, Germany.
- T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and expression (PIE) database,” in Proceedings of the 5th IEEE International Conference on Automatic Face and Gesture Recognition (AFGR '02), pp. 46–51, Washington, DC, USA, May 2002.
- P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
- T. Sim and T. Kanade, “Illuminating the face,” Tech. Rep. CMU-RI-TR-01-31, Robotics Institute, Carnegie Mellon University, Pittsburgh, Pa, USA, 2001.
- B. Beyme, “Face recognition under varying pose,” Tech. Rep. 1461, MIT AI Lab, Cambridge, Mass, USA, 1993.
- A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '94), pp. 84–91, Seattle, Wash, USA, June 1994.
- W. T. Freeman and J. B. Tenenbaum, “Learning bilinear models for two-factor problems in vision,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), pp. 554–560, San Juan, Puerto Rico, USA, June 1997.
- A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “Illumination-based image synthesis: creating novel images of humanfaces under differing pose and lighting,” in Proceedings of the IEEE Workshop on Multi-View Modeling and Analysis of Visual Scenes (MVIEW '99), pp. 47–54, Fort Collins, Colo, USA, June 1999.
- W. Zhao and R. Chellappa, “Symmetric shape-from-shading using self-ratio image,” International Journal of Computer Vision, vol. 45, no. 1, pp. 55–75, 2001.
- R. Dovgard and R. Basri, “Statistical symmetric shape from shading for 3D structure recovery of faces,” in Proceedings of the 8th European Conference on Computer Vision (ECCV '04), pp. 99–113, Prague, Czech Republic, May 2004.
- W. Zhao and R. Chellappa, “SFS based view synthesis for robust face recognition,” in Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition (AFGR '00), pp. 285–292, Grenoble, France, March 2000.
- Z. Yue and R. Chellappa, “Pose-normailzed view synthesis of a symmetric object using a single image,” in Proceedings of the 6th Asian Conference on Computer Vision (ACCV '04), pp. 915–920, Jeju City, Korea, January 2004.
- S. K. Zhou, G. Aggarwal, R. Chellappa, and D. W. Jacobs, “Appearance characterization of linear lambertian objects, generalized photometric stereo, and illumination-invariant face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2, pp. 230–245, 2007.
- T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 681–685, 2001.
- J. Xiao, S. Baker, I. Matthews, and T. Kanade, “Real-time combined 2D+3D active appearance models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. 535–542, Washington, DC, USA, June-July 2004.
- S. Romdhani, J. Ho, T. Vetter, and D. J. Kriegman, “Face recognition using 3-D models: pose and illumination,” Proceedings of the IEEE, vol. 94, no. 11, pp. 1977–1999, 2006.
- P. Henrici, “Barycentric formulas for interpolating trigonometric polynomials
and their conjugates,” Numerische Mathematik, vol. 33, no. 2, pp. 225–234, 1979.
- T. Horprasert, Y. Yacoob, and L. S. Davis, “Computing 3-D head orientation from a monocular image sequence,” in Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition (AFGR '96), pp. 242–247, Killington, Vt, USA, October 1996.
- A. Criminisi, J. Shotton, A. Blake, C. Rother, and P. H. S. Torr, “Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming,” International Journal of Computer Vision, vol. 71, no. 1, pp. 89–110, 2007.
- C. Castillo and D. Jacobs, “Using stereo matching for 2-D face recognition across pose,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), Minneapolis, Minn, USA, June 2007.
- R. Green, “Spherical harmonic lighting: the gritty details,” in Proceedings of the Game Developers' Conference (GDC '03), San Jose, Calif, USA, March 2003.