Abstract
We propose a fast face detector using an efficient architecture based on a hierarchical cascade
of neural network ensembles with which we achieve enhanced detection accuracy and efficiency.
First, we propose a way to form a neural network ensemble by using a number of neural network
classifiers, each of which is specialized in a subregion in the face-pattern space. These classifiers
complement each other and, together, perform the detection task. Experimental results show that the
proposed neural-network ensembles significantly improve the detection accuracy as compared to
traditional neural-network-based techniques. Second, in order to reduce the total computation cost
for the face detection, we organize the neural network ensembles in a pruning cascade. In this way,
simpler and more efficient ensembles used at earlier stages in the cascade are able to reject a majority
of nonface patterns in the image backgrounds, thereby significantly improving the overall detection
efficiency while maintaining the detection accuracy. An important advantage of the new architecture
is that it has a homogeneous structure so that it is suitable for very efficient implementation using
programmable devices. Our proposed approach achieves one of the best detection accuracies in
literature with significantly reduced training and detection cost.
1. Introduction
Face detection
from images (videos) is a crucial preprocessing step for a number of
applications, such as face
identification,
facial expression analysis, and face
coding [1]. Furthermore, research results in face detection can
broadly facilitate general object detection in visual scenes.
A key question in face detection is how to best discriminate
faces from nonface background images. However, for realistic situations, it is
very difficult to define a discriminating metric because human faces usually
vary strongly in their appearance due to ethnic diversity, expressions, poses,
and aging, which makes the characterization of the human face difficult.
Furthermore, environmental factors such as imaging devices and illumination can
also exert significant influences on facial appearances.
In the past decade, extensive research has been
carried out on face detection, and significant progress has been achieved to
improve the detection performance with the following two performance
goals.
(1)Detection accuracy: the accuracy of a face detector is usually characterized by its
receiver operating characteristic (ROC), showing its performance as a trade-off
between the false acceptance rate and the face detection rate.(2)Detection efficiency: the efficiency of a face detector is often characterized by its
operation speed. An efficient detector is especially important for real-time
applications (e.g., consumer applications), where the face detector is required
to process one image at a subsecond level.
Tremendous effort has been spent to achieve the
above-mentioned goals in face-detector design. Various techniques have been
proposed, ranging from simple heuristics-based algorithms to more advanced
algorithms based on machine learning [2].
Heuristics-based face detectors exploit
empirical knowledge about face characteristics, for instance,
the skin color [3]
and edges around facial features [4].
Generally speaking, these detectors are simple, easy
to implement, and usually do not require much computation cost. However, it is
complicated to translate empirical knowledge into well-defined classification
rules. Therefore, these detectors usually have difficulty in dealing with
complex image backgrounds and varying illumination, which limits their
accuracy.
Alternatively, statistics-based face detectors have
received wider interest in recent years. These detectors implicitly distinguish
between face and nonface images by using pattern-classification techniques,
such as neural networks [5, 6]
and support vector machines [7]. The learning-based detectors generally achieve
highly accurate and robust detection performance. However, they are usually far
more computationally demanding in both training and detection.
To further reduce the computation cost, an emerging
interest in literature is to study structured face detectors employing multiple
subdetectors. For example, in [8], a set of reduced set vectors are applied
sequentially to reject unlikely faces in order to speed up a nonlinear support
vector machine classification. In [9], the AdaBoost
algorithm is used to select a set of Haar-like feature classifiers to form a single detector. In order to
improve the overall detection speed, a set of such detectors with different
characteristics are cascaded into a chain. Detectors consisting of smaller
numbers of feature classifiers are relatively fast, and they can be used at the
first stages in the detector cascade to filter out regions that most likely do
not contain any faces. The Viola-Jones face detector in [9]
has achieved real-time processing speed with fairly robust detection accuracy.
The feature-selection (training) stage, however, can be time consuming in practice.
It is reported that several weeks are needed to
completely train a cascaded detector. Later, a number of variants of the
Viola-Jones detector have also been proposed in literature, such as the detector
with extended Haar features [10], the FloatBoost
based detector [11], and so forth.
In [12], we have proposed a heterogeneous face detector
employing three subdetectors using various image features.
In [13], hierarchical support vector machines (SVM) are
discussed, which use a combination of linear SVMs to efficiently exclude most
nonfaces in images, followed by a nonlinear SVM to further verify possible
face candidates.
Although the above techniques manage to reduce the
computation cost of traditional statistics-based detectors, the detection
accuracy of these detectors is also sacrificed. In this paper, we aim to design
a face detector with highly accurate performance, which is also computationally
efficient for embedded applications.
More specifically, we propose a high-performance face
detector built as a cascade of subdetectors, where each subdetector consists
of a neural network ensemble [14]. The ensemble technique effectively improves the
detection accuracy of a single network, leading to an overall enhanced
accuracy. We also cascade a set of different ensembles in such a way that both
detection efficiency and accuracy are optimized.
Compared to related techniques in literature, we have
the following contributions.
(1)We use an
ensemble of neural networks for simultaneously improving accuracy and
architectural simplicity. We have proposed a new training paradigm to form an
ensemble of neural networks, which are subsequently used as the building blocks
of the cascaded detector. The training strategy is very effective as compared
to existing techniques and significantly improves the face-detection accuracy.(2)We also insert
this ensemble structure into the cascaded framework with scalable complexity,
which yields a significant gain in efficiency with (near) real-time detection
speed. Initial ensembles in the cascade adopt base networks that only receive a
coarse feature representation. They usually have fewer nodes and connections,
leading to simpler decision boundaries. However, since these networks can be
executed with very high efficiency, a large portion of an image containing no
faces can be quickly pruned. Subsequent ensembles adopt relatively complex
base networks, which have the capability of forming more precise decision
boundaries. These more complex ensembles are only invoked for difficult cases
that fail to be rejected by earlier ensembles in the cascade. We propose a
way to optimize the cascade structure such that the computation cost
involved can be significantly reduced while retaining overall high detection
accuracy.(3)The proposal in
this paper consists of a two-layer classifier architecture including parallel
ensembles and sequential cascade based on repetitive use of similar structures.
The result is a rather homogeneous architecture, which facilitates an efficient
implementation using programmable hardware.
Our proposed approach achieves one of the best
detection accuracies in literature, with 94% detection rate on the well-known
CMU+MIT test set and up to 5 frames/second processing speed on live videos.
The remainder of the paper is organized as follows. In
Section 2, we first explain the construction of a
neural network ensemble, which is used as the basic element in the detector
cascade. In Section 3, a cascaded detector is
formulated consisting of
multiple neural network ensembles.
Section 4 analyzes the performance of the approach and
Section 5 gives the conclusions.
2. Neural Network Ensemble
In this section, we present the basic elements of our proposed architecture, which will
be reused later to constitute a complete detector cascade. We first present, in
Section 2.1, some basic design principles of our proposed neural network
ensemble. The ensemble structure and training paradigms will be presented in
Sections 2.2 and 2.3.
2.1. Basic Principles
For complex
real-world classification problems such as face detection, the usage of a
single classifier may not be sufficient to capture the complex decision
surfaces between face and nonface patterns. Therefore, it is attractive to
exploit multiple algorithms to improve the classification accuracy. In
Rowley's approach [5] for face detection, three networks with different
initial weights are trained and the final output is based on the majority
voting of these networks. The Viola-Jones detector [9]
makes use of the boosting strategy, which
sequentially trains a set of classifiers by reweighting the sample importance.
During the training of each classifier, those samples misclassified by the
current set of classifiers have higher probabilities to be selected. The final
output is based on a linearly weighted combination of the outputs from all
component classifiers.
For aforementioned reasons, our approach is to start
with an ensemble of neural network classifiers. We denote each neural
network in the ensemble as a component network, which is randomly
initialized with different weights. More important is that we manipulate the
training data such that each component network is specialized in a different
region of the training data space. Our proposed ensemble has the following new
characteristics that are different from existing approaches in literature.
(1)The component
neural networks in our proposal are sequentially trained, each of which uses
training face samples that are misclassified by its previous networks. Our
approach differs from the boosting approach in that the training samples that
are already successfully classified by the current network are discarded and
not used for the later training. This gives a hard partitioning of the training
set, where each component neural network characterizes a specific subregion.(2)The final
output of the ensemble is determined by a decision neural network, which is
trained after the component networks are already constructed. This offers a
more flexible combination rule than the voting or linear weighting as used in
boosting.
The experimental evidence (Section 4.1) shows that our proposed ensemble technique gives
quite good performance in face detection, outperforming the traditional
ensemble techniques.
2.2. Ensemble Architecture
We depict the structure of our proposed neural network
ensemble in Figure 1. The ensemble consists of two layers: a set of
sequentially trained component networks
, and a decision network
. The outputs of the component networks
are fed to the
decision network to give the final output. The input feature vector
is a normalized
image window of
pixels.
Figure 1: The architecture of the neural network ensemble.
(1) Component neural network
Each component
classifier
is a
multilayer feedforward neural network, which has inputs receiving certain
representations of the input feature vector
and one output
ranging from 0 to 1. The network is trained with a target output of unity
indicating a face pattern and zero otherwise. Each network has locally
connected neurons, as motivated by [5].
It is pointed out in [5] that, by incorporating heuristics of facial feature
structures in designing the local connections of the network, the network gives
much better performance (and higher efficiency) than a fully connected network.
We present here four novel base-network structures
employed in this paper: FNET-A, FNET-B, FNET-C, and FNET-D (see Figure 2), which are extensions of [5] by incorporating scalable complexity. These networks
are used as the basic elements in the final face-detector cascade. The design
philosophy for these networks are partially based on heuristic reasoning. The
motivation behind the design is illustrated below.
(1)We aim at
building a complexity-scalable structure for all these base networks. The
networks are constructed with similar structures.(2)The complexity
of the network is controlled by the following structural parameters: the input
resolution, the number of hidden layers, and the number of hidden units in each
layer.(3)When observing Figure 2, FNET-B (FNET-D) enhances FNET-A (FNET-C) by
incorporating more hidden units which specifically aim at capturing various
facial feature structures. Similarly, FNET-C (FNET-D) enhances FNET-A (FNET-B)
by using a higher-input resolution and more hidden layers.
Figure 2: Topology of four types of component networks.
In this way, we obtain a set of networks with scalable
structures and varying representation properties. In the following, we
illustrate each network in more detail.
As shown in Figure 2(a), FNET-A has a relatively simple structure with one
hidden layer. The network accepts an
grid as its
inputs, where each input element is an averaged value of a neighboring
block in the
original
input features.
FNET-A has one hidden layer with
neurons, each
of which looks at a locally neighboring
block from the
inputs.
FNET-B (see Figure 2(a)) shares the same type of inputs as FNET-A, but
with extended hidden neurons. In addition to the
hidden neurons,
additional
and
neurons are
used, each of which looks at a
(or
) block from
the inputs. These additional horizontal and vertical stripes are used to capture
corresponding facial features such as eyes, mouths, and noses.
The topology of FNET-C is depicted in Figure 2(b), which has two hidden layers with
and
hidden neurons,
respectively. The FNET-C directly receives the
input features.
In the first hidden layer, each hidden neuron takes inputs from a locally
neighboring
block of the
input layer. In the second hidden layer, each hidden neuron unit takes a
locally neighboring
block as an input
from the first hidden layer.
FNET-D (see Figure 2(b)) is an enhanced version of both FNET-B and FNET-C,
with two hidden layers and additional hidden neurons arranged in horizontal and
vertical stripes.
From FNET-A to FNET-D, the complexity of the network
is gradually increased by using a finer input representation, adding more
layers or adding more hidden units to capture more intricate facial characteristics.
Therefore, the networks have an increasing number of connections and consume
more computation power.
(2) Decision neural network
For the
decision network
(see Figure 1), we adopt a fully connected feedforward neural
network, which has one hidden layer with eight hidden units. The number of
inputs for
is determined
by the number of the component classifiers in the network ensemble. The decision
network receives the outputs from each component network
, and outputs a value
ranging from 0 to 1, which indicates the confidence that the input vector represents a face.
In other words,
(1)
In the following, we present the training paradigms
for our proposed neural network ensemble.
2.3. Training Algorithms
Since each
ensemble is a two-layer system, the training consists of the following two
stages.
(i)Sequentially,
train
component
classifiers
(
) with a feature sample
drawn from a
training data set
.
contains a face
sample set
and a nonface
sample set
.(ii)Train the decision
neural network
with samples
, where
.
Let us now
present the training algorithm for each stage in more detail.
(1) Training algorithm for component neural networks
One important characteristic of the component-network
training is that each network
is trained on a
subset
of the complete
face set
.
contains only
face samples misclassified by the previous
trained
component classifiers. More specifically, suppose the
th component
network is trained over sample set
. After the training, the network is able to correctly
classify samples
(
). The next component network (the
th network) is
then trained over sample set
. This procedure can be iteratively carried out until
all
component
networks are trained. This is also illustrated in Table 1.
Table 1: Partitioning of the training set for component networks.
In this way, each component network is trained over a
subset of the total training set and is specialized in a specific region in the
face space. For each
, the nonface samples are selected in a bootstrapping manner,
similar to the approach used in [5].
According to the bootstrapping strategy, an initial
set of randomly chosen nonface samples is used, and during the training, new
false positives are iteratively added to the current nonface training set. In
this way, more difficult nonface samples are reinforced during the training
process.
Up to now, we have explained the training-set
selection strategy for the component networks. The actual training of each
network
is based on the
standard backpropagation algorithm [15]. The network is trained with unity for face samples
and zero for nonface samples.
During the classification,
a threshold
needs to be
chosen such that the input
is classified
as a face when
. In the following, we will elaborate on how the
combination of neural networks (
to
) can yield a
reduced classification error over the training face set.
First, we define the face-learning ratio
of the component network
as
(2)
where
denotes the
number of elements in a set. Furthermore, we define
as the fraction
of the face samples successfully classified by
with respect to the
total training face samples, given by
(3)
We can see that
(4)
(5)
By recursively
applying (5), we derive the following relation between
and
:
(6)
The
th component
classifier
thus uses a
percentage of
of all the
training samples, and
(7)
During the sequential training of the component
networks, each network has a decreasing number of available training samples
. To ensure that each component network has sufficient
samples to learn some generalized facial characteristics,
should be
larger than a performance critical value (e.g., 5% when
).
Given a fixed topology of component networks, the
value of
is inversely
proportional to threshold
. Hence, the larger
, the smaller
. Equation (7) provides guidance to the selection of a proper
for each
component network such
that
is large enough
to provide sufficient statistics.
In Table 2, we give the complete training algorithm for
component neural network classifiers.
Table 2: The training algorithm for component neural classifiers.
(2) Training Algorithm for the Decision Neural Network
In Table 3, we present the training algorithm for the decision
network
. During the training of
, the inputs are taken from
, where
is drawn from
the face set or the nonface set. The training also makes use of the bootstrapping
procedure as in the training of the component networks to dynamically add
nonface samples to the training set (line (5) in Table 3). In order to prevent the well-known over-fitting
problem during the backpropagation training, we use here an additional face set
and a nonface
set
for validation
purposes.
Table 3: The training algorithm for the decision network.
(3) Difference between our proposed technique and bagging/boosting
Let us now
briefly compare our proposed approach to two other popular ensemble techniques:
bagging and boosting. The bagging selects training samples for each component
classifier by sampling the training set with replacements. There is no
correlation between the different subsets used for the training of different
component classifiers. When applied for neural network face detection, we can
train
component
neural classifiers independently using randomly selected subsets of the
original face training set. The nonface samples are selected in a
bootstrapping fashion similar to Table 2. The final output
is based on the
average of outputs from component classifiers, given by
(8)
Different from the bagging, boosting sequentially
trains a series of classifiers by emphasizing difficult samples. An example
using the AdaBoost was presented in AdaBoost [15]. During the training of the
th component
classifier, AdaBoost alters the distribution of the samples such that
those samples misclassified by its previous component classifier are
emphasized. The final output
is a weighted
linear combination of the outputs from the component classifiers.
Different from bagging, our proposed ensemble
technique sequentially trains a set of interdependent component classifiers. In
this sense, it shares the basic principle with boosting. However, the proposed
ensemble technique differs from boosting in the following aspects.
(1)Our approach
uses a “hard” partitioning of the face training set. Those samples, already
correctly classified by the current set of networks, will not be reused
for subsequent networks. In this way, face characteristics already learned by
the previous networks are not included in the training of subsequent
components. Therefore, the subsequent networks can focus more on a different
class of face patterns during their corresponding training stages.As a result of the hard partitioning, the subsequent
networks are trained on smaller subsets of the original face training set. We
have to ensure that each network has sufficient samples that characterize a
subclass of face patterns. This has also been discussed previously.(2)We use a
decision neural network to make the final classification based on individual
outputs from component networks. This results in a more flexible decision
function than the linear combination rule used by bagging or boosting.
In Section 4, we will give some examples to compare the
performance of the resulting neural network ensembles trained with different
strategies.
The newly created ensemble of cooperating
neural-network classifiers will be used in the following section as “building blocks”
in a pruning cascade.
3. Cascaded Neural Ensembles for Fast Detection
In this section, we apply the ensemble technique into
a cascading architecture for face detection such that both the detection
accuracy and efficiency are jointly optimized.
Figure 3 depicts the structure of the cascaded neural network
ensembles for face detection. More efficient ensemble classifiers with simpler
base networks are used at earlier stages in the cascade, which are capable of
rejecting a majority of nonface patterns, thereby boosting the overall
detection efficiency.
Figure 3: Pruning cascade of neural network ensembles.
In the following, we introduce a notation framework in
order to come to expressions for the detection accuracy and efficiency of
cascaded ensembles. Afterwards, we propose a technique to jointly optimize the
cascaded face detector for both accuracy and efficiency. Following that, we
introduce an implementation of a cascaded face detector using five neural-network
ensembles.
3.1. Formulation and Optimization of Cascaded Ensembles
As shown in Figure 3, we assume a total of
neural network
ensembles
(
) with increasing base network complexity. The
behavior of each ensemble classifier
can be
characterized by face detection rate
and false
acceptance rate
, where
is the output
threshold of the decision network in the ensemble. By varying
in the interval
, we can obtain different pairs
which actually
constitute the ROC curve of ensemble
. Now, the question is how we can choose a set of
appropriate values for
such that the performance of the cascaded classifier
is optimal.
Suppose we have a detection task with a total of
candidate
windows, and
, where
is the number
of faces and
is the number
of nonfaces. The first classifier in the cascade takes
windows as an
input, among which
windows are
classified as faces and
windows are
classified as nonfaces. Hence
. The
windows are
passed on to the second classifier for further verification. More specifically,
the
th classifier
(
) in the cascade takes
input windows
and classifies them into
faces and
nonfaces. At
the first stage, it is easy to see that
(9)
More generally, it holds that
(10)
where
and
represent the
face detection rate and false acceptance rate, respectively, of the subcascade
formed jointly by the first to the
th ensemble
classifiers. Note that it is difficult to express
explicitly
using
and
, since the behaviors of different ensembles are usually
correlated. In the following, we first define two target functions for
maximizing the detection accuracy and efficiency of the cascaded detector.
Following this, we propose a solution to optimize both objectives.
(a) Detection accuracy
The detection accuracy of a face detector is characterized by both its face detection rate
and false acceptance rate. For a specific application, we can define the
maximally allowed false acceptance rate. Under this constraint, the higher the
face detection rate, the more accurate the classifier. More specifically, we
use cost function
to measure the
detection accuracy of the
-ensemble
cascaded classifier, which is defined by the maximum face detection rate of the
classifier under the condition that the false acceptance rate is below a
threshold value
. Therefore,
(11)
(b) Detection efficiency
We define the detection efficiency of a cascaded classifier by the total amount of time
required to process the
input windows,
denoted as
. Suppose the classification of one image window by
ensemble classifier
takes
time. To
classify
candidate
windows by the complete
-layer cascade,
we need a total amount of time
(12)
where the last step is based on (10) and we define the initial rates
and
.
The performance of a cascaded face detector should be
expressed by both its detection accuracy and efficiency. To this end, we
combine cost functions
(11) and
(12) into a unified function
, which measures the overall performance of a cascaded
face detector. There are various combination methods. One example is based on a
weighted summation of (11) and (12):
(13)
We use a
substraction for the efficiency (time) component to trade-off against accuracy.
By adjusting
, the relative importance of desired accuracy and
efficiency can be controlled.1
In order to obtain a cascaded face detector of high
performance, we aim at maximizing the performance goal as defined by (13). For a given cascaded detector consisting of
ensembles, we
can optimize over all possible
(
) to obtain the best parameters
. However, this process can be computationally
prohibitive, especially when
is large. In
the following, we propose a heuristic suboptimal search to determine these
parameters.
(c) Sequential backward parameter selection
In Table 4, we present the algorithm for selecting a set of
parameters
that maximizes
(13). Since the final face detection rate
is upper bounded
by
, we first ensure a high detection accuracy by
choosing a proper
for the final
ensemble classifier (line 1 in Table 4). Following that, we add each ensemble in a backward
direction and choose its threshold parameter
such that the
partially formed cascade from the
th to the
th ensemble
gives an optimized
.
Table 4: Parameter selection for the face-detection cascade.
The experimental results show that this selection
strategy gives very good performance in practice.
3.2. Implementation of a Cascaded Detector
We build a five-stage cascade of classifiers with
increasing order of topology complexity. The first four stages are based on
component network structures FNET-A to FNET-D, as illustrated in Section 2.2. The final ensemble consists of all component
networks of FNET-D, plus a set of additional component networks that are
variants of FNET-D. These additional component networks allow overlapping of
locally connected blocks so that they offer slightly more flexibility than the
original FNET-D. Although, in principle, a more complex base network structure
can be used and the final ensemble can be constructed following the similar
principle as FNET-A to FNET-D, we found, in our experiments, that using our
proposed strategy for the final ensemble construction already offers sufficient
detection accuracy while still keeping the complexity at a reasonably low
level.
In order to apply the face detector to real-world
detection from arbitrary images (videos), we need to address the following
issues.
(1) Multiresolution face scanning
Since we have no a priori knowledge about the sizes of the faces in the input image,
in order to select face candidates of various sizes, we need to scan the image
at multiple scales. In this way, potential faces of any size can be matched to
the
pixel model at
(at least) one of the image scales. Here, we use a scaling factor of
between
adjacent image scales during the search. In Figure 4, we give an illustrating example of the
multiresolution search strategy.
Figure 4: The multiresolution search for face detection.
(2) Fast preprocessing using integral images
Our proposed
face detector accepts an image window preprocessed by zero mean and unity
standard deviation, with the aim to reduce the global illumination influence.
To facilitate efficient image preprocessing during the multiresolution search,
we compute the mean and variance of an image window using a pair of auxiliary
integral images of the original input image. The integral image of an image
with intensity
is defined as
(14)
As introduced
in [9], using integral images can facilitate a fast
computation of mean value of an arbitrary window from an image. Similarly, a
“squared” integral image can facilitate a fast computation of the variance of
the image window.
In addition to the preprocessing, the fast
computation of the mean values of image windows can also accelerate the
computation of the low-resolution image input for the neural network such as
FNET-A and FNET-B.
(3) Merging multiple detections
Since the
trained neural network classifiers are relatively robust with face variations
in scale and translation, the multiresolution image search would normally
yield multiple detections around a single face. As a postprocessing procedure,
we group adjacent multiple detections into one group, removing repetitive
detections and reducing false positives.
4. Performance Analysis
In this section, we evaluate the performance of our proposed face detector. As a first
step, we look at the performance of the new ensemble technique.
4.1. Performance Analysis of the Neural Network Ensemble
To demonstrate
the performance of our proposed ensemble technique, we evaluate four network
ensembles (FNET-A to FNET-D) (refer to Figure 2) that are employed in the cascaded detection. Our
training face set
consists of
6,304 highly variable face images, all cropped to the size of
pixels.
Furthermore, we build up an initial nonface training set
consisting of
4,548 nonface images of size
. Set
comprises of
around 1,000 scenery pictures containing no faces. For each scenery picture, we
further generate five scaled versions of it, thereby acquiring altogether 5,000
scenery images. Each
sample is
preprocessed to zero mean and unity standard deviation to reduce the influence
of global illumination changes.
Let us first quantitatively analyze the performance
gain by using an ensemble of neural classifiers. We vary the number of
constituting components
and derive the
corresponding ROC curve of each ensemble. The evaluation is based on two
additional validation sets
and
. In Figure 5, we depict the ROC curves for ensembles based on
networks FNET-A and FNET-C, respectively.
In Figure 5(a), we can see that the detection accuracy of the
FNET-A ensemble consistently improves by adding up to three components.
However, no obvious improvement can be achieved by using more than three
components. Similar results also hold for the FNET-C ensemble (see Figure 5(b)).
Figure 5: ROC curves of various network ensembles with respect to different

.
Since using more component classifiers in a
neural network ensemble inevitably increases the total computation cost during
the classification, for a given network topology, we need to select
with the best
trade-off between the detection accuracy and the computation efficiency.
As a next performance-evaluation step, we compare our
proposed classifier ensemble for face detection with two other popular ensemble
techniques, namely, bagging and boosting. We have adopted a slightly different
version of the AdaBoost algorithm [15].
According to the conventional AdaBoost algorithm,
the training procedure uses a fixed nonface set and face set to train a
set of classifiers. However, we found, from our experiments, that this strategy
does not lead to satisfactory results. Instead, we minimize the training error
only on the face set. The nonface set is dynamically formed using the
bootstrapping procedure.
As shown in Figure 6,
it can be seen that, for complex base network
structures such as FNET-D, our proposed neural-classifier ensemble produces the
best results. For a base network with relatively simple structures such as
FNET-A, our proposed ensemble gives comparable results with respect to the
boosting-based algorithm. It is worth mentioning that, for the most complex
network structure FNET-D, bagging or boosting only give a marginal improvement
as compared to using a single network while our proposed ensemble gives much
better results than the other techniques. This can be explained by the
following reasoning.
Figure 6: ROC curves of network ensembles using different training strategies.
The training strategy adopted by the boosting
technique is mostly suitable for combining weak classifiers that may
only work slightly better than random guessing. Therefore, during the
sequential training as in boosting, it is beneficial to reuse the
samples that are correctly classified by its previous component networks to reinforce
the classification performance. For a neural network with simple structures,
the use of boosting can be quite effective in improving the classification
accuracy of the ensemble. However, when training strong component
classifiers, which can already give quite accurate classification results in
a stand-alone operation, it is less effective to repeatedly feed the samples that
are already learned by the preceding networks. Neural networks with complex
structures (e.g., FNET-C and FNET-D) are such strong classifiers, and for these
networks, our proposed strategy is more effective and gives better results in
practice.
4.2. Performance Analysis of the Face-Detection Cascade
We have built
five neural network ensembles as described in Section 3.2. These ensembles have increasing order of structural
complexity, denoted as
(
). As the first step, we evaluate the individual
behavior of each trained neural network ensemble. Using the same training sets
and validation sets as in Section 4.1, we obtain the ROC curves of different ensemble
classifiers
as depicted in
Figure 7. The plot at the right part of the figure is a zoomed
version where the false acceptance rate is within
.
Figure 7: ROC curves of individual ensemble classifiers for face detection.
Afterwards, we form a cascade of neural network
ensembles from
to
. The decision threshold of each network ensemble is
chosen according to the parameter-selection algorithm given in Table 4. We depict the ROC curve of the resulting cascade in
Figure 8, and the performance of the
th (final)
ensemble classifier is given in the same plot for comparison. It can be noticed
that, for false acceptance rates below
for the given
validation set which is normally required for real-world applications, the
cascaded detector has almost the same face detection rate as the most complex
th stage
classifier. The highest detection rate that can be achieved by the cascaded
classifier is 83%, which is only slightly worse than the 85% detection rate of
the final ensemble classifier. The processing time required by the cascaded
classifier drastically drops to less than
compared to
using the
th stage
classifier alone, when tested on the validation sets
and
. For example, a full detection process on a CMU test
image of
pixels takes
around two minutes by using the
th stage
classifier alone. By using the cascaded detector, only four seconds are required
to complete the processing.
Figure 8: Comparison
between the final ensemble classifier (the

th ensemble
classifier) and the cascaded classifier for face detection.
In our implementation, we train each ensemble
independently and then build up a cascade. A slightly different strategy is to
sequentially train the ensembles such that the subsequent ensemble detectors
are only fed with the nonface samples that are misclassified by the previous
ensemble detectors. This strategy was adopted by the Viola-Jones detector
in [9]. When this strategy is used in the neural ensemble
cascade in our case, our experiments show that such a training scheme leads to
slightly worse results than with the independent training. This may be due to
the relatively good learning capability of subsequent ensemble classifiers,
which is less dependent on the relatively “easy” nonface patterns to be
pruned. More study is still needed to arrive to a solid explanation.
Another benefit offered by the independent training is
the saving of the training time.2 This is because, during the
cascaded training, it takes longer time to collect nonface samples during the
bootstrapping training for more complex ensembles, considering the relatively
low false acceptance rate of the partially formed subcascade.
4.3. Performance Analysis for Real-World Face Detection
In this
subsection, we apply our cascaded face detector on a number of real-world test
sets and evaluate its detection accuracy and efficiency. Three test sets containing
various images and video sequences are used for our evaluation purposes, which
are listed in Table 5. The CMU
MIT set is the
most widely-used test set for benchmarking face-detection algorithms [5], and many of the images included in this data set are
of very low quality. The WEB test set contains various images randomly
downloaded from the Internet. The HN2R-DET set contains various images and
video sequences we have collected using both a DV camera and a web camera
during several test phases in the HN2R project [16].
Table 5: Data sets used for the evaluation of our proposed face detector.
(1) Detection accuracy
First, we
compare our detection results to reported results from the literature on the
CMU
MIT test set.
The comparison results are given in Table 5.3 It can be seen that our approach for face detection is among one of
the best performing techniques in terms of detection accuracy.
Using the WEB data set, we achieve a face detection
rate of 93% with a total of 29 false positives. For the HN2R-DET set, which
captures indoor scenes with relatively simple background, a total of 98%
detection rate is achieved with zero false positives.
(2) Detection efficiency
Furthermore, we
have evaluated the efficiency gain by using a cascaded detector. For the CMU
MIT test set,
the five ensembles in the cascade reject 77.2%, 15.5%, 6.2%, 1.1%, and 0.09% of
all the background image windows, respectively. For a typical image of size
, using a cascade can significantly reduce the
computation of the final ensemble by 99.4%, bringing the processing time from
several minutes to a subsecond level. When processing video sequences of
resolution, we
achieve a 4-5 frames/second detection speed on a Pentium-IV PC (3.0 GHz). The
detection is frame-based without the use of any tracking techniques.
The proposed detector has been integrated into a
real-time face-recognition system for consumer-use interactions [17], which gives quite reliable performance under various
operation environments.
(3) Training efficiency
The
state-of-the-art learning-based face detectors such as
the Viola-Jones detector [9]
usually takes weeks to accomplish due to the large
amount of features involved. The training of our proposed face detector is
highly efficient, taking usually only a few hours including the parameter
tuning. This is because the cascaded detector involves only five stages, each
of which can be trained independently. For each stage, only a limited number of
component networks need to be trained due to the relatively good learning
capacity of multilayer neural networks (Section 2). As a result, the computation cost is kept low,
which offers the advantages for applications where frequent updates of
detection models are necessary.
5. Conclusions
In this paper,
we have presented a face detector using a cascade of neural-network ensembles,
which offers the following distinct advantages.
First, we have used a neural network ensemble for improved detection accuracy, which consists of a set of component neural
networks and a decision network. The experimental results have shown that our
proposed ensemble technique outperforms several existing techniques such as
bagging and boosting, with significantly better ROC performance for more
complex neural network structures. For example, as shown in Figure 6(b), by using our proposed technique, the false
rejection rate has been reduced by 23% (at the false acceptance rate of 0.5%)
as compared to bagging and boosting.
Second, we have used a cascade of neural network
ensembles with increasing complexity, in order to reduce the total
computation cost of the detector. Fast ensembles are used first to quickly
prune large background areas while subsequent ensembles are only invoked for
more difficult cases to achieve a refined classification. Based on a new
weighted cost function incorporating both detection accuracy and efficiency, we
use a sequential parameter-selection algorithm to optimize the defined cost.
The experimental results have shown that our detector has effectively reduced
the total processing time from minutes to a fraction of a second, while
maintaining similar detection accuracy as compared to the most powerful
subdetector in the cascade.
When used for real-world face-detection tasks, our
proposed face detector in this chapter is one of the best performing detectors
in detection accuracy, with 94.4% detection rate and 61 false positives on the
CMU+MIT data set (see Table 6). In addition, the cascaded structure has greatly
reduced the required computation complexity. The proposed detector has been
applied in a real-time face-recognition system operating at 4-5 frames/second.
Table 6: Comparison of different face detectors for the CMU

MIT data set.
It is also worth pointing out the architectural
advantages offered by the proposal. In our detector framework, each subdetector
(ensemble) in the cascade is built upon similar structures, and each ensemble
is composed of base networks of the same topology. Within one ensemble, the
component networks can simultaneously process an input window. This structure
is most suitable to be implemented in parallelized hardware architectures,
either in multiprocessor layout or with reconfigurable hardware cells.
Additionally, the different ensembles in a cascade can be implemented in a
streamlined manner to further accelerate the cascaded processing. It is readily
understood that these features are highly relevant for embedded
applications.
1 Factor
also
compensates for the different units used by
(detection
rate) and
(time).
2The complete training takes, roughly, a few hours
in our experimental setup (P-IV PC 3.0 GHz).
3Techniques 3, 4, 7, and 8 and our approach use a
subset of the test sets excluding hand-drawn faces and cartoon faces, leaving
483 faces in the test set. If we further exclude four faces using face masks or
having poor resolution, as we do not consider these situations in the
construction of our training sets, we can achieve a 94.4% face-detection rate
with the same number of false positives. Note that not all techniques listed in
the table uses the same size of training faces and the training data size may
also vary.
References
- W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, 2003.
- M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: a survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34–58, 2002.
- S. L. Phung, A. Bouzerdoum, and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 1, pp. 148–154, 2005.
- B. Fröba and C. Küblbeck, “Real-time face detection using edge-orientation matching,” in Proceedings of the 3rd International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '01), vol. 2091 of LNCS, pp. 78–83, Springer, Halmstad, Sweden, June 2001.
- H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–38, 1998.
- C. Garcia and M. Delakis, “Convolutional face finder: a neural architecture for fast and robust face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408–1423, 2004.
- B. Heisele, T. Poggio, and M. Pontil, “Face detection in still gray images,” Tech. Rep. 1687, Massachusetts Institute of Technology, Cambridge, Mass, USA, 2000, AI Memo.
- S. Romdhani, P. Torr, B. Schölkopf, and A. Blake, “Computationally efficient face detection,” in Proceedings of the 18th IEEE International Conference on Computer Vision (ICCV '01), vol. 2, pp. 695–700, Vancouver, BC, Canada, July 2001.
- P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
- R. Lienhart and J. Maydt, “An extended set of Haar-like features for rapid object detection,” in Proceedings of the International Conference on Image Processing (ICIP '02), vol. 1, pp. 900–903, Rochester, NY, USA, September 2002.
- S. Z. Li and Z. Zhang, “FloatBoost learning and statistical face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1112–1123, 2004.
- F. Zuo and P. H. N. de With, “Fast human face detection using successive face detectors with incremental detection capability,” in Image and Video Communications and Processing, vol. 5022 of Proceedings of SPIE, pp. 831–841, Santa Clara, Calif, USA, January 2003.
- Y. Ma and X. Ding, “Face detection based on hierarchical support vector machines,” in Proceedings of the 16th International Conference on Pattern Recognition (ICPR '02), vol. 1, pp. 222–225, Quebec, Canada, August 2002.
- F. Zuo and P. H. N. de With, “Fast face detection using a cascade of neural network ensembles,” in Proceedings of the 7th International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS '05), vol. 3708 of LNCS, pp. 26–34, Antwerp, Belgium, September 2005.
- R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley-Interscience, New York, NY, USA, 2nd edition, 2000.
- HomeNet2Run, http://www.hitech-projects.com/euprojects/hn2r/.
- F. Zuo and P. H. N. de With, “Real-time embedded face recognition for smart home,” IEEE Transactions on Consumer Electronics, vol. 51, no. 1, pp. 183–190, 2005.
- H. Schneiderman and T. Kanade, “Probabilistic modeling of local appearance and spatial relationships for object recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '98), pp. 45–51, Santa Barbara, Calif, USA, June 1998.
- M.-H. Yang, D. Roth, and N. Ahuja, “A SNoW-based face detector,” in Proceedings of Advances in Neural Information Processing Systems (NIPS '99), vol. 12, pp. 862–868, Denver, Colo, USA, November-December 1999.