Computer Vision and Robotics Research Laboratory, University of California, San Diego, La Jolla, CA 92093, USA
Abstract
We present a system that simultaneously tracks eyes and detects eye blinks. Two interactive particle filters are
used for this purpose, one for the closed eyes and the other one for the open eyes. Each particle filter is used to track the eye locations as well as the scales of the eye subjects. The set of particles that gives higher confidence is defined as the primary set and the other one is defined as the secondary set. The eye location
is estimated by the primary particle filter, and whether the eye status is open or closed is also decided by the label of the primary particle filter. When a new frame comes, the secondary particle filter is reinitialized according to the estimates from the primary particle filter. We use autoregression models for describing the state transition and a classification-based model for measuring the observation. Tensor subspace analysis is used for feature extraction which is followed by a logistic regression model to give the posterior estimation. The performance is carefully evaluated from two aspects: the blink detection rate and the tracking accuracy. The blink detection rate is evaluated using videos from varying scenarios, and the tracking accuracy is given by comparing with the benchmark data obtained using the Vicon motion capturing system. The setup for obtaining benchmark data for tracking accuracy evaluation is presented and experimental results are shown. Extensive experimental evaluations validate the capability of the algorithm.
1. Introduction
Eye blink
detection plays an important role in human-computer interface (HCI) systems. It
can also be used in driver's assistance systems. Studies show that eye blink
duration has a close relation to a subject's drowsiness [1]. The openness of
eyes, as well as the frequency of eye blinks, shows the level of the person's
consciousness, which has potential applications in monitoring driver's
vigourous level for additional safety control [2]. Also, eye blinks can be used
as a method of communication for people with severe disabilities, in which
blink patterns can be interpreted as semiotic messages [3–5]. This provides
an alternate input modality to control a computer: communication by “blink
pattern.” The duration of eye closure determines whether the blink is
voluntary or involuntary. Blink patterns are used by interpreting voluntary
long blinks according to the predefined semiotics dictionary, while ignoring
involuntary short blinks.
Eye blink detection has attracted considerable
research interest from the computer vision community. In literature, most
existing techniques use two separate steps for eye tracking and blink detection
[2, 3, 5–8]. For eye blink detection systems, there are three types of
dynamic information involved: the global motion of eyes (which can be used to
infer the head motion), the local motion of eye pupils, and the eye openness/closure.
Accordingly, an effective eye tracking algorithm for blink detection purposes
needs to satisfy the following constraints:
(i)
track the global motion of eyes, which is confined
by the head motion;
(ii)
maintain invariance to local motion of eye
pupils;
(iii)
classify the closed-eye frames from the open-eye
frames.
Once the eyes' locations are estimated by the tracking
algorithm, the differences in image appearance between the open eyes and the
closed eyes can be used to find the frames in which the subjects' eyes are
closed, such that eye blinking can be determined. In [2], template matching is
used to track the eyes and color features are used to determine the openness of
eyes. Detected blinks are then used together with pose and gaze estimates to
monitor the driver's alertness. In [6, 9], blink detection is implemented as
part of a large facial expression classification system. Differences in
intensity values between the upper eye and lower eye are used for eye
openness/closure classification, such that closed-eye frames can be detected.
The use of low-level features makes the real-time implementation of the blink
detection systems feasible. However, for videos with large variations, such as
the typical videos collected from in-car cameras, the acquired images are
usually noisy and with low-resolution. In such scenarios, simple low-level
features, like color and image differences, are not sufficient. Temporal
information is also used by some other researchers for blinking detection
purposes. For example, in [3, 5, 7], the image difference between neighboring
frames is used to locate the eyes, and the temporal image correlation is used
thereafter to determine whether the eyes are open or closed. This system
provides a possible new solution for a human-computer interaction system that
can be used for highly disabled people. Besides that, motion information has been
exploited as well. The estimate of the dense motion field describes the motion
patterns, in which the eye lid movements can be separated to detect eye blinks.
In [8], dense optical flow is used for this purpose. The ability to
differentiate the motion related to blinks from the global head motion is
essential. Since face subjects are nonrigid and nonplanar, it is not a trivial
work.
Such two-step-based blink detection system requires
that the tracking algorithms are capable of handling the appearance change
between the open eyes and the closed eyes. In this work, we propose an
alternative way that simultaneously tracks eyes and detects eye blinks. We use
two interactive particle filters, one tracks the open eyes and the other one
tracks the closed eyes. Eye detection algorithms can be used to give the
initial position of the eyes [10–12], and after that the interactive
particle filters are used for eye tracking and blink detection. The set of
particles that gives higher confidence is defined as the primary particle set
and the other one is defined as the secondary particle set. Estimates of the
eyes' location, as well as the eye class labels (open-eye versus closed-eye),
are determined by the primary particle filter, which is also used to
reinitialize the secondary particle filter for the new observation. For each
particle filter, the state variables characterize the location and size of the
eyes. We use autoregression (AR) models to describe the state transitions,
where the location is modeled by a second-order AR and the scale is modeled by
a separate first-order AR. The observation model is a classification-based
model, which tracks eyes according to the knowledge learned from examples
instead of the templates adapted from previous frames. Therefore, it can avoid
accumulation of the tracking errors. In our work, we use a regression model in
tensor subspace to measure the posterior probabilities of the observations.
Other classification/regression models can be used as well. Experimental
results show the capability of the algorithm.
The remaining part of the paper is organized as
follows. In Section 2, the theoretical foundation of the particle filter is
reviewed. In Section 3, the details of the proposed algorithm are presented.
The system flowchart in Figure 1 gives an overview of the algorithm. In Section
4, a systematic experimental evaluation of the performance is described. The
performance is evaluated from two aspects: the blink detection rate and the
tracking accuracy. The blink detection rate is evaluated using videos collected
under varying scenarios, and the tracking accuracy is evaluated using benchmark
data collected with the Vicon motion capturing system. Section 5 gives some discussion and concludes the paper.
Figure 1: Flow-chart for eye blink detection system. For every new frame observation, new particles are first predicted from the known important distribution, and then updated accordingly based on the
posterior estimated by logistic regressor in the tensor subspaces. The best estimation gives the class label (open-eye/closed-eye) as well as the eye location.
2. Dynamic Systems and Particle Filters
The fundamental
prerequisite of a simultaneous eye tracking and blink detection system is to
accurately recover the dynamics of eyes, which can be modeled by a dynamic
system. Open eyes and closed eyes appear to have significantly different
appearances. A straightforward way is to model the dynamics of open-eye and
closed-eye individually. We use two interactive particle filters for this
purpose. The posterior probabilities learned by the particle filters are used
to determine which particle filter gives the correct tracks, and this particle
filter is thus labeled as the primary one. Figure 1 gives the diagram of the
system. Since the particle filters are the key part of this blink detection
system, in this section, we present a detailed overview of the dynamic system
and its particle filtering solutions, such that the proposed system for
simultaneous eye tracking and blink detection can be better understood.
2.1. Dynamic Systems
A dynamic
system can be described by two mathematical models. One is the state-transition
model, which describes the system evolution rules, represented by the
stochastic process
, where
(1)
is the state
transition noise with known probability density function (PDF)
. The other one is the observation model, which shows
the relationship between the observable measurement of the system and the
underlying hidden state variables. The dynamic system is observed at discrete
times
via realization
of the stochastic process, modeled as follows:
(2)
is the discrete
observation obtained at time
.
is the
observation noise with known PDF
, which is independent from
. For simplicity, we use capital letters to refer to
the random processes and lowercase letters to denote the realization of the
random processes.
Given that these two system models are known, the
problem is to estimate any function of the state
using the
expectation
. If
and
are linear, and
the two noise PDFs,
and
, are Gaussian, the system can be characterized by a
Kalman filter [13]. Unfortunately, Kalman filters only provide the first-order
approximations for general systems. Extended Kalman Filter (EKF) [13] is one
way to handle the nonlinearity. A more general framework is provided by
particle filtering techniques. Particle filtering is a Monte Carlo solution for
general form dynamic systems. As an alternative to the EKF, particle filters
have the advantage that with sufficient samples, the solutions approach the
Bayesian estimate.
2.2. Review of a Basic Particle Filter
Particle
filters are sequential analogues of Markov chain Monte Carlo (MCMC) batch
methods. They are also known as sequential Monte Carlo (SMC) methods. Particle
filters are widely used in positioning, navigation, and tracking for modeling dynamic systems [14–20]. The basic idea of particle filtering is
to use point mass, or particles, to represent the probability densities. The
tracking problem can be expressed as a Bayes filtering problem, in which the
posterior distribution of the target state is updated recursively as a new
observation comes in
(3)
The likelihood
is the
observation model, and
is the state
transition model.
There are several versions of the particle filters,
such as sequential importance sampling (SIS) [21, 22]/sampling-importance
resampling (SIR) [22–24], auxiliary particle filters [22, 25], and
Rao-Blackwellized particle filters [20, 22, 26, 27], and so forth. All particle
filters are derived based on the following two assumptions. The first
assumption is that the state-transition is a first-order Markov process, which
simplifies the state transition model in (3) to
(4)
The second
assumption is that the observations
are
conditionally independent given known states
, which implies that each observation only relies on
the current state; then we have
(5)
These two
assumptions simplify the Bayes filter in (3) to
(6)
Exploiting
this, particle filter uses a number of particles
to sequentially
compute the expectation of any function of the state, which is
, by
(7)
In our work, we use the combination of SIS and SIR.
Equation (6) tells us that the estimation is achieved by a prediction step,
, followed by an update step,
. At the prediction step, the new state
is sampled from
the state evolution process
to generate a
new cloud of particle filters. With the predicted state
, an estimate of the observation is obtained, which is
used in the update step to correct the posterior estimate. Each particle is
then reweighted in proportion to the likelihood of the observation at time
. We adopt the idea of “resampling when
necessary” as suggested by [21, 28, 29], which suggests that resampling is
only necessary when the effective number of particles is sufficiently low. The
SIS/SIR algorithm can be summarized as in Algorithm 1.
Algorithm 1: SIS/SIR Particle Filter.
is also called
the proposal distribution. A common and simple choice is to use the prior
distribution [30] as the proposal distribution, which is also known as a
bootstrap filter. We use the bootstrap filter in our work, and by this way the weight update can be simplified to
(12)
This indicates that the weight update is directly related to the observational model.
3. Particle Filters for Eye Tracking and Blink Detection
The appearance
of eyes is presented to have significant changes when blinks occur. To
effectively handle such appearance changes, we use two interactive particle
filters, one for open eyes and the other one for closed eyes. These two particle
filters are only different in the observation measurement. In the following
sections, we present the three elements of the proposed particle filters: state
transition model, observation model, and prediction/update scheme.
3.1. State Transition Model
The system
dynamics, which are described by the state variables, are defined by the
location of the eye and the size of the eye image patches. The state vector is
, where
defines the location and
is used to define the size of eye image patches and normalize them to a fixed size. In
other words, the state vector
means that the
image patch under study is centered at
and its size is
, where
is the fixed
size of the eye patches we use in our study.
A second-order autoregressive (AR) model is used for
estimating the eyes' movement. The AR model has been widely used in particle
filter tracking literature for modeling the motion. It can be written as
(13)
where
(14)
and
are the
corresponding mean values for
and
. As pointed out by [31], this dynamic
model is actually a temporal Markov chain. It is capable of capturing
complicated object motion.
and
are matrices
representing the deterministic and the stochastic components, respectively.
and
can be either
obtained by a maximum-likelihood estimation or set manually from prior
knowledge.
is the i.i.d.
Gaussian noise.
We use a first-order AR model to model the scale
transition, which is
(15)
Similar to the
motion model,
is the
parameter describing the system deterministic component, and
is the
parameter describing the system stochastic component.
is the mean
value of the scales, and
is the i.i.d.
measurement noise. We assume
is uniformly
distributed. The scale is crucial for many image appearance-based classifiers.
An incorrect scale causes a significant difference in the image appearance.
Therefore, the scale transition model is one of the most important
prerequisites for obtaining an effective particle filter for measuring the
observation. Experimental evaluation shows that the AR model with uniform i.i.d.
noise is appropriate for tracking the scale changes.
3.2. Classification-Based Observation Model
In literature,
many efforts have been done to address the problem of selecting the proposal
distribution [15, 32–35]. A carefully selected proposal distribution can
alleviate the sample depletion problem, which refers to the problem that the
particle-based posterior approximation collapses over time to a few particles.
For example, in [35], AdaBoost is incorporated into the proposal distribution
to form a mixture proposal. This is crucial in some typical occlusion
scenarios, since `cross over' targets can be represented by the mixture-model.
However, the introduction of complicated proposal distributions greatly
increases the computational complexity. Also, since blink detection is usually
a single-target tracking problem, the proposal distribution is more likely to
be single-mode. Therefore, we only use bootstrap particle filtering approach,
and avoid the nontrivial proposal distribution estimation problem.
In this work, we focus on a better observation model
. The rationale is based on the observation that
combined with the resampling step, a more accurate likelihood learning from a
better observation model can move the particles to areas of high likelihood.
This will in turn mitigate the sample depletion problem, leading to a
significant increase in performance. In literatures, many existing approaches
use simple online template matching [16, 18, 19, 36] to get the observation model,
where the templates are constructed from low-level features, such as color,
edges, contour, and so forth, from previous observations. The likelihood is
usually estimated based on a Gaussian distribution assumption [26, 34]. However,
such approaches in a large extent rely on a reasonably stable feature detection
algorithm. Also, usually a large number of the single low-level feature points
are needed. For example, the contour-based method requires that the state
vector be able to describe the evolution of all contour points. This results in
a high-dimensional state space. Correspondingly, the computational cost is
expensive. One solution is to use abstracted statistics of these single feature
points, such as using color histogram instead of direct color measurement.
However, this causes a loss in the spatial layout information, which implies a
sacrifice in the localization accuracy. Instead we use a subspace-based
classification model for measuring the observation such that a more accurate
probability evaluation can be obtained. Statistics learned from a set of
training samples are used for classification instead of simple template
matching and online updating. This can greatly alleviate the problem of error
accumulation. The likelihood estimation problem,
, becomes a problem of estimating the distribution of
a Bernoulli variable, which is
.
means that the
current state generates a positive example. In our eye tracking and blink
detection problem, it represents that an eye patch is located, including both
open eye and closed eye. Logistic regression is a straightforward solution for
this purpose. Obviously, other existing classification/regression techniques
can be used as well.
Such classification-based particle filtering framework
makes simultaneous tracking and recognition feasible and straightforward. There
are two different ways to embed the recognition problem. The first approach is
to use a single particle filter, whose observation model is a multiclass
classifier. The second approach is to use multiple particle filters, where for
each particle filter its observation model uses a binary classifier designed
for a specific object class. The particle filter who gets the highest posterior
is used to determine the class label as well as the object location, and at the
next frame
, the other particle filters are reinitialized
accordingly. We use the second approach for simultaneous eye tracking and blink
detection. Individual observation models are built for open eye and closed eye
separately, such that two interactive sets of particles can be obtained. The
observation models contain two parts: tensor subspace analysis for feature
extraction, and logistic regression for class posterior learning. The two parts
are individually discussed in Sections 3.2.1 and
3.2.2. Posterior probabilities
measured by particles from these two particle filters are individually denoted
as
and
respectively,
where
refers to the
presence of an open eye and
refers to the
presence of a closed eye.
3.2.1. Subspace Analysis for Feature Extraction
Most existing
applications of using particle filters for visual tracking involve
high-dimensional observations. With the increase of the dimensionality in
observations, the number of particles required increases exponentially.
Therefore, lower dimensional feature extraction is necessary. Sparse low-level
features, such as the abstracted statistics of the low-level features, have
been proposed for this purpose. Examples of the most commonly used features are
color histogram [35, 37], edge density [15, 38], salient points [39], contour
points [18, 19], and so forth. The use of such features makes the system capable
of easily accommodating the scale changes and handling occlusions; however,
performance of such approaches rely on the robustness of the feature detection
algorithms. For example, color histogram is widely used for pedestrian and
human face tracking; however, its performance suffers from the illumination
changes. Also, the spatial information and the texture information are
discarded, which may cause the degradation of the localization accuracy and in
turn deteriorate the performance of the successive recognition algorithms.
Instead of these variants of low-level features, we
use eigen-subspace for feature extraction and dimensionality reduction. Eigenspace
projection provides a holistic feature representation that preserves spatial
and textural information. It has been widely exploited in computer vision
applications. For example, eigenface has been an effective face recognition
technique for decades. Eigenface focuses on finding the most representative
lower-dimensional space in which the pattern of the input can be best
described. It tries to find a set of “standardized face ingredients”
learned from a set of given face samples. Any face image can be decomposed as
the combination of these standard faces. However, this principal component
analysis- (PCA-) based technique treats each image input as a vector, which
causes the ambiguity in image local structure.
Instead of PCA, in [40], a natural alternative for PCA
in image domain is proposed, which is the multilinear analysis. Multilinear
analysis offers a potent mathematical framework for analyzing the multifactor
structure of the image ensemble. For example, a face image ensemble can be
analyzed from the following perspectives: identities, head poses, illumination
variations, and facial expressions. Multilinear analysis uses tensor algebra to
tackle the problem of disentangling these constituent factors. By this way, the
sample structures can be better explored and a more informative data
representation can be achieved. Under different optimization criterion,
variants of the multilinear analysis technique have been proposed. One solution
is the direct expansion of the PCA algorithm, TensorPCA from [41], which is
obtained under the criteria of the least reconstruction error. Both PCA and
tensorPCA are unsupervised techniques, where the class labels are not
incorporated in such representations. Here we use a supervised version of the
tensor analysis algorithm, which is called tensor subspace analysis (TSA) [42].
Extended from locality preservation projections (LPP) [43], TSA detects the
intrinsic geometric structure of the tensor space by learning a
lower-dimensional tensor subspace. We compare both observation models of using
tensorPCA and TSA. TSA preserves the local structure in the tensor space
manifold, hence a better performance should be obtained. Experimental
evaluation validates this conjecture. In the following paragraphs, a brief
review of the theoretical fundamentals of tensorPCA and TSA are presented.
PCA is a widely used method for dimensionality
reduction. PCA offers a well-defined model, which aims to find the subspace
that describes the direction of the most variance and at the same time suppress
known noise as well as possible. Tensor space analysis is used as a natural
alternative for PCA in image domain for efficient computation as well as
avoiding ambiguities in image local spatial structure. Tensor space analysis
handles images using its natural 2D matrix representation. TensorPCA subspace
analysis projects a high-dimensional rank-2 tensor onto a low-dimensional
rank-2 tensor space, where the tensor subspace projection minimizes the
reconstruction error. Different from the traditional PCA, tensor space analysis
provides techniques for decomposing the ensemble in order to disentangle the
constituent factors or modes. Since the spatial location is determined by two
modes: horizontal position and vertical position, tensor space analysis has the
ability to preserve the spatial location, while the dimension of the parameter
space is much smaller.
Similarly as the traditional PCA, the tensorPCA projection finds a set of orthogonal bases that information
is best preserved. Also, tensorPCA subspace projection decreases the
correlation between pixels while the projected coefficient indicates the
information preserved on the corresponding tensor basis. However, for
tensorPCA, the set of bases are composed by second-order tensors instead of
vectors. If we use matrix
to denote the
original image samples, and use matrix
as the
tensorPCA projection result, tensorPCA can be simply computed by [41]
(16)
The column
vectors of the left and right projection matrices
and
are the
eigenvectors of matrix
(17)
and matrix
(18)
respectively; while
. The dimensionality of
reflects the
information preserved, which can be controlled by a parameter
. For example, assume the left projection matrix is
computed from
, then the rank of the left projection matrix
is determined by
(19)
where
is the
th diagonal element of the diagonal eigenvalue matrix
(
if
). The
rank of the right projection matrix
,
can be decided
similarly.
TensorPCA is an unsupervised technique. It is not
clear whether the information preserved is optimal for classification. Also,
only the Euclidean structure is explored instead of the possible underlying
nonlinear local structure of the manifold. The Laplacian-based dimensionality
reduction technique is an alternate way which focuses on discovering the
nonlinear structure of the manifold [44]. It considers preserving the manifold
nature while extracting the subspaces. By introducing this idea into tensor space
analysis, the following objective function can be obtained [42]:
(20)
where
is the weight
matrix of a nearest neighbor graph similar to the one used in LPP [43]:
(21)
We use the
iterative approach provided in [42] to compute the left and right projection
matrices
and
. The same as tensorPCA, for a given example
, TSA gives
(22)
At each frame
, the
th
particle determines an observation
from its state
. Tensor analysis extracts the corresponding features
. Now the observation model becomes computing the
posterior
. For simplicity, in the following section, we omit
the time index
and denote the
problem as
. Logistic regression is a natural solution for this
purpose, which is a generalized linear model for describing the probability of
a Bernoulli distributed variable.
3.2.2. Logistic Regression for Modeling Probability
Regression is
the problem of modeling the conditional expected value of one random variable
based on the observations of some other random variables, which are usually
referred to as dependent variables. The variable to model is called the
response variable. In the proposed algorithm, the dependent variables are the
coefficients from the tensor subspace projection:
, and the response variable to model is the class
label
, which is a Bernoulli variable that defines the
presence of an eye subject. For closed-eye particle filter, this Bernoulli
variable defines the presence of a closed eye; while for open-eye particle
filter, this variable defines the presence of an open eye.
The relationship between the class label
and its
dependent variables, which is the tensor subspace coefficients
here, can be
written as
(23)
where
is the error
and
is called the
link function. The variable
can be
estimated by
(24)
Logistic regression uses the logit as the link
function, which is
Therefore, the
probability of the presence of an eye subject can be modeled as
(25)
where
means that an
eye subject is present.
3.3. State Update
The observation
models for open eye and closed eye are individually trained. We have one TSA
subspace learned from open eye/noneye training samples, and another TSA
subspace learned from closed eye/noneye training samples. Each TSA projection
determines a set of transformed features, which are denoted as
and
.
is the
transformed TSA coefficients for the open eyes and
is the
transformed TSA coefficients for the closed eyes. Correspondingly, for open eye
and closed eye, individual logistic regression models are used separately for
modeling
and
as follows:
(26)
The posteriors
are used to update the weights of the corresponding particles, as indicated in
(12). The updated weights are
and
.
If we have
(27)
it indicates
the presence of open eyes, and the particle filter for tracking the open eye is
the primary particle filter. Otherwise the eyes of the human subject in the
current frame are closed, which indicates the presence of a blink, and the
particle filter for the closed eye is determined as the primary particle
filter. The use of the max function indicates that our criteria is to
trust the most reliable particle. Other criteria can also be used, such as the
mean or product of the posteriors from the best
particles. The
guideline to select the suitable criteria is that only the good particles,
which are the particles that reliably indicate the presence of eyes, should be
considered. At frame
, assume the particles for the primary particle filter
are
, then the location
of the detected
eye is determined by
(28)
and the scale
of the eye
image patch is
(29)
We compute the effective number of particles
. If
, we perform resampling for the primary particle
filter. The particles with high posteriors are multiplied in proposition to
their posteriors. The secondary particle filter is reinitialized by setting the
particles' previous states to
and the
importance weights
to uniform.
4. Experimental Evaluation
The performance
is evaluated from two aspects: the blink detection accuracy and the tracking
accuracy. There are two factors that explain the blink detection rate: first,
how many blinks are correctly detected; second, the detection accuracy of the
blink duration. Videos collected under different scenarios are studied,
including indoor videos, in-car videos, and news report videos. A quantitative
comparison is listed. To evaluate the tracking accuracy, a benchmark data is
required to provide the ground-truth of the eye locations. We use a
marker-based motion capturing system to collect the ground-truth data. The
experimental setup for obtaining the benchmark data is explained, and the
tracking accuracy is presented. Two hundred particles are used for each particle
filter if not stated otherwise. For training the tensor subspaces and the
logistic regression-based posterior estimators, we use eye samples from FERET
gray database to collect open-eye samples. Closed-eye samples are from these
three sources: (1) FERET database; (2) Cohn-Kanade AU-coded facial expression
database; and (3) online images with closed eye. Noneye samples are from both
the FERET database and the online images. We have 273 open-eye images; 149
closed-eye images, and 1879 noneye images. All open-eye, closed-eye, and noneye
samples are resized to
for computing
the tensor subspaces and then getting the logistic regressors. With the
information-preservation threshold set as
, the sizes of the tensorPCA subspaces used for
modeling the open-eye/noneye and closed-eye/noneye samples are
and
respectively;
and the sizes of the TSA subspaces for open eye/noneye and closed eye/noneye
are
and
respectively.
4.1. Blink Detection Accuracy
We use videos
collected under different scenarios for evaluating the blink detection
accuracy. In the first set of experiments, we use the videos collected from an
indoor lab setting. The subjects are asked to make voluntary long blinks or
involuntary short blinks. In the second set of experiments, the videos
collected for drivers in outdoor driving scenarios are used. In the third set
of experiments, we collect videos for different archormen/women from news
reports. In the second and the third experiments, the subjects make natural
actions, such as speaking, so only involuntary short blinks are present. We
have 8 videos from indoor lab settings; 4 videos of the drivers from an in-car
camera; and 20 news report videos, altogether 637 blinks are present. For
in-door videos, the frame rate is around 25 frames per second, and each
voluntary blink may last 5-6 frames. For in-car videos, the image quality is low,
and there are significant illumination changes. Also, the frame rate is fairly
low (around 10 frames per second). The voluntary blinks may last around 2-3
frames. For the news report videos, the frame rate is around 15 frames per
second. The videos are compressed and the voluntary blinks last for about 3-4
frames. In Table 1. the comparison results are summarized. The true number of
blinks, the detected number of blinks, and the number of false positives are
shown. Images in Figures 2–8 give some examples of the detection results,
which also show the typical video frames we used for studying. Red boxes show
the tracked eye location, while blue dots show the center of the tracking
results. If there is a red bar on the top right corner, it means that the eyes
are closed in the current frame. Examples of the typical false detections or
misdetections are also shown.
Figure 2: Examples of the blink detection results for indoor videos.
Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Figure 3: Examples of the blink detection
results for indoor videos. Red boxes are tracked eyes, and the blue dots are
the center of the eye locations. The red bar on the top-left indicates the
presence of closed eyes.
Figure 4: Examples of the blink detection results for in-car videos.
Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Figure 5: Examples of the blink detection results for in-car videos.
Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Figure 6: Examples of the blink detection results for news report
videos. Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Figure 7: Examples of the blink detection results for news report
videos. Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Figure 8: Examples of the blink detection results for news report
videos. Red boxes are tracked eyes, and the blue dots are the center of the eye
locations. The red bar on the top-left indicates the presence of closed eyes.
Blink duration time plays an important role in HCI
systems. Involuntary blinks are usually fast while voluntary blinks usually
last longer [45]. Therefore, it is also necessary to compare the detected blink
duration with the manually labeled true blink duration (in terms of the frame
numbers). In Figure 9, we show the detected blink duration in comparison with
the manually labeled blink duration. The horizontal axis is the blink index,
and the vertical axis shows the duration time in terms of the frame numbers.
Experimental evaluation shows that the proposed algorithm is capable of
capturing short blinks as well as the long voluntary blinks accurately.
Figure 9: Examples of the duration time of each blink: true blink duration versus detected blink
duration. The heights of the bars show the blink duration (in terms of frame
numbers). In each pair of bars, the left (blue) bar shows the duration of the
detected blink, and the right bar (magenta) shows that of the true blink.
As indicated in (27), the ratio of the posterior
maxima, which is
, determines the presence of an open eye or close eye.
Figure 10(a) shows an example of the obtained ratios for one sequence.
Log-scale is used. Let
and
, the presence of the closed-eye frame is determined
when
, which corresponds to
in the
log-scale. Examples of the corresponding frames are also shown in Figures
10(b)–10(d) for illustration.
Figure 10: (a) The log ratio of posteriors

for each frame in Seq. 5.
(b), (c), and (d) The frames corresponding to examples a, b, and c in Figure
10(a). The
tracked eyes and the posteriors

and

are also shown. In each figure, the top red
line shows the posterior of being closed eye, and the bottom red line shows the
posterior of being open eye.
4.2. Comparison of Using TensorPCA Subspace and TSA Subspace
As stated
above, by introducing multilinear analysis, the images can better preserve the
local spatial structure. However, variants of the tensor subspace basis can be
obtained based on different objective functions. TensorPCA is a straightforward
extension of the 1D PCA analysis. Both are unsupervised approaches. TSA extends
LPP that preserves the nonlinear locality in the manifold, which also
incorporates the class information. It is believed that by introducing the
local manifold structure and the class information, TSA can obtain a better
performance. Experimental evaluations verified this claim. Particle filters
that individually use tensorPCA subspace and TSA subspace for observation
models are compared for eye tracking and blink detection purpose. Examples of
the comparison are shown in Figure 11. As suggested, TSA presents a more
accurate tracking result. In Figure 11, examples of the tracking results from
both the tensorPCA observation model and the TSA observation model are shown.
In each subfigure, the left image shows result from the use of TSA subspace,
and the right image shows result from the use of tensorPCA subspace. Just as
above, red bounding boxes show the tracked eyes, the blue dots show the center
of the detection, and the red bar at the top-right corner indicates the
presence of a detected closed-eye frame. For subspace-based analysis, image
alignment is critical for classification accuracy. An inaccurate observation
model causes errors in the posterior probability computation, which in turn
results in inaccurate tracking and poor blink detection.
Figure 11: Comparison of using TSA subspace versus using tensorPCA
subspace in observation models. In each subfigure, the left image shows the
result from using TSA subspace, and the right one shows the result from using
tensorPCA subspace.
4.3. Comparison of Different Scale Transition Models
It is worth
noting that for subspace-based observation model, the scale for normalizing the
size of the images is crucial. A bad scale transition model can severely
deteriorate the performance. Two different popular models have been used to
model the scale transition, and the performance is compared. The first one is
the AR model as in (15), and the other one is a Gaussian transition model in
which the transition is controlled by a Gaussian distributed random noise, as
follows:
(30)
where
is a Gaussian
distribution with
as the mean and
as the
variance. Examples are shown in Figure 12. The parameters of the Gaussian
transition model is obtained by the MAP criteria according to a manually
labeled training sequence. In each subfigure, the left image shows the result
from using the AR model for scale transition, and the right one shows the
result from using the Gaussian transition model. Experimental results show that
AR model performs better. It is because AR model has certain “memory” of
the past system dynamics, while Gaussian transition model can only remember the
history of its immediate past. Therefore, the “short-memory” of Gaussian
transition model uses less information to predict the scale transition
trajectory, which is not effective and in turn causes the failure of the
tracking.
Figure 12: Comparison of using AR versus using Gaussian transition model
in the scale model. In each subfigure, the left image shows the result from AR
scale transition model, and the right one shows the result from the Gaussian
scale transition model.
4.4. Eye Tracking Accuracy
Benchmark data
is required for evaluating the tracking accuracy. We use the marker-based Vicon
motion capture and analysis system for providing the groundtruth. Vicon system
has both hardware and software components. The hardware includes a set of
infrared cameras (usually at least 4), controlling hardware modules and a host
computer to run the software. The software includes Vicon IQ that manages, sets
up, captures, and processes the motion data, the database manager for keeping
records of the data files, their calibration files and the models. We use four
Vicon MCAM cameras to track four reflective markers. The setup is shown as in
Figure 13. Vicon system tracks the markers' position in Vicon's reference
coordinate system, and the video camera collects the video we need for
evaluating the proposed algorithm.
Figure 13: Setup for collecting groundtruth data with Vicon system.
Cameras in red circles are Vicon infrared cameras, and the camera in green
circle is the video camera for collecting testing sequences.
Before collecting data, Vicon system requires preprocesses
including camera calibration, data acquisition, and model building. With the
included calibration tool for the motion capture system, a reflectance marker's
3D position can be obtained in either the Vicon camera coordinate system or an
assigned world coordinate system. Since the Vicon camera coordinate system is
different from the video camera coordinate system, a calibration between these
two camera systems is also required. We use a checker-board pattern with
reflectance markers on specified location for this purpose, as shown in Figure 14.
Intrinsic parameters
and extrinsic
parameters
and
are computed.
Intrinsic parameters give the transform from the 3D coordinates in the camera
reference frame to the 2D coordinates in the image domain, while extrinsic
parameters define the transform between the grid reference frame (as shown in
Figure 15) and the camera reference frame. From intrinsic parameters, the 3D
coordinates in the camera coordinate system
can be related
with the 2D coordinates in the image plane
by
(31)
where
is a nonlinear
function describing the lens distortion. Extrinsic parameters describe the
relation between the 3D coordinate in the camera system
and the 3D
coordinate in a given grid reference frame
, as follows:
(32)
Figure 14: Checker board pattern for calibration between video camera
coordinate system and Vicon camera coordinate system. Reflectance markers are
put at specific locations.
Figure 15: Example of the grid reference frame.
Figure 15 gives an example of the grid reference
frame. Each pose of
the checker-board defines one grid reference frame, hence an individual set of
extrinsic parameters can be determined. The reflectance markers are assumed to
be infinitely thin, such that their depth can be neglected. Therefore, the
reflectance markers' coordinates in current grid reference frame are known,
denoted as
.
can be
transformed back to the video camera reference frame, which gives the 3D
coordinates in the video camera reference frame
, using the corresponding extrinsic parameters
and
. These markers are also visible by the Vicon system,
as shown in Figure 16. Calibrated Vicon system gives the 3D positions of the
markers, which are denoted as
, in the Vicon camera system reference frame. Hence,
and
can be related
by an affine transform:
(33)
This relation
keeps unchanged when the pose of the check-board changes. A set of
can be used to
determine this transform. We use the approach proposed by Goryn and Hein in
[49] to estimate
and
. The rotation matrix
can be
determined by least-square approach as follows:
(34)
where
and
are unitary
matrices obtained from SVD decomposition of the matrices
(35)
The translation vector
can be obtained
accordingly by
(36)
Equation (36) together with (31) determines the mapping from the markers' 3D position given
by Vicon system to the 2D pixel position in the image plane. Therefore, with
the ViconIQ system providing the markers' 3D positions in Vicon camera systems,
we can get our ground-truth data. For reliable tracking, four markers are used,
as shown in Figure 17. We use the Vicon system to track the right-eye location
as well as providing the scale of the image, and apply the proposed algorithm
on tracking and blink detection of left eye. After normalization with the
scale, the distance between the right eye and left eye is constant, so that the
benchmark data can be used for evaluating the tracking accuracy. The fixed size
for computing the subspace is
. We use the center of the markers as the groundtruth
for eyes' location.
Figure 16: Reflectance markers observed by Vicon IQ system.
Figure 17: Marker deployment for tracking accuracy benchmark data collection.
Figure 18 gives an example of the tracking accuracy.
The horizontal axis shows the frame number, and the vertical axis shows the
error in pixels after normalization with the scales. The error is the distance
between the center of detection to the groundtruth. Experimental results show
that in certain frames, the tracking error is bigger. This is because the
proposed algorithm tries to center at the pupil, instead of the center of the
eyes.
Figure 18: Tracking error after normalization
using the scales. The horizontal axis is the frame index, and the vertical axis
is the tracking error in pixels after normalization with the scales.
5. Discussion and Concluding Remarks
A simultaneous
eye tracking and blink detection system is presented in this paper. We used two
interactive particle filters for this purpose, each particle filter serves to
track the eye localization by exploiting AR models for describing the state
transition and a classification-based model in tensor subspace for measuring
the observation. One particle filter tracks the closed eyes and the other one
tracks the open eyes. The set of particles that gives higher confidence is used
to determine the estimated eye location as well as the eye's status (open
versus closed); also the other set of particles is reinitialized accordingly.
The system dynamics are described by two types of hidden state variables: the
position and the scale. We use a second-order autoregression model for
describing the eye's movement and a first-order autoregression model for
describing the scale transition. Tensor subspace analysis is used for feature
extraction and logistic regression is used to evaluate the posterior
probabilities. The algorithm is evaluated using videos collected under
different scenarios, including both indoor and outdoor data. We evaluated the
performance from both the blink detection rate and the tracking accuracy
perspective. Experimental setup for acquiring benchmark data
to evaluate the accuracy is presented; and the experimental results are shown,
which show that the proposed algorithm is able to accurately track eye locations
and detect both voluntary long blinks and involuntary short blinks.
Acknowledgments
This research
was supported in part by grants from the UC Discovery Program and the Technical
Support Working Group of the US Department of Defense. The authors are thankful
for the assistance and support of their colleagues from the UCSD Computer
Vision and Robotics Research Laboratory, especially valuable assistance
provided by Shinko Cheng, which made systematic experimental evaluation using
the motion capture system possible.
References
- N. Kojima, K. Kozuka, T. Nakano, and S. Yamamoto, “Detection of consciousness degradation and concentration of a driver for friendly information service,” in Proceedings of the IEEE International Vehicle Electronics Conference, p. 31, Tottori, Japan, September 2001.
- P. Smith, M. Shah, and N. D. V. Lobo, “Monitoring head/eye motion for driver alertness with one camera,” in Proceedings of the International Conference on Pattern Recognition, vol. 15, p. 636, Cambridge, UK, September 2000.
- K. Grauman, M. Betke, J. Lombardi, J. Gips, and G. Bradski, “Communication via eye blinks and eyebrow raises: video-based human-computer interfaces,” Universal Access in the Information Society, vol. 2, no. 4, 359 pages, 2003.
- K. Grauman, M. Betke, J. Gips, and G. R. Bradski, “Communication via eye blinks—detection and duration analysis in real time,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. 1010, Kauai, Hawaii, USA, December 2001.
- M. Chau and M. Betke, “Real time eye tracking and blink detection with usb cameras,” Tech. Rep. 2005-12, Boston University Computer Science, Boston, Mass, USA, April 2005.
- T. Moriyama, T. Kanade, J. F. Cohn, et al., “Automatic recognition of eye blinking in spontaneously occurring behavior,” in Proceedings of the International Conference on Pattern Recognition (ICPR '02), vol. 16, p. 78, Kauai, Hawaii, USA, 2002.
- D. Gorodnichy, “Second order change detection, and its application to blink-controlled perceptual interfaces,” in Proceedings of the International Association of Science and Technology for Development (IASTED '03) Conference on Visualization, Imaging and Image Processing (VIIP '03), p. 140, Benalmadena, Spain, September 2001.
- T. Morris, P. Blenkhorn, and F. Zaidi, “Blink detection for real-time eye tracking,” Journal of Network and Computer Applications, vol. 25, no. 2, 129 pages, 2002.
- J. F. Cohn, J. Xiao, T. Moriyama, Z. Ambadar, and T. Kanade, “Automatic recognition of eye blinking in spontaneously occurring behavior,” 2007, to appear in Behavior Research Methods, Instruments, and Computers.
- J. C. McCall and M. M. Trivedi, “Facial action coding using multiple visual cues and a hierarchy of particle filters,” in Proceedings of the IEEE Workshop on Vision for Human Computer Interaction in Conjunction with IEEE (CVPR '06), vol. 2006, p. 150, New York, NY, USA, 2006.
- J. Wu and M. M. Trivedi, “Robust facial landmark detection for intelligent vehicle system,” in Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and
Gestures in Conjunction with IEEE (ICCV '05), vol. 3723, p. 213, Beijing, China, 2005.
- J. Wu and M. M. Trivedi, “A binary tree for probability learning in eye detection,” in Proceedings of the IEEE International Workshop on Face Recognition Grand Challenge in conjunction with IEEE (CVPR '05), vol. 3, p. 170, San Diego, Calif, USA, 2005.
- G. Welch and G. Bishop, “An introduction to the kalman filter,” University of North Carolina at Chapel Hill, Chapel Hill, NC, USA, 1995.
- F. Gustafsson, F. Gunnarsson, N. Bergman, et al., “Particle filters for positioning, navigation, and tracking,” IEEE Transactions on Signal Processing, vol. 50, no. 2, 425 pages, 2002.
- Y. Rui and Y. Chen, “Better proposal distributions: object tracking using unscented particle filter,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '01), vol. 2, p. 786, 2001.
- M. Lee, I. Cohen, and S. Jung, “Particle filter with analytical inference for human body tracking,” in Proceedings of the IEEE Workshop on Motion and Video Computing, p. 159, December 2002.
- M. Bolić, S. Hong, and P. M. Djurić, “Performance and complexity analysis of adaptive particle filtering for tracking applications,” in Conference Record of the Asilomar Conference on Signals, Systems and Computers, vol. 1, p. 853, 2002.
- C. Chang and R. Ansari, “Kernel particle filter: iterative sampling for efficient visual tracking,” in IEEE International Conference on Image Processing, vol. 3, p. 977, 2003.
- C. Chang and R. Ansari, “Kernel particle filter for visual tracking,” IEEE Signal Processing Letters, vol. 12, no. 3, 242 pages, 2005.
- A. Giremus, A. Doucet, V. Calmettes, and J.-Y. Tourneret, “A rao-blackwellized particle filter for INS/GPS integration,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), vol. 3, p. 964, 2004.
- J. S. Liu and R. Chen, “Blind deconvolution via sequential imputation,” Journal of the American Statistical Association, vol. 90, no. 430, 567 pages, 1995.
- A. Doucet, de Freitas, J. F. G., and N. J. Gordon, Sequential Monte Carlo Methods in Practice, Springer, New York, NY, USA, 2001.
- K. Heine, “Unified framework for sampling/importance resampling algorithms,” in Proceedings of the IEEE International Conference on Information Fusion, vol. 2, p. 1459, 2005.
- J. S. Liu and R. Chen, “Sequential monte carlo methods for dynamic systems,” Journal of the American Statistical Association, vol. 93, no. 443, 1032 pages, 1998.
- R. Karlsson, Particle Filtering for Positioning and Tracking Applications, Ph.D. thesis, Linköping University, Linköping, Sweden, 2005.
- N. J. Gordon, D. J. Salmond, and A. F. M. Smith, “Novel approach to nonlinear/non-Gaussian Bayesian state estimation,” in IEE Proceedings, Part F: Radar and Signal Processing, vol. 140, no. 2, p. 107, April 1993.
- M. K. Pitt and N. Shephard, “Filtering via simulation: auxiliary particle filters,” Journal of the American Statistical Association, vol. 94, no. 446, 590 pages, 1999.
- Z. Khan, T. Batch, and F. Dellaert, “A rao-blackwellized particle filter for eigen tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, p. 980, 2004.
- S. Sarkka, A. Vehtari, and J. Lampinen, “Rao-blackwellized particle filter for multiple target tracking,” Information Fusion, vol. 8, no. 7, 2 pages, 2007.
- J. Carpenter, P. Clifford, and P. Fernhead, “An improved particle filter for non-linear problems,” Department of Statistics, University of Oxford, Oxford, UK, 1997.
- A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo sampling methods for Bayesian filtering,” Statistics and Computing, vol. 10, no. 3, 197 pages, 2000.
- J. Hol, T. Schon, and F. Gustafsson, “On resampling algorithms for particle filters,” in Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, September 2006.
- M. Isard and A. Blake, “Visual tracking by stochastic propagation of conditional density,” in Proceedings of the 4th European Conference on Computer Vision (ECCV '06), p. 343, Graz, Austria, April 1996.
- M. Isard and A. Blake, “Condensation—conditional density propagation for visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, 5 pages, 1998.
- K. Nishiyama, “Fast and effective generation of the proposal distribution for particle filters,” Signal Processing, vol. 85, no. 12, 2412 pages, 2005.
- Y. Guan, R. Fleißner, P. Joyce, and S. M. Krone, “Markov chain Monte Carlo in small worlds,” Statistics and Computing, vol. 16, no. 2, 193 pages, 2006.
- C. Shen, M. J. Brooks, and A. D. Van Hengel, “Augmented particle filtering for efficient visual tracking,” in Proceedings of the International Conference on Image Processing (ICIP '05), vol. 3, p. 856, 2005.
- K. Okuma, A. Taleghani, N. De Freitas, J. J. Little, and D. G. Lowe, “A boosted particle filter: multitarget detection and tracking,” in Proceedings of the European Conference on Computer Vision, vol. 3021, p. 28, Copenhagen, Denmark, May 2004.
- X. Xu and B. Li, “Head tracking using particle filter with intensity gradient and color histogram,” in Proceedings of the IEEE International Conference on Multimedia and Expo, (ICME '05), vol. 2005, p. 888, 2005.
- P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proceedings of the European Conference on Computer Vision (ECCV '02), Copenhagen, Denmark, May 2002.
- C. Yang, R. Duraiswami, and L. Davis, “Fast multiple object tracking via a hierarchical particle filter,” in Proceedings of the IEEE International Conference on
Computer Vision (ICCV '05), vol. 1, p. 212, 2005.
- M. Pupilli and A. Calway, “Real-time camera tracking using a particle filter,” in Proceedings of the British Machine Vision Conference, p. 519, Oxford Brookes University, Oxford, UK, September 2005.
- M. Alex, O. Vasilescu, and D. Terzopoulos, “Multilinear analysis of image ensembles: Tensorfaces,” in Proceedings of the European Conference on Computer Vision (ECCV '02), p. 447, Copenhagen, Denmark, May 2002.
- D. Cai, X. He, and J. Han, “Subspace learning based on tensor analysis,” Tech. Rep. (UIUCDCS-R-2005-2572), Department of Computer Science,University of Illinois at Urbana-Champaign, Champaign, Ill, USA, 2005.
- X. He, D. Cai, and P. Niyogi, “Tensor subspace analysis,” in Proceedings of the Neural Information Processing Systems, 2005.
- S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, 2323 pages, 2000.
- X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, 328 pages, 2005.
- S. Esaki, Y. Ebisawa, A. Sugioka, and M. Konishi, “Quick menu selection using eye blink for eye-slaved nonverbal communicator with video-based eye-gaze detection,” in Annual International Conference of the IEEE Engineering in Medicine and Biology, vol. 5, p. 2322, 1997.
- D. Goryn and S. Hein, “On the estimation of rigid body rotation from noisy data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 12, 1219 pages, 1995.