Computer Science Department, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel
Abstract
We present an approach for human body parts tracking in 3D with prelearned motion models using multiple cameras. Gaussian process annealing particle filter is proposed for tracking in order to reduce the dimensionality of the problem and to increase the tracker's stability and robustness. Comparing with a regular annealed particle filter-based tracker, we show that our algorithm can track better for low frame rate videos. We also show that our algorithm is capable of recovering after a temporal target loss.
1. Introduction
Human body pose estimation and tracking is a challenging task for several reasons. First, the large dimensionality of the human 3D model complicates the examination of the entire subject and makes it harder to detect each body part separately. Secondly, the significantly
different appearance of different people that stems from various clothing
styles and illumination variations adds to the already great variety of images
of different individuals. Finally, the most challenging difficulty that has to
be solved in order to achieve satisfactory results of pose understanding is the
ambiguity caused by body.
This paper presents an approach to 3D articulated
human body tracking, that enables reduction of the complexity of this model. We
propose a novel algorithm—Gaussian process annealed particle filter (GPAPF)
(see also Raskin et al. [1, 2]). In
this algorithm, we apply a nonlinear dimensionality reduction using Gaussian
process dynamical model (GPDM) (Lawrence [3], Wang et al. [4]) in order to create a low-dimensional latent space. This space describes poses from a specific motion
type. Later we use annealed particle filter proposed by Deutscher and Reid [5, 6] that operates in this laten space in order to generate particles.
The annealed particle filter has a good performance when applied on videos with a high frame rate (60 fps, as reported by Balan et al. [7]), but performance drops when the frame rate is lower (30 fps). We show that our approach provides good results even for the low frame rate (30 fps and lower).
An additional advantage of our tracking algorithm is the capability to recover
after temporal loss of the target, which makes the tracker more robust.
2. Related Works
There are two main approaches for body pose estimation. The first one is
the body detection and recognition, which is based on a single frame
(Song et al. [8], Ioffe and Forsyth [9], Mori and Malik [10]). The second approach is the body pose tracking which
approximates body pose based on a sequence of frames (Sidenbladh et al. [11], Davison et al. [12], Agarwal and Triggs [13, 14]). A variety of methods have been developed for tracking people from single views (Ramanan and Forsyth [15]), as well as from multiple views (Deutscher et al. [5]).
One of the common approaches for tracking is using
particle filtering methods. Particle filtering uses multiple predictions,
obtained by drawing samples of pose and location prior and then propagating
them using the dynamic model, which are refined by comparing them with the
local image data, calculating the likelihood (see, e.g., Isard and MacCormick [16] or Bregler and Malik [17]). The prior is typically quite diffused (because motion can be fast) but the likelihood function may be very peaky, containing multiple local maxima which
are hard to account for in detail. For example, if an arm swings past an
arm-like pole, the correct local maximum must be found to prevent the track
from drifting (Sidenbladh et al. [18]). Annealed particle filter
(Deutscher and Reid [6]) or local searches are the ways to attack this difficulty. An
alternative is to apply a strong model of dynamics (Mikolajcyk et al. [19]).
There exist several possible strategies for reducing
the dimensionality of the configuration space. Firstly it is possible to
restrict the range of movement of the subject. This approach has been pursued
by Rohr [20]. The assumption is that the subject is performing a specific
action. Agarwal and Triggs [13, 14] assume a constant angle of view of the subject. Because of the restricting assumptions the resulting trackers are not
capable of tracking general human poses. Several works have been done in
attempt to learn subspace models. For example, Ormoneit et al. [21] have used
PCA on the cyclic motions. Another way to cope with high-dimensional data space
is to learn low-dimensional latent variable models [22, 23]. However, methods like Isomap [24] and locally linear embedding (LLE) [25] do not provide a mapping between the latent space and the data space. Urtasun et al. [26–28] uses a form of
probabilistic dimensionality reduction by Gaussian process dynamical model
(GPDM) (Lawrence [3], and Wang et al. [4]) and formulate the tracking as a nonlinear least-squares optimization problem.
We propose a tracking algorithm, which consists of two
stages. We separate the body model state into two independent parts: the first
one contains information about 3D location and orientation of the body and the
second one describes the pose. We learn latent space that describes poses only.
In the first one we generate particles in the latent space and transform them
into the data space by using learned a priori mapping function. In the second
stage we add rotation and translation parameters to obtain valid poses. Then we
project the poses on the cameras in order to calculate the weighted function.
The article is organized as follows. In Sections 3 and 4, we give a description of particle
filtering and Gaussian fields. In Section 5, we describe our algorithm. Section 6
contains our experimental results and comparison to annealed particle filter tracker. The conclusions and possible extension are given in Section 7.
3. Filtering
3.1. Particle Filter
The particle filter algorithm was developed for tracking objects, using the Bayesian
inference framework. In order to make an estimation of the tracked object
parameter this algorithm suggests using the importance sampling. Importance
sampling is a general technique for estimating the statistics of a random
variable. The estimation is based on samples of this random variable generated
from other distribution, called proposal distribution, which is easy to sample from.
Let us denote
as a hidden
state vector and let
be a
measurement in time
. The algorithm builds an approximation of a maximum
posterior estimate of the filtering distribution
, where
is the history of the observation. This distribution is represented by a set of pairs
, where
. Using Bayes' rule, the filtering distribution can be
calculated using two steps:
(i)prediction step:
(1)(ii)filtering step:
(2)
Therefore, starting with a weighted set of samples
, the new
sample set
is generated
according to the distribution, that may depend on the previous set
and the new
measurements
:
. The new weights are calculated using the following formula:
(3)
where
(4)
and
is the proposal distribution.
The main problem is that the distribution
may be very
peaky and far from being convex. For such
the algorithm
usually detects several local maxima instead of choosing the global one (see
Deutscher and Reid [6]). This usually happens for the high-dimensional
problems, like body part tracking. In this case a large number of samples have
to be taken in order to find the global maxima, instead of choosing a local
one. The other problem that arises is that the approximation of the
for
high-dimensional spaces is a very computationally inefficient and hard task.
Often a weighting function
can be
constructed according to the likelihood function as it is in the condensation
algorithm of Isard and Blake [29], such that it provides a good approximation
of the
, but is also
relatively easy to calculate. Therefore, the problem becomes to find configuration
that maximizes
the weighting function
.
3.2. Annealed Particle Filter
The main idea is to use a set of weighting functions instead of using a single one. While a
single weighting function may contain several local maxima, the weighting
function in the set should be smoothed versions of it, and therefore contain a
single maximum point, which can be detected using the regular annealed particle
filter.
A series of
is used, where
differs only
slightly from
and represents
a smoothed version of it. The samples should be drawn from the
function, which
might be peaky, and therefore a large number of particles are needed to be used
in order to find the global maxima. Therefore,
is designed to
be a very smoothed version of
. The usual
method to achieve this is by using
, where
and
is equal to the
original weighting function. Therefore, each iteration of the annealed particle
filter algorithm consists of M steps, in each of these the appropriate
weighting function is used and a set of pairs is constructed
. Tracking is described in Algorithm 1.
Algorithm 1: The annealed particle filter algorithm.
Figure 1 shows the illustration of the 5-layered
annealing particle filter. Initially the set contains many particles that
represent very different poses and therefore can fall into local maximum. On
the last layer all the particles are close to the global maximum, and therefore
they represent the correct pose.
Figure 1: Annealed particle filter illustration for M = 5. Initially the set contains many
particles that represent very different poses and therefore can fall into local
maximum. On the last layer all the particles are close to the global maximum,
and therefore they represent the correct pose.
4. Gaussian Fields
The Gaussian process dynamical model (GPDM) (Lawrence [3], Wang et al. [4]) represents a mapping from the latent space to the data:
, where
denotes a
vector in a d-dimensional latent space and
is a vector,
that represents the corresponding data in a D-dimensional space. The model
that is used to derive the GPDM is a mapping with first-order Markov dynamics:
(5)
where
and
are zero-mean
Gaussian noise processes,
and
are weights,
and
and
are basis
functions.
For Bayesian perspective, A and B should
be marginalized out through model average with an isotropic Gaussian prior on B in closed form to yield
(6)
where W is a scaling diagonal matrix, Y is a matrix of training vectors, X contains corresponding latent vectors, and
is the kernel matrix:
(7)W is a scaling
diagonal matrix. It is used to account for the different variances in different
data elements. The hyper parameter
represents the
scale of the output function,
represents the
inverse of the radial basis function (RBF) and
represents the
variance of
. For the
dynamic mapping of the latent coordinates X, the joint probability
density over the latent coordinate system and the dynamics weights A are
formed with an isotropic Gaussian prior over the A, it can be shown (see
Wang et al. [4]) that
(8)
where
,
is a kernel
constructed from
and
has an
isotropic Gaussian prior. GPDM uses a “linear+RBF” kernel with
parameter
:
(9)
Following Wang et al. [4],
(10)
the latent positions and hyper parameters are found by maximizing this distribution or
minimizing the negative log posterior:
(11)
5. GPAPF Filtering
5.1. The Model
In our work we use a model similar to the one proposed by Deutscher et al. [5] with some differences in the annealing schedule and weighting function. The body model is
defined by a pair
, where
stands for the
limbs lengths and
for the angles
between the limbs and the global location of the body in 3D. The limbs
parameters are constant, and represent the actual size of the tracked person.
The angles represent the body pose and, therefore, are dynamic. The state is a
vector of dimensionality 29 : 3 DoF for the global 3D location, 3 DoF for the
global rotation, 4 DoF for each leg, 4 DoF for the torso, 4 DoF for each arm,
and 3 DoF for the head (see Figure 2). The whole tracking process estimates the
angles in such a way that the resulting body pose will match the actual pose.
This is done by maximizing the weighting function which is explained next.
Figure 2: (a) The 3D body model and (b) the samples drawn for the weighting function calculation. In (b) the blue samples are used to evaluate the edge matching, the cyan points are used to calculate
the foreground matching, the rectangles with the edges on the red points are
used to calculate the part-based body histogram.
5.2. The Weighting Function
In order to evaluate how well the body pose matches the actual pose using the particle
filter tracker we have to define a weighting function
, where
is the model's configuration (i.e., angles) and
stands for visual content (the captured images). The weighting function that we use is a
version of the one suggested by Deutscher and Reid [6] with some modifications.
We have experimented with 3 different features: edges, foreground silhouette,
and foreground histogram.
The first feature is the edge map. As Deutscher and Reid
[6] propose, this feature is the most important one, and provides a good
outline for visible parts, such as arms and legs. The other important property of
this feature is that it is invariant to the color and lighting condition. The
edge maps, in which each pixel is assigned a value dependent on its proximity
to an edge, are calculated for each image plane. Each part is projected on the
image plane and samples of the
hypothesized
edges of human body model are drawn. A sum-squared difference function is
calculated for these samples:
(12)
where
is a number of
camera views, and
stands for the
image from the ith camera. The
are the edge
maps. Each part is projected on the image plane and samples of the
hypothesized
edges are drawn.
However, the problem that occurs using this feature is
that the occluded body parts will produce no edges. Even the visible parts,
such as the arms, may not produce the edges, because of the color similarity
between the part and the body. This will cause
to be close to
zero and thus will increase the squared difference function. Therefore, a good
pose which represents well the visual context may be omitted. In order to
overcome this problem for each combination of image plane and body part, we
calculate a coefficient which indicates how well the part can be observed on
this image. For each sample point on the model's edge we estimate the
probability being covered by another body part. Let
be the number
of hypothesized edges that are drawn for the part i. The total number of
drawn sample points can be calculated using
, where
is the total
number of body parts in the model. The coefficient of part i for the
image plane j can be calculated as follows:
(13)
where
is the model
configuration for part i and
is the value of
the foreground pixel map of the sample k. If a body part is occluded by
another one, then the value of
will be close
to one and therefore the coefficient of this part for the specific camera will
be low. We propose using the following function instead of sum-squared
difference function as presented in (12):
(14)
where
(15)
The second feature is the silhouette obtained by
subtracting the background from the image. The foreground pixel map is
calculated for each image plane with background pixels set to 0 and foreground
set to 1 and sum-squared difference function is computed:
(16)
where
is the value is
the foreground pixel map values at the sample points.
The third feature is the foreground histogram. The
reference histogram is calculated for each body part. It can be a grey level
histogram or three separated histograms for color images, as shown in Figure 3. Then, on each frame a normalized histogram is calculated for
a hypothesized body part location and is compared to the referenced one. In
order to compare the histograms we have used the squared Bhattacharya distance [30, 31], which provides a correlation measure between the model and the target
candidates:
(17)
where
(18)
and
is the value of
bin i of the body part bp on the view cv in the reference
histogram, and the
is the value of the corresponding bin on the current frame using the hypothesized body part
location.
Figure 3: The reference histograms of the torso: (a) red, (b) green, and (c) blue colors of the
reference selection.
The main drawback of that feature is that it is
sensitive to changes in the lighting conditions. Therefore, the reference
histogram has to be updated, using the weighted average from the recent
history.
In order to calculate the total weighting function the
features are combined together using the following formula:
(19)
As was stated above, the target of the tracking process is equal to maximizing the weighting
function.
5.3. GPAPF Learning
The drawback in the particle filter tracker is that a high dimensionality of the state space
causes an exponential increase in the number of particles that are needed to be
generated in order to preserve the same density of particles. In our case, the
data dimension is 29D. In their work, Sigal et al. [7] show that the annealed
particle filter is capable of tracking body parts with 125 particles using 60 fps video input.
However, using a significantly lower frame rate (15 fps) causes the tracker to
produce bad results and eventually to lose the target.
The other problem of the annealed particle filter tracker
is that once a target is lost (i.e., the body pose was wrongly estimated, which
can happen for the fast and not smooth movements) it is highly unlikely that
the pose on the following frames will be estimated correctly.
In order to reduce the dimension of the space we
introduce Gaussian process annealed particle filter (GPAPF). We use a set of
poses in order to create a low-dimensional latent space. The latent space is
generated by applying nonlinear dimension reduction on the previously observed
poses of different motion types, such as walking, running, punching, and
kicking. We divide our state into two independent parts. The first part
contains the global 3D body rotation and translation parameters and is
independent of the actual pose. The second part contains only information
regarding the pose (26 DoF). We use Gaussian process dynamical model (GPDM) in
order to reduce the dimensionality of the second part and to construct a latent
space, as shown in Figure 4. GPDM is able to capture properties of high-dimensional
motion data better than linear methods such as PCA. This method generates a
mapping function from the low-dimensional latent space to the full data space.
This space has a significantly lower dimensionality (we have experimented with
2D or 3D). Unlike Urtasun et al. [28], whose latent state variables include
translation and rotation information, our latent space includes solely pose
information and is therefore rotation and translation invariant. This allows
using the sequences of the latent coordinates in order to classify different
motion types.
Figure 4: The latent space that is learned from different poses during the walking sequence. (a) The
2D space; (b) the 3D space. The brighter pixels (a) correspond to more precise
mapping.
We use a 2-stage algorithm. In the first stage a set
of new particles is generated of in the latent space. Then we apply the learned
mapping function that transforms latent coordinates to the data space. As a
result, after adding the translation and rotation information, we construct
31-dimensional vectors that describe a valid data state which includes location
and pose information, in the data space. In order to estimate how well the pose
matches the images the likelihood function, as described in the previous
section, is calculated.
The main difficulty in this approach is that the
latent space is not uniformly distributed. Therefore, we use the dynamic model,
as proposed by Wang et al. [4], in order to achieve smoothed transitions
between sequential poses in the latent space. However, there are still some
irregularities and discontinuities. Moreover, while in a regular space the
change in the angles is independent on the actual angle value, in a latent
space this is not the case. Each pose has a certain probability to occur and
thus the probability to be drawn as a hypothesis should be dependent on it. For
each particle we can estimate the variance that can be used for generation of
the new ones. In Figure 4(a) the lighter pixels represent lower variance,
which depicts the regions of the latent space that produce more likely
poses.
Another advantage of this method is that the tracker
is capable of recovering after several frames, from poor estimations. The
reason for this is that particles generated in the latent space are
representing valid poses more authentically. Furthermore, because of its low
dimensionality, the latent space can be covered with a relatively small number
of particles. Therefore, most of possible poses will be tested with emphasis on
the pose that is close to the one that was retrieved in the previous frame. So
if the pose was estimated correctly, the tracker will be able to choose the
most suitable one from the tested poses. However, if the pose on the previous
frame was miscalculated, the tracker will still consider the poses that are
quite different. As these poses are expected to get higher value of the
weighting function, the next layers of the annealing process will generate many
particles using these different poses. As shown in Figure 5, the pose in this
way is likely to be estimated correctly, despite the miss-tracking on the
previous frame.
Figure 5: Losing and finding the tracked target despite the miss-tracking on the previous frame. (a) Frame 137, camera 1; (b) frame 138, camera 1; (c) frame 137, camera 4; (d) frame 138, camera 4.
In addition the generated poses are, in most cases,
natural. The large variance in the data space causes the generation of
unnatural poses by the condensation or by annealed particle filtering
algorithms. In the introduced approach the poses that are produced by the
latent space that correspond to points with low variance are usually natural as
the whole latent space is constructed based on learning from a set of valid
poses. The unnatural poses correspond to the points with the large variance
(black regions in Figure 4(a)) and, therefore, it is highly unlikely that it
will be generated. Therefore, the effective number of the particles is higher,
which enables more accurate tracking.
As shown in Figure 4 the latent space is not
continuous. Two sequential poses may appear not too close in the latent space;
therefore, there is a minimal number of particles that should be drawn in order
to be able to perform the tracking.
The other drawback of this approach is that it
requires more calculation than the regular annealed particle filter due to the
transformation from the latent space into the data space. However, as it is
mentioned above, if the same number of particles is used, the number of the
effective poses is significantly higher in the GPAPF then in the original
annealed particle filter. Therefore, we can reduce the number of the particles
for the GPAPF tracker, and by this compensate for the additional calculations.
5.4. GPAPF Algorithm
As we have explained before we are using a 2-stage algorithm. The state consists of 2
statistically independent parts. The first one describes the body 3D location:
the rotation and the translation (6 DoF). The second part describes the actual
pose, that is, the latent coordinates of the corresponding point in the
Gaussian space (that was generated as we have explained in Section 5.3). The
second part usually has a very small DoF (as was mentioned before we have
experimented with 2- and 3-dimensional latent spaces). The first stage is the
generation of new particles. Then we apply the learned transform function that
transforms latent coordinates to the data space (25 DoF). As the result, after
adding the translation and rotation information, we construct a 31-dimensional
vectors that describe a valid data state, which includes location and pose
information, in the data space. Then the state is projected to the cameras in
order to estimate how well it fits the images.
Suppose we have M annealing layers. The state
is defined as a pair
, where
is the location
information and
is the pose
information. We also define
as a latent
coordinates corresponding to the data vector
:
, where
is the mapping
function learned by the GPDM.
,
, and
are the
location, pose vector, and corresponding latent coordinates on the frame n and annealing layer m. For each
,
and
are generated
by adding multidimensional Gaussian random variable to
and
, respectively. Then
is calculated
using
. Full body state
is projected to
the cameras and the likelihood
is calculated
using likelihood function as explained in Section 5.2 (see Algorithm 2).
Algorithm 2: The GPAPF algorithm.
In the original annealed particle filter algorithm,
the optimal configuration is achieved by calculating the weighted average of
the particles in the last layer. However, as the latent space is not an
Euclidian one, applying this method on
will produce
poor results. The other method is choosing the particle with the highest
likelihood as the optimal configuration
, where
. However, this is an unstable way to calculate the
optimal pose, as in order to ensure that there exists a particle which
represents the correct pose, we have to use a large number of particles.
Therefore, we propose to calculate the optimal configuration in the data space
and then project it back to the latent space. At the first stage we apply the
on all the
particles to generate vectors in the data space. Then in the data space we
calculate the average on these vectors and project it back to the latent space.
It can be written as
.
5.5. Towards More Precise Tracking
The problem with such a 2-stage approach is that Gaussian field is not capable to describe
all possible posses. As we have mentioned above, this approach resembles using
probabilistic PCA in order to reduce the data dimensionality. However, for
tracking issues we are interested to get the pose estimation as close as
possible to the actual one. Therefore, we add an additional annealing layer as
the last step. This stage consists from only one stage. We use data states,
which were generated on the previous 2 staged annealing layer, described in
previous section, in order to generate data states for the next layer. This is
done with very low variances in all the dimensions, which practically are equal
for all actions, as the purpose of this layer is to make only the slight
changes in the final estimated pose. Thus it does not depend on the actual
frame rate, contrary to original annealing particle tracker, where if the frame
rate is changed one need to update the model parameters (the variances for each
layer).
The final scheme of each step is shown in Figure 6 and described in Algorithm 3. Suppose we have M annealing layers, as explained in Section 5.4, then we add one
more single-staged layer. In this last layer the
is calculated
using only the
without
calculating the
. We should also pay attention that the last layer has
no influence on the quality of tracking in the following frames, as
is used for the initialization of the next layer.
Figure 7 shows the difference between the version without the additional
annealing layer and the results after adding it. We have used 5 2-staged
annealing layers in both cases. For the second tracker, we have added
additional single staged layer. In Figure 7
the error graphs are shown that were produced by two
trackers. The error was calculated, based on comparison of the trackers output
and the result of the MoCap system. The comparison was suggested by Sigal et al. [7]. This is done by calculating the 3D distance between the locations of the
different joints that is estimated by the MoCap system and by the trackers
results. The joints that are used are hips, knees, and so forth. The distances
are summed and multiplied by the weight of the corresponding particle. Then the
sum of the all weighted distances is calculated, which is used as an error
measurement. We can see that the error, produced by GPAPF tracker without the
additional layer (blue circles on the graph), is lower than the one produced by
the original GPAPF algorithm with the additional annealing layer red crosses on
the graph) for the walking sequence taken at 30 fps. We can notice that the
error is lower when we add the layer. However, as we have expected, the
improvement is not dramatic. This is explained by the fact that the difference
between the estimated pose using only the latent space annealing and the actual
pose is not very big. That suggests that the latent space accurately represents
the data space.
Algorithm 3: The GPAPF algorithm with the additional layer.
Figure 6: GPAPF with additional annealing layer graphical model. The black solid arrows
represent the dependencies between state and the visual data; the blue arrows
represent the dependencies between the latent space and the data space; dashed
magenta arrows represent the dependencies between sequential annealing layers;
the red arrows represent the dependencies of the additional annealing layer.
The green arrows represent the dependency between sequential frames.
Figure 7: The errors GPAPF tracer with additional annealing layer (blue circles) and
without it (red crosses) for a walking sequence captured at 30 fps.
Figure 8: (a) and (b) GPAPF algorithm without the additional layer; (c) and (d) GPAPF algorithm with the additional layer.
We can also notice that the improved GPAPF has less
peaks on the error graph. The peaks stem from the fact that the argmax function, that has been used to find the optimal configuration, is very sensitive to the location of the best fitting particle. In the improved
version, we calculate weighted average of all the particles. As we have seen
from our experiments, there are often many particles with the weight close to
the optimal. Therefore, the result is less sensitive to the location of some
particular particle. It depends on the whole set of them.
We have also tried to use the results, produced by the
additional layer, in order to initialize the state in the next time step. This
was done by applying the inverse function
, suggested by Lawrence and Candela [32], on the particles that
were generated in previous annealing layer. However, this approach did not
produce any valuable improvement in the tracking results. As the inverse
function is computationally heavy it caused significant increase in the
calculation time. Therefore, we decided not to experiment with it further.
6. Results
We have tested GPAPF tracking algorithm using HumanEva dataset [33]. The sequences contain different activities, such as walking, boxing, and so forth, which were
captured by 7 cameras; however, we have used only 4 inputs in our evaluation.
The sequences were captured using the MoCap system that provides the correct 3D
locations of the body parts for evaluation of the results and comparison to
other tracking algorithms.
The first sequence that we have used was a walk on a
circle. The video was captured at frame rate 120 fps. We have tested the
annealed particle filter-based body tracker, implemented by A. Balan, and
compared the results with the ones produced by the GPAPF tracker. The error was
calculated, based on comparison of the tracker's output and the result of the
MoCap system, using average distance between 3D joints location, as explained
in Section 5.4. Figure 10 shows the error graphs, produced by GPAPF tracker
(blue circles) and by the annealed particle filter (red crosses) for the
walking sequence taken at 30 fps. As can be seen, the GPAPF tracker produces
more accurate estimation of the body location. Same results were achieved for
15 fps. Figure 9 presents sample images with the actual pose estimation for
this sequence. The poses are projected to the first and second cameras. The
first 2 rows show the results of the GPAPF tracker. The third and forth rows
show the results of the annealed particle filter.
Figure 9: Tracking results
of annealed particle filter tracker and GPAPF tracker. Sample frames from the
walking sequence. First row: GPAPF tracker, first camera. Second row: GPAPF
tracker, second camera. Third row: annealed particle filter tracker, first
camera. Forth row: annealed particle filter tracker, second camera.
Figure 10: The errors of the annealed tracker (red crosses) and GPAPF tracker (blue circles) for a walking sequence captured at 30 fps.
We have experimented with 100 particles up to 2000
particles. For the 100 particles per layer using 5 annealed layers, the
computational cost was 30 seconds per frame. Using the same number of particles
and layers in the annealed particle filter algorithm takes 20 seconds per
frame. However, the annealed particle filter algorithm was not capable of
tracking the body pose with such a low number of particles for 30 fps and 15 fps videos. Therefore, we had to increase the number of particles used in the
annealed particle filter to 500.
We have also tried to compare our results to the
results of condensation algorithm. However, the results of the condensation
algorithm were either very poor or a very large number of particles needed to
be used, which made this algorithm computationally not effective. Therefore, we
do not show the results of this comparison.
The second sequence was captured in our lab. On that
sequence we have filmed similar behavior, produced by a different actor. The
frame rate was 15 fps. In case of walking, the learning was done on the first
sequence data. The GPAPF tracker was able to track the person and produced
results similar to the ones, which were produced for the original
sequence.
We have also experimented with sequences containing
different behavior, like leg movements, object lifting, clapping, and boxing.
We have manually marked some of the sequences in order to produce the needed
training sets for GPDM. After the learning we have run the validation on the
other sequences containing same behavior. As it is shown in the Figure 11, the
tracker successfully tracked these sequences. We have experimented with 100 going up to 2000 particles. For the 100 particles, the computational cost was 30 seconds per frame. The results that are shown in the videos are done
with 500 particles (2.5 minutes per frame). The code that we are using is
written in Matlab with no optimization packages. Therefore, the computational
cost can be significantly reduced if moved to C libraries.
Figure 11: Tracking results of annealed particle filter tracker and
GPAPF tracker. Sample frames from the running, leg movements and object lifting
sequences.
7. Conclusion and Future Work
We have presented an approach that uses GPDM in order to reduce the dimensionality and
in this way to improve the ability of the annealed particle filter tracker to
track the object even in a high-dimensional space. We have also shown that
using GPDM can increase the ability to recover from temporal target loss. We
have also presented a method to approximate the possibility of self occlusion
and we have suggested a way to adjust the weighed function for such cases, in
order to be able to produce more accurate evaluation of a pose.
The main problem is that the learning and tracking are
done for a specific action. The ability of the tracker to use a latent space in
order to track a different motion type has not been shown yet. A possible
approach is to construct a common latent space for the poses from different
actions. The difficulty with such approach may be the presence of a large
number of gaps between the consecutive poses. In the future we plan to extend
the approach in order to be able to track different activities, using the same
learned data.
The other challenging task is to track two or more
people simultaneously. The main problem here is that in this case there is high
possibility of occlusion. Furthermore, while for a single person each body part
can be seen from at least one camera that is not the case for the crowded
scenes.
References
- L. Raskin, E. Rivlin, and M. Rudzsky, “3D human tracking with gaussian process annealed particle filter,” in Proceedings of the 2nd International Conference on Computer Vision Theory and Applications (VISAPP '07), vol. 2, pp. 459–465, Barcelona, Spain, March 2007.
- L. Raskin, M. Rudzsky, and E. Rivlin, “GPAPF: a combined approach for 3D body part tracking,” in Proceedings of the 5th International Conference on Computer Vision Systems (ICVS '07), Bielefeld University, Germany, March 2007.
- N. D. Lawrence, “Gaussian process models for visualisation of high dimensional data,” in Advances in Neural Information Processing Systems (NIPS), vol. 16, pp. 329–336, 2004.
- J. Wang, D. J. Fleet, and A. Hetzmann, “Gaussian process dynamical models,” in Proceeding of the 19th Annual Conference on Neural Information Processing Systems (NIPS '05), pp. 1441–1448, Vancouver, BC, Canada, December 2005.
- J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture by annealed particle filtering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '00), vol. 2, pp. 126–133, Hilton Head Island, SC, USA, June 2000.
- J. Deutscher and I. Reid, “Articulated body motion capture by stochastic search,” International Journal of Computer Vision, vol. 61, no. 2, pp. 185–205, 2005.
- A. O. Bălan, L. Sigal, and M. J. Black, “A quantitative evaluation of video-based 3D person tracking,” in Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS '05), pp. 349–356, Beijing, China, October 2005.
- Y. Song, X. Feng, and P. Perona, “Towards detection of human motion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '00), vol. 1, pp. 810–817, Hilton Head Island, SC, USA, June 2000.
- S. Ioffe and D. Forsyth, “Human tracking with mixtures of trees,” in Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV '01), vol. 1, pp. 690–695, Vancouver, BC, Canada, July 2001.
- G. Mori and J. Malik , “Estimating human body configurations using shape context matching,” in Proceedings of the 7th European Conference on Computer Vision (ECCV '02), vol. 3, pp. 134–141, Copenhagen, Denmark, May 2002.
- H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of 3D human figures using 2D image motion,” in Proceedings of the 6th European Conference on Computer Vision (ECCV '00), vol. 2, pp. 702–718, Dublin, Ireland, June-July 2000.
- A. J. Davison, J. Deutscher, and I. D. Reid, “Markerless motion capture of complex full-body movement for character animation,” in Proceedings of the Eurographic Workshop on Computer Animation and Simulation, pp. 3–14, Manchester, UK, September 2001.
- A. Agarwal and B. Triggs, “Learning to track 3D human motion from silhouettes,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), pp. 9–16, Banff, Alberta, Canada, July 2004.
- A. Agarwal and B. Triggs, “3D human pose from silhouettes by relevance vector regression,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. 882–888, Washington, DC, USA, June-July 2004.
- D. Ramanan and D. A. Forsyth, “Automatic annotation of everyday movements,” in Proceedings of the 15th Annual Conference on Neural Information Processing Systems (NIPS '03), Vancouver, BC, Canada, December 2003.
- M. Isard and J. MacCormick, “BraMBLe: a Bayesian multiple-blob tracker,” in Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV '01), vol. 2, pp. 34–41, Vancouver, BC, Canada, July 2001.
- C. Bregler and J. Malik, “Tracking people with twists and exponential maps,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '98), pp. 8–15, Santa Barbara, Calif, USA, June 1998.
- H. Sidenbladh, M. J. Black, and L. Sigal, “Implicit probabilistic models of human motion for synthesis and tracking,” in Proceedings of 7th European Conference on Computer Vision (ECCV '02), vol. 1, pp. 784–800, Copenhaguen, Denmark, May 2002.
- K. Mikolajczyk, K. Schmid, and A. Zisserman, “Human detection based on a probabilistic assembly of robust part detectors,” in Proceedings of the 8th European Conference on Computer Vision (ECCV '04), vol. 1, pp. 69–82, Prague, Czech Republic, May 2003.
- K. Rohr, “Human movement analysis based on explicit motion models,” in Motion-Based Recognition, pp. 171–198, 1997, chapter 8.
- D. Ormoneit, H. Sidenbladh, M. Black, and T. Hastie, “Learning and tracking cyclic human motion,” in Advances in Neural Information Processing Systems 13, pp. 894–900, 2001.
- A. Elgammal and C.-S. Lee, “Inferring 3d body pose from silhouettes using activity manifold learning,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. 681–688, Washington, DC, USA, June-July 2004.
- Q. Wang, G. Xu, and H. Ai, “Learning object intrinsic structure for robust visual tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR '03), vol. 2, pp. 227–233, Madison, Wis, USA, June 2003.
- J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
- S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
- R. Urtasun and P. Fua, “3D human body tracking using deterministic temporal motion models,” in Proceedings of the 8th European Conference on Computer Vision (ECCV '04), vol. 3, pp. 92–106, Prague, Czech Republic, May 2004.
- R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua, “Priors for people tracking from small training sets,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), vol. 1, pp. 403–410, Beijing, China, October 2005.
- R. Urtasun, D. J. Fleet, and P. Fua, “3D people tracking with Gaussian process dynamical models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), vol. 1, pp. 238–245, New York, NY, USA, June 2006.
- M. Isard and A. Blake, “CONDENSATION—conditional density propagation for visual tracking,” International Journal of Computer Vision, vol. 29, no. 1, pp. 5–28, 1998.
- D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564–577, 2003.
- P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proceedings of 7th European Conference on Computer Vision (ECCV '02), pp. 661–675, Copenhaguen, Denmark, May 2002.
- N. D. Lawrence and J. Quiñonero-Candela, “Local distance preservation in the GP-LVM through back constraints,” in Proceedings of the 23rd International Conference on Machine Learning (ICML '06), pp. 513–520, Pittsburgh, Pa, USA, June 2006.
- L. Sigal and M. J. Black, “Humaneva: cynchronized video and motion capture dataset for evaluation of articulated human motion,” Tech. Rep. CS-06-08, Brown University, Providence, RI, USA, 2006.