Abstract
This paper presents a robust computational framework for monocular 3D tracking of human movement. The main innovation of the proposed framework is to explore the underlying data structures of the body silhouette and pose spaces by constructing low-dimensional silhouettes and poses manifolds, establishing intermanifold mappings, and performing tracking in such manifolds using a particle filter. In addition, a novel vectorized silhouette descriptor is introduced to achieve low-dimensional, noise-resilient silhouette representation. The proposed articulated motion tracker is view-independent, self-initializing, and capable of maintaining multiple kinematic trajectories. By using the learned mapping from the silhouette manifold to the pose manifold, particle sampling is informed by the current image observation, resulting in improved sample efficiency. Decent tracking results have been obtained using synthetic and real videos.
1. Introduction
Reliable recovery and tracking of articulated human motion from video are
considered a very challenging problem in computer vision, due to
the versatility of human movement, the variability of body types, various
movement styles and signatures, and the 3D nature of human body. Vision-based
tracking of articulated motion is a temporal inference problem. There exist
numerous computational frameworks addressing this problem. Some of the
frameworks make use of training data (e.g., [1]) to inform the tracking, while some attempt to
directly infer the articulated motion without using any training data (e.g., [2]). When training
data is available, the articulated motion tracking can be cast into a
statistical learning and inference problem. Using a set of training examples, a learning and inference
framework needs to be developed to track both seen and unseen movements
performed by known or unknown subjects. In terms of the learning and inference
structure, existing 3D tracking algorithms can be roughly clustered into two
categories, namely, generative-based and discriminative-based approaches.
Generative-based approaches, for example [2–4], usually assume the
knowledge of a 3D body model of the subject and dynamical models of the related
movement, from which kinematic predictions and corresponding image observations
can be generated. The movement dynamics are learned from training
examples using various dynamic system models, for example, autoregressive
models [5], hidden
Markov models [6],
Gaussian process dynamical models [1], and piecewise linear models in the form of a mixture
of factor analyzers [7]. A recursive filter is often deployed to temporally
propagate the posterior distribution of the state. Especially, particle filters
have been extensively used in movement tracking to handle nonlinearity in both
the system observation and the dynamic equations. Discriminative-based
approaches, for example [8–13], treat kinematics recovery from images as a
regression problem from the image space to the body kinematics space. Using
training data, the relationship between image observation and body poses is
obtained using machine-learning techniques. When compared against each other,
both approaches have their own pros and cons. In general, generative-based
methods utilize movement dynamics and produce more accurate tracking results,
although they are more time consuming, and usually the conditional distribution
of the kinematics given the current image observation is not utilized directly.
On the other hand, discriminative-based methods learn such conditional
distributions of kinematics given image observations from training data and
often result in fast image-based kinematic inference. However, movement
kinematics are usually not fully explored by discriminative-based methods.
Thus, the rich temporal correlation of body kinematics between adjacent frames
is unused in tracking.
In this paper, we present a 3D tracking framework that
integrates the strengths of both generative and discriminative approaches. The
proposed framework explores the underlying low-dimensional manifolds of
silhouettes and poses using nonlinear dimension reduction techniques such as
Gaussian process latent variable models (GPLVM) [14] and Gaussian process dynamic
models (GPDM) [15].
Both Gaussian process models have been used for people tracking [1, 16–18]. The Bayesian mixture of experts (BME) and relevance
vector machine (RVM) are then used to construct bidirectional mappings between
these two manifolds, in a manner similar to [10]. A particle filter defined over the pose manifold is
used for tracking. Our proposed tracker is self-initializing and capable of
tracking multiple kinematic trajectories due to the BME-based multimodal
silhouette-to-kinematics mapping. In addition, because of the bidirectional
inter-manifold mappings, the particle filter can draw kinematic samples using
the current image observation, and evaluate sample weights without projecting a
3D body model. To overcome noise present in silhouette images, a
low-dimensional vectorized silhouette descriptor is introduced based on
Gaussian mixture models. Our proposed framework has been tested using both
synthetic and real videos with different subjects and movement styles from the
training. Experimental results show the efficacy of the proposed method.
1.1. Related Work
Among existing methods on integrating generative-based
and discriminative-based approaches for articulated motion tracking, the 2D
articulated human motion tracking system proposed by Curio and Giese [19] is the most revelent to our
framework. The system in [19] conducts dimension reduction in both image and pose
spaces. Using training data, one-to-many support vector regression (SVR) is
learned to conduct view-based pose estimation. A first-order autoregressive
(AR) linear model is used to represent state dynamics. A competitive particle
filter defined over the hidden state space is deployed to select plausible
branches and propagate state posteriors over time. Due to SVR, this system is
capable of autonomous initialization. It draws samples using both current
observation and state dynamics. However, there are four major differences
between the approach in [19] and our proposed framework. Essentially, [19] presents a tracking system
for 2D articulated motion, while our framework is for 3D tracking. In
addition, In [19] a 2D
patch-model is used to obtain the predicted image observation, while in our
proposed framework this is done through nonlinear regression without using any
body models. Furthermore, during the initialization stage of the system in
[19], only the best
body configuration obtained from the view-based pose estimation and the
model-based matching is used to initialize the tracking. It is obvious that
using a single initial state has the risk of missing other admissible solutions
due to the inherent ambiguity. Therefore, in our proposed system multiple
solutions are maintained in tracking. Finally, BME is used in our proposed
framework for view-based pose estimation instead of SVR as in [19]. BME has been used for
kinematic recovery [10].
In summary, our proposed framework can be considered as an extension of the
system in [19] to
better address the integration of generative-based and discriminative-based
approaches in the case of 3D tracking of human movement, with the advantages of
tracking multiple possible pose trajectories over time and removing the
requirement of a body model to obtain predicted image observations.
Dimension reduction of the image silhouette and pose
spaces has also been investigated using kernel principle component analysis
(KPCA) [12, 20] and probabilistic PCA
[13, 21]. In [7, 22], a mixture of factor analyzers
is used to locally approximate the pose manifold. Factor analyzers perform
nonlinear dimension reduction and data clustering concurrently within a global
coordinate system, which makes it possible to derive an efficient multiple
hypothesis tracking algorithm based on distribution modes. Recently, nonlinear
probabilistic generative models such as GPLVM [14] have been used to represent
the low-dimensional full body joint data [16, 23] and upper body joints [24] in a probabilistic
framework. Reference [16] introduces the scaled GPLVM to learn dynamical models
of human movements. As variants of GPLVM, GPDM [15, 25], and balanced GPDM
[1] have shown to be
able to capture the underlying dynamics of movement, and at the same time to
reduce the dimensionality of the pose space. Such GPLVM-based movement
dynamical models have been successfully used as priors for tracking of various
types of movement, including walking [1] and golf swing [16]. Recently, [26] presents a hierarchical GPLVM to explore the
conditional independencies, while [27] extends GPDM into a multifactor analysis framework
for style-content separation. In our proposed framework, we follow the balanced
GPDM presented in [1]
to learn movement dynamics due to its simplicity and demonstrated ability to
model human movement. Furthermore, we adopt GPLVM to construct the silhouette
manifold using silhouette images from different views, which has been shown to
be promising in our experiments. Additional results using GPLVM for 3D tracking
have been reported recently. In [18], a real-time body tracking framework is presented
using GPLVM.
Since image observations and body poses of the same
movement essentially describe the same physical phenomenon, it is reasonable to
learn a joint image-pose manifold. In [17] GPLVM has been used to obtain a joint silhouette and
pose manifold for pose estimation. Reference [28] presents a joint learning
algorithm for a bidirectional generative-discriminative model for 2D people
detection and 3D human motion reconstruction from static images with cluttered
background by combining the top-down (generative-based) and bottom-up
(discriminative-based) processings. The combination of top-down and bottom-up
approaches in [28] is
promising for solving simultaneous people detection and pose recovery in
cluttered images. However, the emphasis of [28] is on parameter learning of the bidirectional model
and movement dynamics are not considered. Comparing
with [17, 28], the separate kinematics
and silhouette manifold learning is a limitation of our proposed framework.
View-independent tracking and handling of ambiguous
solutions are critical for monocular-based
tracking. To tackle this challenge, [29] represents shape deformations according to view and
body configuration changes on a 2D torus manifold. A nonlinear mapping is then
learned between torus manifold embedding and visual input using empirical
kernel mapping. Reference [30] learned a clustered exemplar-based dynamic model for
viewpoint invariant tracking of the 3D human motion from a single camera. This
system can accurately track large movements of the human limbs. However,
neither of the above approaches explicitly considers multiple solutions and
only one kinematic trajectory is tracked, which results in an incomplete
description of the posterior distribution of poses. To handle the multimodal
mapping from the visual input space to the pose space, several approaches
[10, 31, 32] have been proposed. The
basic idea is to split the input space into a set of regions and approximate a
separate mapping for each individual region. These regions have soft boundaries, meaning that data points may lie simultaneously in multiple regions
with certain probabilities. The mapping in [31] is based on the joint probability distribution of
both the input and the output data. An inverse mapping function is used to
formulate an efficient inference. In [10, 32], the conditional distribution of the output given the
input is learned in the framework of mixture of experts. Reference [32] also uses the joint
input-output distribution and obtains the conditional distribution using the
Bayes rule while [10]
learns the conditional distribution directly. In our proposed framework, we
adopt the extended BME model [33] and use RVM as experts [10] for multimodal regression.
A related work that should be mentioned here is the extended multivariate RVM
for multimodal multidimensional 3D body tracking [8]. Impressive full body
tracking results of human movement have been reported in [8].
Another highlight of our proposed system is that
predicted visual observations can be obtained directly from a pose hypothesis
without projecting a 3D body model. This feature allows efficient likelihood
and weight evaluation in a particle filtering framework. The 3D-model-free
approaches for image silhouette synthesis from movement data reported in
[34, 35] are most related to our
proposed approach. The main difference is that our approach achieves visual
prediction using RVM-based regression, while in [34, 35] multilinear analyis
[36] is used for
visual synthesis.
2. System Architecture
An overview of the architecture of our proposed system is presented in Figure 1, consisting of
a training phase and a tracking phase.
Figure 1: An overview of the proposed framework, (a): training phase; (b): tracking phase.
The training phase contains training data preparation
and model learning. In data preparation, synthetic images are rendered using
animation software from motion capture data, for example, Maya. The
model-learning process has five major steps as shown in Figure 1(a). In the
first step, key frames are selected from synthetic images using
multidimensional scaling (MDS) [37, 38] and
-means. In the second step, silhouettes in the
training data are then be vectorized according to its distances to these key
frames. Then in the following step, GPLVM is used to construct the
low-dimensional manifold
of the image silhouettes from multiple views
using their vectorized descriptors. The fourth step is to reduce dimensionality
of the pose data and obtain a related motion dynamical model. GPDM is used to
obtain the manifold
of full-body pose angles. This latent space is
then augmented by the torso orientation space
to form the complete pose latent space
. Finally in the last step, the forward and backward nonlinear mappings between
to
are constructed in the learning phase. The
forward mapping from
to
is established using RVM, which will be used
to efficiently evaluate sample weights in the tracking phase. The multimodal
(one-to-many) backward mapping from
to
is obtained using BME.
The essence of tracking in our proposed framework is
the propagation of weighted movement particles in
based on the image observation up to the
current time instant and learned movement dynamic models. In tracking, the body
silhouette is first extracted from an input image and then vectorized. Using
the learned GPLVM, its corresponding latent position is found in
. Then BME is invoked to find a few plausible pose estimates in
. Movement samples are drawn according to both the BME outputs and learned GPDM.
The sample weights are evaluated according to the distance between the observed
and predicted silhouettes. The empirical posterior distributions of poses are
then obtained as the weighted samples. The details of the learning and tracking
steps are described in the following sections.
3. Preparation of Training Data
To learn various models in the proposed framework, we need to construct training data
sets including complete pose data (body joint angles, torso orientation), and
the corresponding images. In our experiments, we focus on the tracking of gait.
Three walking sequences (
) from different subjects were taken from CMU
motion capture database [39], with each sequence containing two gait cycles. These
sequences were then downsampled by a factor of 4, constituting 226 motion
capture frames in total. There are 56 original local joint angles in the
original motion capture data. Only 42 major joint angles are used in our
experiments. This set of local joint angles is denoted as
.
To synthesize multiple views of one body pose defined
by a frame of motion capture data, sixteen frames complete pose data were
generated by augmenting the local joint angles with 16 different torso
orientation angles. To obtain silhouettes from diverse view points, these
orientation angles are randomly altered from frame to frame. Given one frame of
motion capture data, these 16 torso orientation angles were selected as
follows. A circle centered at the body centroid in the horizontal plane of the
human body can be found. To determine the 16 body orientation angles, this
circle is equally divided into 16 parts, corresponding to 16 cameras views. In
each camera view, an angle is uniformly drawn in an angle interval of 22.5°. Hence for each given motion capture frame, there are 16 complete pose
frames with different torso orientation angles, resulting 3616 (
) complete pose frames in total. This training
set of complete poses is denoted as
.
Using
, corresponding silhouettes were generated using animation software. We denote
this silhouette training set
. Three different 3D models (one female and two males) were used for each subject
to obtain a diverse silhouette set with varying appearances.
4. Image Feature Representation
4.1. GMM-Based Silhouette Descriptor
Assume that silhouettes can be extracted from images using background subtraction and
refined by morphological operation. The remaining question is how to represent
the silhouette robustly and efficiently. Different shape descriptors have been
used to represent silhouettes. In [40], Fourier descriptor, shape context, and Hu moments
were computed from silhouettes and their resistance to variations in body
built, silhouette extraction errors, and viewpoints were compared. It is shown
that both Fourier descriptor and shape context perform better than the Hu
moment. In our approach, Gaussian mixture models (GMM) are used to represent
silhouettes and it performs better than shape context descriptor. We have used
GMM-based shape descriptor in our previous work on single-image-based pose
inference [41].
GMM assumes that the observed unlabeled data is
produced by a number of Gaussian distributions. The basic idea of GMM-based
silhouette descriptor is to consider a silhouette as a set of coherent regions
in the 2D space such that the foreground pixel locations are generated by a
GMM. Strictly speaking, foreground pixel locations of a silhouette do not
exactly follow the Gaussian distribution assumption. Actually a uniform
distribution confined to a closed area given by the silhouette contour would be
a much better choice. However, due to its simplicity, GMM is selected in the
proposed framework to represent silhouettes. From Figure 2, we can see that the
GMM can model the distribution of the silhouette pixels well. It has good
locality to improve the robustness compared the global descriptor such as shape
moment. The reconstructed silhouette points look very similar to the original
silhouette image.
Figure 2: (a): the original silhouette, (b): learned Gaussian
mixture components using EM, (c): point samples drawn such a GMM.
Given a silhouette, the GMM parameters can be obtained
using an EM algorithm. Initial data clustering can be done using the
-means algorithm. The full covariance matrices
of the Gaussian are estimated. In our implementation, a GMM with 20 components
is used to represent one silhouette. It takes about 600 milliseconds to extract
the GMM parameters from an input silhouette (
120 pixel-high) using Matlab.
4.2. KLD-Based Similarity Measure
It is critical to measure the similarities between silhouettes. Based on the GMM descriptor,
the Kullback-Leibler divergence (KLD) is used to compute the distance between
two silhouettes. Similar approaches have been taken for GMM-based image
matching for content-based image retrieval [42]. Given two distributions
and
, the KLD from
to
is
(1) The symmetric
version of the KLD is given by
(2)
In our implementation, such symmetric KLD is used to compute the distance between two silhouettes and the
KLDs are computed using a sampling-based method.
GMM representation can handle noise and small shape
model differences. For example, Figure 3 has three columns of images. In each
column, the bottom image is a noisy version of the top image. The KLD between
the noisy and clean silhouettes in the left, middle, and right columns are
0.04, 0.03, and 0.1, respectively. They are all below 0.3, which is an
empirical KLD threshold indicating similar silhouettes. This threshold was obtained
according to our experiments running over a large number of image silhouettes
of various movements and dance poses.
Figure 3: Clean (top row) and noisy silhouettes of
some dance poses.
4.3. Vectorized Silhouette Descriptor
Although GMM and KLD can represent silhouettes and compute their similarities,
sampling-based KLD computation between two silhouettes is slow, which harms the
scalability of the proposed method when a large number of training data is
used. To overcome this problem, in the proposed framework a vectorization of
the GMM-based silhouette descriptor is introduced. The nonvectorized GMM-based
shape descriptor has been used in our previous work on single-image-based pose
inference [41]. Vector
representation of silhouette is critical since it will simplify and expedite
the GPLVM-based manifold learning and mapping from silhouette space to its
latent space.
To obtain a vector representation for our GMM
descriptor, we use the relative distances of one silhouette to several key
silhouettes to locate this point in the silhouette space. The distance between
this silhouette and each key silhouette is one element in the vector. The
challenge here is to determine how many of them will be sufficient and how to
select these key frames.
In our propose framework, we first use MDS [37, 38] to estimate the underlying
dimensionality of the silhouette space. Then the
-means algorithm is used to cluster training
data and locate the cluster centers. Silhouettes that are the closest to these
cluster centers are then selected as our key frames. Given training data, the
distance matrix
of all silhouettes is readily computed using
KLD. MDS is a nonlinear dimension reduction method if one can obtain a good
distance measure. An excellent review of MDS can be found in [37, 38]. Following MDS,
can be computed. When
is a distance matrix of a metric space (e.g.,
symmetric, nonnegative, satisfying triangle inequality),
is positive semidefinite (PSD), and the
minimal embedding dimension is given by the rank of
. Here
is the centering matrix, where
is the number of training data and
is an
matrix of all ones. Due to observation noise
and errors introduced in the sampling-based KLD calculation, the KLD matrix
we obtained is only an approximate distance
matrix and
might not be purely PSD in practice. In our
case, we just ignored the negative eigenvalues of
and only considered the positive ones. Using
the 3616 training samples in
described in Section 3, 45 dimensions are kept
to count over
of the energy in the positive eigenvalues. To
remove a representation ambiguity, distances from 46 key frames are needed to
locate a point in a 45-dimensional space. To select these key frames, all the
training silhouettes are clustered into 46 groups using the
-means algorithm. The closest silhouette to
the center of each cluster is chosen as the key silhouette. Some of these 46
key frames are shown in Figure 4. Given these key silhouettes, we obtain the
GMM vector representation as
,
where
is the KLD distance between this silhouette
and the
th key silhouette.
Figure 4: Some of the 46 key frames
selected from the training samples.
4.4. Comparison with Other Common Shape Descriptors
To validate the proposed vectorized silhouette representation based on GMM, extensive
experiments have been conducted to compare GMM descriptor, vectorized GMM
descriptor, shape context, and the Fourier descriptor. To produce shape context
descriptors, a code book of the 90-dimensional shape context vectors is
generated using the 3616 walking silhouettes from different views in
described in Section 3. Two hundred points are
uniformly sampled on the contour. Each point has a shape context (5 radial, 12
angular bins, size range 1/8 to 3 on log scale). The code book center is
clustered from shape context of all sampling points. To compare these four
types of shape descriptor, distance matrices between silhouettes of a walking
sequence are computed based on these descriptors. This sequence has 149 side
views of a person walking parallel to a fixed camera over about two and half
gait cycles (five steps). The four distance matrices are shown in Figure 5. All
distance matrices are normalized with respect to the corresponding maxima. Dark
blue pixels indicate small distances. Since the input is a side-view walking
sequence, significant inter-frame similarity is
presented, which results in a periodic pattern in the distance
matrices. This is caused by both repeated movement in different gait cycles and
the half cycle ambiguity in a side-view walking sequence in the same or
different gait cycles (e.g., it is hard to tell the left arm from the right arm
from a side-view walking silhouette even for humans). Figure 6 presents the
distance values from the 10th frame to the remaining frames according to the
four different shape descriptors. It can be seen from Figure 5 that the
distance matrix computed using KLD based on GMM (Figure 5(a)) has the clearest
pattern as a result of smooth similarity measure as shown by Figure 6(a). The
continuity of the vectorized GMM is slightly deteriorated comparing to the
original GMM. However, it is still much better than that of the shape context
as shown by Figures 5(b), 5(c), 6(b),
and 6(c). The Fourier descriptor is the least
robust among the four shape descriptors. It is difficult to locate similar
poses (i.e., find the valleys in Figure 6). This is because the outer contour of
a silhouette can change suddenly between successive frames. Thus, the Fourier
descriptor is discontinuous over time. Other than these four descriptors, the
columnized vector of the raw silhouette is actually also a reasonable shape
descriptor. However, the huge dimensionality (
1000) of the raw silhouette makes the
dimension reduction using GPLVM very time consuming and thus computationally
prohibitive.
Figure 5:
Distance matrices of a 149-frame sequence of side-view
walking silhouettes computed using (a) GMM, (b) vectorized GMM using 46 key
frames, (c) shape context, and (d) Fourier descriptor.
Figure 6:
Distances between the 10th frame of the side-view
walking sequence and all the other frames computed using (a) GMM, (b)
vectorized GMM using 46 key frames, (c) shape context, and (d) Fourier
descriptor.
To take a close look at the smoothness of the three
shape descriptors, original GMM, vectorized GMM, and shape context, we examine
the resulting manifolds after dimension reduction and dynamic learning using
GPDM. A smooth trajectory of latent point in the manifold indicates smoothness
of the shape descriptor. Figure 7 shows three trajectories corresponding to
these three shape descriptors. It can be seen that the vectorized GMM has a
smoother trajectory than that of the shape context, which is consistent to our
findings based on distance matrices.
Figure 7:
Movement trajectories of 73 frames of side-view
walking silhouette in the manifold learned using GPDM from three shape
descriptors, including (a) GMM, (b) vectorized GMM using 46 key frames, and
(c) shape context.
5. Dimension Reduction and Dynamic Learning
5.1. Dimension Reduction of Silhouettes Using GPLVM
GPLVM [43] provides a
probabilistic approach to nonlinear dimension reduction. In our proposed
framework, GPLVM is used to reduce the dimensionality of the silhouettes and to
recover the structure of silhouettes from different views. A detailed tutorial
on GPLVM can be found in [14]. Here we briefly describe the basic idea of the GPLVM
for the sake of completeness.
Let
be a set of
-dimensional data points and
be the
-dimensional latent points associated with
.
Assume that
is already centered and
.
and
are related by the following regression
function,
(3)where
and the weight vector
.
's are a set of basis functions. Given
, each dimension of
is a Gaussian process. By assuming
independence among different dimensions of
, the marginalized distribution of
over
given
is
(4)where
is the gram matrix of the
's. The goal in GPLVM is to find
and the parameters that maximize the marginal
distribution of
. The resulting
is thus considered as a low-dimensional
embedding of
. By using the kernel trick, instead of defining what
is, one can simply
define a kernel function over
and compute
so that
.
By using a nonlinear kernel function, one introduces a nonlinear dimension
reduction. In our approach, the following radial basis fundtion (RBF) kernel is
used:
(5)where
is the overall scale of the output,
is the inverse width of the RBFs. The variance
of the noise is given by
.
are the unknown model parameters. We need to
maximize (4) over
and
,
which is equivalent to minimizing the negative log of the objective
function:
(6)with respect to the
and
.
The last term in (6) is added to take care of the ambiguity between the scaling
of
and
by enforcing a low energy regurlization prior
over
.
Once the model is learned, given a new input data
its corresponding latent point
can be obtained by solving the likelihood
objective function:
(7)where
(8)
(9)
is the mean pose reconstructed from the latent
point
, and
is the reconstruction variance.
is the mean of the training data
.
is the kernel function of
evaluated over all the training data. Given
input
, the initial latent position is obtained as
.
Given
,
the mean data reconstructed in high dimension can be obtained using (8). In our
implementation, we make use of the FGPLVM Matlab toolbox (http://www.cs.man.ac.uk/neill/gpsoftware.html) and the fully independent training conditional
(FITC) approximation [44] software provided by Dr. Neil Lawrence for GPLVM
learning and bidirectional mapping between
and
.
Although the FITC approximation was used to expedite the silhouette learning
process, it took about five hours to process all the 3616 training silhouettes.
As a result, it will be difficult to extend our approach to handle multiple
motions simultaneously.
When applying GPLVM to silhouettes modeling, the image
feature points are embedded in a 5D latent space
. This is based on the consideration that three dimensions
are the minimum representation of walking
silhouettes [34]. One
more dimension is enough to describe view changes along a
body-centroid-centered circle in the horizontal plane of the subject. We then
add the fifth dimension to allow the model to capture extra variations, for
example, introduced by body shapes of different 3D body models used in
synthetic data generation. By using the FGPLVM toolbox, we obtained the
corresponding manifold of the training silhouette data set
described in Section 3. In Figure 8, the first
three dimensions of 640 silhouette latent points from
are shown. They represent 80 poses of one gait
cycle (two steps) with 8 views for each pose. It can be seen in Figure 8 that
silhouettes in different ranges of view angles are generally in different part
of the latent space with certain levels of overlapping. Hence, the GPLVM can
partly capture the structure of the silhouettes introduced by view changes.
Figure 8: The first three
dimensions of the silhouette latent points of 640 walking frames.
5.2. Movement Dynamic Learning Using GPDM
GPDM simultaneously provides a low-dimensional embedding of human motion data and
dynamics. Based on GPLVM, [15] proposed GPDM to add a dynamic model in the latent
space. It can be used for the modeling of a single type of motion. Reference [1] extended the GPDM to
balanced-GPDM to handle multiple subjects' stylistic variation by raising the
dynamic density function.
GPDM defines a Gaussian process to relate latent
points
to
at time
. The model is defined as:
(10)where
and
are regression weights, and
and
are Gaussian noise. The marginal distribution
of
is given by
(11)where
,
, and
consists of the kernel parameters which will
be introduced later.
is the kernel associated with the dynamics
Gaussian process and is constructed on
. We use an RBF kernel with a white noise term for the dynamics as in [14]
(12)where
are parameters of the kernel function for the
dynamics. GPDM learning is similar to GPLVM learning. The objective function is
given by two marginal log-likelihoods:
(13)
are found by maximizing
. Based on
, one is ready to sample from the movement dynamics, which is important in
particle filter-based tracking. Given
,
can be inferred from the learned dynamics
as follows:
(14)
where
and
are the mean and variance for prediction.
is the kernel function of
evaluated over
. In our implementation, the balanced GPDM [1] is adopted to balance the effect of the dynamics and
the reconstruction. As a data preprocessing step, we first center the motion
capture data and then rescale the data to unit variance [45]. This preprocessing reduces
the uncertainty in high-dimensional pose space. In addition, we follow the
learning procedure in [14] so that the kernel parameters in
are prechosen instead of being learned for the
sake of simplicity. This is also due to the fact that these parameters carry
clear physical meanings so that they can be reasonably selected by hand
[14]. In our
experiment,
. The local joint angles from motion capture are projected to joint angle
manifold
.
By augmenting
with the torso orientation space
, we obtain the complete pose latent space
. A 3D movement latent space learned using GPDM from the joint angle data set
described in Section 3 (six walking cycles
from three subjects) are shown in Figure 9.
Figure 9:
Two views of a 3D GPDM learned using gait data set

(see Section
3), including six walking cycles' frames from three subjects.
6. BME-Based Pose Inference
The backward mapping from the silhouette manifold
to the joint space of the pose manifold and
the torso orientation
is needed to conduct both autonomous tracking
initialization and sampling from the most recent observation. Different poses can
generate the same silhouette, which means this backward mapping is one-to-many
from a single-view silhouette.
6.1. The Basic Setup of BME
The BME-based
pose learning and inference method we use here mainly follows our previous work
in [41]. Let
be the latent point of an input silhouette and
the corresponding complete pose latent point.
In our BME setup, the conditional probability distribution
is represented as a mixture of
predictions from separate
experts:
(15)where
denotes the model parameters.
is a latent variable such that
indicates that
is generated by the
th expert, otherwise
.
is the gate variable, which is the
probability of selecting the
th expert given
.
For the
th expert, we assume that
follows a Gaussian
distribution:
(16)where
and
are the mean and covariance matrix of the output of the
th expert.
and
.
Following [33], in our
framework we consider the joint distribution
and assume the marginal distribution of
is also a mixture of Gaussian. Hence, the gate
variables are given by the posterior probability
(17)where
.
and
are the mixture coefficient, the mean and
covariance matrix of the marginal distribution of
for the
th expert, respectively.
's sum to one.
Given a set of training samples
,
the BME model parameter vector
needs to be learned. Similar to [10], in our framework the
expectation-maximization (EM) algorithm is used to learn
.
In the E-step of the
th iteration, we first compute the posterior
gate
using the current parameter estimate
.
is basically the posterior probability that
is generated by the
th expert. Then in the M-step, the estimate of
is refined by maximizing the expectation of
the log likelihood of the complete data including the latent variables. It can
be easily shown [33]
that the object function can be decomposed into two subfunctions: one related
to gate parameters
and the other one to the expert parameters
.
Details about the update of
can be found in [33], which are essentially the
basic equations in the M-step for Gaussian mixture modeling of
using EM.
6.2. Experts Learning Using Weighted RVM
In this section, we present our method for the learning of the expert parameters
.
There are
data pair clusters in BME. For each cluster,
we need to construct an expert for the mapping from silhouette latent point
to the complete pose latent point
.
The learning process of the parameters for all of the
experts is identical. We now consider the
learning of
.
The input to the learning algorithm is
,
including the original training data pairs and their associated posterior gate
values with respect to the
th expert.
's are the outputs of the E-step of the BME
learning mentioned in the previous section. Following [33], the objective function for
the optimization of the expert parameters is given by
(18)In our proposed framework, we
deployed RVM [46] to
solve this maximization problem. In our current implementation, individual
dimensions of
are considered separately assuming
independence between dimensions. To be concise in notation, in the remaining of
this section we assume that
is a scalar. When
is a vector, the expert learning processes in
all dimensions are identical. Denote
,
,
and
,
.
The RVM regression from
to
takes the following form:
(19)where
is a column vector of known kernel functions.
Hence, the likelihood of
is
(20)where
is the kernel matrix. To overfitting, a
diagonal hyper-parameter matrix
is introduced to model the prior of
:
.
Following the derivation in [46], it can be easily shown that in the case of weighted
RVM, the conditional probability distribution of
is given by
(21)
are computed through the following iterative
procedure:
(22)where
is the
th element of
.
and
are the
th diagonal terms of
and
,
respectively. Once the parameters have been estimated, given a new input
,
the conditional probability distribution of the output is given
by
(23)with
.
6.3. Experiments Results for 3D Pose Inference
To demonstrate the validity of the above BME-based pose inference framework, some experimental
results are included in this section. The resulting BME constitutes a mapping
from
to
.
The training data used includes the projection of silhouette training set
onto
using GPLVM and the projection of the pose
data
on
using GPDM. The number of experts in BME is
the number of mappings from
to
.
When the local body kinematics is fixed, usually five mappings are sufficient
to cover the variations introduced by different torso orientations. When the torso
orientation is fixed, the number of mappings needed to handle changes due to
different body kinematics depends on the complexity of the actual movement. In
the case of gait, three mappings are sufficient. Therefore, in our experiment
when both torso orientation and body kinematics are allowed to vary, fifteen
experts were learned in BME for pose inference of gait.
Synthetic testing data were generated using different
3D human models and motion sequences from different subjects. Some
reconstructed poses for the first two most probable outputs, that is, the
outputs with the first two largest gate values computed using (17), are shown
in Figure 10. It is clear that BME can handle ambiguous poses.
Figure 10:
BME-based pose inference results of a synthetic
walking sequence. Top row: input images; middle row: the most probable poses;
bottom row: the second most probable poses.
A real video (40 frames, two steps' side-view walking)
was also used to evaluate this approach. Due to observation noise, the
silhouettes extracted from this video were not as clean as the synthesized
ones. However, BME can still produce perceptually sound results. Some recovered
poses are shown in Figure 11.
Figure 11:
BME-based pose inference results of a real
walking video. Top row: input video images; middle row: the most probable
poses; The third row: The second most probable poses.
7. Tracking Using Particle Filter
A particle filter defined over
is used for 3D tracking of articulated motion.
The state parameter at time
is
,
where
is the latent point of the body joint angles,
and
is the torso orientation. Given a sequence of
latent silhouette points
obtained from input images using GPLVM, the
posterior distribution of the state is approximated by a set of weighted
samples
.
The importance weights of the particles are propagated over time as
follows:
(24)
Pose estimation results from BME are used to
initialize the tracking. BME cannot disambiguate, however, it provides multiple
possible solutions. In our experiments, the first three most probable solutions
from BME are selected as tracking seeds according to their gate values. Then
samples are drawn around these seeds. Generally, a wrong initialized branch
will merge with the correct ones after several frames estimation. But in some
situations, due to inherent ambiguity, an ambiguous solution might also stay.
For example, multiple tracking trajectories were obtained in some of our experiments as discussed in Section 8.
7.1. Sampling
Particles are propagated over time from a proposal distribution
.
To take into account both the movement dynamics and the most recent observation
,
in our approach we select
to be the mixture of two distributions as
follows:
(25)
where
is chosen as the BME output
given by (15) and
(26)
In our experiment, we only use
the first three most probable components of the 15 BME outputs and draw samples
according to the regression covariance. The second term in (25) is from
movement dynamics learned using GPDM and a first-order AR model for the torso
orientation
(27)In (25),
is the mixture coefficient of the BME-based
prediction and the dynamics-based prediction components. In our experiments,
. Because of
is a 5D space, only 100 particles were used in
tracking, which makes the tracking computationally efficient.
7.2. Likelihood Evaluation
In our framework, we take RVM as the regression
function to construct a forward mapping from
to
.
The hypothesized pose latent point
is first projected to
, and then to the image feature space using the inverse mapping in GPLVM. In the
RVM learning, we used the same training set as that in the BME learning
described in Section 6.3. The final number of the relevance vectors accounts
about 10%–20% of the total data. To evaluate the
effectiveness of the RVM-based mapping, it was compared against a model-based
approach, in which the hypothesized torso orientation and body kinematics were
obtained from
, and then Maya was used to render the corresponding silhouettes of the 3D body
model. The silhouette distance is measured in the vectorized GMM feature space.
Comparison results using five walking images are included in this section. For
each input silhouette, fifteen poses were inferred using BME learned according
to the method presented in Section 6.3. Given a pose, two vectorized GMM
descriptors were obtained using both the RVM- and model-based approaches. The
root mean square errors (RMSEs) between the predicted and true image features
were then computed. Figure 12(a) shows exemplar input silhouettes from view
number 1 through view number 5, indexed starting from the leftmost figure. For
each view and each method, given an input silhouette, we found the smallest
RMSE among all of the 15 candidate poses provided by BME. We then compute the
average of the smallest RMSE over all the input silhouettes. The average RMSEs
of all the five views from both methods are shown in Figure 12(b). It can be
seen that the average RMSEs are close for these two approaches, which indicates
that the likelihoods of a good pose candidate computed using both methods are
similar. Hence, we can use the example-based approach for computation
efficiency. In addition, the example-based approach does not need a 3D body
model of the subject, which also simplifies the problem.
Figure 12:
(a) Sample silhouettes from view number 1through view number 5, indexed
starting from the leftmost figure. (b) RMS error results using rendering and
regression approaches. The average error is close for both approaches.
8. Experimental Results
The proposed framework has been tested using both synthetic and real image sets. The system
was trained using training data described in Section 3.
During tracking, the preprocessing of the input image
takes about 800 milliseconds per frame, including silhouette extraction, GMM,
and vectorization. Out of these three operations, GMM is the most time
consuming, taking about 600 milliseconds. The mapping from vectorized GMM to
the silhouette manifold
is the most time consuming operation in our
current implementation, which takes about 3 seconds per frame. BME inference,
sampling, and sample weight evaluation is fairly fast, taking about 200
milliseconds per frame. The total time to process one frame of input image is
about 4 seconds.
We first used synthetic data to evaluate the accuracy
of our tracking system. The test sequence was created using motion capture data
(sequence
in the CMU database) of a new subject not
included in training sets and a new 3D body model. Some of the camera views are
also new. This test sequence has 63 frames of two walking cycles. The five
camera views used to create the testing data are the same as those shown in
Figure 12(a). The RMSEs between the ground truth and the estimated joint
angles are given to show the tracking accuracy. The tracking results based only
on sampling from the GPDM movement dynamics are also included for comparison
purposes. One hundred particles were used in both cases. The average RMS errors
from different views are shown in Figure 13(a). The tracking from view number
1 (frontal view) is rather ambiguous. The frame-wise RMSE of view number 5 (side
view walking from left to right) is given in Figure 13(b).
Figure 13(c) presents some input silhouettes from view 5 (top row) and their estimated poses
(the second and third rows). To show the effectiveness of the proposed
framework, results from the static image estimation using only BME and results
from sampling from dynamics are also shown in the fourth and fifth rows,
respectively. It can be seen that our proposed framework provided the most
accurate tracking results among all three methods.
Figure 13: Experimental
results obtained using synthetic data. (a) average RMS errors obtained using
synthetic testing sequences from different views; (b) frame-wise RMS from the
side view. (c) exemplar input silhouettes of view 5 and tracking results. Top
row: some input silhouettes; the second and third rows: two plausible solutions
obtained using our framework; the fourth row: the recovered poses directly from
the observed image using BME; bottom row: the recovered poses obtained using
only dynamic prediction; (d) the tracked movement trajectories in the joint
angle manifold

.
The result obtained using the proposed method
successfully describes the inherent left-and-right ambiguity present in the
input silhouette. It can been seen that the initial and the continuous
silhouettes are difficult to be distinguished from the left and the right. The
proposed framework returned both admissible results, although we cannot tell
which one corresponds to the true movement. Both movement trajectories tracked
in
are shown in Figure 13(d).
A real video sequence of 42 frames of two steps
walking along diagonal direction to the camera was used to evaluate the
proposed system. The subject was not seen in the system training. One hundred
particles were used in the tracking. Due to observation noise in the video, the
extracted silhouettes were not as clean as the synthesized ones. However, the
proposed approach can still produce plausible results. Some recovered poses are
shown in Figure 14.
Figure 14: Reconstructed poses for a real image sequence of 42
frames. Top row: sample input images; the second row: extracted silhouettes;
the third row: recovered poses using the proposed framework; the fourth row:
recovered poses directly from the observed image using only BME; bottom row:
recovered poses obtained using only dynamic prediction.
Another real video sequence (40 frames, two steps side
view walking) was used to evaluate the proposed system. This video is slightly
more challenging than the previous one because there is a jump between frame 29
to frame 30 due to missing frames caused by a misoperation during the video
recording. One hundred particles were used in the tracking. The proposed
framework still recovered two reasonable movement trajectories. Some of the
results are shown in Figure 15(a). Both admissible tracking trajectories in the
joint angle manifold
are shown in Figure 15(b). The last set of
experimental results included here shows the generalization capability of our
proposed tracking framework. A video of a circular walking from [3] was used. Two hundred particles were used in the tracking. The number of samples used in this
experiment was more than the other experiments because of the increased
movement complexity present in a circular walking. The corresponding results
are shown in Figure 16. It can be seen that our proposed framework can track
this challenging video fairly well. Our results are much better than those
obtained either using only BME or direct sampling from movement dynamics.
Figure 15:
Pose tracking results obtained using a real image sequence of 40 frames, where there is a
movement jump between frame 29 to frame 30. (a) Top row: sample input images;
the second row: extracted silhouettes; the third and fourth rows: two plausible
solutions tracked using our framework; the fifth row: recovered poses directly
from the corresponding images using BME; bottom row: recovered poses obtained
using only dynamic prediction. (b) The tracked movement trajectories in the
joint angle manifold

.
Figure 16:
Pose tracking
results using a real image sequence of circular walking. Top row: sample input
images; the second row: extracted silhouettes; the third row: recovered poses
using the proposed framework; the fourth row: recovered poses directly from the
corresponding images using only BME; bottom row: recovered poses obtained using
only dynamic prediction.
9. Conclusion and Future Work
In this paper, a 3D articulated human motion tracking fra-mework using a single camera is
proposed based on manifold learning, nonlinear regression, and particle
filter-based tracking. Experimental results show that once properly trained,
the proposed framework is able to track patterned motion, for example, walking.
A number of improvements can be made as part of our
future work. In the proposed framework, there are two separate low-dimensional
manifolds for silhouettes and poses, which requires a number of
forward-backward mappings. In our future work, we will try to construct a joint
silhouette-pose manifold which will greatly simplify the mapping procedure from
the input silhouette to the corresponding latent pose point, in a way similar
to [17]. In the proposed framework, we assume that all the entries of the vectorized GMM are
independent given the latent variables. This might not be true in reality. We
will investigate possible errors carried by this assumption. In our current
implementation, we only learn the parameters for a first-order Markov process.
To explore higher-order Markov processes, there
will be another interesting research problem to work on as well. In our current
BME learning, experts for different dimensions of
are learned separately using univariate RVM.
In our future work, we would like to adopt the
multivariate RVM framework proposed in [8] for BME learning and pose inference. We will also
compare the final tracking results obtained using univariate RVM and
multivariate RVM. Finally, we are working on extending our proposed framework
in this paper into a multiple-view setting. Research challenges include
optimization of a fusion scheme of input from multiple cameras.
Acknowledgments
The authors would like thank the anonymous reviewers for their insightful comments and
constructive suggestions. They also want to thank Dr. Neil Lawrence for making
the GPLVM and related toolboxes freely available online to the community. This paper is based upon work partly supported by U.S. National Science Foundation on CISE-RI no. 0403428 and IGERT no. 0504647. Any opinions, findings and
conclusions or recommendations expressed in this material are those
of the author(s) and do not necessarily reflect the views of the U.S. National Science Foundation (NSF).
References
- R. Urtasun, D. J. Fleet, and P. Fua, “3D people tracking with Gaussian process dynamical models,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), vol. 1, pp. 238–245, New York, NY, USA, June 2006.
- A. Blake, J. Deutscher, and I. Reid, “Articulated body motion capture by annealed particle filtering,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '00), vol. 2, pp. 126–133, Hilton Head Island, SC, USA, June 2000.
- H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of 3D human figures using 2D image motion,” in Proceedings of the 6th European Conference On Computer Vision (ECCV '00), pp. 702–718, Dublin, Ireland, June-July 2000.
- L. Kakadiaris and D. Metaxas, “Model-based estimation of 3D human motion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1453–1459, 2000.
- A. Agarwal and B. Triggs, “Tracking articulated motion using a mixture of autoregressive models,” in Proceedings of the 8th European Conference on Computer Vision (ECCV '04), pp. 54–65, Prague, Czech Republic, May 2004.
- V. Pavlovic, J. M. Rehg, and J. MacCormick, “Learning switching linear models of human motion,” in Proceedings of the Annual Conference on Neural Information Processing Systems Conference (NIPS '00), Denver, Colo, USA, December 2000.
- R. Li, T.-P. Tian, and S. Sclaroff, “Simultaneous learning of nonlinear manifold and dynamical models for high-dimensional time series,” in Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), pp. 1–8, Rio de Janeiro, Brazil, October 2007.
- A. Thayananthan, R. Navaratnam, B. Stenger, P. H. S. Torr, and R. Cipolla, “Multivariate relevance vector machines for tracking,” in Proceedings of the 9th European Conference on Computer Vision (ECCV '06), pp. 124–138, Graz, Austria, May 2006.
- G. Mori and J. Malik, “Estimating human body configurations using shape context matching,” in Proceedings of the 7th European Conference on Computer Vision (ECCV '02), pp. 150–180, Copenhagen, Denmark, May 2002.
- C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Discriminative density propagation for 3D human motion estimation,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), vol. 1, pp. 390–397, San Diego, Calif, USA, June 2005.
- A. Agarwal and B. Triggs, “Recovering 3D human pose from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 44–58, 2006.
- C. Sminchisescu, A. Kanujia, Z. Li, and D. Metaxas, “Conditional visual tracking in kernel space,” in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS '05), Vancouver, BC, Canada, December 2005.
- K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3D structure with a statistical image-based shape model,” in Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), vol. 1, pp. 641–648, Nice, France, October 2003.
- N. D. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” Journal of Machine Learning Research, vol. 6, pp. 1783–1816, 2005.
- J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), Vancouver, BC, Canada, December 2006.
- R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua, “Priors for people tracking from small training sets,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), vol. 1, pp. 403–410, Beijing, China, October 2005.
- C. H. Ek, N. D. Laurence, and P. H. S. Torr, “Gaussian process latent variable models for human pose estimation,” in Proceedings of the 4th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI '07), Brno, Czech Republic, June 2007.
- S. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley, “Real-time body tracking using a gaussian process latent variable model,” in Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), Rio de Janeiro, Brazil, October 2007.
- C. Curio and M. A. Giese, “Combining view-based and model-based tracking of articulated human movements,” in Proceedings of IEEE Workshop on Motion and Video Computing (MOTION '05), vol. 2, pp. 261–268, Breckenridge, Colo, USA, January 2005.
- B. Scholkopf and A. Smola, Learning with Kernels, MIT Press, Cambridge, Mass, USA, 2002.
- M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analysers,” Neural Computation, vol. 11, no. 2, pp. 443–482, 1999.
- R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian, “Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers,” in Proceedings of the 9th European Conference on Computer Vision (ECCV '06), pp. 137–150, Graz, Austria, May 2006.
- K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popović, “Style-based inverse kinematics,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 522–531, 2004.
- T.-P. Tian, R. Li, and S. Sclaroff, “Articulated pose estimation in a learned smooth space of feasible solutions,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR '05), vol. 3, p. 50, San Diego, Calif, USA, June 2005.
- K. Moon and V. Pavlović, “Impact of dynamics on subspace embedding and tracking of sequences,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), vol. 1, pp. 198–205, New York, NY, USA, June 2006.
- N. D. Lawrence and A. J. Moore, “Hierarchical Gaussian process latent variable models,” in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 481–488, Covallis, Ore, USA, June 2007.
- J. M. Wang, D. J. Fleet, and A. Hertzmann, “Multifactor Gaussian process models for style-content separation,” in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 975–982, Covallis, Ore, USA, June 2007.
- C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Learning joint top-down and bottom-up processes for 3D visual inference,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), vol. 2, pp. 1743–1750, New York, NY, USA, June 2006.
- C.-S. Lee and A. Elgammal, “Simultaneous inference of view and body pose using torus manifolds,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), vol. 3, pp. 489–494, Hong Kong, August 2006.
- E.-J. Ong, A. S. Micilotta, R. Bowden, and A. Hilton, “Viewpoint invariant exemplar-based 3D human tracking,” Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 178–189, 2006.
- R. Rosales and S. Sclaroff, “Learning body pose via specialized maps,” in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS '01), vol. 14, pp. 1263–1270, Vancouver, BC, Canada, December 2001.
- A. Agarwal and B. Triggs, “Monocular human motion capture with a mixture of regressors,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR '05), pp. 54–65, San Diego, Calif, USA, June 2005.
- L. Xu, M. I. Jordan, and G. E. Hinton, “An alternative model for mixtures of experts,” in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS '94), pp. 633–640, Denver, Colo, USA, December 1994.
- A. Elgammal and C.-S. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. 681–688, Washington, DC, USA, June 2004.
- C.-S. Lee and A. Elgammal, “Modeling view and posture manifolds for tracking,” in Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), pp. 1–8, Rio de Janeiro, Brazil, October 2007.
- M. A. O. Vasilescu and D. Terzopoulos, “Multilinear subspace analysis of image ensembles,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), vol. 2, pp. 93–99, Madison, Wis, USA, June 2003.
- I. Borg and P. Groenen, Modern Multidimensional Scaling. Theory and Applications, Springer, New York, NY, USA, 1997.
- C. J. C. Burges, “Geometric methods for feature extraction and dimensional reduction,” in Data Mining and Knowledge Discovery Handbook, Kluwer Academic Publishers, Dordrecht, The Netherlands, 2005.
- CMU Human Motion Capture DataBase, http://mocap.cs.cmu.edu/.
- R. Poppe and M. Poel, “Comparison of silhouette shape descriptors for example-based human pose recovery,” in Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR '06), pp. 541–546, Southampton, UK, April 2006.
- F. Guo and G. Qian, “Learning and inference of 3D human poses from Gaussian mixture modeled silhouettes,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR '06), vol. 2, pp. 43–47, Hong Kong, August 2006.
- J. Goldberger, S. Gordon, and H. Greenspan, “From image gaussian mixture models to categories,” in Proceedings of the 7th European Conference on Computer Vision (ECCV '02), Copenhagen, Denmark, May-June 2002.
- N. D. Lawrence, “Gaussian process latent variable models for visualisation of high dimensional data,” in Proceedings of the 15th Annual Conference on Neural Information Processing Systems (NIPS '03), Vancouver, BC, Canada, December, 2003.
- N. D. Lawrence, “Learning for larger datasets with the gaussian process latent variable model,” in Proceedings of the 11th International Workshop on Artificial Intelligence and Statistics, San Juan, Puerto Rico, USA, March 2007.
- G. Taylor, G. Hinton, and S. Roweis, “Modeling human motion using binary latent variables,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), Vancouver, BC, Canada, December 2006.
- M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, no. 3, pp. 211–244, 2001.