Laboratoire des Sciences de l'Image, de l'Informatique et de la Télédétection, LSIIT, UMR 7005, CNRS-Université Louis Pasteur, Strasbourg 1, Boulevard S. Brant, BP 10413, 67412 Illkirch Cedex, France
Laboratoire d'Imagerie et de Neurosciences Cognitives, LINC, UMR 7191, CNRS-Université Louis Pasteur, Strasbourg 1, LINC-IPB, 4, rue Kirschleger, 67085 Strasbourg Cedex, France
Abstract
Manifold learning may be seen as a procedure aiming at capturing the degrees
of freedom and structure characterizing a set of high-dimensional data, such as
images or patterns. The usual goals are data understanding, visualization, classification, and the computation of means. In a linear framework, this problem is
typically addressed by principal component analysis (PCA).
We propose here a nonlinear extension to PCA. Firstly, the reduced variables are
determined in the metric multidimensional scaling framework. Secondly, regression
of the original variables with respect to the reduced variables is achieved considering
a piecewise linear model. Both steps parameterize the (noisy) manifold
holding the original data. Finally, we address the projection of data onto the manifold.
The problem is cast in a Bayesian framework. Application of the proposed
approach to standard data sets such as the COIL-20 database is presented.
1. Introduction
Data reduction consists in parameterizing a set of
high-dimensional data with a set of reduced coordinates and mapping the
original space to the reduced one and vice versa. The (noisy) data are
assumed to lie close to a nonlinear manifold whose intrinsic dimension is the
dimension of the reduced space. Usual goals of data reduction are visualization
(as a scatter plot of the reduced data with labels from the original data, in
order to get insight into the structure of the data), data understanding (the
degrees of freedom are given a physical interpretation, such as pose angle or
lighting intensity), classification (classification is more robust when
considering the reduced space, especially in a nonlinear reduction framework),
denoising, and the computation of means. Applications are face-recognition,
character recognition, and shape analysis, to mention a few examples.
Manifold learning may be addressed in a more formal
way. The goal is to determine, for each
on the
manifold, a reduced variable
, an approximation error
, and a mapping
such that
. To tackle the undetermination of this problem (
and the
's are both
unknown), the hypothesis that
preserves
distances is introduced. Hence, the data structure in the original space is
preserved in the reduced space. Consequently, the mapping
is determined
up to an isometry, which has no practical influence on the goals of manifold
learning. Although higher-order models could be considered, we will consider
here piecewise linear mappings. The
's and
will be
determined sequentially.
Manifold learning is typically addressed by principal
component analysis (PCA) with severe limitations, the main one being that only
linear manifolds can be comprehended. The main consequence is that the
dimension of the reduced space may be significantly overestimated when tackling
nonlinear manifolds, thus hampering for example classification, visualization,
or the computation of means. The main advantage of PCA over other techniques is
that the reduced coordinates and the mapping
are computed
simultaneously andeasily.
A straightforward nonlinear extension of PCA is local
PCA [1], allowing to
fit the manifold locally. The main problem with this approach is that the
different sets of reduced variables are unrelated, thus also hampering the
final goals of the procedure. Several approaches, such as isometric feature
mapping (Isomap, see [2]) and locally linear embedding (LLE, see [3]), address the compression
problem globally instead of blockwise, such as local PCA. The former tries to
preserve the approximate geodesic distances on the manifold, considering
classical multidimensional scaling (MDS) [4] while the latter aims at preserving the local
structure of the manifold considering local barycentric coordinates. Both approaches
suffer from noise and errors on the estimated distances in the original space
[5]. We propose here a
metric MDS approach [6, 7], with a view to reducing this effect: contrary to
standard MDS, we consider a robust cost function to estimate the
's. We also
propose a piecewise linear regression framework to map the reduced space to the
original one and a Bayesian framework to determine the mapping from the
original space to the reduced one.
Besides, a class of dimension reduction methods known
as eigenmap techniques has recently gained interest. Those methods aim at
preserving local (Laplacian eigenmaps [8], Hessian eigenmaps [9]) or global (diffusion maps [10]) structure by minimizing a
cost function. In practice, the minimizations are carried out using eigenvector
computations. Comparison of the proposed approach with such techniques is
provided in this article. Well-identified shortcomings of eigenmap techniques
are exhibited by the examples we consider.
Several other attempts have been reported in the
literature to reduce the data groupwise [11–13], with an alignment correction in the reduced space,
and using a piecewise linear function for the mapping between the original and
the reduced spaces. To our opinion, the main drawback of these approaches is
the patching of local reduced variables which may introduce distortion in the
reduced space considered globally. Nevertheless, these approaches certainly
deserve further attention and work.
Finally, a class of dimension reduction methods known
as principal curves and principal surfaces [14, 15] attempt to reduce the data nonlinearly. These methods
fundamentally only handle reduced spaces of dimension
or
which is too
restrictive for the present case where we want to be able to consider reduced
variables of higher dimension. LeBlanc and Tibshirani [16] have proposed an extension
of such approaches in higher dimensions. In their setting, the reduced
variables are re-estimated during the procedure, which does not correspond to
our goal, since we want the reduced variables to reflect the structure of the
original data. The model is constructed by adding (and pruning) linear pieces,
which is what we also consider in our approach.
This article is organized as follows. Our approach is
described in Section 2. Application examples are given in Section 3. Conclusion
and prospects are presented in Section 4.
2. The Nonlinear Manifold Learning Procedure
In this section, we detail the proposed two-stage
learning algorithm, consisting in compression (i.e., determination of the
's) and in
regression (i.e., estimation of the mapping
). The compression step is in fact a feature
extraction step, where the features are determined as an implicit nonlinear
function of the original data. Feature extraction is opposed to feature
selection, where a meaningful subset of the original coordinates has to be
selected. From a differential geometry point of view, compression and
regression correspond to devising an atlas on a manifold (or to charting a
manifold). The presentation of a Bayesian projection procedure, mapping new (i.e., incoming) data of original space to the
reduced space, concludes this section. An approach close to the present one was
proposed by Elgammal and Lee [17] to infer body pose from silhouettes.
2.1. Learning Stage I: Compression
Determination of the set
of reduced
coordinates is based
on the hypothesis that the mapping
preserves distances, such
that for all
where
is the
(geodesic) distance in the original space and
the (Euclidean) distance in the reduced
space. This hypothesis allows to import the structure of the
original data in the reduced space, hence bringing clarity in
visualization and robustness in classification. The mapping is
determined up to an isometry which, as already stated, has no
influence on the final goals of the global procedure. The computation of approximate geodesic distances is detailed in [2]. In practice,
, with typically
, whereas
, where
may take
values in the tens of thousands. The last coordinate of each
will be
, so that affine regression is handled with
simplicity and clarity. This coordinate will not be considered
when accounting for the dimension of the reduced space.
The
's may be
obtained as the solution of the following optimization problem:
(1) where
and
is a cost (or
stress) function. We will consider several stress functions in this work.
(i)
, proposed by Sammon [6] and advocated by Duda,
et al. [18, chapter 10, (109)]:
(2) where
is geodesic (in
[6],
is supposed to
be Euclidean though it is mentioned that any distance could be used) and
Euclidean.
(ii)
, corresponding to the Isomap [2] algorithm
(3) where
is the
Frobenius norm,
is the
centering matrix, with
. Matrix
(resp.,
) encompasses
the squared
(resp.,
) distances
between the
(resp.,
). Distances
and
are,
respectively, geodesic and Euclidean.In the standard MDS approach,
is replaced by
a monotonic transformation of
, where
is the
proximity (to be user-defined according to the type of data considered) between
and
(see [4]),
(iii)
, corresponding to the Curvilinear Component Analysis
approach proposed by Demartines et al. [19]:
(4) where
,
being the
indicator function taking value
(resp.,
) if the
condition between parentheses is true (resp., false).
(iv)
, which is the stress function we propose:
(5) where
is geodesic and
Euclidean.
PCA can be interpreted as a particular case of this
approach (Isomap with Euclidean distances in both spaces, see [20]). Standard MDS, which uses
squared differences between distances, is sensitive to outliers, which can be
present in real data. This advocates the metric MDS approach proposed here
(involving a robust, nonquadratic cost function, see the first factor in the
definition of
). Besides,
small distances can be dominated by noise. Small distances thus should have
reduced influence on the stress. This is achieved by the third factor in the
definition of
. Finally, we would like to enhance the convexity of
the cost function, which is the role of the second factor in the definition of
(in the
transient phase of the optimization, when
is large, the
corresponding term in the sum is quadratic). This eliminates many local minima.
More precisely,
is a small real
number so that the square root is always differentiable,
is a threshold
indicating whether the cost function is linear or quadratic; and
is a threshold
indicating whether
should be
considered or not (if
vanishes and
has no
influence on the stress). We set
, we chose
as the first
percentile of the distances
and
as the
th percentile
of the distances
.
This cost function is highly multimodal and must
therefore be handled with care. Two different stochastic approaches are
proposed to solve the optimization problem.
The first approach (Algorithm 1) considers all points
simultaneously.
We use gradient descent with an exact line search. The algorithm proceeds
iteratively and adds a small amount of noise to the points
after each
descent step to try to avoid local minima. The
's are centered
after each descent step to avoid drifting and to maximize precision. The second
approach (Algorithm 2) proceeds by incorporating successively the
's. The
rationale is that a cost function encompassing few points will be less
multimodal, thus enabling to reach the global optimum without particular care
and computational effort. Incorporating additional points will have negligible
effect on the points already optimized and will (hopefully, at least if only
one point is incorporated at a time) correspond to a monomodal problem. As for
the first algorithm, gradient descent and centering are used.
Algorithm 1: Standard compression algorithm.
Algorithm 2: Successive
incorporation compression algorithm.
The solution of the Isomap problem may be obtained
analytically and thus does not involve any iterative optimization [20]. Sammon [6] proposed gradient descent with a fixed step size for
the stress function
. Other approaches include optimizations on local
distances [21], as
opposed to approximated geodesic distances, but this can create folding in the
reduced space as may also be observed for LLE [3].
Let us mention that the optimization algorithms
approaches mentioned in the literature proved to be inefficient for the data
sets we considered. Handling the intricate optimization problem related to
compression may thus be considered as a contribution of the present work.
In addition to those approaches based on the optimization of a stress function,
another class of approaches, known as eigenmap methods, has recently gained
interest. Generally speaking, eigenmap methods rely on the computation of
eigenvectors and amount indirectly to the optimization of a cost. Some of the
eigenvectors are used as a compact representation of the data. This general class
encompasses Belkin and Niyogi's Laplacian eigenmaps
[8], Donoho and
Grime's Hessian eigenmaps [9], and Coifman and Lafon's diffusion maps [10]. Those methods, which
attempt to preserve either local or global structure, will not be detailed
here, the reader being referred to the cited references.
Let us finally mention that unifying interpretations
of several methods have been proposed: Ham et al. propose an
interpretation from a kernel point of view [22], Coifman and Lafon propose an interpretation from a
diffusion map point of view [10].
2.2. Learning Stage II: Regression
The mapping
can be
estimated, now that the reduced set of coordinates is known. We choose for
a piecewise
affine function, because affine functions are highly adaptable and relatively
easy to handle. PCA would correspond to a piecewise affine function of only one
piece.
The general goal is to estimate a set
of
regression
matrices, where
is the unknown
number of pieces of the model approximating the manifold optimally and where
(resp.,
) is the
dimension of the compressed (resp., response) variable. The guideline will be
the minimization of
. The index
involved is
unknown and depends on
. Such an optimization problem is intricate and we
will propose a suboptimal solution, but holding satisfying approximation
properties. Let us mention that we enforce a connectivity constraint: no vector
can have a
label not represented in its neighborhood and two or more patches with the same
label are not authorized. The proposed formulation of the regression problem
has a stochastic interpretation as maximum likelihood estimation with Gaussian
iid noise, subject to the connectivity constraint.
Let us also mention that nonlinear regression is
usually addressed in the literature in the scalar case (
) or in the
case of the juxtaposition of independent scalar regressions (e.g., see the
projection pursuit regression method of Friedman and Stuetzle [23] and the multivariate
adaptive regression splines method of Friedman [24]). The reader is also
referred to the monograph [25]. More generally, all monographs we are aware of are
dealing with regression and smoothing methods in the scalar response case. In
the multidimensional response case, our approach is close to the one proposed
by Haralick and Harpaz [26], where search for features linear cluster by linear
cluster is carried out. In Haralick and Harpaz's approach, no dimension
reduction is considered, thus leaving the clusters unconnected, which is a
drawback for future classification, visualization, data understanding, and the
computation of means. To our best knowledge, no other work has addressed the
general case
, set aside principal surfaces approaches which we
already mentioned.
Two approaches, both stochastic and iterative, are
proposed to estimate the
's and the
's. The first
one (Algorithm 3) creates a plane in an unlabeled neighborhood of a randomly
chosen point. It must be noticed that some points may remain unlabeled at the
end of this procedure. We choose not to process them since there are few such
points and since their influence is negligible. The second one (Algorithm 4)
creates planes considering one point among those which are the most unlikely to
belong to an already existing plane (this is done by considering the
likelihoods of the point and of its neighborhood). One advantage of the latter
algorithm is the control over the quality of the regression allowed by the
information criterion. This criterion may also be used to impose the number of
pieces of the model or the accuracy of the regression.
Algorithm 3: Piecewise linear mapping 1 (PLM 1).
Algorithm 4: Piecewise linear mapping
2 (PLM 2).
2.3. Projection
Projecting a new (i.e., incoming) point
onto the
manifold amounts to estimating the variables
parameterizing
this point. We will thus have
. The general goal is to assess the belonging of point
to the
manifold.
This problem will be cast in a Bayesian framework. The
unknowns
and
(
is implicitly
known, when
and
are given) will
be chosen as the maximizers of
. All probabilities involved are conditioned on the
model previously determined. This conditioning is dropped for the sake of
clarity. Maximizing
amounts to
maximizing
, which amounts to minimizing
(6) where
is the variance
of the noise as estimated from the initial (i.e., learning) data set. Probability
is determined
from the learning data set using Gaussian kernels. Probability
is also
estimated from the learning data set.
For
,
is computed
using a local optimization algorithm with a multistart process, since the cost
function may be multimodal. The estimate
is retained as
the minimizer of the cost function among all
's, whose
indices
satisfy the
connectivity constraint.
It must be noted that the original data
are not needed
for the projection, since only the reduced coordinates
and the
matrices
are used in the
process. This saves a lot of memory, which is an advantage of the proposed
methodology.
2.4. Comment: Iterating Compression and Regression
As an extension to the previous approach, one might
also consider iterating compression and regression steps. The rationale is to
use the result of the regression step to initialize the neighborhood graph: in
addition to its use for the nearest neighbors, Euclidean distance is then used
for all pair of points belonging to the same linear piece, thus reducing the
errors that might occur in the estimation of those distances.
We implemented this approach on the SCurve. The
estimation of the reduced coordinates was not improved. This is due to the fact
that the regression step is an approximation to the true function if few planes
are used. Thus, errors are introduced in the regression step, which are further
propagated in the compression step. If many planes are used, the new
initialization of the graph is not different from the one considering the
nearest neighbors. Hence, there is no global advantage in iterating between
both steps, at least for the SCurve, and it is highly probable that this would
also be the case for other data sets. This approach—iterating compression and
regression—will not be considered in the sequel.
3. Experimental Results
This section is separated in two parts. The first part
presents results of the compression procedure; the second one addresses the
projection procedure. The manifolds used here are standard data sets known as
the SwissRoll (Figure 1(a)), the SCurve (Figure 1(c)) (e.g., see [2, 3, 13]) and the COIL-20 database
(e.g., see Figure 6, [27]). We mention that both the SwissRoll and the SCurve
have two degrees of freedom and that neither of them can be described by a
-dimensional
reduced space in a linear framework in a manner that preserves their intrinsic
structure. This precludes PCA for a compact representation of these data.
Figure 1: (a) Original SwissRoll, (b) regressed
SwissRoll (

planes), (c)
original SCurve, (d) regressed SCurve (

planes).
3.1. Compression
First of all, it should be mentioned that comparing
compression methods is an intricate problem because of the choice of the
evaluation criterion which should not favor one method over the others. As
benchmarks, we consider the SwissRoll and the SCurve which were compressed
using different paradigms and stress functions (PCA, Isomap,
,
,
, Laplacian eigenmaps, diffusion maps, Hessian
eigenmaps, and locally linear embedding). We consider a reduced space of dimension
2, which is the true intrinsic dimensionality of the data (we will return on
the determination of the intrinsic dimensionality in the sequel).
From a qualitative point of view, we observe that
PCA does not behave well, since the neighborhood structure is not preserved in
the reduced space (see Figure 2and compare the colors of the reduced data to
the colors of the original data displayed Figure 1(a)) (see also Figure 4) .
Locally linear embedding, Laplacian eigenmaps, and diffusion maps do not behave
satisfyingly either. All other approaches behave quite well in the nonnoisy
SCurve case. In the noisy case, as expected, all methods are affected by noise,
though to a different extent (see Figure 3). Besides all not very satisfying
behaviors mentioned above, we notice that the Hessian eigenmap compression
degrades in the presence of noise, as mentioned in the literature [9].The Laplacian eigenmap
compression tends to introduce holes, which correspond to its known property of
emphasizing clusters [8]. Sammon's cost function and the proposed cost function
exhibit the most satisfying behaviors. Similar conclusions may be drawn with
the SwissRoll test case. These differences should be quantified, which we did
using two different tests.
Figure 2: Reduced space computed by (a) PCA, (b) Isomap,
(c)

, (d)

, (e)

, (f) locally
linear embedding, (g) Laplacian eigenmap, (h) Hessian eigenmap, and (i)
diffusion maps for the nonnoisy SCurve.
Figure 3: Reduced space computed by (a) PCA, (b) Isomap,
(c)

, (d)

, (e)

, (f) locally
linear embedding, (g) Laplacian eigenmap, (h) Hessian eigenmap, and (i)
diffusion maps for the noisy SCurve (Gaussian noise,

variance).
Figure 4: Reduced space computed by (a) PCA and (b)

for the duck of
the COIL-20 database.
The rationale of the first test is to try to recover
the true (initial)
's (which will
be denoted
's) from the
data
's. For this
first test to make sense, we have to consider a manifold whose local
magnification factor is
(the distances
between the
's must be the
same as the ones between the
's since the
distances between the
's are
reproduced in the distances between the
's). The SCurve
was retained for this reason. Since the
's are
determined up to an isometry, the isometry putting the
's and the
's into
correspondence is first computed and applied on the
's.The sum of
the squared distances between both sets of coordinates is then computed (notice
that many other choices for the criterion quantifying the discrepancy between
both sets may have been done. We chose this particular one because it is the
most used). Examining Table 1, we may assert that
and
behave better,
which is coherent with the qualitative observations. Besides, it should be
mentioned that the cost function of Demartines and Hérault (
) behaves quite
well and exhibits good robustness to noise. As expected,
yields better
results in the presence of outliers (see the last line of Table 1).
Table 1: Distances between the original reduced
coordinates and the estimated ones for a 2000-point SCurve. For

, the threshold was set so that

of the lowest
distances

were
considered. LEM, DM, HEM, and LLE stand for Laplacian eigenmap, diffusion map,
Hessian eigenmap, and locally linear embedding, respectively. In the cases of
rows

and

(Gaussian and
Laplacian noise), noise is affecting the coordinates of the

's. The noise
variances (

and

) are
quantified with respect to the variance of the noise-free data. In case of row

, the distances

are perturbated
with a Laplacian noise affecting

of the

’s. The noise
variance (

) is quantified
with respect to the variance of the noise-free

's.
In the second test, we computed the linear correlation
between the true geodesic distances and the Euclidean distances computed from
the estimated reduced coordinates. This was achieved on the SwissRoll. We will
not further comment this benchmark since it was not able to clearly
discriminate between Isomap,
,
and
. Let us notice that other choices might have been
made for the comparison criterion (linear correlation is in essence a quadratic
criterion).
The determination of the optimal dimension of the
reduced space is tackled by the scree test [28],as is classically done in the literature.
Optimizations are achieved for several dimensions of the reduced space and the
stress values at the optima are compared. The dimension retained is the value
after which the stress level does not decrease significantly any more. This can
be seen on Figure 5 for several data sets (the SwissRoll and two sets from the
COIL-20 database). Different noise levels for the SwissRoll indicate that low
noise level has no influence on the number of dimensions. In each case, the
optimal dimension is
.
Figure 5: (a) Scree plots for

on the
SwissRoll, (b) the duck and the piece of wood from the COIL-20 database. The
abscissa axis is the dimension of the reduced space; and the ordinate axis is
the value of the cost function at the solution.
Figure 6: The

images of the
duck set of the COIL-20 database with (a) 0% and (b) 40% occlusion; (c)
reconstruction from PCA-

compression, (d)
reconstruction from

-2 compression.
The occlusion percentage is measured with respect to the entire image and would
be larger if measured with respect to the foreground.
To conclude this section, we comment on both proposed
algorithms. Algorithm 1 is less demanding from a CPU-time point of view, but
requires several runs since it may get stuck at local minima. The computational
burden of Algorithm 2is larger, but the correct solutions were obtained from
sin- gle runs on the data we processed. A
-point SwissRoll
requires
-minute CPU time
with Algorithm 1 (one run),
minutes with
Algorithm 2,
seconds with
Isomap. A
-point SwissRoll
requires
minutes with
Algorithm 1 (one run),
hours with
Algorithm 2, between
seconds and
minutes with
eigenmaps methods,
minutes with
Isomap,
minutes with
Sammon's cost function, and
hours with
. Nevertheless, though quite demanding, a compression
has to be achieved only once for a given data set. For data sets with a very
large number of points, other approaches known as landmark approaches, where
compression is done with a subset of the initial data set, might be more
adapted. The remaining data are then compressed with respect to the fixed
landmarks. The regression phase requires from about 3- to 4-hour CPU time.
3.2. Projection
Two test cases are considered to evaluate the
projection method (and in fact the whole procedure, since the projection method
cannot be isolated).
The first test case is the computation of the
projection error for the nonnoisy SwissRoll and SCurve. Points
are projected
onto the regressed manifold. The average error is less than
, when measured with respect to the standard deviation
of the
's.
The second test case is conducted by projecting
occluded COIL-20 images (see Figure 6) onto each underlying manifold (i.e., a model is learnt for each data set). In the
learning phase, all images (nonoccluded) of a given set are considered.
Regression is achieved using both algorithms (PLM1 and PLM2), the best solution
being retained.
The goal is to recover, for each occluded image, the
original image. A hit is obtained when the image closest to the projection is
the nonoccluded image. A near hit is obtained when the image closest to the
projection is one of the five images closest to the original one. This assesses
the robustness of the procedure. Because of the occlusion, the Gaussian noise
assumption in the projection (see (6)) is replaced by a generalized Gaussian
noise, with exponential rate of decay of
. The results are displayed in Figure 7. It may be
observed that the proposed approach yields better results than PCA-
. PCA-
yields better
results than the proposed approach, particularly in the case displayed in
Figure 7(a), but at the cost of a significant increase of the dimension of the
underlying reduced space. Even if PCA-
behaves well
from a hit or miss point of view, the distance in the reduced space is very
large (see the reconstructed ducks, Figure 6(c)), much larger than the distance
yielded by the proposed approach (Figure 6(d)). As a complement to this hit and
miss point of view, we compute the pose estimation errors (see Figure 8). The
analyses of those errors are consistent with the analysis corresponding to the
hit and miss framework.
Figure 7: Projection of the sets of the COIL-20 database
(abscissa represents the index of the data set, ranging from

to

). The curves
represent (a) hits with

occlusion, (b)
near hits with

occlusion, (c)
hits with

occlusion, (d)
near hits with

occlusion for
different paradigms (PCA and

-dimensional

).
Figure 8: Histograms of the pose estimation error in
degrees of one representative example from the COIL-20 database (

occlusion): (a)
PCA 2, (b) PCA 15, (c)

-dimensional

.
Moreover, we emphasize once more that the true degrees
of freedom are not captured by PCA-
, thus hampering visualization and data understanding.
We remind the reader that the computation of means is an important goal of the
present work, with applications to brain imaging. The Fréchet means [29, chapter 9], [30] will be computed in the
reduced space and then lifted into the original space. It is thus important to
capture the true degrees of freedom of the data in order to avoid spurious
effects induced by extra coordinates. We will thus retain the proposed
compression
approach, along with the regression and Bayesian projection methods detailed in
this article. Finally, let us mention that the computational cost of the projection
phase is negligible (from
to
seconds).
4. Conclusion
In this article, we have introduced an original metric
multidimensional scaling-based nonlinear manifold learning framework, allowing
efficient and robust reduction of high-dimensional data. The approach is
composed of compression, regression, and projection. The original data do not
need to be stored hence saving significant memory space once the model is
learnt.
Data understanding, visualization, classification, and
the computation of means are possible even in the case of highly nonlinear
manifolds. The classical solution to the general problem addressed here is PCA,
which is clearly outperformed by the proposed method. Application of this
method to analysis and classification of shapes in brain imaging is currently
investigated.
Acknowledgment
The authors are grateful to the reviewers and
associate editor for their questions and comments which lead to a significant
improvement of the paper.
References
- N. Kambhatla and T. K. Leen, “Dimension reduction by local principal component analysis,” Neural Computation, vol. 9, no. 7, pp. 1493–1516, 1997.
- J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
- S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
- I. Borg, Modern Multidimensional Scaling, Springer, New York, NY, USA, 2nd edition, 2005.
- M. Balasubramanian and E. L. Schwartz, “The isomap algorithm and topological stability,” Science, vol. 295, no. 5552, p. 7, 2002.
- J. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Transactions on Computers, vol. 18, no. 5, pp. 401–409, 1969.
- A. Webb, Statistical Pattern Recognition, John Wiley & Sons, New York, NY, USA, 2002.
- M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
- D. Donoho and C. Grimes, “Hessian eigenmaps: locally linear embedding techniques for high-dimensional data,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 10, pp. 5591–5596, 2003.
- R. R. Coifman and S. Lafon, “Diffusion maps,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006.
- S. Roweis, L. Saul, and G. Hinton, “Global coordination of local linear models,” in Proceedings of the 14th Annual Conference on Advances in Neural Information Processing Systems
(NIPS '01), vol. 14, pp. 889–896, MIT Press, Vancouver, BC, Canada, December 2001.
- M. Brand, “Charting a manifold,” in Proceedings of the 15th Annual Conference on Advances in Neural Information Processing Systems
(NIPS '02), vol. 15, pp. 985–992, MIT Press, Vancouver, BC, Canada, December 2002.
- Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM Journal of Scientific Computing, vol. 26, no. 1, pp. 313–338, 2004.
- T. Hastie and W. Stuetzle, “Principal curves,” Journal of the American Statistical Association, vol. 84, no. 406, pp. 502–516, 1989.
- B. Kégl, A. Krzyzak, T. Linder, and K. Zeger, “Learning and design of principal curves,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 3, pp. 281–297, 2000.
- M. LeBlanc and R. Tibshirani, “Adaptive principal surfaces,” Journal of the American Statistical Association, vol. 89, no. 425, pp. 53–64, 1994.
- A. Elgammal and C.-S. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning,” in Proceedings of IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR '04), vol. 2, pp. 681–688, Washington, DC, USA, June-July 2004.
- R. Duda, P. Hart, and D. Stork, Pattern Classification, John Wiley & Sons, New York, NY, USA, 2nd edition, 2001.
- P. Demartines and J. Hérault, “Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets,” IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 148–154, 1997.
- T. Cox and M. Cox, Multidimensional Scaling, Chapman & Hall, Boca Raton, Fla, USA, 2001.
- D. K. Agrafiotis and H. Xu, “A self-organizing principle for learning nonlinear manifolds,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 25, pp. 15869–15872, 2002.
- J. Ham, D. D. Lee, S. Mika, and B. Schölkopf, “A kernel view of the dimensionality reduction of manifolds,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), pp. 369–376, Banff, Canada, July 2004.
- J. Friedman and W. Stuetzle, “Projection pursuit regression,” Journal of the American Statistical Association, vol. 76, no. 376, pp. 817–823, 1981.
- J. Friedman, “Multivariate adaptive regression splines,” Annals of Statistics, vol. 19, no. 1, pp. 1–67, 1991.
- T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, Berlin, Germany, 2003.
- R. Haralick and R. Harpaz, “Linear manifold clustering in high dimensional spaces by stochastic search,” Pattern Recognition, vol. 40, no. 10, pp. 2672–2684, 2007.
- S. Nene, S. Nayar, and H. Murase, “Columbia object image library (coil-20),” Deparment of Computer Science, Columbia University, New York, NY, USA, 1996.
- R. Cattell, “The scree test for the number of factors,” Multivariate Behaviorial Research, vol. 1, no. 2, pp. 245–276, 1966.
- D. Kendall, D. Barden, T. Carne, and H. Le, Shape and Shape Theory, John Wiley & Sons, New York, NY, USA, 1999.
- A. Kume and H. Le, “On Fréchet means in simplex shape spaces,” Advances in Applied Probability, vol. 35, no. 4, pp. 885–897, 2003.