Abstract

Kernel sliced inverse regression (KSIR) is a natural framework for nonlinear dimension reduction using the mapping induced by kernels. However, there are numeric, algorithmic, and conceptual subtleties in making the method robust and consistent. We apply two types of regularization in this framework to address computational stability and generalization performance. We also provide an interpretation of the algorithm and prove consistency. The utility of this approach is illustrated on simulated and real data.

1. Introduction

The goal of dimension reduction in the standard regression/classification setting is to summarize the information in the -dimensional predictor variable relevant to predicting the univariate response variable . The summary should have variates and ideally should satisfy the following conditional independence property: Thus, any inference of involves only the summary statistic which is of much lower dimension than the original data .

Linear methods for dimension reduction focus on linear summaries of the data, . The -dimensional subspace, , is defined as the effective dimension reduction (e.d.r.) space in [1], since summarizes all the predictive information on . A key result in [1] is that under some mild conditions, the e.d.r. directions correspond to the eigenvectors of the matrix: Thus, the e.d.r. directions or subspace can be estimated via an eigenanalysis of the matrix , which is the foundation of the sliced inverse regression (SIR) algorithm proposed in [1, 2]. Further developments include sliced average variance estimation (SAVE) [3] and Principal Hessian directions (PHDs) [4]. The aforementioned algorithms cannot be applied in high-dimensional settings, where the number of covariates is greater than the number of observations , since the sample covariance matrix is singular. Recently, an extension of SIR has been proposed in [5], which can handle the case for based on the idea of partial least squares.

A common premise held in high-dimensional data analysis is that the intrinsic structure of data is in fact low dimensional, for example, the data is concentrated on a manifold. Linear methods such as SIR often fail to capture this nonlinear low-dimensional structure. However, there may exist a nonlinear embedding of the data into a Hilbert space, where a linear method can capture the low-dimensional structure. The basic idea in applying kernel methods is the application of a linear algorithm to the data mapped into a feature space induced by a kernel function. If projections onto this low-dimensional structure can be computed by inner products in this Hilbert space, the so-called kernel trick [6, 7] can be used to obtain simple and efficient algorithms. Since the embedding is nonlinear, linear directions in the feature space correspond to nonlinear directions in the original data space. Nonlinear extensions of some classical linear dimensional reduction methods using this approach include kernel principle component analysis [6], kernel Fisher discriminant analysis [8], and kernel independent correlation analysis [9]. This idea was applied to SIR in [10, 11] resulting in the kernel sliced inverse regression (KSIR) method which allows for the estimation of nonlinear e.d.r. directions.

There are numeric, algorithmic, and conceptual subtleties to a direct application of this kernel idea to SIR, although it looks quite natural at first glance. In KSIR, the -dimensional data are projected into a Hilbert space through a feature map: and the nonlinear features are supposed to be recovered by the eigenfunctions of the following operator However, this operator is actually not well defined in general, especially when is infinite dimensional and the covariance operator is not invertible. In addition, the key utility of representation theorems in kernel methods is that optimization in a possibly infinite dimensional Hilbert space reduces to solving a finite dimensional optimization problem. In the KSIR algorithm developed in [10, 11], the representer theorem has been implicitly used and seems to work well in empirical studies. However, it is not theoretically justifiable since is estimated empirically via observed data. Moreover, the computation of eigenfunctions of an empirical estimate of from observations is often ill-conditioned and results in computational instability. In [11], a low rank approximation of the kernel matrix is used to overcome this instability and to reduce computational complexity. In this paper, our aim is to clarify the theoretical subtleties in KSIR and to motivate two types of regularization schemes to overcome the computational difficulties arising in KSIR. The consistency is proven and practical advantages of regularization are demonstrated via empirical experiments.

2. Mercer Kernels and Nonlinear e.d.r. Directions

The extension of SIR to use kernels is based on properties of reproducing kernel Hilbert spaces (RKHSs) and in particular Mercer kernels [12].

Given predictor variables , a Mercer kernel is a continuous, positive, and semidefinite function with the following spectral decomposition: where are the eigenfunctions and are the corresponding nonnegative, nonincreasing eigenvalues. An important property of Mercer kernels is that each kernel uniquely corresponds to an RKHS as follows: where the cardinality of is the dimension of the RKHS which may be infinite [12, 13].

Given a Mercer kernel, there exists a unique map or embedding from to a Hilbert space defined by the eigenvalues and eigenfunctions of the kernel. The map takes the following form: The Hilbert space induced by this map with the standard inner product is isomorphic to the RKHS (5), and we will denote both Hilbert spaces as [13]. In the case where is infinite dimensional, .

The random variable induces a random element in the RKHS. Throughout this paper, we will use Hilbert space valued random variables; so we now recall some basic facts. Let be a random element in with , where denotes the norm in induced by its inner product . The expectation is defined to be an element in , satisfying , for all . If , then the covariance operator of is defined as , where

Let denote the measure for random variable . Throughout, we assume the following conditions.

Assumption 1. (1)For all ,?? is -measurable. (2)There exists such that ,?? (a.s.) with respect to .

Under Assumption 1, the random element has a well-defined mean and a covariance operator because is bounded (a.s.). Without loss of generality, we assume , where is the zero element in . The boundedness also implies that the covariance operator is compact and has the following spectral decomposition: where and are the eigenvalues and eigenfunctions, respectively.

We assume the following model for the relationship between and : with and the distribution of is independent of . This model implies that the response variable depends on only through a -dimensional summary statistic Although is a linear summary statistic in , it extracts nonlinear features in the space of the original predictor variables . We call the nonlinear e.d.r. directions, and the nonlinear e.d.r. space. The following proposition [10] extends the theoretical foundation of SIR to this nonlinear setting.

Proposition 2. Assume the following linear design condition for that for any , there exists a vector such that Then, for the model specified in (9), the inverse regression curve is contained in the span of , where is the covariance operator of .

Proposition 2 is a straightforward extension of the multivariate case in [1] to a Hilbert space or a direct application of the functional SIR setting in [14]. Although the linear design condition (11) may be difficult to check in practice, it has been shown that such a condition usually holds approximately in a high-dimensional space [15]. This conforms to the argument in [11] that the linearity in a reproducing kernel Hilbert space is less strict than the linearity in the Euclidean space. Moreover, it is pointed out in [1, 10] that even if the linear design condition is violated, the bias of SIR variate is usually not large.

An immediate consequence of Proposition 2 is that nonlinear e.d.r. directions are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigendecomposition problem: or equivalently from an eigenanalysis of the operator . In the infinite dimensional case, a technical difficulty arises since the operator is not defined on the entire Hilbert space . So for the operator to be well defined, we need to show that the range of is indeed in the range of . A similar issue also arose in the analysis of dimension reduction and canonical analysis for functional data [16, 17]. In these analyses, extra conditions are needed for operators like to be well defined. In KSIR, this issue is resolved automatically by the linear design condition and extra conditions are not required as stated by the following Theorem; see Appendix A for the proof.

Theorem 3. Under Assumption 1 and the linear design condition (11), the following hold: the operator is of finite rank . Consequently, it is compact and has the following spectral decomposition: ?where and are the eigenvalues and eigenvectors, respectively. Moreover, for all ;?the eigendecomposition problem (12) is equivalent to the eigenanalysis of the operator , which takes the following form

3. Regularized Kernel Sliced Inverse Regression

The discussion in Section 2 implies that nonlinear e.d.r. directions can be retrieved by applying the original SIR algorithm in the feature space induced by the Mercer kernel. There are some computational challenges to this idea such as estimating an infinite dimensional covariance operator and the fact that the feature map is often difficult or impossible to compute for many kernels. We address these issues by working with inner products of the feature map and adding a regularization term to kernel SIR.

3.1. Estimating the Nonlinear e.d.r. Directions

Given observations , our objective is to obtain an estimate of the e.d.r. directions . We first formulate a procedure almost identical to the standard SIR procedure except that it operates in the feature space . This highlights the immediate relation between the SIR and KSIR procedures.?Without loss of generality, we assume that the mapped predictor variables are mean zero, that is, , for otherwise we can subtract from . The sample covariance is estimated by ?Bin the variables into slices and compute mean vectors of the corresponding mapped predictor variables for each slice ?Compute the sample between-group covariance matrix ?Estimate the SIR directions by solving the generalized eigendecomposition problem:

This procedure is computationally impossible if the RKHS is infinite dimensional or the feature map cannot be computed (which is the usual case). However, the model given in (9) requires not the e.d.r. directions but the projection onto these directions, that is, the summary statistics which we call the KSIR variates. Like other kernel algorithms, the kernel trick enables KSIR variates to be efficiently computed from only the kernel , not the map .

The key quantity in this alternative formulation is the centred Gram matrix defined by the kernel function , where Note that the rank of is less than ; so is always singular.

Given the centered Gram matrix , the following generalized eigendecomposition problem can be used to compute the KSIR variates: where denotes the -dimensional generalized eigenvector and denotes an matrix with if are in the th group consisting of observations and zero otherwise. The following proposition establishes the equivalence between two eigendecomposition problems, (22) and (19), in the recovery of KSIR variates .

Proposition 4. Given the observations , let and denote the generalized eigenvectors of (22) and (19), respectively. Then, for any and , the following holds: provided that is invertible. When is not invertible, the conclusion holds modulo the null space of .

This result was proven in [10], in which the algorithm was further reduced to solving by canceling from both sides of (22).

Remark 5. It is important to remark that when is not invertible, Proposition 4 states that the equivalence between (19) and (22) holds modulo the null space of that is a requirement of the representer theorem, that is, are linear combinations of . Without this mandated condition, we will see that each eigenvector of (22) produces an eigenvector of (19), while the eigenvector of (19) is not necessarily recovered by (22).

Remark 6. It is necessary to clarify the difference between (12) and (19). In (12), it is natural to assume that is orthogonal to the null space of . To see this, let with belonging to the null space and orthogonal to the null space. Then, that is, (a.s.). It means that does not contribute to the KSIR variates and thus could be set as 0. However, in (19), if is an eigenvector and is its orthogonal component relative to the null space of , the identity is in general not true for a new point which is different from the observations. Thus, from a theoretical perspective, it is not as natural to assume the representer theorem, although it works well in practice. In this sense, the KSIR algorithm based on (22) or (24) does not have a thorough mathematical foundation.

3.2. Regularization and Stability

Except for the theoretical subtleties, in applications with relatively small samples, the eigendecomposition in (22) is often ill-conditioned resulting in overfitting as well as numerically unstable estimates of the e.d.r. space. This can be addressed by either thresholding eigenvalues of the estimated covariance matrix or by adding a regularization term to (22) or (24).

We motivate two types of regularization schemes. The first one is the traditional ridge regularization. It is used in both linear SIR and functional SIR [1820], which solves the eigendecomposition problem Here, and in the sequel, denotes the identity operator and is a tuning parameter. Assuming the representer theorem, its kernel form is given as

Another type of regularization is to regularize (22) directly: The following proposition, whose proof is in Appendix B, states that solving the generalized eigendecomposition problem (28) is equivalent to finding the eigenvectors of

Proposition 7. Let be the eigenvectors of (28), and let be the eigenvectors of (29). Then, the following holds for the regularized KSIR variates:

This algorithm is termed as the Tikhonov regularization. For linear SIR, it is shown in [21] that the Tikhonov regularization is more efficient than the ridge regularization.

Except for the computational stability, regularization also makes the matrix forms of KSIR, (27) and (28), interpretable by justifiable representer theorem.

Proposition 8. For both ridge and the Tikhonov regularization scheme of KSIR, the eigenfunctions are linear combinations of .

The conclusion follows from the observation that for the ridge regularization and for the Tikhonov regularization.

To close, we remark that KSIR is computationally advantageous even for the case of linear models when due to the fact that the eigendecomposition problem is for matrices rather than the matrices in the standard SIR formulation.

3.3. Consistency of Regularized KSIR

In this subsection, we prove the asymptotic consistency of the e.d.r. directions estimated by regularized KSIR and provide conditions under which the rate of convergence is . An important observation from the proof is that the rate of convergence of the e.d.r. directions depends on the contribution of the small principal components. The rate can be arbitrarily slow if the e.d.r. space depends heavily on eigenvectors corresponding to small eigenvalues of the covariance operator.

Note that various consistency results are available for linear SIR [2224]. These results hold only for the finite dimensional setting and cannot be adapted to KSIR where the RKHS is often infinite dimensional. Consistency of functional SIR has also been studied before. In [14], a thresholding method is considered, which selects a finite number of eigenvectors and uses results from finite rank operators. Their proof of consistency requires stronger and more complicated conditions than ours. The consistency for functional SIR with ridge regularization is proven in [19], but it is of a weaker form than our result. We remark that the consistency results for functional SIR can be improved using a similar argument in this paper.

In the following, we state the consistency results for the Tikhonov regularization. A similar result can be proved for the ridge regularization while the details are omitted.

Theorem 9. Assume , and ; then where is the rank of , is the projection onto the th e.d.r., and is the projection onto the th e.d.r. as estimated by regularized KSIR.
If the e.d.r. directions depend only on a finite number of eigenvectors of the covariance operator , the rate of convergence is .

This theorem is a direct corollary of the following theorem which is proven in Appendix C.

Theorem 10. Define the projection operator and its complement for each where are the eigenvectors of the covariance operator as defined in (8), with the corresponding eigenvalues denoted by .
Assume . For each , the following holds: where and are the eigenvectors of as defined in (14) and denotes the Hilbert-Schmidt norm of a linear operator.
If satisfy and as , then

4. Application to Simulated and Real Data

In this section, we compare regularized kernel sliced inverse regression (RKSIR) with several other SIR-related dimension reduction methods. The comparisons are used to address two questions: does regularization improve the performance of kernel sliced inverse regression, and does the nonlinearity of kernel sliced inverse regression improve the prediction accuracy?

We would like to remark that the assessment of nonlinear dimension reduction methods could be more difficult than that of linear ones. When the feature mapping for an RKHS is not available, we do not know the true e.d.r. directions or subspace. So in that case, we will use the prediction accuracy to evaluate the goodness of RKSIR.

4.1. Importance of Nonlinearity and Regularization

Our first example illustrates that both the nonlinearity and regularization of RKSIR can significantly improve prediction accuracy.

The regression model has ten predictor variables with each one following a normal distribution . A univariate response is given as with noise . We compare the effectiveness of the linear dimension reduction methods SIR, SAVE, and PHD with RKSIR, by examining the predictive accuracy of a nonlinear kernel regression model on the reduced space. We generate 100 training samples and apply the above methods to compute the e.d.r. directions. In RKSIR, we used an additive Gaussian kernel as follows: where . After projecting the training samples on the estimated e.d.r. directions, we train a Gaussian kernel regression model based on the new variates. Then, the mean square error is computed on 1000 independent test samples. This experiment is repeated 100 times with all parameters set by cross-validation. The results are summarized in Figure 1.

Figure 1(a) displays the accuracy for RKSIR as a function of the regularization parameter, illustrating the importance of selecting regularization parameters. KSIR without regularization performs much worse than the RKSIR. Figure 1(b) displays the prediction accuracy of various dimension reduction methods.

RKSIR outperforms all the linear dimension reduction methods, which illustrates the power of nonlinearity introduced in RKSIR. It also suggests that there are essentially two nonlinear e.d.r. directions. This observation seems to agree with the model in (35). Indeed, Figures 1(c) and 1(d) show that the first two e.d.r. directions from RKSIR estimate the two nonlinear factors well.

4.2. Effect of Regularization

This example illustrates the effect of regularization on the performance of KSIR as a function of the anisotropy of the predictors.

The regression model has ten predictor variables and a univariate response specified by where and with a randomly chosen orthogonal matrix and . We will see that increasing the parameter increases the anisotropy of the data that increases the difficulty of identifying the correct e.d.r. directions.

For this model, it is known that SIR will miss the direction along the second variable . So we focus on the comparison of KSIR and RKSIR in this example.

If we use a second-order polynomial kernel that corresponds to the feature space then can be captured in one e.d.r. direction. Ideally the first KSIR variate should be equivalent to modulo shift and scale So for this example given estimates of KSIR variates at the data points , the error of the first e.d.r. direction can be measured by the least squares fitting of with respect to

We drew observations from the model specified in (37), and then we applied the two dimension reduction methods KSIR and RKSIR. The mean and standard errors of repetitions of this procedure are reported in Figure 2. The result shows that KSIR becomes more and more unstable as increases and the regularization helps to reduce this instability.

4.3. Importance of Nonlinearity and Regularization in Real Data

When SIR is applied to classification problems, it is equivalent to a Fisher discriminant analysis. For the case of multiclass classification, it is natural to use SIR and consider each class as a slice. Kernel forms of Fisher discriminant analysis (KFDA) [8] have been used to construct nonlinear discriminant surfaces and the regularization has improved performance of KFDA [25]. In this example, we show that this idea of adding a nonlinearity and a regularization term improves predictive accuracy in a real multiclass classification data set, the classification of handwritten digits.

The MNIST data set (Y. LeCun, http://yann.lecun.com/exdb/mnist/) contains images of handwritten digits as training data and images as test data. Each image consists of gray-scale pixel intensities. It is commonly believed that there is clear nonlinear structure in this -dimensional space.

We compared regularized SIR (RSIR) as in (26), KSIR, and RKSIR, on this data to examine the effect of regularization and nonlinearity. Each draw of the training set consisted of observations of each digit. We then computed the top e.d.r. directions using these observations and slices, one for each digit. We projected the test observations onto the e.d.r. directions and used a k-nearest neighbor (kNN) classifier with to classify the test data. The accuracy of the kNN classifier without dimension reduction was used as a baseline. For KSIR and RKSIR we used a Gaussian kernel with the bandwith parameter set as the median pairwise distance between observations. The regularization parameter was set by cross-validation.

The mean and standard deviation of the classification accuracy over iterations of this procedure are reported in Table 1. The first interesting observation is that the linear dimension reduction does not capture discriminative information, as the classification accuracy without dimension reduction is better. Nonlinearity does increase classification accuracy and coupling regularization with nonlinearity increases accuracy more. This improvement is dramatic for , , and .

5. Discussion

The interest in manifold learning and nonlinear dimension reduction in both statistics and machine learning has led to a variety of statistical models and algorithms. However, most of these methods are developed in the unsupervised learning framework. Therefore, the estimated dimensions may not be optimal for the regression models. Our work incorporates nonlinearity and regularization to inverse regression approaches and results in a robust response driven nonlinear dimension reduction method.

RKHS has also been introduced into supervised dimension reduction in [26], where the conditional covariance operators on such kernel spaces are used to characterize the conditional independence between linear projections and the response variable. Therefore, their method estimates linear e.d.r. subspaces, while in [10, 11] and in this paper, RKHS is used to model the e.d.r. subspace, which leads to nonlinear dimension reduction.

There are several open issues in regularized kernel SIR method, such as the selection of kernels, regularization parameters, and number of dimensions. A direct assessment of the nonlinear e.d.r directions is expected to reduce the computational burden in procedures based on cross validation. While these are well established in linear dimension reduction, however, little is known for nonlinear dimension reduction. We would like to leave them for future research.

There are some interesting connections between KSIR and functional SIR, which are developed by Ferré and his coauthors in a series of papers [14, 17, 19]. In functional SIR, the observable data are functions and the goal is to find linear e.d.r. directions for functional data analysis. In KSIR, the observable data are typically not functions but mapped into a function space in order to characterize nonlinear structures. This suggests that computations involved in functional SIR can be simplified by a parametrization with respect to an RKHS or using a linear kernel in the parametrized function space. On the other hand, from a theoretical point of view, KSIR can be viewed as a special case of functional SIR, although our current theoretical results on KSIR are different from the ones for functional SIR.

Appendices

A. Proof of Theorem 3

Under the assumption of Proposition 2, for each ,

Since , the rank of (i.e., the dimension of the image of ) is less than , which implies that is compact. With the fact that it is symmetric and semipositive, there exist positive eigenvalues and eigenvectors , such that .

Recall that for any , also belongs to because of (A.1); so This proves (i).

Since for each , ??, the operator is well defined over the whole space. Moreover, This proves (ii).

B. Proof of Proposition 7

We first prove the proposition for matrices to simplify then notation; we then extend the result to the operators, where is infinite and a matrix form does not make sense.

Let . It has the following SVD decomposition: where , and is a diagonal matrix of dimension .

We need to show the KSIR variates It suffices to prove that if is a solution to (28), then is also a pair of eigenvalue and eigenvector of and vice versa, where and are related by Noting that facts , , and , the argument may be made as follows: Note that the implication in the third step is necessary only in the direction which is obtained by multiplying both sides and using the facts . For the last step, since , we use the following facts:

In order for this result to hold rigorously when the RKHS is infinite dimensional, we need to formally define , , and the SVD of when is infinite. For the infinite dimensional case, is an operator from to defined by for and is its adjoint, an operator from to such that for . The notions and are similarly defined.

The above formulation of and coincides the definition of as a covariance operator. Since the rank of is less than , it is compact and has the following representation: where is the rank and . This implies that each lies in and hence we can write , where should be considered as an operator from to and . Denote . It is easy to check that . Let and . Then, we obtain the SVD for as which is well defined.

C. Proof of Consistency

C.1. Preliminaries

In order to prove Theorems 9 and 10, we use the properties of the Hilbert-Schmidt operators, covariance operators for the Hilbert space valued random variables, and the perturbation theory for linear operators. In this subsection we provide a brief introduction to them. For details, see [2729] and references therein.

Given a separable Hilbert space of dimension , a linear operator on is said to belong to the Hilbert-Schmidt class if where is an orthonormal basis. The Hilbert-Schmidt class forms a new Hilbert space with norm .

Given a bounded operator on , the operators and both belong to the Hilbert-Schmidt class and the following holds: where denotes the default operator norm

Let be a random vector taking values in satisfying . The covariance operator is self-adjoint, positive, compact, and belongs to the Hilbert-Schmidt class.

A well-known result from perturbation theory for linear operators states that if a set of linear operators converges to in the Hilbert-Schmidt norm and the eigenvalues of are nondegenerate, then the eigenvalues and eigenvectors of converge to those of with same rate or convergence as the convergence of the operators.

C.2. Proof of Theorem 10

We will use the following result from [14]: To simplify the notion, we denote . Also, define Then,

For the first term, observe that For the second term, note that Therefore, Since , there exists such that . Then, which implies For the third term, the following holds: and for each , Combining these terms results in (33).

Since as , consequently, we have if and .

If all the e.d.r. directions depend only on a finite number of eigenvectors of the covariance operator, then there exist some such that . This implies that Therefore, . Let ; the rate is .

Acknowledgments

The authors acknowledge the support of the National Science Foundation (DMS-0732276 and DMS-0732260) and the National Institutes of Health (P50 GM 081883). Any opinions, findings and conclusions, or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.