Abstract

Kernel selection is a central issue in kernel methods of machine learning. In this paper, we investigate the regularized learning schemes based on kernel design methods. Our ideal kernel is derived from a simple iterative procedure using large scale unlabeled data in a semisupervised framework. Compared with most of existing approaches, our algorithm avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms. Moreover, large amounts of information associated with input space can be exploited, and thus generalization ability is improved accordingly. We provide some theoretical support for the least square cases in our settings; also these advantages are shown by a simulation experiment and a real data analysis.

1. Introduction

Kernel-based methods have been proved to be powerful for a wide range of different data analysis problems. Since support vector machine (SVM) was initially proposed by Vapnik [1], many other kernel-based methods have been proposed such as kernel PCA, kernel Fisher discriminant, and kernel CCA. In many cases, the performance of kernel methods greatly depends on the choice of kernel function (for the importance of specifying an appropriate kernel, see Chapter of [2]). To choose an appropriate kernel, many kernel learning algorithms have been proposed in recent years such as [35]. Among them, there are two kinds of candidate kernel sets: the first one involves parameter selection from the candidate kernel collection including Gaussian kernels [6]; that is, . The others mainly refer to the linear combination of certain prespecified kernels. It is known that the latter one is also called “multiple kernel learning” [4]. Recall that Lanckriet et al. [5] proposed a positive semidefinite programming to search the best linear combination automatically for SVM; however, this approach is time-consuming and only feasible for small sample cases. Sonnenburg et al. [7] relaxed this mentioned optimization problem to a semi-infinite linear program, which is capable of coping with a large spectrum of kernels and samples. However, these multiple kernel learning algorithms do not have better performance than traditional nonweighted kernel in SVM sometimes, and Cortes [8] questioned that “can learning kernels help performance?”. Recently Kloft and Blanchard [9] introduced a multikernel learning with -norm approach, which has been shown effective in both theory and practice [10, 11]. Essentially, -norm is a kind of minimizing empirical risk algorithm with kernel candidate set . Kloft and Blanchard [9] provided an excess generation error utilizing local Rademacher complexity of -norm multiple kernel learning. Although these mentioned learning kernel algorithms provide more flexibility than these one-kernel approaches, more complex computational problems induced by multiple kernel learning arise additionally. In addition, these above kernel learning algorithms are considered only under fully supervised learning settings. In practice, labeled instances however are often difficult, expensive, or time-consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled data may be relatively east to collect. In the machine learning literature, semisupervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better learners.

In this paper, we pursue the goal of kernel learning algorithms under semisupervised learning framework. For this sake, we resort to find a sequence of candidate kernels using an iterative procedure; then our regularized learning algorithms perform on the corresponding RKHS, which leads to a classical convex optimization program on training data. Finally, we apply the test data to select the optimal kernel function and regularization parameter. It is worth noting that the proposed method consists of two-step estimation stated as above. At the first step, we use large amount of information of unlabeled data to explore underlying data structure. Our optimization problem involved at the second step is as efficient as those classical single kernel approaches. More importantly, we provide sufficient theoretical support for our approach and we demonstrate the effectiveness of the proposed method by experiments.

The rest of the paper is organized as follows. In Section 2 we introduce some basic notations and our two-step estimation for kernel learning. We present main theoretical results for the proposed approach, which is achieved mainly by using advanced concentration inequalities stated in Section 3. Section 4 contains some proof details such as error decomposition and approximation error. We implement a simulation and a real data experiment in Section 5. Some proofs are relegated to the Appendix.

2. The Proposed Algorithm

We first describe the notations used in this paper. Suppose that our algorithm produces one learner from a compact metric space to the output space . Such a learner yields for each point the value , which is a prediction made for . The goodness of estimation is usually assessed by some specified loss function denoted by . The most commonly used loss function is the least square one; that is, . Let be the random variable on with the probability distribution . Within statistical learning framework, the target function can be formulated as a minimizer of the following functional optimization: In particular, in the case of the least square loss, we derive an explicit solution expressed aswhere is conditional probability measure at induced by . Under fully supervised learning setting, based on available samples , the main goal of learning is to design an efficient learning algorithm to obtain one learner , which is capable of well approximating regression function on the whole space. These popular regularized learning algorithms within a RKHS can be stated aswhere is a specified RKHS and is the regularization parameter, controlling empirical error and functional complexity of . Note that may rely on sample size; it satisfies .

In the semisupervised learning framework, the first samples are labeled as above and followed by unlabeled samples . Denote by a weak kernel as the original kernel. Compared to those standard kernels, a weak kernel here means that its complexity is very large or it is less smooth. A learner with a weak kernel usually leads to overfitting, while it can approximate well more complicated functions and hence reduce estimation bias of the learner. Hence, selecting an appropriate kernel needs to trade off the functional complexity of various . Motivated by this observation, we propose an iterative procedure as our first step for constructing candidate kernels. For the th step, the next candidate kernel is derived as follows:The labeled samples are divided into training data denoted by and test data , respectively; then we establish our regularized learning algorithm based on the associated :Given the total number of iteration steps, we minimize on the test data:Thus, we take as our final leaner in semisupervised learning setting. Note that we use the least square loss instead of in the final step, since we compute or approximate its solution easily following nice mathematical property of the least square one.

Our motivation of designing kernel as (4) is based on the following fact. By the Mercer Theorem [12], any given kernel defined on a compact set can be expressed as , where is the corresponding eigenpairs of the integral operator , which will be defined in (19) below. In general, the problem of selecting kernel corresponds to a suitable choice of the parameters , since the eigenvalue has a close relationship with the functional complexity of [13]. In our case, we use an iterative procedure to select an appropriate kernel. To be precise, we define a new candidate kernel by . Based on the observation , it suffices to find an appropriate iteration step. Since is often unknown, alternatively, we use an empirical estimator of defined as (4) as our candidate kernel. Furthermore, in view of the slow rate with order , by which the empirical kernel in (4) converges to its corresponding population kernel, the large amounts of unlabeled data guarantee a smaller error generated by sampling randomly.

It is seen from the proposed program as above that this method avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms up to some constant. Moreover, large amounts of information associated with input space have been made fully use of, so that some intrinsic data structure may be exploited.

3. Main Results

To highlight our idea by presenting more refined theoretical results, in what follows, we are primarily concerned with the least square setting, since the regularized least square algorithm has a closed-form solution. First of all, by the law of large number, with a high probability, we can replace the first step (4) with the following iterative procedure:where is the marginal distribution induced by . We denote by the derived learner of (5) at the th iteration. For notational simplicity, we write . In this paper, we focus on generalization error of the proposed algorithm; that is, A small quantity of implies a good prediction ability of . Different from those classical literatures under fixed kernel settings such as [13, 14], the main goal of this paper is to indicate some specific advantages theoretically compared with those fixed kernel approaches.

To simplify theoretical analysis, we assume that the conditional distribution has a support on , and it follows that almost everywhere. To this end, we introduce the projection operator as follows.

Definition 1. Define the projection operator on measurable function as

Note that the error bound between the projections of and can be expressed aswhere denotes the population error of the function .

Since the regression function may not be found within , the approximation error between and is needed. Considering the error induced by sampling and the approximation error, we introduce the following empirical error as and we also introduce an approximation error associated with the joint distribution :where is called the regularization function, given as

Remark 2. In the literature of learning theory, one usually assumes that there exist and , such that . In fact means and vice versa [1517]. Strictly speaking is formally discussed in approximation theory.

To obtain convergence rates of (10), we decompose the term into two parts: the approximation error and the sample error; see [14, 15].

Proposition 3. Let be defined by (5); the following inequality holdswhere

Proposition 3 shows that is bounded by . We usually call the sample error, since this quantity mainly involves random sampling and the complexity of .

Bounding the sample error is a standard technique in learning theory [13, 15, 18]. To this end, we need to introduce the notion of covering number to measure the complexity of .

Definition 4. Assume is a pseudometric space with some metrics and . Then, for any , we define the covering number referring to and as covering number of a ball with radius :

Recall that a kernel function is called the Mercer kernel, if it is symmetric, positive definite, and continuous. Several properties concerning the Mercer kernel have been established well and can be found in [12, 15]. Suppose that .

Assumption 5. Suppose that the Mercer kernel has a polynomial complexity with where is the unit ball of and is some constant. For the Sobolev space on with the order , it is known in [12] that .

On the other hand, to quantify the approximation error and characterize the regularity of , we need to introduce the notion of fractional integral operator associated with . Recall that a standard inner product on is defined as Then we can define an integral operator on :It has been verified in [16] that the integral operator is a compact, self-adjoint, and positive definite operator from to . So the fractional operator of is well defined. Moreover, it is easy to check that and . Lemma 12 below will show that if a weak kernel with respect to the true function is used for learning, the approximation ability cannot be improved even if the true function is sufficiently smooth. This is why we propose the iterative procedure (4) for updating kernels.

With these preparations, we can state the main results depending on the capacity of and the smoothness of target function as follows.

Theorem 6. Let be defined by (5) and Assumption 5 holds. If , then, for any and , the following holds:with probability of at least , where the constant is given by Proposition 11.

From Proposition 3, we can deduce that in can be ignored by studying the new equivalent sampling error given by . The following corollary provides an asymptotically optimal convergence rate of . The proof can be found in the Appendix.

Corollary 7. Let be defined by (5), and Assumption 5 holds true. If , then when , for any and , the following holds: with probability of at least , where . Particularly, for , we have where is an arbitrary positive number.

It is seen from Corollary 7 that the ideal choice of the regularization parameter depends on the two quantities and , which are often unknown in advance. Alternatively, cross-validation technique is one of the commonly used tools in practice. It is worth noting that our approach selects the ideal kernel and the regularization parameter simultaneously, which is of significant difference from those classical fixed kernel methods.

Next, we compare our rate with existing references. Recently, sharp learning rates have been established by advanced empirical process technique in [14]. Note that we use to replace the kernel appearing in algorithm (3), where its covering number has polynomial decay index of . To be precise, an upper bound of sampling error in [14] was given as

From formula (A.2) in the Appendix, we can achieve that . Following the equivalent relationship between covering number and the spectrum of (see Theorem 10 [13]), we see that sufficiently large ensures that . Additionally, when , in Theorem 6 and above are both constants and hence it follows that with some constant , since . In summary, our derived sample error is sharper than that in [14]. Thus, if the regression function is smooth sufficiently, that is, which can also be approximated well by , we conclude that the corresponding learning rate of Theorem 6 outperforms that in [14]. This shows that if we know the smoothness of the target function, we can choose a proper kernel (which requires ) to improve the sampling error effectively. This provides us an excellent theoretical basis in choosing kernel function for real problems. Of course, considering noises with the samples in real problems, the chosen kernel also has to be smoother than the target function.

4. Error Analysis

The sampling error is analyzed according to empirical process technique. The initial study on sampling error mainly applies the McDiarmid inequality without considering the conception of “space complexity.” However, the McDiarmid inequality is not able to show the variance message of random variables. Afterwards, VC dimension was originally introduced into the literature and some other conceptions such as covering numbers formed in the Bernstein-type probability inequality significantly reduce the sampling error; see [19] for an explicit overview. To bound sample error, we split it into two parts again: Note that does not involve any functional complexity, which in turn can be estimated easily by the following one-side Bernstein probability inequality.

Lemma 8. Define as a random variable on the probability space , and there exists a constant and it holds . Define the variance of by ; then, for any , the following holds: with probability of at least .

Proposition 9. Define random variable . For any the following holds: with probability of at least .

Proof. Notice that Since is true almost everywhere, thus This yields that .
Additionally, notice that satisfies which implies that .
By Lemma 8 and the basic inequality (), we obtain with probability of at least .

Bounding the sampling error is more involved, since the estimator will vary with the random sample. To handle it, some advanced uniform concentration inequality is required [14].

Lemma 10. Define as a function set on . If there exists a constant , then almost everywhere and . Then for any positive number and ,

Proposition 11. Given defined by (5), suppose that Assumption 5 holds. When , then, for any , with probability of at least , where .

Proof. Introduce a function set defined by Proposition 9: Each function in can be denoted as , where . It follows that , , and Also note that and almost everywhere; we have This implies that Additionally notice that By Lemma 10 with and , we obtainwith probability at least Now we need to estimate covering number .
For arbitrary , we obtain Since is true for any and is contraction mapping, we have Hence According to Assumption 5, suppose that where and . Then it can be written as According to Lemma 7 given in [16] Substituting it into (38) and noticing that , we further see that can be bounded by Based on Lemma 4.1 given in [14], for each we have Hence, if we have Proposition 11 is completed by taking .

According to the conclusion of Proposition 9, two important quantities and involved in need to be bounded.

Lemma 12. Let be defined as (13). If , the following holds:

It makes sense that the estimation of is extended from Lemma  4.3 in [18], where takes place of . Now we discuss the second quantity. In the classical algorithm (3), when , the increase of smoothness of is not able to improve the error , which is called the “saturation” phenomenon in the literature of inverse problems. While, for the algorithm we study, only when , the saturation will happen. This shows specific advantages of using instead of the original from the perspective of approximation theory.

Proof of Lemma 12. According to [17], . Notice that and this yields where is the corresponding spectrum of the integral operator . Thus On the other hand, noting the fact that and the assumption , we have This is the end of proving Lemma 12.

Together with Propositions 9 and 11, Lemma 12, and the fact that , Theorem 6 is proved easily.

5. Numerical Experiments

5.1. Simulated Example

Although this paper mainly focuses on theoretical analysis, we can take some experiments to show its efficiency in practice. A simulated example is considered, where the true regression is an additive model. That is,with , , , and +. Firstly, generate which are independent from ; then generate by with .

For the above example, different scenarios are taken into account, with , , and , and each scenario is repeated times. We here use the widely used Gaussian kernel, , where parameter will be specified by conducted 10-fold cross-validation on each data set. Besides, as we mentioned before, we start with a weak kernel and search a better one somehow iteratively. A standard weak kernel is defined as follows: , where is an adjustable parameter, where we specify it as .

The performance of various methods is measured by the MSE, where MSE represents the relative mean squared error for each kernel-based regression. The averaged performance measures are summarized in Table 1. Note that SKM represents the single kernel method with the Gaussian kernel, UKM represents the proposed method without using any unlabeled data, and SEKM represents the proposed method with using all the unlabeled data.

From Table 1 we find that using the proposed kernel learning method on the data sets generates better prediction accuracy than using a single kernel. Probably, the true function is more complicated, and in this case the Gaussian kernel has a limited learning ability. Thus, learning to start with a weak kernel implies that the hypothesis space is much larger than that induced by the Gaussian kernel, so that the true function can be learnt well by our algorithm. Moreover, from the last row of Table 1, we see that using the unlabeled data by our algorithm can further reduce the prediction error, as we expect in theory.

5.2. Real Example

The proposed method is also applied to a real example, the Boston housing data, which is publicly available. The Boston housing data concerns the median value of owner-occupied homes in each of the 506 census tracts in the Boston Standard Metropolitan Statistical Area in 1970. It consists of 13 variables, including per capita crime rate by town (CRIM), proportion of residential land zoned for lots over 25,000 square feet (ZN), proportion of nonretail business acres per town (INDUS), Charles River dummy variable (CHAS), nitric oxides concentration (NOX), average number of rooms per dwelling (RM), proportion of owner-occupied units built prior to 1940 (AGE), weighted distances to five Boston employment centers (DIS), index of accessibility to radial highways (RAD), full-value property-tax rate per $10000 (TAX), pupil-teacher ratio by town (PTRATIO), the proportion of blacks by town (B), and lower status of the population (LSTAT), which may affect the housing price.

In our analysis, all the variables are standardized. To compute the averaged prediction error, each dataset is randomly split into two parts: training data and testing data with the number . To show the performance our method compared with a single kernel method, we split the training data with three different scenarios with , , and , and each scenario is repeated times. Besides, parameter will be specified by conducted 10-fold cross-validation on each data set. The prediction performance of the single kernel method via the proposed method is summarized in Table 2.

As shown in the table, the proposed method is significant in terms of prediction accuracy, except that one of six results in Table 2 is of poor performance compared to the single kernel method. This practical result may be acceptable, since we do not know the underlying rule for this real data, and it is hard to ensure a perfect performance in various settings. Totally, to some extent, the proposed method is a simple but efficient kernel learning method in the family of kernel methods.

6. Conclusions and Discussions

This paper mainly discussed the kernel learning problems within semisupervised learning setting. Our candidate kernel sequence is generated by a simple iterative procedure using large amounts of unlabeled data. Under mild assumptions on target function, it is shown that we can match a kernel theoretically to outperform efficiently the sample error induced by one-kernel learning. This also shows that the learning kernel function outperforms traditional kernel-based learning algorithms in our case. Moreover, a simulation example and a real data experiment are implemented, respectively, to show the effectiveness of our proposed method.

We note that the space complexity of function space in the paper is described by covering number, which was a straightaway conception. Yet it is not a perfect choice theoretically. Combining with the way in which the kernel function was formed in the text, we can replace Assumption 5 with the eigenvalue asymptotic behavior assumption on the integration operator . Based on its relationship among entropy and the Rademacher complexity, a better theoretical results may be achieved. This will be our subsequent work in the future. We attempt to explore some intrinsic structure of input space by selecting an appropriate kernel. Perhaps, there are other more effective ways to explore these underlying structures.

Appendix

Integral operator has the following properties which were proved in [15].

(1) is a positive, self-adjoint, and compact operator from to . Consequently, according to classical spectral theorem, its standard eigenfunctions , … consist of a family of orthogonal bases on . The discrete operator spectrum and the corresponding eigenvalues , are finite or monotonically decreasing; .

(2) For each , is defined as Denote ; forms a set of orthogonal bases of . Furthermore is isomorphic mapping between and . Particularly, the following holds:

Proof of Corollary 7. Notice that when , in the conclusion of Proposition 3 is a constant, and . By taking we obtain . Thus In addition, if , the corresponding infinitely approaches 0 by the classic conclusion described in [14]. Thus, we complete the proof of Corollary 7.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The second author’s research is supported partially by National Natural Science Foundation of China (no. 11301421) and Fundamental Research Funds for the Central Universities of China (Grant nos. JBK141111, 14TD0046, and JBK151134).