Mathematical Problems in Engineering

Volume 2015 (2015), Article ID 451947, 9 pages

http://dx.doi.org/10.1155/2015/451947

## An Efficient Kernel Learning Algorithm for Semisupervised Regression Problems

Statistics School, Southwestern University of Finance and Economics, Chengdu 611130, China

Received 4 July 2015; Accepted 25 August 2015

Academic Editor: Igor Andrianov

Copyright © 2015 Chao Zhang and Shaogao Lv. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Kernel selection is a central issue in kernel methods of machine learning. In this paper, we investigate the regularized learning schemes based on kernel design methods. Our ideal kernel is derived from a simple iterative procedure using large scale unlabeled data in a semisupervised framework. Compared with most of existing approaches, our algorithm avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms. Moreover, large amounts of information associated with input space can be exploited, and thus generalization ability is improved accordingly. We provide some theoretical support for the least square cases in our settings; also these advantages are shown by a simulation experiment and a real data analysis.

#### 1. Introduction

Kernel-based methods have been proved to be powerful for a wide range of different data analysis problems. Since support vector machine (SVM) was initially proposed by Vapnik [1], many other kernel-based methods have been proposed such as kernel PCA, kernel Fisher discriminant, and kernel CCA. In many cases, the performance of kernel methods greatly depends on the choice of kernel function (for the importance of specifying an appropriate kernel, see Chapter of [2]). To choose an appropriate kernel, many kernel learning algorithms have been proposed in recent years such as [3–5]. Among them, there are two kinds of candidate kernel sets: the first one involves parameter selection from the candidate kernel collection including Gaussian kernels [6]; that is, . The others mainly refer to the linear combination of certain prespecified kernels. It is known that the latter one is also called “multiple kernel learning” [4]. Recall that Lanckriet et al. [5] proposed a positive semidefinite programming to search the best linear combination automatically for SVM; however, this approach is time-consuming and only feasible for small sample cases. Sonnenburg et al. [7] relaxed this mentioned optimization problem to a semi-infinite linear program, which is capable of coping with a large spectrum of kernels and samples. However, these multiple kernel learning algorithms do not have better performance than traditional nonweighted kernel in SVM sometimes, and Cortes [8] questioned that “can learning kernels help performance?”. Recently Kloft and Blanchard [9] introduced a multikernel learning with -norm approach, which has been shown effective in both theory and practice [10, 11]. Essentially, -norm is a kind of minimizing empirical risk algorithm with kernel candidate set . Kloft and Blanchard [9] provided an excess generation error utilizing local Rademacher complexity of -norm multiple kernel learning. Although these mentioned learning kernel algorithms provide more flexibility than these one-kernel approaches, more complex computational problems induced by multiple kernel learning arise additionally. In addition, these above kernel learning algorithms are considered only under fully supervised learning settings. In practice, labeled instances however are often difficult, expensive, or time-consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled data may be relatively east to collect. In the machine learning literature, semisupervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better learners.

In this paper, we pursue the goal of kernel learning algorithms under semisupervised learning framework. For this sake, we resort to find a sequence of candidate kernels using an iterative procedure; then our regularized learning algorithms perform on the corresponding RKHS, which leads to a classical convex optimization program on training data. Finally, we apply the test data to select the optimal kernel function and regularization parameter. It is worth noting that the proposed method consists of two-step estimation stated as above. At the first step, we use large amount of information of unlabeled data to explore underlying data structure. Our optimization problem involved at the second step is as efficient as those classical single kernel approaches. More importantly, we provide sufficient theoretical support for our approach and we demonstrate the effectiveness of the proposed method by experiments.

The rest of the paper is organized as follows. In Section 2 we introduce some basic notations and our two-step estimation for kernel learning. We present main theoretical results for the proposed approach, which is achieved mainly by using advanced concentration inequalities stated in Section 3. Section 4 contains some proof details such as error decomposition and approximation error. We implement a simulation and a real data experiment in Section 5. Some proofs are relegated to the Appendix.

#### 2. The Proposed Algorithm

We first describe the notations used in this paper. Suppose that our algorithm produces one learner from a compact metric space to the output space . Such a learner yields for each point the value , which is a prediction made for . The goodness of estimation is usually assessed by some specified loss function denoted by . The most commonly used loss function is the least square one; that is, . Let be the random variable on with the probability distribution . Within statistical learning framework, the target function can be formulated as a minimizer of the following functional optimization: In particular, in the case of the least square loss, we derive an explicit solution expressed aswhere is conditional probability measure at induced by . Under fully supervised learning setting, based on available samples , the main goal of learning is to design an efficient learning algorithm to obtain one learner , which is capable of well approximating regression function on the whole space. These popular regularized learning algorithms within a RKHS can be stated aswhere is a specified RKHS and is the regularization parameter, controlling empirical error and functional complexity of . Note that may rely on sample size; it satisfies .

In the semisupervised learning framework, the first samples are labeled as above and followed by unlabeled samples . Denote by a weak kernel as the original kernel. Compared to those standard kernels, a weak kernel here means that its complexity is very large or it is less smooth. A learner with a weak kernel usually leads to overfitting, while it can approximate well more complicated functions and hence reduce estimation bias of the learner. Hence, selecting an appropriate kernel needs to trade off the functional complexity of various . Motivated by this observation, we propose an iterative procedure as our first step for constructing candidate kernels. For the th step, the next candidate kernel is derived as follows:The labeled samples are divided into training data denoted by and test data , respectively; then we establish our regularized learning algorithm based on the associated :Given the total number of iteration steps, we minimize on the test data:Thus, we take as our final leaner in semisupervised learning setting. Note that we use the least square loss instead of in the final step, since we compute or approximate its solution easily following nice mathematical property of the least square one.

Our motivation of designing kernel as (4) is based on the following fact. By the Mercer Theorem [12], any given kernel defined on a compact set can be expressed as , where is the corresponding eigenpairs of the integral operator , which will be defined in (19) below. In general, the problem of selecting kernel corresponds to a suitable choice of the parameters , since the eigenvalue has a close relationship with the functional complexity of [13]. In our case, we use an iterative procedure to select an appropriate kernel. To be precise, we define a new candidate kernel by . Based on the observation , it suffices to find an appropriate iteration step. Since is often unknown, alternatively, we use an empirical estimator of defined as (4) as our candidate kernel. Furthermore, in view of the slow rate with order , by which the empirical kernel in (4) converges to its corresponding population kernel, the large amounts of unlabeled data guarantee a smaller error generated by sampling randomly.

It is seen from the proposed program as above that this method avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms up to some constant. Moreover, large amounts of information associated with input space have been made fully use of, so that some intrinsic data structure may be exploited.

#### 3. Main Results

To highlight our idea by presenting more refined theoretical results, in what follows, we are primarily concerned with the least square setting, since the regularized least square algorithm has a closed-form solution. First of all, by the law of large number, with a high probability, we can replace the first step (4) with the following iterative procedure:where is the marginal distribution induced by . We denote by the derived learner of (5) at the th iteration. For notational simplicity, we write . In this paper, we focus on generalization error of the proposed algorithm; that is, A small quantity of implies a good prediction ability of . Different from those classical literatures under fixed kernel settings such as [13, 14], the main goal of this paper is to indicate some specific advantages theoretically compared with those fixed kernel approaches.

To simplify theoretical analysis, we assume that the conditional distribution has a support on , and it follows that almost everywhere. To this end, we introduce the projection operator as follows.

*Definition 1. *Define the projection operator on measurable function as

Note that the error bound between the projections of and can be expressed aswhere denotes the population error of the function .

Since the regression function may not be found within , the approximation error between and is needed. Considering the error induced by sampling and the approximation error, we introduce the following empirical error as and we also introduce an approximation error associated with the joint distribution :where is called the regularization function, given as

*Remark 2. *In the literature of learning theory, one usually assumes that there exist and , such that . In fact means and vice versa [15–17]. Strictly speaking is formally discussed in approximation theory.

To obtain convergence rates of (10), we decompose the term into two parts: the approximation error and the sample error; see [14, 15].

Proposition 3. *Let be defined by (5); the following inequality holdswhere *

Proposition 3 shows that is bounded by . We usually call the sample error, since this quantity mainly involves random sampling and the complexity of .

Bounding the sample error is a standard technique in learning theory [13, 15, 18]. To this end, we need to introduce the notion of covering number to measure the complexity of .

*Definition 4. *Assume is a pseudometric space with some metrics and . Then, for any , we define the covering number referring to and as covering number of a ball with radius :

Recall that a kernel function is called the Mercer kernel, if it is symmetric, positive definite, and continuous. Several properties concerning the Mercer kernel have been established well and can be found in [12, 15]. Suppose that .

*Assumption 5. *Suppose that the Mercer kernel has a polynomial complexity with where is the unit ball of and is some constant. For the Sobolev space on with the order , it is known in [12] that .

On the other hand, to quantify the approximation error and characterize the regularity of , we need to introduce the notion of fractional integral operator associated with . Recall that a standard inner product on is defined as Then we can define an integral operator on :It has been verified in [16] that the integral operator is a compact, self-adjoint, and positive definite operator from to . So the fractional operator of is well defined. Moreover, it is easy to check that and . Lemma 12 below will show that if a weak kernel with respect to the true function is used for learning, the approximation ability cannot be improved even if the true function is sufficiently smooth. This is why we propose the iterative procedure (4) for updating kernels.

With these preparations, we can state the main results depending on the capacity of and the smoothness of target function as follows.

Theorem 6. *Let be defined by (5) and Assumption 5 holds. If , then, for any and , the following holds:with probability of at least , where the constant is given by Proposition 11.*

From Proposition 3, we can deduce that in can be ignored by studying the new equivalent sampling error given by . The following corollary provides an asymptotically optimal convergence rate of . The proof can be found in the Appendix.

Corollary 7. *Let be defined by (5), and Assumption 5 holds true. If , then when , for any and , the following holds: with probability of at least , where . Particularly, for , we have where is an arbitrary positive number.*

It is seen from Corollary 7 that the ideal choice of the regularization parameter depends on the two quantities and , which are often unknown in advance. Alternatively, cross-validation technique is one of the commonly used tools in practice. It is worth noting that our approach selects the ideal kernel and the regularization parameter simultaneously, which is of significant difference from those classical fixed kernel methods.

Next, we compare our rate with existing references. Recently, sharp learning rates have been established by advanced empirical process technique in [14]. Note that we use to replace the kernel appearing in algorithm (3), where its covering number has polynomial decay index of . To be precise, an upper bound of sampling error in [14] was given as

From formula (A.2) in the Appendix, we can achieve that . Following the equivalent relationship between covering number and the spectrum of (see Theorem 10 [13]), we see that sufficiently large ensures that . Additionally, when , in Theorem 6 and above are both constants and hence it follows that with some constant , since . In summary, our derived sample error is sharper than that in [14]. Thus, if the regression function is smooth sufficiently, that is, which can also be approximated well by , we conclude that the corresponding learning rate of Theorem 6 outperforms that in [14]. This shows that if we know the smoothness of the target function, we can choose a proper kernel (which requires ) to improve the sampling error effectively. This provides us an excellent theoretical basis in choosing kernel function for real problems. Of course, considering noises with the samples in real problems, the chosen kernel also has to be smoother than the target function.

#### 4. Error Analysis

The sampling error is analyzed according to empirical process technique. The initial study on sampling error mainly applies the McDiarmid inequality without considering the conception of “space complexity.” However, the McDiarmid inequality is not able to show the variance message of random variables. Afterwards, VC dimension was originally introduced into the literature and some other conceptions such as covering numbers formed in the Bernstein-type probability inequality significantly reduce the sampling error; see [19] for an explicit overview. To bound sample error, we split it into two parts again: Note that does not involve any functional complexity, which in turn can be estimated easily by the following one-side Bernstein probability inequality.

Lemma 8. *Define as a random variable on the probability space , and there exists a constant and it holds . Define the variance of by ; then, for any , the following holds: with probability of at least .*

Proposition 9. *Define random variable . For any the following holds: with probability of at least .*

*Proof. *Notice that Since is true almost everywhere, thus This yields that .

Additionally, notice that satisfies which implies that .

By Lemma 8 and the basic inequality (), we obtain with probability of at least .

Bounding the sampling error is more involved, since the estimator will vary with the random sample. To handle it, some advanced uniform concentration inequality is required [14].

Lemma 10. *Define as a function set on . If there exists a constant , then almost everywhere and . Then for any positive number and , *

Proposition 11. *Given defined by (5), suppose that Assumption 5 holds. When , then, for any , with probability of at least , where .*

*Proof. *Introduce a function set defined by Proposition 9: Each function in can be denoted as , where . It follows that , , and Also note that and almost everywhere; we have This implies that Additionally notice that By Lemma 10 with and , we obtainwith probability at least Now we need to estimate covering number .

For arbitrary , we obtain Since is true for any and is contraction mapping, we have Hence According to Assumption 5, suppose that where and . Then it can be written as According to Lemma 7 given in [16] Substituting it into (38) and noticing that , we further see that can be bounded by Based on Lemma 4.1 given in [14], for each we have Hence, if we have Proposition 11 is completed by taking .

According to the conclusion of Proposition 9, two important quantities and involved in need to be bounded.

Lemma 12. *Let be defined as (13). If , the following holds: *

It makes sense that the estimation of is extended from Lemma 4.3 in [18], where takes place of . Now we discuss the second quantity. In the classical algorithm (3), when , the increase of smoothness of is not able to improve the error , which is called the “saturation” phenomenon in the literature of inverse problems. While, for the algorithm we study, only when , the saturation will happen. This shows specific advantages of using instead of the original from the perspective of approximation theory.

*Proof of Lemma 12. *According to [17], . Notice that and this yields where is the corresponding spectrum of the integral operator . Thus On the other hand, noting the fact that and the assumption , we have This is the end of proving Lemma 12.

Together with Propositions 9 and 11, Lemma 12, and the fact that , Theorem 6 is proved easily.

#### 5. Numerical Experiments

##### 5.1. Simulated Example

Although this paper mainly focuses on theoretical analysis, we can take some experiments to show its efficiency in practice. A simulated example is considered, where the true regression is an additive model. That is,with , , , and +. Firstly, generate which are independent from ; then generate by with .

For the above example, different scenarios are taken into account, with , , and , and each scenario is repeated times. We here use the widely used Gaussian kernel, , where parameter will be specified by conducted 10-fold cross-validation on each data set. Besides, as we mentioned before, we start with a weak kernel and search a better one somehow iteratively. A standard weak kernel is defined as follows: , where is an adjustable parameter, where we specify it as .

The performance of various methods is measured by the MSE, where MSE represents the relative mean squared error for each kernel-based regression. The averaged performance measures are summarized in Table 1. Note that SKM represents the single kernel method with the Gaussian kernel, UKM represents the proposed method without using any unlabeled data, and SEKM represents the proposed method with using all the unlabeled data.