Abstract

We consider a kind of support vector machines regression (SVMR) algorithms associated with coefficient-based regularization and data-dependent hypothesis space. Compared with former literature, we provide here a simpler convergence analysis for those algorithms. The novelty of our analysis lies in the estimation of the hypothesis error, which is implemented by setting a stepping stone between the coefficient regularized SVMR and the classical SVMR. An explicit learning rate is then derived under very mild conditions.

1. Introduction

Recall the regression setting in learning theory, and let be a compact subset of , , for some . is an unknown probability distribution endowed on , and is a set of samples independently drawn according to . Given samples , the regression problem aims to find a function , such that is a satisfactory estimate of output when a new input is given.

Support vector machines regression (SVMR) is a kind of kernel-based regression algorithms with the -insensitive loss defined by for some fixed . A function is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points , the matrix is positive semidefinite. The reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel is defined (see [1]) to be the completion of the linear span of the set of functions with the inner product satisfying and the reproducing property is given by Let and then the reproducing property tells us the following:

The classical SVMR proposed by Vapnik and his coworkers [2, 3] is given by the following regularization scheme: where is the empirical error with respect to and is a regularization parameter. It is well known, see for example, [4, Proposition 6.21], that the solution is of the form where the coefficients are a solution of the optimization problem

Remark 1. The equality constraint needed in [4, Proposition 6.21] is superfluous since we do not include an offset term in the primal problem (5).

The mathematical analysis of algorithm (5) has been well understood with various techniques in extensive literature; see, for example, [57]. In this paper, we are interested in a different regularized SVMR algorithm. In our setting, the regularizer is not the RKHS norm but an -norm of the coefficients in the kernel ensembles.

Definition 2. For , let Then, the SVMR with -coefficient regularization learning algorithm that we study in this paper takes the form

Remark 3. The regularization parameter in (9) may be different from in scheme (5), but a relationship between and will be given in Section 3 as we derive the learning rate of algorithm (9).

Learning with coefficient-based regularization has attracted a considerable amount of attention in recent years, on both theoretical analysis and applications. It was pointed out in [8] that by taking , and especially for the limit value , the proposed minimization procedure in (9) can promote the sparsity of the solution; that is, it tends to result in a solution with a few nonzero coefficients [9]. This phenomenon has been also observed in the LASSO algorithm [10] and the literature of compressed sensing [11].

However, it should be noticed that there are essential differences between the learning schemes (9) and (5). On one hand, the regularizer is not a Hilbert space norm, which causes a technical difficulty for mathematical analysis. On the other hand, both hypothesis space and regularizer depend on samples , and this increases the flexibility and adaptivity of algorithm (9) but causes the standard error analysis methods for scheme (5) which are not appropriate to scheme (9) any longer. To overcome these difficulties, [12] introduces a Banach space of all functions of the form with norm An error analysis framework is established and then a series of papers start to investigate the performance of kernel learning scheme with coefficient regularization (see [1316]). In those literatures, an condition imposed on the marginal distribution of on plays a critical role in the error analysis. A probability measure on is said to satisfy condition if there exist some and such that In general, the index is hard to estimate. If satisfies some regularity conditions (such as an interior cone condition) and is uniform distribution on , then (12) holds with . It leads to a low convergence rate and depends on , the dimension of the input space , which is often large in learning problem.

In this paper, we succeed to remove condition (12) and provide a simpler error analysis for scheme (9). The novelty of our analysis is a stepping stone technique applied to bound the hypothesis error. As a result, we derive an explicit learning rate of (9) under very mild conditions.

2. Error Decomposition and Hypothesis Error

The main purpose of this paper is to provide a convergence analysis of the learning scheme (9). With respect to the -insensitive loss , the prediction ability of a measurable function is measured by the following generalization error: where is the marginal distribution on and is the conditional probability measure at induced by . Let be a minimizer of among all measurable functions on . It was proved in [6] that for almost every . To make full use of the feature of the target function , one can introduce a projection operator, which was extensively used to the error analysis of learning algorithm; see, for example, [17, 18].

Definition 4. The projection operator is defined on the space of measurable functions as

It is easy to see that , so We thus take instead of as our empirical target function and analyze the related learning rates.

2.1. Error Decomposition

The error decomposition is a useful approach to the error analysis for the regularized learning schemes. With sample-dependent hypothesis space , [12] proposes a modified error decomposition with an extra hypothesis error term, by introducing a regularization function as We can conduct the error decomposition for scheme (9) with the same underlying idea of [12].

Proposition 5. Let , be defined in (9) and (16). Then, Here,

Proof. A direct computation shows that This proves the proposition.

is usually called the sample error; it will be estimated by some concentration inequality in the next section. is independent of the sample and is often called the approximation error, the decay of , as characterizes the approximation ability of . We will assume that, for some and ,

Remark 6. Since , concerns the approximation of in by functions from . In fact, (20) can be satisfied when is in some interpolation spaces of the pair (see, e.g., [19, 20]).

is called hypothesis error since the regularization function may not be in the hypothesis space . The major contribution we make in this paper is to give a simpler estimation of by a stepping stone between and .

2.2. Hypothesis Error Estimate

The solution of scheme (9) has a representation similar to in scheme (5); it is reasonable to expect close relations between the two schemes. So, the latter may play roles in the analysis of the former.

Theorem 7. Let , , , and then

Proof. Let be the solution to (5). By (7), we have Noting that , it can be derived from (15), (9), and (22) that By taking in (5), one can get Putting (24) into (23), one then has This proves the theorem.

Remark 8. The stepping stone method was first introduced in [21] to the error analysis of linear programming SVM classifiers. This technique was also used in [22, 23] to study -coefficient regularized least square regression with . While, in this paper, we extend the index to a large range , it will be helpful to improve the understanding of those coefficient-based regularized algorithms.

Remark 9. Theorem 7 presents a simpler approach for estimating the hypothesis error. Different from the former literature (see, e.g., [1316]), we conduct the estimation without imposing any assumptions on the input space , kernel , and marginal distribution .

3. Sample Error and Learning Rate

This section is devoted to estimating the sample error and deriving the learning rate of algorithm (9).

3.1. Sample Error Estimate

We will adopt some results from the literature to estimate the sample error. To this end, we need some definitions and assumptions. For a measurable function , denote .

Definition 10. A variance power of the pair is a number in such that for any , there exists some constant satisfying

Equation (26) is usually called a variance-expectation condition for the pair . It is easy to see that (26) always holds for and . When , the target function becomes the median of (see, [6]). In this case, as it points out in [24], if has a median of -average type for some and , then (26) can be satisfied with . Here, we say has a median of -average type if, for every , there exist constants and such that, for all , and that the function on taking value at lies in . But for , as we know, it is still open to find a meaningful condition for to guarantee that (26) holds with a positive index .

Definition 11. Let be a class of functions on and let . The -metric is defined on by For every , the covering number of with respect to is

Let . The -empirical covering number of the unit ball is defined as We assume that satisfies the following capacity assumption.

There exists an exponent with and a constant such that We now set out to bound the sample error. Write as Applying [6, Proposition 4.1], we yield the following estimation for .

Lemma 12. For any , under the assumption (26), with confidence , one has

The estimation for is based on the following concentration inequality which can be found in [25].

Lemma 13. Let be a set of measurable functions on , and let , and be constants such that each satisfies and . If, for some and , then there exists a constant depending only on such that for any , with probability , there holds where .

We may apply Lemma 13 to a set of functions with , where

Proposition 14. If assumptions (26) and (31) are satisfied, then for any , with confidence , there holds for all , where

Proof. Each function has a form with some . We can easily see that and The assumption (26) tells us that with . Moreover, for any , and , we get This together with (31) implies Hence, all the conditions in Lemma 13 hold, and we know that, for any , with confidence , there holds, for every , Here, Recall an elementary inequality Applying it with , , to the first term of (44), we can derive the conclusion.

It remains to find a ball containing for all .

Lemma 15. Let , be defined by (9). Then, for any , one has

Proof. For any , there exists , such that and Taking in (9), we can see that It follows that When , by the Hölder inequality, we can see that It is easy to check that (51) still holds for . Let ; we then get the assertion.

From Lemma 15 and Proposition 14, we can get the following.

Corollary 16. If assumptions (26) and (31) hold, then, for any , with confidence , there holds where

3.2. Deriving Learning Rates

Combining the estimation in Sections 2.2 and 3.1, we can derive an explicit learning rate for scheme (9) by suitably selecting the regularization parameters and .

Theorem 17. Suppose that assumptions (20), (26), and (31) are satisfied, for any , by taking , , and we have, with confidence , where is a constant independent of or .

Proof. Putting Theorem 7, Lemma 12, Corollary 16, and assumption (20) into Proposition 5, by taking , we find that, for any , with confidence , Here, . According to the choice of , we can easily check that So, our theorem follows by taking and .

Remark 18. Theorem 17 provides an explicit learning rate for    coefficient-based regularized SVMR. This learning rate is independent of the dimension of the input space . We do not require the marginal distribution and the kernel to satisfy any additional regularity condition, such as the condition.

Remark 19. Another advantage of coefficient-based regularization scheme is its flexibility in choosing the kernel. For instance, [26, 27] consider the least square regression with indefinite kernels and an -coefficient regularization, where they relax the requirement of the kernel to be only continuous and uniformly bounded bivariate function on . It will be a very interesting topic in future work to extend the method in this paper to the indefinite kernel setting.

Let us end this paper by comparing our result with the learning rate presented in [7] in a special case . To this end, we reformulate [7, Theorem 2.3] for as follows.

Proposition 20. If , , and is of -average of type for some , taking , , then, with , for any , with confidence , one has where is a constant independent of or .

By Theorem 17, we can see the following.

Corollary 21. Under the same conditions of Proposition 20, for , by taking , one has, for any , with confidence , where is a constant independent of or .

Proof. Note that when . From [24, Theorem 2.7], we know that and here is a constant independent of or .
Since , we know that (20) holds with .   is -average type which implies that (26) is satisfied with . Since and , we know from [28] that (31) holds true for any . Therefore, let ; according to Theorem 17, by taking , , we have, for any , with confidence , This together with (59) proves the corollary with the constant .

Corollary 21 shows us that the learning rate presented in Theorem 17 for -coefficient regularized SVMR is faster than the one given in [7] for RKHS norm regularized learning schemes at least in the case of .

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank the referees for their valuable comments and helpful suggestions which greatly improved this work. The research is supported by NSF of China under Grants 11171014.