Constructive Analysis for Least Squares Regression with Generalized K-Norm Regularization
We introduce a constructive approach for the least squares algorithms with generalized K-norm regularization. Different from the previous studies, a stepping-stone function is constructed with some adjustable parameters in error decomposition. It makes the analysis flexible and may be extended to other algorithms. Based on projection technique for sample error and spectral theorem for integral operator in regularization error, we finally derive a learning rate.
In learning theory, we are always given a sample set , which is drawn from a joint distribution on the sample space . Here, the input space is a compact metric space and for a regression problem. For a function obtained via some algorithm, a loss functional is defined to measure its performance on a sample point . In regression problem, least square loss is most widely used. Then, we can use the generalization error to evaluate over the whole sample space:
From , we know the goal function is , which is called the regression function, minimizing the generalization error. Since is always unknown in practice, we have to find another function close to based on the sample. The famous empirical risk minimization (ERM) algorithm is introduced in [2, 3]. To avoid overfitting, a penalty term related to is added into this algorithm, which is usually called regularization. While the squared -norm regularization term is extensively studied in , and so forth, in this paper, we consider a more general model: with some . In this algorithm, minimization is restricted to a hypothesis space which is a reproducing kernel Hilbert space (RKHS) on . The RKHS  is defined as with , associated with a Mercer Kernel which is continuous, symmetric, and positive definite. Since is a compact metric space, Kernel is bounded and we denote in the following.
2. Main Result
Though uniform bounded assumption was abandoned in previous work , we still assume almost surely for some constant throughout this paper for simplicity, since our analysis can be extended to the unbounded situation by choosing some different probability inequality.
For the hypothesis space, a polynomial decay condition is given to control the capacity. To state this condition, we have to firstly recall covering number.
Definition 1. Let be a pseudometric space and . For , the covering number of the set with respect to is defined to be the minimal number of balls of radius whose union covers . That is, where .
When metric is chosen to be , that is, , it is the classical uniform covering number. It is widely used in [4, 6–8], and so forth, and more detailed analysis can be found in [9, 10]. More recent references [11–14] use -empirical covering number to obtain a sharper upper bound for the excess generalization error .
Definition 2. Denote for some . For a set of functions on and , with notation and , the -empirical covering number of is given by
Now, we can describe the capacity condition of the hypothesis space .
Definition 3. We say that has empirical polynomial complexity with exponent , , if there exists a constant such that where is the ball with radius in .
The integral operator defined by is also important in learning theory and has been studied in . In , the authors claim that, for a Mercer Kernel , the associated is a compact operator with nonincreasing positive eigenvalue sequence . And the induced fractional operator is well defined, for any with orthogonal basis of . In the following, we will make use of this notion in our construction analysis.
We additionally introduce the projection operator on the space of measurable function :
The main result is stated as follows which will be proved in Section 6.
Theorem 4. Assume (3), (7) hold for sample distribution and hypothesis space . The regression function satisfies for some . is obtained from (2). Then, by choosing appropriate (explicit expression can be found in the proof) with confidence for any , we have for some constant not depending on or and
3. Error Decomposition
Various error decomposition methods motivate our research, especially [7, 12, 14, 16, 17]. A general idea of error decomposition is to transform the excess generalization error (see  for details) to two parts, which can be bounded by some concentration inequality and approximation analysis. In our setting, let be a function to be determined in ; it can be expressed as where The first and second terms and are called sample error which will be studied in Section 5, while the third term is regularization error (or approximation error) which is our main work in this paper.
It is known that can be freely chosen in which is close to in some sense and in previous work is always naturally chosen to be the one minimizing . However, we will encounter difficulties if the minimizer does not exist or the expression of the minimizer is not explicit. In this paper, we construct a special function in the form with some to handle this problem.
4. Regularization Error
It is the main contribution of this section to conduct error analysis for the regularization error. Regularization error, also called approximation error, has already been studied in . However, we will analyze this part of error in a different viewpoint. From , we know is a compact self-adjoint and positive operator. By applying the spectral theorem for compact operators, we can bound a compact positive operator with its eigenvalues. Firstly, we have to introduce a useful lemma.
Lemma 5. Let and ; one has
Proof. By simply taking derivative of the right-hand side with respect to , we can find that it reaches its maximum when ; that is, Since , where , the lemma is proved.
Proposition 6. Assume and (15); there holds with some constant depending on and by choosing appropriate and .
Proof. Since , we will analyze the two terms, respectively. Noting that and , we have
Recall that , combining with Lemma 5; there holds
For the term , we have the following inequality as : This means
To minimize the sum of upper bounds (22) and (24) is the same to maximize the power of . We can choose Then, This proves the result with
Remark 7. Another choice can also lead to the same result except for the constants.
Remark 8. In the case , our result turns to which is consistent with the classical one . In fact, for a general of interest, the bound is better than since , while .
In , the authors construct a function based on the generalized Fourier expansion of and derive that with some constant for any . The rate is always much less than and cannot achieve when . On the other hand, our result is better than , while .
Compared with , we get the same rate of upper bound. There, the authors find a connection between and with different . However, their analysis needs an existent result, while our method does not.
5. Sample Error
There are a vast number of literatures studying the sample error. Here, we will follow the analysis of . Firstly, we should introduce the Bernstein inequality . Denote , , and for an integral function on .
Lemma 9. Assume for some constant almost surely. Then, for any .
Now, we can obtain the sample error bound involving .
Proposition 10. Assume (3), for any , with confidence ; there holds
Proof. Let we have . Note that and It is easy to see that Then, and By Bernstein inequality, holds with . Set the right-hand side to be ; we can solve and the following bound This proves the proposition.
For the sample error term , it is more difficult since it involves the function which varies, while the sample size is different. So, we need a concentration inequality for a set of functions as in . By setting , the inequality becomes as follows.
Lemma 11. Let be a set of measurable functions on , and is constant such that each function satisfies and . If for some and , then there exists a constant depending only on such that, for any , with probability at least , there holds where .
The result will be used to estimate . We apply this lemma to the function set and have the following proposition.
Proposition 12. Let be defined as above with some satisfying , whose expression will be given in the next section. Assume (3) and (7) hold. Then, we have for some constant depending only on with confidence .
Proof. From definition, we know that
where is an element of .
In the following, we verify the conditions for in Lemma 11. For any function , it holds On the other hand, for any depending, respectively, on , This means and
Now, we can see from Lemma 11 that, with confidence , there holds This proves the proposition.
6. Total Error
Combining the regularization and sample error bounds, we can prove the main result as follows.
Proof of Theorem 4. By substituting the regularization error and sample error (in the error decomposition formula) with obtained bounds in the above two sections, we have Note that and radius is always larger than ; the bound becomes where From , we have the bound for the radius: , and the above inequality is now To balance the two terms, we choose and the result is proved with constant
Remark 13. In , the authors also use -empirical covering number and derive an optimal rate . Compared with their classical rate for squared -norm regularization, our result also can achieve the best one , while tends to . Though when , that is, , our rate is worse than , we will get a better rate than when . Moreover, by the iteration technique , we can expect that the radius for is close to the upper bound of , which leads to a sharper learning rate . This is always better than for any .
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by NSF of China (Grant no. 11326096), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (no. 2013LYM 0089), Doctor Grants of Huizhou University (Grant no. C511.0206), Major Project of Chinese National Statistics Bureau (no. 2013LZ52), NSF of Guangdong Province in China (no. S2013010014601), “12.5” Planning Project of Common Construction Subject for Philosophical and Social Sciences in Guangdong (no. GD12XYJ18), Project of Science and Technology Innovation in Guangdong Education Department (no. 2013KJCX0175), and Planning Fund Project of Humanities and Social Science Research in Chinese Ministry of Education (no. 14YJAZH040).
F. Cucker and S. Smale, “On the mathematical foundations of learning,” Bulletin of the American Mathematical Society, vol. 39, no. 1, pp. 1–49, 2002.View at: Publisher Site | Google Scholar | MathSciNet
V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.View at: MathSciNet
Q. Wu, Y. Ying, and D. X. Zhou, “Learning rates of least-square regularized regression,” Foundations of Computational Mathematics, vol. 6, no. 2, pp. 171–192, 2006.View at: Publisher Site | Google Scholar | MathSciNet
N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
C. Wang and D. Zhou, “Optimal learning rates for least squares regularized regression with unbounded sampling,” Journal of Complexity, vol. 27, no. 1, pp. 55–67, 2011.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
Q. W. Xiao and D. X. Zhou, “Learning by nonsymmetric kernels with data dependent spaces and -regularizer,” Taiwanese Journal of Mathematics, vol. 14, pp. 1821–1836, 2010.View at: Google Scholar
Y. Feng, “Least-squares regularized regression with dependent samples and -penalty,” Applicable Analysis, vol. 91, no. 5, pp. 979–991, 2012.View at: Publisher Site | Google Scholar | MathSciNet
D. X. Zhou, “The covering number in learning theory,” Journal of Complexity, vol. 18, no. 3, pp. 739–767, 2002.View at: Publisher Site | Google Scholar | MathSciNet
D. X. Zhou, “Capacity of reproducing kernel spaces in learning theory,” IEEE Transactions on Information Theory, vol. 49, no. 7, pp. 1743–1752, 2003.View at: Publisher Site | Google Scholar | MathSciNet
Z. C. Guo and D. X. Zhou, “Concentration estimates for learning with unbounded sampling,” Advances in Computational Mathematics, vol. 38, no. 1, pp. 207–223, 2013.View at: Publisher Site | Google Scholar | MathSciNet
L. Shi, “Learning theory estimates for coefficient-based regularized regression,” Applied and Computational Harmonic Analysis, vol. 34, no. 2, pp. 252–265, 2013.View at: Publisher Site | Google Scholar | MathSciNet
S. G. Lv, D. M. Shi, Q. Xiao, and M. S. Zhang, “Sharp learning rates of coefficient-based -regularized regression with indefinite kernels,” Science China Mathematics, vol. 56, no. 8, pp. 1557–1574, 2013.View at: Publisher Site | Google Scholar | MathSciNet
C. Wang and J. Cai, “Convergence analysis of coefficient-based regularization under moment incremental condition,” International Journal of Wavelets, Multiresolution and Information Processing, vol. 12, no. 1, Article ID 1450008, 19 pages, 2014.View at: Publisher Site | Google Scholar | MathSciNet
S. Smale and D. X. Zhou, “Learning theory estimates via integral operators and their approximations,” Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.View at: Publisher Site | Google Scholar | MathSciNet
Q. Wu and D.-X. Zhou, “Learning with sample dependent hypothesis spaces,” Computers & Mathematics with Applications, vol. 56, no. 11, pp. 2896–2907, 2008.View at: Publisher Site | Google Scholar | MathSciNet
L. Shi, Y. Feng, and D. Zhou, “Conce ntration estimates for learning with l1-regularizer and data dependent h ypothesis spaces,” Applied and Computational Harmonic Analysis, vol. 31, no. 2, pp. 286–302, 2011.View at: Publisher Site | Google Scholar | MathSciNet
S. Smale and D. Zhou, “Estimating the approximation error in learning theory,” Analysis and Applications, vol. 1, no. 1, pp. 17–41, 2003.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
I. Steinwart, D. Hush, and C. Scovel, “Optimal rates for regularized least squares regression,” in Proceedings of the 22nd Annual Conference on Learning Theory, S. Dasgupta and A. Klivans, Eds., pp. 79–93, 2009.View at: Google Scholar
G. Bennett, “Probability inequalities for the sum of independent random variables,” Journal of the American Statistical Association, vol. 57, pp. 33–45, 1962.View at: Publisher Site | Google Scholar
Q. Wu, Y. Ying, and D. Zhou, “Multi-kernel regularized classifiers,” Journal of Complexity, vol. 23, no. 1, pp. 108–134, 2007.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet