Research Article | Open Access

Xiaorong Chu, Hongwei Sun, "Regularized Least Square Regression with Unbounded and Dependent Sampling", *Abstract and Applied Analysis*, vol. 2013, Article ID 139318, 7 pages, 2013. https://doi.org/10.1155/2013/139318

# Regularized Least Square Regression with Unbounded and Dependent Sampling

**Academic Editor:**Changbum Chun

#### Abstract

This paper mainly focuses on the least square regression problem for the -mixing and -mixing processes. The standard bound assumption for output data is abandoned and the learning algorithm is implemented with samples drawn from dependent sampling process with a more general output data condition. Capacity independent error bounds and learning rates are deduced by means of the integral operator technique.

#### 1. Introduction and Main Results

The aim of this paper is to study the least square regularized regression learning algorithm. The main novelty of this problem here is the unboundedness and dependence of the sampling process. Let be a compact metric space (usually a subset of ) and . Suppose that is a probability distribution defined on . In regression learning, one wants to learn or approximate the regression function given by where is the conditional distribution of for given . is not directly computable because is unknown in fact. Instead we learn a good approximation of from a set of observations drawn according to .

The learning algorithm studied here is based on a Mercer kernel which is a continuous, symmetric, and positive semidefinite function. The RKHS associated with the Mercer kernel is the completion of span with the inner product satisfying . The learning algorithm is a regularization scheme in given by where is a regularization parameter.

Error analysis for learning algorithm (2) has been studied in a lot of literatures [1–4], which focused on independent samples. In recent years, there are some studies relaxing the independent restriction and turning to the dependent sampling learning [5–8]. In [8] the learning performance of regularized least square regression was studied with the mixing sequences, and the result for this setting was refined by an operator monotone inequality in [7].

For a stationary real-valued sequence , the -algebra generated by the random variables , , is denoted by . The uniformly mixing condition (or -mixing condition) and the strongly mixing condition (or -mixing condition) are defined as follows.

*Definition 1 (-mixing). *The th -mixing coefficient for the sequence is defined as
The process is said to satisfy a uniformly mixing condition (or -mixing condition) if , as .

*Definition 2 (-mixing). *The th -mixing coefficient for random sequence is defined as
The random process is said to satisfy a strongly mixing condition (or -mixing condition) if , as .

By the fact , -mixing condition is weaker than -mixing condition. Many random processes satisfy the strongly mixing condition, for example, the stationary Markov process which is uniformly pure nondeterministic, the stationary Gaussian sequence with a continuous spectral density that is bounded away from 0, certain ARMA processes, and some aperiodic, Harris-recurrent Markov processes; see [5, 9] and the references therein.

In this paper we follow [7, 8] to consider -mixing and -mixing processes, estimate the error bounds, and derive the learning rates of algorithm (2), where the output data satisfy the following unbounded condition.

*Unbounded Hypothesis*. There exist two constants and such that

The error analysis for the algorithm (2) was usually presented under the standard assumption that almost surely with some constant . This standard assumption was abandoned in [10–14]. In [10] the authors introduced the condition for almost every and some constants , where is the orthogonal projection of onto the closure of in . In [11–13] the error analysis was conducted in another setting satisfying the following moment hypothesis; that is, there exist constants and such that for all . Notice that with different constants the moment hypothesis and (6) are equivalent in the case [13]. Obviously, our unbounded hypothesis is a natural generalization of the moment hypothesis. An example for which unbounded hypothesis (5) is satisfied but moment hypothesis failed has been given in [15]. It mainly studies the half supervised coefficient regularization with indefinite kernels and unbounded sampling, where the unbounded condition is for some constant .

Since , where the generalization error , the goodness of the approximation of by is usually measured by the excess generalization error . Denoting the reproducing property in RKHS yields that for any . Thus, the distance between and in can be applied to measure this approximation as well when .

The noise-free limit of algorithm (2) takes the form thus the error analysis can be divided into two parts. The difference between and is called the sample error, and the distance between and is called the approximation error. We will bound the error in and , respectively. Estimate of the sample error is more difficult because changes with the sample and cannot be considered as a fixed function. The approximation error does not depend on the samples, which has been studied in the literature [2, 3, 7, 16, 17].

We mainly devote the next two sections to estimating the sample error with more general sampling processes. Our main results can be stated as follows.

Theorem 3. *Suppose that the unbounded hypothesis holds, for some , and the -mixing coefficients satisfy a polynomial decay, that is, for some and . Then, for any , one has with confidence ,
**
where is given by
**Moreover, when , one has with confidence ,
**
where is given by
*

Theorem 3 proves the asymptotic convergence of algorithm (2) with the samples satisfying a uniformly mixing condition. Our second main result considers this algorithm with -mixing process.

Theorem 4. *Suppose that the unbounded hypothesis with holds, for some , and the -mixing coefficients satisfy a polynomial decay, that is, for some and . Then, for any , one has with confidence ,
**
where is given by
**Moreover, when , with confidence ,
**
where is given by
*

The proof of these two theorems will be given in Sections 2, 3, and 4, and notice that the log term can be dropped when . Our error analysis reveals some interesting phenomena for learning with unbounded and dependent sampling.(i)Smoother target function (i.e., becomes larger) implies better learning rates. Stronger dependence between samples (i.e., becomes smaller) implies that they contain less information and hence lead to worse rates.(ii) The learning rates are improved as the dependence between samples becomes weaker and becomes larger but they are no longer improved after some constant . This phenomenon is called saturation effect, which was discussed in [18–20]. In our setting, saturation effects include saturation for smoothness of function mainly relative to the approximation error and saturation for dependence between samples. An interesting phenomenon revealed here is that when -mixing coefficients satisfy for some , the saturation for dependence between samples is for , which is dependent on the unbounded condition parameter .(iii) For -mixing process, the learning rates have nothing to do with unbound condition parameter since is bounded by . But for -mixing process, to derive the learning rate, we have to estimate with .(iv) Under -mixing condition, when and , the influence of the unbounded condition becomes weak. Recall that the learning rate derived in [8] is for , . It implies that when is large enough, our learning rate for unbounded samples is as sharp as that for the uniform bounded sampling.

#### 2. Sampling Satisfying -Mixing Condition

In this section, we would apply the integral operator technique in [7] to handle the sample error with -mixing condition. However, different from the uniform bounded case the learning performance of the unbounded sampling is not measured directly. Instead, the expectations are estimated first and then the bound for the sample error can be obviously deduced by Markov inequality:

To this end, define the sampling operator as , where is the set of input data . Then its adjoint is for . The analytic expression of optimization solution was given in [3], where is the integral operator defined as

For a random variable with values in a Hilbert space and , denote the th moment as if and . Lemma 5 is due to Billingsley [21].

Lemma 5. *Let and be random variables with values in a separable Hilbert space measurable -field and and having finite th and th moments, respectively, where with . Then
*

Lemma 6. *For an -mixing sequence , one has
*

*Proof. *With the definition of the sample operator, we have
Letting , then is an -valued random variable defined on . Note that , and . We have

By Lemma 5 with , for ,
Thus the desired estimate can be obtained by plugging (23) into (22).

Proposition 7. *Suppose that the unbounded hypothesis holds with some and that the sample sequence satisfies an -mixing condition and with . Then one has
**
where is a constant only dependent on . *

*Proof. *By [7, Theorem 3.1], we have
where is a random variable with values in , and . A similar computation together with the result of Lemma 6 leads to

It suffices to estimate . By Hölder inequality, there is
Thus and
which implies

Plugging (29) into (26), there holds

Combining (25), (22), and (30) and taking the constant , we complete the proof.

The following proposition provides the bound of the difference between and in with -mixing process.

Proposition 8. *Under the assumption of Proposition 7, there holds
*

*Proof. *The representations of and imply that
Then the desired bound follows from (30) and (32).

#### 3. Samples Satisfying -Mixing Condition

Now we turn to bound the sample error when the sampling process satisfies strongly mixing condition, and unbounded hypothesis holds. In Section 2, the key point is to estimate with the lack of uniform boundedness. For the sampling satisfying -mixing condition, we have to deal with for some .

Proposition 9. *Suppose that the unbounded hypothesis holds with some and that the sample sequence satisfies an -mixing condition and with . Then one gets
**
where is a constant only depending on and .*

*Proof. *For the strongly mixing process, by [8, Lemma 5.1],
Taking in [8, Lemma 4.2], we have

The estimation of has been obtained in Section 2, and now we mainly devote to estimating . To get this estimation, the bound of is needed which can be stated as follows ([3, Lemma 3] or [8, Lemma 4.3]):
where . Observe that . Hence,
Now we can deduce that
Plugging this estimate into (35) yields
where is a constant only depending on and . Then combining (34) and (39) with (25), we complete the proof.

For -mixing process we have the following proposition to get the bound of sample error in , and the proof can be directly obtained by the inequality (32).

Proposition 10. *Under assumption of Proposition 9, one has
**
where .*

#### 4. Error Bounds and Learning Rates

In this section we derive the learning rates, that is, the convergence rates of and as by choosing the regularization parameter according to . The following approximation error bound is needed to get the convergence rates.

Proposition 11. *Supposing that for some , there holds
**
Moreover, when , that is, , there holds
*

The first conclusion in Proposition 11 has been proved in [20], and the second one can be proved in the same way. To derive the learning rates, we need to balance the approximation error and sample error. For this purpose, the following simple facts are necessary:

*Proof of Theorem 3. *The estimate of learning rates in norm is divided into two cases. *Case **1.* For , by (43) and , there is
Thus Proposition 7 yields that
By Proposition 11 and Markov inequality, with confidence , there holds

For , by taking , we can deduce the learning rate as . When , taking , the learning rate can be derived. When , the desired convergence rate is obtained by taking .*Case **2. *. With confidence , there holds
For , taking , the learning rate can be derived, and for , by taking , we can deduce the learning rate . When , the desired convergence rate is obtained by taking .

Next for bounding the generalization error in , Proposition 8 in connection with Proposition 11 tells us that with confidence ,
The rest of the proof is analogous to the estimate of mentioned previously.

*Proof of Theorem 4. *For , by (43) and , there is
By Propositions 9 and 11 and Markov inequality, with confidence , there holds

For , by taking , we can deduce the learning rate as . When , taking , the learning rate can be derived. When , the desired convergence rate is obtained by taking .

The rest of the analysis is similar; we omit it here.

#### Acknowledgment

This paper is supported by the National Nature Science Foundation of China (no. 11071276).

#### References

- T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks and support vector machines,”
*Advances in Computational Mathematics*, vol. 13, no. 1, pp. 1–50, 2000. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - S. Smale and D.-X. Zhou, “Shannon sampling. II. Connections to learning theory,”
*Applied and Computational Harmonic Analysis*, vol. 19, no. 3, pp. 285–302, 2005. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approximations,”
*Constructive Approximation*, vol. 26, no. 2, pp. 153–172, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - Q. Wu, Y. Ying, and D.-X. Zhou, “Learning rates of least-square regularized regression,”
*Foundations of Computational Mathematics*, vol. 6, no. 2, pp. 171–192, 2006. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - D. S. Modha and E. Masry, “Minimum complexity regression estimation with weakly dependent observations,”
*IEEE Transactions on Information Theory*, vol. 42, no. 6, pp. 2133–2145, 1996. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - S. Smale and D.-X. Zhou, “Online learning with Markov sampling,”
*Analysis and Applications*, vol. 7, no. 1, pp. 87–113, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - H. Sun and Q. Wu, “A note on application of integral operator in learning theory,”
*Applied and Computational Harmonic Analysis*, vol. 26, no. 3, pp. 416–421, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - H. Sun and Q. Wu, “Regularized least square regression with dependent samples,”
*Advances in Computational Mathematics*, vol. 32, no. 2, pp. 175–189, 2010. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - K. B. Athreya and S. G. Pantula, “Mixing properties of Harris chains and autoregressive processes,”
*Journal of Applied Probability*, vol. 23, no. 4, pp. 880–892, 1986. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - A. Caponnetto and E. De Vito, “Optimal rates for the regularized least-squares algorithm,”
*Foundations of Computational Mathematics*, vol. 7, no. 3, pp. 331–368, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - Z.-C. Guo and D.-X. Zhou, “Concentration estimates for learning with unbounded sampling,”
*Advances in Computational Mathematics*, vol. 38, no. 1, pp. 207–223, 2013. View at: Publisher Site | Google Scholar | MathSciNet - S.-G. Lv and Y.-L. Feng, “Integral operator approach to learning theory with unbounded sampling,”
*Complex Analysis and Operator Theory*, vol. 6, no. 3, pp. 533–548, 2012. View at: Publisher Site | Google Scholar | MathSciNet - C. Wang and D.-X. Zhou, “Optimal learning rates for least squares regularized regression with unbounded sampling,”
*Journal of Complexity*, vol. 27, no. 1, pp. 55–67, 2011. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - C. Wang and Z. C. Guo, “ERM learning with unbounded sampling,”
*Acta Mathematica Sinica*, vol. 28, no. 1, pp. 97–104, 2012. View at: Publisher Site | Google Scholar | MathSciNet - X. R. Chu and H. W. Sun, “Half supervised coefficient regularization for regression learning with unbounded sampling,”
*International Journal of Computer Mathematics*, 2013. View at: Publisher Site | Google Scholar - S. Smale and D.-X. Zhou, “Shannon sampling and function reconstruction from point values,”
*The American Mathematical Society*, vol. 41, no. 3, pp. 279–305, 2004. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - H. Sun and Q. Wu, “Application of integral operator for regularized least-square regression,”
*Mathematical and Computer Modelling*, vol. 49, no. 1-2, pp. 276–285, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - F. Bauer, S. Pereverzev, and L. Rosasco, “On regularization algorithms in learning theory,”
*Journal of Complexity*, vol. 23, no. 1, pp. 52–72, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri, “Spectral algorithms for supervised learning,”
*Neural Computation*, vol. 20, no. 7, pp. 1873–1897, 2008. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - H. Sun and Q. Wu, “Least square regression with indefinite kernels and coefficient regularization,”
*Applied and Computational Harmonic Analysis*, vol. 30, no. 1, pp. 96–109, 2011. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - P. Billingsley,
*Convergence of Probability Measures*, John Wiley & Sons, New York, NY, USA, 1968. View at: MathSciNet

#### Copyright

Copyright © 2013 Xiaorong Chu and Hongwei Sun. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.