Abstract

This paper mainly focuses on the least square regression problem for the -mixing and -mixing processes. The standard bound assumption for output data is abandoned and the learning algorithm is implemented with samples drawn from dependent sampling process with a more general output data condition. Capacity independent error bounds and learning rates are deduced by means of the integral operator technique.

1. Introduction and Main Results

The aim of this paper is to study the least square regularized regression learning algorithm. The main novelty of this problem here is the unboundedness and dependence of the sampling process. Let be a compact metric space (usually a subset of ) and . Suppose that is a probability distribution defined on . In regression learning, one wants to learn or approximate the regression function given by where is the conditional distribution of for given . is not directly computable because is unknown in fact. Instead we learn a good approximation of from a set of observations drawn according to .

The learning algorithm studied here is based on a Mercer kernel which is a continuous, symmetric, and positive semidefinite function. The RKHS associated with the Mercer kernel is the completion of span with the inner product satisfying . The learning algorithm is a regularization scheme in given by where is a regularization parameter.

Error analysis for learning algorithm (2) has been studied in a lot of literatures [14], which focused on independent samples. In recent years, there are some studies relaxing the independent restriction and turning to the dependent sampling learning [58]. In [8] the learning performance of regularized least square regression was studied with the mixing sequences, and the result for this setting was refined by an operator monotone inequality in [7].

For a stationary real-valued sequence , the -algebra generated by the random variables , ,  is denoted by . The uniformly mixing condition (or -mixing condition) and the strongly mixing condition (or -mixing condition) are defined as follows.

Definition 1 (-mixing). The th -mixing coefficient for the sequence is defined as The process is said to satisfy a uniformly mixing condition (or -mixing condition) if , as .

Definition 2 (-mixing). The th -mixing coefficient for random sequence is defined as The random process is said to satisfy a strongly mixing condition (or -mixing condition) if , as .

By the fact , -mixing condition is weaker than -mixing condition. Many random processes satisfy the strongly mixing condition, for example, the stationary Markov process which is uniformly pure nondeterministic, the stationary Gaussian sequence with a continuous spectral density that is bounded away from 0, certain ARMA processes, and some aperiodic, Harris-recurrent Markov processes; see [5, 9] and the references therein.

In this paper we follow [7, 8] to consider -mixing and -mixing processes, estimate the error bounds, and derive the learning rates of algorithm (2), where the output data satisfy the following unbounded condition.

Unbounded Hypothesis. There exist two constants and such that

The error analysis for the algorithm (2) was usually presented under the standard assumption that almost surely with some constant . This standard assumption was abandoned in [1014]. In [10] the authors introduced the condition for almost every and some constants , where is the orthogonal projection of onto the closure of in . In [1113] the error analysis was conducted in another setting satisfying the following moment hypothesis; that is, there exist constants and such that for all . Notice that with different constants the moment hypothesis and (6) are equivalent in the case   [13]. Obviously, our unbounded hypothesis is a natural generalization of the moment hypothesis. An example for which unbounded hypothesis (5) is satisfied but moment hypothesis failed has been given in [15]. It mainly studies the half supervised coefficient regularization with indefinite kernels and unbounded sampling, where the unbounded condition is for some constant .

Since , where the generalization error , the goodness of the approximation of by is usually measured by the excess generalization error . Denoting the reproducing property in RKHS yields that for any . Thus, the distance between and in can be applied to measure this approximation as well when .

The noise-free limit of algorithm (2) takes the form thus the error analysis can be divided into two parts. The difference between and is called the sample error, and the distance between and is called the approximation error. We will bound the error in and , respectively. Estimate of the sample error is more difficult because changes with the sample and cannot be considered as a fixed function. The approximation error does not depend on the samples, which has been studied in the literature [2, 3, 7, 16, 17].

We mainly devote the next two sections to estimating the sample error with more general sampling processes. Our main results can be stated as follows.

Theorem 3. Suppose that the unbounded hypothesis holds, for some , and the -mixing coefficients satisfy a polynomial decay, that is, for some and . Then, for any , one has with confidence , where is given by
Moreover, when , one has with confidence , where is given by

Theorem 3 proves the asymptotic convergence of algorithm (2) with the samples satisfying a uniformly mixing condition. Our second main result considers this algorithm with -mixing process.

Theorem 4. Suppose that the unbounded hypothesis with holds, for some , and the -mixing coefficients satisfy a polynomial decay, that is, for some and . Then, for any , one has with confidence , where is given by
Moreover, when , with confidence , where is given by

The proof of these two theorems will be given in Sections 2, 3, and 4, and notice that the log term can be dropped when . Our error analysis reveals some interesting phenomena for learning with unbounded and dependent sampling.(i)Smoother target function (i.e., becomes larger) implies better learning rates. Stronger dependence between samples (i.e., becomes smaller) implies that they contain less information and hence lead to worse rates.(ii) The learning rates are improved as the dependence between samples becomes weaker and becomes larger but they are no longer improved after some constant . This phenomenon is called saturation effect, which was discussed in [1820]. In our setting, saturation effects include saturation for smoothness of function mainly relative to the approximation error and saturation for dependence between samples. An interesting phenomenon revealed here is that when -mixing coefficients satisfy for some , the saturation for dependence between samples is for , which is dependent on the unbounded condition parameter .(iii) For -mixing process, the learning rates have nothing to do with unbound condition parameter since is bounded by . But for -mixing process, to derive the learning rate, we have to estimate with .(iv) Under -mixing condition, when and , the influence of the unbounded condition becomes weak. Recall that the learning rate derived in [8] is for , . It implies that when is large enough, our learning rate for unbounded samples is as sharp as that for the uniform bounded sampling.

2. Sampling Satisfying -Mixing Condition

In this section, we would apply the integral operator technique in [7] to handle the sample error with -mixing condition. However, different from the uniform bounded case the learning performance of the unbounded sampling is not measured directly. Instead, the expectations are estimated first and then the bound for the sample error can be obviously deduced by Markov inequality:

To this end, define the sampling operator as , where is the set of input data . Then its adjoint is for . The analytic expression of optimization solution was given in [3], where is the integral operator defined as

For a random variable with values in a Hilbert space and , denote the th moment as if and . Lemma 5 is due to Billingsley [21].

Lemma 5. Let and be random variables with values in a separable Hilbert space measurable -field and and having finite th and th moments, respectively, where with . Then

Lemma 6. For an -mixing sequence , one has

Proof. With the definition of the sample operator, we have Letting , then is an -valued random variable defined on . Note that , and . We have
By Lemma 5 with , for , Thus the desired estimate can be obtained by plugging (23) into (22).

Proposition 7. Suppose that the unbounded hypothesis holds with some and that the sample sequence satisfies an -mixing condition and with . Then one has where is a constant only dependent on .

Proof. By [7, Theorem 3.1], we have where is a random variable with values in , and . A similar computation together with the result of Lemma 6 leads to
It suffices to estimate . By Hölder inequality, there is Thus and which implies
Plugging (29) into (26), there holds
Combining (25), (22), and (30) and taking the constant , we complete the proof.

The following proposition provides the bound of the difference between and in with -mixing process.

Proposition 8. Under the assumption of Proposition 7, there holds

Proof. The representations of and imply that Then the desired bound follows from (30) and (32).

3. Samples Satisfying -Mixing Condition

Now we turn to bound the sample error when the sampling process satisfies strongly mixing condition, and unbounded hypothesis holds. In Section 2, the key point is to estimate with the lack of uniform boundedness. For the sampling satisfying -mixing condition, we have to deal with for some .

Proposition 9. Suppose that the unbounded hypothesis holds with some and that the sample sequence satisfies an -mixing condition and with . Then one gets where is a constant only depending on and .

Proof. For the strongly mixing process, by [8, Lemma 5.1], Taking in [8, Lemma 4.2], we have
The estimation of has been obtained in Section 2, and now we mainly devote to estimating . To get this estimation, the bound of is needed which can be stated as follows ([3, Lemma 3] or [8, Lemma 4.3]): where . Observe that . Hence, Now we can deduce that Plugging this estimate into (35) yields where is a constant only depending on and . Then combining (34) and (39) with (25), we complete the proof.

For -mixing process we have the following proposition to get the bound of sample error in , and the proof can be directly obtained by the inequality (32).

Proposition 10. Under assumption of Proposition 9, one has where .

4. Error Bounds and Learning Rates

In this section we derive the learning rates, that is, the convergence rates of and as by choosing the regularization parameter according to . The following approximation error bound is needed to get the convergence rates.

Proposition 11. Supposing that for some , there holds Moreover, when , that is, , there holds

The first conclusion in Proposition 11 has been proved in [20], and the second one can be proved in the same way. To derive the learning rates, we need to balance the approximation error and sample error. For this purpose, the following simple facts are necessary:

Proof of Theorem 3. The estimate of learning rates in norm is divided into two cases.
Case 1. For , by (43) and , there is Thus Proposition 7 yields that By Proposition 11 and Markov inequality, with confidence , there holds
For , by taking , we can deduce the learning rate as . When , taking , the learning rate can be derived. When , the desired convergence rate is obtained by taking .
Case 2. . With confidence , there holds For , taking , the learning rate can be derived, and for , by taking , we can deduce the learning rate . When , the desired convergence rate is obtained by taking .
Next for bounding the generalization error in , Proposition 8 in connection with Proposition 11 tells us that with confidence , The rest of the proof is analogous to the estimate of mentioned previously.

Proof of Theorem 4. For , by (43) and , there is By Propositions 9 and 11 and Markov inequality, with confidence , there holds
For , by taking , we can deduce the learning rate as . When , taking , the learning rate can be derived. When , the desired convergence rate is obtained by taking .
The rest of the analysis is similar; we omit it here.

Acknowledgment

This paper is supported by the National Nature Science Foundation of China (no. 11071276).