Learning Bounds of ERM Principle for Sequences of Time-Dependent Samples

Yao, Mingchen; Zhang, Chao; Wu, Wei

doi:https://doi.org/10.1155/2015/826812

Discrete Dynamics in Nature and Society

On this page

Abstract Introduction Conclusion References Copyright Related Articles

Research Article | Open Access

Volume 2015 | Article ID 826812 | https://doi.org/10.1155/2015/826812

Learning Bounds of ERM Principle for Sequences of Time-Dependent Samples

Mingchen Yao,¹Chao Zhang,¹and Wei Wu¹

Academic Editor: Juan R. Torregrosa

Received17 Jul 2015

Revised26 Oct 2015

Accepted01 Nov 2015

Published19 Nov 2015

Abstract

Many generalization results in learning theory are established under the assumption that samples are independent and identically distributed (i.i.d.). However, numerous learning tasks in practical applications involve the time-dependent data. In this paper, we propose a theoretical framework to analyze the generalization performance of the empirical risk minimization (ERM) principle for sequences of time-dependent samples (TDS). In particular, we first present the generalization bound of ERM principle for TDS. By introducing some auxiliary quantities, we also give a further analysis of the generalization properties and the asymptotical behaviors of ERM principle for TDS.

1. Introduction

Let and be an input space and the corresponding output space, respectively. Define with . Many classical results of statistical learning theory are built under the assumption that samples are independently drawn from an identical distribution on , that is, the so-called i.i.d.-sample assumption, for example, [1–5]. However, the theoretical results sharing the i.i.d.-sample assumption may not be valid for (or cannot be directly used in) the non-i.i.d. scenario.

There have been many research interests lying in the theoretical analysis of the learning processes for time-dependent samples. Zhang and Tao [6, 7] study the generalization properties of ERM-based learning processes for time-dependent samples drawn from Lévy processes and continues-time Markov chains, respectively. Zou et al. [8] establish exponential bound on the rate of relative uniform convergence for the ERM algorithm with dependent observations and derived the related generalization bounds with -mixing sequence. Jiang [9] introduces a triplex inequality based on Hoeffding’s inequality (cf. [10]) and then obtains probability bounds for uniform deviations in a very general framework, which is suitable for the cases with unbounded loss functions and dependent samples. Zou et al. [11] present a novel Markov sampling algorithm to generate uniformly ergodic Markov chain samples from a given dataset. Xu et al. [12] adopt the similar techniques of [11] to develop the error bound for an online SVM classification algorithm, which provides a higher learning performance than that of classical random sampling methods.

In this paper, we are mainly concerned with the generalization performance of ERM principle for the sequences of time-dependent samples (TDS), which are independently and repeatedly observed from an undetermined stochastic process at the fixed time points. We can see that the generalization performance of such learning process is affected by the following factors: the number of independent observation sequences, time-dependence among samples in a sequence, the sample number in a sequence, and the observing time points.

By introducing some auxiliary quantities, we propose a new framework to analyze the generalization properties of the learning process. In particular, we first show that the generalization bound of this learning process can be decomposed into four parts: , , , and , which are related to the aforementioned factors. We then analyze the properties of these quantities. By imposing some mild conditions, we further obtain the upper bounds of the first three types of quantities , , and , respectively. Finally, some techniques of statistical learning theory are applied to bound the quantity , for example, the uniform entry number, the relevant deviation, and the symmetrization inequalities for independent sequences of TDS.

Different from previous works [6, 7], there is no specific assumption on the distribution of the stochastic process sampled from. In contrast, the previous works require that samples should be observed from Lévy processes and continuous-time Markov chains, respectively. Instead of the only -valued function class appearing in the previous works, we impose a function class consisting of functions evaluated on and the time interval . Moreover, the samples considered in previous works [6–8] are a series of data observed from one certain stochastic process, while this paper discusses the learning process based on quantities of independent sequences of TDS observed from a stochastic process at the fixed time points.

Moreover, the works [9, 11, 12] only consider the case of one single sequence of TDS, while this paper studies the learning process for multiple sequences of TDS, where one sequence corresponds to one trajectory (sample path) of a stochastic process. Therefore, our results are more general than previous ones.

The rest of paper is organized as follows. In Section 2, we first introduce some notions and notations used in the paper and then exhibit the decomposition of the generalization bounds of ERM principle for TDS. In Section 3, we bound the four auxiliary quantities , , , and and present the main results of the paper. The last section concludes the paper.

2. Problem Setup

In this section, we formalize the main research issue of this paper and then show the decomposition of the generalization error of ERM principle for TDS.

2.1. ERM Principle and Generalization Bounds

For any , let and be the -time inputs and the corresponding outputs, respectively. Denote () and assume that is an undetermined stochastic process with a countable space. Consider a function class consisting of functions evaluated at the input space and the time interval . We would like to find a function such that, for any input (), the corresponding output can be predicted as accurately as possible.

A natural criterion to choose the function is the lowest expected risk caused by some function in :where is a function with respect to the -time input and the time point (Here, in the function is just an input value of the functional ), is the composite function after , and stands for the -time distribution of the stochastic process on . However, the distribution of is unknown, and it is difficult to directly obtain the target function by minimizing the expected risk .

Instead, the empirical risk minimization (ERM) principle provides a solution scheme to this issue. For , let be the th sequence of samples observed from a certain stochastic process at the fixed time points and let the sequences () be independent of each other. The ERM principle aims to minimize the empirical risk over :and the solution is regarded as an estimate to the expected solution with respect to the sample sequences .

For convenience, we further define the loss function classand call the function class in the rest of this paper. Given sample sequences , taking provides the following brief notations of expected risk (1) and empirical risk (2):

Similar to the classical statistical learning theory [13], the main issue of this paper is to discuss whether the empirical solution provided by ERM principle will perform as well as the expected solution . Then, the supremumcalled the generalization bound of ERM principle for TDS , will play an important role in analyzing the generalization performance of the above learning process as well.

2.2. Relationship with Some Time-Dependent Problem

At the end of this section, we will show that many time-dependent problems can come down to the aforementioned learning process based on ERM principle, for example, the estimation of information channel and functional linear models.

2.2.1. Estimation of Information Channel

Since the functions in are evaluated at both the real input space and the time interval , the framework proposed in the paper can describe the inherent characteristics of the time-dependent problems more precisely than the learning framework given in the previous works [6, 7], where the function class is only evaluated at . Certainly, the time-dependent problem mentioned in [14], for example, the estimation of information channel, can also be included in this learning setting. Different from the previous one, the setting considered in this paper is also suitable for analyzing the performance of estimating dynamic information channel.

The estimation of dynamic information channel is of the following model: , where and , changing status with time varying, are the channel matrix and the noise vector, respectively. The corresponding function class is formalized as

2.2.2. Functional Linear Models

Moreover, the functional data classification or regression with functional linear models is also in accordance with the learning setting mentioned in this paper. By denoting the following are the most frequently used functional linear models mentioned by [15]: (i)The model with a scalar input and a functional output: , which corresponds to the function class for any : (ii)The model with a functional input and a scalar output: , which corresponds to the function class for any : (iii)The two models with a functional input and a functional output:(a)one model is , which corresponds to the function class for any : (b)the other is , which corresponds to the function class for any : In the above models, the loss function is usually selected as the mean square error function and then some functional data algorithms are used to find the function that minimizes empirical risk (2) over the function class . We refer to [16] for more details on functional linear models.

2.3. Analysis of Generalization Performance

Different from the i.i.d. learning setting, regardless of the distribution characteristics of , the generalization performance of the aforementioned learning process is also affected by the following factors:(i)Time-dependence among samples in one sequence.(ii)The number of independent sequences of samples.(iii)The number of samples in one sequence.(iv)The fixed time points , at which the samples are observed.The main concern of this paper is to quantitatively analyze the influence to the generalization performance caused by these factors and then achieve the generalization bound of ERM principle for TDS.

Proposition 1. Denote as an ordered set with and . If the set achieves the supremumthen there holds thatwhere the quantities , , , and are, respectively, defined by

This result implies that the behavior of the generalization bound can be described by using the summation of the quantities , , , and . In the next section, we will discuss the properties of the quantities , , , and , respectively.

3. Analysis of Relevant Quantities

As mentioned above, regardless of the distribution characteristics of , there are four factors affecting the generalization performance of the ERM learning process for TDS: time-dependence, the number of TDS sequences, the sample number, and the observing time points, which are actually related to the quantities , , , and , respectively.

Moreover, two mild conditions will be imposed for the following discussion:Assume that there exists a constant such that, for all and , there holds thatAssume that each is differentiable with respect to the time and there exists a constant such that, for all and , there holds that, for any and ,The former requires that any function should have the bounded expectation with respect to the distribution of at any time , and the latter implies that all functions in should have bounded first-order partial derivative with respect to .

3.1. Upper Bound of Quantity

Under Condition (C1) and recalling (15), we can bound the quantity as follows:which implies that is affected by the choice of the specific time sequence which achieves supremum (13). Note that it is possible that the time sequence is not the unique one which can achieve supremum (13); then we define

It can be observed that if the time points satisfy that the length of each () is equivalent, the sampling-time error is equal to zero. From the probabilistic perspective, it means that is the one that is closest to uniform distribution among the candidate sequences. In other words, more uniformly the time points in are distributed and the quantity is closer to zero. On the other hand, the uniform distribution of time points implies that, for the stochastic process , the distributions of are identical for all , which is actually a probability distribution.

3.2. Upper Bound of Quantity

Denote which is the so-called integral probability metric (IPM) between the two distributions and with respect to the function class . IPM plays an important role in measuring the discrepancy between two distributions in probability theory, and we refer to [17, 18] for more details on IPM.

According to (16), we then obtain thatParticularly, according to (23), if the stochastic process has an identical distribution at any time, the quantity is equal to zero.

3.3. Upper Bound of Quantity

According to (17), it follows from the condition of (C2) thatwhere lies between and . According to (25), quantity can be bounded by the summation of the discrepancies between the supremum time points and the fixed observation time point . Recalling (13), the time points are determined by the selection of . It is noteworthy that the time points usually cannot coincide with .

3.4. Upper Bound of Quantity

Similar to the classical statistical learning theory, the task to bound can be divided into three steps: complexity measure of function classes and the deviation and the symmetrization inequalities. Based on Azuma’s inequality, we will obtain a suitable deviation inequality for TDS and then the relevant symmetrization inequality. By using the uniform entropy number, we finally bound the quantity in the sense of probability.

3.4.1. Deviation Inequality

Let be the observation sequence of the stochastic process with a countable state space . By the way presented in [19], a filtration associated with can be built as follows:(i)Let be all subset of .(ii)Set .(iii)For any , let be the -algebra generated by .Then, there naturally holds thatAccording to Azuma’s inequality [19, 20], since the observation sequences are independent of each other, we obtain the following result.

Lemma 2. Given a function , set and for any , define the associated martingale differences as follows:Then, there holds that, for any ,

3.4.2. Symmetrization Inequality

Symmetrization inequalities are mainly used to replace the expected risk by an empirical risk computed on another sample set that is independent of the given sample set but has the identical distribution. In this manner, some kinds of complexity measures can be incorporated in the resulting bounds.

For clarity of presentation, we give a useful notation for the following discussion. Given a sample sequence , consider another sample sequence independent of . If has the same distribution as for any , the sequence is called the ghost sample sequence of . The following result is a direct extension of the classical symmetrization results of Lemma 2 in [21].

Theorem 3. Assume that is a function class consisting of the functions with the range and is a stochastic process with a countable state space . Let and be drawn from in the time interval . Then given any , one has, for any ,where .

3.4.3. Approximate Error

The following is the definition of the covering number and we refer to [22] for details.

Definition 4. Let be a function class and let be a metric on . For any , the covering number of at radius with respect to the metric , denoted by , is the minimum size of a cover of radius .

The uniform entropy number (UEN) is a variant of the covering number and we refer to [22] for details as well. By setting the metric (), the UEN is defined as follows:Based on the uniform entropy number, the upper bound of can be obtained as follows.

Theorem 5. Assume that with is a function class such that, for any and any , the functional is bounded with the range . Let be a stochastic process with a countable state space. Let be independent observation sequences of at the time points . Let be the ghost samples of and denote for any . Then, given any , one has, for any ,

Compared to the classical generalization results for i.i.d. samples (cf. Theorem 2.3 of [22]), time-dependent data are involved in the analysis of upper bound (31). This result is also different from the generalization bound of [7] because more than one sequence of time-dependent samples are investigated. Denote the right-hand side of (31) by , and under notations of Theorem 5 we immediately obtain that, with probability of at least ,which explicitly shows the upper bound of .

Finally, according to the upper bounds of () and Proposition 1, we can bound given in (6) as follows: with probability of at least ,From result (33), we can find that the generalization performance of ERM principle for TDS is affected by the following factors: (i)The choice of the observation time points .(ii)The number of TDS sequences.(iii)The length of sequences.(iv)The properties of the martingale differences .(v)The complexity of function classes.And meanwhile, the quantitative relationship among these factors is explicitly given in the above result as well.

4. Conclusion

In this paper, we propose a new framework to analyze the generalization properties of ERM principle for the sequences of TDS. By introducing four auxiliary quantities , , , and , we give a detailed insight to the interactions among the complexity of function classes, the size of samples, and the dependence among samples. To achieve the upper bound of , we develop the relevant deviation inequality and symmetrization inequality for the sequences of TDS. This work is an extension of the classical techniques of statistical learning theory. In our future works, we will consider to relax conditions (C1) and (C2) and further investigate the properties of ERM-based learning processes for the TDS sequences, which are not observed at the fixed observation time points.

Appendices

A. Proof of Proposition 1

According to (3), (4), and (5), we have which can lead to result (14). This completes the proof.

B. Proof of Theorem 5

Given a time sequence , let . For any , define a function :Then, consider as independent Rademacher random variables, that is, independent -valued random variables with equal probability of taking either value. Given and , we denoteand for any , According to (5) and Theorem 3, given any , we have, for any ,

For any given , set to be a -radius cover of with respect to the norm. Since is composed of bounded functions with range , we assume that the same holds for any . According to the triangle inequality, if is the function that achieves there must be that satisfies and meanwhile Therefore, we arrive at

On the other hand, given any , we also have, for any , The last inequality of (B.9) is derived from Definition (30) and Lemma 2.

The combination of (B.4), (B.8), and (B.9) leads to the result: given any , there holds that, for any , This completes the proof.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is supported by the National Natural Science Foundation of China under Project 11401076, Project 61473328, Project 11171367, and Project 61473059.

References

V. N. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999.
View at: Google Scholar
E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun, “PAC-Bayes bounds with data dependent priors,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 3507–3531, 2012.
View at: Google Scholar | MathSciNet
L. Oneto, A. Ghio, S. Ridella, and D. Anguita, “Local rademacher complexity: sharper risk bounds with and without unlabeled samples,” Neural Networks, vol. 65, pp. 115–125, 2015.
View at: Publisher Site | Google Scholar
X. Liu, S. Lin, J. Fang, and Z. Xu, “Is extreme learning machine feasible? A theoretical assessment (Part I),” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 7–20, 2015.
View at: Publisher Site | Google Scholar
S. Lin, X. Liu, J. Fang, and Z. Xu, “Is extreme learning machine feasible? A theoretical assessment (Part II),” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 21–34, 2015.
View at: Publisher Site | Google Scholar
C. Zhang and D. Tao, “Risk bounds of learning processes for lévy processes,” Journal of Machine Learning Research, vol. 14, no. 1, pp. 351–376, 2013.
View at: Google Scholar | MathSciNet
C. Zhang and D. Tao, “Generalization bounds of ERM-based learning processes for continuous-time markov chains,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 12, pp. 1872–1883, 2012.
View at: Publisher Site | Google Scholar
B. Zou, L. Li, and Z. Xu, “The generalization performance of ERM algorithm with strongly mixing observations,” Machine Learning, vol. 75, no. 3, pp. 275–295, 2009.
View at: Publisher Site | Google Scholar
W. Jiang, “On uniform deviations of general empirical risks with unboundedness, dependence, and high dimensionality,” Journal of Machine Learning Research, vol. 10, pp. 977–996, 2009.
View at: Google Scholar | MathSciNet
W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963.
View at: Publisher Site | Google Scholar | MathSciNet
B. Zou, Y. Y. Tang, Z. Xu, L. Li, J. Xu, and Y. Lu, “The generalization performance of regularized regression algorithms based on markov sampling,” IEEE Transactions on Cybernetics, vol. 44, no. 9, pp. 1497–1507, 2014.
View at: Publisher Site | Google Scholar
J. Xu, Y. Y. Tang, B. Zou, Z. Xu, L. Li, and Y. Lu, “The generalization ability of online svm classification based on markov sampling,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 3, pp. 628–639, 2015.
View at: Publisher Site | Google Scholar
P. L. Bartlett, S. Mendelson, and P. Philips, “Local complexities for empirical risk minimization,” in Learning Theory, pp. 270–284, Springer, 2004.
View at: Google Scholar
A. Sutivong, M. Chiang, T. M. Cover, and Y.-H. Kim, “Channel capacity and state estimation for state-dependent Gaussian channels,” IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1486–1495, 2005.
View at: Publisher Site | Google Scholar | MathSciNet
J. Ramsay, Functional Data Analysis, Wiley, 2005.
G. M. James, J. Wang, and J. Zhu, “Functional linear regression that's interpretable,” The Annals of Statistics, vol. 37, no. 5A, pp. 2083–2108, 2009.
View at: Publisher Site | Google Scholar | MathSciNet
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. G. Lanckriet, “On the empirical estimation of integral probability metrics,” Electronic Journal of Statistics, vol. 6, pp. 1550–1599, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
A. Müller, “Integral probability metrics and their generating classes of functions,” Advances in Applied Probability, vol. 29, no. 2, pp. 429–443, 1997.
View at: Publisher Site | Google Scholar | MathSciNet
L. A. Kontorovich and K. Ramanan, “Concentration inequalities for dependent random variables via the martingale method,” The Annals of Probability, vol. 36, no. 6, pp. 2126–2158, 2008.
View at: Publisher Site | Google Scholar | MathSciNet
K. Azuma, “Weighted sums of certain dependent random variables,” Tohoku Mathematical Journal, Second Series, vol. 19, no. 3, pp. 357–367, 1967.
View at: Google Scholar
O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” in Advanced Lectures on Machine Learning, pp. 169–207, Springer, 2004.
View at: Google Scholar
S. Mendelson, “A few notes on statistical learning theory,” in Advanced Lectures on Machine Learning, vol. 2600 of Lecture Notes in Computer Science, pp. 1–40, Springer, Berlin, Germany, 2003.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2015 Mingchen Yao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1123

Downloads

951

Citations