Abstract
It is known that more and more mathematicians have paid their attention to the field of learning with a Banach space since Banach spaces may provide abundant inner-product structures. We give investigations on the convergence of a kernel-regularized online binary classification learning algorithm in the setting of reproducing kernel Banach spaces (RKBSs), design an online iteration algorithm with the subdifferential of the norm and the logistic loss, and provide an upper bound for the learning rate, which shows that the online learning algorithm converges if RKBSs satisfy 2-uniform convexity.
1. Introduction
Theory analysis for the convergence of binary classification learning algorithms has always been one of the most central problems in learning theory, and many experts have paid their attention to this topic (see e.g., [1–13]).
A binary classification algorithm produces a binary classifier which divides the input space into two classes represented by . The classifier gives a prediction for each point (a vector with components corresponding to practical measurements). A real-valued function can be used to produce a classifier where
In many cases, we borrow the samples or observations with the kernel-regularized off-line classification algorithms to produce a classifier from a hypothesis function space called a reproducing kernel Hilbert space (RKHS). Let be a probability distribution on and be a set of random samples drawn independently (i.i.d) according to . Let be a compact subset in the Euclidean space. The RKHS associated with a Mercer kernel is a Hilbert space consisting of all the continuous real functions defined on , satisfying (see e.g., [14])
The kernel-regularized batch off-line binary classification learning algorithm associated with a classification loss and RKHS is defined as (see e.g., [3, 10–12, 15, 16])where is the penalty parameter, is a binary classification loss which is often a differential convex function defined on satisfying , and or (see e.g., [3, 10, 11, 17, 18]). Then, the classifier can be obtained by taking
For the regression function ,We define the Bayes rule (see e.g., [19]) as
We define the misclassification probability of a classifier as
Then, the aim of learning theory is to bound the misclassification error in probability (see e.g., [3, 16])
Many mathematicians have conducted extensive research in this field (see e.g., [3, 10–13, 16]).
In the case of batch learning, we need to test all samples in each training. When the amount of data is large or new sample points are added, the learning ability of batch learning decreases significantly. Online learning is one of the most effective approaches raised for analyzing and processing big data in various applications, such as communication, electronics, and other fields (see, e.g., [20–23]). The performance of kernel-based regularized online learning algorithms has been investigated, and their effectiveness has been verified (see e.g., [24–26]). Unlike off-line learning algorithms, online learning algorithms process the observations one by one, and the output is adjusted in time according to the results of last learning. For example, a way of obtaining learning sequence through the observation is the following online iterations (see e.g., [13, 27])where is called the step size, is the regularization parameter, and the sequence satisfies . Algorithm (8) corresponds to the match off-line binary learning model (3).
The geometry properties of the Hilbert space are well understood, and the bilinearity of the inner product makes a thoroughgoing analysis possible. However, the simplicity structure of the Hilbert space has limitations since much data do not come with distance induced from an inner product. Hilbert spaces also have some limitations, all Euclidean spaces of the same basis cardinality are isometrically isomorphic, and there is only one inner-product space. On the other hand, the Banach space may have richer geometry structures and various distances, can provide a more natural notion of distance between data points, and is more suitable for describing complicated data (see e.g., [28]). The reproducing property theory for the Banach space has been investigated (see e.g., [29–33]), and corresponding kernel-regularized regression learning has been defined and investigated by many mathematicians (see e.g., [4, 34–37]).
A question then arises, can we design an online classification learning algorithm corresponding to the kernel-regularized binary classification in the setting of RKBSs? From the literature, we know that such kinds of discussions are rare. On the other hand, we find that the Banach geometry properties and skills have been used to design descent iteration algorithms for solving the minimization of Tikhonov functionals in Banach spaces (see, e.g., [38]) and have also been used to design sharp approximation methods in approximation theory (see [39–41]). So it is possible for us to design online learning algorithms to generate binary classifiers. This is the main motivation for writing this manuscript.
We denote by the Banach space with a dual space and norm . For and , we write .
A reproducing kernel Banach space (RKBS) on is a reflexive Banach space of real functions on , and there exists a unique function called the reproducing kernel of such thatand
When is an RKHS, is indeed the reproducing kernel in the usual sense (see e.g., [14]).
Since is a reflective Banach space, we have
We express our idea through the logistic loss as follows (see e.g., [42–44]):
The batch kernel-regularized off-line learning algorithm corresponding to iswhere is an RKBS associated with a reproducing kernel . The integral-type model corresponding to scheme (13) iswhere is the generalization error defined as
We shall design a kind of kernel-regularized online learning algorithm associating the loss in the setting that is an RKBS with respect to a kernel on and use the Banach geometry properties of to measure the learning rate. For this purpose, we need some notions of Banach geometry theory and skills.
Let be a Banach space and be a convex function. Then,and we call the subdifferential of at
Let be a normed space and be a real-valued functional. We say is Gateaux differentiable at if there is such that for any , there holdsand we write .
For a Banach space , we define the modulus of convexity and smoothness asand
We say is uniformly convex if for all and uniformly smooth if (see e.g., [45, 46]). Let be real numbers. We say is p-uniformly convex (resp.q-uniformly smooth) if there is a constant such that (resp. ).
It is shown by [47] that the Banach space is p-uniformly convex if and only if is q-uniformly smooth; is q-uniformly smooth if and only if is p-uniformly convex where
We define and . Then, the uniform convexity and uniform smoothness of and can be described by and , respectively (see e.g., [46, 48–51]). If is an RKHS, then . This fact encourages us to design an online learning algorithm with the help of and . We modify the kernel-regularized online learning algorithm (8) according to the subgradient method (see Algorithm 1 in Section 6.2) and define following iteration algorithm:where , and are chosen arbitrarily. The excess misclassification error (7) is
|
Since and , we have and . By Theorem 34 in the Appendix A of [3], we have a constant such that for any measurable function , there holds
Therefore, to estimate error (21), we need to bound the excess generalization errorwhere is the minimizer of the generalization error
We can decompose the error (23) aswhereit is a -function in learning theory and its decay is determined by the approximation ability of , andwhich is called the sample, its convergence rate will be bounded later.
We shall first give the convergence rate estimate for and then show the convergence rate for (25) under the assumption that has a decay of power.
The main contribution lies in extending the online iteration learning algorithm from the RKHS setting into an RKBS setting first time and giving investigation on the performance.
The organization of this paper is as follows: In Section 2, we provide some assumptions on and the kernel , respectively, based on which we give the main results of the present manuscript, among which there is an upper bound for error and an explicit convergence rate for (21). Two kinds of RKBSs satisfying the assumptions in the present paper are given in Section 3. In Section 4, we provide some auxiliary lemmas. Theorems and corollaries are shown in Section 5. Section 6 shows an appendix which provides some inequalities for Banach spaces.
In what follows, we shall write if there is a constant such that . We say if and .
2. The Results
To give the upper bound estimate for error (23), we provide two assumptions.
Assumption 1. The K-functional has the decay of power, i.e.,where .
Assumption 2. and are uniformly continuous about on , and there is a constant such thatIn Assumption 1, denotes the approximation ability of RKBS whose convergence rate has been investigated by many papers (see e.g., [52–54]); it is a usual assumption in learning theory (see e.g., [11, 12, 36]).
We point here that the RKBSs which satisfy Assumption 2 are existent; in Section 3, we provide two kinds of RKBSs whose reproducing kernels satisfy Assumption 2. Also, by the results of Lemma 9 in [35], we know if Assumption 2 holds, then both and are compact spaces, so and are existent and unique.
2.1. The Convergence Rate
We now give some bounds for the convergence of the online iteration algorithm.
Theorem 1. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumptions 1 and 2. Assume that the sequences satisfy and . Then,
We now give an error bound for .
Theorem 2. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumption 2. Let be defined as in (14) and be the sequence defined by algorithm (20). If with and , thenwhere is the constant defined in the convex inequality (140).
Under some assumptions, error (31) can be bounded explicitly.
Corollary 1. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumption 2. Let be the sequence produced by the algorithm (20). For any , we take . If the step size is chosen as , thenwhere is a constant depending only on and .
Finally, we can give an error upper bound for in mathematical expectation.
Corollary 2. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumptions 1 and 2. Let be the sequence produced by the algorithm (20). For , and , we choose . Then,
2.2. Comments
It looks that the form of the convergence for the online learning in the present paper is the same as that of some published papers (see e.g., [13, 24, 25, 27]). There are essential differences. First, our investigations are in the setting of RKBSs. Note that a 2-uniformly convex Banach space has the closest properties with a Hilbert space. So it is very natural that the form of the iteration sequences and the performance may be the same as that in a RKHS setting.
Second, it is hopeful that the geometry property of the RKBS hypothesis influences the learning rate (this fact has been proved in the case of match off-line learning, see [34]). So it will be an interesting topic for us to design online iteration algorithms according to the geometry parameters of the RKBS hypothesis, for example, the modulus of convexity and the parameters in the convex inequality (see them in Section 6), and investigate the performance. From this view, we can say that the present paper is only a beginning.
Third, the choices and are arbitrary. It will be interesting if the choices can be finished according to some rules.
3. Examples
We now construct two kinds of RKBS kernel functions which are 2-uniformly convex, and their reproducing kernels satisfy Assumption 2.
Let be two given real numbers. is the generalized Jacobi weight function satisfying . The weighted space consists of the real functions on for which the normsare finite. Denote by the Jacobi orthonormal polynomial of order which is normalized by
We have
For given , we define generalized symmetric translations as
Then, the bivariate function is a Mercer kernel on .
Proposition 1. Let and . Assume that satisfies , both and hold, and there exists a constant such that . We define
With the norm,
We defineWith the norm,
Also, we define a continuous bilinear form on as
Then, is a reproducing kernel for and , i.e.,
Moreover, there are the following results:(i) is 2-uniformly smooth, and is 2-uniformly convex.(ii) is uniformly continuous about ; is uniformly continuous about .(iii)There holds And
Proof. The proofs can be found from the Section Appendix of [36].
Let and be real numbers, and denotes the space of all measurable real functions on such thatwhere . The function is defined on bywhich is called the normalized Bessel function of the first kind and order , where is the Gamma function. For , we define the Fourier–Bessel transform byFor , we define
Proposition 2. Let and . Assume that satisfies , both and hold.
We defineWith the norm, ,and with the norm,
We define a bilinear form on as
Then, is a reproducing kernel for and , i.e.,
Moreover, there are the following results:(i) is 2-uniformly smooth, and is 2-uniformly convex.(ii) is uniformly continuous about ; is uniformly continuous about .(iii)There hold inequalities And
Proof. The proofs can be found from the Section Appendix of [36].
4. Lemmas
To prove the results in Section 2, we provide here some lemmas which are proved in Section 5.
Lemma 1. Let be a convex loss function, and for any , there holds
Lemma 2 (see remark 4.6.1 in [55] and line 15 in pages 1128 in [46]). The sets and defined in Section 2 have the following expressions:
and
Lemma 3. Let be the sequence produced by the algorithm (20) and us take . Then,. Since , we have
Also, since , then
Proof. (61) and (62) can be deduced from (58) and (59)
Lemma 4. Let be the sequence produced by the algorithm (20). If for any , then there holds
Lemma 5. Let be defined as in (14). Then, there exists such that
Lemma 6. Let be defined as in (14) and be a constant in inequality (138) in Appendices. Then, for any , there holds
Lemma 7. Let be a 2-uniformly convex RKBS, and therefore, is a 2- uniformly smooth RKBS. Let be the sequence produced by the algorithm (20) and be the constant in inequality (140) in Appendices. If , then
Furthermore, there holds
Lemma 8 (see (56) in Lemma 4 of [13]). For any and , there holds
Lemma 9 (see Lemma 5 of [13]). Let and Then, is bounded by
5. Proofs
We now give proofs for the lemmas, theorems, and corollaries in the present paper.
Proof of Lemma 1. We denote as the space of continuous functions on andCombining (11) and (29) with the reproducing property, we know thatTherefore,According to the median value theorem, there exists between and such thatDue to the reproducing property and (29), we haveSince , we have(57) thus holds.
Proof of Lemma 4. We prove it by induction. According to the definition of the algorithm (20), we know . By (58), we have , where . Since , then , and the definition of the Banach space of functions, we have . The case is trivial since . We suppose that (63) holds true for . We now consider . It can be expressed asUnder Assumption 2, there isSince,The assumption that , we haveAlso, there holds (63) thus holds.
Proof of Lemma 5. Since is the solution of (14), there holdsNote that and are convex functions, we haveThen, there is such that(64) thus holds.
Proof of Lemma 6. Since B is 2-uniformly convex RKBS, for any , by inequality (138), we haveTherefore,Combining (64) with the above inequality, we have(65) then holds.
Proof of Lemma 7. By Lemma 3, we haveAlso, we know that . Since is a 2-uniformly smooth RKBS, by inequality (140) and equality (62), we haveWe takeSince B is a 2-uniformly convex RKBS, by inequality (138), we have thatIt follows thatwhere we have used the fact that is a convex function andSince depends on but not on , it follows thatCombining (87) and (92), we haveAccording to (65), we haveCombining (61), (62), (93), and (94), we arrive atBy (63), we knowSubstituting (96) into (95), we obtainApplying the above relation iteratively for , we haveSince , and according to the definition of , we have . It follows that(66) thus holds. According to the inequality , for any , we obtain (67) following from (66).
Proof of Theorem 1. It is easy to see thatBy Assumption 1, there isTherefore, for any , there exists such that for holdsAccording to the assumption, , for every , we haveSince is fixed, there exists some , and for every , it holds thatFor any , we haveAccording to the assumptions for , there exists an integer such that for all ; hence,We know that there exists some such that andSince , we denote .From (106) and (108), we know that for any , there exists such that for , there holdsLet , then by (66), (102), and (109), we havewhenever .Thus, we have (30).
Proof of Theorem 2. According to (93), we need to know more concrete expression forBy (61) and (143) and Assumption 2, we haveWhen , we haveSince and , we haveBy the definition of , we can seeSubstituting (114) into (93), we getAccording to (115), we haveCombining (116) with (117), thenBased on the assumptions about , we know thatAlso, by (65), we haveWe takeThen, by (120), we haveCombining (116) with (122), thenWe denote . For , we apply the iterative relation (123) repeatedly, and we haveFor any , we know that inequality holds, and (124) implies thatWe takeWe haveThen, by (125) and the assumption about the step size , we haveNow, we estimate and . By (68), we obtain the following estimate for On the other hand, by (69) with , we haveTherefore,(131) yields the following estimate for :Since and , the conclusion can be estimated by combining (128), (129), and (132).
Proof of Corollary 1. For any , by (31) with , and the factwe haveSince for any , there exists such that ; hence, the first term on the right-hand side of (134) decays in the form of for any large .
However, the second term on the right-hand side of (134) is bounded by . Consequently, there exists a constant depending only on and such that(32) thus holds.
Proof of Corollary 2. Collecting decomposition (25), (28) together with (32), we have (33).
6. Appendix: Further Results on Convex Analysis
We give some known inequalities in Banach spaces and the optimization algorithm about the subgradient.
6.1. Some Inequalities
By (3.2) in Corollary 1 of [46], we know is -uniformly convex if and only if there is a positive such that for all and all , there holds
By (ii) in Corollary 1’ of [46], we know is -uniformly smooth if and only if there is a constant such that for all and all , there holds
In particular, is a -uniformly convex space if and only if there exists a constant such that for all , there holds
By in Corollary 1’ of [46], we know is -uniformly smooth, and there is a constant such that for all and all , there holds
In particular, is a -uniformly smooth space if and only if there exists a constant such that for all , there holds
By (29) in Lemma 2.1 of [56], we know is a uniformly convex Banach space with the moduli of convexity of power type . There is a constant such thatfor all and , where
Combining (141) with (142) and taking , we have
It is known that all the Hilbert spaces, and for , the Banach spaces and (Sobolev spaces) all are both uniformly convex and uniformly smooth (see e.g., [50, 57, 58]).
6.2. The Subgradient Method
Let be a convex function defined on the Euclidean space. Then, the minimization problem has a solution , i.e., if and only if . To obtain approximation , one uses the classical subgradient descent.
We observe that is chosen arbitrarily.
Data Availability
Data are not available as no new data were created or analyzed in this study
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.
Acknowledgments
This work was partially supported by the NSFC/RGC Joint Research Scheme (Project nos. 12061160462 and N_CityU102/20) and NSF (Project no. 61877039) of China.