Abstract

It is known that more and more mathematicians have paid their attention to the field of learning with a Banach space since Banach spaces may provide abundant inner-product structures. We give investigations on the convergence of a kernel-regularized online binary classification learning algorithm in the setting of reproducing kernel Banach spaces (RKBSs), design an online iteration algorithm with the subdifferential of the norm and the logistic loss, and provide an upper bound for the learning rate, which shows that the online learning algorithm converges if RKBSs satisfy 2-uniform convexity.

1. Introduction

Theory analysis for the convergence of binary classification learning algorithms has always been one of the most central problems in learning theory, and many experts have paid their attention to this topic (see e.g., [113]).

A binary classification algorithm produces a binary classifier which divides the input space into two classes represented by . The classifier gives a prediction for each point (a vector with components corresponding to practical measurements). A real-valued function can be used to produce a classifier where

In many cases, we borrow the samples or observations with the kernel-regularized off-line classification algorithms to produce a classifier from a hypothesis function space called a reproducing kernel Hilbert space (RKHS). Let be a probability distribution on and be a set of random samples drawn independently (i.i.d) according to . Let be a compact subset in the Euclidean space. The RKHS associated with a Mercer kernel is a Hilbert space consisting of all the continuous real functions defined on , satisfying (see e.g., [14])

The kernel-regularized batch off-line binary classification learning algorithm associated with a classification loss and RKHS is defined as (see e.g., [3, 1012, 15, 16])where is the penalty parameter, is a binary classification loss which is often a differential convex function defined on satisfying , and or (see e.g., [3, 10, 11, 17, 18]). Then, the classifier can be obtained by taking

For the regression function ,We define the Bayes rule (see e.g., [19]) as

We define the misclassification probability of a classifier as

Then, the aim of learning theory is to bound the misclassification error in probability (see e.g., [3, 16])

Many mathematicians have conducted extensive research in this field (see e.g., [3, 1013, 16]).

In the case of batch learning, we need to test all samples in each training. When the amount of data is large or new sample points are added, the learning ability of batch learning decreases significantly. Online learning is one of the most effective approaches raised for analyzing and processing big data in various applications, such as communication, electronics, and other fields (see, e.g., [2023]). The performance of kernel-based regularized online learning algorithms has been investigated, and their effectiveness has been verified (see e.g., [2426]). Unlike off-line learning algorithms, online learning algorithms process the observations one by one, and the output is adjusted in time according to the results of last learning. For example, a way of obtaining learning sequence through the observation is the following online iterations (see e.g., [13, 27])where is called the step size, is the regularization parameter, and the sequence satisfies . Algorithm (8) corresponds to the match off-line binary learning model (3).

The geometry properties of the Hilbert space are well understood, and the bilinearity of the inner product makes a thoroughgoing analysis possible. However, the simplicity structure of the Hilbert space has limitations since much data do not come with distance induced from an inner product. Hilbert spaces also have some limitations, all Euclidean spaces of the same basis cardinality are isometrically isomorphic, and there is only one inner-product space. On the other hand, the Banach space may have richer geometry structures and various distances, can provide a more natural notion of distance between data points, and is more suitable for describing complicated data (see e.g., [28]). The reproducing property theory for the Banach space has been investigated (see e.g., [2933]), and corresponding kernel-regularized regression learning has been defined and investigated by many mathematicians (see e.g., [4, 3437]).

A question then arises, can we design an online classification learning algorithm corresponding to the kernel-regularized binary classification in the setting of RKBSs? From the literature, we know that such kinds of discussions are rare. On the other hand, we find that the Banach geometry properties and skills have been used to design descent iteration algorithms for solving the minimization of Tikhonov functionals in Banach spaces (see, e.g., [38]) and have also been used to design sharp approximation methods in approximation theory (see [3941]). So it is possible for us to design online learning algorithms to generate binary classifiers. This is the main motivation for writing this manuscript.

We denote by the Banach space with a dual space and norm . For and , we write .

A reproducing kernel Banach space (RKBS) on is a reflexive Banach space of real functions on , and there exists a unique function called the reproducing kernel of such thatand

When is an RKHS, is indeed the reproducing kernel in the usual sense (see e.g., [14]).

Since is a reflective Banach space, we have

We express our idea through the logistic loss as follows (see e.g., [4244]):

The batch kernel-regularized off-line learning algorithm corresponding to iswhere is an RKBS associated with a reproducing kernel . The integral-type model corresponding to scheme (13) iswhere is the generalization error defined as

We shall design a kind of kernel-regularized online learning algorithm associating the loss in the setting that is an RKBS with respect to a kernel on and use the Banach geometry properties of to measure the learning rate. For this purpose, we need some notions of Banach geometry theory and skills.

Let be a Banach space and be a convex function. Then,and we call the subdifferential of at

Let be a normed space and be a real-valued functional. We say is Gateaux differentiable at if there is such that for any , there holdsand we write .

For a Banach space , we define the modulus of convexity and smoothness asand

We say is uniformly convex if for all and uniformly smooth if (see e.g., [45, 46]). Let be real numbers. We say is p-uniformly convex (resp.q-uniformly smooth) if there is a constant such that (resp. ).

It is shown by [47] that the Banach space is p-uniformly convex if and only if is q-uniformly smooth; is q-uniformly smooth if and only if is p-uniformly convex where

We define and . Then, the uniform convexity and uniform smoothness of and can be described by and , respectively (see e.g., [46, 4851]). If is an RKHS, then . This fact encourages us to design an online learning algorithm with the help of and . We modify the kernel-regularized online learning algorithm (8) according to the subgradient method (see Algorithm 1 in Section 6.2) and define following iteration algorithm:where , and are chosen arbitrarily. The excess misclassification error (7) is

Step 1: Given an initial point .
Step 2: Compute
Step 3: Choose step size and set .

Since and , we have and . By Theorem 34 in the Appendix A of [3], we have a constant such that for any measurable function , there holds

Therefore, to estimate error (21), we need to bound the excess generalization errorwhere is the minimizer of the generalization error

We can decompose the error (23) aswhereit is a -function in learning theory and its decay is determined by the approximation ability of , andwhich is called the sample, its convergence rate will be bounded later.

We shall first give the convergence rate estimate for and then show the convergence rate for (25) under the assumption that has a decay of power.

The main contribution lies in extending the online iteration learning algorithm from the RKHS setting into an RKBS setting first time and giving investigation on the performance.

The organization of this paper is as follows: In Section 2, we provide some assumptions on and the kernel , respectively, based on which we give the main results of the present manuscript, among which there is an upper bound for error and an explicit convergence rate for (21). Two kinds of RKBSs satisfying the assumptions in the present paper are given in Section 3. In Section 4, we provide some auxiliary lemmas. Theorems and corollaries are shown in Section 5. Section 6 shows an appendix which provides some inequalities for Banach spaces.

In what follows, we shall write if there is a constant such that . We say if and .

2. The Results

To give the upper bound estimate for error (23), we provide two assumptions.

Assumption 1. The K-functional has the decay of power, i.e.,where .

Assumption 2. and are uniformly continuous about on , and there is a constant such thatIn Assumption 1, denotes the approximation ability of RKBS whose convergence rate has been investigated by many papers (see e.g., [5254]); it is a usual assumption in learning theory (see e.g., [11, 12, 36]).
We point here that the RKBSs which satisfy Assumption 2 are existent; in Section 3, we provide two kinds of RKBSs whose reproducing kernels satisfy Assumption 2. Also, by the results of Lemma 9 in [35], we know if Assumption 2 holds, then both and are compact spaces, so and are existent and unique.

2.1. The Convergence Rate

We now give some bounds for the convergence of the online iteration algorithm.

Theorem 1. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumptions 1 and 2. Assume that the sequences satisfy and . Then,

We now give an error bound for .

Theorem 2. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumption 2. Let be defined as in (14) and be the sequence defined by algorithm (20). If with and , thenwhere is the constant defined in the convex inequality (140).

Under some assumptions, error (31) can be bounded explicitly.

Corollary 1. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumption 2. Let be the sequence produced by the algorithm (20). For any , we take . If the step size is chosen as , thenwhere is a constant depending only on and .

Finally, we can give an error upper bound for in mathematical expectation.

Corollary 2. Let be a 2-uniformly convex RKBS whose reproducing kernel satisfies Assumptions 1 and 2. Let be the sequence produced by the algorithm (20). For , and , we choose . Then,

2.2. Comments

It looks that the form of the convergence for the online learning in the present paper is the same as that of some published papers (see e.g., [13, 24, 25, 27]). There are essential differences. First, our investigations are in the setting of RKBSs. Note that a 2-uniformly convex Banach space has the closest properties with a Hilbert space. So it is very natural that the form of the iteration sequences and the performance may be the same as that in a RKHS setting.

Second, it is hopeful that the geometry property of the RKBS hypothesis influences the learning rate (this fact has been proved in the case of match off-line learning, see [34]). So it will be an interesting topic for us to design online iteration algorithms according to the geometry parameters of the RKBS hypothesis, for example, the modulus of convexity and the parameters in the convex inequality (see them in Section 6), and investigate the performance. From this view, we can say that the present paper is only a beginning.

Third, the choices and are arbitrary. It will be interesting if the choices can be finished according to some rules.

3. Examples

We now construct two kinds of RKBS kernel functions which are 2-uniformly convex, and their reproducing kernels satisfy Assumption 2.

Let be two given real numbers. is the generalized Jacobi weight function satisfying . The weighted space consists of the real functions on for which the normsare finite. Denote by the Jacobi orthonormal polynomial of order which is normalized by

We have

For given , we define generalized symmetric translations as

Then, the bivariate function is a Mercer kernel on .

Proposition 1. Let and . Assume that satisfies , both and hold, and there exists a constant such that . We define

With the norm,

We defineWith the norm,

Also, we define a continuous bilinear form on as

Then, is a reproducing kernel for and , i.e.,

Moreover, there are the following results:(i) is 2-uniformly smooth, and is 2-uniformly convex.(ii) is uniformly continuous about ; is uniformly continuous about .(iii)There holdsAnd

Proof. The proofs can be found from the Section Appendix of [36].
Let and be real numbers, and denotes the space of all measurable real functions on such thatwhere . The function is defined on bywhich is called the normalized Bessel function of the first kind and order , where is the Gamma function. For , we define the Fourier–Bessel transform byFor , we define

Proposition 2. Let and . Assume that satisfies , both and hold.

We defineWith the norm, ,and with the norm,

We define a bilinear form on as

Then, is a reproducing kernel for and , i.e.,

Moreover, there are the following results:(i) is 2-uniformly smooth, and is 2-uniformly convex.(ii) is uniformly continuous about ; is uniformly continuous about .(iii)There hold inequalitiesAnd

Proof. The proofs can be found from the Section Appendix of [36].

4. Lemmas

To prove the results in Section 2, we provide here some lemmas which are proved in Section 5.

Lemma 1. Let be a convex loss function, and for any , there holds

Lemma 2 (see remark 4.6.1 in [55] and line 15 in pages 1128 in [46]). The sets and defined in Section 2 have the following expressions:

and

Lemma 3. Let be the sequence produced by the algorithm (20) and us take . Then,. Since , we have

Also, since , then

Proof. (61) and (62) can be deduced from (58) and (59)

Lemma 4. Let be the sequence produced by the algorithm (20). If for any , then there holds

Lemma 5. Let be defined as in (14). Then, there exists such that

Lemma 6. Let be defined as in (14) and be a constant in inequality (138) in Appendices. Then, for any , there holds

Lemma 7. Let be a 2-uniformly convex RKBS, and therefore, is a 2- uniformly smooth RKBS. Let be the sequence produced by the algorithm (20) and be the constant in inequality (140) in Appendices. If , then

Furthermore, there holds

Lemma 8 (see (56) in Lemma 4 of [13]). For any and , there holds

Lemma 9 (see Lemma 5 of [13]). Let and Then, is bounded by

5. Proofs

We now give proofs for the lemmas, theorems, and corollaries in the present paper.

Proof of Lemma 1. We denote as the space of continuous functions on andCombining (11) and (29) with the reproducing property, we know thatTherefore,According to the median value theorem, there exists between and such thatDue to the reproducing property and (29), we haveSince , we have(57) thus holds.

Proof of Lemma 4. We prove it by induction. According to the definition of the algorithm (20), we know . By (58), we have , where . Since , then , and the definition of the Banach space of functions, we have . The case is trivial since . We suppose that (63) holds true for . We now consider . It can be expressed asUnder Assumption 2, there isSince,The assumption that , we haveAlso, there holds (63) thus holds.

Proof of Lemma 5. Since is the solution of (14), there holdsNote that and are convex functions, we haveThen, there is such that(64) thus holds.

Proof of Lemma 6. Since B is 2-uniformly convex RKBS, for any , by inequality (138), we haveTherefore,Combining (64) with the above inequality, we have(65) then holds.

Proof of Lemma 7. By Lemma 3, we haveAlso, we know that . Since is a 2-uniformly smooth RKBS, by inequality (140) and equality (62), we haveWe takeSince B is a 2-uniformly convex RKBS, by inequality (138), we have thatIt follows thatwhere we have used the fact that is a convex function andSince depends on but not on , it follows thatCombining (87) and (92), we haveAccording to (65), we haveCombining (61), (62), (93), and (94), we arrive atBy (63), we knowSubstituting (96) into (95), we obtainApplying the above relation iteratively for , we haveSince , and according to the definition of , we have . It follows that(66) thus holds. According to the inequality , for any , we obtain (67) following from (66).

Proof of Theorem 1. It is easy to see thatBy Assumption 1, there isTherefore, for any , there exists such that for holdsAccording to the assumption, , for every , we haveSince is fixed, there exists some , and for every , it holds thatFor any , we haveAccording to the assumptions for , there exists an integer such that for all ; hence,We know that there exists some such that andSince , we denote .From (106) and (108), we know that for any , there exists such that for , there holdsLet , then by (66), (102), and (109), we havewhenever .Thus, we have (30).

Proof of Theorem 2. According to (93), we need to know more concrete expression forBy (61) and (143) and Assumption 2, we haveWhen , we haveSince and , we haveBy the definition of , we can seeSubstituting (114) into (93), we getAccording to (115), we haveCombining (116) with (117), thenBased on the assumptions about , we know thatAlso, by (65), we haveWe takeThen, by (120), we haveCombining (116) with (122), thenWe denote . For , we apply the iterative relation (123) repeatedly, and we haveFor any , we know that inequality holds, and (124) implies thatWe takeWe haveThen, by (125) and the assumption about the step size , we haveNow, we estimate and . By (68), we obtain the following estimate for On the other hand, by (69) with , we haveTherefore,(131) yields the following estimate for :Since and , the conclusion can be estimated by combining (128), (129), and (132).

Proof of Corollary 1. For any , by (31) with , and the factwe haveSince for any , there exists such that ; hence, the first term on the right-hand side of (134) decays in the form of for any large .
However, the second term on the right-hand side of (134) is bounded by . Consequently, there exists a constant depending only on and such that(32) thus holds.

Proof of Corollary 2. Collecting decomposition (25), (28) together with (32), we have (33).

6. Appendix: Further Results on Convex Analysis

We give some known inequalities in Banach spaces and the optimization algorithm about the subgradient.

6.1. Some Inequalities

By (3.2) in Corollary 1 of [46], we know is -uniformly convex if and only if there is a positive such that for all and all , there holds

By (ii) in Corollary 1’ of [46], we know is -uniformly smooth if and only if there is a constant such that for all and all , there holds

In particular, is a -uniformly convex space if and only if there exists a constant such that for all , there holds

By in Corollary 1’ of [46], we know is -uniformly smooth, and there is a constant such that for all and all , there holds

In particular, is a -uniformly smooth space if and only if there exists a constant such that for all , there holds

By (29) in Lemma 2.1 of [56], we know is a uniformly convex Banach space with the moduli of convexity of power type . There is a constant such thatfor all and , where

Combining (141) with (142) and taking , we have

It is known that all the Hilbert spaces, and for , the Banach spaces and (Sobolev spaces) all are both uniformly convex and uniformly smooth (see e.g., [50, 57, 58]).

6.2. The Subgradient Method

Let be a convex function defined on the Euclidean space. Then, the minimization problem has a solution , i.e., if and only if . To obtain approximation , one uses the classical subgradient descent.

We observe that is chosen arbitrarily.

Data Availability

Data are not available as no new data were created or analyzed in this study

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was partially supported by the NSFC/RGC Joint Research Scheme (Project nos. 12061160462 and N_CityU102/20) and NSF (Project no. 61877039) of China.