Probability Error Bounds for Approximation of Functions in Reproducing Kernel Hilbert Spaces
We find probability error bounds for approximations of functions in a separable reproducing kernel Hilbert space with reproducing kernel on a base space , firstly in terms of finite linear combinations of functions of type and then in terms of the projection on , for random sequences of points in . Given a probability measure , letting be the measure defined by , , our approach is based on the nonexpansive operator where the integral exists in the Bochner sense. Using this operator, we then define a new reproducing kernel Hilbert space, denoted by , that is the operator range of . Our main result establishes bounds, in terms of the operator , on the probability that the Hilbert space distance between an arbitrary function in and linear combinations of functions of type , for sampled independently from , falls below a given threshold. For sequences of points constituting a so-called uniqueness set, the orthogonal projections to converge in the strong operator topology to the identity operator. We prove that, under the assumption that is dense in , any sequence of points sampled independently from yields a uniqueness set with probability 1. This result improves on previous error bounds in weaker norms, such as uniform or norms, which yield only convergence in probability and not almost certain convergence. Two examples that show the applicability of this result to a uniform distribution on a compact interval and to the Hardy space are presented as well.
Several machine learning algorithms that use positive semidefinite kernels, such as support vector machines (SVM), have been analysed and justified rigorously using the theory of reproducing kernel Hilbert spaces (RKHS), yielding statements of optimality, convergence, and approximation bounds, e.g., see Cucker and Smale . Reproducing kernel Hilbert spaces are Hilbert spaces of functions associated to a suitable kernel such that convergence with respect to the Hilbert space norm implies pointwise convergence, and in the context of approximation possess various favourable properties resulting from the Hilbert space structure. For example, under certain conditions on the kernel, every function in the Hilbert space is sufficiently differentiable, and differentiation is in fact a nonexpansive linear map with respect to the Hilbert space norm, e.g., see (, Subsection 2.1.3).
In order to substantiate the motivation for our investigation, we briefly review previously obtained bounds on the approximation of functions as linear combinations of kernels evaluated at finitely many points. The theory of Vapnik and Chervonenkis of statistical learning theory [3–5] relies on concentration inequalities such as Hoeffding’s inequality to bound the supremum distance between expected and empirical risk. The theory considers a data space on which an unknown probability distribution is defined, a hypothesis set , and a loss function , such that one wishes to find a hypothesis that minimizes the expected risk
Since is not known in general, instead of minimizing the expected risk one usually minimizes the empirical risk over a finite set of samples. Vapnik-Chervonenkis theory measures the probability with which the maximum distance between and falls below a given threshold. Recall that the Vapnik-Chervonenkis (VC) dimension of with respect to is the maximum cardinality of finite subsets that can be shattered by , i.e. for each , there exist and such that
Thus, they prove that, assuming that for each and the VC dimension of is , then, for any ,
Girosi, see  and (, Proposition 2), has used this general result to bound the uniform distance between integrals and sums of the form , by reinterpreting as , as , and as . Kon and Raphael  then applied this methodology to obtain uniform approximation bounds of functions in reproducing kernel Hilbert spaces. They consider two cases where the Hilbert space is dense in with a stronger norm (, Theorem 4), and where it is a closed subspace with the same norm (, Theorem 5). Also, Kon et al.  extended Girosi’s approximation estimates for functions in Sobolev spaces. While these bounds guarantee uniform convergence in probability, the approximating functions are neither orthogonal projections of nor necessarily elements of a reproducing kernel Hilbert space and hence may not capture exactly at nor converge monotonically. Furthermore, the fact that the norm is not a RKHS norm means that derivatives of may not be approximated in general, since differentiation is not bounded with respect to the uniform norm, unlike the RKHS norm associated with a continuously differentiable kernel.
The purpose of this article is thus to establish sufficient conditions for convergence and approximation in the reproducing kernel Hilbert space norm. In Section 3, we find probability error bounds for approximations of functions in a separable reproducing kernel Hilbert space with reproducing kernel on a base space , firstly in terms of finite linear combinations of functions of type and then in terms of the projection onto , for random sequences of points in the base space . Given a probability measure , letting be the measure defined by , , we approach these problems by firstly showing the existence of the nonexpansive operator where the integral exists in the Bochner sense. Using this operator, we then define a new reproducing kernel Hilbert space, denoted by , that is the operator range of . Our main result establishes bounds, in terms of the operator , on the probability that the Hilbert space distance between an arbitrary function in and linear combinations of functions of type , for sampled independently from , falls below a given threshold, see Theorem 8. For sequences of points constituting a so-called uniqueness set, see Subsection 3.4, the orthogonal projections onto the converge in the strong operator topology to the identity operator. As an application of our main result, we show that, under the assumption that is dense in , any sequence of points sampled independently from yields a uniqueness set with probability 1.
The results obtained in this article improve on the results obtained by Kon and Raphael in several senses: the convergence of approximations is in the RKHS norm, which is stronger than the uniform norm whenever the kernel is bounded; the type of convergence with respect to the points is strengthened from convergence in probability to almost certain convergence; and the separability of then allows the result to be extended from the approximation of a single function to the simultaneous approximation of all functions in the Hilbert space. In addition, when compared to the existing methods for this kind of problems, our approach based on the operator defined at (5), that encodes the interplay between the kernel and the probability measure , and the associated RKHS , is completely new and has the potential to overcome many difficulties.
These results are confined to the special case of a separable RKHS of functions on an arbitrary set , due to several reasons, one of them being the fact that the Bochner integral is requiring the assumption of separability, but we do not see this as a loss of generality since most of the spaces of interest for applications are separable. In the last section, we present two examples that point out the applicability, and the limitations of our results as well, the first to the uniform probability distribution on the compact interval , together with a class of bounded continuous kernels, and the second to the Hardy space corresponding to the Szegö kernel which is unbounded. In each case, we can explicitly calculate the space , its reproducing kernel , and the operator .
2. Notation and Preliminary Results
2.1. Reproducing Kernel Hilbert Spaces
In this subsection, we briefly review some concepts and facts on reproducing kernel Hilbert spaces, following classical texts such as Aronszajn [9, 10] and Schwartz , or more modern ones such as Saitoh and Sawano (, Chapter 2) and Paulsen and Raghupathi .
Throughout this article, we denote by one of the commutative fields or . For a nonempty set , let denote the set of -valued functions on , forming an -vector space under pointwise addition and scalar multiplication. For each , the evaluation map at is the linear functional
The evaluation maps equip with the locally convex topology of pointwise convergence, which is the weakest topology on that renders each evaluation map continuous. Under this topology, a generalized sequence in converges if and only if it converges pointwise, i.e., its image under each evaluation map converges. Since each evaluation map is linear and hence the vector space operations are continuous, this renders into a complete Hausdorff locally convex space. With respect to this topology, if is a topological space, a map is continuous if and only if is continuous for all .
We are interested in Hilbert spaces with topologies at least as strong as the topology of pointwise convergence of , so that the convergence of a sequence of functions in implies that the functions also converge pointwise. When is a finite set, , where is the number of elements of , can itself be made into a Hilbert space with a canonical inner product , or in general by an inner product induced by a positive semidefinite matrix. This leads to the concept of reproducing kernel Hilbert spaces.
Recalling the Riesz’s Theorem of representations of bounded linear functionals on Hilbert spaces, if each restricted to is continuous, for each , then there exists a unique vector such that . But, since each vector in is itself a function , these vectors altogether define a map , . Also, recall that a map is usually called a kernel.
Definition 1. Let be a Hilbert space, a kernel. For each define . is said to be a reproducing kernel for , and is then said to be a reproducing kernel Hilbert space (RKHS), if, for each , we have (i)(ii), that is, for every we have The second property is referred to as the reproducing property of the kernel .
We may then summarize the last few paragraphs with the following characterization: Let be a Hilbert space. The following assertions are equivalent: (i)The canonical injection is continuous(ii)For each , the map is continuous(iii) admits a reproducing kernel
In that case, the reproducing kernel admitted by the Hilbert space is unique, by the uniqueness of the Riesz representatives of the evaluation maps. We may further apply the reproducing property to each to obtain that for each , yielding the following properties: (i)For each , (ii)For each , , and(iii)For each , ,
The property in (7) is the analogue of the Schwarz Inequality. As a consequence of it, if for some then for all .
For any , each so we may define the subspace
of . If is the reproducing kernel of a Hilbert space , is also a subspace of and therefore, is a dense subspace of , equivalently, is a total set for .
The property at item (iii) is known as the positive semidefiniteness property. A positive semidefinite kernel is called definite if for all . Positive semidefiniteness is in fact sufficient to characterize all reproducing kernels. By the Moore-Aronszajn Theorem, for any positive semidefinite kernel , there is a unique Hilbert space with reproducing kernel .
Let us briefly recall the construction of the Hilbert space in the proof. We first render into a pre-Hilbert space satisfying the reproducing property. Define on the inner product for any . It is proven that the definition is correct and provides indeed an inner product.
Let be the completion of , then is a Hilbert space with an isometric embedding whose image is dense in . It is proven that this abstract completion can actually be realized in and that it is the RKHS with reproducing kernel that we denote by .
In applications, one of the most useful tools is the interplay between reproducing kernels and orthonormal bases of the underlying RKHSs. Although this fact holds in higher generality, we state it for separable Hilbert spaces since, most of the time, this is the case of interest: letting be a separable RKHS, with reproducing kernel , and let be an orthonormal basis of , then where the series converges absolutely pointwise.
We now recall a useful result on the construction of new RKHSs and positive semidefinite kernels from existing ones. It also shows that the concept of reproducing kernel Hilbert space is actually a special case of the concept of operator range. Let be a Hilbert space, a continuous linear map. Then with the norm is a RKHS, unitarily isomorphic to . The kernel for is then given by the map where such that on . Applying this proposition to particular continuous linear maps, one obtains useful results for pullbacks, restrictions, sums, scaling, and normalizations of kernels.
2.2. Integration of RKHS-Valued Functions
In this article, we use integrals of Hilbert space-valued functions. We first provide fundamental definitions and properties concerning the Bochner integral, an extension of the Lebesgue integral for Banach space-valued functions, following Cohn (, Appendix E).
Let be a (real or complex) Banach space and a finite measure space. On , we consider the Borel -algebra denoted by . A map is called measurable if for all , and it is called strongly measurable if it is measurable and its range is separable. If is a separable Banach space then the concepts coincide. Both sets of measurable functions, respectively, strongly measurable functions, are vector spaces. It is proven that a function is strongly measurable if and only if there exists a sequence of simple functions such that pointwise on . In addition, in this case, the sequence can be chosen such that for all .
A function is Bochner integrable if it is strongly measurable and the scalar function is integrable. In this case, the Bochner integral of is defined by approximation with simple functions. Bochner integrable functions share many properties with scalar-valued integrable functions, but not all. For example, the collection of all Bochner integrable functions makes a vector space, and, for any Bochner integrable function , we have
Also, letting denote the collection of all equivalence classes of Bochner integrable functions, identified -almost everywhere, this is a Banach space with norm
In addition, the Dominated Convergence Theorem holds for the Bochner integral as well, e.g., see (, Theorem E.6).
In this article, we will use the following result, which is a special case of a theorem of Hille, e.g., see (, Theorem III.2.6). In Hille’s Theorem, the linear transformation is supposed to be only closed, and, consequently, additional assumptions are needed, so we provide a proof for the special case of bounded linear operators for the reader’s convenience.
Theorem 2. Let be a Banach space, a measure space, and a Bochner integrable function. If is a continuous linear transformation between Banach spaces, then is Bochner integrable and
Proof. Since is Bochner integrable, there exists a sequence of simple functions that converges pointwise to on and for all and all . Then, hence, the sequence converges pointwise to . Also, it is easy to see that is a simple function for all . These show that is strongly measurable. Since for all and is Bochner integrable, it follows that hence, is Bochner integrable.
On the other hand, hence, by the Dominated Convergence Theorem for the Bochner integral, it follows that ☐
A direct consequence of this fact is a sufficient condition for when a pointwise integral coincides with the Bochner integral, valid not only for RKHSs but also for Banach spaces of functions on which evaluation maps at any point are continuous, e.g., for some compact Hausdorff space .
Proposition 3. Let be a measure space, a Banach space of functions on , such that all evaluation maps on are continuous. Let be such that for each we have .
If, for each, the mapis Bochner integrable, then the scalar mapis integrable, for each fixed.
Moreover, in that case, the pointwise integral maplies inand coincides with the Bochner integral.
Proof. Since, for each , the map is Bochner integrable, and taking into account that, for all , the linear functional is continuous, by Theorem 2, we have Since for all , this means that the scalar map is integrable, for each fixed , and hence, the pointwise integral map lies in and coincides with the Bochner integral .☐
3. Main Results
Throughout this section, we consider a probability measure space and a RKHS in , with norm denoted by , such that its reproducing kernel is measurable. In addition, throughout this section, the reproducing kernel Hilbert space is supposed to be separable.
3.1. The Reproducing Kernel Hilbert Space
On the measurable space , we define the measure by that is, is the absolutely continuous measure with respect to such that the function is the Radon-Nikodym derivative of with respect to .
With respect to the measure space , we consider the Hilbert space . Our approach is based on the following natural bounded linear operator mapping to .
Proposition 4. With notation and assumptions as before, let be a measurable function such that the integral is finite. Then, the Bochner integral exists in .
In addition, the mappingis a nonexpansive, hence, bounded, linear operator.
Proof. By assumptions, the map is measurable, and, since is separable, it follows that this map is actually strongly measurable. Letting denote the norm on and using the assumption that is finite, we have hence, by the Schwarz Inequality and taking into account that is a probability measure, we have
For arbitrary , by the triangle inequality for the Bochner integral (15), we then have and applying the Schwarz Inequality for the integral and taking into account that is a probability measure hence, is a nonexpansive linear operator.
Using the bounded linear operator defined as in (26), let us denote its range by
which is a subspace of the RKHS .
Proposition 5. is a RKHS contained in , hence, in , and its reproducing kernel is where whenever , by convention we define for all .
Proof. Since is a Hilbert space and is a bounded linear map, by (13) it follows that is a RKHS in , isometrically isomorphic to the orthogonal complement of , and its norm is given by
Let and let us define by
From the Schwarz Inequality for the kernel , it follows that if then for all . This shows that for all .
For each , by the Schwarz inequality and the fact that is a probability measure, we have
hence, . Then, taking into account that for all and all , it follows that, for each and , we have
In conclusion, is exactly the representative for the functional so, by (13) the kernel of is and, using the convention that whenever and for arbitrary , ☐
One of the main results of this article, see Theorem 11, assumes that the space is dense in . The next proposition provides sufficient conditions for this.
Suppose thatis continuous on, that, and thatis strictly positive on any nonempty open subset of. Then, is dense in.
Proof. The assertion is clearly equivalent with showing that the orthogonal complement of in is the null space. To this end, let , . That is, for each , we have
Then noting the fact that is a Bochner integral, and hence, by Theorem 2, it commutes with inner products,
By assumption, , so we can take to obtain
This implies that -almost everywhere, i.e., the set has zero measure.
Since is continuous by assumption, by the Theorem 2.3 in (, Section 2.1.3), each is continuous hence is an open subset of . But, since is assumed strictly positive on any nonempty open set, it follows that must be empty, hence, identically.☐
3.2. Probability Error Bounds of Approximation
The first step in our enterprise is to find error bounds for approximations of functions in the reproducing kernel Hilbert space in terms of distributional finite linear combinations of functions of type . To do that, we use the celebrated Markov-Bienaymé-Chebyshev Inequality on the concentration of probability measures to obtain regions of large measure with small approximation error, in terms of the Hilbert space norm and not simply the uniform norm.
Theorem 7. (Markov-Bienaymé- Chebyshev’s Inequality) Let be a probability space, a Banach space, and let be two Borel measurable functions. Then, for any , we have
The classical Bienaymé-Chebyshev Inequality is obtained from (43) applied for , , and , for , where is the expected value of the random variable and is the variance of .
Theorem 8. With notation and assumptions as before, let and . For each and , consider the set
Then, lettingdenote the product probability measure onand defining the bounded linear operatoras in (26), we have
Proof. By Proposition 4, the Bochner integral exists in and the linear operator is well-defined and bounded. In order to simplify the notation, considering the function defined by observe that is measurable and for each , we have
Then, we have
Since is a probability measure, we have
On the other hand, by Fubini’s theorem and the fact that the Bochner integral commutes with continuous linear operations, see Theorem 2, we have
Also, for each , and, for each ,
Integrating both sides of (49) and using all the previous equalities, we therefore have