Learning TheoryView this Special Issue
Density Problem and Approximation Error in Learning Theory
We study the density problem and approximation error of reproducing kernel Hilbert spaces for the purpose of learning theory. For a Mercer kernel on a compact metric space (, ), a characterization for the generated reproducing kernel Hilbert space (RKHS) to be dense in is given. As a corollary, we show that the density is always true for convolution type kernels. Some estimates for the rate of convergence of interpolation schemes are presented for general Mercer kernels. These are then used to establish for convolution type kernels quantitative analysis for the approximation error in learning theory. Finally, we show by the example of Gaussian kernels with varying variances that the approximation error can be improved when we adaptively change the value of the parameter for the used kernel. This confirms the method of choosing varying parameters which is used often in many applications of learning theory.
Learning theory investigates how to find function relations or data structures from random samples. For the regression problem, one usually has some experience and would expect that the (underlying) unknown function lies in some set of functions called the hypothesis space. Then one tries to find a good approximation in of the underlying function(under certain metric). The best approximation in is called the target function. However,is unknown. What we have in hand is a set of random samples. These samples are not given byexactly (). They are controlled by this underlying functionwith noise or some other uncertainties (). The most important model studied in learning theory  is to assume that the uncertainty is represented by a Borel probability measureon, and the underlying functionis the regression function ofgiven by Here, is the conditional probability measure at. Then, the samplesare independent and identically distributed drawers according to the probability measure. For the classification problem,and sign is the optimal classifier.
Based on the samples, one can find a function from the hypothesis space that best fits the data(with respect to certain loss functional). This function is called the empirical target function . When the number of samples is large enough, is a good approximation of the target function with certain confidence. This problem has been extensively investigated and well developed in the literature of statistical learning theory. See, for example, [1–4].
What is less understood is the approximation of the underlying desired functionby the target function. For example, if one takes to be a polynomial space of some fixed degree, thencan be approximated by functions from only whenis a polynomial in .
In kernel machine learning such as support vector machines, one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces. Here, we take to be a compact metric space and .
Definition 1. Letbe continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points, the matrixis positive semidefinite. Such a kernel is called a Mercer kernel. It is called positive definite if the matrixis always positive definite.
The reproducing kernel Hilbert space (RKHS) associated with a Mercer kernelis defined (see ) to be the completion of the linear span of the set of functionswith the inner productsatisfying The reproducing kernel property is given by This space can be embedded into, the space of continuous functions on.
In kernel machine learning, one often takesor its balls as the hypothesis space. Then, one needs to know whether the desired functioncan be approximated by functions from the RKHS.
The first purpose of this paper is to study the density of the reproducing kernel Hilbert spaces in(or inwhenis a subset of the Euclidean space). This will be done in Section 2 where some characterizations will be provided. Let us mention a simple example with detailed proof given in Section 6.
Example 2. Letand let be a Mercer kernel given by wherefor each and. Set. Then, is dense inif and only if
When the density holds, we want to study the convergence rate of the approximation by functions from balls of the RKHS as the radius tends to infinity. The quantity is called the approximation error in learning theory. Some estimates have been presented by Smale and Zhou  for thenorm and many kernels. The second purpose of this paper is to investigate the convergence rate of the approximation error with the uniform norm as well as thenorm. Estimates will be given in Section 4, based on the analysis in Section 3 for interpolation schemes associated with general Mercer kernels. With this analysis, we can understand the approximation error with respect to marginal probability distribution induced by. Let us provide an example of Gaussian kernels to illustrate the idea. Notice that when the parameterof the kernel is allowed to change with, the rate of the approximation error may be improved. This confirms the method of adaptively choosing the parameter of the kernel, which is used in many applications (see e.g., ).
Example 3. Let There exist positive constantsandsuch that, for eachand, there holds whenis fixed; while whenmay change with, there holds
2. Density and Positive Definiteness
The density problem of reproducing kernel Hilbert spaces inwas raised to the author by Poggio et al. See . It can be stated as follows.
Given a Mercer kernelon a compact metric space, when is the RKHSdense in?
By means of the dual space of, we can give a general characterization. This is only a simple observation, but it does provide us useful information. For example, we will show that the density is always true for convolution type kernels. Also, for dot product type kernel, we can give a complete nice characterization for the density, which will be given in Section 6.
Recall the Riesz Representation Theorem asserting that the dual space ofcan be represented by the set of Borel measures on. For a Borel measureon, we define the integral operatorassociated with the kernel as This is a compact operator on if is a positive measure.
Theorem 4. Let be a Mercer kernel on a compact metric space . Then, the following statements are equivalent. (1)is dense in. (2)For any nontrivial positive Borel measure,is dense in. (3)For any nontrivial positive Borel measure,has no eigenvalue zero in. (4)For any nontrivial Borel measure, as a function in,
Proof. (1)(2). This follows from the fact thatis dense in. See, for example, .
(2)(3). Suppose that is dense in, buthas an eigenvalue zero in. Then, there exists a nontrivial functionsuch that; that is, The identity holds as functions in. If the support ofis, then this identity would imply thatis orthogonal to eachwith. When the support ofis not, things are more complicated. Here, the support of, denoted as supp, is defined to be the smallest closed subsetofsatisfying.
The property of the RKHS enables us to prove the general case. As the functionis continuous, we know from (12) that, for eachin supp, This means for eachin ,in , wherehas been restricted onto supp. When we restrictonto , the new kernelis again a Mercer kernel. Moreover, by (1),. It follows that span is dense in. The latter is dense in. Therefore,is orthogonal to; hence, as a function in,is zero. This is a contradiction.
(3)(4). Every nontrivial Borel measurecan be uniquely decomposed as the difference of two mutually singular positive Borel measures:; that is, there exists a Borel setsuch thatand. With this decomposition, Here, is the characteristic function of the set, andis the absolute value of. Asis a nontrivial positive Borel measure andis a nontrivial function in, statement (3) implies that, as a function in,. Since this function lies in, it is nonzero as a function in.
The last implication (4) (1) follows directly from the Riesz Representation Theorem.
The proof of Theorem 4 also yields a characterization for the density of the RKHS in.
Corollary 5. Letbe a Mercer kernel on a compact metric space and a positive Borel measure on. Then, is dense inif and only ifhas no eigenvalue zero in.
The necessity has been verified in the proof of Theorem 4, while the sufficiency follows from the observation that anfunctionlying in the orthogonal complement of spangives an eigenfunction ofwith eigenvalue zero:
Theorem 4 enables us to conclude that the density always holds for convolution type kernelswith. The density for some convolution type kernels has been verified by Steinwart . The author observed the density as a corollary of Theorem 4 whenis strictly positive. Charlie Micchelli pointed out to the author that, for a convolution type kernel, the RKHS is always dense in. So, the density problem is solved for these kernels.
Corollary 6. Letbe a nontrivial convolution type Mercer kernel onwith. Then, for any compact subsetof,onis dense in.
Proof. It is well known thatis a Mercer kernel if and only ifis continuous andalmost everywhere. We apply the equivalent statement (4) of Theorem 4 to prove our statement.
Letbe a Borel measure onsuch that Then, the inverse Fourier transform yields Here, is the Fourier transform of the Borel measure, which is an entire function.
Taking the integral onwith respect to the measure, we have For a nontrivial Borel measuresupported on , vanishes only on a set of measure zero. Hence, almost everywhere, which gives. Therefore, we must have. This proves the density by Theorem 4.
After the first version of the paper was finished, I learned that Micchelli et al.  proved the density for a class of convolution type kernelswithbeing the Fourier transform of a finite Borel measure. Note that a large family of convolution type reproducing kernels are given by radial basis functions; see, for example, .
Now we can state a trivial fact that the positive definiteness is a necessary condition for the density.
Corollary 7. Letbe a Mercer kernel on a compact metric space. Ifis dense in, thenis positive definite.
Proof. Suppose to the contrary thatis dense in, but there exists a finite set of distinct pointssuch that the matrixis not positive definite. By the Mercer kernel property,is positive semidefinite. So it is singular, and we can find a nonzero vectorsatisfying. It follows that; that is,
Now, we define a nontrivial Borel measuresupported onas Then, for, This is a contradiction to the equivalent statement (4) in Theorem 4 of the density.
Because of the necessity given in Corollary 7, one would expect that the positive definiteness is also sufficient for the density. Steve Smale convinced the author that this is not the case in general. This motivates us to present a constructive example ofkernel. Denote as the norm in the Sobolev space.
Example 8. Let. For everyand every, choose a real-valuedfunctiononsuch that Defineonby Then, is aMercer kernel on. It is positive definite, but the constant functionis not in the closure ofin . Hence, is not dense in.
Proof. The series in (24) converges infor any. Hence, is and is a Mercer kernel on.
To prove the positive definiteness, we letbe a finite set of distinct points and a nonzero vector. Choosesuch that Then, for each, eitheror. Hence, By the construction of, there holds Then, Now, the determinant of the matrixis a Vandermonde determinant and is nonzero. Since is a nonzero vector, we know that for some. It follows that . Thus, is positive definite.
We now prove that, the constant function taking the valueeverywhere, is not in the closure ofin. In fact, the uniformly convergent series (24) and the vanishing property ofimply that Since span is dense in and is embedded in , we know that If could be uniformly approximated by a sequence in , then which would be a contradiction. Therefore,is not dense in.
Combining the previous discussion, we know that the positive definiteness is a nice necessary condition for the density of the RKHS in but is not sufficient.
3. Interpolation Schemes for Reproducing Kernel Spaces
The study of approximation by reproducing kernel Hilbert spaces has a long history; see, for example, [13, 14]. Here, we want to investigate the rate of approximation as the RKHS norm of the approximant becomes large.
Definition 9. We say thatis the set of nodal functions associated with the nodesifand
In , we show that the nodal functionsassociated withexist if and only if the Gramian matrixis nonsingular. In this case, the nodal functions are uniquely given by
Remark 10. When the RKHS has finite dimension , then, for any we can find nodal functions associated with some subset , while for , no such nodal functions exist. When dim, then, for any , we can find a subset which possesses a set of nodal functions.
The nodal functions are used to construct an interpolation scheme: It satisfies for. Interpolation schemes have been applied to the approximation by radial basis functions in the vast literature; see, for example, [17–20].
The errorforwill be estimated by means of a power function.
Definition 11. Letbe a Mercer kernel on a compact metric spaceand. The power functionis defined onas
We know thatwhen. Ifis Lipschitzon: then Moreover, higher order regularity ofimplies faster convergence of. For details, see .
The error of the interpolation scheme for functions from RKHS can be estimated as follows.
Theorem 12. Letbe a Mercer kernel and nonsingular for a finite set. Define the interpolation scheme associated withas (34). Then, for, there holds
Proof. Let. We apply the reproducing property (3) of the functionin
By the Schwartz inequality in,
As, we have
However, the quadratic function
overtakes its minimum value at. Therefore,
It follows that
This proves (38).
As andfor, we know that This means thatis orthogonal to span. Hence, is the orthogonal projection ofonto span. Thus, .
The regularity of the kernel in connection with Theorem 12 yields the rate of convergence of the interpolation scheme. As an example, from the estimate forgiven in [16, Proposition 2], we have the following.
Corollary 13. Let, , andbe aMercer kernel such thatis nonsingular for. Then, for, there holds
For convolution type kernels, the power function can be estimated in terms of the Fourier transform of the kernel function. This is of particular interest when the kernel function is analytic. Let us provide the details.
Assume thatis a symmetric function inandalmost everywhere on. Consider the Mercer kernel For, we define the following function to measure the regularity:
Remark 14. This function involves two parts. The first part is, where; hence, it decays exponentially fast asbecomes large. The second part is, whereis large. Then, the decay of(which is equivalent to the regularity of) yields the fast decay of the second part.
The power functioncan be bounded byon the regular points:
Proof. Chooseas the Lagrange interpolation polynomials on. It is a vector infor each. Then, , where
In the proof of Theorem 2 in , we showed thatfor each. Therefore,.
The estimate forin the second part was verified in the proof of Theorem 3 in .
For the Gaussian kernels it was proved in [16, Example 4] that, for, there holds
4. Approximation Error in Learning Theory
Now, we can estimate the approximation error in learning theory by means of the interpolation scheme (34).
Theorem 16. Let be a symmetric function with , and let the kernel on be . For and , we set by Then, with , one has(i); (ii); (iii).
Proof. (i) For and , expression (33) gives
Then for we have
where is the vector . It follows that
where denotes the (operator) norm of the matrix in .
We apply the previous analysis to the function satisfying Then,
Now, we need to estimate the norm . For convolution type kernels, such an estimate was given in [15, Theorem 2] by means of methods from the radial basis function literature, for example, [17, 21–24]. We have Therefore, This proves the statement in (i).
(ii) Let . Then By the Schwartz inequality, The first term is bounded by . The second term is which can be bounded by , as shown in the proof of Theorem 12. Therefore, by (52),
(iii) By the Plancherel formula, This proves all the statements in Theorem 16.
Theorem 16 provides quantitative estimates for the approximation error: with Choose such that as ; we have and the RKHS norm of is controlled by the asymptotic behavior of .
Denote by the inverse function of : Then, our estimate for the approximation error can be given as follows.
Corollary 17. Let and . Then, for , where . If , then In particular, when for some and , one has provided that with the function , satisfies
Proof. The first part is a direct consequence of Theorem 16 when we choose to be , the integer part of .
To see the second part, we note that (77) in connection with Proposition 15 implies with , Then, .
For , we can choose such that Choose such that Then, , and by Theorem 16, When there holds Hence, When satisfies (79), we know that Hence, (84) holds true. This proves our statements.
For the Gaussian kernels, we have the following.
Proposition 18. Let Denote and . If , then one has and when , for satisfying
Proof. The Fourier transform of is
For we can take with such that Here, is the inverse function of : Then, . Let . By Theorem 16, .
By Corollary 17 and (57), where . Choose such that With this choice, . Therefore, where
When there holds This yields the first estimate.
When , the same method gives the error with the uniform norm.
5. Learning with Varying Kernels
Proposition 18 in the last section shows that, for a fixed Gaussian kernel, the approximation error behaves as for functions in .
In this section, we consider the learning with varying kernels. Such a method is used in many applications where we have to choose suitable parameters for the reproducing kernel. For example, in  Gaussian kernels with different parameters in different directions are considered. Here, we study the case when the variance parameter keeps the same in all directions. Our analysis shows that the approximation error may be improved when the kernel changes with the RKHS norm of the empirical target function.
Proposition 19. Let There exist positive constants and , depending only on and , such that for each and , one can find some satisfying