Research Article | Open Access
Su Chen, "Optimal Bandwidth Selection for Kernel Density Functionals Estimation", Journal of Probability and Statistics, vol. 2015, Article ID 242683, 21 pages, 2015. https://doi.org/10.1155/2015/242683
Optimal Bandwidth Selection for Kernel Density Functionals Estimation
The choice of bandwidth is crucial to the kernel density estimation (KDE) and kernel based regression. Various bandwidth selection methods for KDE and local least square regression have been developed in the past decade. It has been known that scale and location parameters are proportional to density functionals with appropriate choice of and furthermore equality of scale and location tests can be transformed to comparisons of the density functionals among populations. can be estimated nonparametrically via kernel density functionals estimation (KDFE). However, the optimal bandwidth selection for KDFE of has not been examined. We propose a method to select the optimal bandwidth for the KDFE. The idea underlying this method is to search for the optimal bandwidth by minimizing the mean square error (MSE) of the KDFE. Two main practical bandwidth selection techniques for the KDFE of are provided: Normal scale bandwidth selection (namely, “Rule of Thumb”) and direct plug-in bandwidth selection. Simulation studies display that our proposed bandwidth selection methods are superior to existing density estimation bandwidth selection methods in estimating density functionals.
Suppose that a random variable with a probability density function (p.d.f.) belongs to a location-scale family. Let and be the location and scale parameter of , respectively. We have for some base function . If is a symmetric function, then is usually chosen to be the same class of distribution with mean zero. For instance, if is the p.d.f. of Normal distribution with mean and standard deviation , then is usually chosen to be the density of standard Normal distribution. In the nonparametric world, is not assumed to have any prespecified distributional format. Therefore, and are unknown and can not be estimated by any distribution based method such as maximum likelihood estimate. Ahmad  proposed a nonparametric kernel estimation of location and scale parameters via density functionals estimation with known base functions. The location and scale functions are written in terms of density functionals as follows:Apparently, the location and scale only rely on two functionals of unknown density , namely, and , if is known. Ahmad  showed that the new kernel location and scale estimates had better asymptotic property than MLE. Simulation results in Ahmad and Amezziane , a subsequent work of Ahmad , indicated that the kernel location and scale estimators have a comparable variability to that of the MLE and smaller than that of Huber’s M-estimator. However, it is usually difficult or impossible to know the base density especially in the nonparametric world. Moreover can be derived in terms of and if the base density is given. In this case, it becomes a parametric situation and MLE can be considered. From this point of view, Ahmad’s scale and location estimates are not very practical in real world application because the base density function needs to be known first.
Chen  proposed kernel-based nonparametric tests of equality of scale and location parameters among populations based on the kernel scale and location estimators proposed by Ahmad . To test is equivalent to test according to (1), where and are the scale and density function of th population, respectively, and . Likewise, is equivalent to by (3) if homogeneous scale is assumed. This fact motivates Chen  to build test statistics for equality of scale and location on the density functionals estimation of and , respectively. Chen  brought a new life to the two kernel density functionals estimations, which were originally introduced to estimate location and scale parameter by Ahmad . When comparing the scale (or location) parameters among populations, the differences in scale (or location) can be completely determined by (or ) and becomes irrelevant if we assume populations are from same distributional family but differed only in locations and/or scales. Thus the assumption of having to know base density as required in kernel scale and location estimation was successfully dropped. To find a good estimate of density functionals and becomes our next concern.
Aubuchon and Hettmansperger  proposed a kernel estimation of by a convolution of kernel density estimation function with the empirical CDF and showed its asymptotic equivalence to Lehmann’s estimator (see Lehmann  for details) based on the Wilcoxon confidence interval. Ahmad  provided two approaches to estimate and , one is similar to Aubuchon and Hettmansperger  and the other is to approximate the density function with an orthogonal series expansion, and then estimate the functionals of density. Grübel  estimated the density functionals for known under certain conditions through the kernel density estimate of the unknown .
Choice of bandwidth (window width or smoothing parameter) is crucial for every kernel based procedure, such as kernel density estimation and kernel regression. A vast amount of literatures has been devoted in choosing practical optimal bandwidth for techniques built on kernel estimation. Representative surveys of bandwidth selection techniques can be found in Bowman , Jones et al. , Loader , and Wand and Jones . Jones et al.  grouped data-based bandwidth selection methods for density estimation into “first generation” method and “second generation” method. The first generation methods, including the least-square cross-validation (LSCV) in Bowman  and biased cross-validation in Scott and Terrell , suffer from a slow relative rate of convergence to of order . Härdle and Marron  applied the least squares cross-validation idea to bandwidth selection on Nadaraya-Watson estimator. The second generation methods are mainly based on plug-in techniques. The idea of “plug-in” is to replace with a consistent estimate first proposed by Nadaraya  and Woodroofe ; however, the practical choice of pilot bandwidth was not discussed. Sheather and Jones  proposed a refined plug-in method, so-called “solve-the-equation (STE)” plug-in , which has faster rate of convergence of order than cross-validation estimators. Smoothed cross-validation (SCV) is also a plug-in type method with pilot bandwidth of format . It was developed by Hall et al. . Müller  and Staniswalis  employed the idea in the kernel regression. Hall et al.  constructed root- bandwidth selectors and achieved the optimal relative rate of convergence by appropriate choice of the parameters in pilot bandwidth . Gasser et al.  and Ruppert et al.  borrowed the simple direct plug-in idea to local linear regression. Fan and Gijbels  applied “cross-validation technique, the Normal-reference method, and the plug-in approach” for the density estimation setting to their corresponding bandwidth selectors for local polynomial regression method.
However, few literature studies the optimal bandwidth for the estimation of and , which are two important density functionals for estimating location and scale parameters as discussed in the prior paragraphs. Aubuchon and Hettmansperger  chose the bandwidth by removing the bias term. Grübel  suggested using the MISE-optimal choice of bandwidth in kernel density estimation. Chen  uses the least-square cross-validation bandwidth selection method for density estimation. In this paper, we will derive optimal bandwidth selection of kernel location and scale estimation by minimizing the MSE of the kernel functionals estimation for and . This paper will also propose two practical bandwidth selection methods and then compare them with various bandwidth selections for kernel density estimation such as Rule-of-Thumb, direct plug-in (DPI), least square cross-validation, and biased cross-validation (BCV).
For simplicity of illustration, a unified format (i.e., ) of the two density functionals mentioned above will be used throughout the paper. When and , it equals and , respectively. The paper is organized as follows. The optimal bandwidth for estimation in terms of AMSE criterion is derived in Section 2.1. Two practical bandwidth selection methods for are provided in Sections 2.2 and 2.3 when and . Asymptotic distribution of direct plug-in bandwidth for kernel functionals estimation of is given in Section 2.3 as well. Section 3 conducts three simulation studies to explore the properties of proposed bandwidth selection methods and evaluate their performance compared to several classical bandwidth selection methods for kernel density estimation.
2. Main Results
2.1. Optimal Bandwidth Selection
Define and . Let us write and in a more general density functionals . Note that and are special cases of , where is and , respectively. Suppose are independent random variable from a distribution with density function , where is unknown. Similar to Aubuchon and Hettmansperger  and Grübel , we obtain the kernel density functionals estimate of by , where is the kernel density estimate of and is the empirical CDF. Thus, a kernel density functionals estimate of is given bywhere and is the kernel function (details can be found in Wand’s book). The following theorem provides the mean and variance of in (4) for fixed .
Theorem 1. For in (4), the expected value and variance of are given bywhere and .
We prove this in Appendix A. The first term in (6) is nonnegative by Jensen’s inequality. Then the MSE of can be written as follows:Therefore, the optimal bandwidth selection for density functionals estimation of is , the minimizer of . To obtain a closed form of optimal bandwidth for kernel functionals estimation of , the minimizer of the asymptotic mean square error (AMSE) of is studied instead. The optimal bandwidth for estimation of with respect to AMSE criterion is given bywhereHowever, in (8) is not computable since and depend on unknown function . A quick and simple guess of AMSE-optimal bandwidth is “Normal scale” bandwidth. It gives reasonable answers whenever the data are close to Normal. In the next section, Normal scale bandwidth selection will be studied for and , respectively.
2.2. Normal Scale Bandwidth Selection
Proposition 2. If is Normal with mean and variance then the Normal scale AMSE-optimal bandwidth selector for is given bywhere is some estimate of .
The proof of Proposition 2 can be found in Appendix B. If Gaussian kernel is chosen, that is, is the density of standard Normal distribution, then and . Hence (11) is simplified towhich can be called “Rule-of-Thumb” (ROT) bandwidth selector for kernel scale estimation.
Proposition 3. If is Normal with mean and variance then the Normal scale AMSE-optimal bandwidth selector for is given bywhere is an estimate of and is an estimate of . If (sample standard deviation) and (sample mean) then (14) can be rewritten aswhere is the coefficient of variation (CV). Particularly when for fixed , goes to infinity.
The proof of Proposition 3 is given in Appendix C. When kernel function is the density of standard Normal distribution, then the “Rule-of-Thumb” bandwidth selector for kernel location estimation isBoth (14) and (16) infer that the larger the location of in absolute value is, the smaller the optimal bandwidth is needed. In another word, the optimal-AMSE bandwidth for goes to infinity. This fact also not merely applies to Normal with zero mean but also can be extended to any distribution with p.d.f. an even function (symmetric distribution around zero).
Corollary 4. For any distribution with even density function , that is, , then the optimal-AMSE bandwidth selector is .
Remarks. (1) The optimal bandwidth for estimation of is not effected by the location of , but the scale parameter. However, the optimal bandwidth for estimation of not only depends on scale but also varies along with the location. This fact will be illustrated by the simulation study in Section 3.1. Note that scale parameter is determined by and location parameter is determined by .
(2) The common choice of is sample standard deviation as in Silverman . However, Wand and Jones  recommended the smaller value between and interquartile range . Janssen et al.  also studied other more sophisticated estimates of .
2.3. Direct Plug-In Bandwidth Selection
If the distribution of ’s, that is, , departs far from Normal distribution, then Normal scale bandwidth selector will be problematic. Note that and in (8) are unknown and need to be estimated to obtain a practical optimal bandwidth selector. A natural estimate of isSimilarly, can be estimated byReplacement of and by and leads to the direct plug-in (DPI) bandwidth selector for :
Obviously, the kernel density functionals estimates in (17) and (18) rely on the choice of pilot bandwidth . Simple candidates for pilot bandwidth are to use Normal scale bandwidth selector proposed in Section 2.2 for or smoothing parameters for traditional density estimate (e.g., ROT, LSCV, BCV, and DPI surveyed in Wand and Jones ). The DPI bandwidth selection can be practically computed through the following procedures.
Step 1. Estimate using the Normal scale bandwidth proposed in Section 2.2 (i.e., for estimation of and for estimation of ) or bandwidth selection for density estimation (such as ROT , LSCV , BVC , and DPI ).
Step 3. The DPI bandwidth selection for is obtained followed by (19).
The performance of these pilot bandwidth selections is compared in terms of MSE of through Monte Carlo simulation in Section 3.2 (Simulation Study 2). Next, we will study the asymptotic distribution of . The limiting distribution of practical bandwidth selector is very important in that the rate of convergence is the chief concern.
Proposition 5. If and density function are continuous and satisfy and , then
Remarks. Particularly, when , the DPI bandwidth selector for estimation of is , where and . Likewise, when , the DPI bandwidth selector for estimation of is , where and .
3. Simulation Study
Three simulation studies are carried out to evaluate [Simulation Study 1] the accuracy of and (Normal scale bandwidth for and ) comparing to and under normality assumption; [Simulation Study 2] the optimal choices of pilot bandwidth for and in terms of MSE of and , respectively; [Simulation Study 3] the performance of proposed practical optimal bandwidth selection methods (ROT and DPI proposed in Sections 2.2 and 2.3) versus traditional (classical) bandwidth selection for kernel density estimate in terms of MSE of . As to the choice of kernel function , it has been shown in literatures that the choice of bandwidth overrides the effect of choice of kernel function. So for simplicity, we just use the Gaussian kernel in all the three simulation studies.
3.1. Simulation Study 1
The purpose of this study is threefold (1) to evaluate the performance of and when samples are from Normal distribution, (2) to study and (the optimal bandwidths that minimize the MSE of and , resp.) in terms of the location parameter of Normal distribution, and (3) to illustrate numerically that optimal bandwidth that minimizes the MSE of goes to infinity when location parameter gets closer to zero.
Figure 1 plots the MSE of versus the choice of bandwidth when sample of sizes 20, 50, 100, and 200 is drawn from and , respectively (the simulation result is not sensitive to the choice of scale). The blue curve in each subplot represents the MSE() as bandwidth ranges from 0 to 2. The minimum point of the blue curve indicates . The red vertical line in the subplot represents and is computed from (10) by replacing with the p.d.f. of , where in Figure 1(a) and in Figure 1(b). is an estimate of (an asymptotic approximation of ) under normality assumption. Simulation results in Figure 1 show that tends to have small variance and stabilized around the true for normality data. The optimal bandwidth does not change with location parameter as shown in Figure 1(a) () and Figure 1(b) () (more simulation results based on location parameters other than 0 and 1 are available upon request.).
Figure 2 plots the MSE of versus the choice of bandwidth when sample of sizes 20, 50, 100, and 200 is drawn from , , , and , respectively. Similar to Figure 1, the blue curve and red vertical line represent MSE() and . The boxplot of is based on the 100 sets of simulated samples of size from Normal distribution with mean and standard deviation . The red vertical line disappears in Figure 1(a) due to the fact that as . Also the MSE of (blue curve) strictly decreases as rises in Figure 2(a). To conclude from Figure 1(a), the optimal bandwidth for kernel functional estimation of goes to infinity when mean of underlying distribution is zero, which is consistent to Proposition 3. However, the boxplot in Figure 2(a) infers that the distribution of is right-skewed with median, 1st quantile, and 3rd quantile around one, which is far departure from the true value as well as . When slightly deviates from zero, just as the case in Figures 2(b) and 2(c), tends to be less variate (and skewed), overlap with (red vertical line), and get closer to (valley of blue curve), especially as the sample size grows. When increases up to 1 and above as shown in Figure 2(d), the median of , , and coincide when sample size is 50+. More simulation results can be found in Supplementary Material available online at http://dx.doi.org/10.1155/2015/242683.
3.2. Simulation Study 2
Several candidate bandwidth selection methods are available to serve as a pilot bandwidth , such as classical bandwidth selection methods for kernel density estimate (described in Section 2.3 Step 1), the optimal bandwidth for ( or ) on Normal scale references, namely, “” and “” in Section 2.2. This subsection aims to study the pilot bandwidth for and required to estimate in (19) (note that is simplified to and when and , resp.). Five different choices of pilot bandwidth are studied in this subsection to estimate and for under five different underlying distributions: (i) Cauchy with location ( was set to be 1 in simulation study for estimation of and 5 for estimation of ) and scale 2/3, (ii) Generalized Pareto with location , scale 2/3, and shape 1, (iii) Normal with location and scale 2/3, (iv) Mix-Normal I, and (v) Mix-Normal II. Mix-Normal I and Mix-Normal II are weighted mixtures of two Normal distributions: Mix-Normal I is and Mix-Normal I is . Cauchy distribution is a well-known fat-tailed symmetric distribution. Generalized Pareto with shape 1 is a extremely fat-tailed asymmetric (right-skewed) distribution. Density curves of the two Mix-Normal -distributions are given in Figure 3. It is obvious that both Mix-Normal I and Mix-Normal II are bimodal distributions in contrast to unimodal distributions such as Cauchy, Generalized Pareto, and Normal. Mix-Normal I is symmetric bimodal and Mix-Normal II is asymmetric (left-skewed) bimodal distribution. The motivation behind the choice of distributions is to see whether the optimal bandwidth is sensitive to the skewness, extreme outliers, and complex shape of the distributions in contrast to Normal.
Figures 4 and 5 compare five candidate pilot bandwidth selection methods in terms of the boxplots of MSE of and when sample size is 100. The five candidates for pilot bandwidth selection considered in this paper are (1) Rule-of-Thumb bandwidth for and proposed in Section 2.2 (“1-ROT(s)” means ; “1-ROT(L)” means ); (2) Rule-of-Thumb bandwidth for kernel density estimation (KDE) proposed in Scott  (“2-ROT(d)” means , where is the minimum of standard deviation and interquantile range); (3) least-square cross-validation bandwidth for density estimation proposed in Bowman  (“3-UCV(d)” means for density estimation); (4) biased cross-validation for density estimation proposed in Scott and Terrell  (“4-BCV(d)” means for density estimation); (5) direct plug-in bandwidth for density estimation reported by Sheather  (“5-DPI(d)” means for density estimation). As shown in Figures 4 and 5 that is the worst candidate for pilot bandwidth in the density functionals estimation . The pilot bandwidth choice gives the lowest MSE() for Cauchy and Normal samples, while results in the lowest MSE() for Generalized Pareto, Mix-Normal I, and Mix-Normal II samples. Similar conclusions can be found in the pilot bandwidth choice for estimation of ; however, leads to slightly smaller MSE() than for Mix-Normal I and Mix-Normal II samples. Simulation Study 1 illustrates that the Normal reference bandwidth (including and ) is not a reliable estimate when the location parameter of underlying distribution is close to zero. Therefore, , rather than , is recommended to serve as pilot bandwidth in estimation of .
(d) Mix-Normal I
(e) Mix-Normal II
(d) Mix-Normal I
(e) Mix-Normal II
3.3. Simulation Study 3
This section aims to evaluate the performance of our proposed bandwidth (or ) in Section 2.2 and (or ) in Section 2.3 and compare with the classical bandwidth selection methods for density estimation, that is, , , and in estimation of (or ). Simulation Study 2 recommends and to be pilot bandwidth to estimate direct plug-in bandwidth for estimation of and among the 5 candidate pilot bandwidth methods and 5 different underlying distributions. Therefore, (or ) with these two pilot bandwidth selection methods, and , is considered separately in this subsection. Direct plug-in bandwidth for estimation of with pilot bandwidth is denoted by “” in Table 1 and “2a-DPI(s)” in Figure 1, and with pilot bandwidth is denoted by “" in Table 1 and “2b-DPI(s)” in Figure 1.
|ROT bandwidth for estimation of given by (12); 2DPI bandwidth for estimation of with pilot bandwidth ; 3DPI bandwidth for estimation of with pilot bandwidth ; 4ROT bandwidth for density estimation proposed in Silverman ; 5LSCV bandwidth for density estimation proposed in Bowman ; 6DPI bandwidth for density estimation proposed in Sheather and Jones . |
The summary statistics (mean, 1st quantile, median, and 3rd quantile) for the three proposed bandwidths and 3 classical bandwidths are provided in Tables 1 and 2 to estimate and , respectively. Samples of size 50 and 100 from five different underlying distribution (3 unimodal distributions and 2 bimodal distributions) as in Simulation Study 2 are considered in both simulation studies of and estimation.
|ROT bandwidth for estimation of given by (16); 2DPI bandwidth for estimation of with pilot bandwidth ; 3DPI bandwidth for estimation of with pilot bandwidth ; 4ROT bandwidth for density estimation proposed in Silverman ; 5LSCV bandwidth for density estimation proposed in Bowman ; 6DPI bandwidth for density estimation proposed in Sheather and Jones . |
In general, the optimal bandwidth for kernel density functionals estimation (estimation of and in this paper) is smaller than the one for kernel density estimation under same sample size and underlying distribution as shown in Tables 1 and 2, except for the least square cross-validation bandwidth for density estimation on Generalized Pareto samples. In another word, kernel density functionals estimation requires less smoothness in the estimation, which exaggerates some characteristics of the sample. For instance, the location and scale estimation will be more sensitive to the outliers than density estimation.
To evaluate the performance of our proposed bandwidth selection methods in contrast to classical bandwidth selection methods for density estimation in estimation of and , the MSE of (and ) are computed and compared. Figures 6 and 7 demonstrate the boxplot of MSE of and , respectively, in terms of 6 bandwidths shown in Tables 1 and 2 under five different distributions: Normal, Cauchy, Generalized Pareto, Mix-Normal I, and Mix-Normal II. Both figures illustrate that (a) MSE of and decreases as sample size increases from 50 to 100; (b) MSE of and is larger for samples from asymmetric distribution rather than symmetric distribution, from bimodal distribution rather than unimodal distribution. Figure 6 infers that (i) Normal scale bandwidths for both estimation of () and density estimation () lead to smaller relative to other 4 types of bandwidth selection methods for Normal samples of size 50. When sample size goes up to 100, outperforms with a smaller MSE; (ii) for Cauchy samples with location 1 and scale 2/3, becomes the worst choice in kernel density functionals estimation of , especially for relative large sample size. Our proposed bandwidth results in the smallest MSE in this case; (iii) for Generalized Pareto samples with shape 1, location 1, and scale 2/3, both and perform very poorly. However, with pilot bandwidth gives the smallest MSE for Pareto samples, which can partly be explained by the fact that gives the second smallest MSE; (iv) for bimodal distributed samples (including Mix-Normal I and Mix-Normal II), the three proposed bandwidth selection methods for estimation of completely dominate the three classical density estimation bandwidth selection methods. Among the three proposed bandwidth, with pilot bandwidth gives minimum .
Figure 7 compares the performance of the three proposed practical bandwidth selection methods to three classical density estimation bandwidths in terms of kernel density functionals estimation of . It is shown that with pilot bandwidth outperforms other 5 bandwidth methods in kernel density functionals estimation of for Normal and Cauchy samples. Similar to estimation of , direct plug-in bandwidth designed for estimation of , that is, with pilot bandwidth , beats other candidates for Generalized Pareto samples. The optimal bandwidth selection (with three practical estimates , , and ) for estimation of proposed in this paper performs significantly better than the bandwidth selection for density estimation in density functionals estimation for Mix-Normal distributions. Among the three proposed bandwidth selection methods for estimation of , works better than the other for a mixture of 2 Normal distributions as indicated in Figure 7.
The optimal bandwidth, along with three practical bandwidth selection methods for kernel density functionals estimation of format , is discussed in this paper. Necessity and urgency of this study are due to the fact that and are two core portions for scale and location, respectively. Chen  shed a light on a novel field of nonparametric analytic method for experimental design relying on kernel density functionals estimation of