Abstract

Kernel density estimators due to boundary effects are often not consistent when estimating a density near a finite endpoint of the support of the density to be estimated. To address this, researchers have proposed the application of an optimal bandwidth to balance the bias-variance trade-off in estimation of a finite population mean. This, however, does not eliminate the boundary bias. In this paper weighting method of compensating for nonresponse is proposed. Asymptotic properties of the proposed estimator of the population mean are derived. Under mild assumptions, the estimator is shown to be asymptotically consistent.

1. Introduction

Estimation of population parameters, for example, the population mean using kernel density estimators in presence of nonresponse, often leads to bias due to boundary effects; see, for instance, [1]. This affects the optimality of the estimators for the population parameters. To address this problem, use of optimal bandwidth has been suggested in literature. However, this does not eliminate the boundary bias. Weighting method of compensating for nonresponse in two-stage cluster sampling is proposed in this study. The values of the auxiliary variables are assumed to be known for all the clusters while the values of the survey variable are only known for response units in the sample selected.

Let be the values of the survey variable for unit in cluster , for . The problem is to estimate the survey values of the nonresponse component in the second stage of sampling in the selected sample. This is done by first generating data using a linear regression model applied by [2, 3]. The model is given bywhere is a smooth function of the auxiliary variables and is the residual term with mean zero and variance which is strictly positive. Auxiliary data is assumed to be known throughout the study and is therefore used to predict the nonresponse values. In the following sections, different methods of estimating the nonresponse values of the survey variable using a function of the auxiliary data, , are discussed.

Let denote a probability density function with the support and consider nonparametric estimator of based on a random sample from . The kernel estimator of due to [4] is given bywhere is some specified density function, symmetric about zero over the interval with as the bandwidth such that as . The properties of under some smoothness assumptions for arewhere .

For , that is, the interior points, the bias of is of order . However, at the boundary points, that is, for , is not consistent. In nonparametric curve estimation problems, this phenomenon is called “boundary effects” of the estimator in (2).

2. Methods of Reducing Boundary Bias due to Nadaraya-Watson Estimator

It is generally known that kernel density estimators are not consistent when estimating a density near the finite end points of the support of the density to be estimated. This is due to the boundary effects that occur in nonparametric curve estimation. The estimator proposed by [1] suffers from boundary problem induced by Nadaraya-Watson estimator used. Notably, it is often desirable to have an optimal bandwidth to balance the bias-variance trade-off. If is a symmetric density function and fixed across a support estimation, then the inference is generally simplified for unbounded support i.e., . But the function is not consistent at the boundary where is the bandwidth for such a choice of ; for details see [5, 6]. So for , the bias is of order instead of order at the boundary points. To eliminate the boundary effects in kernel density estimation, the methods described below have been proposed in literature. A brief description of these methods is hereby provided.

2.1. Reflection of Data Method

This method has been explored by [5, 7, 8]. It is also known as “data-reflected technique”. To apply this method, one has to add to the data set. Since the kernel is penalizing for lack of data on the negative axis, the estimator therefore gradually applies reduced amount of data in its window as it approaches the boundary, thus resulting in a boundary bias; the addition of , compensates for the lack of data. The estimator of is defined byFor , where is a function of auxiliary random variables which are assumed to be known throughout the study. For it can be shown that ; therefore it becomes better than other methods if the underlying density has the property ; if this property does not hold, the method may become cumbersome to apply.

2.2. Pseudodata Method

This method was suggested by [9]. In this method, data is generated outside the interval of estimation; that is, the method generates data beyond the left endpoint of the support of the density. Those data are assumed to be linear functions of order statistics in the original sample . This transforms the data into a new set and then puts it on the negative axis. The estimator of is given bywhere andwhere linearly interpolates among in that order; for details see [9]. Though this method is simple to implement and allows a minimal variance of the usual kernel estimator, the drawback of this method is that straight data reflection corrects only for a jump in the value of the density at the ends of its support, not for discontinuities in the derivatives of the density. Therefore, the method does not adequately correct bias problems caused by the edge effects in kernel estimators of order 2 or higher order.

2.3. Boundary Kernel Method

The boundary kernel estimate at a particular point of estimation in the boundary region is obtained by first constructing the appropriate kernel for that point. Many researchers including [7, 1012] have explored this approach. The method applies a different kernel for estimating function at each point in the boundary region. Due to this, some kernels may not hold the symmetry property and can therefore put more weight on the positive axis. The estimator for using this method is given bywhere , , and . Besides, is such that for The boundary kernel and related methods usually have low bias but the price for that is an increase in variance. It has been observed, see, for instance, [11], that approaches involving only kernel modifications without regard to data, such as boundary kernel method, are always associated with larger variance. Besides, the corresponding estimates tend to take negative values near the boundary points. This is due to the fact that some kernels may not be symmetric and can therefore put more weight on the positive axis. These drawbacks limit the use of this method.

2.4. Transformation of Data Method

This technique has been discussed by [4, 13]. Original data is transformed, that is, is transformed to , while keeping the original data, where is a nonnegative, continuous, and monotonically increasing function for . To use this method one can take a one-to-one continuous function: . A regular kernel estimator is then used with the transformed data set . The estimator is given byThis method gives the estimator of the probability density function of not that of . The strength of this method is that transformation-based boundary correction estimates are nonnegative and have low variance. The nonnegativity property is very vital in practical applications and it is therefore worth-exploring to consider methods that result in nonnegative estimators.

A modified version of this method is therefore proposed in this study since it is not computationally intensive and is easier to implement compared to the rest of the methods.

In the next section, the estimator of the finite population mean is modified using transformation of data technique, and further, its asymptotic properties are derived.

3. Proposed Estimator of Finite Population Mean Using Modified Transformation of Data Method

Consider a finite population of size consisting of clusters with elements in the cluster. Let denote the value of the survey variable for unit in cluster , for . To estimate the nonresponse values in the second stage of sampling a linear regression model given in (1) is used. Auxiliary data is assumed to be known throughout the study and is therefore used to predict the nonresponse values. The estimator proposed by [1] suffers from boundary bias. To obtain a nonparametric regression estimator for the finite population mean that resolves boundary bias, the function of auxiliary variables given in (10) below is used to predict the nonresponse values of the survey variable ; the estimator is defined bywhere is the function of auxiliary variables due to modified transformation of data method proposed. Following the work of [9], data should be generated beyond the left endpoint of the support of the density function such that the data provides a natural adjustment of the density outside its support. The method of data generation procedure combines the transformation and the reflection of data methods. To do this, first transform the original data to while retaining the original data where is a nonnegative continuous and monotonically increasing function from to . Secondly, reflect around the origin so that we have . Consequently, using the enlarged data sample the new estimator of the population mean is defined bywhere represents the estimator of the nonresponse units and can be rewritten aswhere are the modified weights arising from the proposed procedure. From (12), the following equation is obtained:Using (13) the expected value of the estimator of the population mean is therefore given byIn what follows, the bias and variance of the proposed estimator are derived.

3.1. Asymptotic Bias of the Proposed Estimator

Introduction. Boundary bias occurs in the interval due to lack of data following the reduction of such data at this interval. This implies that the density function has continuity on and is 0 for . Due to reduced amount of data, the resulting estimators are biased. This is possible if the selected bandwidth is greater than the value of , i.e., if . Consider the Nadaraya-Watson estimator given by in [1]. In addition, consider for where so that for one can obtainso that . Next, consider a kernel estimator given in (1) which has the support ; this means the variable must be contained in the interval , so that for we have . Since the main problem is to estimate the nonresponse component of the proposed estimator, the following theorem due to [9], under certain conditions on and outlined below, is applied.

Theorem 1. Assume that and exist and are continuous, where such that and . Assume that , , , and , , where is the inverse function of while and are the derivatives of and , respectively, for . Furthermore, let , where . Assume the kernel function K is nonnegative, symmetric function with support such that , , and .

Using Theorem 1, the expected value of the nonresponse component is given by AndUsing change of variables technique and on simplification the result becomeswhich simplifies to which on simplification yields Substituting (21) into (17) the following is obtained: Since exists and is continuous near , then for , we have Hence simplifying (22) gives Thus the bias of the estimator of the nonresponse component in (14) can be expressed asAs , it is noted that . This shows that, for the bias to reduce, the bandwidth must tend to zero as the sample size increases; that is, as , .

3.2. Asymptotic Variance of the Proposed Estimator

The variance of the estimator proposed is given aswhich can be rewritten as follows:whereandEquation (31) can be expanded to getusing change of variables technique, (32) can be expressed aswhich on simplification reduces toNext we haveSince is continuous, Taylor’s series expansion givesReplacing given by (36) in (35) yieldswhich reduces on simplification toNext is to evaluate which as earlier outlined in (30) is given byApplying Taylor’s series expansion and following the same procedure as for , would simplify toHence putting together , i.e., and we have It can be noted that the variance, , is decreasing in . This is because as , the bandwidth and hence ; that is, the bandwidth decreases but not at a faster rate than the sample size. Thus for a large sample size, the variance is reduced significantly. Both the bias and the variance must become small as for the estimator to be optimal. That means the bandwidth has to decrease but not at a faster rate than the sample size. This suffices to establish the consistency of the estimator. That is, for all , in probability as .

4. Simulation Study and Discussion of the Results

The simulation study was carried out using R statistical Package (R code). To obtain the estimator for the finite population mean, , the auxiliary variables were generated as identically and independently distributed random variables on . The population consisted of 30 clusters. In stage one, a sample of clusters was chosen using simple random sampling with replacement () which constituted the primary sampling units (PSUs).

In stage two, from each selected clusters, say , a sample from was selected; that is, the sample from a fixed selected cluster was selected using from a total elements.

Consider the survey variables that are known only for the respondents in the sample. Using known auxiliary variables, , nonresponse values were generated using the model using SRSWR within the cluster.

Moreover, let such that is a function of auxiliary random variables generated using linear, sine, and quadratic data functions outlined in the following subsection.

This procedure was repeated iteratively to obtain . confidence intervals (CI) were then constructed for the estimators of population means which corresponded to the proposed estimator and Nadaraya-Watson estimators of finite population means, respectively.

A normal kernel with mean 0 and variance 1 was used since it has smooth and continuous derivatives at every data point. To maintain stability in terms of the variation of the random values simulated, an optimal bandwidth obtained using the cross-validation technique was used.

4.1. Equations of Data Functions of Simulated

These data functions are normally used in statistics for data simulations since they are widely applicable in real life; see, for instance, [14, 15]. Sine functions are used to model periodic events such as light waves and average temperature variations throughout the year while quadratic functions are used in physics to describe trajectory followed by objects thrown upward at an angle whereas in economics quadratic functions can be used to develop profit and loss functions. Linear functions are widely applicable, for example, in establishing the relationships between a dependent variable and two or more independent variables, e.g., analyzing the linear relationship between the price, supply, and demand of a commodity. Bump functions used in such events as bio-surveillance for modeling disease-outbreaks or floods within a certain limit of time in a given place and can also be used in curve fitting, uncertainty analysis, and approximation of nonlinear relationships in scattered data. Equations of data functions simulated are presented in Table 1.

4.2. Simulation Results

The results of the data simulated are presented in Tables 2, 3 and 4. For details on Nadaraya-Watson estimator see [16]. The Nadaraya-Watson estimator was used for comparison with the proposed estimator.

It can be noted in Table 2 that the values of the bias for the proposed estimator are relatively smaller than those of the Nadaraya-Watson estimator for all the data functions simulated except for the quadratic data function where the values of the bias are close to each other for the two estimators. This may be attributed to the reflection of the transformed data at the boundaries of the support of the kernel density function used. The transformation of data method was proposed to address the boundary bias arising from Nadaraya-Watson technique. Hence the proposed estimator clearly resolved the bias due to Nadaraya-Watson estimation technique.

Efficiency of the mean estimator of the population mean was obtained by its MSE. This is illustrated in Table 3. Measures for the MSE were simulated for purposes of comparison. Comparatively, the proposed estimator of the finite population mean outperforms the Nadaraya-Watson estimator in terms of efficiency as noted from the table. This is because the MSE of the proposed estimator is relatively smaller compared to Nadaraya-Watson estimator in all the data functions. Hence the proposed estimator is more efficient than the Nadaraya-Watson estimator as illustrated in Table 3.

The upper and lower confidence intervals were generated for the estimators of finite population mean using the formula and subsequently the confidence interval lengths were obtained. The results of these confidence interval lengths are presented in Table 4. A good confidence interval has a coverage rate closer to the true population mean being estimated and therefore its length has to be small. From Table 4, it can be observed that the confidence interval lengths for the proposed estimator are much smaller than those of Nadaraya-Watson estimator in all the data functions simulated. Therefore, it can be concluded that the estimator developed in this paper has a tighter confidence interval length and is superior to its rival Nadaraya-Watson estimator at coverage rate.

5. Conclusion

The proposed estimator of finite population mean has been shown to be better than the Nadaraya-Watson estimator using various performance criteria such as the bias, the mean squared error, and the confidence interval lengths. The results are tabulated in Tables 2, 3, and 4. Most importantly, it is shown in Table 4 that the proposed estimator has got tighter confidence interval lengths at level; hence it produces estimates that are closer to the true population values being estimated.

Data Availability

The data used to support the theoretical findings were generated via simulation using R statistical package.

Conflicts of Interest

The authors declare that they have no conflicts of interest.