Abstract

Nonresponse is a potential source of errors in sample surveys. It introduces bias and large variance in the estimation of finite population parameters. Regression models have been recognized as one of the techniques of reducing bias and variance due to random nonresponse using auxiliary data. In this study, it is assumed that random nonresponse occurs in the survey variable in the second stage of cluster sampling, assuming full auxiliary information is available throughout. Auxiliary information is used at the estimation stage via a regression model to address the problem of random nonresponse. In particular, auxiliary information is used via an improved Nadaraya–Watson kernel regression technique to compensate for random nonresponse. The asymptotic bias and mean squared error of the estimator proposed are derived. Besides, a simulation study conducted indicates that the proposed estimator has smaller values of the bias and smaller mean squared error values compared to existing estimators of a finite population mean. The proposed estimator is also shown to have tighter confidence interval lengths at coverage rate. The results obtained in this study are useful for instance in choosing efficient estimators of a finite population mean in demographic sample surveys.

1. Introduction

Many authors such as [14] have looked at estimation of a finite population mean in the presence of nonresponse using various assumptions. However, the estimators developed in these studies need improvements on the efficiency and the bias. In the sequence of improving estimation of a finite population mean in the presence of random nonresponse, an improved Nadaraya–Watson kernel regression estimator is proposed in this study. The improved Nadaraya–Watson kernel regression technique was first fronted by [5]. To compensate for random nonresponse, auxiliary information is used in this study via an improved Nadaraya–Watson kernel regression technique due to [5].

An improvement of the Nadaraya–Watson estimator [6, 7] has been proposed by [5] using local bandwidth factor determined using [8] algorithm. The improved Nadaraya–Watson estimator is given bywhere is a smoothing parameter while the local bandwidth factor is given bywhere is an arithmetic mean given by while is a sensitivity parameter which satisfies . It has been suggested by [8] that taking produces good results.

2. The Proposed Estimator of Finite Population Mean Using Improved Nadaraya–Watson Kernel Regression Technique

Consider a finite population of size consisting of clusters with elements in the cluster. A sample of clusters is selected so that units respond and units fail to respond. Let denote the value of the survey variable for unit in cluster , for , and let the population mean be given by

The proposed estimator is given bywhere is an estimator of the nonresponse component of the sample. Assuming auxiliary information is known throughout, can be obtained using the improved Nadaraya–Watson regression technique byso that the estimator of the finite population mean can be rewritten as

A special case where is assumed in this study. This simplifies mathematical computations so that equation (7) can be rewritten aswhere is the improved Nadaraya–Watson kernel regression estimator given in equation (1), which is a weighted sum of the values of the survey variable . Data are generated using a regression model given bywhere is an unknown smooth function of auxiliary random variables . It is assumed that the error term satisfies the following conditions:

Hence, the unspecified function of the auxiliary random variables is replaced by the improved Nadaraya–Watson kernel estimator . The estimator can be rewritten aswhere are the improved Nadaraya–Watson kernel weights, where is a given kernel function assumed to be symmetrical. Since the choice of the kernel function is not critical for the performance of the kernel regression estimator, a simplified Gaussian kernel with mean 0 and variance 1 is used in this study. This is given by

In this case, the improved Nadaraya–Watson kernel estimation at any point is given bywhere is the bandwidth while is given in equation (2) due to [5].

This provides a way of estimating the nonresponse values of the survey variable , in the cluster given the auxiliary values , for a specified kernel function.

2.1. The Asymptotic Bias of the Proposed Estimator

The expected value of the proposed estimator is given by

Rewriting equation (5) using the property of symmetry associated with Nadaraya–Watson estimator,

Following the procedure by [9], equation (14) can be rewritten aswhere is the estimated marginal density of auxiliary variables . The bias of the estimator can be written aswhich reduces to

Rewriting the regression model given by asand substituting it in equation (15) gives

Hence, the first term in equation (18) before taking expectation is given as

Simplifying equation (21), the following is obtained:where

Taking conditional expectation of equation (22) leads to

The following theorem due to [10] and applied by [11] was used in obtaining asymptotic bias and variance of the estimator using conditional expectations.

Theorem 1. Let be a symmetric density function with and . Assume and increase together such that with . Besides, assume the sampled and nonsampled values of are in the interval and are obtained by densities and , respectively, where both are bounded away from zero on with continuous second derivatives. If for any variable and , then .
Using this theorem, the asymptotic bias and variance are derived in the following sections. From the conditions of the error term stated in equation (9), it follows that . Therefore, . Thus, can be obtained as follows:Using substitution and change of variable technique given byequation (26) can be simplified toUsing Taylor’s series expansion about the point , the order kernel can be derived as follows:Similarly,Therefore, expanding equation (28) up to order and simplifying givesUsing the conditions due to [10] given by , , and , the derivation in equation (31) can further be simplified to obtainHence, the expected value of the second term in equation (25) then becomesSimplifying equation (33) giveswhere and .
Using the equation of the bias given in (16) and the conditional expectation in equation (25), the following equation for the conditional bias of the estimator was obtained:In the next subsection, the asymptotic variance of the estimator is also derived.

2.2. Asymptotic Variance of the Proposed Estimator

Using equation (7), the conditional variance of the estimator is given aswhere is given bywhere is the estimated marginal density of auxiliary variables ; for details see [6, 7]. Rewriting the regression model as and substituting in equation (37) leads to

From equation (24),

Hence,where . Expressing equation (40) in terms of expectation, the following equation is obtained

Using the fact that the conditional expectation , the second term in equation (41) reduces to zero. Therefore,where .

Let and and make the following substitutions:so that

Using the change of variables technique and simplifying, equation (44) reduces to

Following the same procedure for getting the variance of ,

can similarly be obtained as follows:

Equation (46) can be rewritten aswhere so that . Changing variables and applying Taylor’s series expansion about the point leads towhich gives

Following the procedure by [12] and simplifying, equation (49) reduces to

For large samples, as , and , then . Hence, the variance in equation (49) asymptotically tends to zero, i.e., , so that the variance of the estimator of the population mean reduces to

Simplifying equation (51) leads to

Substituting equation (45) in equation (52) yields the following:

2.3. Mean Squared Error of the Proposed Estimator

The conditional of the estimator of the finite population mean combines the conditional squared bias and the conditional variance of the estimator, that is,which on simplification leads towhere and .

From equation (55), it is noted that if the sample size is large, that is, as and , the of due to the kernel tends to zero for a sufficiently small bandwidth. The estimator is therefore asymptotically consistent since its converges to zero in probability.

3. Simulation Study

A simulation experiment was conducted using R code in order to compare the performance of the proposed estimator in two-stage cluster sampling with the transformed estimator due to [13] and the nonparametric regression estimator due to [14]. An asymptotic framework is used where both the population number of clusters and the sample number of clusters are large. The number of clusters within each cluster is held constant so that no cluster dominates the population.

Both linear and nonlinear mean functions of auxiliary random variables due to [14] were considered in generating data, where . The equations of the mean functions used in simulating the data are given in Table 1.

The population auxiliary values of size are generated as identical and independently distributed uniform random variables. The survey values are only known for the respondents in the selected sample. Using the auxiliary values, the nonresponse values are generated, that is, for every generated value , the mean survey nonresponse values are generated aswhere are identically and independently distributed normal random variables with mean zero and variance one. Besides, a Gaussian kernel with mean zero and variance one was used. A Gaussian kernel was used since it has smooth and continuous derivatives at every data point. Besides, an optimal bandwidth generated using cross-validation technique due to [15] was used. It has been noted by [15] that this bandwidth would lead to more informative estimates compared to other choices. The local bandwidth factor given in equation (2) was generated using the algorithm due to [8].

At stage one, a sample of clusters is generated first by simple random sampling using a sample of size . At stage two, subsamples of elements within every selected cluster are generated by simple random sampling with replacement using a random sample of size . The nonresponse mean survey values were then generated using equation (56). The estimates of the finite population mean were then computed using the estimator in equation (7). The values of bias and mean squared error values were also computed. The confidence intervals were then constructed for the estimators of the finite population mean for comparative purposes.

4. Simulation Results

The values of the bias, mean squared error, and confidence interval lengths are given in the following tables. Note that is the estimator of the finite population mean proposed in this study and is the transformation of data method estimator of the finite population mean due to [13] whereas is the nonparametric regression estimator due to [14]. Both and were used for comparative purposes with the proposed estimator.

The biases of the estimators considered are presented in Table 2. Negative values of the bias imply underestimation while positive values of the bias indicate overestimation of the finite population mean by the different estimators. The proposed estimator has relatively smaller values of the bias followed by transformation of data method estimator due to [13]. The nonparametric-based estimator due to [14] has larger values compared to the other two estimators. It is also observed that the three estimators have relatively closer values of the bias in the quadratic mean function though the transformation of data method has positive bias at this mean function. Generally, among the three estimators of the finite population mean, the proposed estimator using the improved Nadaraya–Watson kernel regression technique performs better than the other two estimators in terms of bias.

Mean squared error combines both the variance and the squared bias terms of an estimator. The mean squared error values presented in Table 3 were simulated using the different mean functions indicated. The quadratic mean function gives the smallest value of the mean squared error of the proposed estimator followed by the linear function. The estimator due to [14] has the largest value of the mean squared error in the jump function. Generally, it is noted from Table 3 that the mean squared error values for the proposed estimator are relatively smaller than the rest of the estimators considered. The transformation of data method estimator due to [13] follows closely in the second place with smaller mean squared error values compared to nonparametric regression-based estimator due to [14]. From this comparison of the mean squared error values, it can be concluded that the proposed estimator is more efficient than the other two estimators considered. It has got smaller MSE values in all the mean functions and thus outperforms the others in terms of efficiency.

The upper and lower confidence intervals were constructed for the estimators of the finite population mean. Confidence interval lengths were then obtained. The results are given in Table 4. From the values obtained, it is noted that the confidence interval lengths for the proposed estimator are much tighter than those of the estimators due to [13, 14]. Hence, at level of confidence, the estimator proposed in this study performs better than its rival estimators.

5. Conclusion

This study has developed an estimator of the finite population mean in two-stage cluster sampling, assuming random nonresponse occurs in the survey variable in the second stage of cluster sampling. Complete auxiliary information is assumed to be available in both stage one and stage two of cluster sampling. Kernel weights developed using the improved Nadaraya–Watson regression technique were used in the estimation process. The theoretical properties of the proposed estimator such as asymptotic bias, variance, and mean squared error were derived. Simulation results show that the proposed estimator has smaller values of the bias, smaller mean squared error values, and tighter confidence interval lengths compared to the other estimators. Therefore, the estimator of the finite population mean proposed in this study dominates the estimators due to [13, 14], respectively.

Data Availability

The data used to support the theoretical findings were generated via simulation using R statistical package.

Disclosure

The abstract was to be presented in a conference organized by the World Academy of Science, Engineering and Technology, but due to financial constraints, participation and presentation in the conference was withdrawn and the organizer was informed accordingly.

Conflicts of Interest

The authors declare that they have no conflicts of interest.