Research Article | Open Access

Volume 2020 |Article ID 7306315 | https://doi.org/10.1155/2020/7306315

Nelson Kiprono Bii, Christopher Ouma Onyango, John Odhiambo, "Estimating a Finite Population Mean Using Transformed Data in Presence of Random Nonresponse", International Journal of Mathematics and Mathematical Sciences, vol. 2020, Article ID 7306315, 7 pages, 2020. https://doi.org/10.1155/2020/7306315

# Estimating a Finite Population Mean Using Transformed Data in Presence of Random Nonresponse

Accepted22 Jun 2020
Published08 Jul 2020

#### Abstract

Developing finite population estimators of parameters such as mean, variance, and asymptotic mean squared error has been one of the core objectives of sample survey theory and practice. Sample survey practitioners need to assess the properties of these estimators so that better ones can be adopted. In survey sampling, the occurrence of nonresponse affects inference and optimality of the estimators of finite population parameters. It introduces bias and may cause samples to deviate from the distributions obtained by the original sampling technique. To compensate for random nonresponse, imputation methods have been proposed by various researchers. However, the asymptotic bias and variance of the finite population mean estimators are still high under this technique. In this paper, transformation of data weighting technique is suggested. The proposed estimator is observed to be asymptotically consistent under mild assumptions. Simulated data show that the estimator proposed is much better than its rival estimators for all the different mean functions simulated.

#### 1. Introduction

A lot of significance is attached to efficient and cost-effective survey sampling designs in sample surveys while estimating a finite population mean, see for instance [1, 2]. Careful design of samples based on random selection with known probabilities of population elements should be considered. This gives a target sample of intended respondents where each may provide responses to a set of survey questions that result in an array of responses. Van Buuren et al.  observed that nonresponse occurs if some of the expected responses are missing, for instance, where a whole vector of responses is missing for some sampled units or where responses are obtained for some questions and not to others in the sample selected. As noted in , nonresponse often occurs in human population surveys as people hesitate to respond in surveys and increases notably while studying sensitive issues. Moreover, the presence of nonresponse increases the bias in estimates, ultimately reducing their efficiency as observed in .

The basis for statistical inference is therefore formed by a sampling design that provides a link between a sample and the population. As observed in , a good sample survey practise and efficient methods of compensating for nonresponse should be adopted.

In sample surveys, nonresponse leads to biased results in the estimation of a finite population mean. It may force samples to deviate from the distributions originally established by the sampling design. The incorporation of regression models is acknowledged as one of the methods of minimizing bias resulting from nonresponse using auxiliary data, for details see . In practise, knowledge on the study variables is unavailable for nonrespondents, whereas auxiliary data may be given. To minimize the bias and variance resulting from nonresponse, Liang and Zeger  noted that it is desirable to incorporate auxiliary data in the process of estimation where the probabilities of response are mostly assumed to be correlated with certain characteristics, for instance, age, race, and income in human population surveys. Sanaullah et al.  studied nonresponse under stratified two-phase sampling using generalized exponential chain-ratio and chain-product estimators and concluded that their estimator is more efficient than that in , one of the pioneer studies of estimation of finite population parameters under nonresponse.

In the sequence of addressing the problem of nonresponse in sample surveys, Javaid et al.  derived a modified ratio estimator in systematic random sampling. They proposed the use of single auxiliary variable to estimate a finite population mean. However, as a departure from the work in [7, 9], this paper proposed the use of weights due to transformed data to compensate for nonresponse. The weighting method is highlighted in the following section.

This paper is organized as follows: introduction of the paper has been presented in Section 1; in Section 2, a review of weighting method has been discussed; the proposed estimator is derived in Section 3; in Sections 3.13.3, the bias, the variance, and the mean squared error of the estimator are derived, respectively; Section 4 presents results from the simulation experiment conducted; and the conclusion of the paper has been presented in Section 5.

#### 2. Review of Weighting Method

It has been observed by authors in  that nonresponse leads to reduced number of observations. Weighting thus implies that the weights are increased almost for all the elements that do not respond in a survey. For instance, the authors in [11, 12] explored a modified Horvitz–Thompson estimator to correct the problem of nonresponse using the weighting technique. The estimator used was defined aswhere is given assuch that , is the survey value of the study variable taken from a sample selected from a finite population, , is the inclusion probability given by , and is the value of the respondent in the sample selected . The estimator adjusts the weights by an unbiased estimator , of the response probabilities of the population mean, as shown in the following equation:

An approximate bias of the estimator is thus given aswhere

If tends to zero, the bias would be minimal, for more details see . The adjusted Horvitz–Thompson estimator, , is an illustration of reweighting measurements of respondents without using auxiliary information. However, in this paper, auxiliary information is used in the estimation procedure. To compensate for nonresponse, weights obtained from transformed data are used.

#### 3. Finite Population Mean Estimator Proposed

Suppose a finite population of size consists of clusters having elements in the cluster. is taken to represent the survey value of the study variable, , for the unit in the cluster, for , for details see . Let the mean of the finite population to be estimated be given by . Data are generated by a regression model used in [15, 16] and more recently in ; the model is given aswhere is a function of the auxiliary data having continuous derivatives and is a residual variable having a mean of zero and a nonnegative variance. Auxiliary information is assumed to be known in this study. To predict nonresponse values in the study variable, the following estimator due to transformed data is proposed:

The original data are first transformed to where is a nonnegative, continuous, and monotonically increasing function from to . The transformed data are then reflected around the origin to obtain , for details see . More recently, Bii et al.  used the same procedure while developing boundary bias correction under nonresponse. However, the weights resulting from the current study are easier to implement and give better estimates than those in . Hence, utilizing the transformed sample data , the population mean estimator is defined aswhere estimates the nonresponse units. Similarly, can be represented aswhere are the weights obtained by transformation of the data. These weights are given asso that in equation (9) can be rewritten as

Hence, using equation (11), equation (8) becomeswhich is equivalent to

The following section gives some properties of the estimator proposed.

##### 3.1. The Bias of the Estimator

Assumptions in  are used in deriving the bias and the variance of the estimator proposed. More recently, these assumptions have also been used in  in boundary bias correction in presence of nonresponse.

and are assumed to exist and are continuous; given such that and ; , , , , and , where is the inverse function of , whereas and are the derivatives of and , respectively, for , where is the bandwidth such that , where . Furthermore, let , where such that for . Besides, a kernel function K is assumed to be nonnegative and symmetric function with support such that , , and . From these assumptions, the following equation is thus obtained:

Simplifying equation (15) reduces to

Equation (16) is equivalent to

It is observed that is approximately equal to as and for all . Hence, the proposed estimator is asymptotically unbiased.

##### 3.2. Asymptotic Variance of the Estimator Proposed

This estimator suggested in this study has variance given aswhich can similarly be expressed as

Using Taylor’s series, the asymptotic expansion and simplification of the variance becomes

As , the bandwidth , and hence, . Hence, the variance, , decreases in . Thus, the larger the sample size, the smaller the variance.

##### 3.3. Mean Squared Error of the Estimator Proposed

The mean squared error combines the variance and the squared bias terms of the estimator, that is,

Therefore, combining equations (17) and (19) lead to

As can be observed in equation (22), the mean squared error approaches zero as ; this is true as the bandwidth becomes sufficiently smaller, that is, as , in which represents the sampled clusters, while is the size of each sampled cluster. It is also noted that there is a trade-off between the variance and the bias terms of the estimator since as the bandwidth decreases, the variance increases, but the bias decreases. Optimal bandwidth ought to be developed to solve the bias-variance trade-off in the estimation process, see for details . The next section presents a simulation study conducted to compare the performance of the finite population mean estimator suggested in this paper with those that exist in literature.

#### 4. Simulation Study

The simulation experiment was done using R code. The mean functions in  were used for simulation. Table 1 presents the different mean functions used for data simulation.

 Mean function Equation Bump Exponential Jump Linear Quadratic Sine
##### 4.1. Mean Functions of Simulated

The following steps were followed in data simulation:(1)The auxiliary data were generated as identical and independent uniform random variables on . The values of the residual term were obtained from a normal distribution having mean 0 and variance 1. The population was made up of 80 clusters.(2)A sample of clusters was selected in stage one following the simple random sampling procedure with replacement.(3)From every cluster, , picked in stage two, a sample , from was chosen from a total of observations by a simple random sampling technique with replacement. Consider the study variable , that are given for the respondents in the sample only.(4)Using the auxiliary data, , for , nonresponse data were obtained from the regression equation using simple random sampling with replacement in the cluster, where is a function of auxiliary data obtained from the different mean functions given in Table 1; , while .(5)This procedure was replicated to obtain the mean estimators, , in the cluster.(6)At level, confidence intervals were developed for the population mean estimators, , which corresponds to the estimator proposed, Nadaraya–Watson estimator in [21, 22], and the modified Nadaraya–Watson estimator in , respectively.(7)A Gaussian kernel together with a locally adaptive bandwidth in  was used in the simulation of data.

The results are discussed in the following section.

##### 4.2. Simulation Results

The simulated data for the estimator proposed, the estimator in [21, 22], and the estimator in  are given in Tables 24.

 Estimator Estimator proposed Nadaraya–Watson Modified Nadaraya–Watson Bump −0.18757 −0.51888 −0.25989 Exponential −0.01228 0.58383 0.29008 Jump −0.2537 −0.30344 −0.15825 Linear −0.16800 −0.40750 −0.20447 Quadratic 0.03000 0.07000 −0.17728 Sine −0.29100 −1.17090 −0.5999
 Estimator Estimator proposed Nadaraya–Watson Modified Nadaraya–Watson Bump 0.06229 0.32087 0.08046 Exponential 0.00314 1.16734 0.29075 Jump 0.08790 0.94584 0.23835 Linear 0.04900 0.19020 0.04782 Quadratic 0.04400 0.56210 0.14017 Sine 0.13700 1.41030 0.36981
 Estimator Estimator proposed Nadaraya–Watson Modified Nadaraya–Watson Bump 0.97482 2.22051 1.91380 Exponential 0.21954 4.23531 2.11372 Jump 1.09441 3.81237 1.91380 Linear 0.92700 1.70900 0.85730 Quadratic 0.77000 3.0000 1.46763 Sine 1.52700 4.89000 2.54520

Table 2 presents a summary of the results of bias simulated from the mean functions in , as shown in Table 1. Negative values imply underestimation, while positive values of the bias indicate overestimation by the different estimators considered. The proposed estimator has got smaller values of the bias compared to the rest of the estimators considered as can be seen in Table 2 for all the mean functions simulated. For the exponential and quadratic mean functions, Nadaraya–Watson estimator overestimates the finite population mean, while the proposed estimator only overestimates the finite population mean for the quadratic mean function. Modified Nadaraya–Watson estimator underestimates the finite population mean in all the mean functions simulated except for exponential function. Generally, from Table 2, the proposed estimator has got smaller values of the bias in all the mean functions simulated.

The mean squared error values presented in Table 3 were generated using different mean functions, as indicated in Table 1. It can be noted that the estimator proposed has got smaller mean squared error values than those of the estimator in [21, 22] and the modified Nadaraya–Watson estimator in . Nadaraya–Watson estimator has the largest mean squared error values than any other estimator considered. Comparing the mean squared error values for these three estimators shows that the estimator proposed is better than the rest of the estimators considered in this paper.

Confidence intervals are normally constructed around point estimators to provide properly calibrated measures of variability associated with estimators of population parameters of interest. In this paper, confidence intervals were obtained at for the finite population mean estimators. Shorter confidence interval lengths mean the estimator is asymptotically equal to the true population parameter being predicted. From the results in Table 4, the estimator suggested in this paper is observed to have tighter confidence interval lengths than the other estimators considered. This means at coverage rates, the estimator proposed is better than the Nadaraya–Watson and the improved Nadaraya–Watson estimators.

#### 5. Conclusion

The estimator proposed in this paper is noted to be more desirable than the estimator in [21, 22] and the modified Nadaraya–Watson estimator in . In Table 1, smaller bias values are observed for the proposed estimator for all the mean functions simulated compared to Nadaraya–Watson and the improved Nadaraya–Watson estimators. The mean squared error values in Table 2 indicate that the estimator proposed does much better than the rest considered in this study. Tighter confidence interval lengths can also be observed for the proposed estimator than the rest of the estimators, as given in Table 3. Hence, the proposed estimator provides a better estimation of the mean of a finite population compared to estimators in [21, 22] and that in .

The sampling procedure used in this study and the finite population mean estimator derived can be used to estimate the average health insurance coverage in a given population.

#### Data Availability

To support the theoretical findings, data were generated using R statistical package.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

1. C. Fife-Schaw, “Surveys and sampling issues,” Research Methods in Psychology, vol. 2, pp. 88–104, 2000. View at: Google Scholar
2. L. P. Lago and R. G. Clark, “Imputation of household survey data using linear mixed models,” Australian & New Zealand Journal of Statistics, vol. 57, no. 2, pp. 169–187, 2015. View at: Publisher Site | Google Scholar
3. S. Van Buuren, H. C. Boshuizen, and D. L. Knook, “Multiple imputation of missing blood pressure covariates in survival analysis,” Statistics in Medicine, vol. 18, no. 6, pp. 681–694, 1999. View at: Publisher Site | Google Scholar
4. M. Ismail, M. Q. Shahbaz, and M. Hanif, “A general class of estimator of population mean in presence of non–response,” Pakistan Journal of Statistics, vol. 27, no. 4, pp. 467–476, 2011. View at: Google Scholar
5. R. R. Andridge and R. J. A. Little, “A review of hot deck imputation for survey non-response,” International Statistical Review, vol. 78, no. 1, pp. 40–64, 2010. View at: Publisher Site | Google Scholar
6. K. Y. Liang and S. L. Zeger, “Regression analysis for correlated data,” Annual Review of Public Health, vol. 14, no. 1, pp. 43–68, 1993. View at: Publisher Site | Google Scholar
7. A. Sanaullah, M. Noor-ul-Amin, M. Hanif, and C. Kadilar, “Generalized exponential chain ratio and chain product estimators under stratified two-phase random sampling for non-response,” International Journal of Applied Mathematics & Statistics, vol. 55, no. 2, pp. 57–79, 2016. View at: Google Scholar
8. M. H. Hansen and W. N. Hurwitz, “The problem of non-response in sample surveys,” Journal of the American Statistical Association, vol. 41, no. 236, pp. 517–529, 1946. View at: Publisher Site | Google Scholar
9. A. Javaid, M. Noor-ul-Amin, and M. Hanif, “Modified ratio estimator in systematic random sampling under non-response,” Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, vol. 89, no. 4, pp. 817–825, 2019. View at: Publisher Site | Google Scholar
10. G. R. Pike, “Using weighting adjustments to compensate for survey nonresponse,” Research in Higher Education, vol. 49, no. 2, pp. 153–171, 2008. View at: Publisher Site | Google Scholar
11. J. G. Bethlehem, “Reduction of nonresponse bias through regression estimation,” Journal of Official Statistics, vol. 4, no. 3, pp. 251–260, 1988. View at: Google Scholar
12. J. K. Kim and J. J. Kim, “Nonresponse weighting adjustment using estimated response probability,” Canadian Journal of Statistics, vol. 35, no. 4, pp. 501–514, 2007. View at: Publisher Site | Google Scholar
13. C. E. Särndal, “On π-inverse weighting versus best linear unbiased weighting in probability sampling,” Biometrika, vol. 67, no. 3, pp. 639–650, 1980. View at: Publisher Site | Google Scholar
14. N. K. Bii, C. O. Onyango, and J. Odhiambo, “Boundary bias correction using weighting method in presence of nonresponse in two-stage cluster sampling,” Journal of Probability and Statistics, vol. 2019, Article ID 6812795, 8 pages, 2019. View at: Publisher Site | Google Scholar
15. C. O. Onyango, R. O. Otieno, and G. O. Orwa, “Generalised model based confidence intervals in two stage cluster sampling,” Pakistan Journal of Statistics and Operation Research, vol. 6, no. 2, 2010. View at: Publisher Site | Google Scholar
16. C. Ouma and C. Wafula, “Bootstrap confidence intervals for model-based surveys,” East African Journal of Statistics, vol. 1, no. 1, pp. 84–90, 2005. View at: Google Scholar
17. N. K. Bii, C. O. Onyango, and J. Odhiambo, “Estimating a finite population mean under random non-response in two stage cluster sampling with replacement,” Open Journal of Statistics, vol. 7, no. 5, pp. 834–848, 2017. View at: Publisher Site | Google Scholar
18. A. Cowling and P. Hall, “On pseudodata methods for removing boundary effects in kernel density estimation,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 3, pp. 551–563, 1996. View at: Publisher Site | Google Scholar
19. C. R. Loader, “Bandwidth selection: classical or plug-in?” The Annals of Statistics, vol. 27, no. 2, pp. 415–438, 1999. View at: Publisher Site | Google Scholar
20. J.-Y. Kim, F. J. Breidt, and J. D. Opsomer, “Nonparametric regression estimation of finite population totals under two-stage sampling,” Tech. Rep., Colorado State University, Fort Collins, CO, USA, 2003, Technical report. View at: Google Scholar
21. E. A. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications, vol. 9, no. 1, pp. 141-142, 1964. View at: Publisher Site | Google Scholar
22. G. S. Watson, “Smooth regression analysis,” Sankhya: The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, 1964. View at: Google Scholar
23. S. Demir and Ö. Toktamiş, “On the adaptive nadaraya-watson kernel regression estimators,” Hacettepe Journal of Mathematics and Statistics, vol. 39, no. 3, 2010. View at: Google Scholar

#### More related articles

Article of the Year Award: Outstanding research contributions of 2020, as selected by our Chief Editors. Read the winning articles.