Research Article | Open Access
Nelson Kiprono Bii, Christopher Ouma Onyango, John Odhiambo, "Estimating a Finite Population Mean Using Transformed Data in Presence of Random Nonresponse", International Journal of Mathematics and Mathematical Sciences, vol. 2020, Article ID 7306315, 7 pages, 2020. https://doi.org/10.1155/2020/7306315
Estimating a Finite Population Mean Using Transformed Data in Presence of Random Nonresponse
Developing finite population estimators of parameters such as mean, variance, and asymptotic mean squared error has been one of the core objectives of sample survey theory and practice. Sample survey practitioners need to assess the properties of these estimators so that better ones can be adopted. In survey sampling, the occurrence of nonresponse affects inference and optimality of the estimators of finite population parameters. It introduces bias and may cause samples to deviate from the distributions obtained by the original sampling technique. To compensate for random nonresponse, imputation methods have been proposed by various researchers. However, the asymptotic bias and variance of the finite population mean estimators are still high under this technique. In this paper, transformation of data weighting technique is suggested. The proposed estimator is observed to be asymptotically consistent under mild assumptions. Simulated data show that the estimator proposed is much better than its rival estimators for all the different mean functions simulated.
A lot of significance is attached to efficient and cost-effective survey sampling designs in sample surveys while estimating a finite population mean, see for instance [1, 2]. Careful design of samples based on random selection with known probabilities of population elements should be considered. This gives a target sample of intended respondents where each may provide responses to a set of survey questions that result in an array of responses. Van Buuren et al.  observed that nonresponse occurs if some of the expected responses are missing, for instance, where a whole vector of responses is missing for some sampled units or where responses are obtained for some questions and not to others in the sample selected. As noted in , nonresponse often occurs in human population surveys as people hesitate to respond in surveys and increases notably while studying sensitive issues. Moreover, the presence of nonresponse increases the bias in estimates, ultimately reducing their efficiency as observed in .
The basis for statistical inference is therefore formed by a sampling design that provides a link between a sample and the population. As observed in , a good sample survey practise and efficient methods of compensating for nonresponse should be adopted.
In sample surveys, nonresponse leads to biased results in the estimation of a finite population mean. It may force samples to deviate from the distributions originally established by the sampling design. The incorporation of regression models is acknowledged as one of the methods of minimizing bias resulting from nonresponse using auxiliary data, for details see . In practise, knowledge on the study variables is unavailable for nonrespondents, whereas auxiliary data may be given. To minimize the bias and variance resulting from nonresponse, Liang and Zeger  noted that it is desirable to incorporate auxiliary data in the process of estimation where the probabilities of response are mostly assumed to be correlated with certain characteristics, for instance, age, race, and income in human population surveys. Sanaullah et al.  studied nonresponse under stratified two-phase sampling using generalized exponential chain-ratio and chain-product estimators and concluded that their estimator is more efficient than that in , one of the pioneer studies of estimation of finite population parameters under nonresponse.
In the sequence of addressing the problem of nonresponse in sample surveys, Javaid et al.  derived a modified ratio estimator in systematic random sampling. They proposed the use of single auxiliary variable to estimate a finite population mean. However, as a departure from the work in [7, 9], this paper proposed the use of weights due to transformed data to compensate for nonresponse. The weighting method is highlighted in the following section.
This paper is organized as follows: introduction of the paper has been presented in Section 1; in Section 2, a review of weighting method has been discussed; the proposed estimator is derived in Section 3; in Sections 3.1–3.3, the bias, the variance, and the mean squared error of the estimator are derived, respectively; Section 4 presents results from the simulation experiment conducted; and the conclusion of the paper has been presented in Section 5.
2. Review of Weighting Method
It has been observed by authors in  that nonresponse leads to reduced number of observations. Weighting thus implies that the weights are increased almost for all the elements that do not respond in a survey. For instance, the authors in [11, 12] explored a modified Horvitz–Thompson estimator to correct the problem of nonresponse using the weighting technique. The estimator used was defined aswhere is given assuch that , is the survey value of the study variable taken from a sample selected from a finite population, , is the inclusion probability given by , and is the value of the respondent in the sample selected . The estimator adjusts the weights by an unbiased estimator , of the response probabilities of the population mean, as shown in the following equation:
An approximate bias of the estimator is thus given aswhere
If tends to zero, the bias would be minimal, for more details see . The adjusted Horvitz–Thompson estimator, , is an illustration of reweighting measurements of respondents without using auxiliary information. However, in this paper, auxiliary information is used in the estimation procedure. To compensate for nonresponse, weights obtained from transformed data are used.
3. Finite Population Mean Estimator Proposed
Suppose a finite population of size consists of clusters having elements in the cluster. is taken to represent the survey value of the study variable, , for the unit in the cluster, for , for details see . Let the mean of the finite population to be estimated be given by . Data are generated by a regression model used in [15, 16] and more recently in ; the model is given aswhere is a function of the auxiliary data having continuous derivatives and is a residual variable having a mean of zero and a nonnegative variance. Auxiliary information is assumed to be known in this study. To predict nonresponse values in the study variable, the following estimator due to transformed data is proposed:
The original data are first transformed to where is a nonnegative, continuous, and monotonically increasing function from to . The transformed data are then reflected around the origin to obtain , for details see . More recently, Bii et al.  used the same procedure while developing boundary bias correction under nonresponse. However, the weights resulting from the current study are easier to implement and give better estimates than those in . Hence, utilizing the transformed sample data , the population mean estimator is defined aswhere estimates the nonresponse units. Similarly, can be represented aswhere are the weights obtained by transformation of the data. These weights are given asso that in equation (9) can be rewritten as
The following section gives some properties of the estimator proposed.
3.1. The Bias of the Estimator
Assumptions in  are used in deriving the bias and the variance of the estimator proposed. More recently, these assumptions have also been used in  in boundary bias correction in presence of nonresponse.
and are assumed to exist and are continuous; given such that and ; , , , , and , where is the inverse function of , whereas and are the derivatives of and , respectively, for , where is the bandwidth such that , where . Furthermore, let , where such that for . Besides, a kernel function K is assumed to be nonnegative and symmetric function with support such that , , and . From these assumptions, the following equation is thus obtained:
Expanding equation (14) leads to
Simplifying equation (15) reduces to
Equation (16) is equivalent to
It is observed that is approximately equal to as and for all . Hence, the proposed estimator is asymptotically unbiased.
3.2. Asymptotic Variance of the Estimator Proposed
This estimator suggested in this study has variance given aswhich can similarly be expressed as
Using Taylor’s series, the asymptotic expansion and simplification of the variance becomes
As , the bandwidth , and hence, . Hence, the variance, , decreases in . Thus, the larger the sample size, the smaller the variance.
3.3. Mean Squared Error of the Estimator Proposed
The mean squared error combines the variance and the squared bias terms of the estimator, that is,
As can be observed in equation (22), the mean squared error approaches zero as ; this is true as the bandwidth becomes sufficiently smaller, that is, as , in which represents the sampled clusters, while is the size of each sampled cluster. It is also noted that there is a trade-off between the variance and the bias terms of the estimator since as the bandwidth decreases, the variance increases, but the bias decreases. Optimal bandwidth ought to be developed to solve the bias-variance trade-off in the estimation process, see for details . The next section presents a simulation study conducted to compare the performance of the finite population mean estimator suggested in this paper with those that exist in literature.
4. Simulation Study
4.1. Mean Functions of Simulated
The following steps were followed in data simulation:(1)The auxiliary data were generated as identical and independent uniform random variables on . The values of the residual term were obtained from a normal distribution having mean 0 and variance 1. The population was made up of 80 clusters.(2)A sample of clusters was selected in stage one following the simple random sampling procedure with replacement.(3)From every cluster, , picked in stage two, a sample , from was chosen from a total of observations by a simple random sampling technique with replacement. Consider the study variable , that are given for the respondents in the sample only.(4)Using the auxiliary data, , for , nonresponse data were obtained from the regression equation using simple random sampling with replacement in the cluster, where is a function of auxiliary data obtained from the different mean functions given in Table 1; ∼, while ∼.(5)This procedure was replicated to obtain the mean estimators, , in the cluster.(6)At level, confidence intervals were developed for the population mean estimators, , which corresponds to the estimator proposed, Nadaraya–Watson estimator in [21, 22], and the modified Nadaraya–Watson estimator in , respectively.(7)A Gaussian kernel together with a locally adaptive bandwidth in  was used in the simulation of data.
The results are discussed in the following section.
4.2. Simulation Results
Table 2 presents a summary of the results of bias simulated from the mean functions in , as shown in Table 1. Negative values imply underestimation, while positive values of the bias indicate overestimation by the different estimators considered. The proposed estimator has got smaller values of the bias compared to the rest of the estimators considered as can be seen in Table 2 for all the mean functions simulated. For the exponential and quadratic mean functions, Nadaraya–Watson estimator overestimates the finite population mean, while the proposed estimator only overestimates the finite population mean for the quadratic mean function. Modified Nadaraya–Watson estimator underestimates the finite population mean in all the mean functions simulated except for exponential function. Generally, from Table 2, the proposed estimator has got smaller values of the bias in all the mean functions simulated.
The mean squared error values presented in Table 3 were generated using different mean functions, as indicated in Table 1. It can be noted that the estimator proposed has got smaller mean squared error values than those of the estimator in [21, 22] and the modified Nadaraya–Watson estimator in . Nadaraya–Watson estimator has the largest mean squared error values than any other estimator considered. Comparing the mean squared error values for these three estimators shows that the estimator proposed is better than the rest of the estimators considered in this paper.
Confidence intervals are normally constructed around point estimators to provide properly calibrated measures of variability associated with estimators of population parameters of interest. In this paper, confidence intervals were obtained at for the finite population mean estimators. Shorter confidence interval lengths mean the estimator is asymptotically equal to the true population parameter being predicted. From the results in Table 4, the estimator suggested in this paper is observed to have tighter confidence interval lengths than the other estimators considered. This means at coverage rates, the estimator proposed is better than the Nadaraya–Watson and the improved Nadaraya–Watson estimators.
The estimator proposed in this paper is noted to be more desirable than the estimator in [21, 22] and the modified Nadaraya–Watson estimator in . In Table 1, smaller bias values are observed for the proposed estimator for all the mean functions simulated compared to Nadaraya–Watson and the improved Nadaraya–Watson estimators. The mean squared error values in Table 2 indicate that the estimator proposed does much better than the rest considered in this study. Tighter confidence interval lengths can also be observed for the proposed estimator than the rest of the estimators, as given in Table 3. Hence, the proposed estimator provides a better estimation of the mean of a finite population compared to estimators in [21, 22] and that in .
The sampling procedure used in this study and the finite population mean estimator derived can be used to estimate the average health insurance coverage in a given population.
To support the theoretical findings, data were generated using R statistical package.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
- C. Fife-Schaw, “Surveys and sampling issues,” Research Methods in Psychology, vol. 2, pp. 88–104, 2000.
- L. P. Lago and R. G. Clark, “Imputation of household survey data using linear mixed models,” Australian & New Zealand Journal of Statistics, vol. 57, no. 2, pp. 169–187, 2015.
- S. Van Buuren, H. C. Boshuizen, and D. L. Knook, “Multiple imputation of missing blood pressure covariates in survival analysis,” Statistics in Medicine, vol. 18, no. 6, pp. 681–694, 1999.
- M. Ismail, M. Q. Shahbaz, and M. Hanif, “A general class of estimator of population mean in presence of non–response,” Pakistan Journal of Statistics, vol. 27, no. 4, pp. 467–476, 2011.
- R. R. Andridge and R. J. A. Little, “A review of hot deck imputation for survey non-response,” International Statistical Review, vol. 78, no. 1, pp. 40–64, 2010.
- K. Y. Liang and S. L. Zeger, “Regression analysis for correlated data,” Annual Review of Public Health, vol. 14, no. 1, pp. 43–68, 1993.
- A. Sanaullah, M. Noor-ul-Amin, M. Hanif, and C. Kadilar, “Generalized exponential chain ratio and chain product estimators under stratified two-phase random sampling for non-response,” International Journal of Applied Mathematics & Statistics, vol. 55, no. 2, pp. 57–79, 2016.
- M. H. Hansen and W. N. Hurwitz, “The problem of non-response in sample surveys,” Journal of the American Statistical Association, vol. 41, no. 236, pp. 517–529, 1946.
- A. Javaid, M. Noor-ul-Amin, and M. Hanif, “Modified ratio estimator in systematic random sampling under non-response,” Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, vol. 89, no. 4, pp. 817–825, 2019.
- G. R. Pike, “Using weighting adjustments to compensate for survey nonresponse,” Research in Higher Education, vol. 49, no. 2, pp. 153–171, 2008.
- J. G. Bethlehem, “Reduction of nonresponse bias through regression estimation,” Journal of Official Statistics, vol. 4, no. 3, pp. 251–260, 1988.
- J. K. Kim and J. J. Kim, “Nonresponse weighting adjustment using estimated response probability,” Canadian Journal of Statistics, vol. 35, no. 4, pp. 501–514, 2007.
- C. E. Särndal, “On π-inverse weighting versus best linear unbiased weighting in probability sampling,” Biometrika, vol. 67, no. 3, pp. 639–650, 1980.
- N. K. Bii, C. O. Onyango, and J. Odhiambo, “Boundary bias correction using weighting method in presence of nonresponse in two-stage cluster sampling,” Journal of Probability and Statistics, vol. 2019, Article ID 6812795, 8 pages, 2019.
- C. O. Onyango, R. O. Otieno, and G. O. Orwa, “Generalised model based confidence intervals in two stage cluster sampling,” Pakistan Journal of Statistics and Operation Research, vol. 6, no. 2, 2010.
- C. Ouma and C. Wafula, “Bootstrap confidence intervals for model-based surveys,” East African Journal of Statistics, vol. 1, no. 1, pp. 84–90, 2005.
- N. K. Bii, C. O. Onyango, and J. Odhiambo, “Estimating a finite population mean under random non-response in two stage cluster sampling with replacement,” Open Journal of Statistics, vol. 7, no. 5, pp. 834–848, 2017.
- A. Cowling and P. Hall, “On pseudodata methods for removing boundary effects in kernel density estimation,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 3, pp. 551–563, 1996.
- C. R. Loader, “Bandwidth selection: classical or plug-in?” The Annals of Statistics, vol. 27, no. 2, pp. 415–438, 1999.
- J.-Y. Kim, F. J. Breidt, and J. D. Opsomer, “Nonparametric regression estimation of finite population totals under two-stage sampling,” Tech. Rep., Colorado State University, Fort Collins, CO, USA, 2003, Technical report.
- E. A. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications, vol. 9, no. 1, pp. 141-142, 1964.
- G. S. Watson, “Smooth regression analysis,” Sankhya: The Indian Journal of Statistics, Series A, vol. 26, no. 4, pp. 359–372, 1964.
- S. Demir and Ö. Toktamiş, “On the adaptive nadaraya-watson kernel regression estimators,” Hacettepe Journal of Mathematics and Statistics, vol. 39, no. 3, 2010.
Copyright © 2020 Nelson Kiprono Bii et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.