An Exponential-Cum-Sine-Type Hybrid Imputation Technique for Missing Data

Bhattacharyya, D.; Singh, G.N.; Jawa, Taghreed M.; Sayed-Ahmed, Neveen; Pandey, Awadhesh K.

doi:https://doi.org/10.1155/2021/4845569

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Results and Discussion Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Artificial Intelligence and Machine Learning-Driven Decision-Making

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 4845569 | https://doi.org/10.1155/2021/4845569

An Exponential-Cum-Sine-Type Hybrid Imputation Technique for Missing Data

D. Bhattacharyya,¹G.N. Singh,¹Taghreed M. Jawa,²Neveen Sayed-Ahmed,²and Awadhesh K. Pandey³

Academic Editor: Ahmed Mostafa Khalil

Received14 Oct 2021

Accepted17 Nov 2021

Published03 Dec 2021

Abstract

In this study, a new exponential-cum-sine-type hybrid imputation technique has been proposed to handle missing data when conducting surveys. The properties of the corresponding point estimator for population mean have been examined in terms of bias and mean square errors. An extensive simulation study using data generated from normal, Poisson, and Gamma distributions has been conducted to evaluate how the proposed estimator performs in comparison to several contemporary estimators. The results have been summarized, and discussion regarding real-life applications of the estimator follows.

1. Introduction

The impracticality of measuring the entire population for any realistic project due to budgetary, time, or other constraints makes sampling indispensible for any field of study [1–12]. The widespread applications of acceptance sampling in various industries for manufacturing and other processes have been noted for a considerable period of time. Sampling can also be applied to obtain vital information on the chief characteristics of items ranging from electrical appliances to machine parts such as screws and bolts, automobiles, and computer parts such as chip. In addition, many environmental problems involve physical, geographical, economical, and other characteristics which need to be estimated prior to data analysis, model formulation, and predictions. Studies related to the amount of rainfall received annually in a flood-prone area, the quality of drinking water near an industrial zone, the soil quality of an agricultural land, etc. are some instances where estimation of mean, median, variance, and other statistics is essential. Such information can be collected via sample surveys [4, 6, 7, 9, 13].

Missing data is a universal occurrence in sample surveys, leading to a decline in data quality and complications in making inferences. It is pivotal for survey statisticians to factor in the stochastic nature of incomplete data. This brings forth the question of what assumptions have to be made or which techniques have to be employed to handle the problem of ignorability of completeness mechanism. The mechanisms of missing data have been studied in detail in [9, 13], among others. Three missing data mechanisms are mostly of interest in the survey literature, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR is said to occur when data is missing randomly or by chance, MAR occurs when the missingness does not depend on the variable under study (which may be unobserved), but on some other variables (which is fully observed), and MNAR occurs when missingness depends on the variable under study.

Numerous statistical methods have been devised over the years to overcome the problem of missing data. Subsampling of nonrespondents in surveys via mail questionnaire was pioneered in [8]. Another commonly used method is imputation, in which the missing values are filled in by a suitable function of the available values, to ensure the structural completeness of the sample before analysis begins. Popular imputation techniques include mean imputation, regression imputation, hot deck imputation, cold deck imputation, and nearest neighbor method. Imputation techniques in the survey literature are from [3, 5, 14–21], among others. Some recent works in the area of imputation and estimation of population mean have been done in [22–29] and others.

Information from an auxiliary variable can be utilized to provide an improved estimate for population characteristics. Such information may be readily available as secondary data from previous surveys or census or may be collected during the survey procedure at little to no additional cost. Some examples of such auxiliary information include the lifetime of a previous batch of bulbs when studying the life of a current lot of bulbs, the speed of cars when studying the mileage of cars, etc.

In this manuscript, a new exponential-cum-sine-type hybrid imputation technique and corresponding point estimator have been proposed for estimation of population mean. Motivation for this estimator, its properties, and its uses have been discussed in the subsequent sections. The manuscript is henceforth divided into the following sections: Section 2 introduces the sample structure and notations used in the manuscript. Section 3 discusses some conventional estimators of population mean. Section 4 discusses the proposed estimator, including its existence, consistency, properties, and implementation in R. The simulation study has been presented in Section 5, the results and discussion in Section 6, and the conclusions in Section 7.

2. Sample Structure and Notations Used

Let the character of interest be denoted by . We consider the scenario in which complete information on a correlated auxiliary variable is available to the survey statisticians and its population mean is known.

The sample structure and the notations used henceforth have been introduced in Table 1.

3. Some Conventional Estimators

Before the proposed estimator is introduced, it is important to examine some existing estimators for population mean and study their strengths and limitations. A few such estimators have been discussed in this section.

The mean estimator is a simple and traditional estimator, which makes use of the average of the responses to provide an estimate of the population mean. The ratio estimator tries to make an improvement over the mean estimator by incorporating auxiliary information into a correlated variable. Various other estimators that make innovative use of auxiliary information have been proposed, for instance, the estimator proposed in [30], regression-type estimators proposed in [10], and exponential type estimators in [31], among others.

The structures of some of these estimators have been given in Table 2, while the expressions for their respective variances (V) or mean square errors (MSEs) have been given in Table 3.

It is to be noted that most conventional estimators make use of simple functional forms, such as linear combinations, exponential functions, and chains. Combination of multiple mathematical functions is rarely seen. This can be attributed to computational limitations associated with such functions. However, with the advent of supercomputers and improvement in computational powers, such obstructions have been eliminated. It is worth exploring whether combinations of mathematical functions produce better estimates than traditional estimators. This has been the motivation behind the construction of the proposed estimator.

Two such functions have been used, namely, the exponential and sine functions. Such particular functions were selected based on their use in real-life situations. The exponential function is usually used to model growth and decay observed in nature, such as growth and decay of microorganisms like bacteria, human population, spread of pandemics, and compound interests. Sine function is commonly utilized for the purpose of modeling natural phenomena which are periodic in nature, such as sound waves, light waves, tides, sunlight intensity, and average temperature variations through the year, as well as ballistic trajectories, electrical currents, and GPS locations.

4. Formulation of the Proposed Estimator

Let and be the values of and , respectively, for the unit in the population. The following imputation method may be suggested to deal with the problem of missing data:

The point estimator under an imputation method is given in

Using equation (2), under the imputation outlined in equation (1), the expression for the point estimator of is obtained as

4.1. Existence and Consistency of the Estimator

It is important to specify the domain of values for which an estimator exists, so that survey statisticians or those working in the field can determine whether an estimator can be reasonably used in a practical scenario.

The given estimator consists of two major functions: the trigonometrical function and the exponential function . Both and exist in , so exist in .

Hence, the proposed estimator can be used for all real values of the characters under study. For real-world scenarios, most, if not all, characters of interest take only real values. For example, measurements such as length, breadth, height, weight, diameter, currencies, and number of an item do not take nonreal values. Hence, the proposed estimator can be used in all practical scenarios.

It is to be noted that the structure of the estimator is consistent for large sample approximations. As , , , , and . Hence, .

4.2. Properties of the Proposed Estimator

The “goodness” of an estimator can be measured in terms of various properties. Two such properties, namely, bias and mean squared error (MSE), have been explored here. The bias gives an idea about the expected deviation from the true value of a parameter, while MSE deals with the degree of spread. The expressions for the same have been derived under large sample assumptions up to the first order of approximations. Some transformations involving error terms have been used for the purpose, indicated as follows:

The error terms have the following expectations:

To obtain the expressions for bias and MSE, in the first step, algebraic expansion of the expression of the estimator given in equation (3) is done, using the following Taylor’s series:(1)(2)(3)

The estimator takes the following form:

In the second step, the transformations in equation (4) are applied to equation (6) to obtain the following form of the estimator:

Hence, .

Expectations taken on both sides and use of the expected values of , yield the expectations for bias and MSE , obtained up to the first order of approximations of the estimators , as follows:where .

4.3. Implementation in R

In the current day and age, most computations are carried out using a suitable software environment. The following R [32] code snippet has been developed to carry out the proposed imputation on a data set of interest and calculate the value of the corresponding point estimator: #Import data of respondents from file dfresp < −read.table (file.choose()) #Import data of nonrespondents from file dfnonresp < −read.table (file.choose()) xrbar = mean (dfresp[, 1]) yrbar < −mean (dfresp[, 2]) xbarnonresp = mean (dfnonresp[, 1]) r = nrow (dfresp) #no. of respondents nonresp = nrow (dfnonresp) #no. of nonrespondents n = r + nonresp #sample size xnbar=(rxrbar + nonrespxbarnonresp)/n num = sin(xnbar) − sin(xrbar) den = 1 + sin(xnbar)+sin(xrbar) #imputation t < −c() for (i in 1 : (n − r)) { t[i] = n/(n − r)x[i]exp(num/den) − r/(n − r)yrbar } #point estimation est = yrbarexp(num/den)

5. Simulation Study

Before an estimator can be used in practical scenarios, its performance must be examined, in terms of its properties. To this end, the bias of the estimator is calculated and the MSE is compared with that of the contemporary estimators given in Table 2 in terms of percentage relative efficiencies (PREs).

The PREs of the estimator with respect to the contemporary estimators are defined as follows:where the expression for the MSE of the proposed estimator is given in equation (9), while that of the contemporary estimators is given in Table 3.

Using R [32], an extensive simulation study has been carried out on sufficiently large fictitious populations to compute the bias and the PREs defined above. Data is generated from three different probability distributions, namely, normal and Gamma distributions (continuous distributions) and Poisson distribution (discrete distribution). Some important properties of the distributions have been summarized in Table 4. Such distributions are chosen based on their occurrence in real-life situations.

Data from normal distribution is rampant in nature. It can be used to model heights of individuals, test scores of students, blood pressure, daily returns of any particular stock, weights of items produced by a manufacturing process, etc. Poisson distribution can be used to model the probability that a given number of events occur in a specific time interval, for example, the number of insurance claims filed per month, the number of network failures occurring per week, and the number of bulbs manufactured per minute. It also finds use by medical statisticians, such as for estimating the number of births that may be expected on a particular night, the number of patients with an infectious disease arriving at a clinic within a given hour, the number of mutations on a given strand of DNA per time unit, etc. Gamma distribution can be used for modeling wait time, reliability, service time in queuing theory, etc. For example, it can be used to model the amount of rainfall that accumulates in a given reservoir, the flow of items through manufacturing as well as distribution processes, the size of loan defaults, etc. Thus, these three distributions are chosen based on their importance in practical scenarios.

It is seen through trial and error that the estimator performs well when and take small values and the variation in is greater than that in .

The steps of the simulation are as follows:(1)The sizes of the population, the sample, and the responding part of the sample are defined. For the purpose of the study, sufficiently large values of , and have been chosen.(2)The parameters of the population are defined.(3)Simulation is conducted for various values of . For the purpose of the study, in the range ; i.e., positively correlated variable is considered.

The results of the simulation study related to the PREs have been presented in Tables 5–11, while the biases have been presented in Table 12.

6. Results and Discussion

The simulation study enables us to study the behavior of the proposed estimator under various scenarios involving various values of parameters. The chief conclusions are as follows:(1)From the values of in Table 5, it is seen that the proposed estimator is more efficient than for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(2)It is seen that the proposed estimator performs better than for all values of for normal and Gamma data and for for Poisson data for the various values of response rates from the values of in Table 6.(3)From the values of in Table 7, it is seen that the proposed estimator dominates for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(4)The values of in Table 8 show that the proposed estimator is more efficient than for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(5)In Table 9, the values of show that the proposed estimator performs better than for all values of and for the various values of response rates for normal, Gamma, and Poisson data.(6)From the values of in Table 10, it is seen that the proposed estimator dominates for all values of and for the various values of response rates for normal, Gamma, and Poisson data.(7)It is seen that the proposed estimator is more efficient than for all values of and for the various values of response rates for normal, Gamma, and Poisson data from the values of in Table 11.(8)From Table 12, it is seen that the estimator is negatively biased. The bias is negligible, being of the order and for various values of the parameter and for various response rates, and hence, bias correction is not needed.

7. Conclusion

The following trend in the PREs is noticed from the tables: increases with the increase in value of , while decrease with the increase in value of .

The proposed estimator is seen to be consistent, exists for all real values of parameters, has negligible bias, and is more efficient than 7 other contemporary estimators. Hence, the proposed estimator may be recommended for use in field work.

Data Availability

The data used in the study are generated theoretically by the equations given in this paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Taif University Researchers Supporting Project number TURSP-2020/318, Taif University, Taif, Saudi Arabia.

References

M. Mohiuddin, H. Al Bayatti, and R. Kannan, “A new generalization of garima distribution with application to real life data,” Applied Mathematics & Information Sciences, vol. 15, no. No. 5, pp. 577–592, 2021.
View at: Google Scholar
A. El Sheikh, S. Barakat, and S. Mohamed, “New aspects on the modified group LASSO using the least angle regression and shrinkage algorithm,” Information Sciences Letters, vol. 10, no. 3, pp. 527–536, 2021.
View at: Google Scholar
M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, and W. Abu-Dayyeh, “Estimation of a population mean using different imputation methods,” Statistics in Transition, vol. 7, no. 6, pp. 1247–1264, 2006.
View at: Google Scholar
A. M. Basheer, E. M. Almetwally, and H. M. Okasha, “Marshall-olkin alpha power inverse Weibull distribution: non bayesian and bayesian estimations,” Journal of Statistics Applications & Probability, vol. 10, no. 2, pp. 327–345, 2021.
View at: Google Scholar
G. Diana and P. Francesco Perri, “Improved estimators of the population mean for missing data,” Communications in Statistics-Theory and Methods, vol. 39, no. 18, pp. 3245–3251, 2010.
View at: Publisher Site | Google Scholar
F. Daghestani, A. S. Sultan, and S. Al-Moisheer, “Mixture of Lindley and Weibull distributions: properties and estimation,” Journal of Statistics Applications & Probability, vol. 10, no. 2, pp. 301–313, 2021.
View at: Google Scholar
M. Abu-Moussa, A. Abd-Elfattah, and E. Hafez, “Estimation of StressStrength parameter for Rayleigh distribution based on progressive type-II censoring,” Information Sciences Letters, vol. 10, no. 1, pp. 101–110, 2021.
View at: Google Scholar
M. H. Hansen and W. N. Hurwitz, “The problem of non-response in sample surveys,” Journal of the American Statistical Association, vol. 41, no. 236, pp. 517–529, 1946.
View at: Publisher Site | Google Scholar
D. F. Heitjan and S. Basu, “Distinguishing “missing at random” and “missing completely at random,” The American Statistician, vol. 50, no. 3, pp. 207–213, 1996.
View at: Publisher Site | Google Scholar
C. Kadilar and H. Cingi, “Estimators for the population mean in the case of missing data,” Communications in Statistics-Theory and Methods, vol. 37, no. 14, pp. 2226–2236, 2008.
View at: Publisher Site | Google Scholar
M. O. Mohamed and E. Mohamed, “Estimation of parameters of burr distribution under SSALT,” Applied Mathematics & Information Sciences, vol. 15, no. 3, pp. 293–298, 2021.
View at: Publisher Site | Google Scholar
R. M. EL-Sagheer and M. A. A. Khder, “Estimation in K-stage step-stress partially accelerated life tests for generalized pareto distribution with progressive type-I censoring,” Applied Mathematics & Information Sciences, vol. 15, no. 3, pp. 299–305, 2021.
View at: Google Scholar
D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
View at: Publisher Site | Google Scholar
G. Kalton, D. Kasprzyk, and R. Santos, “Issues of nonresponse and imputation in the survey of income and program participation,” in Current Topics in Survey Sampling, pp. 455–480, Academic Press, New York, NY, USA, 1981.
View at: Publisher Site | Google Scholar
G. Kalton and D. Kasprzyk, “Imputing for missing survey responses,” in Proceedings of the Section on Survey Research Methods, American Statistical Association, American Statistical Association Cincinnati, Baltimore, Maryland, August 1982.
View at: Google Scholar
G. Kalton and L. Kish, “Some efficient random imputation methods,” Communications in Statistics -Theory and Methods, vol. 13, no. 16, pp. 1919–1939, 1984.
View at: Publisher Site | Google Scholar
J. K. Kim and J. Shao, Statistical Methods for Handling Incomplete Data, CRC Press, Florita, FL, USA, 2013.
H. Liu, S.-G. Li, H.-X. Wang, and G.-J. Li, “Adaptive fuzzy synchronization for a class of fractional-order neural networks,” Chinese Physics B, vol. 26, no. 3, Article ID 030504, 2017.
View at: Publisher Site | Google Scholar
I. G. Sande, “A personal view of hot-deck imputation procedures,” Survey Methodology, vol. 5, no. 2, pp. 238–258, 1979.
View at: Google Scholar
S. Singh and S. Horn, “Compromised imputation in survey sampling,” Metrika, vol. 51, no. 3, pp. 267–276, 2000.
View at: Publisher Site | Google Scholar
S. Singh and B. Deo, “Imputation by power transformation,” Statistical Papers, vol. 44, no. 4, pp. 555–579, 2003.
View at: Publisher Site | Google Scholar
A. K. Pandey, M. Usman, and G. N. Singh, “Optimality of ratio and regression type estimators using dual of auxiliary variable under non response,” Alexandria Engineering Journal, vol. 60, no. 5, pp. 4461–4471, 2021.
View at: Publisher Site | Google Scholar
A. K. Pandey, G. N. Singh, N. Sayed-Ahmed, and H. Abu-Zinadah, “Improved estimators for mean estimation in presence of missing information,” Alexandria Engineering Journal, vol. 60, no. 6, pp. 5977–5990, 2021.
View at: Publisher Site | Google Scholar
U. Shahzad, N. H. Al-Noor, M. Hanif, I. Sajjad, and M. Muhammad Anas, “Imputation based mean estimators in case of missing data utilizing robust regression and variance-covariance matrices,” Commuanications in Statistics-Simulation and Computation, pp. 1–20, 2020.
View at: Publisher Site | Google Scholar
G. N. Singh, A. K. Pandey, and A. K. Sharma, “Some improved and alternative imputation methods for finite population mean in presence of missing information,” Communications in Statistics-Theory and Methods, vol. 50, pp. 1–27, 2020.
View at: Publisher Site | Google Scholar
M. U. Sohail, J. Shabbir, and S. Ahmed, “A class of ratio type estimators for imputing the missing values under rank set sampling,” Journal of Statistical Theory and Practice, vol. 12, no. 4, pp. 704–717, 2018.
View at: Publisher Site | Google Scholar
M. U. Sohail, J. Shabbir, and C. Kadilar, “Homogeneous imputation under two phase probability proportional to size sampling,” Hacettepe Journal of Mathematics and Statistics, vol. 48, no. 5, pp. 1522–1546, 2019.
View at: Publisher Site | Google Scholar
M. U. Sohail, J. Shabbir, and F. Sohil, “Imputation of missing values by using raw moments,” Statistics in Transition New Series, vol. 20, no. 1, pp. 21–40, 2019.
View at: Publisher Site | Google Scholar
M. U. Sohail, F. Sohil, and J. Shabbir, “Comparative study of different imputation methods,” Communications in Statistics-Simulation and Computation, pp. 1–23, 2021.
View at: Publisher Site | Google Scholar
H. Toutenburg and V. K. Srivastava, “Amputation versus imputation of missing values through ratio method in sample surveys,” Statistical Papers, vol. 49, no. 2, pp. 237–247, 2008.
View at: Google Scholar
G. N. Singh, S. Maurya, M. Khetan, and C. Kadilar, “Some imputation methods for missing data in sample surveys,” Hacettepe Journal of Mathematics and Statistics, vol. 45, no. 6, pp. 1865–1880, 2016.
View at: Google Scholar
R Core Team, R: A Language and Environment for Statistical Computing, R Found. Statistical Computing, Vienna, Austria, 2018, https://www.R-project.org/.

Copyright

Copyright © 2021 D. Bhattacharyya et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

252

Downloads

778

Citations