Some Classes of Logarithmic-Type Imputation Techniques for Handling Missing Data

Pandey, Awadhesh K.; Singh, G. N.; Bhattacharyya, D.; Ali, Abdulrazzaq Q.; Al-Thubaiti, Samah; Yakout, H. A.

doi:https://doi.org/10.1155/2021/8593261

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Artificial Intelligence and Machine Learning-Driven Decision-Making

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 8593261 | https://doi.org/10.1155/2021/8593261

Some Classes of Logarithmic-Type Imputation Techniques for Handling Missing Data

Awadhesh K. Pandey,¹G. N. Singh,²D. Bhattacharyya,²Abdulrazzaq Q. Ali,³Samah Al-Thubaiti,⁴and H. A. Yakout⁵

Academic Editor: Ahmed Mostafa Khalil

Received09 Nov 2021

Accepted30 Nov 2021

Published20 Dec 2021

Abstract

In this manuscript, three new classes of log-type imputation techniques have been proposed to handle missing data when conducting surveys. The corresponding classes of point estimators have been derived for estimating the population mean. Their properties (Mean Square Errors and bias) have been studied. An extensive simulation study using data generated from normal, Poisson, and Gamma distributions, as well as real dataset, has been conducted to evaluate how the proposed estimator performs in comparison to several contemporary estimators. The results have been summarized, and discussion regarding real-life applications of the estimator follows.

1. Introduction

Any project has several constraints involved, such as budget restrictions, time limitations, and deadlines. As a result, it is not feasible to study the entire population, and sampling is indispensable for any field of study [1–4]. Sampling has immense applications in various industries such as manufacturing and quality control. It can be utilized to gather information on the notable characteristics of items, such as electrical appliances and household appliances, machine parts like screws and bolts, automobiles, and computer parts like chips. Sampling also has applications in environmental problems that require the estimation of physical, geographical, economical, and other characteristics, before data analysis can begin [5, 6]. Mean, median, variance, and other statistics are essential for studies involving various environmental parameters, such as estimation of the amount of rainfall received in an area prone to droughts and the air quality of a city with high traffic density. Sample surveys may be designed to collect such information.

Missing data is a frequent element in sample surveys and is a primary contributor towards decline of data quality and incorrect inferences. Hence, it is crucial that survey statisticians deal with the stochastic nature of such incomplete data. It is essential to understand the assumptions which have to be made and the methods that can be utilized to deal with the problem of ignorability of completeness mechanism. The authors of [7, 8] and many others have studied the mechanisms of missing data. Of these, the ones that are most relevant to the survey literature are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). When data is missing randomly or by chance, MCAR is said to occur. MAR occurs when the missingness does not depend on the variable under study (which may be unobserved), but on some other variable (which is fully observed). MNAR occurs when missingness depends on the variable under study.

A number of statistical techniques have been developed over the past decades to handle the situation of missing data. The study in [9] was the first to suggest that a subsample of nonrespondents be contacted again by mail surveys. Another widely employed technique is imputation, in which a suitable function of the variables is used to fill in the missing values. This ensures the completeness of the sample in terms of structure prior to the commencement of statistical analysis. Some popular imputation methods include mean, regression, hot deck, cold deck, and nearest neighbor methods of imputation, among others. Imputation techniques in the survey literature are due to [10–27], among others.

Information from an auxiliary variable can be utilized to provide an improved estimate for population characteristics. Such information may be readily available as secondary data from previous surveys or census or may be collected during the survey procedure at little to no additional cost. Some examples of such auxiliary information include the lifetime of a previous batch of bulbs when studying the life of a current lot of bulbs and the speed of cars when studying the mileage of cars.

This manuscript proposed three novel logarithmic-type imputation methods to neutralize the nuisance effects of nonresponse in survey sampling. The corresponding classes of point estimators that may be used for estimating population mean have been studied in detail. The subsequent sections of the manuscript are devoted to the theoretical analysis of the properties of the proposed estimators, in terms of bias and Mean Square Error (MSE), and empirical study to examine the performance of the proposed estimators in comparison with some contemporary estimators, based on both simulated data and real data, and the conclusions have been presented. These are structured as follows: Sections 2 and 3 introduce the sample structure and notations and some conventional estimators of population mean, respectively, which have been used subsequently in the manuscript. Section 4 introduces the proposed classes of estimators, and comments on its existence, consistency, properties, and implementation in R. The empirical study involving simulated data and real data have been presented in Sections 5 and 6, respectively. Section 7 summarizes the main findings and conclusions.

2. Sampling Scheme and Notations Used

Let the characteristic of interest be denoted by . A correlated auxiliary variable with the availability of complete information on it and known population mean is considered.

The sample structure as well as the notations used in the subsequent sections of the manuscript have been introduced in Table 1.

3. Some Conventional Estimators

It is crucial to conduct thorough literature review and examine the properties of some existing estimators of population mean, before new estimators can be proposed. A few such estimators have been discussed in this section.

The mean estimator is a simple and widely used estimator, which provides an estimate of the population mean using the average of the responses. Ratio estimator improves over the mean estimator by utilizing auxiliary information on a correlated variable. Numerous other estimators which make effective use of auxiliary information have been developed, for instance, the estimator proposed in [28] and regression-type estimators proposed in [29], among others.

The structures of some of these estimators have been given in Table 2, while the expressions for their respective variances (V) or Mean Square Errors (MSEs) have been given in Table 3.

It is to be noted that most conventional estimators make use of simple functional forms, such as linear combinations, exponential functions, and chains. Logarithmic functions are rarely seen. This can be partially attributed to computational limitations associated with such functions. However, the advent of supercomputers and improvement in computational powers have eliminated such obstacles. Logarithms are useful because they express numbers in a reasonable scale that is easy to understand by people. Logarithms count multiplication as steps and hence can express events whose magnitudes can vary in a drastic manner, such as earthquakes, on a singular scale that has a compact range. Logarithmic-scale graphs are efficient in graphically depicting such widely varying magnitudes in a single scale. In log-scale graphs, straight lines often represent exponential changes, thus making them easier to interpret. Some real-life examples of use of logarithms are decibels for measuring sound, Richter scale for measuring earthquakes, pH scale for measuring acidity, etc. Logarithms can also be used to study exponential growth and decay, such as bacterial growth in a Petri dish, interest rates (the implicit growth rate), and radioactive decay in radiocarbon dating. Hence, it is reasonable to explore the use of log-type estimators for estimation of various population parameters. This has been the motivation behind the construction of the proposed classes of logarithmic-type estimators.

4. Formulation of the Proposed Classes of Logarithmic-Type Estimators

Let where or denote, respectively, the values for the population unit of characteristics and . Let and denote the sets of respondents and nonrespondents, respectively. The following imputation methods may be suggested to deal with the problem of missing data:where are constants, to be determined in such a way that they minimize the MSE.

The point estimator under an imputation method is given in

Using Equation (4), under the imputation outlined in Equations (1)–(3), respectively, the expressions for the corresponding classes of logarithmic-type point estimators of are obtained as

4.1. Existence and Consistency of the Estimator

The domain of values for which an estimator exists should be specified, so that survey statisticians or those working in the field are able to determine whether it is reasonable to use an estimator in a practical scenario.

The proposed classes of estimators consist of the function, which exists for all positive values of . Hence, , exist for all positive values of .

Hence, the proposed estimators can be used for all real, positive values of the characters under study. For real-world scenarios, many characters of interest take only positive values. For example, measurements such as length, breadth, height, weight, diameter, currencies, and number of an item do not take negative values. Hence, the proposed estimator can be used in such practical scenarios.

It is to be noted that the structure of the estimator is consistent for large-sample approximations. As , , , and , . Hence, , for .

4.2. Properties of the Proposed Estimator

Various properties can be used to measure the “goodness” of an estimator. Two such properties, namely, bias and Mean Squared Error (MSE), have been discussed in this manuscript. Bias paints a picture of the expected deviation from the true value of a parameter, while MSE gives an idea about the degree of spread. Large-sample assumptions have been considered for the purpose. The expressions have been derived up to the first order of approximations. Some transformations involving error terms have been employed for the purpose, given as follows:

The error terms have the following expectations:

To obtain the expressions for Bias and MSE, in the first step, the transformations in Equation (8) are applied to Equations (5)–(7). In the second step, algebraic expansion of the resultant expressions are done, using the following Taylor’s series: .

The estimators take the following forms after algebraic manipulation:

Hence,

Expectations taken on the square of both sides yield the expressions for MSEs . They are obtained up to the first order of approximations of the estimators , as follows:

As stated when introducing the imputation methods, the constants , and are to be determined so that they minimize the respective MSEs of the estimators. Settingthe respective optimal values of , and are obtained as follows:

Thus, the expressions for the minimum MSE (Min M(.)) of the proposed classes of logarithmic-type estimators under optimal conditions are as follows:

The expressions for bias , using the optimal values of , are found to be as follows:

Remark on practicability: a primary problem in the use of the proposed classes of logarithmic-type estimators , is the choice of the constants , and . The optimum value of , and depends on the population parameter . These values are seen to be overall stable when surveys are conducted repeatedly (see [30]); however, sometimes, the values remain unknown. In situations like that, the following estimators of , and are suggested:where is the correlation coefficient between and , is the sample mean square of , and is the sample coefficient of variation of , based on the responding part of the sample of size .

4.3. Implementation in R

In today’s technologically advanced world, most computations are done in some suitable software environment. The R [31] code snippet given in the following can be used to carry out the proposed imputations on a dataset of interest and calculate the values of the corresponding point estimators: #Import data of respondents from file dfresp < - read.table(file.choose()) #Import data of non-respondents from file dfnonresp < - read.table(file.choose()) xrbar = mean(dfresp[,1]) yrbar < -mean(dfresp[,2]) xbar < - XXX #Specify known value of Xbar here rhosamp = corr(dfresp[,1],dfresp[,2]) sxr = var(dfresp[,1]) syr = var(dfresp[,2]) cyr = syr/yrbar cxr = sxr/xrbar xbarnonresp = mean(dfnonresp[,1]) r = nrow(dfresp) #no. of respondents nonresp = nrow(dfnonresp) #no. of non-respondents n = r + nonresp #sample size xnbar=(rxrbar + nonrespxbarnonresp)/n const = rhosampsyr/cxr alpha = -const beta = -const gamma = const #imputation t1<−c() t2<−c() t3<−c() for(i in 1:(n − r)) { t1[i] = yrbar + alphanx[i]log(xnbar/Xbar)/((n − r)xbarnonresp) t2[i] = yrbar + betanx[i]log(xrbar/Xbar)/((n − r)xbarnonresp) t3[i] = yrbar + gammanx[i]log(xnbar/xrbar)/((n − r)xbarnonresp) } #point estimation est1 = yrbar + alphalog(xnbar/Xbar) est2 = yrbar + betalog(xrbar/Xbar) est3 = yrbar + gammalog(xnbar/xrbar)

5. Empirical Study

Before an estimator can be used in practical scenarios, its performance must be examined, in terms of its properties. To this end, the biases of the estimators are calculated and the MSEs under optimal conditions are compared with those of the contemporary estimators given in Table 2 within the framework of percentage relative efficiencies (PREs).

The PREs of the classes of logarithmic-type estimators w.r.t. the contemporary estimators, under optimal conditions, are defined as follows:

Here, the expressions for the Min. MSEs of the proposed classes of logarithmic-type estimators , are given in Equations (16)–(18), while those of the contemporary estimators are given in Table 3.

Using R [31], an extensive simulation study has been carried out on sufficiently large fictitious populations to compute the biases and the PREs defined above. Data is generated from three different probability distributions, namely, normal (a continuous distribution), Poisson (a discrete distribution), and Gamma (a continuous distribution) distributions. A few important properties of the distributions have been tabulated in Table 4. Such distributions have been selected because they are frequently seen to occur in real-life situations.

Normal distribution has uses in modeling of heights of individuals, test scores of students, blood pressure, daily returns of any particular stock, weights of items produced by a manufacturing process, etc. Poisson distribution can be used to model the probability that a given number of events occur in a specific time interval, for example, the number of insurance claims filed per month, the number of network failures occurring per week, and the number of bulbs manufactured per minute. It also finds use in medical statistics, such as for estimating the number of births that may be expected on a particular night, the number of patients with an infectious disease arriving at a clinic within a given hour, and the number of mutations on a given strand of DNA per time unit. Gamma distribution can be used for modeling wait time, reliability, service time in queuing theory, etc. For example, it can be used to model the amount of rainfall that accumulates in a given reservoir, the flow of items through manufacturing as well as distribution processes, the size of loan defaults, etc. Thus, these three distributions are chosen based on their importance in practical scenarios.

The steps of the simulation are as follows:(1)The sizes of the population, the sample, and the responding part of the sample are defined. For the purpose of the study, sufficiently large values of , and have been chosen.(2)The parameters of the population are defined. Data is generated from normal distribution with parameters for and for X, from Gamma distribution with parameters with means and variances for and , respectively, and from Poisson distribution with means for and , respectively.(3)Simulation is conducted for various values of . For the purpose of the study, in the range , i.e., positively correlated variable , is considered.

The results of the simulation study related to the PREs have been presented in Tables 5–13, while the biases have been presented in Tables 14–16.

6. Application to Real Data

Secondary data has been used for the purpose of demonstrating the utilization of the proposed estimator under the SRSWOR sampling scheme. The dataset “Chemical Composition of Ceramic Samples Data Set” has been obtained from UCI Machine Learning Repository [32] and used to illustrate the use of the proposed estimator in real-world scenarios for estimating population mean. The dataset consists of 88 instances of 19 attributes and is concerned with the classification of ceramic samples depending on their chemical composition from energy-dispersive X-ray fluorescence. We use the subset of the dataset where attribute “Part” takes the value “Body,” so that . Here, : percentage of MgO (wt%) : percentage of CaO (wt%)

It is seen that . Taking , the PREs are found to be as given in Table 17. The MSEs of the proposed estimators and the contemporary estimators have been plotted in Figure 1.

7. Conclusions

The empirical study enables us to study the behavior of the proposed estimator under various scenarios involving various values of parameters. The chief conclusions that follow are given next:(1)Tables 5–7 show that the proposed classes of logarithmic-type estimators , are more efficient than the contemporary estimators when data is generated from normal distribution.(2)The PRE of the proposed classes of estimators w.r.t. the contemporary estimators is seen to increase with the increase in the value of , i.e., the correlation coefficient between the study and the auxiliary variables, as evident from Tables 5–7.(3)From Tables 8–10, it is observed that the proposed classes of logarithmic-type estimators , dominate over the contemporary estimators when data is generated from Gamma distribution.(4)The proposed estimators , perform better than the contemporary estimators in terms of PREs when data is generated from Poisson distribution, as seen from Tables 11–13.(5)Tables 14–16 show that the biases of the proposed estimators are negligible, being of orders and , when data is generated from normal, Gamma, and Poisson distributions, respectively.(6)Table 17 shows that for the real data used in this manuscript, the classes of logarithmic-type estimators proposed in the manuscript dominate over the contemporary estimators for situations when the variables and have a moderate positive value of the correlation coefficient. Furthermore, from Figure 1, it is graphically seen that the MSEs of the proposed estimators , are less than that of the contemporary estimators.

Hence, the proposed estimator is seen to be consistent, exists for all real positive values of parameters, has negligible bias, and is more efficient than 6 other contemporary estimators. Hence, the proposed estimator may be recommended for use in field work.

Data Availability

The data used in the study are generated theoretically by the equations given in this paper.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University, Saudi Arabia, for funding this work through Research Groups Program under grant number RGP. 2/110/42.

References

P. Brweton and L. Millward, Organizational Research Methods, SAGE, London, UK, 2001.
G. H. Brown, “A comparison of sampling methods,” Journal of Marketing, vol. 11, no. 4, pp. 331–337, 1947.
View at: Publisher Site | Google Scholar
A. Bryman and E. Bell, Business Research Methods, Oxford University Press, Oxford, UK, 2003.
K. Sahu and R. Srivastava, “Needs and importance of reliability prediction: an industrial perspective,” Information Sciences Letters, vol. 9, pp. 33–37, 2020.
View at: Google Scholar
M. Mahmoud, M. M. Nassar, and M. A. Aefa, “Parameter estimation for a mixture of inverse chen and inverse compound Rayleigh distributions based on type-II hybrid censoring scheme,” Journal of Statistics Applications & Probability, vol. 10, pp. 467–485, 2021.
View at: Google Scholar
S. Kumar, S. Bhougal, V. Sharma, R. Gupta, and J. P. S. Joorel, “Estimating the problem of non-response and measurement error in sample survey,” Journal of Statistics Applications & Probability, vol. 10, pp. 665–675, 2021.
View at: Google Scholar
D. F. Heitjan and S. Basu, “Distinguishing “missing at random” and “missing completely at random”,” The American Statistician, vol. 50, no. 3, pp. 207–213, 1996.
View at: Publisher Site | Google Scholar
D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
View at: Publisher Site | Google Scholar
M. H. Hansen and W. N. Hurwitz, “The problem of non-response in sample surveys,” Journal of the American Statistical Association, vol. 41, no. 236, pp. 517–529, 1946.
View at: Publisher Site | Google Scholar
M. S. Ahmed, O. Al-Titi, Z. Al-Rawi, and W. Abu-Dayyeh, “Estimation of a population mean using different imputation methods,” Statistics in Transition, vol. 7, no. 6, pp. 1247–1264, 2006.
View at: Google Scholar
G. Diana and P. Francesco Perri, “Improved estimators of the population mean for missing data,” Communications in Statistics - Theory and Methods, vol. 39, no. 18, pp. 3245–3251, 2010.
View at: Publisher Site | Google Scholar
H. Liu, Y. Chen, G. Li, W. Xiang, and G. Xu, “Adaptive fuzzy synchronization of fractional-order chaotic (hyperchaotic) systems with input saturation and unknown parameters,” Complexity, vol. 2017, Article ID C, 16 pages, 2017.
View at: Publisher Site | Google Scholar
H. Liu, S.-G. Li, H.-X. Wang, and G.-J. Li, “Adaptive fuzzy synchronization for a class of fractional-order neural networks,” Chinese Physics B, vol. 26, no. 3, Article ID 030504, 2017.
View at: Publisher Site | Google Scholar
A. Gupta and C. S. Nazrin, “The main factors of intimate partner violence - a statistical study,” Journal of Statistics Applications & Probability, vol. 10, pp. 103–112, 2021.
View at: Google Scholar
M. H. Abu-Moussa, A. M. Abd-Elfattah, and E. H. Hafez, “Estimation of stress-strength parameter for Rayleigh distribution based on progressive type-II censoring,” Information Sciences Letters, vol. 10, pp. 101–110, 2021.
View at: Google Scholar
G. Kalton, D. Kasprzyk, and R. Santos, “Issues of nonresponse and imputation in the survey of income and program participation,” in Current Topics in Survey Sampling, pp. 455–480, Academic Press, Cambridge, MA, USA, 1981.
View at: Publisher Site | Google Scholar
G. Kalton and D. Kasprzyk, “Imputing for missing survey responses,” in Proceedings of the section on survey research methods, American Statistical Association, American Statistical Association Cincinnati, vol. 22, p. 31, August 1982.
View at: Google Scholar
G. Kalton and L. Kish, “Some efficient random imputation methods,” Communications in Statistics-Theory and Methods, vol. 13, no. 16, pp. 1919–1939, 1984.
View at: Publisher Site | Google Scholar
J. K. Kim and J. Shao, Statistical Methods for Handling Incomplete Data, Chapman and Hall/CRC, Boca Raton, FL, USA, 2nd edition, 2021.
A. K. Pandey, M. Usman, and G. N. Singh, “Optimality of ratio and regression type estimators using dual of auxiliary variable under non response,” Alexandria Engineering Journal, vol. 60, no. 5, pp. 4461–4471, 2021.
View at: Publisher Site | Google Scholar
A. K. Pandey, G. N. Singh, N. Sayed-Ahmed, and H. Abu-Zinadah, “Improved estimators for mean estimation in presence of missing information,” Alexandria Engineering Journal, vol. 60, no. 6, pp. 5977–5990, 2021.
View at: Publisher Site | Google Scholar
I. G. Sande, “A personal view of hot-deck imputation procedures,” Survey Methodology, vol. 5, no. 2, pp. 238–258, 1979.
View at: Google Scholar
A. A. El Sheikh, S. L. Barakat, and S. M. Mohamed, “New aspects on the modified group LASSO using the least angle regression and shrinkage algorithm,” Information Sciences Letters, vol. 10, pp. 527–536, 2021.
View at: Google Scholar
M. U. Sohail, J. Shabbir, and F. Sohil, “Imputation of missing values by using raw moments,” Statistics in Transition New Series, vol. 20, no. 1, pp. 21–40, 2019.
View at: Publisher Site | Google Scholar
G. N. Singh, A. K. Pandey, and A. K. Sharma, “Some improved and alternative imputation methods for finite population mean in presence of missing information,” Communications in Statistics-Theory and Methods, vol. 50, no. 19, pp. 4401–4427, 2020.
View at: Publisher Site | Google Scholar
S. Singh and S. Horn, “Compromised imputation in survey sampling,” Metrika, vol. 51, no. 3, pp. 267–276, 2000.
View at: Publisher Site | Google Scholar
S. Singh and B. Deo, “Imputation by power transformation,” Statistical Papers, vol. 44, no. 4, pp. 555–579, 2003.
View at: Publisher Site | Google Scholar
H. Toutenburg and V. K. Srivastava, “Amputation versus imputation of missing values through ratio method in sample surveys,” Statistical Papers, vol. 49, no. 2, pp. 237–247, 2008.
View at: Google Scholar
C. Kadilar and H. Cingi, “Estimators for the population mean in the case of missing data,” Communications in Statistics-Theory and Methods, vol. 37, no. 14, pp. 2226–2236, 2008.
View at: Publisher Site | Google Scholar
V. N. Reddy, “A study on the use of prior knowledge on certain population parameters in estimation,” Sankhya C, vol. 40, pp. 29–37, 1978.
View at: Google Scholar
R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2018, https://www.R-project.org/.
D. Dua and C. Graff, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, USA, 2019, http://archive.ics.uci.edu/ml.

Copyright

Copyright © 2021 Awadhesh K. Pandey et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

393

Downloads

815

Citations