Abstract

In this study, a new exponential-cum-sine-type hybrid imputation technique has been proposed to handle missing data when conducting surveys. The properties of the corresponding point estimator for population mean have been examined in terms of bias and mean square errors. An extensive simulation study using data generated from normal, Poisson, and Gamma distributions has been conducted to evaluate how the proposed estimator performs in comparison to several contemporary estimators. The results have been summarized, and discussion regarding real-life applications of the estimator follows.

1. Introduction

The impracticality of measuring the entire population for any realistic project due to budgetary, time, or other constraints makes sampling indispensible for any field of study [112]. The widespread applications of acceptance sampling in various industries for manufacturing and other processes have been noted for a considerable period of time. Sampling can also be applied to obtain vital information on the chief characteristics of items ranging from electrical appliances to machine parts such as screws and bolts, automobiles, and computer parts such as chip. In addition, many environmental problems involve physical, geographical, economical, and other characteristics which need to be estimated prior to data analysis, model formulation, and predictions. Studies related to the amount of rainfall received annually in a flood-prone area, the quality of drinking water near an industrial zone, the soil quality of an agricultural land, etc. are some instances where estimation of mean, median, variance, and other statistics is essential. Such information can be collected via sample surveys [4, 6, 7, 9, 13].

Missing data is a universal occurrence in sample surveys, leading to a decline in data quality and complications in making inferences. It is pivotal for survey statisticians to factor in the stochastic nature of incomplete data. This brings forth the question of what assumptions have to be made or which techniques have to be employed to handle the problem of ignorability of completeness mechanism. The mechanisms of missing data have been studied in detail in [9, 13], among others. Three missing data mechanisms are mostly of interest in the survey literature, namely, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR is said to occur when data is missing randomly or by chance, MAR occurs when the missingness does not depend on the variable under study (which may be unobserved), but on some other variables (which is fully observed), and MNAR occurs when missingness depends on the variable under study.

Numerous statistical methods have been devised over the years to overcome the problem of missing data. Subsampling of nonrespondents in surveys via mail questionnaire was pioneered in [8]. Another commonly used method is imputation, in which the missing values are filled in by a suitable function of the available values, to ensure the structural completeness of the sample before analysis begins. Popular imputation techniques include mean imputation, regression imputation, hot deck imputation, cold deck imputation, and nearest neighbor method. Imputation techniques in the survey literature are from [3, 5, 1421], among others. Some recent works in the area of imputation and estimation of population mean have been done in [2229] and others.

Information from an auxiliary variable can be utilized to provide an improved estimate for population characteristics. Such information may be readily available as secondary data from previous surveys or census or may be collected during the survey procedure at little to no additional cost. Some examples of such auxiliary information include the lifetime of a previous batch of bulbs when studying the life of a current lot of bulbs, the speed of cars when studying the mileage of cars, etc.

In this manuscript, a new exponential-cum-sine-type hybrid imputation technique and corresponding point estimator have been proposed for estimation of population mean. Motivation for this estimator, its properties, and its uses have been discussed in the subsequent sections. The manuscript is henceforth divided into the following sections: Section 2 introduces the sample structure and notations used in the manuscript. Section 3 discusses some conventional estimators of population mean. Section 4 discusses the proposed estimator, including its existence, consistency, properties, and implementation in R. The simulation study has been presented in Section 5, the results and discussion in Section 6, and the conclusions in Section 7.

2. Sample Structure and Notations Used

Let the character of interest be denoted by . We consider the scenario in which complete information on a correlated auxiliary variable is available to the survey statisticians and its population mean is known.

The sample structure and the notations used henceforth have been introduced in Table 1.

3. Some Conventional Estimators

Before the proposed estimator is introduced, it is important to examine some existing estimators for population mean and study their strengths and limitations. A few such estimators have been discussed in this section.

The mean estimator is a simple and traditional estimator, which makes use of the average of the responses to provide an estimate of the population mean. The ratio estimator tries to make an improvement over the mean estimator by incorporating auxiliary information into a correlated variable. Various other estimators that make innovative use of auxiliary information have been proposed, for instance, the estimator proposed in [30], regression-type estimators proposed in [10], and exponential type estimators in [31], among others.

The structures of some of these estimators have been given in Table 2, while the expressions for their respective variances (V) or mean square errors (MSEs) have been given in Table 3.

It is to be noted that most conventional estimators make use of simple functional forms, such as linear combinations, exponential functions, and chains. Combination of multiple mathematical functions is rarely seen. This can be attributed to computational limitations associated with such functions. However, with the advent of supercomputers and improvement in computational powers, such obstructions have been eliminated. It is worth exploring whether combinations of mathematical functions produce better estimates than traditional estimators. This has been the motivation behind the construction of the proposed estimator.

Two such functions have been used, namely, the exponential and sine functions. Such particular functions were selected based on their use in real-life situations. The exponential function is usually used to model growth and decay observed in nature, such as growth and decay of microorganisms like bacteria, human population, spread of pandemics, and compound interests. Sine function is commonly utilized for the purpose of modeling natural phenomena which are periodic in nature, such as sound waves, light waves, tides, sunlight intensity, and average temperature variations through the year, as well as ballistic trajectories, electrical currents, and GPS locations.

4. Formulation of the Proposed Estimator

Let and be the values of and , respectively, for the unit in the population. The following imputation method may be suggested to deal with the problem of missing data:

The point estimator under an imputation method is given in

Using equation (2), under the imputation outlined in equation (1), the expression for the point estimator of is obtained as

4.1. Existence and Consistency of the Estimator

It is important to specify the domain of values for which an estimator exists, so that survey statisticians or those working in the field can determine whether an estimator can be reasonably used in a practical scenario.

The given estimator consists of two major functions: the trigonometrical function and the exponential function . Both and exist in , so exist in .

Hence, the proposed estimator can be used for all real values of the characters under study. For real-world scenarios, most, if not all, characters of interest take only real values. For example, measurements such as length, breadth, height, weight, diameter, currencies, and number of an item do not take nonreal values. Hence, the proposed estimator can be used in all practical scenarios.

It is to be noted that the structure of the estimator is consistent for large sample approximations. As , , , , and . Hence, .

4.2. Properties of the Proposed Estimator

The “goodness” of an estimator can be measured in terms of various properties. Two such properties, namely, bias and mean squared error (MSE), have been explored here. The bias gives an idea about the expected deviation from the true value of a parameter, while MSE deals with the degree of spread. The expressions for the same have been derived under large sample assumptions up to the first order of approximations. Some transformations involving error terms have been used for the purpose, indicated as follows:

The error terms have the following expectations:

To obtain the expressions for bias and MSE, in the first step, algebraic expansion of the expression of the estimator given in equation (3) is done, using the following Taylor’s series:(1)(2)(3)

The estimator takes the following form:

In the second step, the transformations in equation (4) are applied to equation (6) to obtain the following form of the estimator:

Hence, .

Expectations taken on both sides and use of the expected values of , yield the expectations for bias and MSE , obtained up to the first order of approximations of the estimators , as follows:where .

4.3. Implementation in R

In the current day and age, most computations are carried out using a suitable software environment. The following R [32] code snippet has been developed to carry out the proposed imputation on a data set of interest and calculate the value of the corresponding point estimator:#Import data of respondents from filedfresp < −read.table (file.choose())#Import data of nonrespondents from filedfnonresp < −read.table (file.choose())xrbar = mean (dfresp[, 1])yrbar < −mean (dfresp[, 2])xbarnonresp = mean (dfnonresp[, 1])r = nrow (dfresp) #no. of respondentsnonresp = nrow (dfnonresp) #no. of nonrespondentsn = r + nonresp #sample sizexnbar=(rxrbar + nonrespxbarnonresp)/nnum = sin(xnbar) − sin(xrbar)den = 1 + sin(xnbar)+sin(xrbar)#imputationt < −c()for (i in 1 : (n − r)){t[i] = n/(n − r)x[i]exp(num/den) − r/(n − r)yrbar}#point estimationest = yrbarexp(num/den)

5. Simulation Study

Before an estimator can be used in practical scenarios, its performance must be examined, in terms of its properties. To this end, the bias of the estimator is calculated and the MSE is compared with that of the contemporary estimators given in Table 2 in terms of percentage relative efficiencies (PREs).

The PREs of the estimator with respect to the contemporary estimators are defined as follows:where the expression for the MSE of the proposed estimator is given in equation (9), while that of the contemporary estimators is given in Table 3.

Using R [32], an extensive simulation study has been carried out on sufficiently large fictitious populations to compute the bias and the PREs defined above. Data is generated from three different probability distributions, namely, normal and Gamma distributions (continuous distributions) and Poisson distribution (discrete distribution). Some important properties of the distributions have been summarized in Table 4. Such distributions are chosen based on their occurrence in real-life situations.

Data from normal distribution is rampant in nature. It can be used to model heights of individuals, test scores of students, blood pressure, daily returns of any particular stock, weights of items produced by a manufacturing process, etc. Poisson distribution can be used to model the probability that a given number of events occur in a specific time interval, for example, the number of insurance claims filed per month, the number of network failures occurring per week, and the number of bulbs manufactured per minute. It also finds use by medical statisticians, such as for estimating the number of births that may be expected on a particular night, the number of patients with an infectious disease arriving at a clinic within a given hour, the number of mutations on a given strand of DNA per time unit, etc. Gamma distribution can be used for modeling wait time, reliability, service time in queuing theory, etc. For example, it can be used to model the amount of rainfall that accumulates in a given reservoir, the flow of items through manufacturing as well as distribution processes, the size of loan defaults, etc. Thus, these three distributions are chosen based on their importance in practical scenarios.

It is seen through trial and error that the estimator performs well when and take small values and the variation in is greater than that in .

The steps of the simulation are as follows:(1)The sizes of the population, the sample, and the responding part of the sample are defined. For the purpose of the study, sufficiently large values of , and have been chosen.(2)The parameters of the population are defined.(3)Simulation is conducted for various values of . For the purpose of the study, in the range ; i.e., positively correlated variable is considered.

The results of the simulation study related to the PREs have been presented in Tables 511, while the biases have been presented in Table 12.

6. Results and Discussion

The simulation study enables us to study the behavior of the proposed estimator under various scenarios involving various values of parameters. The chief conclusions are as follows:(1)From the values of in Table 5, it is seen that the proposed estimator is more efficient than for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(2)It is seen that the proposed estimator performs better than for all values of for normal and Gamma data and for for Poisson data for the various values of response rates from the values of in Table 6.(3)From the values of in Table 7, it is seen that the proposed estimator dominates for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(4)The values of in Table 8 show that the proposed estimator is more efficient than for all values of for normal data and for for Gamma and Poisson data for the various values of response rates.(5)In Table 9, the values of show that the proposed estimator performs better than for all values of and for the various values of response rates for normal, Gamma, and Poisson data.(6)From the values of in Table 10, it is seen that the proposed estimator dominates for all values of and for the various values of response rates for normal, Gamma, and Poisson data.(7)It is seen that the proposed estimator is more efficient than for all values of and for the various values of response rates for normal, Gamma, and Poisson data from the values of in Table 11.(8)From Table 12, it is seen that the estimator is negatively biased. The bias is negligible, being of the order and for various values of the parameter and for various response rates, and hence, bias correction is not needed.

7. Conclusion

The following trend in the PREs is noticed from the tables: increases with the increase in value of , while decrease with the increase in value of .

The proposed estimator is seen to be consistent, exists for all real values of parameters, has negligible bias, and is more efficient than 7 other contemporary estimators. Hence, the proposed estimator may be recommended for use in field work.

Data Availability

The data used in the study are generated theoretically by the equations given in this paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Taif University Researchers Supporting Project number TURSP-2020/318, Taif University, Taif, Saudi Arabia.