Abstract

Due to its impact on health and quality of life, Thailand’s ozone pollution has become a major concern among public health investigators. Saraburi Province is one of the areas with high air pollution levels in Thailand as it is an important industrialized area in the country. Unfortunately, the August 2018 Pollution Control Department (PCD) report contained some missing values of the ozone concentrations in Saraburi Province. Missing data can significantly affect the data analysis process. We need to deal with missing data in a proper way before analysis using standard statistical techniques. In the presence of missing data, we focus on estimating ozone mean using an improved compromised imputation method that utilizes chain ratio exponential technique. Expressions for bias and mean square error (MSE) of an estimator obtained from the proposed imputation method are derived by Taylor series method. Theoretical finding is studied to compare the performance of the proposed estimator with existing estimators on the basis of MSE’s estimators. In this case study, the results in terms of the percent relative efficiencies indicate that the proposed estimator is the best under certain conditions, and it is then applied to the ozone mean estimation for Saraburi Province in August 2018.

1. Introduction

Air pollution is a global problem which results in negative effects on both the environment and human health. Many researchers have found that air pollution is associated with mortality and morbidity from lung cancer, respiratory, cardiovascular diseases, and exacerbation of chronic respiratory conditions [1, 2]. Moultion and Yang [3] have shown that air pollution is correlated with Alzheimer’s disease and other neurodegenerative disorders. From the World Health Organization’s report in 2018, air pollution caused approximately 4.2 million deaths [4]. In fact, children are more vulnerable than adults because their lungs, heart, and brain are still growing. Therefore, air pollution is a major public health concern; monitoring and measuring the quality of air is critical.

The Pollution Control Department (PCD) in Thailand is an agency that measures the amount of air pollutants such as sulphur dioxide , carbon monoxide , nitrogen oxides , particulate matter , and ozone . As shown in the PCD’s Air Quality Management Division report in 2015, and concentrations were higher than standard levels in almost every province [5]. There is also a relationship between and [6, 7]. Naphralan subdistrict, Chaloem Phra Kiat district is an industrialized area in Saraburi Province with traffic congestion and several stone and cement factories. According to the information from the Saraburi-based ground monitoring stations of PCD in August 2018 [8], we found that the concentration data were missing due to equipment malfunction or errors in measurement. Missing values may cause a significant effect of data analysis process. We deal with it using standard statistical techniques. In environmental research, a number of techniques can be employed to impute missing values in air pollutants concentration data such as the mean top bottom method, mean imputation method, the multiple regression method, and artificial neural network models [913].

In sample surveys, missing values or nonresponses often occur. There are two types of nonresponse: item nonresponse and unit nonresponse. The imputation method is used to handle item nonresponse, and the weight method is applied to deal with unit nonresponse. In addition, imputation which uses available data as a source for replacement of missing data is the most common method to solve missing data.

In addition, many researchers have studied the auxiliary information in order to improve the precision of population mean estimation under a simple random sampling without replacement (SRSWOR). For example, Cochran [14] applied the auxiliary information at the estimation stage and proposed an estimator to estimate the population mean. Bahl and Tuteja [15] first proposed new ratio-type exponential method for estimating the mean of population using information on auxiliary variable, and their methods are more efficient than common methods: mean and ratio methods. Later, Singh and Pal [16] proposed a chain ratio-ratio-type exponential technique which is more efficient than the common estimators including mean, ratio, and ratio-type exponential estimators under certain condition as follows:where is sample mean of interest variable and and are population mean and sample mean of auxiliary variable , respectively.

Similarly, Lee et al. [17] applied the auxiliary information for the purpose of imputation. Recently, Singh and Horn [18] suggested a compromised imputation method to estimate the population mean as follows:where , and are observed values of and for the unit, and are the set of responding and nonresponding units, respectively, and are sizes of sample and response data, and is a chosen constant.

Under this imputation method, the point estimator of the population mean becomeswhere , , and are response mean of the variable of interest , sample mean of auxiliary variable , and response mean of auxiliary variable , respectively.

The bias and mean square error of are, respectively, given bywhere , , and are population coefficient of variations of and , respectively. is population correlation coefficient between and . , , and are sizes of population, sample, and response data, respectively. Their research showed that the suitably chosen constant is correlated with the performance of the estimator.

In this study, we aim to use the imputation method to estimate the population mean in the presence of nonresponse occurring in the variable of interest only. We propose to improve the compromised imputation method by using the chain ratio-type exponential technique and its corresponding estimator. The bias and the mean square error have been obtained to the first degree of approximation using the Taylor series method. The efficiency of the proposed estimator is compared with some existing estimators on the basis of MSE in order to obtain the certain conditions for application of proposed estimator. In this case study, we use the percent relative efficiency (PRE) as an indicator to assess the performance of the estimator. Then, the best estimator is applied to estimate the population mean of as a variable of interest based on concentration data as a variable of auxiliary of the Saraburi Province’s data in August 2018.

2. Materials and Methods

2.1. Basic Setup Framework

Let be a finite population of size be the value of interest variable , and be value of auxiliary variable . Let and be the population means of and , respectively, and they are unknown values. Let and be the set of responding units and nonresponding units, respectively. The value of is observed for every ; meanwhile, the value of is missing for every and imputed value with . Based on SRSWOR scheme, of size with paired variable is selected from and contains both responding units and nonresponding units. Let , and be the sample mean of and the response mean of and , respectively.

2.2. Existing Imputation Methods and Corresponding Estimation

The mean imputation method, the data after imputation, is defined aswhere .

Under this imputation method, the point estimator of the population mean becomes

The bias and variance of are, respectively, given bywhere and .

The ratio imputation method, the data after imputation is defined aswhere ,

Under this imputation method, the point estimator of the population mean becomeswhere , , and be the sample mean of and the response mean of and , respectively.

The bias and mean square error of are, respectively, given bywhere , , , , , , and .

2.3. Proposed Compromised Imputation Method and Corresponding Estimator

Motivated by Singh and Pal [16] and Singh and Horn [18], we propose a new compromised imputation method by using the idea of chain ratio-type exponential estimator. The data after imputation are defined aswhere denotes suitably chosen constants in order that of the estimator is minimum. , , and be the sample mean of and the response mean of and , respectively.

Under the proposed imputation method, the point estimator of the population mean is given as follows:

Note that if , then , and if , then the is analogue of the estimator for population mean proposed by Singh and Pal [16].

To find the properties of the proposed estimator, both bias and of are considered up to the first degree of approximation by using the Taylor series method. We define , and .

Since, SRSWOR is being followed, and we have ,where , , , , , and .

Naturally, we can use , , , , , and to estimate , , , , , and as population parameters when we cannot find these population parameters.

Next, writing in term of ’s, equation (15) takes the form as follows:

From equation (17), we have

Taking expectation on both sides of equation (18), we get as follows:

Squaring both sides of equation (18), expanding the term and taking expectations, and retaining the terms up to first degree of approximations, we get as follows:

2.4. Estimation of Optimum Value Constant

In this section, we consider that is optimum in order to find the minimum . Since as given in equation (20) is a function of unknown constant , it needs to search for optimum values, such that becomes minimum value. To obtain the constant , we differentiate equation (20) with respect to and equate it to zero as follows:

We solve equation (21), and we get

3. Efficiency Comparison of the Proposed Estimator

Under optimum value constant in equation (22), comparison of with those estimators including , , and is carried out by using and an estimator with a preferred smaller value. We can observe the efficiency of the proposed estimator as follows:

From equations (9) and (20), we haveif or .

From equations (13) and (20), we haveif or .

From equations (5) and (20), we haveif or .

When the conditions in equations (23)–(25) are satisfied, is more efficient than , , and , respectively.

4. Case Study

For the case study, we obtained data level ppb and level on a time-scale of one per hour (hourly average) from the PCD website in August 2018. The data belong to the population of 744 units. On examination, we found that the concentration data had missing values, so it was taken as a variable of interest , and concentration data was taken as a variable of auxiliary . The following values were obtained for the considered variables: , , , , , , , , , and . We identified 6.03% of the data as missing.

The conditions of for which is better than the existing estimators are shown in Table 1. The table also presents the percent relative efficiencies of each estimator with respect to which can be computed by .

The scatter plot (Figure 1) indicates that and concentration data have a negative relationship. The correlation coefficient between and concentration data is 0.47. From Table 1, we consider both the certain conditions of and ; accordingly, is more efficient than others. is the most suitable for estimating the mean value of concentration in this case study. The mean value of concentration by using is equal to 13.00 ppb/hr or 0.01 ppm/hr which does not exceed the average standard of the PDC in Thailand ( ppm/hr). The mean square error of is 0.07 which is close to zero, so it indicates that the proposed estimator is effective in this case.

5. Conclusions

In this case study, when missing data occur in the variable of interest, we propose the improved compromised imputation method using the chain ratio-type exponential technique for the population mean estimation. The mean square error of the proposed estimator was studied under general and optimum situations. We suggest this proposed method is useful for estimating the population mean in the presence of common, real-world, nonresponse data, and it is efficient for applying to the real dataset with missing values under certain condition of . In fact, in this case study, we applied our proposed method to the whole process of ozone population mean estimation from real data. On the contrary, the common methods proposed by the other authors have not applied their techniques to estimate population mean from real data. In addition, when a dataset contains some missing values, our proposed method can handle this problem to complete data and can also save both time and budget for research conduction. Therefore, the proposed method is a good strategy to apply in practice in case that a missing data problem occurs. However, this study focused on missing values in the variable of interest only, but we could apply a similar method to cases where missing data happens in both the interest and auxiliary variables.

Data Availability

The data to support this study are available on the website of the Pollution Control Department (PCD) of Thailand (http://air4thai.pcd.go.th/webV2/history/).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Pollution Control Department (PCD) of Thailand for supporting the data in this research.