Abstract

Missing data is a frequently encountered problem in environment research community. To facilitate the analysis and management of air quality data, for example, PM2.5 concentration in this study, a commonly adopted strategy for handling missing values in the samples is to generate a complete data set using imputation methods. Many imputation methods based on temporal or spatial correlation have been developed for this purpose in the existing literatures. The difference of various methods lies in characterizing the dependence relationship of data samples with different mathematical models, which is crucial for missing data imputation. In this paper, we propose two novel and principled imputation methods based on the nuclear norm of a matrix since it measures such dependence in a global fashion. The first method, termed as global nuclear norm minimization (GNNM), tries to impute missing values through directly minimizing the nuclear norm of the whole sample matrix, thus at the same time maximizing the linear dependence of samples. The second method, called local nuclear norm minimization (LNNM), concentrates more on each sample and its most similar samples which are estimated from the imputation results of the first method. In such a way, the nuclear norm minimization can be performed on those highly correlated samples instead of the whole sample matrix as in GNNM, thus reducing the adverse impact of irrelevant samples. The two methods are evaluated on a data set of PM2.5 concentration measured every 1 h by 22 monitoring stations. The missing values are simulated with different percentages. The imputed values are compared with the ground truth values to evaluate the imputation performance of different methods. The experimental results verify the effectiveness of our methods, especially LNNM, for missing air quality data imputation.

1. Introduction

During the last decades, a large amount of air quality data which reflect significant pollutant concentrations have been collected by air quality monitoring stations distributed over a certain area. Due to the adverse effects of pollutants on the environment and human health, the analysis of air quality data plays an important role for environment protection and pollution treatment. However, because of many uncontrollable factors, such as instrument faults, communication, and processing errors, these data often suffer from missing values or incomplete samples [1, 2] with different proportions, thus causing serious difficulties for subsequent data analysis and decision making. For instance, many standard data analysis methods, such as neural networks [3, 4] and support vector machines [5, 6], are not applicable since they can only work on complete data.

According to [7, 8], the missing data mechanism can be categorized into three cases: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). For MCAR, the missing values are completely independent of each other and thus appear as a few isolated points. For MAR, the missing values are related to each other in a neighborhood and thus appear as a group of values lost at a time. For MNAR, the occurrence of missing values has specific patterns, for example, the pattern caused by a long time malfunction of monitoring station. In this paper, we mainly focus on the first two cases since the last case is too restrictive in realities [9].

Traditional method to handle missing data is discarding those samples with missing values, which is generally called listwise deletion. However, this method will cause information loss because the observed values in the incomplete samples are actually informative. Moreover, the analytical results drawn from listwise method may be biased since the data distribution might be altered after deletion, especially in the case of high proportion of incomplete samples. Therefore, instead of deleting all incomplete samples, data imputation which replaces the missing values with probable values estimated by different methods has attracted much attention of air quality research community recently.

The imputation methods can be roughly divided into two categories: single imputation [10, 11] and multiple imputation [12, 13]. Single imputation methods estimate a single value for each missing element whereas multiple imputation methods generate multiple possible values for each missing element, thus reflecting the uncertainty of estimation results. In this study, we concentrate on single imputation because it is more convenient to integrate with popular data analysis tools. So far, many single imputation methods have been explored for various application fields, such as probabilistic principal component analysis (PPCA) [14, 15], expectation-maximization (EM) [16], and neural networks [17, 18]. PPCA is based on probabilistic latent variable model, making an assumption that the observed high-dimensional data are sampled from a low-dimensional intrinsic subspace. Both the intrinsic subspace and the missing values can be jointly solved through maximum likelihood estimation (MLE). Neural networks are a regression-based model missing value imputation approach where the relationship between the observed values and the missing value is characterized by neural network model. Among these methods, station mean (SM) [19] is a typical single imputation method, which imputes the missing value by computing the mean of observed values measured at the same time by the other monitoring stations. This method is actually based on the spatial correlation, which utilizes the relatedness of air quality samples measured at different monitoring stations. On the other hand, the nearest neighbor- (NN-) based imputation let the missing value equal the value of the closest sample in time. It assumes that the time series of daily air quality data have strong local temporal correlation, which can be used to impute the missing values. The extensions of NN imputation include linear interpolation and cubic spline imputation, which characterize the locally temporal relationship between nearby air quality samples using more complex models.

It can be observed from the existing works that the missing data imputation closely depends on certain prior assumption about air quality data. For example, NN imputation method supposes that the local pattern variation of air quality time series is constant, linear, or cubic with time. In this paper, we suppose that the air quality samples measured at different time points or different stations in a certain area should be dependent to each other as in NN and SM methods. This dependence also imposes a prior structure on air quality data. In the case of missing data, we can recover such a prior structure and missing data based on the data which are observable. Therefore, we aim to impute the missing values by making the linear dependence of the resulting complete data matrix in terms of rows and columns as large as possible. From this perspective, characterizing the linear dependence of the rows and columns of a data matrix plays an important role for missing data imputation.

As is well known, the rank of a matrix [20] characterizes the number of linearly independent rows or columns of a matrix. The lower the rank, the more the rows and columns are linearly dependent. Usually, the matrix with rank less than the number of rows and columns is said to be rank deficient or not full rank. Therefore, rank is an interesting quantity when comparing different low-rank matrices but it lacks sufficient description of dependence strength for general matrices. In addition, minimizing the rank of a matrix is generally a NP-hard problem [21, 22] difficult to solve. Different from the rank, the nuclear norm of a matrix also provides a well measurement of the linear dependence of the rows and columns. Moreover, the nuclear norm is the best convex approximation of the matrix rank over the unit ball of matrices, thus leading to efficient optimization algorithm and preferable globally optimal solution [21, 23, 24]. Due to these advantages, the nuclear norm has been widely applied in various fields, such as image processing [25, 26] and bioinformatics [27].

Inspired by the above discussion, in this paper, we propose two new and principled imputation methods for air quality data by utilizing the nuclear norm of a matrix to measure the inherent dependence of samples. The first method, called global nuclear norm minimization (GNNM), minimizes the nuclear norm directly on the whole air quality data in order to impute the missing values. This method is relatively simple but may produce suboptimal estimation of missing values, especially when the data is not strongly dependent in a global way. To deal with this problem, we further propose local nuclear norm minimization (LNNM) by introducing the local similarity in order to improve imputation accuracy of air quality data. Specifically, LNNM consists of two steps. The first step aims to gain a rough estimation of missing values by using the above GNNM. Then, we concentrate more on each air quality sample and select its most similar samples, that is, nearest neighbors, based on the above rough estimation. The nuclear norm minimization is performed on the highly correlated sample subset comprising of this sample and its most similar samples so as to refine the estimation of missing values for this sample. The above refinement procedure is conducted for each air quality sample, thus resulting in final estimation of the whole missing data. Note that although nuclear norm minimization has been employed for some missing value imputation problems, such as traffic flow data [28], it was seldom investigated for air quality data imputation.

The paper is organized as follows: in Section 2, the real-world PM2.5 concentration data is described and the proposed GNNM and LNNM methods are presented. In Section 3, we report and analyze the imputation performance of different methods. Finally, we conclude this paper in Section 4 by summarizing the main results of this study.

2. Data and Methods

2.1. Data

In this study, we consider PM2.5 concentration measured every 1 h by 22 air quality monitoring stations distributed over the metropolitan area of Beijing, China, for 19 days, April 2013 [29, 30]. The whole data matrix consists of 22 rows and 456 columns, thus generating totally 10,032 measures. Each row denotes the th monitoring stations, and each column denotes the readings of all monitoring stations at a particular time point . The structure of data is shown in Table 1. The PM2.5 concentration measured at a station in one day is viewed as an air quality sample, thus comprising 24 measurements.

2.2. Global Nuclear Norm Minimization

First of all, we give some notations that will be used. Let the air quality sample, that is, PM2.5 concentration, at the th monitoring station for the th day be denoted by , since the concentration is measured on a time scale of one per hour. Notice that some elements in are missing. All air quality samples recorded by different monitoring stations and at different times can be organized in matrix form as

Table 1 illustrates the specific structure of used in this study. In terms of missing value estimation problem, let be the set of indexes with observable PM2.5 concentration . The complementary set of is denoted by , where the values are missing. Thus, we have .

As discussed in the Introduction, the nuclear norm is a convex surrogate of the matrix rank, which can characterize the linear dependence of the rows and columns of general matrix and at the same time is computationally efficient. Therefore, we propose the following nuclear norm minimization problem to recover a complete matrix from incomplete data . where denotes the nuclear norm (the sum of the singular values) of matrix and denotes the observed values. Through solving (2), we can get optimal solution , where is the observed data, and thus, gives the estimation of missing data . In this paper, we use singular value thresholding (SVT) algorithm [23] to solve (2).

2.3. Local Nuclear Norm Minimization

The effectiveness of (2), in terms of recovering missing data , closely depends on the strength of linear dependence of the rows and columns of . As a result, the recovered values from (2) may be suboptimal if some irrelevant air quality samples exist in the whole set . Therefore, estimating the correlation between samples is significant for imputation. However, due to a large number of missing values in , it is impossible to obtain an exact estimation of correlation between two air quality samples and a priori. Nevertheless, through solving (2), we obtain a rough estimation of , that is, of . Subsequently, the estimation of correlation between and can be inferred based on similarity metric on . Finally, we can refine the missing value estimation for each air quality sample by only selecting those samples most similar to and performing (2) on only those highly similar samples.

Specifically, suppose the estimation of in is corresponding to in , then a similarity between and is denoted by where denotes the norm of a vector. Equation (3) means that if two air quality samples estimated from (2) are close to each other in distance, they will have large similarity and the missing values of the highly correlated samples can be imputed more reliably. Therefore, for each sample , we can compute its similarities with all the remaining samples , , and sort these sample in order of magnitude from large to small, that is,

Suppose the similarity between and is consistent with the similarity between and to a certain degree, the top most similar samples in terms of (4) are selected and combined with to form a new sample matrix with missing values.

Finally, we can solve the following nuclear norm minimization problem on the incomplete data matrix instead of the original whole data matrix to get a refined imputation for . where refers to the observed values in the combined matrix , respectively. The refined estimation for is thus given by the first row of . The above procedure is performed for each air quality sample , and finally, we can obtain the refined estimation of all missing values in .

3. Experimental Results and Analysis

3.1. Experimental Configuration

The proposed GNNM and LNNM are compared with typical SM and NN imputation methods. To evaluate the imputation performance of different methods, the complete data described in Section 2 is used as ground truth test. We randomly generate missing data in which the percentage of missing values (PMV) changes from 10% to 50%. Obviously, the larger PMV, the harder the imputation problem. Different imputation method is then applied on the incomplete data such that the missing values could be estimated based on the observed values. Finally, the imputed values and the ground truth values are compared in order to evaluate the imputation performance. To reduce the possible bias, the above evaluation procedure is repeated 10 times and the averaged results are recorded.

In the experiments, we adopt two indices [10, 18, 31] which are widely used to compare different imputation methods. Let denote the number of missing values and and stand for the th imputed and observed value, respectively. Then we have the following: (1)Root mean square error ():(2)Coefficient of determination ():where and are the average of imputed and observed data, respectively, and are the standard deviation of imputed and observed data, respectively.

Index (1) measures the discrepancy between the imputed and observed values; thus, a small is preferable when comparing different imputation methods. In contrast, index (2) characterizes the correlation between the imputed and observed values; thus, a large is preferable for imputation methods.

The descriptive statistics of the data under different percentages of missing values are shown in Table 2. We can see that these statistics vary very little with respect to the percentage of missing data.

3.2. Imputation Error Comparison

The imputation performance of different methods when PMV equals 10% is shown in Table 3 where we report the mean value (mean) and the corresponding standard deviation (STD) of RMSE and 2 across 10 tests. As we can see, GNNM achieves competitive performance in comparison with SM and NN. Note that GNNM is based on correlation between different stations and time periods, thus implying nuclear norm provides an interesting and effective global measure for spatial and temporal correlation of PM2.5 pollution concentrations. More importantly, the proposed LNNM, which elaborately integrates both global and local correlation in a unified framework, consistently outperforms all other competing methods. In addition, we perform paired t-test on the 10 imputation results to show if there is statistically significant differences between LNNM and other competitive methods. The values of t-test at 5% significance level are also reported in Table 3 where a greater difference between LNNM and the other method exists for a value less than 0.05. As we can see, the imputation performance difference between LNNM and all other methods is significant since the values are all very close to zero. These results show that performing nuclear norm minimization on highly correlated data across different stations and time periods can sufficiently make use of such high correlation, thus being able to impute missing values with better accuracy.

The experimental results when PMV equals 20%, 30%, 40%, and 50% are shown in Tables 47, respectively. We can see from the results that, as the percentage of missing values continues to increase, the estimation errors increase at the same time. This can be explained because the available information for imputation reduces in the case of high percentage of missing data. This variation of imputation error against the percentage of missing values is shown in Figure 1 for better comparison. We can observe that SM is more insensitive to the percentage of missing data, although its estimation error is usually larger than the other methods. NN is superior to SM when the number of missing data is not very large. However, when that percentage exceeds 40%, NN becomes the worst one. This may due to the fact that NN is purely based on local temporal correlation which is unreliable when large amount of data is missing. The proposed methods, especially LNNM, consistently achieve comparable or smaller estimation errors even in the case of large quantity of missing data. It confirms the effectiveness of the proposed methods.

3.3. Illustration of Imputation Results

In this section, we illustrate some imputation examples of SM, NN, GNNM, and LNNM in case that PMV equals 10%, 30%, and 50% in Figures 2, 3, and 4, respectively. As we can see, due to the inherent principle of a specific method, it may produce better estimation on some missing data, that is, the difference between the imputed value and observed value is small, but worse estimation on other missing data. In other words, no method can consistently outperform the other methods on all missing values. Nevertheless, statistically speaking, the proposed GNNM and LNNM can give better estimation of missing values in most cases. Figures 5, 6, and 7, respectively, show the scatter plots of the observed and imputed PM2.5 pollutant concentration when PMV equals 10%, 30%, and 50%. We can see that the scatter of imputation results of our proposed methods, especially LNNM, is smaller than the other methods. It means the errors between observed and imputed values are generally smaller.

4. Conclusions

Imputing missing values is an important preprocessing task for air quality data analysis. How to make use of the inherent structure underlying data is closely related to the imputation performance. In this paper, we propose two new methods for air quality data imputation. The motivation is the row-wise or column-wise correlation that provides a prior structure of air quality data matrix. As a result, we can naturally use the nuclear norm to characterize such correlation and implement data imputation. In the first method (GNNM), we directly minimize the nuclear norm on the whole samples to maximize the global correlation. In the second method (LNNM), we tend to perform nuclear norm minimization on those highly correlated samples for improving imputation performance. The experiments on real-world PM2.5 concentration data set verify the effectiveness of our proposed method.

It should be emphasized that despite better recovery performance, the proposed LNNM will suffer from higher computational burden since nuclear norm minimization has to be conducted on each sample and its nearest neighbors. Nevertheless, there are some possible ways to deal with this problem. For example, it is clear that the imputation on each sample and its nearest neighbors can be well parallelized so as to reduce the computational time. The proposed algorithm can be implemented on specific hardware, such as graphic processing unit (GPU) consisting of thousands of cores, to further improve the computational efficiency. The comparison of our method with other air quality data imputation methods [3234] is also an interesting research topic in the future.

Conflicts of Interest

The authors declare that there is no conflict of interests.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant no. 61773184), Six Talent Peaks Project of Jiangsu Province (Grant no. 2017-JXQC-007), and the Talent Foundation of Jiangsu University, China (no. 14JDG066).