Journal of Sensors

Volume 2018, Article ID 7465026, 11 pages

https://doi.org/10.1155/2018/7465026

## A Novel Method for Air Quality Data Imputation by Nuclear Norm Minimization

^{1}Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China^{2}School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China^{3}School of Chemistry and Chemical Engineering, Jiangsu University, Zhenjiang 212013, China

Correspondence should be addressed to Xiaobo Chen; moc.liamg@28nehcbx

Received 23 December 2017; Accepted 10 April 2018; Published 26 April 2018

Academic Editor: Fanli Meng

Copyright © 2018 Xiaobo Chen and Yan Xiao. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Missing data is a frequently encountered problem in environment research community. To facilitate the analysis and management of air quality data, for example, PM_{2.5} concentration in this study, a commonly adopted strategy for handling missing values in the samples is to generate a complete data set using imputation methods. Many imputation methods based on temporal or spatial correlation have been developed for this purpose in the existing literatures. The difference of various methods lies in characterizing the dependence relationship of data samples with different mathematical models, which is crucial for missing data imputation. In this paper, we propose two novel and principled imputation methods based on the nuclear norm of a matrix since it measures such dependence in a global fashion. The first method, termed as global nuclear norm minimization (GNNM), tries to impute missing values through directly minimizing the nuclear norm of the whole sample matrix, thus at the same time maximizing the linear dependence of samples. The second method, called local nuclear norm minimization (LNNM), concentrates more on each sample and its most similar samples which are estimated from the imputation results of the first method. In such a way, the nuclear norm minimization can be performed on those highly correlated samples instead of the whole sample matrix as in GNNM, thus reducing the adverse impact of irrelevant samples. The two methods are evaluated on a data set of PM_{2.5} concentration measured every 1 h by 22 monitoring stations. The missing values are simulated with different percentages. The imputed values are compared with the ground truth values to evaluate the imputation performance of different methods. The experimental results verify the effectiveness of our methods, especially LNNM, for missing air quality data imputation.

#### 1. Introduction

During the last decades, a large amount of air quality data which reflect significant pollutant concentrations have been collected by air quality monitoring stations distributed over a certain area. Due to the adverse effects of pollutants on the environment and human health, the analysis of air quality data plays an important role for environment protection and pollution treatment. However, because of many uncontrollable factors, such as instrument faults, communication, and processing errors, these data often suffer from missing values or incomplete samples [1, 2] with different proportions, thus causing serious difficulties for subsequent data analysis and decision making. For instance, many standard data analysis methods, such as neural networks [3, 4] and support vector machines [5, 6], are not applicable since they can only work on complete data.

According to [7, 8], the missing data mechanism can be categorized into three cases: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). For MCAR, the missing values are completely independent of each other and thus appear as a few isolated points. For MAR, the missing values are related to each other in a neighborhood and thus appear as a group of values lost at a time. For MNAR, the occurrence of missing values has specific patterns, for example, the pattern caused by a long time malfunction of monitoring station. In this paper, we mainly focus on the first two cases since the last case is too restrictive in realities [9].

Traditional method to handle missing data is discarding those samples with missing values, which is generally called listwise deletion. However, this method will cause information loss because the observed values in the incomplete samples are actually informative. Moreover, the analytical results drawn from listwise method may be biased since the data distribution might be altered after deletion, especially in the case of high proportion of incomplete samples. Therefore, instead of deleting all incomplete samples, data imputation which replaces the missing values with probable values estimated by different methods has attracted much attention of air quality research community recently.

The imputation methods can be roughly divided into two categories: single imputation [10, 11] and multiple imputation [12, 13]. Single imputation methods estimate a single value for each missing element whereas multiple imputation methods generate multiple possible values for each missing element, thus reflecting the uncertainty of estimation results. In this study, we concentrate on single imputation because it is more convenient to integrate with popular data analysis tools. So far, many single imputation methods have been explored for various application fields, such as probabilistic principal component analysis (PPCA) [14, 15], expectation-maximization (EM) [16], and neural networks [17, 18]. PPCA is based on probabilistic latent variable model, making an assumption that the observed high-dimensional data are sampled from a low-dimensional intrinsic subspace. Both the intrinsic subspace and the missing values can be jointly solved through maximum likelihood estimation (MLE). Neural networks are a regression-based model missing value imputation approach where the relationship between the observed values and the missing value is characterized by neural network model. Among these methods, station mean (SM) [19] is a typical single imputation method, which imputes the missing value by computing the mean of observed values measured at the same time by the other monitoring stations. This method is actually based on the spatial correlation, which utilizes the relatedness of air quality samples measured at different monitoring stations. On the other hand, the nearest neighbor- (NN-) based imputation let the missing value equal the value of the closest sample in time. It assumes that the time series of daily air quality data have strong local temporal correlation, which can be used to impute the missing values. The extensions of NN imputation include linear interpolation and cubic spline imputation, which characterize the locally temporal relationship between nearby air quality samples using more complex models.

It can be observed from the existing works that the missing data imputation closely depends on certain prior assumption about air quality data. For example, NN imputation method supposes that the local pattern variation of air quality time series is constant, linear, or cubic with time. In this paper, we suppose that the air quality samples measured at different time points or different stations in a certain area should be dependent to each other as in NN and SM methods. This dependence also imposes a prior structure on air quality data. In the case of missing data, we can recover such a prior structure and missing data based on the data which are observable. Therefore, we aim to impute the missing values by making the linear dependence of the resulting complete data matrix in terms of rows and columns as large as possible. From this perspective, characterizing the linear dependence of the rows and columns of a data matrix plays an important role for missing data imputation.

As is well known, the rank of a matrix [20] characterizes the number of linearly independent rows or columns of a matrix. The lower the rank, the more the rows and columns are linearly dependent. Usually, the matrix with rank less than the number of rows and columns is said to be rank deficient or not full rank. Therefore, rank is an interesting quantity when comparing different low-rank matrices but it lacks sufficient description of dependence strength for general matrices. In addition, minimizing the rank of a matrix is generally a NP-hard problem [21, 22] difficult to solve. Different from the rank, the nuclear norm of a matrix also provides a well measurement of the linear dependence of the rows and columns. Moreover, the nuclear norm is the best convex approximation of the matrix rank over the unit ball of matrices, thus leading to efficient optimization algorithm and preferable globally optimal solution [21, 23, 24]. Due to these advantages, the nuclear norm has been widely applied in various fields, such as image processing [25, 26] and bioinformatics [27].

Inspired by the above discussion, in this paper, we propose two new and principled imputation methods for air quality data by utilizing the nuclear norm of a matrix to measure the inherent dependence of samples. The first method, called global nuclear norm minimization (GNNM), minimizes the nuclear norm directly on the whole air quality data in order to impute the missing values. This method is relatively simple but may produce suboptimal estimation of missing values, especially when the data is not strongly dependent in a global way. To deal with this problem, we further propose local nuclear norm minimization (LNNM) by introducing the local similarity in order to improve imputation accuracy of air quality data. Specifically, LNNM consists of two steps. The first step aims to gain a rough estimation of missing values by using the above GNNM. Then, we concentrate more on each air quality sample and select its most similar samples, that is, nearest neighbors, based on the above rough estimation. The nuclear norm minimization is performed on the highly correlated sample subset comprising of this sample and its most similar samples so as to refine the estimation of missing values for this sample. The above refinement procedure is conducted for each air quality sample, thus resulting in final estimation of the whole missing data. Note that although nuclear norm minimization has been employed for some missing value imputation problems, such as traffic flow data [28], it was seldom investigated for air quality data imputation.

The paper is organized as follows: in Section 2, the real-world PM_{2.5} concentration data is described and the proposed GNNM and LNNM methods are presented. In Section 3, we report and analyze the imputation performance of different methods. Finally, we conclude this paper in Section 4 by summarizing the main results of this study.

#### 2. Data and Methods

##### 2.1. Data

In this study, we consider PM_{2.5} concentration measured every 1 h by 22 air quality monitoring stations distributed over the metropolitan area of Beijing, China, for 19 days, April 2013 [29, 30]. The whole data matrix consists of 22 rows and 456 columns, thus generating totally 10,032 measures. Each row denotes the th monitoring stations, and each column denotes the readings of all monitoring stations at a particular time point . The structure of data is shown in Table 1. The PM_{2.5} concentration measured at a station in one day is viewed as an air quality sample, thus comprising 24 measurements.