Intelligent Control Approaches for Modeling and Control of Complex Systems
View this Special IssueResearch Article  Open Access
Neural Models for Imputation of Missing Ozone Data in AirQuality Datasets
Abstract
Ozone is one of the pollutants with most negative effects on human health and in general on the biosphere. Many dataacquisition networks collect data about ozone values in both urban and background areas. Usually, these data are incomplete or corrupt and the imputation of the missing values is a priority in order to obtain complete datasets, solving the uncertainty and vagueness of existing problems to manage complexity. In the present paper, multipleregression techniques and Artificial Neural Network models are applied to approximate the absent ozone values from five explanatory variables containing airquality information. To compare the different imputation methods, reallife data from six dataacquisition stations from the region of Castilla y León (Spain) are gathered in different ways and then analyzed. The results obtained in the estimation of the missing values by applying these techniques and models are compared, analyzing the possible causes of the given response.
1. Introduction and Related Work
The ozone (O_{3}) is an odorless, colorless, and highly reactive gas composed of three oxygen atoms. It is formed both in the Earth’s upper atmosphere (stratospheric ozone) and at ground level (tropospheric ozone). It can be “good” or “bad” for people’s health and for the environment, depending on its concentration levels and location in the atmosphere [1].
Stratospheric O_{3} is formed naturally through the interaction of solar UltraViolet (UV) radiation with molecular oxygen (O_{2}). Groundlevel or “bad” ozone is not emitted directly into the air. In the 1950s, hydrocarbons and nitrogen oxides (NO_{x}) were identified as the two key chemical precursors of photochemical smog and its concomitant high concentrations of O_{3} and other photochemical oxidant [2]. The majority of groundlevel O_{3} is formed from the photochemical oxidation of Volatile Organic Compounds (VOCs) in the presence of NO and other NO_{x}. Significant sources of VOCs are chemical plants, gasoline pumps, oilbased paints, autobody shops, and print shops. NO_{x} result primarily from high temperature combustion, and its most significant sources are power plants, industrial furnaces and boilers, and motor vehicles [3].
1.1. Importance of Ozone
The O_{3} exposition can cause damage in different ways. In the stratosphere, reduced O_{3} levels as a result of O_{3} layer depletion mean less protection from the sun’s rays and more exposure to UltraViolet B (shortwave) rays (UVB) radiation at the Earth’s surface [4]. The effects on human health of the O_{3} layer depletion have been much analyzed, increasing the amount of UVB that reaches the Earth’s surface. UVB causes nonmelanoma skin cancer and plays a major role in malignant melanoma development. In addition, UVB has been linked to the development of certain cataracts, negative effects in patients with asthma, and other chronic respiratory disease. With respect to groundlevel O_{3}, and its effects on human health, breathing O_{3} can trigger a variety of health problems. People with asthma and other chronic respiratory disease are a large and growing segment of the population and are also known to be especially susceptible to the effects of O_{3} exposure. On days with high levels of O_{3}, people with asthma tend to experience increased respiratory symptoms [3]. The layer O_{3} depletion has also negative effects on the process of the development of plants, effects on the marine ecosystems like a direct reduction in phytoplankton production, negative effects on materials like biopolymers, and so forth. Tropospheric O_{3} does not provide the protective function that it fulfills in the stratosphere, being high reactivity. Its strong oxidizing capacity, when its levels rise above the natural background, can cause adverse effects in materials (derived from its corrosive effects), on vegetation and ecosystems.
The present work focuses on tropospheric O_{3}, which is a risk for the air quality [3]. Given the increase in O_{3} levels in the troposphere, it is currently considered one of the most important atmospheric pollutants.
1.2. Ozone Level Monitoring
Around the world there are numerous dataacquisition networks for the measurement of O_{3} levels and other pollutants, which consist of many stations in different locations where different sensors measure corresponding magnitudes. These network stations acquire data at periodic intervals of time (periods between ten and fifteen minutes are the most frequent ones) but frequently appear missing or corrupted data. In Europe, data are considered as corrupted when not meeting the Council Decision 97/101/EC of January 27, 1997 [5], which establish a reciprocal exchange of information and data from networks and individual stations measuring ambient air pollution within the Member States. Some of these networks provide information about the validity of the data, indicating through codes if the data is correct, it has not been possible to acquire, or it is corrupt, but in other occasions this type of information is not provided while the data are still missing. Some reasons for such failures have been pinpointed [6], namely, a damaged cable, the loss of proper electrical grounding, halfmelted frost or snow on the dome, communications failure, and so forth. Some of these causes are temporary and may disappear spontaneously, but other ones require the intervention of a maintenance task force, and therefore errors persist for different periods of time. The absence of valid data may also be due to reasons such as the following: mishandling of samples, low signaltonoise ratio, measurement error, nonresponse, or deleted aberrant value [7]. This is a problem for the analysis of the information coming from the measurement networks, and the imputation of these missing data [8] is necessary. Any of the variables acquired in network stations may suffer from the problem of the absence of data. If many data variables are omitted or corrupted in the same record, the whole sample must be withdrawn, when some models are applied [9], for subsequent tasks such as control, classification, forecast. Alternatively, if data for the same pollutant are missing in several adjacent rows, removing that variable may also be an alternative solution. In conclusion, having a complete set of data is necessary to perform a reliable study and to apply some models that cannot deal with missing data.
1.3. Missing Values and Related Work
The standard classification of missing data phenomenon [10] includes different situations:(i)Missing Completely At Random (MCAR), when the probability of an instance (case) having a missing value for a variable does not depend on either the known values or the missing data.(ii)Missing At Random (MAR), when the probability of an instance having a missing value for a variable may depend on the known values but not on the value of the missing data itself.(iii)Not Missing At Random (NMAR), when the probability of an instance having a missing value for a variable could depend on the value of that variable.
As previous authors have pointed out, the complexity varies between these patterns of missing data [11]. Usually, in the case of airquality data, missing values are associated with MAR or MCAR. The circumstances that may interfere with the acquisition of the data are many and not easily predictable [12].
To solve the missing data problem, a wide variety of different methods have been applied up to now [8, 10, 13]. These imputation methods (IMs) are usually classified as follows:(i)Single imputation (SI): the method fills in one value for each missing one [12].(ii)Multiple imputation (MI): multiple simulated values are generated at the same time [14].
The univariate and multivariate imputation methods differ in which the approximation of the missing values of the variable under study are calculated from the rest of the values of the very same variable (univariate) or using values of the rest of the variables (multivariate) [12].
With the aim of reducing the complexity of other MI applied methods [11], the present paper focuses on single and multivariate imputation for the O_{3} magnitude in air pollution datasets. To do so, multipleregression (linear and nonlinear) techniques together with Artificial Neural Networks (ANN) are applied to reallife datasets obtained from public airquality networks.
Up to now, different ArtificialIntelligence (AI) techniques have been applied for imputation of missing data. In [7] imputation methods based on six different techniques are compared: KNearest Neighbors (KNN), Fuzzy KMeans (FKM), Singular Value Decomposition, Bayesian Principal Component Analysis (bPCA) and Multiple Imputations by Chained Equations. These methods are applied to four datasets split into two groups of various sizes: small datasets (Iris and E. coli) and large datasets (breast cancers 1 and 2). bPCA and FKM appeared to be the most robust imputation methods in the tested conditions.
In [15] the accuracy of different imputation methods is evaluated: MissForest (MF) and Multiple Imputation based on ExpectationMaximization (MIEM), along with two other imputation methods: Sequential HotDeck and Multiple Imputation based on Logistic Regression (MILR). The models are applied over fourteen binary datasets, with a range of missing data rates between 5% and 50%. The results from 10fold CrossValidation (CV) show that the performance of the imputation methods varies substantially between different classifiers and at different rates of missing values.
Although many imputation methods have been proposed up to now, scant attention has been paid to validate ANN for such a task, taking advantage of their regression capability [16]. Among these previous studies, ANN have been applied for the estimation of lost values in [17], where the main goal is identifying Learning Disabilities (LD) in children at early stages. In [18], authors proposed a SI approach relying on a Multilayer Perceptron (MLP) whose training is conducted with different learning rules, and a MI approach based on the combination of MLP and KNN. 24 real and simulated datasets from the UCI repository, the Promise repository, and mldata.org were exposed to a perturbation experiment with random generation of monotone missing data pattern.
In [19] six different types of ANN are proposed as IM: MLP and its variations (the TimeLagged Feedforward Network (TLFN)), the Generalized RadialBasisFunction (GRBF) network, the Recurrent Neural Network (RNN), and its variations (the Time Delay Recurrent Neural Network (TDRNN)). Additionally, the Counterpropagation FuzzyNeural Network (CFNN) along with different optimization methods is applied for infilling missing daily total precipitation and extreme temperature series from 15 weather stations. The standard MLP and TLFN appear to provide the most accurate reconstruction of missing precipitation and daily extreme temperatures records with results for the R correlation coefficient between the observed and the reconstructed daily series close to 1.
In [20] a novel nonparametric algorithm named Generalized regression neural network Ensemble for Multiple Imputation (GEMI) is proposed. Additionally, a SI version of this approach (GESI) is proposed. The algorithms were tested on 98 synthetic and realworld datasets. All simulation results show the advantages of GEMI as compared with conventional algorithms. GEMI has heavy memory storage requirements but outperformed other SI algorithms.
In [21] fifteen real and simulated datasets are exposed to a perturbation experiment, based on the random generation of missing values. Several architectures and learning algorithms for the MLP are tested and compared with three classic imputation procedures: mean/mode imputation, regression, and hotdeck [22].
In [23] a methodology based on Gaussian Mixture Model (GMM) and Extreme Learning Machine (ELM) is developed and tested on some datasets from the UCI Machine Learning Repository and the LIACC regression repository. GMM is used to model the data distribution which is adapted to handle missing values, while ELM enables devising a Multiple Imputation strategy for final estimation. The combination of GMM and ELM is shown to be superior in almost all tested cases over the method based on conditional mean imputation.
In [24] a SI approach relying on a MLP and a MI approach based on the combination of MLP and KNN is proposed. The models are applied to 18 real and simulated datasets like domains such as biology, medicine, chemistry, electronics, social surveys, census, and business. For datasets with only quantitative variables MIMLP model provided the best results, with IMLP being the best method for datasets with categorical variables.
In [25] a twostage hybrid model for filling the missing values using fuzzy cmeans clustering and MLP is proposed. It is applied to a Wine dataset with a 1% to 5% of generated missing values and the accuracy of the model is checked using the Mean Absolute Percentage Error (MAPE). The MAPE obtained for stage 2 (MLP regression to the obtained dataset as a result of applying fuzzy cmeans in stage 1) is 4.95% for 1% missingvalue records and 8.36% for 5% missingvalue records.
In the case of airquality data, few imputation methods have been proposed up to now. In [13], an important set of SI: Listwise, Unconditional mean, Modified Median, Principal Componentbased, ExpectationMaximization (EM) (RegularizedEM), and MI methods are applied to three datasets with the most important pollutant variables (NO, NO_{2}, NO_{x}, CO, O_{3}, PM10, and PM2.5) and a percentage of missing data among the 3.85% and the 23.52% depending on the year. Missing data of the eight variables are imputed in order to assess the effectiveness of the methods applied. In general, MI tends to yield more scattered values than its counterparts, mainly when the variables have many voids and they correlate poorly to the other variables like CO with 43.5% of missing data in 2006 and they correlate poorly to the other variables.
In [11] some methods for the imputation of missing airquality data are compared: in the context of SI (linear, spline, and nearest neighbor interpolations), MI (regressionbased imputation, multivariate nearest neighbor, SelfOrganizing Maps (SOM), and Multilayer Backpropagation (MLBP) nets) and hybrid methods of the aforementioned. The dataset uses the most common pollutants: NO_{x}, NO_{2}, O_{3}, PM10, SO_{2}, and CO concentrations, all on a timescale of one per hour (hourly averaged), together with four meteorological parameters. The performance of the proposed univariate missing data interpolation was limited, and in general they were able to fill only very short gaps of contiguous missing data. The general performance of the applied imputation methods was fair good when considering the pollutants (NO_{x}, NO_{2}, O_{3}, PM10, SO_{2}, and CO) which are the most important ones in terms of airquality modelling, but not so good regarding meteorological variables. The results suggested that SOM and MLBP are the methods of choice for airquality data imputation and even better results can be achieved by using the MI.
1.4. Main Contributions
The main contributions of this work are as follows:(i)Deep study of the reallife human health protection task in Spanish region of Castilla y León.(ii)Multisensor of O_{3} data analysis.(iii)Experimental evaluation of the proposed approach based on multipleregression techniques together with ANN models.
To the best of authors knowledge, this is the first approach of imputation methods of O_{3} based on both MLP and RadialBasisFunction Networks.
The rest of this paper is organized as follows. Section 2 presents the techniques and models applied. Section 3 details the reallife case study that is addressed in present work, while Section 4 describes the experiments and results. Finally, Section 5 sets out the main conclusions and future work.
2. Regression Techniques and ANN Models
In order to fill missing or corrupted values of O_{3} in high dimensional datasets with airquality information, two regression techniques and two ANN models have been applied in present study. This set of techniques applied as imputation methods is described in this section.
2.1. Regression Techniques
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable [26].
The general purpose of multiple regressions [27] is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable.
2.1.1. Multiple Linear Regression
Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data [28]. Every value of the independent variable is associated with a value of the dependent variable . The population regression line for explanatory variablesis defined to be
This line describes how the mean response changes with the explanatory variables. The observed values for y vary about their means and are assumed to have the same standard deviation . The fitted values estimate the parameters of the population regression line.
Since the observed values for y vary about their means u_{y}, the multipleregression models include a term for this variation. The model is expressed as DATA = FIT + RESIDUAL, where the “FIT” term represents the expression . The “RESIDUAL” term represents the deviations of the observed values from their means , which are normally distributed with mean 0 and variance . The notation for the model deviations is .
Formally, the model for multiple linear regression, given n observations, is [28]
2.1.2. Multiple Nonlinear Regression
A Multiple Nonlinear Regression (MNLR) is a form of regression analysis in which observational data are modelled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables [29]. The data are fitted by a method of successive approximations.
The parameters can take the form of an exponential, trigonometric, power, or any other nonlinear function. To determine the nonlinear parameter estimates, an iterative algorithm is typically used.where represents nonlinear parameter estimates to be computed, is the dependent or criterion variables, and represents the error terms.
2.2. Artificial Neural Networks
Artificial Neural Networks (ANN), also known as Artificial Neural Systems (ANS), connectionist systems, adaptive networks, and distributed and parallel processing are simplified models of natural neural systems. The following definition, given by HechtNielsen in 1989 [30], formalizes the concept of ANN:
An ANN is a parallel processing computer system distributed, consisting of a set of elementary processing units equipped with a small local memory and interconnected in a network through connections with associated weights. Each processing unit has one or more input connections and a single output connection that links to many collateral connections as desired. All processing associated with an elementary unit is a local, i.e. depends only on the values that take input signals from the unit and the internal state of the same.
2.2.1. Multilayer Perceptron (MLP)
The MLP consists of a system of simple interconnected neurons or nodes. The nodes are connected by weights and output signals which are a function of the sum of the inputs to the node modified by a simple nonlinear transfer, or activation, function. The architecture consists of several layers of neurons; the input layer serves to pass the input vector to the network. The terms “input vectors” and “output vectors” refer to the inputs and outputs of the MLP and can be represented as single vectors [31]. A MLP may have one or more hidden layers and finally an output layer. MLP are fully connected, with each node connected to every node in the next and previous layer.
To perform a comprehensive comparison, the MLP is trained with the following algorithms:(1)LevenbergMarquardt backpropagation (LM)(2)Gradient Descent with momentum and adaptive learning rate backpropagation (GDX) [32](3)Batch Training with weight and bias learning rules (TB)(4)Scaled Conjugate Gradient backpropagation (SCG)(5)Bayesian Regularization backpropagation (BR).
2.2.2. RadialBasisFunction Networks (RBFN)
In a RBFN [33] each unit in the hidden layer of this network has its own centroid, and, for each input vector , it computes the distance between and its centroid. Its output of the unit is calculated as a nonlinear function of this distance.
Assuming that there are r input nodes and m output nodes, the overall response function without considering nonlinearity in an output node has the following form [34]: where is the number of units in the hidden layer, is the vector of weights linking the th hiddenlayer unit to the output nodes, x is an input vector, K is a radially symmetric kernel function of a unit in the hidden layer, z_{i} and are the centroid and smoothing factor of the th kernel node, respectively, and is a function called the activation function, which characterizes the kernel shape.
3. Case Study
In present study, data from airquality stations in Castilla y León (CyL) are analyzed. CyL is a Spanish region located at the northcenter of the Iberian Peninsula. It is composed of nine provinces and it is the most extensive region of Spain with a total surface of 94,226 square kilometers and the sixth with more population: 2,435,797 habitants. Gross Domestic Product (GDP) in CyL represents the 5.3% of country’s GDP [35]. Climate in CyL approaches what is known as the continental ocean, characterized by cold winters and hot summers with short spring and autumn periods.
CyL region provides a wide network of stations [36] for the acquisition of airquality data. These data are public available according to the Open Data Initiative from the Spanish Government [37].
Stations from this network have some interesting characteristics:(1)Stations are classified in types: urban, background, and oriented to the vegetation protection [36].(2)These stations collect the fundamental airquality pollutants, and among them is the O_{3}, which is the objective pollutant of this study. Daily averages data [38] of each pollutant are provided in each location.(3)This data presents empty or corrupted data in all of its variables in some rows and in a reasonable percentage to be estimated.
In the present study, pollutant data recorded in six different stations from the CyL network are analyzed. Daily data averages from years 2000 to 2008 have been selected. For some periods of time within the selected time window, data are not available for all the variables and, thus, the whole example is rejected for the study. Three of the stations are located in the center of the cities and labeled as urban stations; these stations are oriented to the protection of the human health. The other three stations are background stations and are also oriented to the protection of the human health. These stations measure a greater number of pollutants than the other type of stations and are the most important ones in terms of air quality, and many of them are not collected at the stations for the vegetation protection. This fact is important for the determination of the O_{3} missing values, as this gas is especially harmful for human health.
The three urban stations considered in present study are as follows:(1)Ávila. “Bus Station” station. Geographical coordinates: 40.65914, −4.68237; 1150 meters above sea level (masl).(2)Aranda de Duero. “Jardines de Don Diego” station. Geographical coordinates: 41.67111, −3.68388; 801 masl.(3)León. “Avda. San Ignacio de Loyola” station. Geographical coordinates: 42.60388, −5.58722; 838 masl.
The three background stations are as follows:(1)Burgos. “Fuentes Blancas” station. Geographical coordinates: 42.33611, −3.63611; 929 masl.(2)Segovia. “Acueducto” station. Geographical coordinates: 40.95555, −4.11055; 951 masl.(3)Medina del Campo (Valladolid). “Bus Station” station. Geographical coordinates: 41.31638, −4.90916; 721 masl.
Figure 1 shows the location of the six selected stations that have been studied in the present paper.
The pollutants gathered in the abovementioned stations and analyzed in the present study are as follows:(1)Ozone (O_{3}), μg/m^{3}, secondary pollutant. See Section 1.(2)Carbon monoxide (CO), mg/m^{3}, primary pollutant. It is an odorless, colorless gas formed by the incomplete combustion of fuels. When people are exposed to CO gas, the CO molecules will displace the oxygen in their bodies and lead to poisoning [39].(3)Nitric oxide (NO), μg/m^{3}, primary pollutant. NO is a colorless gas which reacts with ozone undergoing rapid oxidation to NO_{2}, predominant in the atmosphere [39].(4)Nitrogen dioxide (NO_{2}), μg/m^{3}, primary pollutant. From the standpoint of health protection, nitrogen dioxide has set exposure limits for long and short duration [39].(5)Particulate matter (PM10), μg/m^{3}, primary pollutant. These particles remain stable in the air for long periods of time without falling to the ground and can be moved significant distances by the wind. It is defined by the ISO as follows: “particles which pass through a sizeselective inlet with a 50% efficiency cutoff at 10 μm aerodynamic diameter. PM10 corresponds to the ‘thoracic convention’ as defined in ISO 7708:1995, Clause 6” [40].(6)Sulphur dioxide (SO_{2}), μg/m^{3}, primary pollutant. It is a gas. It smells like burnt matches. Its smell is also suffocating. SO_{2} is produced by volcanoes and in various industrial processes. In the food industry, it is also used to protect wine from oxygen and bacteria [39].
Primary pollutants are injected into the atmosphere directly. Secondary pollutants are formed in the atmosphere through chemical and photochemical reactions from the primary pollutants [36].
All data from these six variables were normalized for the study. On the other hand, all of them are highly decorrelated. Table 1 shows the correlation matrix of the six pollutants of the case study.

It is worth mentioning that O_{3} is the most independent pollutant, as its correlation coefficients with the rest of the variables are close to zero.
There are a total of 13,526 samples, as one sample per day (daily average) was collected for the twelve months of every year, between years 2000 and 2008, in the six stations analyzed in this study. Missing or corrupted data appear in all the variables in some rows, which are omitted for the study.
Table 2 shows the percentage of missing or corrupted data presented in each variable in the whole dataset.

All the samples with at least one missing or corrupted value were removed from the dataset.
4. Experiments, Results, and Discussion
The main target of this paper is to fill missing O_{3} values in air pollution datasets. To do so, several imputation methods are comprehensively compared as described below.
4.1. Experimental Settings
The imputation methods described in Section 2 are applied to different datasets, all of them with the six variables described in Section 3:(1)The Whole Dataset (WD), comprising the 13,526 samples: results for this datasets are shown in Section 4.2.(2)The Season Dataset (SD): samples in WD are split in four subsets according to the four seasons of the year: spring (3,453 samples), summer (3,349 samples), autumn (3,295 samples), and winter (3,429 samples). Results for this dataset are shown in Section 4.3.(3)The Type station Dataset (TD): samples in WD are split into two subsets according to the type of the station where the data come from; “urban” (6,763 samples) or “background” (6,763 samples). Results for this datasets are shown in Section 4.4.
For the three datasets, both statistical and neural imputation methods were applied and the performance is calculated through nfold CrossValidation (CV). The main idea behind CV is to split data, normally many times, for estimating the risk, error, or performance of each algorithm. Part of data (the training samples) is used for training each algorithm, and the remaining part (the validation samples) is used for validating the algorithm(s). Then, CV selects the algorithm with the smallest estimated risk [41]. CV prevents from overfitting because the training sample is independent of the validation sample. The number of the parameters (data partitions) was 10 for all the experiments in the present study. It means that 90% of the data are used for training and 10% for validation. In the case of neural models, the training process is repeated ten times (one for each fold). In the case of MLP, training is also repeated for each training algorithm (see Section 2.2). For all the experiments the Mean and the Standard Deviation (STD) of the Mean Square Error (MSE) for the ten folds are presented in Tables 3–11. The Mean and the STD of the execution time (in seconds) are also presented in Tables 3–11 for the 10 folds.








