Complexity

Volume 2018 (2018), Article ID 7238015, 14 pages

https://doi.org/10.1155/2018/7238015

## Neural Models for Imputation of Missing Ozone Data in Air-Quality Datasets

^{1}Department of Civil Engineering, University of Burgos, Burgos, Spain^{2}Department of Physics, University of Burgos, Burgos, Spain^{3}Departamento de Informática y Automática, University of Salamanca, Salamanca, Spain^{4}Department of Systems and Computer Networks, Wrocław University of Science and Technology, Wrocław, Poland

Correspondence should be addressed to Ángel Arroyo

Received 5 December 2017; Accepted 31 January 2018; Published 8 March 2018

Academic Editor: Eloy Irigoyen

Copyright © 2018 Ángel Arroyo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Ozone is one of the pollutants with most negative effects on human health and in general on the biosphere. Many data-acquisition networks collect data about ozone values in both urban and background areas. Usually, these data are incomplete or corrupt and the imputation of the missing values is a priority in order to obtain complete datasets, solving the uncertainty and vagueness of existing problems to manage complexity. In the present paper, multiple-regression techniques and Artificial Neural Network models are applied to approximate the absent ozone values from five explanatory variables containing air-quality information. To compare the different imputation methods, real-life data from six data-acquisition stations from the region of Castilla y León (Spain) are gathered in different ways and then analyzed. The results obtained in the estimation of the missing values by applying these techniques and models are compared, analyzing the possible causes of the given response.

#### 1. Introduction and Related Work

The ozone (O_{3}) is an odorless, colorless, and highly reactive gas composed of three oxygen atoms. It is formed both in the Earth’s upper atmosphere (stratospheric ozone) and at ground level (tropospheric ozone). It can be “good” or “bad” for people’s health and for the environment, depending on its concentration levels and location in the atmosphere [1].

Stratospheric O_{3} is formed naturally through the interaction of solar UltraViolet (UV) radiation with molecular oxygen (O_{2}). Ground-level or “bad” ozone is not emitted directly into the air. In the 1950s, hydrocarbons and nitrogen oxides (NO_{x}) were identified as the two key chemical precursors of photochemical smog and its concomitant high concentrations of O_{3} and other photochemical oxidant [2]. The majority of ground-level O_{3} is formed from the photochemical oxidation of Volatile Organic Compounds (VOCs) in the presence of NO and other NO_{x}. Significant sources of VOCs are chemical plants, gasoline pumps, oil-based paints, autobody shops, and print shops. NO_{x} result primarily from high temperature combustion, and its most significant sources are power plants, industrial furnaces and boilers, and motor vehicles [3].

##### 1.1. Importance of Ozone

The O_{3} exposition can cause damage in different ways. In the stratosphere, reduced O_{3} levels as a result of O_{3} layer depletion mean less protection from the sun’s rays and more exposure to UltraViolet B (shortwave) rays (UVB) radiation at the Earth’s surface [4]. The effects on human health of the O_{3} layer depletion have been much analyzed, increasing the amount of UVB that reaches the Earth’s surface. UVB causes nonmelanoma skin cancer and plays a major role in malignant melanoma development. In addition, UVB has been linked to the development of certain cataracts, negative effects in patients with asthma, and other chronic respiratory disease. With respect to ground-level O_{3}, and its effects on human health, breathing O_{3} can trigger a variety of health problems. People with asthma and other chronic respiratory disease are a large and growing segment of the population and are also known to be especially susceptible to the effects of O_{3} exposure. On days with high levels of O_{3}, people with asthma tend to experience increased respiratory symptoms [3]. The layer O_{3} depletion has also negative effects on the process of the development of plants, effects on the marine ecosystems like a direct reduction in phytoplankton production, negative effects on materials like biopolymers, and so forth. Tropospheric O_{3} does not provide the protective function that it fulfills in the stratosphere, being high reactivity. Its strong oxidizing capacity, when its levels rise above the natural background, can cause adverse effects in materials (derived from its corrosive effects), on vegetation and ecosystems.

The present work focuses on tropospheric O_{3}, which is a risk for the air quality [3]. Given the increase in O_{3} levels in the troposphere, it is currently considered one of the most important atmospheric pollutants.

##### 1.2. Ozone Level Monitoring

Around the world there are numerous data-acquisition networks for the measurement of O_{3} levels and other pollutants, which consist of many stations in different locations where different sensors measure corresponding magnitudes. These network stations acquire data at periodic intervals of time (periods between ten and fifteen minutes are the most frequent ones) but frequently appear missing or corrupted data. In Europe, data are considered as corrupted when not meeting the Council Decision 97/101/EC of January 27, 1997 [5], which establish a reciprocal exchange of information and data from networks and individual stations measuring ambient air pollution within the Member States. Some of these networks provide information about the validity of the data, indicating through codes if the data is correct, it has not been possible to acquire, or it is corrupt, but in other occasions this type of information is not provided while the data are still missing. Some reasons for such failures have been pinpointed [6], namely, a damaged cable, the loss of proper electrical grounding, half-melted frost or snow on the dome, communications failure, and so forth. Some of these causes are temporary and may disappear spontaneously, but other ones require the intervention of a maintenance task force, and therefore errors persist for different periods of time. The absence of valid data may also be due to reasons such as the following: mishandling of samples, low signal-to-noise ratio, measurement error, nonresponse, or deleted aberrant value [7]. This is a problem for the analysis of the information coming from the measurement networks, and the imputation of these missing data [8] is necessary. Any of the variables acquired in network stations may suffer from the problem of the absence of data. If many data variables are omitted or corrupted in the same record, the whole sample must be withdrawn, when some models are applied [9], for subsequent tasks such as control, classification, forecast. Alternatively, if data for the same pollutant are missing in several adjacent rows, removing that variable may also be an alternative solution. In conclusion, having a complete set of data is necessary to perform a reliable study and to apply some models that cannot deal with missing data.

##### 1.3. Missing Values and Related Work

The standard classification of missing data phenomenon [10] includes different situations:(i)Missing Completely At Random (MCAR), when the probability of an instance (case) having a missing value for a variable does not depend on either the known values or the missing data.(ii)Missing At Random (MAR), when the probability of an instance having a missing value for a variable may depend on the known values but not on the value of the missing data itself.(iii)Not Missing At Random (NMAR), when the probability of an instance having a missing value for a variable could depend on the value of that variable.

As previous authors have pointed out, the complexity varies between these patterns of missing data [11]. Usually, in the case of air-quality data, missing values are associated with MAR or MCAR. The circumstances that may interfere with the acquisition of the data are many and not easily predictable [12].

To solve the missing data problem, a wide variety of different methods have been applied up to now [8, 10, 13]. These imputation methods (IMs) are usually classified as follows:(i)Single imputation (SI): the method fills in one value for each missing one [12].(ii)Multiple imputation (MI): multiple simulated values are generated at the same time [14].

The univariate and multivariate imputation methods differ in which the approximation of the missing values of the variable under study are calculated from the rest of the values of the very same variable (univariate) or using values of the rest of the variables (multivariate) [12].

With the aim of reducing the complexity of other MI applied methods [11], the present paper focuses on single and multivariate imputation for the O_{3} magnitude in air pollution datasets. To do so, multiple-regression (linear and nonlinear) techniques together with Artificial Neural Networks (ANN) are applied to real-life datasets obtained from public air-quality networks.

Up to now, different Artificial-Intelligence (AI) techniques have been applied for imputation of missing data. In [7] imputation methods based on six different techniques are compared:* K*-Nearest Neighbors (KNN), Fuzzy* K*-Means (FKM), Singular Value Decomposition, Bayesian Principal Component Analysis (bPCA) and Multiple Imputations by Chained Equations. These methods are applied to four datasets split into two groups of various sizes: small datasets (Iris and* E. coli*) and large datasets (breast cancers 1 and 2). bPCA and FKM appeared to be the most robust imputation methods in the tested conditions.

In [15] the accuracy of different imputation methods is evaluated: MissForest (MF) and Multiple Imputation based on Expectation-Maximization (MIEM), along with two other imputation methods: Sequential Hot-Deck and Multiple Imputation based on Logistic Regression (MILR). The models are applied over fourteen binary datasets, with a range of missing data rates between 5% and 50%. The results from 10-fold Cross-Validation (CV) show that the performance of the imputation methods varies substantially between different classifiers and at different rates of missing values.

Although many imputation methods have been proposed up to now, scant attention has been paid to validate ANN for such a task, taking advantage of their regression capability [16]. Among these previous studies, ANN have been applied for the estimation of lost values in [17], where the main goal is identifying Learning Disabilities (LD) in children at early stages. In [18], authors proposed a SI approach relying on a Multilayer Perceptron (MLP) whose training is conducted with different learning rules, and a MI approach based on the combination of MLP and KNN. 24 real and simulated datasets from the UCI repository, the Promise repository, and mldata.org were exposed to a perturbation experiment with random generation of monotone missing data pattern.

In [19] six different types of ANN are proposed as IM: MLP and its variations (the Time-Lagged Feedforward Network (TLFN)), the Generalized Radial-Basis-Function (GRBF) network, the Recurrent Neural Network (RNN), and its variations (the Time Delay Recurrent Neural Network (TDRNN)). Additionally, the Counterpropagation Fuzzy-Neural Network (CFNN) along with different optimization methods is applied for infilling missing daily total precipitation and extreme temperature series from 15 weather stations. The standard MLP and TLFN appear to provide the most accurate reconstruction of missing precipitation and daily extreme temperatures records with results for the* R* correlation coefficient between the observed and the reconstructed daily series close to 1.

In [20] a novel nonparametric algorithm named Generalized regression neural network Ensemble for Multiple Imputation (GEMI) is proposed. Additionally, a SI version of this approach (GESI) is proposed. The algorithms were tested on 98 synthetic and real-world datasets. All simulation results show the advantages of GEMI as compared with conventional algorithms. GEMI has heavy memory storage requirements but outperformed other SI algorithms.

In [21] fifteen real and simulated datasets are exposed to a perturbation experiment, based on the random generation of missing values. Several architectures and learning algorithms for the MLP are tested and compared with three classic imputation procedures: mean/mode imputation, regression, and hot-deck [22].

In [23] a methodology based on Gaussian Mixture Model (GMM) and Extreme Learning Machine (ELM) is developed and tested on some datasets from the UCI Machine Learning Repository and the LIACC regression repository. GMM is used to model the data distribution which is adapted to handle missing values, while ELM enables devising a Multiple Imputation strategy for final estimation. The combination of GMM and ELM is shown to be superior in almost all tested cases over the method based on conditional mean imputation.

In [24] a SI approach relying on a MLP and a MI approach based on the combination of MLP and* K*-NN is proposed. The models are applied to 18 real and simulated datasets like domains such as biology, medicine, chemistry, electronics, social surveys, census, and business. For datasets with only quantitative variables MIMLP model provided the best results, with IMLP being the best method for datasets with categorical variables.

In [25] a two-stage hybrid model for filling the missing values using fuzzy c-means clustering and MLP is proposed. It is applied to a Wine dataset with a 1% to 5% of generated missing values and the accuracy of the model is checked using the Mean Absolute Percentage Error (MAPE). The MAPE obtained for stage 2 (MLP regression to the obtained dataset as a result of applying fuzzy* c*-means in stage 1) is 4.95% for 1% missing-value records and 8.36% for 5% missing-value records.

In the case of air-quality data, few imputation methods have been proposed up to now. In [13], an important set of SI: Listwise, Unconditional mean, Modified Median, Principal Component-based, Expectation-Maximization (EM) (Regularized-EM), and MI methods are applied to three datasets with the most important pollutant variables (NO, NO_{2}, NO_{x}, CO, O_{3}, PM10, and PM2.5) and a percentage of missing data among the 3.85% and the 23.52% depending on the year. Missing data of the eight variables are imputed in order to assess the effectiveness of the methods applied. In general, MI tends to yield more scattered values than its counterparts, mainly when the variables have many voids and they correlate poorly to the other variables like CO with 43.5% of missing data in 2006 and they correlate poorly to the other variables.

In [11] some methods for the imputation of missing air-quality data are compared: in the context of SI (linear, spline, and nearest neighbor interpolations), MI (regression-based imputation, multivariate nearest neighbor, Self-Organizing Maps (SOM), and Multilayer Backpropagation (MLBP) nets) and hybrid methods of the aforementioned. The dataset uses the most common pollutants: NO_{x}, NO_{2}, O_{3}, PM10, SO_{2}, and CO concentrations, all on a time-scale of one per hour (hourly averaged), together with four meteorological parameters. The performance of the proposed univariate missing data interpolation was limited, and in general they were able to fill only very short gaps of contiguous missing data. The general performance of the applied imputation methods was fair good when considering the pollutants (NO_{x}, NO_{2}, O_{3}, PM10, SO_{2}, and CO) which are the most important ones in terms of air-quality modelling, but not so good regarding meteorological variables. The results suggested that SOM and MLBP are the methods of choice for air-quality data imputation and even better results can be achieved by using the MI.

##### 1.4. Main Contributions

The main contributions of this work are as follows:(i)Deep study of the real-life human health protection task in Spanish region of Castilla y León.(ii)Multisensor of O_{3} data analysis.(iii)Experimental evaluation of the proposed approach based on multiple-regression techniques together with ANN models.

To the best of authors knowledge, this is the first approach of imputation methods of O_{3} based on both MLP and Radial-Basis-Function Networks.

The rest of this paper is organized as follows. Section 2 presents the techniques and models applied. Section 3 details the real-life case study that is addressed in present work, while Section 4 describes the experiments and results. Finally, Section 5 sets out the main conclusions and future work.

#### 2. Regression Techniques and ANN Models

In order to fill missing or corrupted values of O_{3} in high dimensional datasets with air-quality information, two regression techniques and two ANN models have been applied in present study. This set of techniques applied as imputation methods is described in this section.

##### 2.1. Regression Techniques

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable [26].

The general purpose of multiple regressions [27] is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable.

###### 2.1.1. Multiple Linear Regression

Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data [28]. Every value of the independent variable is associated with a value of the dependent variable . The population regression line for explanatory variablesis defined to be

This line describes how the mean response changes with the explanatory variables. The observed values for* y* vary about their means and are assumed to have the same standard deviation . The fitted values estimate the parameters of the population regression line.

Since the observed values for* y* vary about their means* u*_{y}, the multiple-regression models include a term for this variation. The model is expressed as DATA = FIT + RESIDUAL, where the “FIT” term represents the expression . The “RESIDUAL” term represents the deviations of the observed values from their means , which are normally distributed with mean 0 and variance . The notation for the model deviations is .

Formally, the model for multiple linear regression, given* n* observations, is [28]

###### 2.1.2. Multiple Nonlinear Regression

A Multiple Nonlinear Regression (MN-LR) is a form of regression analysis in which observational data are modelled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables [29]. The data are fitted by a method of successive approximations.

The parameters can take the form of an exponential, trigonometric, power, or any other nonlinear function. To determine the nonlinear parameter estimates, an iterative algorithm is typically used.where represents nonlinear parameter estimates to be computed, is the dependent or criterion variables, and represents the error terms.

##### 2.2. Artificial Neural Networks

Artificial Neural Networks (ANN), also known as Artificial Neural Systems (ANS), connectionist systems, adaptive networks, and distributed and parallel processing are simplified models of natural neural systems. The following definition, given by Hecht-Nielsen in 1989 [30], formalizes the concept of ANN:

An ANN is a parallel processing computer system distributed, consisting of a set of elementary processing units equipped with a small local memory and interconnected in a network through connections with associated weights. Each processing unit has one or more input connections and a single output connection that links to many collateral connections as desired. All processing associated with an elementary unit is a local, i.e. depends only on the values that take input signals from the unit and the internal state of the same.

###### 2.2.1. Multilayer Perceptron (MLP)

The MLP consists of a system of simple interconnected neurons or nodes. The nodes are connected by weights and output signals which are a function of the sum of the inputs to the node modified by a simple nonlinear transfer, or activation, function. The architecture consists of several layers of neurons; the input layer serves to pass the input vector to the network. The terms “input vectors” and “output vectors” refer to the inputs and outputs of the MLP and can be represented as single vectors [31]. A MLP may have one or more hidden layers and finally an output layer. MLP are fully connected, with each node connected to every node in the next and previous layer.

To perform a comprehensive comparison, the MLP is trained with the following algorithms:(1)Levenberg-Marquardt backpropagation (LM)(2)Gradient Descent with momentum and adaptive learning rate backpropagation (GDX) [32](3)Batch Training with weight and bias learning rules (TB)(4)Scaled Conjugate Gradient backpropagation (SCG)(5)Bayesian Regularization backpropagation (BR).

###### 2.2.2. Radial-Basis-Function Networks (RBFN)

In a RBFN [33] each unit in the hidden layer of this network has its own centroid, and, for each input vector , it computes the distance between and its centroid. Its output of the unit is calculated as a nonlinear function of this distance.

Assuming that there are* r* input nodes and* m* output nodes, the overall response function without considering nonlinearity in an output node has the following form [34]: where is the number of units in the hidden layer, is the vector of weights linking the th hidden-layer unit to the output nodes,* x* is an input vector,* K* is a radially symmetric kernel function of a unit in the hidden layer,* z*_{i} and are the centroid and smoothing factor of the th kernel node, respectively, and is a function called the activation function, which characterizes the kernel shape.

#### 3. Case Study

In present study, data from air-quality stations in Castilla y León (CyL) are analyzed. CyL is a Spanish region located at the north-center of the Iberian Peninsula. It is composed of nine provinces and it is the most extensive region of Spain with a total surface of 94,226 square kilometers and the sixth with more population: 2,435,797 habitants. Gross Domestic Product (GDP) in CyL represents the 5.3% of country’s GDP [35]. Climate in CyL approaches what is known as the continental ocean, characterized by cold winters and hot summers with short spring and autumn periods.

CyL region provides a wide network of stations [36] for the acquisition of air-quality data. These data are public available according to the Open Data Initiative from the Spanish Government [37].

Stations from this network have some interesting characteristics:(1)Stations are classified in types: urban, background, and oriented to the vegetation protection [36].(2)These stations collect the fundamental air-quality pollutants, and among them is the O_{3}, which is the objective pollutant of this study. Daily averages data [38] of each pollutant are provided in each location.(3)This data presents empty or corrupted data in all of its variables in some rows and in a reasonable percentage to be estimated.

In the present study, pollutant data recorded in six different stations from the CyL network are analyzed. Daily data averages from years 2000 to 2008 have been selected. For some periods of time within the selected time window, data are not available for all the variables and, thus, the whole example is rejected for the study. Three of the stations are located in the center of the cities and labeled as urban stations; these stations are oriented to the protection of the human health. The other three stations are background stations and are also oriented to the protection of the human health. These stations measure a greater number of pollutants than the other type of stations and are the most important ones in terms of air quality, and many of them are not collected at the stations for the vegetation protection. This fact is important for the determination of the O_{3} missing values, as this gas is especially harmful for human health.

The three urban stations considered in present study are as follows:(1)Ávila. “Bus Station” station. Geographical coordinates: 40.65914, −4.68237; 1150 meters above sea level (masl).(2)Aranda de Duero. “Jardines de Don Diego” station. Geographical coordinates: 41.67111, −3.68388; 801 masl.(3)León. “Avda. San Ignacio de Loyola” station. Geographical coordinates: 42.60388, −5.58722; 838 masl.

The three background stations are as follows:(1)Burgos. “Fuentes Blancas” station. Geographical coordinates: 42.33611, −3.63611; 929 masl.(2)Segovia. “Acueducto” station. Geographical coordinates: 40.95555, −4.11055; 951 masl.(3)Medina del Campo (Valladolid). “Bus Station” station. Geographical coordinates: 41.31638, −4.90916; 721 masl.

Figure 1 shows the location of the six selected stations that have been studied in the present paper.