The Influence of Data Length on the Performance of Artificial Intelligence Models in Predicting Air Pollution

AlOmar, Mohamed Khalid; Khaleel, Faidhalrahman; AlSaadi, Abdulwahab Abdulrazaaq; Hameed, Mohammed Majeed; AlSaadi, Mohammed Abdulhakim; Al-Ansari, Nadhir

doi:https://doi.org/10.1155/2022/5346647

Advances in Meteorology

On this page

Abstract Introduction Results and Discussion Conclusion Abbreviations Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Computational Algorithms for Climatological and Hydrological Applications

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 5346647 | https://doi.org/10.1155/2022/5346647

The Influence of Data Length on the Performance of Artificial Intelligence Models in Predicting Air Pollution

Mohamed Khalid AlOmar,¹Faidhalrahman Khaleel,¹Abdulwahab Abdulrazaaq AlSaadi,²Mohammed Majeed Hameed,^1,3Mohammed Abdulhakim AlSaadi,⁴and Nadhir Al-Ansari⁵

Academic Editor: Upaka Rathnayake

Received04 Mar 2022

Revised08 Aug 2022

Accepted22 Aug 2022

Published30 Sept 2022

Abstract

Air pollution is one of humanity's most critical environmental issues and is considered contentious in several countries worldwide. As a result, accurate prediction is critical in human health management and government decision-making for environmental management. In this study, three artificial intelligence (AI) approaches, namely group method of data handling neural network (GMDHNN), extreme learning machine (ELM), and gradient boosting regression (GBR) tree, are used to predict the hourly concentration of PM_2.5 over a Dorset station located in Canada. The investigation has been performed to quantify the effect of data length on the AI modeling performance. Accordingly, nine different ratios (50/50, 55/45, 60/40, 65/35, 70/30, 75/25, 80/20, 85/15, and 90/10) are employed to split the data into training and testing datasets for assessing the performance of applied models. The results showed that the data division significantly impacted the model's capacity, and the 60/40 ratio was found more suitable for developing predictive models. Furthermore, the results showed that the ELM model provides more precise predictions of PM_2.5 concentrations than the other models. Also, a vital feature of the ELM model is its ability to adapt to the potential changes in training and testing data ratio. To summarize, the results reported in this study demonstrated an efficient method for selecting the optimal dataset ratios and the best AI model to predict properly which would be helpful in the design of an accurate model for solving different environmental issues.

1. Introduction

1.1. Background

The impacts of urbanization and industrialization have resulted in increased air pollution, considered one of our time's most pressing public health challenges [1]. Pollution can occur both indoors and outdoors [2, 3]. They are both equally dangerous, despite their differing sources. The main difference between indoor and outdoor pollution is that indoor pollution may be addressed using air filters and odour absorbers, whereas there are no effective means for monitoring and detecting air pollution, which, in turn, can then be prevented [2]. Several studies suggest that by 2050, of the global population will reside in urban areas [4]. Therefore, an effective method of monitoring and predicting air pollution, particularly fine particulate matter (FPM), is critical [5–7]. is known as atmospheric particulate matter, and its equivalent diameter enables them to be suspended in the atmosphere for an extended period. Furthermore, the chemical composition of generally consists of carbon, nitrate compounds, sulfur, heavy metals, and other substances such as sea salt and sand [8], leading to various respiratory diseases, nervous system damage, cancer, cardiovascular diseases, etc. [9–12]. Furthermore, air pollution becomes more severe as the concentration increases. Additionally, worldwide about 3.15 million mature deaths each year are caused by exposure to high concentrations of . Overall, outdoor pollution causes 3.3 million deaths yearly [13]. Consequently, accurate predictions of PM_2.5 concentrations are critical for enhancing the public health system and developing an early warning system that predicts pollutant levels. Besides, the advanced warning system can significantly help people, especially those suffering from chronic diseases, to avoid exposure to air pollutants at peak times when pollution reaches high levels that affect their health.

1.2. Previous Works

In the last decades, several studies have been conducted to predict . Moreover, these studies can roughly be categorized into conventional (deterministic and statistical) and artificial intelligence approaches . The deterministic approach is based on weather research and predicting and community multiscale air quality models [14]. Additionally, calculations based on the deterministic model can account for abrupt changes in weather phenomena that cause the diffusion of atmospheric particles and perform well over extended periods [15]. Besides, the deterministic approaches rely on numerical simulation to obtain large-scale results. However, these models are time-consuming because they require many computational resources, limiting their comprehensive implementation [16]. On the other hand, statistical models such as nonlinear regression [17], classification and regression trees linear model-Kalman filter analog combination [18], autoregressive integrated moving average [19], exponential smoothing with drift model [20], and combination model [21] are more efficient as well as quicker and easier than deterministic models [22]. However, the performances of the statistical models are relatively poor since the characteristics of are volatile for different samples [23, 24]. Zhang et al. [19] used an autoregressive integrated moving average to evaluate and predict the trend of concentrations. However, the result showed that the model was outdated, which reduced the model’s accuracy. Thus, the model enabled only to predict the trend. Furthermore, several factors influence the complicated formation of such as meteorological factors (e.g., wind speed, humidity), population, and road network. The relation between these factors is highly nonlinear and complicated, making them almost impossible to be captured using conventional methods [25, 26].

Machine learning has made tremendous progress in recent years in solving numerous engineering in general [27–32] and concentration in particular [33–42]. combines data science, statistics, and computing in an interdisciplinary fashion. Furthermore, regarding concentration prediction, methods have been shown to perform better than traditional statistical models since they can handle nonlinear relationships and interactions between variables [43, 44]. Furthermore, methods are valuable tools for tracking pollution baseline and have been proven to identify pollution hotspots accurately. Moreover, many variables from air quality and metrological data can be analyzed using these techniques to enhance the understanding of their patterns and predict weather phenomena such as haze, air pollution, and visibility. Shang and He [45] combined random forest and ensemble neural network to predict the hourly concentrations. The proposed modeling method performed well. Furthermore, Wang et al. [46] used , multiple linear regression , and an ensemble model that combines and to forecast the indoor hourly concentrations. Different metrological and air quality parameters are considered to develop the proposed models. The results showed that the ensemble model provided better accuracy than the stand-alone models. Additionally, the results showed that the model has significant potency in predicting concentrations. Murillo et al. [47] proposed three machine learning models, namely artificial neural network (, support vector regression , and a hybrid model that combines the model with a particle swarm optimization algorithm to predict one day in advance of concentrations. The models were developed using various air quality and metrological parameters. The result showed that the hybrid model showed better performance in predicting concentrations compared to the other models. The hybrid models can find more efficient solutions than traditional ones [48]. In other words, the researchers usually incorporate the bio-inspirited algorithm with the classical models to enhance these models' capability and hence achieve excellent predictive results [49, 50]. Furthermore, these algorithms more frequently are given particular roles, such as optimizing the hyperparameters of the model, which are very difficult to compute via traditional methods.

Moisan et al. [51] compared the performance of three machine learning models, namely dynamic multiple equations , seasonal model with exogenous variables and in predicting the hourly concentrations. For model development, different historical pollution and metrological parameters were considered inputs for the proposed models. The results showed that the model performed better than the other models during the severe episodes. For more examples, Table 1 shows a brief of the applied approaches in the concentration predictions. Based on the reviewed papers in that table, the researchers did not pay considerable attention to the data division through training the AI and statistical models. Few ratios (70/30, 80/20, and 90/10) were employed to split the data into training and testing datasets to assess applied models' performance. Moreover, the proper data division into the training and testing datasets can significantly influence the model efficiency. In other words, increasing the length of the training dataset would make the model overfit the data. Nevertheless, insufficient data for training the model may significantly impact prediction accuracy, dramatically lowering the chance of receiving valid estimates.

1.3. Research Motivation

Owning to the accurate prediction of is very important is very important for the mangers to be alert, establish a robust system for early warning, and minimize adverse health effects and associated costs; this study investigates the influence of data partitioning on the models’ efficiency. To the best of authors' knowledge, the investigation of selecting the best training and testing data ratio is not conducted yet. The reported approach in Table 1 explored that the researchers preferred to predict air pollution using ANN-based models. However, new versions of ANN modeling approaches such as ELM were not applied for air pollution forecasting. In addition, models such as , , and , despite their wide popularity in solving complex engineering problems [27, 29, 60–62], were not used in previous works to predict the concentration of PM_2.5. Therefore, these modeling approaches and their capacity have been explored in more detail.

2. Methodology

2.1. Case Study and Data Collection

In this study, the hourly concentration data from Dorset station from January 01, 2011 to December 31, 2020 are collected. Dorset station is located in Ontario city with a latitude of and longitude of , Canada. The location of the study area and the location of the studied station and the distribution of pollution over Canada are provided in Figures 1(a) and 1(b), respectively. More information about the studied station and the statistical description of the data is presented in Table 2. Furthermore, Figure 2 shows Dorset ambient air monitoring station.

(a)

(b)

2.2. Data Cleaning

Pollutant data such as are usually measured using several equipment or sensors. Despite this, sensors are susceptible to hardware issues like power failures, maintenance, and unstable network equipment and hence lead to producing missing measures, zero values, negative values, null values, or others that exceed the normal range. Consequently, the accuracy of model predictions may be affected if data containing defects are directly used as input.

In this study, the percentage of missing data of is significantly low (1.78%). To compensate for the missing values, the linear interpolation of neighboring and piecewise cubic spline interpolation methods is used in this study before making predictions. However, the piecewise cubic spline interpolation approach provided unrealistic and negative values, making it unreliable in compensating for missing air pollution data values. In addition, the results of this study are consistent with other studies regarding the unreliability of the piecewise cubic spline interpolation approach in compensating for the missing values [64]. As a result, linear interpolation is more efficient in replacing the missing values. Besides, this method is chosen because the range of missing values is small, making it easy to recover the hour's conditions from the data. The adopted approach formula can be described as follows:where is the time-series target, is the time-series duration, and is the prediction item of missing value, where Moreover, corresponds to the previous normal data before the range of missing points.

2.3. Extreme Learning Machine

The is considered a new robust and simple learning algorithm designed by Huang et al. [65] for a single hidden layer feedforward neural network . Unlike the gradient algorithms, the learning speed is significantly faster at the same time, providing better generalization since it does not have the complexities of local minima, learning rate, and epochs, which is considered a considerable drawback for the other models. Furthermore, the model is user-friendly, easy to comprehend, and provides minimum training errors with few norm weights [66, 67]. The network consists of input, hidden, and output layers. In the input layer, the data are provided to the network. Among the three layers, the hidden layer is considered the most fundamental layer since the computations are carried out in it, as well as it serves as a bridge between the input and output layers in which the results are organized. Given samples of a trained dataset, the mathematical expression of ’s output function with hidden nodes and activation function is shown as follows:

The input weights and biases are assigned randomly for the hidden nodes, while the output weights are calculated analytically. The equation above can be compacted in the form as follows:where Z is the output matrix,where refers to the transpose of the matrix. Figure 3 shows the main structure of .

2.4. Group Method of Data Handling

Ivakhnenko first proposed the group method of data handling approach as a polynomial neural network to capture the complex relationship between the input and output in a nonlinear system [68]. Since having prior knowledge of the model is inconceivable in the mathematical model, the neural network is utilized to overcome this issue [27]. As a result, in the model, the simulation of complex systems can be carried out without needing any prior specialized knowledge. The primary notion of the model is to establish an analytical function within the feedforward network , which can be achieved by utilizing the coefficients from a quadratic node transfer function derived through the regression approach. A standard formula can be expressed as follows:where is the output, and present the model’s inputs. The , and refer to the polynomial coefficients, which can be obtained through the training dataset. Each layer involves a set of input processing components known as nods, and the outcome of each layer is utilized as new input over the following layer. In order to optimize the weights, the least squares are adopted to acquire the minimum residual between the actual and the predicted values. Figure 4 shows the structure of the model.

2.5. Gradient Boosting Regression Trees

Gradient boosting regression tree combines the advantages of the boosting approach and decision trees to overcome classification and regression problems. The general notion of is the combination (through boosting approach) of a series of decision trees known as weak learners to obtain an ensemble with multiple decision trees (strong learners), which in turn will increase the accuracy and the performance of the model. The boosting approach involves adding extra trees to the sequence without altering the model parameters that have already been added to minimize the loss function for the model. In other words, the training samples’ weights are modified in accordance with the last iteration, and the weights are increased for the observations that are hard to predict while lessened for those well-handled. Assuming is the approximation function, and is set of predictors, and utilizing additive functions, the ensemble tree model can be illustrated as follows [69, 70]:where and represent the end nods’ mean and the given weights in the regression tree, respectively. represents the basis function’s additive expansion. Using the forward approach, the parameters and are optimized. The estimate function can be illustrated through (7) after number of iterations, and the optimal is obtained using (8).where is the loss function, represents the number of observations, represents the predictors set for a given observations, and represent the response variable for a given observations. Figure 5 shows the structure of model.

2.6. Model Development Performance Evaluation

Three artificial intelligence models, namely extreme learning machine (ELM), group method of data handling neural network (GMDHNN), and gradient boosting regression tree , are used to predict the hourly concentration of over a Dorset station located in Canada. Before training the AI models, it is crucial to replace the missing dataset and determine the proper input-lagged vectors. Notably, the missing records have been replaced using two methods, as shown in the previous section of this study. Furthermore, the autocorrelation function (ACF) and partial autocorrelation function (PACF) are used. Autocorrelation and partial autocorrelation functions are fundamental tools in the analysis of linear time series. The measures the correlation between values and the series’ current value at various time points. More specifically, it indicates how similar the observations are considering their time lag. The measures the correlation between values at various time points and the series’ current value by partially removing the effects of the intermediate values. According to Figure 6, three input combinations can be used. The possible input groups can be shown below and can be used to predict one hour ahead of .

Based on the view of ACF, many possible variables can be used as inputs, and, however, there can be found that these variables were significantly correlated with each other. Thus, PACF was used to select the most significant inputs.

After selecting the input groups, it is essential to determine the possible training/testing ratios. The dataset's length considerably affects the AI models’ performances. This study employed nine different ratios (see Figure 7) to split the data into training and testing datasets to assess the applied model's performance. It is worth mentioning that the hyperparameters of the applied models were selected using the trial-and-error method because there is no straightforward approach to compute these critical parameters, which have considerable effects on the estimation accuracy. Figure 8 shows the main processes that are used in this research. The block diagram in Figure 8(a) shows the seven essential steps related to the research methodology, whereas more details on the models' development are given in Figure 8(b).

(a)

(b)

2.7. Performance Evaluation

To assess the performance of the proposed models in prediction, different statistical matrices are employed as shown below [71–73]:(i)Mean absolute error (MAE)(ii)Root mean square error (RMSE)(iii)Correlation coefficient ()(iv)Willmot index (WI)(v)Nash-Sutcliffe efficiency (NSE) where and are the average of the measured and the predicted values, respectively; represent the measured and predicted values of for n number of total observations; and is the mean deviation of the measured value.

3. Results and Discussion

This section discusses the performance of the proposed models in forecasting the hourly concentration of over a long period (from 01/01/2011 to 31/12/2020). Besides, three input combinations and nine data length scenarios have been used to train and validate the models (ELM, GMDHNN, and GBR). The performance of the models through the training phase is given in Table 3. According to the results of the training phase, the ELM provides the most efficient predictions with the lowest forecasting errors ( 0.9710 to 1.1099; 1.6088 to 1.8329). Nevertheless, the general performance of the GBR is unsatisfactory compared to other models, providing higher errors ( 3.7064 to 7.5851; 4.8536 to 9.7894). The third model (GMDHNN) yields a satisfactory prediction capacity ( 0.9898 to 1.1187; 1.6495 to 1.8372) than the GBR model, but its performance is still lower than the ELM model through the training phase. The statistical parameters (i.e., RMSE and MAE) prove that the ELM has an outstanding capability, providing excellent estimates despite the considerable change in the input variables and length of the training dataset. On the other hand, the GBR model shows poor performance and an inability to deal with the extensive dataset. A further remarkable observation that can be deduced from the reported results is that the performances of the machine learning model (ELM and GMDHNN) got reduced when the training dataset was at 50% of the total observations. For the case of 50% of data being used for the training, both models illustrate the difficulty of estimating the using a single input parameter. After evaluating the forecasted errors, it is essential to analyze how the estimated observations of are correlated with their corresponding values. in this regards, many performance metrics are performed, namely, the Willmott index (WI), correlation coefficient (R), and Nash-Sutcliffe efficiency (NSE), as presented in Table 3. Overall, the results demonstrate that the ELM model can provide higher accurate estimates in all cases than other models. In other words, the similarity between observed and predicted values by the ELM approach is promising. The , , and criteria for all cases range from 0.906 to 0.9171, 0.9489 to 0.9553, and 0.942 to 0.938, respectively. Similarly, the GMDHNN model yields a good prediction but is slightly lower than the ELM models. On the other hand, the GBR model cannot mimic the fluctuation of the concentrations over time, providing poor estimates with ranging from 0.6228 to 0.8109 and ranging from 0.798 to 0.547.

As the ELM approach presents excellent performance through the training phase, it is essential to validate this model using testing data points. Several studies emphasized that comparable models can be evaluated more effectively through the testing phase [28, 74]. The reason is that the model in the training phase would be trained in the presence of input points and their corresponding values. On the contrary, the applied model receives only input vectors in the testing phase. Table 4 shows the performance evaluation of the proposed models through the testing phase, and it can be seen that the ELM model outperforms the other proposed models. In other words, the provides estimations that are significantly similar to their actual ones (R ≈ 0.9001 to 0.9297; WI ≈ 0.9461 to 0.9573; ≈ 0.9371 to 0.9281) with lower forecasted error forecasted errors (RMSE ≈ 1.4049 to 1.5327; MAE ≈ 0.9001 to 0.9207) compared to GBR and GMDHNN models. The results also show that the best second model is GMDHNN, but its efficiency in dealing with fluctuated data like is not accurate as the ELM model. However, the GBR model faces problems in capturing the dynamics of over the time period.

3.1. The Effect of Data Length on the Predictive Models' Performances

This part of the study shows how the input variables and length of the testing dataset affect the applied models’ capacity to predict . In general, AI models require sufficient records and enough input vectors to provide more accurate estimations. In this regard, this study offers forty-five different scenarios of input parameters and the dataset’s length as shown in Table 4. The results through the testing phase are considered to analyze the models’ performances. The results showed that the ELM is more flexible with data size changes and lagged inputs than other models according to the statistical parameters such as . Moreover, the ELM requires only two input vectors when the testing data size ranges from 50% to 25% and adapts very well to the increasing changes in the data length. Also, the results demonstrate that if the testing data are reduced below 25%, the model requires more input vectors. As the length of testing data decreases (i.e., 20%, 15%, and 10%), the training data employed in the model increases, and thus, the training algorithm requires more inputs to complete the training and calibration processes efficiently and elaborately. Accordingly, the proposed model has high flexibility in the changes concerning the length of data and the number of the used inputs. According to the reviewed results obtained from the ELM model, it can be said that this model can provide more accurate results when the testing data size ranges from 40% to 45% of the entire dataset.

The other comparable models, such as the GBR model, do not have a reasonable or deducible pattern in dealing with cases where there is a change in the percentage of training data and the number of inputs. On the contrary, the last model (GMDHNN) tends to have a single pattern that can be deduced by evaluating its performance through statistical coefficients. This model needs, in most cases, the largest possible number of inputs, and therefore it does not show any flexibility for small and large changes that occur in the volume of data used.

For further assessment, the 95% uncertainty criterion () is a very effective tool for selecting the most effective and reliable model [28]. Taking into consideration the RMSE and SD (standard deviation of the computed errors), the mathematical expression of the is as follows [75]:

For different splits and input lags of the testing dataset, Figure 9 shows the evaluation of proposed models using . The results demonstrate that the ELM provides the smallest value of compared to other models. Furthermore, Figure 9 are consistent with the findings from the statistical parameters, which indicate that the effectiveness and accuracy of the ELM model reach their maximum when the training data represent from 40% to 45% of the total data. However, the GBR has recorded the highest value of uncertainty () followed by GMDHNN models.

It is essential to check the reliability analysis (RA) of the comparable models. This type of analysis is considered very effective in evaluating the consistency and performance of the models. This novel statistical metric provides essential information that can be used in determining whether the suggested models achieve the minimum requirement of allowable level of accuracy. Therefore, the RA is quite helpful in deciding and nominating the best model for air quality prediction. The formula below shows the mathematical expressions for calculating the RA [76]:

In (16), the n is the total number of PM_2.5 samples, and S_i is the equivalent factor for each sample and its value is either 1 or 0. Furthermore, the S_i depends mainly on the percentage relative error (RE), which is mathematically derived in the following equation.

According to value , if the fall within the allowable range, the is given 1; otherwise, it is given 0. The allowable range is

Based on a specialized technique known as RA, this study evaluated the prediction accuracy of the applied models. Table 5 shows the results of these metrics for every model throughout the training and testing stages. In this work, we select two data division combinations. The first combination involves 60% of the data points used for training and 40% used for testing; however, in the second combination, 55% of the data records are regarded as training data, and the rest are used for validation (testing). According to the obtained result, the ELM generally has a superior performance to other models, achieving the highest RA value in the training and testing phase. For example, the ELM in the first combination obtains the higher RA value of 77.34%, followed by GBRT of 76.21%, and finally, the GMDHNN produces the lowest RA with 58.123%. As a result, both ELM and GBR models show a satisfactory performance during the training stage than GMDHNN models. Concerning the testing stage, the results confirm that the ELM is the best model for estimating PM_2.5, having the highest RA value of 75.78%, followed by GBR (71.19%) and GMDHNN (55.16%). The RA results proved that the ELM is more efficient in estimating the hourly PM_2.5 than other models. Besides, the appraisal of the models with the help of RA also revealed that the best combination is when the training data records make up 60% of the dataset.

The proposed models are also evaluated graphically using the box plot, violin diagram, and Taylor diagram (see Figures 10–12). According to Figure 10, the overall performance of ELM shows more precise estimates of than the other models. Furthermore, median and interquartile range () errors are reduced more than in the GMDHNN and GBR models.

Figure 11 represents the violin diagram that integrates a boxplot and density plot to illustrate the data distribution shape. This figure is created for more visualization comparison using testing dataset for the best input combination (training 60% and testing 40%). According to the violin diagram, the model can efficiently mimic the actual data distribution and provide more agreement between the actual and the predicted records. Although the model outcomes are similar to the actual data distribution, it generates negative values that affect the model's performance. On the other hand, the model performs poorly in mimicking the actual data distribution and generates negative values. Figure 12 represents the Taylor diagram, a polar plot created based on correlation coefficient, standard deviation, and root mean square error to evaluate the models’ performance. According to Figure 12, the model can simulate closer to the actual one compared to the other models.

Overall, the results obtained in this study prove that the ELM model is more reliable in estimating the hourly as well as more flexible in resisting the changes in the data features and length of training data.

Figure 13 gives valuable insight into the practical implementations of the proposed approaches. In particular, data such as , as well as temperature, humidity, and air pressure, are extracted from air pollution stations. Next, data are processed at cloud servers using machine learning approaches. Finally, the prediction results can be accessed through a software application interface in real time. A complete implementation example is shown in Figure 14, which shows City Air Quality Management application from Siemens [77]. is an AI-based worldwide application that can be used on multi-platforms and combines the latest air pollution measurements, such as , , and , with the latest AI approach in order to predict the concentrations of for the coming days. The application can predict the pollution for three days ahead with 90% of accuracy and 80% for five days ahead.

4. Conclusion

Three AI models, namely ELM, GMDHNN, and GBR, have been used to predict the hourly concentrations over Dorset station, located in Canada. The case study covers the period from 2001 to 2020. The accurate estimations of hourly air pollutants via AI models require a proper input data feature and enough data records for model training. In this study, three input combinations are used via partial autocorrelation function (, and nine data length scenarios are used to validate the models to select the best model that can efficiently adapt to the changes. The finding of this study can be illustrated as follows:(i)The ELM model generally performs better in estimating PM_2.5 than the comparable models producing fewer errors ( 0.9710 to 1.099; 1.6088 to 1.8329).(ii)The flexibility of the ELM model in dealing with changes in the size of training data and different training conditions is significantly remarkable. The results showed that the ELM model demands fewer input vectors when the testing data size ranges from 50% to 25% of entire data observations. However, the model requires additional input features in different cases, primarily when the training data represent 80% to 90%.(iii)All the used models except the ELM do not provide a rational pattern consistent with the changes that occur in the training process.(iv)The results of this study uncover that the optimal training data, which can provide more accurate estimates, represent 60% of the obtained records.

This study recommends to(i)Applying the proposed methodology to find the optimal training and testing ratios for other series of pollution like ozone, nitrogen dioxide, sulfur dioxide, and carbon monoxide(ii)Using the feature selection method instead of PACF and ACF to select the best inputs(iii)Investigating using deep learning models (i.e., LSTM) to predict the PM_2.5 concentrations

Abbreviations

XGBoost:	Extreme gradient boosting
RPE:	Relative prediction error
SVM:	Support vector machine
ANN:	Artificial neural network
R:	Correlation coefficient
:	Coefficient of determination
MSE:	Mean square error
MAE:	Mean absolute error
RMSE:	Root mean square error
IA:	Index of agreement
MAPE:	Mean absolute percentage error
RMSPE:	Root mean square prediction error
MPE:	Mean prediction error
EEMD-GRNN:
ANFIS:
MLR:
GTWR:
LR:
GRNN:
RBF:	Radial basis function
SVR:	Support vector regression
PCR:	Principal component regression
ARIMA:
NELRM:
RF:	Random forest.

Data Availability

The data are available upon request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank Al-Maarif University College for funding this research.

References

M. Zamani Joharestani, C. Cao, X. Ni, B. Bashir, and S. Talebiesfandarani, “PM_2.5 prediction based on random forest, XGBoost, and deep learning using multisource remote sensing data,” Atmosphere, vol. 10, no. 7, p. 373, 2019.
View at: Publisher Site | Google Scholar
M. Ghoneim and S. M. Hamed, “Towards a smart sustainable city: air pollution detection and control using internet of things,” in 2019 5th International Conference on Optimization and Applications (ICOA), pp. 1–6, Kenitra, Morocco, 2019.
View at: Publisher Site | Google Scholar
M. M. Aljumaily, M. A. Alsaadi, N. A. Binti Hashim et al., “Superhydrophobic nanocarbon-based membrane with antibacterial characteristics,” Biotechnology Progress, vol. 36, no. 3, p. e2963, 2020.
View at: Google Scholar
J. I. R. Molano, L. M. O. Bobadilla, and M. P. R. Nieto, “Of cities traditional to smart cities,” in 2018 13th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–6, Caceres, Spain, 2018.
View at: Google Scholar
W. Sun and J. Sun, “Daily PM2.5 concentration prediction based on principal component analysis and LSSVM optimized by cuckoo search algorithm,” Journal of Environmental Management, vol. 188, pp. 144–152, 2017.
View at: Publisher Site | Google Scholar
K. Gan, S. Sun, S. Wang, and Y. Wei, “A secondary-decomposition-ensemble learning paradigm for forecasting PM2.5 concentration,” Atmospheric Pollution Research, vol. 9, no. 6, pp. 989–999, 2018.
View at: Publisher Site | Google Scholar
J. Du, F. Qiao, and L. Yu, “Temporal characteristics and forecasting of PM2.5 concentration based on historical data in Houston, USA,” Resources, Conservation and Recycling, vol. 147, pp. 145–156, 2019.
View at: Publisher Site | Google Scholar
T. Washington, HM and EU of the Cost of Air Pollution: Strengthening the Economic Case for Action, Evaluation University of Washington, Seattle, Washington DC, USA, 2016.
Y. Qi, Q. Li, H. Karimian, and D. Liu, “A hybrid model for spatiotemporal forecasting of PM2.5 based on graph convolutional neural network and long short-term memory,” Science of the Total Environment, vol. 664, pp. 1–10, 2019.
View at: Publisher Site | Google Scholar
A. H. Al Hanai, D. S. Antkiewicz, J. D. Hemming et al., “Seasonal variations in the oxidative stress and inflammatory potential of PM2.5 in Tehran using an alveolar macrophage model; the role of chemical composition and sources,” Environment International, vol. 123, pp. 417–427, 2019.
View at: Publisher Site | Google Scholar
J. Evans, A. van Donkelaar, R. V. Martin et al., “Estimates of global mortality attributable to particulate air pollution using satellite imagery,” Environmental Research, vol. 120, pp. 33–42, 2013.
View at: Google Scholar
D. Rojas-Rueda, A. de Nazelle, O. Teixidó, and M. J. Nieuwenhuijsen, “Health impact assessment of increasing public transport and cycling use in Barcelona: a morbidity and burden of disease approach,” Preventive Medicine, vol. 57, no. 5, pp. 573–579, 2013.
View at: Google Scholar
J. Lelieveld, J. S. Evans, M. Fnais, D. Giannadaki, and A. Pozzer, “The contribution of outdoor air pollution sources to premature mortality on a global scale,” Nature, vol. 525, no. 7569, pp. 367–371, 2015.
View at: Publisher Site | Google Scholar
Y. Sathe, S. Kulkarni, P. Gupta, A. Kaginalkar, S. Islam, and P. Gargava, “Application of moderate resolution imaging spectroradiometer (MODIS) aerosol optical depth (AOD) and weather research forecasting (WRF) model meteorological data for assessment of fine particulate matter (PM2.5) over India,” Atmospheric Pollution Research, vol. 10, no. 2, pp. 418–434, 2019.
View at: Publisher Site | Google Scholar
G. Zhou, J. Xu, Y. Xie et al., “Numerical air quality forecasting over eastern China: an operational application of WRF-Chem,” Atmospheric Environment, vol. 153, pp. 94–108, 2017.
View at: Publisher Site | Google Scholar
P. Perez and E. Gramsch, “Forecasting hourly PM2.5 in Santiago de Chile with emphasis on night episodes,” Atmospheric Environment, vol. 124, pp. 22–27, 2016.
View at: Publisher Site | Google Scholar
B. Lv, W. G. Cobourn, and Y. Bai, “Development of nonlinear empirical models to forecast daily PM2.5 and ozone levels in three large Chinese cities,” Atmospheric Environment, vol. 147, pp. 209–223, 2016.
View at: Google Scholar
B. Lyu, Y. Zhang, and Y. Hu, “Improving PM2.5 air quality model forecasts in China using a bias-correction framework,” Atmosphere, vol. 8, no. 12, p. 147, 2017.
View at: Publisher Site | Google Scholar
L. Zhang, J. Lin, R. Qiu et al., “Trend analysis and forecast of PM2.5 in Fuzhou, China using the ARIMA model,” Ecological Indicators, vol. 95, pp. 702–710, 2018.
View at: Publisher Site | Google Scholar
S. Mahajan, L.-J. Chen, and T.-C. Tsai, “Short-term PM2.5 forecasting using exponential smoothing method: a comparative analysis,” Sensors, vol. 18, no. 10, p. 3223, 2018.
View at: Publisher Site | Google Scholar
A. B. Chelani, “Estimating PM2.5 concentration from satellite derived aerosol optical depth and meteorological variables using a combination model,” Atmospheric Pollution Research, vol. 10, no. 3, pp. 847–857, 2019.
View at: Publisher Site | Google Scholar
H. J. Fernando, M. Mammarella, G. Grandoni et al., “Forecasting PM10 in metropolitan areas: efficacy of neural networks,” Environmental Pollution, vol. 163, pp. 62–67, 2012.
View at: Publisher Site | Google Scholar
W. Qiao, W. Tian, Y. Tian, Q. Yang, Y. Wang, and J. Zhang, “The forecasting of PM2.5 using a hybrid model based on wavelet transform and an improved deep learning algorithm,” IEEE Access, vol. 7, pp. 142814–142825, 2019.
View at: Publisher Site | Google Scholar
H. Liu and C. Chen, “Prediction of outdoor PM2.5 concentrations based on a three-stage hybrid neural network model,” Atmospheric Pollution Research, vol. 11, no. 3, pp. 469–481, 2020.
View at: Google Scholar
T. Li, H. Shen, Q. Yuan, and L. Zhang, “Deep learning for ground-level PM_2.5 prediction from satellite remote sensing data,” in IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 7581–7584, Valencia, Spain, 2018.
View at: Publisher Site | Google Scholar
Y. Wang, “Regional-level prediction model with advection PDE model and fine particulate matter (PM 2.5) concentration data,” Physica Scripta, vol. 95, no. 3, Article ID 035204, 2020.
View at: Publisher Site | Google Scholar
M. M. Hameed, F. Khaleel, M. A. Abed, D. Khaleel, and M. K. Alomar, “An effective predictive model for daily evapotranspiration based on a limited number of meteorological parameters,” in 2021 3rd International Sustainability and Resilience Conference: Climate Change, pp. 495–499, Sakheer, Bahrain, 2021.
View at: Publisher Site | Google Scholar
M. M. Hameed, M. K. AlOmar, F. Khaleel, and N. Al-Ansari, “An extra tree regression model for discharge coefficient prediction: novel, practical applications in the hydraulic sector and future research directions,” Mathematical Problems in Engineering, vol. 2021, Article ID 7001710, 19 pages, 2021.
View at: Publisher Site | Google Scholar
M. M. Hameed, F. Khaleel, and D. Khaleel, “Employing a robust data-driven model to assess the environmental damages caused by installing grouted columns,” in 2021 Third International Sustainability and Resilience Conference: Climate Change, pp. 305–309, Sakheer, Bahrain, 2021.
View at: Publisher Site | Google Scholar
A. Dikshit, B. Pradhan, and M. Santosh, “Artificial neural networks in drought prediction in the 21st century–A scientometric analysis,” Applied Soft Computing, vol. 114, Article ID 108080, 2022.
View at: Publisher Site | Google Scholar
G. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, 2009.
View at: Publisher Site | Google Scholar
F. Khaleel, M. M. Hameed, D. Khaleel, and M. K. AlOmar, “Applying an efficient AI approach for the prediction of bearing capacity of shallow foundations,” in International Conference on Emerging Technology Trends in Internet of Things and Computing, pp. 310–323, Berlin, Germany, 2022.
View at: Google Scholar
P. Du, J. Wang, W. Yang, and T. Niu, “A novel hybrid fine particulate matter (PM2.5) forecasting and its further application system: case studies in China,” Journal of Forecasting, vol. 41, no. 1, pp. 64–85, 2022.
View at: Publisher Site | Google Scholar
A. Stojić, G. Jovanovic, S. Stanisic et al., “The PM2.5-bound polycyclic aromatic hydrocarbon behavior in indoor and outdoor environments, part II: explainable prediction of benzo [a] pyrene levels,” Chemosphere, vol. 289, Article ID 133154, 2022.
View at: Publisher Site | Google Scholar
P. Liu, E. Yao, T. Liu, L. Kong, X. Tang, and G. Tan, “Improvement of AI forecast of gridded PM2.5 forecast in China through ConvLSTM and Attention,” CCF Trans. High Perform. Comput., vol. 4, pp. 104–119, 2022.
View at: Publisher Site | Google Scholar
P. Zhang, L. Yang, W. Ma, N. Wang, F. Wen, and Q. Liu, “Spatiotemporal estimation of the PM2.5 concentration and human health risks combining the three-dimensional landscape pattern index and machine learning methods to optimize land use regression modeling in Shaanxi, China,” Environmental Research, vol. 208, Article ID 112759, 2022.
View at: Publisher Site | Google Scholar
Q. Xiao, G. Geng, J. Cheng et al., “Evaluation of gap-filling approaches in satellite-based daily PM2.5 prediction models,” Atmospheric Environment, vol. 244, Article ID 117921, 2021.
View at: Publisher Site | Google Scholar
I. Yeo, Y. Choi, Y. Lops, and A. Sayeed, “Efficient PM2.5 forecasting using geographical correlation based on integrated deep learning algorithms,” Neural Computing & Applications, vol. 33, no. 22, pp. 15073–15089, 2021.
View at: Publisher Site | Google Scholar
M. H. Nguyen, P. Le Nguyen, K. Nguyen, V. A. Le, T. H. Nguyen, and Y. Ji, “PM25 prediction using genetic algorithm-based feature selection and encoder-decoder model,” IEEE Access, vol. 9, pp. 57338–57350, 2021.
View at: Publisher Site | Google Scholar
S. Chae, J. Shin, S. Kwon, S. Lee, S. Kang, and D. Lee, “PM₁₀ and PM₂₅ real-time prediction models using an interpolated convolutional neural network,” Scientific Reports, vol. 11, no. 1, pp. 11952–11959, 2021.
View at: Publisher Site | Google Scholar
E. Kristiani, H. Lin, J.-R. Lin, Y.-H. Chuang, C.-Y. Huang, and C.-T. Yang, “Short-term prediction of PM2.5 using LSTM deep learning methods,” Sustainability, vol. 14, no. 4, p. 2068, 2022.
View at: Publisher Site | Google Scholar
J. Ni, Y. Chen, Y. Gu, X. Fang, and P. Shi, “An improved hybrid transfer learning-based deep learning model for PM2.5 concentration prediction,” Applied Sciences, vol. 12, no. 7, p. 3597, 2022.
View at: Publisher Site | Google Scholar
K. Huang, Q. Xiao, X. Meng et al., “Predicting monthly high-resolution PM2.5 concentrations with random forest model in the North China Plain,” Environmental Pollution, vol. 242, pp. 675–683, 2018.
View at: Publisher Site | Google Scholar
A. Masood and K. Ahmad, “A review on emerging artificial intelligence (AI) techniques for air pollution forecasting: fundamentals, application and performance,” Journal of Cleaner Production, vol. 322, Article ID 129072, 2021.
View at: Google Scholar
Z. Shang and J. He, “Predicting hourly mathbf {PM}-{2.5} concentrations based on random forest and ensemble neural network,” in Proceedings 2018 Chinese Automation Congress, CA 2018, pp. 2341–2345, 2019.
View at: Publisher Site | Google Scholar
W. Yuchi, E. Gombojav, B. Boldbaatar et al., “Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city,” Environmental Pollution, vol. 245, pp. 746–753, 2019.
View at: Google Scholar
J. Murillo-Escobar, J. P. Sepulveda-Suescun, M. A. Correa, and D. Orrego-Metaute, “Forecasting concentrations of air pollutants using support vector regression improved with particle swarm optimization: case study in Aburrá Valley, Colombia,” Urban Climate, vol. 29, Article ID 100473, 2019.
View at: Google Scholar
M. Najafzadeh and F. Saberi-Movahed, “GMDH-GEP to predict free span expansion rates below pipelines under waves,” Marine Georesources & Geotechnology, vol. 37, no. 3, pp. 375–392, Mar. 2019.
View at: Publisher Site | Google Scholar
M. Najafzadeh and H. M. Azamathulla, “Neuro-fuzzy GMDH to predict the scour pile groups due to waves,” Journal of Computing in Civil Engineering, vol. 29, no. 5, Article ID 4014068, 2015.
View at: Publisher Site | Google Scholar
M. M. Hameed, M. A. Abed, N. Al-Ansari, and M. K. Alomar, “Predicting compressive strength of concrete containing industrial waste materials: novel and hybrid machine learning model,” Advances in Civil Engineering, vol. 2022, Article ID 5586737, 19 pages, 2022.
View at: Publisher Site | Google Scholar
S. Moisan, R. Herrera, and A. Clements, “A dynamic multiple equation approach for forecasting PM2.5pollution in Santiago, Chile,” International Journal of Forecasting, vol. 34, no. 4, pp. 566–581, 2018.
View at: Google Scholar
A. Masood and K. Ahmad, “A model for particulate matter (PM2.5) prediction for Delhi based on machine learning approaches,” Procedia Computer Science, vol. 167, pp. 2101–2110, 2020.
View at: Google Scholar
J. Amanollahi and S. Ausati, “PM2.5 concentration forecasting using ANFIS, EEMD-GRNN, MLP, and MLR models: a case study of Tehran, Iran,” Air Quality, Atmosphere & Health, vol. 13, no. 2, pp. 161–171, 2020.
View at: Publisher Site | Google Scholar
M. Mirzaei, J. Amanollahi, and C. G. Tzanis, “Evaluation of linear, nonlinear, and hybrid models for predicting PM2.5 based on a GTWR model and MODIS AOD data,” Air Quality, Atmosphere & Health, vol. 12, no. 10, pp. 1215–1224, 2019.
View at: Publisher Site | Google Scholar
S. S. Ganesh, P. Arulmozhivarman, and V. S. N. R. Tatavarti, “Prediction of PM2.5 using an ensemble of artificial neural networks and regression models,” Journal of Ambient Intelligence and Humanized Computing, 2018.
View at: Publisher Site | Google Scholar
Q. Zhou, H. Jiang, J. Wang, and J. Zhou, “A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network,” Science of the Total Environment, vol. 496, pp. 264–274, 2014.
View at: Google Scholar
X. Mao, T. Shen, and X. Feng, “Prediction of hourly ground-level PM2.5 concentrations 3 days in advance using neural networks with satellite data in eastern China,” Atmospheric Pollution Research, vol. 8, no. 6, pp. 1005–1015, 2017.
View at: Google Scholar
Z.-Y. Chen, T. H. Zhang, R. Zhang et al., “Extreme gradient boosting model to estimate PM2.5 concentrations with missing-filled satellite data in China,” Atmospheric Environment, vol. 202, pp. 180–189, 2019.
View at: Google Scholar
X. Hu, J. H. Belle, X. Meng et al., “Estimating PM2.5 concentrations in the conterminous United States using the random forest approach,” Environmental Science and Technology, vol. 51, no. 12, pp. 6936–6944, 2017.
View at: Publisher Site | Google Scholar
Z. A. Al Sudani and G. S. A. Salem, “Evaporation rate prediction using advanced machine learning models: a comparative study,” Advances in Meteorology, vol. 2022, Article ID 1433835, 13 pages, 2022.
View at: Publisher Site | Google Scholar
M. M. Hameed, M. K. AlOmar, W. J. Baniya, and M. A. AlSaadi, “Prediction of high-strength concrete: high-order response surface methodology modeling approach,” Engineering with Computers, vol. 38, no. S2, pp. 1655–1668, 2022.
View at: Publisher Site | Google Scholar
N. Nabipour, S. N. Qasem, E. Salwana, and A. Baghban, “Evolving LSSVM and ELM models to predict solubility of non-hydrocarbon gases in aqueous electrolyte systems,” Measurement, vol. 164, Article ID 107999, 2020.
View at: Google Scholar
“Health impacts of air pollution in Canada,” 2021, https://www.canada.ca/en/health-canada/services/publications/healthy-living/2021-health-effects-indoor-air-pollution.html.
View at: Google Scholar
M. K. AlOmar, M. M. Hameed, and M. A. AlSaadi, “Multi hours ahead prediction of surface ozone gas concentration: robust artificial intelligence approach,” Atmospheric Pollution Research, vol. 11, no. 9, pp. 1572–1587, 2020.
View at: Publisher Site | Google Scholar
G. Bin Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, 2006.
View at: Publisher Site | Google Scholar
D. Wang and G.-B. Huang, “Protein sequence classification using extreme learning machine,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, pp. 1406–1411, Montreal, Que, 2005.
View at: Google Scholar
G.-B. Huang, Q.-Y. Zhu, and C. K. Siew, “Real-time learning capability of neural networks,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 863–878, 2006.
View at: Publisher Site | Google Scholar
A. G. Ivakhnenko, “Heuristic self-organization in problems of engineering cybernetics,” Automatica, vol. 6, no. 2, pp. 207–219, 1970.
View at: Publisher Site | Google Scholar
J. H. Friedman, “Greedy function approximation: a gradient boosting machine 1 function estimation 2 numerical optimization in function space,” North, vol. 1, no. 3, pp. 1–10, 1999.
View at: Google Scholar
J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.
View at: Publisher Site | Google Scholar
M. K. Alomar, M. M. Hameed, N. Al-Ansari, and M. A. Alsaadi, “Data-Driven model for the prediction of total dissolved gas: robust artificial intelligence approach,” Advances in Civil Engineering, vol. 2020, Article ID 6618842, 20 pages, 2020.
View at: Publisher Site | Google Scholar
M. Despotovic, V. Nedic, D. Despotovic, and S. Cvetanovic, “Review and statistical analysis of different global solar radiation sunshine models,” Renewable and Sustainable Energy Reviews, vol. 52, pp. 1869–1880, 2015.
View at: Google Scholar
A. Yafouz, A. N. Ahmed, N. Zaini, M. Sherif, A. Sefelnasr, and A. El-Shafie, “Hybrid deep learning model for ozone concentration prediction: comprehensive evaluation and comparison with various machine and deep learning algorithms,” Engineering Applications of Computational Fluid Mechanics, vol. 15, no. 1, pp. 902–933, 2021.
View at: Publisher Site | Google Scholar
M. M. Hameed, M. K. AlOmar, S. F. Mohd Razali et al., “Application of artificial intelligence models for evapotranspiration prediction along the southern coast of Turkey,” Complexity, vol. 2021, Article ID 8850243, 20 pages, 2021.
View at: Publisher Site | Google Scholar
M. M. Hameed, F. Khaleel, M. K. AlOmar, S. F. Mohd Razali, and M. A. AlSaadi, “Optimising the selection of input variables to increase the predicting accuracy of shear strength for deep beams,” Complexity, vol. 2022, Article ID 6532763, 23 pages, 2022.
View at: Publisher Site | Google Scholar
F. Saberi-Movahed, M. Najafzadeh, and A. Mehrpooya, “Receiving more accurate predictions for longitudinal dispersion coefficients in water pipelines: training group method of data handling using extreme learning machine conceptions,” Water Resources Management, vol. 34, no. 2, pp. 529–561, 2020.
View at: Publisher Site | Google Scholar
“Artificial intelligence improves air quality,” 2019, https://new.siemens.com/global/en/company/stories/infrastructure/2019/artificial-intelligence-improves-air-quality.html.
View at: Google Scholar

Copyright

Copyright © 2022 Mohamed Khalid AlOmar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

391

Downloads

513

Citations