Air pollution is one of humanity's most critical environmental issues and is considered contentious in several countries worldwide. As a result, accurate prediction is critical in human health management and government decision-making for environmental management. In this study, three artificial intelligence (AI) approaches, namely group method of data handling neural network (GMDHNN), extreme learning machine (ELM), and gradient boosting regression (GBR) tree, are used to predict the hourly concentration of PM2.5 over a Dorset station located in Canada. The investigation has been performed to quantify the effect of data length on the AI modeling performance. Accordingly, nine different ratios (50/50, 55/45, 60/40, 65/35, 70/30, 75/25, 80/20, 85/15, and 90/10) are employed to split the data into training and testing datasets for assessing the performance of applied models. The results showed that the data division significantly impacted the model's capacity, and the 60/40 ratio was found more suitable for developing predictive models. Furthermore, the results showed that the ELM model provides more precise predictions of PM2.5 concentrations than the other models. Also, a vital feature of the ELM model is its ability to adapt to the potential changes in training and testing data ratio. To summarize, the results reported in this study demonstrated an efficient method for selecting the optimal dataset ratios and the best AI model to predict properly which would be helpful in the design of an accurate model for solving different environmental issues.

1. Introduction

1.1. Background

The impacts of urbanization and industrialization have resulted in increased air pollution, considered one of our time's most pressing public health challenges [1]. Pollution can occur both indoors and outdoors [2, 3]. They are both equally dangerous, despite their differing sources. The main difference between indoor and outdoor pollution is that indoor pollution may be addressed using air filters and odour absorbers, whereas there are no effective means for monitoring and detecting air pollution, which, in turn, can then be prevented [2]. Several studies suggest that by 2050, of the global population will reside in urban areas [4]. Therefore, an effective method of monitoring and predicting air pollution, particularly fine particulate matter (FPM), is critical [57]. is known as atmospheric particulate matter, and its equivalent diameter enables them to be suspended in the atmosphere for an extended period. Furthermore, the chemical composition of generally consists of carbon, nitrate compounds, sulfur, heavy metals, and other substances such as sea salt and sand [8], leading to various respiratory diseases, nervous system damage, cancer, cardiovascular diseases, etc. [912]. Furthermore, air pollution becomes more severe as the concentration increases. Additionally, worldwide about 3.15 million mature deaths each year are caused by exposure to high concentrations of . Overall, outdoor pollution causes 3.3 million deaths yearly [13]. Consequently, accurate predictions of PM2.5 concentrations are critical for enhancing the public health system and developing an early warning system that predicts pollutant levels. Besides, the advanced warning system can significantly help people, especially those suffering from chronic diseases, to avoid exposure to air pollutants at peak times when pollution reaches high levels that affect their health.

1.2. Previous Works

In the last decades, several studies have been conducted to predict . Moreover, these studies can roughly be categorized into conventional (deterministic and statistical) and artificial intelligence approaches . The deterministic approach is based on weather research and predicting and community multiscale air quality models [14]. Additionally, calculations based on the deterministic model can account for abrupt changes in weather phenomena that cause the diffusion of atmospheric particles and perform well over extended periods [15]. Besides, the deterministic approaches rely on numerical simulation to obtain large-scale results. However, these models are time-consuming because they require many computational resources, limiting their comprehensive implementation [16]. On the other hand, statistical models such as nonlinear regression [17], classification and regression trees linear model-Kalman filter analog combination [18], autoregressive integrated moving average [19], exponential smoothing with drift model [20], and combination model [21] are more efficient as well as quicker and easier than deterministic models [22]. However, the performances of the statistical models are relatively poor since the characteristics of are volatile for different samples [23, 24]. Zhang et al. [19] used an autoregressive integrated moving average to evaluate and predict the trend of concentrations. However, the result showed that the model was outdated, which reduced the model’s accuracy. Thus, the model enabled only to predict the trend. Furthermore, several factors influence the complicated formation of such as meteorological factors (e.g., wind speed, humidity), population, and road network. The relation between these factors is highly nonlinear and complicated, making them almost impossible to be captured using conventional methods [25, 26].

Machine learning has made tremendous progress in recent years in solving numerous engineering in general [2732] and concentration in particular [3342]. combines data science, statistics, and computing in an interdisciplinary fashion. Furthermore, regarding concentration prediction, methods have been shown to perform better than traditional statistical models since they can handle nonlinear relationships and interactions between variables [43, 44]. Furthermore, methods are valuable tools for tracking pollution baseline and have been proven to identify pollution hotspots accurately. Moreover, many variables from air quality and metrological data can be analyzed using these techniques to enhance the understanding of their patterns and predict weather phenomena such as haze, air pollution, and visibility. Shang and He [45] combined random forest and ensemble neural network to predict the hourly concentrations. The proposed modeling method performed well. Furthermore, Wang et al. [46] used , multiple linear regression , and an ensemble model that combines and to forecast the indoor hourly concentrations. Different metrological and air quality parameters are considered to develop the proposed models. The results showed that the ensemble model provided better accuracy than the stand-alone models. Additionally, the results showed that the model has significant potency in predicting concentrations. Murillo et al. [47] proposed three machine learning models, namely artificial neural network (, support vector regression , and a hybrid model that combines the model with a particle swarm optimization algorithm to predict one day in advance of concentrations. The models were developed using various air quality and metrological parameters. The result showed that the hybrid model showed better performance in predicting concentrations compared to the other models. The hybrid models can find more efficient solutions than traditional ones [48]. In other words, the researchers usually incorporate the bio-inspirited algorithm with the classical models to enhance these models' capability and hence achieve excellent predictive results [49, 50]. Furthermore, these algorithms more frequently are given particular roles, such as optimizing the hyperparameters of the model, which are very difficult to compute via traditional methods.

Moisan et al. [51] compared the performance of three machine learning models, namely dynamic multiple equations , seasonal model with exogenous variables and in predicting the hourly concentrations. For model development, different historical pollution and metrological parameters were considered inputs for the proposed models. The results showed that the model performed better than the other models during the severe episodes. For more examples, Table 1 shows a brief of the applied approaches in the concentration predictions. Based on the reviewed papers in that table, the researchers did not pay considerable attention to the data division through training the AI and statistical models. Few ratios (70/30, 80/20, and 90/10) were employed to split the data into training and testing datasets to assess applied models' performance. Moreover, the proper data division into the training and testing datasets can significantly influence the model efficiency. In other words, increasing the length of the training dataset would make the model overfit the data. Nevertheless, insufficient data for training the model may significantly impact prediction accuracy, dramatically lowering the chance of receiving valid estimates.

1.3. Research Motivation

Owning to the accurate prediction of is very important is very important for the mangers to be alert, establish a robust system for early warning, and minimize adverse health effects and associated costs; this study investigates the influence of data partitioning on the models’ efficiency. To the best of authors' knowledge, the investigation of selecting the best training and testing data ratio is not conducted yet. The reported approach in Table 1 explored that the researchers preferred to predict air pollution using ANN-based models. However, new versions of ANN modeling approaches such as ELM were not applied for air pollution forecasting. In addition, models such as , , and , despite their wide popularity in solving complex engineering problems [27, 29, 6062], were not used in previous works to predict the concentration of PM2.5. Therefore, these modeling approaches and their capacity have been explored in more detail.

2. Methodology

2.1. Case Study and Data Collection

In this study, the hourly concentration data from Dorset station from January 01, 2011 to December 31, 2020 are collected. Dorset station is located in Ontario city with a latitude of and longitude of , Canada. The location of the study area and the location of the studied station and the distribution of pollution over Canada are provided in Figures 1(a) and 1(b), respectively. More information about the studied station and the statistical description of the data is presented in Table 2. Furthermore, Figure 2 shows Dorset ambient air monitoring station.

2.2. Data Cleaning

Pollutant data such as are usually measured using several equipment or sensors. Despite this, sensors are susceptible to hardware issues like power failures, maintenance, and unstable network equipment and hence lead to producing missing measures, zero values, negative values, null values, or others that exceed the normal range. Consequently, the accuracy of model predictions may be affected if data containing defects are directly used as input.

In this study, the percentage of missing data of is significantly low (1.78%). To compensate for the missing values, the linear interpolation of neighboring and piecewise cubic spline interpolation methods is used in this study before making predictions. However, the piecewise cubic spline interpolation approach provided unrealistic and negative values, making it unreliable in compensating for missing air pollution data values. In addition, the results of this study are consistent with other studies regarding the unreliability of the piecewise cubic spline interpolation approach in compensating for the missing values [64]. As a result, linear interpolation is more efficient in replacing the missing values. Besides, this method is chosen because the range of missing values is small, making it easy to recover the hour's conditions from the data. The adopted approach formula can be described as follows:where is the time-series target, is the time-series duration, and is the prediction item of missing value, where Moreover, corresponds to the previous normal data before the range of missing points.

2.3. Extreme Learning Machine

The is considered a new robust and simple learning algorithm designed by Huang et al. [65] for a single hidden layer feedforward neural network . Unlike the gradient algorithms, the learning speed is significantly faster at the same time, providing better generalization since it does not have the complexities of local minima, learning rate, and epochs, which is considered a considerable drawback for the other models. Furthermore, the model is user-friendly, easy to comprehend, and provides minimum training errors with few norm weights [66, 67]. The network consists of input, hidden, and output layers. In the input layer, the data are provided to the network. Among the three layers, the hidden layer is considered the most fundamental layer since the computations are carried out in it, as well as it serves as a bridge between the input and output layers in which the results are organized. Given samples of a trained dataset, the mathematical expression of ’s output function with hidden nodes and activation function is shown as follows:

The input weights and biases are assigned randomly for the hidden nodes, while the output weights are calculated analytically. The equation above can be compacted in the form as follows:where Z is the output matrix,where refers to the transpose of the matrix. Figure 3 shows the main structure of .

2.4. Group Method of Data Handling

Ivakhnenko first proposed the group method of data handling approach as a polynomial neural network to capture the complex relationship between the input and output in a nonlinear system [68]. Since having prior knowledge of the model is inconceivable in the mathematical model, the neural network is utilized to overcome this issue [27]. As a result, in the model, the simulation of complex systems can be carried out without needing any prior specialized knowledge. The primary notion of the model is to establish an analytical function within the feedforward network , which can be achieved by utilizing the coefficients from a quadratic node transfer function derived through the regression approach. A standard formula can be expressed as follows:where is the output, and present the model’s inputs. The , and refer to the polynomial coefficients, which can be obtained through the training dataset. Each layer involves a set of input processing components known as nods, and the outcome of each layer is utilized as new input over the following layer. In order to optimize the weights, the least squares are adopted to acquire the minimum residual between the actual and the predicted values. Figure 4 shows the structure of the model.

2.5. Gradient Boosting Regression Trees

Gradient boosting regression tree combines the advantages of the boosting approach and decision trees to overcome classification and regression problems. The general notion of is the combination (through boosting approach) of a series of decision trees known as weak learners to obtain an ensemble with multiple decision trees (strong learners), which in turn will increase the accuracy and the performance of the model. The boosting approach involves adding extra trees to the sequence without altering the model parameters that have already been added to minimize the loss function for the model. In other words, the training samples’ weights are modified in accordance with the last iteration, and the weights are increased for the observations that are hard to predict while lessened for those well-handled. Assuming is the approximation function, and is set of predictors, and utilizing additive functions, the ensemble tree model can be illustrated as follows [69, 70]:where and represent the end nods’ mean and the given weights in the regression tree, respectively. represents the basis function’s additive expansion. Using the forward approach, the parameters and are optimized. The estimate function can be illustrated through (7) after number of iterations, and the optimal is obtained using (8).where is the loss function, represents the number of observations, represents the predictors set for a given observations, and represent the response variable for a given observations. Figure 5 shows the structure of model.

2.6. Model Development Performance Evaluation

Three artificial intelligence models, namely extreme learning machine (ELM), group method of data handling neural network (GMDHNN), and gradient boosting regression tree , are used to predict the hourly concentration of over a Dorset station located in Canada. Before training the AI models, it is crucial to replace the missing dataset and determine the proper input-lagged vectors. Notably, the missing records have been replaced using two methods, as shown in the previous section of this study. Furthermore, the autocorrelation function (ACF) and partial autocorrelation function (PACF) are used. Autocorrelation and partial autocorrelation functions are fundamental tools in the analysis of linear time series. The measures the correlation between values and the series’ current value at various time points. More specifically, it indicates how similar the observations are considering their time lag. The measures the correlation between values at various time points and the series’ current value by partially removing the effects of the intermediate values. According to Figure 6, three input combinations can be used. The possible input groups can be shown below and can be used to predict one hour ahead of .

Based on the view of ACF, many possible variables can be used as inputs, and, however, there can be found that these variables were significantly correlated with each other. Thus, PACF was used to select the most significant inputs.

After selecting the input groups, it is essential to determine the possible training/testing ratios. The dataset's length considerably affects the AI models’ performances. This study employed nine different ratios (see Figure 7) to split the data into training and testing datasets to assess the applied model's performance. It is worth mentioning that the hyperparameters of the applied models were selected using the trial-and-error method because there is no straightforward approach to compute these critical parameters, which have considerable effects on the estimation accuracy. Figure 8 shows the main processes that are used in this research. The block diagram in Figure 8(a) shows the seven essential steps related to the research methodology, whereas more details on the models' development are given in Figure 8(b).

2.7. Performance Evaluation

To assess the performance of the proposed models in prediction, different statistical matrices are employed as shown below [7173]:(i)Mean absolute error (MAE)(ii)Root mean square error (RMSE)(iii)Correlation coefficient ()(iv)Willmot index (WI)(v)Nash-Sutcliffe efficiency (NSE)where and are the average of the measured and the predicted values, respectively; represent the measured and predicted values of for n number of total observations; and is the mean deviation of the measured value.

3. Results and Discussion

This section discusses the performance of the proposed models in forecasting the hourly concentration of over a long period (from 01/01/2011 to 31/12/2020). Besides, three input combinations and nine data length scenarios have been used to train and validate the models (ELM, GMDHNN, and GBR). The performance of the models through the training phase is given in Table 3. According to the results of the training phase, the ELM provides the most efficient predictions with the lowest forecasting errors ( 0.9710 to 1.1099; 1.6088 to 1.8329). Nevertheless, the general performance of the GBR is unsatisfactory compared to other models, providing higher errors ( 3.7064 to 7.5851; 4.8536 to 9.7894). The third model (GMDHNN) yields a satisfactory prediction capacity ( 0.9898 to 1.1187; 1.6495 to 1.8372) than the GBR model, but its performance is still lower than the ELM model through the training phase. The statistical parameters (i.e., RMSE and MAE) prove that the ELM has an outstanding capability, providing excellent estimates despite the considerable change in the input variables and length of the training dataset. On the other hand, the GBR model shows poor performance and an inability to deal with the extensive dataset. A further remarkable observation that can be deduced from the reported results is that the performances of the machine learning model (ELM and GMDHNN) got reduced when the training dataset was at 50% of the total observations. For the case of 50% of data being used for the training, both models illustrate the difficulty of estimating the using a single input parameter. After evaluating the forecasted errors, it is essential to analyze how the estimated observations of are correlated with their corresponding values. in this regards, many performance metrics are performed, namely, the Willmott index (WI), correlation coefficient (R), and Nash-Sutcliffe efficiency (NSE), as presented in Table 3. Overall, the results demonstrate that the ELM model can provide higher accurate estimates in all cases than other models. In other words, the similarity between observed and predicted values by the ELM approach is promising. The , , and criteria for all cases range from 0.906 to 0.9171, 0.9489 to 0.9553, and 0.942 to 0.938, respectively. Similarly, the GMDHNN model yields a good prediction but is slightly lower than the ELM models. On the other hand, the GBR model cannot mimic the fluctuation of the concentrations over time, providing poor estimates with ranging from 0.6228 to 0.8109 and ranging from 0.798 to 0.547.

As the ELM approach presents excellent performance through the training phase, it is essential to validate this model using testing data points. Several studies emphasized that comparable models can be evaluated more effectively through the testing phase [28, 74]. The reason is that the model in the training phase would be trained in the presence of input points and their corresponding values. On the contrary, the applied model receives only input vectors in the testing phase. Table 4 shows the performance evaluation of the proposed models through the testing phase, and it can be seen that the ELM model outperforms the other proposed models. In other words, the provides estimations that are significantly similar to their actual ones (R ≈ 0.9001 to 0.9297; WI ≈ 0.9461 to 0.9573; ≈ 0.9371 to 0.9281) with lower forecasted error forecasted errors (RMSE ≈ 1.4049 to 1.5327; MAE ≈ 0.9001 to 0.9207) compared to GBR and GMDHNN models. The results also show that the best second model is GMDHNN, but its efficiency in dealing with fluctuated data like is not accurate as the ELM model. However, the GBR model faces problems in capturing the dynamics of over the time period.

3.1. The Effect of Data Length on the Predictive Models' Performances

This part of the study shows how the input variables and length of the testing dataset affect the applied models’ capacity to predict . In general, AI models require sufficient records and enough input vectors to provide more accurate estimations. In this regard, this study offers forty-five different scenarios of input parameters and the dataset’s length as shown in Table 4. The results through the testing phase are considered to analyze the models’ performances. The results showed that the ELM is more flexible with data size changes and lagged inputs than other models according to the statistical parameters such as . Moreover, the ELM requires only two input vectors when the testing data size ranges from 50% to 25% and adapts very well to the increasing changes in the data length. Also, the results demonstrate that if the testing data are reduced below 25%, the model requires more input vectors. As the length of testing data decreases (i.e., 20%, 15%, and 10%), the training data employed in the model increases, and thus, the training algorithm requires more inputs to complete the training and calibration processes efficiently and elaborately. Accordingly, the proposed model has high flexibility in the changes concerning the length of data and the number of the used inputs. According to the reviewed results obtained from the ELM model, it can be said that this model can provide more accurate results when the testing data size ranges from 40% to 45% of the entire dataset.

The other comparable models, such as the GBR model, do not have a reasonable or deducible pattern in dealing with cases where there is a change in the percentage of training data and the number of inputs. On the contrary, the last model (GMDHNN) tends to have a single pattern that can be deduced by evaluating its performance through statistical coefficients. This model needs, in most cases, the largest possible number of inputs, and therefore it does not show any flexibility for small and large changes that occur in the volume of data used.

For further assessment, the 95% uncertainty criterion () is a very effective tool for selecting the most effective and reliable model [28]. Taking into consideration the RMSE and SD (standard deviation of the computed errors), the mathematical expression of the is as follows [75]:

For different splits and input lags of the testing dataset, Figure 9 shows the evaluation of proposed models using . The results demonstrate that the ELM provides the smallest value of compared to other models. Furthermore, Figure 9 are consistent with the findings from the statistical parameters, which indicate that the effectiveness and accuracy of the ELM model reach their maximum when the training data represent from 40% to 45% of the total data. However, the GBR has recorded the highest value of uncertainty () followed by GMDHNN models.

It is essential to check the reliability analysis (RA) of the comparable models. This type of analysis is considered very effective in evaluating the consistency and performance of the models. This novel statistical metric provides essential information that can be used in determining whether the suggested models achieve the minimum requirement of allowable level of accuracy. Therefore, the RA is quite helpful in deciding and nominating the best model for air quality prediction. The formula below shows the mathematical expressions for calculating the RA [76]:

In (16), the n is the total number of PM2.5 samples, and Si is the equivalent factor for each sample and its value is either 1 or 0. Furthermore, the Si depends mainly on the percentage relative error (RE), which is mathematically derived in the following equation.

According to value , if the fall within the allowable range, the is given 1; otherwise, it is given 0. The allowable range is

Based on a specialized technique known as RA, this study evaluated the prediction accuracy of the applied models. Table 5 shows the results of these metrics for every model throughout the training and testing stages. In this work, we select two data division combinations. The first combination involves 60% of the data points used for training and 40% used for testing; however, in the second combination, 55% of the data records are regarded as training data, and the rest are used for validation (testing). According to the obtained result, the ELM generally has a superior performance to other models, achieving the highest RA value in the training and testing phase. For example, the ELM in the first combination obtains the higher RA value of 77.34%, followed by GBRT of 76.21%, and finally, the GMDHNN produces the lowest RA with 58.123%. As a result, both ELM and GBR models show a satisfactory performance during the training stage than GMDHNN models. Concerning the testing stage, the results confirm that the ELM is the best model for estimating PM2.5, having the highest RA value of 75.78%, followed by GBR (71.19%) and GMDHNN (55.16%). The RA results proved that the ELM is more efficient in estimating the hourly PM2.5 than other models. Besides, the appraisal of the models with the help of RA also revealed that the best combination is when the training data records make up 60% of the dataset.

The proposed models are also evaluated graphically using the box plot, violin diagram, and Taylor diagram (see Figures 1012). According to Figure 10, the overall performance of ELM shows more precise estimates of than the other models. Furthermore, median and interquartile range () errors are reduced more than in the GMDHNN and GBR models.

Figure 11 represents the violin diagram that integrates a boxplot and density plot to illustrate the data distribution shape. This figure is created for more visualization comparison using testing dataset for the best input combination (training 60% and testing 40%). According to the violin diagram, the model can efficiently mimic the actual data distribution and provide more agreement between the actual and the predicted records. Although the model outcomes are similar to the actual data distribution, it generates negative values that affect the model's performance. On the other hand, the model performs poorly in mimicking the actual data distribution and generates negative values. Figure 12 represents the Taylor diagram, a polar plot created based on correlation coefficient, standard deviation, and root mean square error to evaluate the models’ performance. According to Figure 12, the model can simulate closer to the actual one compared to the other models.

Overall, the results obtained in this study prove that the ELM model is more reliable in estimating the hourly as well as more flexible in resisting the changes in the data features and length of training data.

Figure 13 gives valuable insight into the practical implementations of the proposed approaches. In particular, data such as , as well as temperature, humidity, and air pressure, are extracted from air pollution stations. Next, data are processed at cloud servers using machine learning approaches. Finally, the prediction results can be accessed through a software application interface in real time. A complete implementation example is shown in Figure 14, which shows City Air Quality Management application from Siemens [77]. is an AI-based worldwide application that can be used on multi-platforms and combines the latest air pollution measurements, such as , , and , with the latest AI approach in order to predict the concentrations of for the coming days. The application can predict the pollution for three days ahead with 90% of accuracy and 80% for five days ahead.

4. Conclusion

Three AI models, namely ELM, GMDHNN, and GBR, have been used to predict the hourly concentrations over Dorset station, located in Canada. The case study covers the period from 2001 to 2020. The accurate estimations of hourly air pollutants via AI models require a proper input data feature and enough data records for model training. In this study, three input combinations are used via partial autocorrelation function (, and nine data length scenarios are used to validate the models to select the best model that can efficiently adapt to the changes. The finding of this study can be illustrated as follows:(i)The ELM model generally performs better in estimating PM2.5 than the comparable models producing fewer errors ( 0.9710 to 1.099; 1.6088 to 1.8329).(ii)The flexibility of the ELM model in dealing with changes in the size of training data and different training conditions is significantly remarkable. The results showed that the ELM model demands fewer input vectors when the testing data size ranges from 50% to 25% of entire data observations. However, the model requires additional input features in different cases, primarily when the training data represent 80% to 90%.(iii)All the used models except the ELM do not provide a rational pattern consistent with the changes that occur in the training process.(iv)The results of this study uncover that the optimal training data, which can provide more accurate estimates, represent 60% of the obtained records.

This study recommends to(i)Applying the proposed methodology to find the optimal training and testing ratios for other series of pollution like ozone, nitrogen dioxide, sulfur dioxide, and carbon monoxide(ii)Using the feature selection method instead of PACF and ACF to select the best inputs(iii)Investigating using deep learning models (i.e., LSTM) to predict the PM2.5 concentrations


XGBoost:Extreme gradient boosting
RPE:Relative prediction error
SVM:Support vector machine
ANN:Artificial neural network
R:Correlation coefficient
:Coefficient of determination
MSE:Mean square error
MAE:Mean absolute error
RMSE:Root mean square error
IA:Index of agreement
MAPE:Mean absolute percentage error
RMSPE:Root mean square prediction error
MPE:Mean prediction error
RBF:Radial basis function
SVR:Support vector regression
PCR:Principal component regression
RF:Random forest.

Data Availability

The data are available upon request to the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The authors thank Al-Maarif University College for funding this research.