Abstract
Accurate prediction models for air pollutants are crucial for forecast and health alarm to local inhabitants. In recent literature, discrete wavelet transform (DWT) was employed to decompose a series of air pollutant levels, followed by modeling using support vector machine (SVM). This combination of DWT and SVM was reported to produce a more accurate prediction model for air pollutants by investigating different levels of frequency bands. However, DWT has a significant demand in model complexity, namely, the training time and the model size of the prediction model. In this paper, a new method called variationoriented filtering (VF) is proposed to remove the data with low variation, which can be considered as noise to a prediction model. By VF, the noise and the size of the series of air pollutant levels can be reduced simultaneously and hence so are the training time and model size. The SO_{2} (sulfur dioxide) level in Macau was selected as a test case. Experimental results show that VF can effectively and efficiently reduce the model complexity with improvement in predictive accuracy.
1. Introduction
Rapid urban development if inappropriately managed may lead to increase in pollution. Many studies [1–3] reported that the amount of sulfur dioxide (SO_{2}) is associated with adverse human health. Moreover, WHO (World Health Organization) [4] reported that the health problems in turn may increase the burden of the healthcare systems in the long run. Therefore, one possibility for reducing pollutant related sick leave is an early warning. Authorities may provide general public with an early warning using a reliable shortterm prediction during adverse pollution conditions.
There are two mainstream methods for air pollution prediction: deterministic and statistical based method. Deterministic models [5] are, however, costy to develop (establishment of various inventory) and difficult to operate in real time. Even when adequate data and resources were to become available to implement the deterministic approach, Gardner and Dorling [6] and Kukkonen et al. [7] pointed out that the complexity of a problem in general increases when the spatial interactions between systems (regional and urban backgrounds) are ill defined. Since statistical methods usually are derived from empirical relationships between air pollution and other related parameters, these methods are simple to develop and have been widely used in shortterm prediction of air pollution [8]. Statistical methods applied to air pollution prediction include multiple linear regression, nonlinear regression, autoregressive moving average (ARMA), and artificial neural networks (ANNs). Among these statistical methods, ANNs are regarded as a costeffective and reliable method for air pollution forecasting. Comrie [9] and Ando et al. [10] showed that ANNs are better suited for air pollution forecasting than linear and nonlinear models. Recent approaches had ANNs combined with other machine learning methods, for example, genetic algorithm (GA) [11], in order to improve the predictability of ANN models [12–15]. However, ANNs suffer from some inherent drawbacks [16, 17] such as local minima, overfitting, poor generalization, and long training time. Recently, support vector machine (SVM) has attracted considerable attention for prediction problems [18–20]. This approach is based on the structural risk minimization to provide a good generalization capability [21]. SVM is therefore very resistant to the underfitting and overfitting that are common when ANNs are used. SVM is therefore expected to improve generalization of ANNs and obtain global solutions simultaneously. For example, Lu and Wang [22] reported that SVM model provided better prediction of air pollution than that from ANNs model.
In [23], various data filtering techniques, such as moving average, exponential smoothing, and total variation (TV) denoising, can be applied to further improve the accuracy of a prediction model by filtering or smoothing the noise in the training dataset. However, these data filtering techniques usually filtered or smoothed acute values which are considered as noise or outliers. In many categories of timeseries prediction [24], such as financial data, environmental parameter estimation (current application belongs to this aspect), electric utility load, and machine reliability, the acute values are those of great significance and interest. The filtering of these acute values is likely to incur critical information loss in a warning system in particular and hence deterioration in prediction accuracy as shown in Figure 1. It can be seen that significant proportion of information with high variation become lost after TV denoising process. Therefore, the class of the abovementioned data filtering techniques that attenuate the acute values is not applicable to current application.
Furthermore, timefrequency decomposition tool such as discrete wavelet transform (DWT) has been reported [25] to aid timeseries prediction. For example, Osowski and Garanty [26] attempted to integrate DWT with SVM for a higher prediction accuracy for the application of air pollution forecasting. Moreover, many studies [27–30] in different applications have also reported that DWT improves the performance of prediction models by decomposing the time series into various levels of frequency bands for forecasting. However, DWT decomposes a time series into several subseries corresponding to various frequency bands. The frequency bands with relatively low variations are usually considered as insignificant because they carry little or even no information that hardly contribute to forecasting. Hence the subseries corresponding to these frequency bands can be filtered out. For every remaining subseries, a corresponding SVM submodel is constructed. Juhos et al. [31] reported that SVM model led to higher accuracy in their study for forecasting NO and NO_{2} after optimizing the SVM hyperparameters through a timeconsuming grid search. Nevertheless the timeconsuming grid search can be replaced by efficient, direct search methods such as genetic algorithms (GA). There is, however, usually a degeneration of prediction accuracy because GA easily returns local suboptimal hyperparameters.
To achieve the best prediction accuracy, every SVM submodel requires a timeconsuming optimization process of hyperparameters such that the training process using DWT and SVM can easily take several hours (e.g., in current application). The current application focuses on daily forecasting of SO_{2} level while hourly forecasting in practice may become necessary for dominant diurnal activities (i.e., every hour, the dataset is updated and the prediction model is retrained) and therefore time is another critical issue. The time issue becomes even more critical when more associated factors (i.e., input variables) to the pollutant level are available. In short, DWT significantly increases the model complexity (i.e., training time and model size) of the prediction model for air pollution forecasting. From this viewpoint, the objective of this study is to design an efficient algorithm that can improve the prediction accuracy while it does not significantly increase the model complexity as in DWT.
To design such efficient algorithm, direct manipulation on the time series is preferred to the timefrequency decomposition as in DWT. A possible solution to this algorithm may involve the reduction of the number of training data by clustering similar data points into various clusters. However, the application of SO_{2} level prediction has a stochastic nature, that is, two similar inputs can produce two highly different outputs. Hence, clustering of data points may even cause performance degeneration and thus does not necessarily work in the current application.
In fact, the reduction of training data can be based on the determination of variation from a statistical viewpoint. It can be observed that a series of data points () carries little information if there is low variation among of its successive points, where is the vector of input variables and indicates the corresponding SO_{2} level for to . For example, a series of data points is represented as . The first four points of only carry little information because to are close to each other. Hence, the second and third data points (with 12 and 11, resp.) can be discarded because they hardly contribute any extra information. Finally can be reduced to 3 while becomes . Furthermore, the low variation among these data points may be even considered as noise. For a real example, Figure 2(a) shows the trend of SO_{2} levels in Macau from year 2003 to year 2008. It can be seen that high daily variation of SO_{2} levels occurred in some days (Figure 2(b)). The input data from these days carry more valuable information for SO_{2} level prediction. In Figure 2(c), some successive days are with low daily variation of SO_{2} levels. Such data carry little practical information for a warning system of adverse air quality. Therefore, these data can be discarded without significantly affecting the accuracy of SO_{2} prediction models under a statistical perspective, where data or dimensions with low variations can be discarded (as mentioned in the theory of principle components analysis (PCA) [32]). With a daily variation threshold of 0.15 (to be explained in Section 2), about 30% (662) of days from year 2003 to year 2008 are with low daily variation of SO_{2} level and can be discarded. Based on this idea, we propose a novel algorithm called variationoriented filtering (VF) which can effectively and efficiently reduce the number of data points of low variation while most of the intrinsic information (i.e., high variation and acute values) of the training data can be retained. Therefore, compared to DWT, VF can reduce the model complexity for SO_{2} level prediction without obvious sacrifice of the prediction accuracy.
(a)
(b) High variation
(c) Low variation
Moreover, since using VF the number of training data points can be reduced, the number of support vectors #SV (selected from the training data points) of the final SVM regression model [21] as shown in the following is very likely to reduce or at least remained the same: where and are the support values corresponding to the th support vector, is a kernel function, and is a bias constant. With less training data points, the training time can be shortened while, with less support vectors, less memory is necessary for storage of the model and also faster execution time. Since DWT and SVM are wellknown techniques in the past two decades, interested readers may refer to the classical textbooks [21, 25] for technical details.
In Section 2, the algorithm of VF and the framework of modeling for SO_{2} level prediction are described. In Section 3, an illustrative application of SO_{2} level prediction is presented, followed by the simulation results and discussion in Section 4. Finally a conclusion (and future work) is drawn in Section 5.
2. Methods
In this section, the description of the proposed algorithm is presented, followed by the framework of constructing SO_{2} prediction models.
2.1. Algorithm of VariationOriented Filtering (VF)
The trend of SO_{2} levels consists of a series of data points (), where is the vector of normalized input variables and indicates the corresponding output variable (normalized SO_{2} level), for to (number of days). Data normalization is mentioned in Section 3.2. VF interprets a series as different segments of data points. During the interpretation of segments of data points, a data point () is compared with its previous point (). If the difference between and exceeds or equals a userdefined variation threshold , that is, then () is considered to be significant and is assigned the high variation class. Otherwise, () belongs to the low variation class and is discarded. After the interpretation, a reduced series of data points with only the high variation class is produced. The procedure is described in detail as follows.
A data point () is assumed to be the beginning of a segment. It is compared with the next data point () using (1) under an optimized threshold of daily variation (). If (2) is false, the data point () belongs to the low variation class and is discarded. The comparison then continues with () and so on until either (2) is true with a data point () for , or the end of the series of data points is reached; that is, . In order to retain the trend of a segment, the preceding and current data points, namely, () and (), are both retained instead of () alone. Finally () is marked as the end point of the current segment and simultaneously the beginning of next segment, where the interpretation of next segment begins.
An example of applying VF is shown in Figure 3. Assume the data point () at Day 1 is the beginning of Segment 1. It is compared with its following data point () at Day 2 using (2) under . Since (2) is true, () is retained and marked as the end of Segment 1 and the beginning of Segment 2. For Segment 2, the data point () at Day 3 is compared with () and (2) is false. Then, () is discarded. Similarly, the data points at Day 4 and Day 5 are discarded. The comparison with () continues until the data point () at Day 6 is reached where (2) becomes true. In order to maintain the trend of the segment, the data point () and its previous one () are retained. Then () is marked as the end of Segment 2 and also the beginning of Segment 3. The dash dot line in Figure 3 shows the trend of SO_{2} levels, which is not significantly deformed, after applying VF. In addition, the data points of Day 3 and Day 4 carry little information and can be considered as noise. The algorithm of VF described above is outlined as in Algorithm 1.

2.2. Workflow of Modeling
The workflow of modeling employs the techniques of VF, SVM, and GA. For accurate SO_{2} level prediction, VF is proposed to filter out the noise in the input data (i.e., the data points with low variation). In fact, the variation threshold for VF is difficult to define and is subject to different training data. If is set too high, some informative data points for prediction will be discarded. On contrary, if is set too low, the redundant data points cannot be effectively filtered out. Hence, an optimization for is required.
SVM was employed as the modeling technique in this study. Mostly radial basis function (RBF) kernel is selected in SVM for modeling problems. Under this setup, SVM includes two hyperparameters and , which are the regularization factor and the RBF kernel parameter, respectively. Therefore, there are three hyperparameters (, , and ) in total to be optimized in the current modeling framework. Although there are numerous optimization techniques in the literature, GA is regarded as one of the most powerful optimization techniques and has been widely used in many problems [12, 13, 33] to optimize the model performance. In this study, GA is employed as the optimization tool and the details of GA setup can be found in Section 3.4. Other optimization techniques may be tried in the future.
The workflow of modeling is depicted in Figure 4. The first objective is to determine the optimal hyperparameters (, , and ) using GA, as shown in Figure 4(a). An initial population of hyperparameters (, , and ) is randomly generated. For every instance of (, , and ) from the population, the steps of VF and SVM are applied as follows. Given a training dataset , a set of data points of high variation class is selected by applying VF with the variation threshold . Then SVM with the and is used to build a provisional SVM model. If there are 100 instances of (, , and ) in the population, there are 100 provisional SVM models. Every provisional SVM model is then evaluated using a fitness function over an independent validation set VALID (more details can be found in Section 3.4). The above procedure repeats until the stopping criterion is satisfied. Finally the optimal hyperparameters (, , and ) are returned. The second step is simply the construction of a prediction model of SO_{2} levels from using the optimal hyperparameters as shown in Figure 4(b).
(a)
(b)
3. Application
In this section, the environment and monitoring sites in Macau are briefly introduced. The data preparation and representation in this study are described. In addition, the detail of experimental setup is described.
3.1. The Environment and Monitoring Sites in Macau
Macau, located on the southern coast of China with merely 26.8 square miles land area, comprises three land zones: Macau peninsula, Taipa, and Coloane (Figure 5). Similar to many coastal cities in Mainland China, Macau has experienced rapid urban development over the past decades. Therefore, the air pollution data and meteorological data in Macau were used as a case study.
Macau government meteorological center (DSMG) [34] has established general and roadside meteorological stations for collecting air pollution data, such as suspended particulate matters (SPM), nitrogen dioxide (NO_{2}), sulfur dioxide (SO_{2}), and ozone (O_{3}), and meteorological data, such as temperature, wind direction, wind speed, and relative humidity. Since the land area of Macau is relatively small, the data obtained at the general meteorological station at Taipa Grande (at an elevation of approximately 150 m above sea level) may be considered as representative for the entire region of Macau. Therefore, air pollution data and meteorological data form Taipa Grande general meteorological station were used in this study.
3.2. Data Preparation and Representation
The daily average values for air pollution data and meteorological data from year 2003 to year 2008 were extracted from the website of DSMG. These data were considered representative measures and were adopted in this study of SO_{2} modeling. The choice of the study periods was based primarily on the completeness of both air pollution data and meteorological data available at the time of the experiment. In the study periods, the percentage of missing data, possibly due to maintenance or calibration work, is less than 3%. Outside of these periods, either air pollution data or meteorological data are unavailable.
It is necessary to provide commensurate data ranges so that the SO_{2} model will not be dominated by variables with large values. Moreover, the normalization procedure usually leads to more stable and accurate prediction results. In this study, the training data sets (TRAIN) and test data sets (TEST), including input variables and output variable , were normalized based on zero mean and unit variance, as shown in the following equation: where is either an input variable in or the output variable , and and are the mean and the standard deviation of the variable, respectively.
In order to maintain parsimony for the input variables in the model so that the model does not become overgeneralized, it is necessary to sort out and select the highly related input variables only. The selection of input variables is based on Pearson correlation coefficient [35]. In short, the Pearson correlation coefficient is a measure of the linear dependence between two variables and , giving a value between +1 and −1 inclusive. A value close to +1 indicates positive correlation while a value close to −1 indicates negative correlation. A value close to zero indicates no dependence between the two variables.
Pearson’s correlation coefficient between two variables and is commonly denoted by : In this study, variables and are time series, containing a series of data point. and are the th data points of and , respectively, to , where is the number of data. and are the means of and , respectively.
In this case study, the candidates of input variables are the air pollutants (SPM, SO_{2}, NO_{2}, and O_{3}) and available meteorological data (atmospheric pressure (AP), temperature (TEMP), mean relative humidity (mRH), wind speed (WS), rainfall (RF), and sunshine hour (SHr)) on previous days (e.g., current day , previous day ); the output variable is the SO_{2} level on the following day . Pearson’s correlation coefficients among the output variables SO_{2}() and the candidates of input variables on previous days are calculated based on (4) and summarized in Table 1. For example, in order to find Pearson’s correlation coefficient () between and , namely, NO_{2}() and SO_{2}(), respectively, is set as 1 January 2003. Then the series of NO_{2} levels from 1 January 2003 to 30 December 2008 is retrieved and is set to the number of days in this period. Each NO_{2} level in this period is considered as , to . Similarly, the series of SO_{2} levels from 2 January 2003 to 31 December 2008 is selected because one day is shifted for , and each SO_{2} level in the period is . The corresponding data are put into (3) to calculate .
The level of significance is assumed to be 0.50; that is, two variables have positive correlation if the corresponding Pearson correlation coefficient is greater than 0.50 and negative correlation if the coefficient is smaller than −0.50; otherwise, the two variables have no relationship. Hence, with the level of significance set to 0.50, the followings can be summarized from Table 1.(i)SO_{2}() seems positively correlated with SPM, SO_{2}, NO_{2}, and AP, and negatively correlated with TEMP, on different days (, , and so on). Therefore the variables SPM, SO_{2}, NO_{2}, AP, and TEMP on different days were selected as input variables in this study.(ii)SO_{2}() seems poorly correlated with O_{3}, mean relative humidity (mRH), wind speed (WS), rainfall (RF), and sunshine hours (SHr) on all days. These input variables candidates have weak influence on the prediction of SO_{2} level at the following day and thus are not selected.
Furthermore, Macau is characterized as a subtropical monsoon climate. Airborne industrial pollutants may be carried by northward prevailing wind from Mainland China through Macau in winter, while the southeastern prevailing wind from the sea in summer usually carries pollutants away. Wind direction (WD), a significant factor to SO_{2} levels in Macau, is therefore included in the SO_{2} prediction model. In the current study, WD is separated into 16 discrete directions . In order to make WD more applicable for the generalization and to avoid unnecessary input variables, only the relevant wind directions were selected. Out of the 16 wind directions, those from North and Southward, namely, , are related to SO_{2} levels prediction in Macau and thus selected as input variables as well. To represent these wind directions without incurring any bias, a set of Boolean variables was used for to 8, instead of a number . In addition, the wind direction from the current day and the previous day, namely, and , can adequately be represented in the input, rather than using wind direction for more than one day ago.
Finally the data representation in this study is defined as a pair (), where is the vector of input variables and indicates the corresponding SO_{2} level. According to Table 1, the representation of is defined as Note that the output is defined as the SO_{2} level at the following day .
3.3. Construction of Prediction Models
In this study, four SO_{2} models using combinations of SVM, VF, GA, and DWT (shown in Table 2) were constructed to evaluate the effectiveness and efficiency of VF against DWT. The hyperparameters and for SVM model and SVMDWT model were optimized using exhaustive grid search, while the optimization of and for SVMGA and SVMVF was done using GA. The ranges of both and are defined as 0.0~9.9 for a demonstrative trial. Practically wider ranges of these hyperparameters are preferred. Since VF employs GA to search for hyperparameters, SVMVF works much faster than SVM, which uses exhaustive grid search. In order to illustrate fair comparison between the models SVM and SVMVF, a model SVMGA is therefore required to assess the efficiency of VF.
Each model was independently trained 10 times to generate more reliable results. For a more comprehensive comparison for model stability, experiment data (from year 2003 to year 2008) were divided into three groups of TRAIN and TEST (as shown in Table 3). Each group of data was employed to construct four SO_{2} models. Therefore, there are totally 12 SO_{2} models in this study.
3.4. GA Setup
As mentioned in Section 2.2, the hyperparameters , , and are optimized using GA. The ranges of , , and are defined as 0.00~0.99, 0.0~9.9, and 0.0~9.9, respectively. Hence, it is enough to employ 2 digits (0 to 9) for each of the hyperparameters; that is, an individual of GA is a realcode string of 6 digits. An example of an individual with a string of “865407” is shown in Figure 6, which means , , and , respectively.
The generation of populations of the hyperparameters is controlled by the operations of selection, crossover, and mutation. The selection procedure in this study was divided into two parts. Firstly, in order to imply the elitism strategy [36] which retains the best individuals, two individuals with the best fitness were directly selected from the current population without further processing. Secondly, the other 98 individuals were selected from the current population by using tournament selection and passed to crossover and mutation. The predefined probability of crossover and mutation is 0.8 and 0.05 according to [37].
The evaluation of the fitness of an individual is the accuracy of the SVM model over an independent validation dataset (VALID) using the individual (i.e., the hyperparameters). The accuracy of the SVM model is evaluated using complementary Willmott’s index of agreement (CWIA), which is further discussed in Section 4.1. VALID consists of 36 data points randomly selected from TEST. The stopping criterion for GA process is that no better fitness can be achieved for 50 successive iterations. Finally, the individual with the best fitness is returned as the optimal hyperparameters (, , and ).
3.5. DWT Setup
As a comparison with VF, DWT was also applied to the training dataset. The family of Daubechies (db) wavelets is one of the most popular mother wavelets and was employed in this study. Ten kinds of Daubechies wavelets (from db1 to db10) were applied to construct different SVM models as illustrated in Figure 7. Given a training dataset for to , DWT decomposes TRAIN into several levels . For to , every is used to construct a submodel SVM_{k}, respectively, where is the kth decomposed value of . In other words, the series of SO_{2} levels, to , is decomposed into series for to . The final SO_{2} prediction is simply the summation of the prediction by all submodels SVM_{k}.
The SVM model accuracy using different kinds of Daubechies wavelets can be evaluated through CWIA (discussed in Section 4.1) over an independent validation set. From simulation results (not shown here), db7 was found to be the most suitable for this study. In addition, it is also important to decide the decomposition level . The number of the decomposition levels of each model can be estimated by using the function “wmaxlev” provided in the MATLAB wavelet toolbox. In this study, was found to be 6.
4. Performance Evaluation and Results
4.1. Error Measures
In order to effectively measure the accuracy of SVM models, three measures for statistic error are used in this study: mean absolute error (MAE) for value difference, root mean squared error (RMSE) for sensitivity, and complementary Willmott’s index of agreement (CWIA) for curve fitting. Both MAE and RMSE are common measures to estimate the average error of models. However, neither of them provides information about the relative size of the average difference or the nature of the differences. Willmott [38] reported that CWIA is a standardized measure that can be easily interpreted and provides crosscomparisons of its magnitudes for a variety of models, regardless of units. Owing to its dimensionless nature, relationships described by CWIA tend to complement the information contained in RMSE. Therefore, CWIA was also used to evaluate the accuracy of SVM models in this study. The range of CWIA is from 0 to 1, where a value closer to 0 indicates a better performance. The three measures are described by the following formulas, where and , respectively, represent the predicted and observed value of SO_{2}() in the th day and is the size of TEST: where , , and .
4.2. Stability and Errors of Models
In addition to the statistic errors MAE, RMSE, and CWIA, model stability is another important criterion to evaluate the performance of models. Therefore, standard deviation () as shown in (8) is employed to evaluate the stability of models in this study, where is the number of runs, represents the statistic error for a model in the th run, and represents the mean of all runs: Based on (6) to (9), the statistic errors and standard deviations of four combinations of SVM models can be obtained. The experimental results in Figure 8 show that the standard deviation of each statistic error for all models is very small (or even close to zero). It indicates that the results produced in each run for all models were very close to their corresponding means. Therefore, the performance of these models is stable and hence the statistic errors of these models are reliable.
In Figure 8, the statistic errors MAE, RMSE, and CWIA of the four models SVM, SVMGA, SVMVF, and SVMDWT are shown. Each model was trained with three different datasets according to the grouping in Table 3 so that 12 different models were obtained. Note that the statistic errors in Figure 8 are the mean of the errors predicted by each model in 10 independent runs. From the results, SVMVF model (Figure 8) produced the lowest statistic errors in three groups of experimental data among the four models. Despite being generated from fewer training data, VF can improve the accuracy of the prediction by filtering unimportant low variation data points. Compared to SVM and SVMGA models, SVMVF model has a relative improvement of about 5% and 9% in all statistic errors, respectively. Conversely, SVMDWT model produced significantly worse accuracy than the other models. It seems that DWT does not show its superiority for improving the accuracy of the SVM in this study. The decline in accuracy may arise from the fact that DWT decomposes a time series into different subseries, each of which has an independent prediction model as shown in Figure 7. Each of these models incurs an error and hence the accumulated error for the final SVMDWT model becomes relatively large. Furthermore, SVMDWT model took the longest training time (almost 5 hours) among the three compared models, while the training time of SVMVF model was the shortest (<0.5 hour). The time issue will be further discussed in Section 4.4.
4.3. Seasonal Variation of SO_{2} Levels
In order to visualize the performance of the studied methods, the trends of the predicted SO_{2} levels were compared with the observed SO_{2} levels from 2006 to 2008. Figure 9 shows that the trends of predicted SO_{2} levels of each model are generally close to the trend of observed levels. It suggests that these four models have a good capability of prediction.
In addition, representative cases of predicted and observed SO_{2} levels can clearly show a better comparison of the prediction capability of the four models. According to the climatic characteristics of Macau (mentioned in Section 3.2), the trends of SO_{2} levels in winter and summer can represent the general yearly trend in Macau. Hence, the predicted and observed SO_{2} levels in winter and summer for each experiment were shown in Figures 10 and 11, respectively. These two figures illustrated that the predicted SO_{2} level produced by SVMVF model is closer to the observed SO_{2} level when comparing with those of the other three models. In particular, SVMVF appears to fit the more prominent high SO_{2} level during winter seasons (Figure 11), when health warning is most needed in this case study.
4.4. Model Complexity
The model complexities of the three combinations of SVM models are shown in Table 4, where SVMDWT model is observed to have relatively large number of support vectors (SV) (about 6800) and very long training time (about 290 minutes). Therefore, the cost of DWT decomposition for SO_{2} level forecasting is at the relatively high model complexity (i.e., training time and model size) in the current application of air pollution prediction.
For SVMVF models, the size of the training dataset TRAIN was reduced to different extents after applying VF for data filtering. The highest data reduction rate by VF is up to 13.33% (for TRAIN in year 2003 to 2005). In addition, the number of SV for SVMVF is the fewest among the three models, especially when compared with SVMDWT. Furthermore, the execution time of SVMVF model is much shorter than those of SVM model (about 50 minutes) and SVMDWT model (about 290 minutes) (Table 4).
Although SVMGA model requires even shorter training time than SVMVF model, it suffers from performance degeneration as shown in Figure 8 because GA can easily locate the suboptimal SVM hyperparameters. Hence, the proposed algorithm VF may resolve the high complexities of model construction in both time and model size while it can even improve the prediction accuracy. In summary, SVMVF model not only effectively reduces the model complexity of prediction model for SO_{2} level, but also improves the accuracy of the prediction models by filtering the unimportant low variation data points.
4.5. Discussion
From the experimental results, VF was verified to effectively reduce the size of training dataset TRAIN and also the number of SV of the prediction models. The reductions of training data points and number of SV lead to shorter training time and smaller model size that can benefit for real largescale modeling using hundreds of input variables and millions of data points. Despite using fewer training data, SVMVF model can produce higher accuracy than SVM model with significantly shorter training time. It is because VF only discards the unimportant data points with low variation and hence does not affect the accuracy of the SO_{2} prediction model. Moreover, the unimportant low variation in the discarded data points may be a noise to the modeling. By filtering out these noises, the accuracy of SO_{2} prediction model can be improved as shown in the results. In addition, the training time of SVMVF model is significantly reduced as compared to SVM.
Using the accuracy of SVM as a standard, when comparing with SVMGA model, SVMVF model takes longer training time but produces relatively up to 9% better prediction accuracy. This is also credited to the reduction of modeling noise. Since the optimization of SVM hyperparameters by GA is based on the SVM prediction accuracy, the hyperparameters become more “suboptimal” under a noisy training dataset (i.e., with unimportant low variation data) so that the prediction accuracy of SVMGA model is further deteriorated.
Although many literatures reported that DWT decomposition can improve the performance of prediction models for many applications of signal or image processing, DWT does not necessarily work in timeseries prediction. At least, DWT did not establish its superiority in this study. The cause may arise from DWT decomposition and accumulated error as explained in the following. DWT decomposes a time series into several subseries as shown in Figure 7. Each subseries is employed to train a submodel SVM_{k}, for to . Every SVM_{k} produces a prediction of the output for the subseries, which may incur a certain (small but measurable) amount of error . Since the final prediction is simply the accumulation of the predictions of all SVM_{k}, the total error of the final prediction is also the accumulation of which may become large. Moreover, DWT significantly increases the training time and model size because submodels, SVM_{k}, are produced. Hence, SVMDWT model did not outperform other models in this study. As a comparison to SVMDWT model, SVMVF model not only resolves the issue of high model complexity, but also produces better performance. In a nutshell, Table 5 shows the summary and a quick comparison among the models in the current application. It is concluded that our proposed method VF produces better performance, in particular the training time, in current application.
5. Conclusion
The current application of air pollutant prediction aims to predict the significant acute pollutant levels as a warning message. However, most of the existing methods of data filtering or smoothing techniques are likely to attenuate the acute pollutant levels in such a time series (Figure 1) and hence a deterioration in prediction accuracy for timeseries prediction. Therefore, these existing methods are not necessarily applicable to the current application. In the literatures about air pollutant level prediction, DWT was used to decompose the series of pollutant levels into different subseries for modeling in order to obtain higher prediction accuracy. However, DWT incurs the issue of high complexity in training time and model size in addition to possible performance degeneration.
This research proposes a new algorithm of variationoriented filtering (VF) to filter only the unimportant low variation data while the acute ones (i.e., with high variation) are retained. VF can resolve the issue of high model complexity without sacrifice (or even with improvement) of the prediction accuracy. The SO_{2} level in Macau was used as a case study; and four different combinations of SVM models and 12 scenarios using GA, DWT, and VF were compared. Experimental results reveal that the models using VF outperform the other models using GA and DWT for SO_{2} level prediction in Macau. Moreover, with VF, the number of training data and model construction time can be significantly reduced so that hourly (current application is daily) prediction of pollutant level can become more feasible. The proposed method VF can also be applied to the prediction of other air pollutants or even other timeseries prediction (such as financial data, environmental parameter estimation, electric utility load, machine reliability, etc.) where acute values are of interest and significance, for filtering data of low variations and performance improvement.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The research is supported by the University of Macau Research Grant, Grant no. MYRG141 (Y1L2)FST11IWF.