Abstract

Forecasting of air pollution is a popular and important topic in recent years due to the health impact caused by air pollution. It is necessary to build an early warning system, which provides forecast and also alerts health alarm to local inhabitants by medical practitioners and the local government. Meteorological and pollutions data collected daily at monitoring stations of Macau can be used in this study to build a forecasting system. Support vector machines (SVMs), a novel type of machine learning technique based on statistical learning theory, can be used for regression and time series prediction. SVM is capable of good generalization while the performance of the SVM model is often hinged on the appropriate choice of the kernel.

1. Introduction

Air pollution is often the result of economy development and population increase. It is particular in developing cities, particularly cities in China and India. Many epidemiologic studies [13] reported that air pollution problems are often associated with adverse human respiratory health effects, particularly to susceptible individuals. For example, ozone has been attributed to cause inflammation in airway and elevate airway response to inhaled allergens. It may increase the risk of the development of asthma among children taking part in outdoor sports [4]. WHO [5] reported that the health problems in turn may increase the burden of the health care systems in the long run and be detrimental to economy. To reduce the burden on health care due to diseases caused by atmospheric pollutants, the establishment of an early warning system is necessary. The success of an early warning system, which provides forecast and alerts local inhabitants, depends on the reliability and the availability of up-to-date meteorological information and pollutions data. For instance, medical practitioners can advise patients to minimize outdoor activities during days of high levels of pollutions and smog, depending on the prediction of the early warning system.

The meteorological and pollutant data in Macau are used as a case study for the testing of the forecasting model for a representative developing city. Macau, located on the southern coast of China with merely 26.8 square miles land area, comprises three land zones: Macau peninsula, Taipa, and Coloane (Figure 1). Macau peninsula has the characteristics of a hybridized, urbanized area; Taipa has mainly residential areas; Coloane has a power station and is largely undeveloped with the largest green areas. The population density in 2008 reached 20,493 people per square mile [6], which is one of the highest in the world. Resident population in Macau is projected to increase at an average annual rate of 1.9%, from 513,000 at 2006 to 829,000 at 2031 [7]. At present, the number of vehicles amounts to 188, 668 by the end of 2009, that is, more than triple from the quantity in 1999 [8]. Macau Government has implemented regulations on importing lead-free petroleum products inline with other developed countries since 1995, while the sulfur contents in gasoline must be lower than 0.05% by weight. Furthermore, in 2004, the sulfur contents in petroleum products used in power station were also regulated. Implementation of these policies reduces the emission of pollutants locally. In addition, construction and infrastructure projects have been transforming the landscape of a reclaimed land area between Taipa and Coloane since 2004.

Monitoring and forecasting of air pollutant level in ambient condition involve using a variety of approaches, for example, on-site measurement, computational fluid dynamics (CFD) simulation, and computational intelligence, and so forth. Artificial neural network (ANN) method is regarded as a cost-effective method and has been employed for the construction of prediction models at a variety of cities by environmental researchers [912]. The practical applications of these models, however, suffer from different drawbacks, for example, local minima, overfitting, poor generalization, and the need to determine the appropriate network architecture. Support vector machines (SVM), developed by Vapnik [13], can provide an effective novel approach to improve generalization performance of neural networks and achieve global solutions simultaneously. SVM can overcome most drawbacks of ANN and has been reported to show promising results [1416]. However, the performance of the resulting SVM is often hinged on the appropriate choice of the kernel. There are several kernels commonly used in SVM for regression. Therefore, another aim of this study is to study which kernel is more suitable used in air pollution prediction.

2. Meteorological Data

The meteorological information and pollutant data measured at Taipa Grande automatic meteorological station (see Figure 1) in year 2003 to year 2006 were selected as the experiment data set, which were extracted from Macau Government’s centre. Since the land area of Macau is relatively small, the data obtained at Taipa Grande (at an elevation of approximately 150 m above sea level) may be considered as representative for the entire region of Macau. The meteorological stations record pollutant data, such as nitrogen dioxide (NO2), sulfur dioxide (SO2), suspended particulate matters (SPM), and ozone (O3); climatic data, such as temperature, humidity, rainfall, wind direction, wind speed, and precipitation. The day average value for air pollutants and meteorological data is considered a more representative measure and is adopted in this study.

In addition, the recorded levels of SPM, SO2, NO2, and O3 in January and July 2006 were selected as special cases in this study. The reason to choose the data in these two months is because January and July represent the winter and summer seasons in Macau, respectively. January is typified with dry, dominating northeastern wind, whereas July is typified with humid, hot weather, and southeastern prevailing wind. The temperature difference in these two seasons in Macau may range from 10 to 30°C. In winter season, due to the dominating northeastern wind, air-borne industrial pollutants from mainland China may be blown through Macau. In contrast, the southeastern prevailing wind from the sea in summer usually carries pollutants away.

Moreover, these meteorological data are closely associated with the presence and dispersion of pollutants. In order to discern the relationship between meteorological data and pollutants, an unadjusted crude method of bivariate analysis using Pearson correlation coefficients was applied. The resultant Pearson correlation coefficients between the various meteorological and pollutant parameters are presented in Figure 2. The results illustrate some critical problems using limiting amount of data for the determination of the relationship between bivariate independent variables. For instance, atmospheric pressure appeared to have a positive correlation with SPM, NO2, and SO2 for a period of one year, respectively, whereas the same parameters for a three-year period showed no relationship. However, to minimize the operation of the regression model, parameters with a Pearson correlation coefficient of a value greater than 0.5 were selected as input in the model. However, an exception was applied to those related to O3 where the value of Pearson correlation coefficient greater than or equal to 0.4 was used. Apart from the physical significance of the meteorological variable, such as sunshine rate to the production of O3, this exception was necessary to prevent having too few inputs in the model that may fail to account for the fluctuation of O3 levels. The available input variables included in the regression model is summarized in Table 1.

Table 2 shows the Pearson correlation coefficient between pollutants in different time series. The three-time series of SPM, SO2, and O3, at or even fall below, the level of significance of 0.5 after lag 2. Hence, in order to improve the accuracy of models and minimize the operation of models, only air pollutants and meteorological data at current day and previous day were used in this study to predict air pollutants level at the following day.

On the other hand, wind direction W is only available in the form of general directions, such as, N, NE, and E. Therefore, Pearson correlation coefficient cannot be applied to find the relationship between air pollutants and wind direction. However, as mentioned above, wind direction is closely associated with the presence and dispersion of pollutants. Hence, wind direction is also selected as the available input variable in this study. The wind direction is separated into 16 discrete directions {N,NNE,NE,ENE,E,}. After applying corresponding analysis, it was found that only 7 out of the 16 wind directions are related to pollutant levels, namely, {N,NNE,E,ESE,SE,NW,NNW}. To represent these directions, a Boolean variable 𝑊𝑖{0,1} was used for each of them, where 𝑖=1 to 7, rather than a number 𝑊𝑖{1,2,,7}, so that no bias is incurred.

In addition, since most of air pollutants are dissolvable, rainfall may be a criticalimpact feature to the output. However, after applying rainfall into modeling, it is found that the influence of rainfall is very low for the accuracy of models. Hence, rainfall was not selected as input variable in this study.

3. Methodology

3.1. Support Vector Machines

Support vector machines (SVMs) are known as an excellent tool for classification and regression problems [1719], producing good generalization. The basic principle of SVM applies linear model to convert nonlinear class boundaries through some nonlinear mapping of the input vector into the high-dimensional feature space. Details of the working concept of SVM can be found in [13].

3.2. Kernel Selection

Kernel selection is a crucial issue for support vector machines. A kernel introduces nonlinearity into the SVM problem by mapping new input data, 𝑋, implicitly into Hilbert space via a function Φ where it may then be linearly separable. Since SVM only requires inner products of the nonlinearly mapped features Φ(𝑋), a kernel becomes an efficient way to compute such an inner product and provides the same scalar output 𝑘(𝑋,𝑋𝑡)=Φ(𝑋)𝑇Φ(𝑋𝑡), where 𝑘 is a predefined kernel and 𝑋𝑡 is the support vector. Different kernels will accommodate different nonlinear mapping and the performance of the resulting SVM is often hinged on the appropriate choice of the kernel [20]. Several kernels are commonly used in SVM for regression. These kernels including Linear, Polynomial, Radial Basis Function (RBF), Sigmoid, and Wavelet were used in this study to build SVM models as comparison. In general, these kernel functions are listed as follows, where 𝑋,𝑋𝑡𝑅𝑚: linear𝑘𝑋,𝑋𝑡=𝑋𝑇𝑋𝑡,(1)polynomial𝑘𝑋,𝑋𝑡=𝑋𝑇𝑋𝑡+1𝑛,(2)RBF𝑘𝑋,𝑋𝑡=exp𝑋𝑋𝑡22𝜎2,(3)sigmoid𝑘𝑋,𝑋𝑡𝑋=tanh𝑇𝑋𝑡+1,(4)wavelet𝑘𝑋,𝑋𝑡=𝑚𝑖=1𝜑𝑋𝑋𝑡𝜎.(5) In (5), 𝜑 can be any mother wavelet. In this study, Morlet function was selected.

4. Experiment Workflow

4.1. Data Sampling

The raw air pollutants and meteorological data from year 2003 to year 2006 were obtained from the website of DSMG (http://www.smg.gov.mo/www/e_index.php (last access: March 2012)). These data were divided into three groups for three experiments as shown in Table 3.

4.2. Data Normalization

Prior to modelling, it is necessary to normalize all selected features into same range to avoid the domination by any feature with large values. This normalization process leads to more stable and accurate predicted results. The features in training data and test data were normalized by subtracting and dividing by the feature means, that is, 𝐱𝑖𝐱𝑖𝐱𝑖𝐱𝑖,(6) where 𝐱𝑖 is the mean of the 𝑖th parameter of 𝐱.

4.3. Modeling and Data Representation

As mentioned in Section 2, air pollutants and meteorological data at previous day and current day were used in this study to predict air pollutant level at the following day. In order to apply SVM for pollutant level forecasting, the representation of a pollutant level is defined as a pair (𝐱,𝑦). Generally the following features for a specific pollutant 𝑃{SPM,SO2,NO2,O3} are chosen for the representation of 𝐱:(i)pollutant level at previous day: 𝑃(𝑑1)(ii)pollutant level at current day: 𝑃(𝑑)(iii)correlated pollutant level at previous day: Corr𝑃(𝑑1)(iv)correlated pollutant level at current day: Corr𝑃(𝑑)(v)correlated meteorological level at previous day: Corr𝑀(𝑑1)(vi)correlated meteorological level at current day: Corr𝑀(𝑑).

For example, if pollutant P = NO2, then according to Table 1, its correlated pollutants Corr𝑃={SPM,SO2} and the correlated meteorological parameters Corr𝑀={𝑊,𝑇,Hum}, denoting the levels of SPM, SO2, wind direction, temperature, and humidity at previous day and current day, respectively. The representation of 𝐱 is then defined as 𝐱=NO2(𝑑1),NO2(𝑑),SPM(𝑑1),SPM(𝑑),SO2(𝑑1),SO2(𝑑),𝑊𝑖(𝑑1),𝑊𝑖(𝑑),𝑇(𝑑1),𝑇(𝑑),Hum(𝑑1),Hum(𝑑)for𝑖=1to7.(7) Finally, the output 𝑦=𝑃(𝑑+1) is the corresponding pollutant level of 𝑃 (i.e., predicted pollutant level) at the following day.

This set of training data (𝐱,𝑦) is then passed to SVM models. The concept is illustrated in Figure 3. For simplicity, the SVM models were named according to the kernel used in the model. Subsequently, five kinds of models for each pollutant in this study were as follows: Linear model, Polynomial model, RBF model, Sigmoid model, and Wavelet model. For 𝑃{SPM,SO2,NO2,O3}, five modelling methods, three different experiments, and 60 different trained models were developed in total.

4.4. Experiment Environment

Modelling was performed on MATLAB 2007a platform where LIBSVM toolbox [21] and SVM Matlab toolbox [22] were employed to construct models. The hyperparameters (c and g) of SVM and the options of different kernels have been optimized.

5. Results

5.1. Error Measures

In order to effectively compare the accuracy among the models, four error measures were used in this study including mean absolute error (MAE), root mean squared error (RMSE), complementary Willmott’s index of agreement (CWIA), and relative error (RE). It is necessary to set up the RE because in a warning system, attentions are usually focused on the level exceeding a particular dangerous level. The success of a forecasting system may be defined as whether the predicted value falls within an accepted error range relative to the true value [23]. In the following formulas, 𝑃𝑖 and 𝑂𝑖 represent the predicted level and observed level of 𝑖th day, respectively. 𝑂max and 𝑂min represent the maximum and minimum of observed level within each test set. n is number of data in the test sets: 1MAE=𝑛𝑛𝑖=1||𝑃𝑖𝑂𝑖||,RMSE=1𝑛𝑛𝑖=1𝑃𝑖𝑂𝑖2,CWIA=𝑛𝑖=1𝑃𝑖𝑂𝑖2𝑛𝑖=1||𝑃𝑖||||𝑂𝑖||2,(8) where 𝑃𝑖=𝑃𝑖𝑂𝑖,𝑂𝑖=𝑂𝑖𝑂𝑖,𝑂𝑖=(1/𝑛)𝑛𝑖=1𝑂𝑖, 1RE=𝑛𝑛𝑖=1𝐸𝑖,(9) where 𝐸𝑖=||𝑃0if𝑖𝑂𝑖||<𝑂max𝑂min×15%1otherwise.(10)

5.2. Prediction Results

Table 4 presents the results of SVM models under different kernels in 1-year experiment for SPM, SO2, NO2, and O3. The bolded values indicate the best performance among the five testing models. Linear model and RBF model produced satisfactorily low errors for all pollutants. Moreover, the results of these two models were comparable. The results of Polynomial model and Wavelet model were poor, evident in 3–18% higher than the results of Linear model and RBF model. Sigmoid model produced the highest errors. The predicted results of seasonal experiment showed the same pattern as in 1-year experiment (see Table 5 (winter experiment) and Table 6 (summer experiment)). From these results, Linear model and RBF model were capable of producing much higher generalization than other three models. In addition, the poorresults of Polynomial model and Sigmoid model may be caused by the use of more hyperparameters, which are difficult to optimize.

5.3. Matching of Predicted and Observed Pollution Levels

The exemplary plots of predicted and observed level of NO2 in 1-year experiment, SPM in winter experiment, and O3 in summer experiment are depicted in Figures 4 to 6, respectively. Although there were some lagging and underestimations, the predicted levels (Figure 4) produced by Linear model and RBF model followed the trend of observed level of NO2 pretty well in 1-year experiment. However, the other three models failed to follow observed level at all. In winter experiment (Figure 5), Linear model preformed the best, while the other four models failed to follow the trend of the observed level, especially Sigmoid model. In summer experiment (Figure 6), Linear model, RBF model, and Polynomial model showed good performances. However, Polynomial model cannot match the peaks of the observed levels. Both Sigmoid model and Wavelet model produced poor prediction comparing to other three models. It is clear that Linear model and RBF model performed the best and their predicted results were the closest to the observed levels, regardless in the 1-year experiment or the seasonal experiment.

6. Conclusion

Using observed meteorological and pollutant data, SVM models for forecasting daily ambient air pollutant were constructed. The prediction results of Linear model and RBF model showed a relative good fit to the observed test set of over one year of data, particularly for SO2 and NO2. In seasonal experiment, Linear model and RBF model also outperformed other three tested models, although some lagging and underestimations of these two models occurred in winter experiment. Comparing to these five studied models, it was evident that using Linear kernel and RBF kernel in SVM model for air pollutant forecasting in Macau produced superior results with relatively lower errors. It is believed that SVM model with Linear kernel or RBF kernel can also produce good performance for air pollutant forecasting in other similar developing cities, or even other time series prediction in similar situation.

Although Linear model and RBF model outperformed other three tested models, both of these two models still suffer underestimation of high levels of pollutants. How to solve this problem to improve the accuracy of prediction model is the future works. Some literature [24] attempted to integrate discrete wavelet transform (DWT) with SVM for a higher accuracy. Hence, we will attempt to integrate other machine learning methods, for example, genetic algorithm (GA), with SVM to improve the accuracy and efficiency of model in the future.

Acknowledgment

The research was sponsored by research committee, University of Macau, under Grant no. MYRG141(Y2-L2)-FST11-IWF.