Journal of Control Science and Engineering

Volume 2012 (2012), Article ID 518032, 11 pages

http://dx.doi.org/10.1155/2012/518032

## Short-Term Prediction of Air Pollution in Macau Using Support Vector Machines

^{1}Department of Computer and Information Science, University of Macau, Macau^{2}Faculty of Science and Technology, University of Macau, Macau^{3}Department of Electromechanical Engineering, University of Macau, Macau

Received 16 January 2012; Accepted 29 March 2012

Academic Editor: Qingsong Xu

Copyright © 2012 Chi-Man Vong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Forecasting of air pollution is a popular and important topic in recent years due to the health impact caused by air pollution. It is necessary to build an early warning system, which provides forecast and also alerts health alarm to local inhabitants by medical practitioners and the local government. Meteorological and pollutions data collected daily at monitoring stations of Macau can be used in this study to build a forecasting system. Support vector machines (SVMs), a novel type of machine learning technique based on statistical learning theory, can be used for regression and time series prediction. SVM is capable of good generalization while the performance of the SVM model is often hinged on the appropriate choice of the kernel.

#### 1. Introduction

Air pollution is often the result of economy development and population increase. It is particular in developing cities, particularly cities in China and India. Many epidemiologic studies [1–3] reported that air pollution problems are often associated with adverse human respiratory health effects, particularly to susceptible individuals. For example, ozone has been attributed to cause inflammation in airway and elevate airway response to inhaled allergens. It may increase the risk of the development of asthma among children taking part in outdoor sports [4]. WHO [5] reported that the health problems in turn may increase the burden of the health care systems in the long run and be detrimental to economy. To reduce the burden on health care due to diseases caused by atmospheric pollutants, the establishment of an early warning system is necessary. The success of an early warning system, which provides forecast and alerts local inhabitants, depends on the reliability and the availability of up-to-date meteorological information and pollutions data. For instance, medical practitioners can advise patients to minimize outdoor activities during days of high levels of pollutions and smog, depending on the prediction of the early warning system.

The meteorological and pollutant data in Macau are used as a case study for the testing of the forecasting model for a representative developing city. Macau, located on the southern coast of China with merely 26.8 square miles land area, comprises three land zones: Macau peninsula, Taipa, and Coloane (Figure 1). Macau peninsula has the characteristics of a hybridized, urbanized area; Taipa has mainly residential areas; Coloane has a power station and is largely undeveloped with the largest green areas. The population density in 2008 reached 20,493 people per square mile [6], which is one of the highest in the world. Resident population in Macau is projected to increase at an average annual rate of 1.9%, from 513,000 at 2006 to 829,000 at 2031 [7]. At present, the number of vehicles amounts to 188, 668 by the end of 2009, that is, more than triple from the quantity in 1999 [8]. Macau Government has implemented regulations on importing lead-free petroleum products inline with other developed countries since 1995, while the sulfur contents in gasoline must be lower than 0.05% by weight. Furthermore, in 2004, the sulfur contents in petroleum products used in power station were also regulated. Implementation of these policies reduces the emission of pollutants locally. In addition, construction and infrastructure projects have been transforming the landscape of a reclaimed land area between Taipa and Coloane since 2004.

Monitoring and forecasting of air pollutant level in ambient condition involve using a variety of approaches, for example, on-site measurement, computational fluid dynamics (CFD) simulation, and computational intelligence, and so forth. Artificial neural network (ANN) method is regarded as a cost-effective method and has been employed for the construction of prediction models at a variety of cities by environmental researchers [9–12]. The practical applications of these models, however, suffer from different drawbacks, for example, local minima, overfitting, poor generalization, and the need to determine the appropriate network architecture. Support vector machines (SVM), developed by Vapnik [13], can provide an effective novel approach to improve generalization performance of neural networks and achieve global solutions simultaneously. SVM can overcome most drawbacks of ANN and has been reported to show promising results [14–16]. However, the performance of the resulting SVM is often hinged on the appropriate choice of the kernel. There are several kernels commonly used in SVM for regression. Therefore, another aim of this study is to study which kernel is more suitable used in air pollution prediction.

#### 2. Meteorological Data

The meteorological information and pollutant data measured at Taipa Grande automatic meteorological station (see Figure 1) in year 2003 to year 2006 were selected as the experiment data set, which were extracted from Macau Government’s centre. Since the land area of Macau is relatively small, the data obtained at Taipa Grande (at an elevation of approximately 150 m above sea level) may be considered as representative for the entire region of Macau. The meteorological stations record pollutant data, such as nitrogen dioxide (NO_{2}), sulfur dioxide (SO_{2}), suspended particulate matters (SPM), and ozone (O_{3}); climatic data, such as temperature, humidity, rainfall, wind direction, wind speed, and precipitation. The day average value for air pollutants and meteorological data is considered a more representative measure and is adopted in this study.

In addition, the recorded levels of SPM, SO_{2}, NO_{2}, and O_{3} in January and July 2006 were selected as special cases in this study. The reason to choose the data in these two months is because January and July represent the winter and summer seasons in Macau, respectively. January is typified with dry, dominating northeastern wind, whereas July is typified with humid, hot weather, and southeastern prevailing wind. The temperature difference in these two seasons in Macau may range from 10 to 30°C. In winter season, due to the dominating northeastern wind, air-borne industrial pollutants from mainland China may be blown through Macau. In contrast, the southeastern prevailing wind from the sea in summer usually carries pollutants away.

Moreover, these meteorological data are closely associated with the presence and dispersion of pollutants. In order to discern the relationship between meteorological data and pollutants, an unadjusted crude method of bivariate analysis using Pearson correlation coefficients was applied. The resultant Pearson correlation coefficients between the various meteorological and pollutant parameters are presented in Figure 2. The results illustrate some critical problems using limiting amount of data for the determination of the relationship between bivariate independent variables. For instance, atmospheric pressure appeared to have a positive correlation with SPM, NO_{2}, and SO_{2} for a period of one year, respectively, whereas the same parameters for a three-year period showed no relationship. However, to minimize the operation of the regression model, parameters with a Pearson correlation coefficient of a value greater than 0.5 were selected as input in the model. However, an exception was applied to those related to O_{3} where the value of Pearson correlation coefficient greater than or equal to 0.4 was used. Apart from the physical significance of the meteorological variable, such as sunshine rate to the production of O_{3}, this exception was necessary to prevent having too few inputs in the model that may fail to account for the fluctuation of O_{3} levels. The available input variables included in the regression model is summarized in Table 1.

Table 2 shows the Pearson correlation coefficient between pollutants in different time series. The three-time series of SPM, SO_{2}, and O_{3}, at or even fall below, the level of significance of 0.5 after lag 2. Hence, in order to improve the accuracy of models and minimize the operation of models, only air pollutants and meteorological data at current day and previous day were used in this study to predict air pollutants level at the following day.

On the other hand, wind direction *W* is only available in the form of general directions, such as, N, NE, and E. Therefore, Pearson correlation coefficient cannot be applied to find the relationship between air pollutants and wind direction. However, as mentioned above, wind direction is closely associated with the presence and dispersion of pollutants. Hence, wind direction is also selected as the available input variable in this study. The wind direction is separated into 16 discrete directions . After applying corresponding analysis, it was found that only 7 out of the 16 wind directions are related to pollutant levels, namely, . To represent these directions, a Boolean variable was used for each of them, where to 7, rather than a number , so that no bias is incurred.

In addition, since most of air pollutants are dissolvable, rainfall may be a criticalimpact feature to the output. However, after applying rainfall into modeling, it is found that the influence of rainfall is very low for the accuracy of models. Hence, rainfall was not selected as input variable in this study.

#### 3. Methodology

##### 3.1. Support Vector Machines

Support vector machines (SVMs) are known as an excellent tool for classification and regression problems [17–19], producing good generalization. The basic principle of SVM applies linear model to convert nonlinear class boundaries through some nonlinear mapping of the input vector into the high-dimensional feature space. Details of the working concept of SVM can be found in [13].

##### 3.2. Kernel Selection

Kernel selection is a crucial issue for support vector machines. A kernel introduces nonlinearity into the SVM problem by mapping new input data, , implicitly into Hilbert space via a function where it may then be linearly separable. Since SVM only requires inner products of the nonlinearly mapped features , a kernel becomes an efficient way to compute such an inner product and provides the same scalar output , where is a predefined kernel and is the support vector. Different kernels will accommodate different nonlinear mapping and the performance of the resulting SVM is often hinged on the appropriate choice of the kernel [20]. Several kernels are commonly used in SVM for regression. These kernels including *Linear*, *Polynomial*, *Radial Basis Function* (*RBF*), *Sigmoid*, and *Wavelet* were used in this study to build SVM models as comparison. In general, these kernel functions are listed as follows, where :
In (5), can be any mother wavelet. In this study, *Morlet* function was selected.

#### 4. Experiment Workflow

##### 4.1. Data Sampling

The raw air pollutants and meteorological data from year 2003 to year 2006 were obtained from the website of DSMG (http://www.smg.gov.mo/www/e_index.php (last access: March 2012)). These data were divided into three groups for three experiments as shown in Table 3.

##### 4.2. Data Normalization

Prior to modelling, it is necessary to normalize all selected features into same range to avoid the domination by any feature with large values. This normalization process leads to more stable and accurate predicted results. The features in training data and test data were normalized by subtracting and dividing by the feature means, that is, where is the mean of the th parameter of .

##### 4.3. Modeling and Data Representation

As mentioned in Section 2, air pollutants and meteorological data at previous day and current day were used in this study to predict air pollutant level at the following day. In order to apply SVM for pollutant level forecasting, the representation of a pollutant level is defined as a pair . Generally the following features for a specific pollutant are chosen for the representation of :(i)pollutant level at previous day: (ii)pollutant level at current day: (iii)correlated pollutant level at previous day: Corr(iv)correlated pollutant level at current day: Corr(v)correlated meteorological level at previous day: Corr(vi)correlated meteorological level at current day: Corr.

For example, if pollutant *P* = NO_{2}, then according to Table 1, its correlated pollutants Corr and the correlated meteorological parameters Corr, denoting the levels of SPM, SO_{2}, wind direction, temperature, and humidity at previous day and current day, respectively. The representation of is then defined as
Finally, the output is the corresponding pollutant level of (i.e., predicted pollutant level) at the following day.

This set of training data is then passed to SVM models. The concept is illustrated in Figure 3. For simplicity, the SVM models were named according to the kernel used in the model. Subsequently, five kinds of models for each pollutant in this study were as follows: *Linear model*, *Polynomial model*, *RBF model*, *Sigmoid model*, and *Wavelet model*. For , five modelling methods, three different experiments, and 60 different trained models were developed in total.

##### 4.4. Experiment Environment

Modelling was performed on MATLAB 2007a platform where LIBSVM toolbox [21] and SVM Matlab toolbox [22] were employed to construct models. The hyperparameters (*c* and *g*) of SVM and the options of different kernels have been optimized.

#### 5. Results

##### 5.1. Error Measures

In order to effectively compare the accuracy among the models, four error measures were used in this study including mean absolute error (MAE), root mean squared error (RMSE), complementary Willmott’s index of agreement (CWIA), and relative error (RE). It is necessary to set up the RE because in a warning system, attentions are usually focused on the level exceeding a particular dangerous level. The success of a forecasting system may be defined as whether the predicted value falls within an accepted error range relative to the true value [23]. In the following formulas, and represent the predicted level and observed level of th day, respectively. and represent the maximum and minimum of observed level within each test set. *n* is number of data in the test sets:
where ,
where

##### 5.2. Prediction Results

Table 4 presents the results of SVM models under different kernels in 1-year experiment for SPM, SO_{2}, NO_{2}, and O_{3}. The bolded values indicate the best performance among the five testing models. *Linear model* and *RBF model* produced satisfactorily low errors for all pollutants. Moreover, the results of these two models were comparable. The results of *Polynomial model* and *Wavelet model* were poor, evident in 3–18% higher than the results of *Linear model* and *RBF model*. *Sigmoid model* produced the highest errors. The predicted results of seasonal experiment showed the same pattern as in 1-year experiment (see Table 5 (winter experiment) and Table 6 (summer experiment)). From these results, *Linear model* and *RBF model* were capable of producing much higher generalization than other three models. In addition, the poorresults of *Polynomial model* and *Sigmoid model* may be caused by the use of more hyperparameters, which are difficult to optimize.

##### 5.3. Matching of Predicted and Observed Pollution Levels

The exemplary plots of predicted and observed level of NO_{2} in 1-year experiment, SPM in winter experiment, and O_{3} in summer experiment are depicted in Figures 4 to 6, respectively. Although there were some lagging and underestimations, the predicted levels (Figure 4) produced by *Linear model* and *RBF model* followed the trend of observed level of NO_{2} pretty well in 1-year experiment. However, the other three models failed to follow observed level at all. In winter experiment (Figure 5), *Linear model* preformed the best, while the other four models failed to follow the trend of the observed level, especially *Sigmoid model*. In summer experiment (Figure 6), *Linear model*, *RBF model*, and *Polynomial model* showed good performances. However, *Polynomial model* cannot match the peaks of the observed levels. Both *Sigmoid model* and *Wavelet model* produced poor prediction comparing to other three models. It is clear that *Linear model* and *RBF model* performed the best and their predicted results were the closest to the observed levels, regardless in the 1-year experiment or the seasonal experiment.

#### 6. Conclusion

Using observed meteorological and pollutant data, SVM models for forecasting daily ambient air pollutant were constructed. The prediction results of *Linear model* and *RBF model* showed a relative good fit to the observed test set of over one year of data, particularly for SO_{2} and NO_{2}. In seasonal experiment, *Linear model* and *RBF model* also outperformed other three tested models, although some lagging and underestimations of these two models occurred in winter experiment. Comparing to these five studied models, it was evident that using *Linear* kernel and *RBF* kernel in SVM model for air pollutant forecasting in Macau produced superior results with relatively lower errors. It is believed that SVM model with *Linear* kernel or *RBF* kernel can also produce good performance for air pollutant forecasting in other similar developing cities, or even other time series prediction in similar situation.

Although *Linear model* and *RBF model* outperformed other three tested models, both of these two models still suffer underestimation of high levels of pollutants. How to solve this problem to improve the accuracy of prediction model is the future works. Some literature [24] attempted to integrate discrete wavelet transform (DWT) with SVM for a higher accuracy. Hence, we will attempt to integrate other machine learning methods, for example, genetic algorithm (GA), with SVM to improve the accuracy and efficiency of model in the future.

#### Acknowledgment

The research was sponsored by research committee, University of Macau, under Grant no. MYRG141(Y2-L2)-FST11-IWF.

#### References

- J. A. Bernstein, N. Alexis, C. Barnes et al., “Health effects of air pollution,”
*Journal of Allergy and Clinical Immunology*, vol. 114, no. 5, pp. 1116–1123, 2004. View at Publisher · View at Google Scholar · View at Scopus - A. Seaton, W. MacNee, K. Donaldson, and D. Godden, “Particulate air pollution and acute health effects,”
*Lancet*, vol. 345, no. 8943, pp. 176–178, 1995. View at Google Scholar · View at Scopus - C. A. Pope III, D. V. Bates, and M. E. Raizenne, “Health effects of particulate air pollution: time for reassessment?”
*Environmental Health Perspectives*, vol. 103, no. 5, pp. 472–480, 1995. View at Google Scholar · View at Scopus - R. McConnell, K. Berhane, F. Gilliland et al., “Air pollution and bronchitic symptoms in Southern California children with asthma,”
*Environmental Health Perspectives*, vol. 107, no. 9, pp. 757–760, 1999. View at Google Scholar · View at Scopus - WHO, “WHO Air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulfur dioxide,” 2005.
- DSEC, “Popolation Estimate of Macao,” 2008, http://www.dsec.gov.mo/getAttachment/9cc40c96-fa67-400b-b66f-d77af509754a/E_POP_FR_2008_Y.aspx.
- DSEC, “Macao Resident Population Profections 2007–2031,” 2008, http://www.dsec.gov.mo/getAttachment/e92b4427-6057-491a-80b7-f4255b65f1ca/E_PPRM_PUB_2007_Y.aspx.
- DSEC, “Transport and Communications Statistics 2009,” 2009, http://www.dsec.gov.mo/getAttachment/7c480491-15cf-4018-a0b8-19fa951be8a2/E_ETC_FR_2009_M12.aspx.
- S. L. Reich, D. R. Gomez, and L. E. Dawidowski, “Artificial neural network for the identification of unknown air pollution sources,”
*Atmospheric Environment*, vol. 33, no. 18, pp. 3045–3052, 1999. View at Publisher · View at Google Scholar · View at Scopus - M. W. Gardner and S. R. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,”
*Atmospheric Environment*, vol. 32, no. 14-15, pp. 2627–2636, 1998. View at Publisher · View at Google Scholar · View at Scopus - M. Kolehmainen, H. Martikainen, and J. Ruuskanen, “Neural networks and periodic components used in air quality forecasting,”
*Atmospheric Environment*, vol. 35, no. 5, pp. 815–825, 2001. View at Publisher · View at Google Scholar · View at Scopus - W. Z. Lu, H. Y. Fan, A. Y. T. Leung, and J. C. K. Wong, “Analysis of pollutant levels in central Hong Kong applying neural network method with particle swarm optimization,”
*Environmental Monitoring and Assessment*, vol. 79, no. 3, pp. 217–230, 2002. View at Publisher · View at Google Scholar · View at Scopus - V. N. Vapnik,
*The Nature of Statistical Learning Theory*, Springer, New York, NY, USA, 1st edition, 1995. - W. Z. Lu and W. J. Wang, “Potential assessment of the “support vector machine” method in forecasting ambient air pollutant trends,”
*Chemosphere*, vol. 59, no. 5, pp. 693–701, 2005. View at Publisher · View at Google Scholar · View at Scopus - I. Juhos, L. Makra, and B. Tóth, “Forecasting of traffic origin NO and NO
_{2}concentrations by Support Vector Machines and neural networks using Principal Component Analysis,”*Simulation Modelling Practice and Theory*, vol. 16, no. 9, pp. 1488–1502, 2008. View at Publisher · View at Google Scholar · View at Scopus - W. C. Wang, K. W. Chau, C. T. Cheng, and L. Qiu, “A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series,”
*Journal of Hydrology*, vol. 374, no. 3-4, pp. 294–306, 2009. View at Publisher · View at Google Scholar · View at Scopus - P. K. Wong, C. M. Vong, L. M. Tam, and K. Li, “Data preprocessing and modelling of electronically-controlled automotive engine power performance using kernel principal components analysis and least squares support vector machines,”
*International Journal of Vehicle Systems Modelling and Testing*, vol. 3, no. 4, pp. 312–330, 2008. View at Publisher · View at Google Scholar · View at Scopus - P. K. Wong, L. M. Tam, K. Li, and C. M. Vong, “Engine idle-speed system modelling and control optimization using artificial intelligence,”
*Proceedings of the Institution of Mechanical Engineers D*, vol. 224, no. 1, pp. 55–72, 2010. View at Publisher · View at Google Scholar · View at Scopus - C. M. Vong and P. K. Wong, “Engine ignition signal diagnosis with Wavelet Packet Transform and Multi-class Least Squares Support Vector Machines,”
*Expert Systems with Applications*, vol. 38, no. 7, pp. 8563–8570, 2011. View at Publisher · View at Google Scholar · View at Scopus - T. Jebara, “Multi-task feature and kernel selection for SVMs,” in
*21st International Conference on Machine Learning (ICML '04)*, pp. 433–440, Banff, Canada, July 2004. View at Scopus - C. C. Chang and C. J. Lin, “LIBSVM—A Library for Support Vector Machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- “SVM and Kernel Methods Matlab Toolbox,” http://asi.insa-rouen.fr/enseignants/~arakotom/toolbox/index.html.
- W. F. Ip, C. M. Vong, J. Y. Yang, and P. K. Wong, “Least squares support vector prediction for daily atmospheric pollutant level,” in
*9th IEEE/ACIS International Conference on Computer and Information Science (ICIS '10)*, pp. 23–28, August 2010. View at Publisher · View at Google Scholar · View at Scopus - S. Osowski and K. Garanty, “Forecasting of the daily meteorological pollution using wavelets and support vector machine,”
*Engineering Applications of Artificial Intelligence*, vol. 20, no. 6, pp. 745–755, 2007. View at Publisher · View at Google Scholar · View at Scopus