Abstract

Air quality prediction is an important research issue due to the increasing impact of air pollution on the urban environment. However, existing methods often fail to forecast high-polluting air conditions, which is precisely what should be highlighted. In this paper, a novel multiple kernel learning (MKL) model that embodies the characteristics of ensemble learning, kernel learning, and representative learning is proposed to forecast the near future air quality (AQ). The centered alignment approach is used for learning kernels, and a boosting approach is used to determine the proper number of kernels. To demonstrate the performance of the proposed MKL model, its performance is compared to that of classical autoregressive integrated moving average (ARIMA) model; widely used parametric models like random forest (RF) and support vector machine (SVM); popular neural network models like multiple layer perceptron (MLP); and long short-term memory neural network. Datasets acquired from a coastal city Hong Kong and an inland city Beijing are used to train and validate all the models. Experiments show that the MKL model outperforms the other models. Moreover, the MKL model has better forecast ability for high health risk category AQ.

1. Introduction

With the development of the economy and society all over the world, most metropolitan cities are experiencing elevated concentrations of ground-level air pollutants, especially in fast developing countries like India and China. Exposure to air pollution can affect everyone, but it can be particularly harmful to people with a heart disease or a lung condition, elderly people, and children. Studies show that long-term exposure to fine particulate air pollution or traffic-related air pollution is associated with environmental-cause mortality, even at concentration ranges well below the standard annual mean limit value [1, 2]. Therefore, building an early warning system, which provides precise forecast and also alerts health alarm to local inhabitants will provide valuable information to protect humans from damage by air pollution.

Currently, three major approaches are used to forecast real-time air quality: simple empirical approaches, advanced physically based approaches, and machine learning approaches.

Simple empirical approaches like persistence method and climatology method are based on assumptions or hypothesis; that is, thresholds of forecasted meteorological variables can indicate future pollution level [3]. They are computationally fast but have low accuracy and are primarily used as references by other methods. Advanced physically based approaches like chemical transport models (CTMs) simulate the formation and accumulation of air pollutants by a solution of the conservation equations and transformation relationships among the mass of various chemical species and physical states. They can provide valuable insights for understanding pollutant diffusion mechanisms. But they are computationally expensive, demanding reliable meteorological predictions, and highly relevant to a high level of expertise [4].

Machine learning methods are computationally fast and cost-effective and can provide promising prediction accuracy. Various machine learning methods have been applied to predict the air quality. Widely used methods include classical autoregressive moving average (ARMA) methods like the autoregressive integrated moving average (ARIMA) [5], support vector machine (SVM) methods like the support vector classifier (SVC) [6, 7], ensemble methods like the random forest (RF) [8, 9], artificial neural network (ANN) methods like the multiple layer perceptron (MLP) [10, 11], and deep learning methods like the long short-term memory neural network (LSTM NN) [12, 13].

Among the models mentioned above, ARIMA is a time series model and is often used as a baseline model. The performance of the SVM model is often hinged on the appropriate choice of the kernel. A kernel in SVM introduces nonlinearity into the problem by mapping new input data implicitly into a Hilbert space where it may then be linearly separable [14]. Neural network models, especially deep neural networks, can automatically learn the representations from raw data, but it takes a long time and a large volume of data to train a well-behaved network.

Multiple kernel learning (MKL) is proposed as an alternative to cross validation, feature selection, metric learning, and ensemble methods. MKL refers to using multiple kernels instead of a single one; most of the algorithms which make use of the kernel tricks can take the advantage of MKL, such as SVM and kernel ridge regression (KRR). In MKL, feature combination and classifier training are done simultaneously, and different data formats can be used in the same formulation. In addition, the inherent kernel trick of combining linear kernels and nonlinear kernels in MKL makes it more promising in solving fusing information problems. There is a significant amount of work in the literature for combining multiple kernels [15, 16]. Various applications indicate that performance gains can be achieved by linear and nonlinear kernel combinations using MKL methods [1719].

In this paper, a novel multiple kernel learning-based air quality prediction approach that can inherently capture the characteristics of the heterogeneous time, meteorology, and air pollutant data is proposed. Real datasets from a coastal city Hong Kong and an inland city Beijing are used to demonstrate the effectiveness the proposed approach. Comprehensive comparison experiments with ARIMA, RF, SVCs, MLP, and LSTM are conducted. Though some of the algorithms can automatically learn the representative features of the data, pretraining featuring engineering is still necessary and will significantly affect the models’ performance. In addition, hyperparameter tuning is critical for all the parametric models. Therefore, in this paper, special attention is paid to the feature engineering and parameter tuning process. The methodologies applied to Hong Kong and Beijing datasets are similar. Therefore, Hong Kong is used for demonstration in most of the paper. The main contributions of this paper are as follows:(1)A multiple kernel learning approach is introduced into the domain of air quality prediction for the first time. Multiscale predictions over the next 1, 3, 6, 9, and 12 hours’ air quality of an inland city Beijing and a coastal city Hong Kong are presented.(2)The proposed method can effectively capture the air quality features from the hybrid time, meteorology, and air pollutant data. The experimental results demonstrated the advantages of this approach over some of the widely used models, especially in the prediction of severe air pollution conditions.

The rest of the paper is organized as follows: Section 2 presents the methodology of the multiple kernel learning algorithm; data preparation is introduced in Section 3; in Section 4, extensive experimentation results and necessary discussions are presented; and Section 5 concludes this paper.

2. Methodology

While classical kernel-based classifiers such as SVCs are based on a single kernel, in practice, it is often desirable to base classifiers on combinations of multiple kernels since data points typically can be due to multiple heterogeneous sources. A kernel implicitly represents a notion of similarity for the data, and different kernels will accommodate different nonlinear mappings, and MKL provides a way to combine different ideas of similarity. Using a specific kernel may be a source of bias, and MKL provides a way to select optimal kernels and parameters from a larger set of kernels. In the air quality prediction case, the source data are coming from different modalities. Therefore, in the paper, instead of using just a single kernel which is usually more suitable for the homogeneous data source, multiple kernels are combined, and the classical and empirically successful support vector classifier is used as the base learner. The detailed introduction of the kernel support vector machine is given in Appendix A. In this section, the multiple kernel learning approach is described first, and then, the centered alignment method is introduced for learning kernels.

2.1. Multiple Kernel Learning

MKL is conceptually similar to single kernel learning. In other words, single kernel leaning is a special case of MKL. In MKL, the final kernel is learnt as a combination (linear or nonlinear) of many base kernels from the data itself:where is the combination function, is the kernel function, is the dimensionality of the corresponding feature representation, and parameterizes the combination function.

It is also possible to integrate into the kernel functions where it is optimized during training.

Most of the existing MKL algorithms fall into the first category and try to combine predefined kernels in an optimal way. Commonly used kernels are linear, polynomial, radial basis function (RBF), and sigmoid.

The kernels can be combined in different ways, and each has its own combination parameter characteristics. Generally, linear combination methods are used, and they fall into two basic categories: unweighted sum (i.e., using sum or mean of the kernels as the combined kernel) and weighted sum. In the weighted sum case, the combination function is linearly parameterized:where denotes the kernel weights. Different versions of this approach differ in the way they put restrictions on : the linear sum has arbitrary real value and the conic sum requires to be positive, while sums to 1 for the convex sum.

The conic sum and convex sum are special cases of the linear sum, but the former two are used more often because the relative importance of the combined kernels can be extracted by looking at the kernel weights. Furthermore, the kernel weights of the conic and convex sum correspond to scaling the feature spaces when they are nonnegative [20].

In this paper, the conic sum restriction used as the convex sum is a special case of the conic sum. The resulting decision function of the multiple kernel support vector classifier (MKSVC) is defined as

There are four important parameters: the number of kernels (), the inner kernel coefficients of each kernel, features to use for each kernel (), and the weight () of each kernel. In this paper, the inner kernel coefficients are obtained by optimizing the single kernel-based learners. is obtained by the centered alignment approach proposed in [32]. is obtained through the boosting approach by iteratively adding a new kernel until the performance stops improving (the kernels are added based on the weights learned by the centered alignment approach, kernel with higher weight first). As with the features used by each kernel, for simplicity, the canonical multiple kernel learning approach is used, namely, one kernel combination for all feature representations. The pseudo code of the MKSVC is described in Algorithm 1.

Input: dataset: , n samples
Output: decision function of MKSVC
Start First, get the kernel coefficients by optimizing the single kernel-base learners ()
Second, get the weight of each kernel by the centered kernel alignment algorithm ()
Third, get the number of kernels by boosting approach (P)
Fourth, get the combined optimized kernel
Then, use SVC as the base learner and optimize it with a general optimizing algorithm
Return
Stop
2.2. Centered Alignment Method for Learning Kernels

Centered alignment is used as a similarity measure between kernels or kernel matrices. Given kernels matrices , centered kernel alignment learns a linear combination of kernels resulting in a combined kernel matrix:where is the number of kernels, is the centered kernel weight, and is the centered kernel:where is the identity matrix, denotes the vector with all entries equal to one, and is the original kernel matrix.

The alignment between two kernel functions and is defined bywhere and are the centered kernels of and and denotes the Frobenius product and the Frobenius norm defined byand by definition.

Using the independent alignment-based algorithm proposed in [32], the alignment between each kernel matrix and the target (, y is the labels) can be computed independently by using the training samples and the centered kernel weight can be chosen proportional to that alignment. Thus, the resulting kernel matrix is defined by

3. Data Preparation

In this paper, two datasets are used: one is from Hong Kong, a coastal city, whose air condition is relatively good, and the other is from an inland city, Beijing, whose air condition is relatively poor. Dataset of HK contains two years’ hourly meteorology data and pollutant data between 1 February 2013 and 31 January 2015 collected from HK’s Sha Tin air quality monitoring station [21] and weather forecast station [22]. Dataset of Beijing contains five years’ hourly PM2.5 data and meteorology data between 1 January 2010 and 31 December 2014 collected from UCI machine learning repository [23].

3.1. Prediction Target and Performance Metric
3.1.1. Prediction Target

The prediction targets in this paper are the air quality health index (AQHI) in Hong Kong and the PM2.5 individual air quality level (IAQL) in Beijing. AQHI and IAQL are scales designed to help understand the impact of air quality on health. Unlike air quality concentrations, these air quality indices provide the public with advice on how to protect their health during air quality levels associated with low, moderate, high, and very high health risks. They also provide advice on how to improve air quality by proposing behavioral change to reduce the environmental footprint [24, 25].

For any given hour, the AQHI is calculated from the sum of the percentage excess risk of daily hospital admissions attributing to the 3-hour moving average concentrations of four criteria air pollutants: ozone (O3), nitrogen dioxide (NO2), sulphur dioxide (SO2), and particulate matter (PM) (respirable suspended particulates (RSP or PM10) or fine suspended particulates (FSP or PM2.5), whichever poses a higher health risk).

The IAQL is classified based on the individual air quality index (IAQI) which is calculated according to a formula published by China’ Ministry of Environmental Protection (MEP) [26]. The highest IAQI among pollutants SO2, NO2, O3, carbon monoxide (CO), PM2.5, and PM10 at a given time is called the primary or dominant pollutant and is chosen for the overall AQI value. In China, PM2.5 is the primary pollutant most of the time; therefore, its IAQI is usually the overall AQI.

The detailed information of calculating AQHI and IAQI is given in Appendix B. These indices are health protection tools used to make decisions to reduce short-term exposure to air pollution by adjusting activity levels during increased levels of air pollution. Table 1 shows the health risks with corresponding air quality classifications.

3.1.2. Performance Metric

In this paper, accuracy, mean square error (mse), weighted precision (wp), weighted recall (wr), and weighted f1-score (wf) are used to evaluate the effectiveness of all the algorithms. The precision (P) is calculated by the formula where TP is the number of correct predictions and FP is the number of incorrect predictions. Recall (R) is the proportion of instances classified as a given class divided by the actual total in that class. F1-score is a harmonic average of precision and recall [27].

For accuracy and mse,where is the predicted value of the sample and is the corresponding true value.

For wp, wr, and wfwhere is the set of predicted pairs, is the set of true pairs, is the set of classes, and is the subset of with classes ; similarly, is the subset of . and (conventions vary on handling ; this implementation uses , and similar for ). .

3.2. Featured Data

Take dataset of HK for example. Following air pollutant data features are contained: FSP, NO2, NOx, O3, RSP, and SO2 (unit of measurement of all the air pollutants is μg/m3). Air pollutant data samples are shown in Table 2.

Following meteorology data features are contained: T, P0, P1, , H, WD, WP, and dew. Meteorological samples are shown in Table 3.

Following time stamp features are contained: month, the day of the week (week), the day of the month (day), and the hour of the day (hour). There may be a yearly trend of the air quality, but we just have limited years of data, so “year” is not included in the feature set.

3.3. Feature Engineering
3.3.1. Feature Transformation

(1) Encoding Wind Direction. Among the data obtained, the wind direction is nonnumeric (i.e., “east,” “east-southeast”). It has to be converted to numerical value so that the algorithms can make use of. One-hot encoding (e.g., “east” is encoded as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]) and label encoding (e.g., “east” is encoded as 1, “south” is encoded as “2” etc.) were tried in this paper. Figure 1 shows the forecast performances of RF, MLP, and SVC_linear (SVC with linear kernel) algorithms when the wind direction was encoded by one-hot encoding and label encoding, respectively, and the parameters of the algorithms stayed unchanged. From the figure, it is obvious that label encoding is superior over one-hot encoding on the dataset. Therefore, in this paper, the wind direction was label encoded.

(2) Missing Data Imputation. Linear interpolation was used in the paper to interpolate the missing values in the two datasets.where denotes the missing value at time and is the time gap between interval .

(3) Data Normalization. Normalization or standardization of either input or target variables tends to make the training process better behaved. Normalization scales the feature values in the range [0,1]:

Standardization transforms the feature values to have zero mean and unit variance:

To see whether normalization or standardization helps, both of them were tried and compared with the one without any processing. Again, RF, MLP, and SVC_linear were used as the validation algorithms. Results are shown in Figure 2. The figure shows that, generally, models benefit from normalization or standardization, especially for the neural network model. Normalization is slightly better than standardization. Therefore, in this paper, the data were normalized.

3.3.2. Feature Selection

Take Hong Kong for example. The source dataset contains 18 features, and they are as follows:Meteorological (M) data features: <T, P0, P1, , H, WD, WP, dew>Air pollutant (AP) data features: <FSP, NO2, NOx, O3, RSP, SO2>Time data features: <month, week, day, hour>

The target is to forecast the near future AQHI. However, not all the features above are related to the AQHI, finding out the features which are correlated with the target would be beneficial. The historical pollutants and meteorology may impact the future air quality as the simple empirical approaches assume, finding out the influential historical time lag would be important as well.

(1) Feature Correlation Analysis. In this paper, Spearman’s correlation analysis was used due to the possible nonlinear relationships between variables. Spearman’s rank correlation coefficient measures the monotonic association between two variables and relies on the rank order of values [28]. The formula for Spearman’s coefficient iswhere are the ranked (sorted) values of variables x and y, is the covariance, and is the standard deviation. Figure 3 shows the Spearman correlation coefficients between the features of HK dataset. Correlation scores go from −1 to 1. Perfect positive correlation is 1. Perfect negative correlation is −1. The figure shows that FSP, O3, RSP, SO2, P0, and P1 have strong positive correlations with the AQHI, while T, H, and dew have strong negative correlations with the AQHI. Cohen’s standard [29] was used in this paper to select the correlated features. Features with association smaller than 0.30 are discarded. The picked features are as follows:

<FSP, NO2, NOx, O3, RSP, SO2, T, P0, P1, , H, dew, WP, WD, month, hour>

(2) Temporal Correlation Analysis. Intuitively, historical data from different periods have different effects on future time lags. More recent events have a stronger influence on the current status, while earlier events have a weaker influence. Denote current time as t, the historical time lag as h, and the future time lag as f, and then the prediction time is (f = 1, 3, 6, 9, 12) and the influential historical time is (). The multiscale prediction task is represented in Figure 4. In this paper, the LSTM NN model which is capable of learning long time series was used to select the appropriate influential historical time lag [30].

The network architecture of the LSTM model used in the paper is shown in Figure 5, which is the same as the LSTM-extended network proposed in [13]. The main input is the air pollutant data, and the auxiliary input is the time and meteorology data. There are two LSTM layers and one output layer which is a fully connected layer that has 11 neurons corresponding to the number of classes. The number of neurons in the LSTM layer has to be tuned. For simplicity, the number of neurons in each LSTM layer was set to an equivalent value chosen from a candidate set of {50, 100, 200, 500, 1000, 2000}. The most appropriate setting was chosen that yielded the best performance based on several comparative experiments. When the number of neurons in the LSTM was as 1000, the LSTM achieved the best performance. Therefore, in this paper, the number of neurons in the LSTM layers was set as 1000.

The future 1, 3, 6, 9, and 12 hours’ AQHIs were predicted in this paper. With each future time lag, the influences of different historical time lags were examined. The results are given in Table 4. The evaluation metric is weighted f1-score (f1 in Table 4). The corresponding curve graph is given in Figure 6. The result shows that different future time lag (F-lag in Table 4) corresponds to slightly different optimal historical time lag (H-lag in Table 4). The general influential time of historical data for a specific future time’s AQHI is around 9 hours.

Notably, the result shows that the prediction performances are poor for future time lag larger than 6, indicating that long-term prediction tasks are instinctively more difficult. Small-time lag cannot guarantee enough long-term memory inputs for the LSTM model, while large time lags permit an increased number of unrelated inputs, which increase the model’s complexity and the difficulty of learning useful features. According to the above experiments, for simplicity, 9 was selected as the most appropriate influential historical time lag for different future time lag.

4. Results and Discussion

Algorithms used in the experiments are ARIMA, RF, MLP, SVC_linear (SVC with the liner kernel), SVC_rbf (SVC with the RBF kernel), SVC_sig (SVC with the sigmoid kernel), SVC_poly (SVC with the polynomial kernel), LSTM, and MKSVC. ARIMA was used as a baseline model, RF, MLP, and SVC are widely used air quality forecast models, they were fine-tuned in this paper in order to make a fair comparison with MKSVC, and the LSTM in this paper has the same structure as the LSTM extended model proposed in [13]. Figure 7 shows the experimental flow. All algorithms were designed and tested with the same operation environment (Python 3.5.3, Windows 10, Intel® Core™ i7-5500U CPU @2.40 GHz, 16.0 GB RAM).

4.1. Parameter Optimization

Parameter optimization refers to the method of finding optimal parameters for a machine-learning algorithm. This is important since the performance of any machine learning algorithm depends to a huge extent on what the values of parameters are. For each prediction time lag, the parameters are different for each algorithm. It means an optimal model for each prediction task and each algorithm need to be tuned. The ways to get the parameters of MKSVC are detailed in Section 2 and Section 3.2.2 for LSTM. For the other algorithms, the parameter tuning process of the one-hour future time lag prediction task is presented in the following part, and the multiscale prediction tasks have identical fine-tuning processes.

First, the grid search interval of a parameter is narrowed by analyzing the influence curve of a single parameter on the training score and the validation score. For instance, by varying the kernel coefficient of the RBF kernel in SVC_rbf, the -score curve can be obtained as shown in Figure 8. The yellow line denotes the score over the training set. The purple line represents the score on the validation set, and the shadow represents the variance.

The figure shows that, at first, both the training and validation scores rise with the increase of . However, when reaches around 0.5, a further increase will result in the increase of the training score but the decrease of the validation score; it signifies that the model is getting overfitting. According to this influence curve, the grid search interval of in the next step can be narrowed between 0.0 and 1.0.

Based on the influence curve, the grid search intervals of the main parameters of the ARIMA, RF, MLP, and SVCs are shown in Table 5. RF, MLP, and SVCs used in this paper are implemented in scientific toolbox scikit-learn [31] and ARIMA implemented in statsmodels [32]. The unlisted parameters are set as default.

Then, a gird search with 5-fold cross validation was applied to find the optimum parameter. By exhaustively considering all parameter combinations in Table 5, the optimal parameter settings of the ARIMA, RFC, MLP, and SVCs are obtained as shown in Table 6. After getting the inner kernel coefficients of all the base kernels, the centered kernel alignment method described in Section 2.2. was used to get the optimal weight for each kernel.

4.2. Comparison

For HK, one year’s data was used for training, and the other year’s data was used for testing. For Beijing, the first two years’ data was used for training, and the other three year’s data was used for testing. The comparisons of the predictions for the future 1, 3, 6, 9, and 12 hours are given below.

4.2.1. Predict the AQHI of Hong Kong

Tables 711 show the performances of the algorithms for forecasting the future 1, 3, 6, 9, and 12 hours’ AQHI in Hong Kong. From the table, the following conclusions can be drawn:(1)MKSVC performs best on all the three prediction tasks. SVC models with linear, RBF, and polynomial kernels perform better than other models except for the MKSVC. Sigmoid kernel SVC always makes the worst predictions which show that the sigmoid kernel is unable to capture the characters of the dataset.(2)Time series models like ARIMA and LSTM fail to compete with the widely used parametric models like RF, MLP, and SVCs, and as the future time lag increases, the time series models’ performances decrease, while the parametric models keep achieving very satisfying results.(3)Among the well-performed SVC models, linear kernel model performs best, which demonstrates that the relation between the target and the input information has a lot of linear components, but there are also factors that influence the future air quality in a nonlinear way as the RBF and polynomial kernels also achieve promising performance.(4)Models like MKSVC, MLP, and SVCs (except SVC_sigmoid) present very satisfying performance in the prediction for short-term air quality, larger than 90% of accuracy for the future 1, 3, and 6 hours. However, the performance for longer term predictions drops sharply from 0.976 of the 6 hour to 0.630 of the 12 hour (accuracy of MKSVC). It demonstrates that long-term air quality prediction is difficult.

4.2.2. Predict the PM2.5 IAQL of Beijing

Tables 1216 show the performances of the algorithms for forecasting the next 1, 3, 6, 9, 12 hours’ PM2.5 IAQL in Beijing. Similar conclusions can be drawn as that of HK, MKSVC is superior to other models, SVC_sigmoid and LSTM perform worst, SVCs behavior relatively better than other parametric models. But the overall performance of all the models on this dataset is much worse than that of HK. One possible reason is that there are fewer features in the Beijing dataset and the features in the dataset have a weaker correlation with the target. The other reason may be due to the generally worse air conditions in Beijing because higher polluting air conditions are harder to predict as demonstrated in the next part.

4.2.3. Comparison of Severe Air Pollution Prediction

Severe pollution prediction is a difficult task; however, it is critical as high-polluting air condition does way more damage to human health. Therefore, even a small improvement in the prediction of severe pollution is more meaningful than a large improvement in predicting good or less polluting air conditions.

As SVCs performed better than other algorithms except for the MKSVC, the best performing SVC was chosen to compare with the MKSVC in terms of forecasting severe air pollutions in the paper. AQHI greater than 6 is considered as severe pollution in HK. IAQL greater than 4 is considered as severe air pollution in Beijing. Figures 9 and 10 are the confusion matrixes of MKSVC and SVC_linear when predicting the next hour’s AQHI of HK.

The x-axis denotes the predicted value, the y-axis denotes the true value, and the values on the diagonal of the matrix denote the probability of the correct prediction. The figures show that linear kernel SVC performs well in forecasting less polluting air conditions, so is the MKSVC. But MKSVC performs far better than linear kernel SVC when AQHI is larger than 8.

Figures 11 and 12 are the confusion matrixes of MKSVC and SVC_rbf when forecasting the next hour’s PM2.5 IAQL of Beijing. The same conclusion can be drawn as that of HK. Generally, all the models make better prediction for light pollutions than severe ones due to the bias towards majority classes. It demonstrates that the task for severe air pollution prediction is challenging.

5. Conclusions

In this paper, a novel multiple kernel learning-based approach with SVC as the base learner was proposed for the near future’s air quality prediction. It was the first time that multiple kernel learning method was applied to air quality forecasting. Special attention was given to the feature engineering process. MKSVC is capable of learning the optimal combination of different kernels with which information coming from multiple sources can be captured simultaneously. Extensive experiments were conducted to compare the performance of MKSVC with the baseline model ARIMA, widely used parametric air quality forecasting models RFC, MLP, and SVCs, and a deep recurrent neural network model LSTM. Historical air pollutant concentration data, meteorological data, and time stamp data of a coastal city Hong Kong and an inland city Beijing were used to validate the models. Based on the experiments, a number of conclusions can be drawn:(1)The proposed MKSVC algorithm offers a better predictive ability than the other models.(2)The proposed MKSVC algorithm is capable of forecasting severe air pollution much better than the other models.(3)The widely used parametric models RF, MLP, and SVC exhibit better prediction performance than the time series models ARIMA and LSTM.(4)Feature transformation and feature selection play a significant role in making better air quality forecasting.

As can be seen from the experiments, long-term prediction task is difficult, so is the task to predict severe air pollutions. Though the proposed multiple kernel learning-based approach demonstrated relatively good performance in terms of both long-term prediction and severe air pollution prediction, more sophisticated methods need to be explored in order to build a more comprehensive and effective air quality forecasting system.

Appendix

A. Kernel SVM

Given a dataset with training instances where is a vector in the input space and denotes the class index taking a value +1 or −1. SVM aims at minimizing an upper bound of the generalization error through maximizing the margin between the separating hyperplane and the data in the input space. In real-time problems, it is often not possible to determine an exact separating hyperplane dividing the data within the input space and also we might get a curved decision boundary in some cases. In such cases, the original input space can be mapped to a higher-dimensional feature space (Hilbert space) using nonlinear functions called feature functions . The resulting discriminant function is

The classifier can be trained by solving the following quadratic optimization problem:where is the vector of weight coefficients, is the bias term of the separating hyperplane, is a predefined positive trade-off parameter between model simplicity and classification error, represents parameters for handling nonseparable data. Instead of solving this optimization problem directly, the Lagrangian dual function enables us to obtain the following dual formulation:where is the vector of dual variables corresponding to each separation constraint. Even though feature space is high dimensional, it could not be practically feasible to directly use the feature functions for classification of the hyperplane. So in such cases, nonlinear mapping induced by the feature functions is used for computation using special nonlinear functions called kernels.where is named the kernel function. By solving the above dual problem, we get , and the maximum margin separate hyperplane function can be rewritten as

The multiclass support can be handled according to a one-versus-one or one-versus-rest scheme. The kernel trick allows SVMs to form nonlinear boundaries [14].

B. Calculation of AQHI in Hong Kong and IAQI in Mainland China

B.1. Calculation of AQHI

The AQHI of the current hour is calculated from the sum of the percentage added health risk (%AR) of daily hospital admissions attributable to the 3-hour moving average concentrations of four criteria air pollutants: ozone (O3), nitrogen dioxide (NO2), sulphur dioxide (SO2), and particulate matter (PM) (respirable suspended particulates (RSP or PM10) or fine suspended particulates (FSP or PM2.5), whichever poses a higher health risk).

The %AR of each pollutant depends on its concentration and a risk factor which was derived from local health statistics and air pollution data. The %AR is then compared to a scale to obtain the appropriate banding of AQHI. The equations are as follows:where %AR (PM) = %AR (PM10) or %AR (PM2.5), whichever is higher.where %AR (NO2), %AR (SO2), %AR (O3), %AR (PM), %AR (PM10), and %AR (PM2.5) are the added health risk of NO2, SO2, O3, PM, PM10, and PM2.5 respectively;

C(NO2), C(SO2), C(O3), C(PM10), and C(PM2.5) are the 3-hour moving average concentration of the respective pollutants in microgram per cubic meter (µg/m3). β(NO2) = 0.0004462559, β(SO2) = 0.0001393235, β(O3) = 0.0005116328, β(PM10) = 0.0002821751, and β(PM2.5) = 0.0002180567 are added health risk factors (technically known as regression coefficients) of the respective pollutants [24].

B.2. Calculation of IAQI

Each pollutant’s individual AQI is called its IAQI. The highest IAQI among these six pollutants at a given time is called the primary or dominant pollutant and is chosen for the overall AQI value.where is mass concentration value of the air pollutant , is the high value of the concentration limit which can be checked in the reference table from the paper [25], is the low value of the concentration limit which can be checked in the reference table from [25], is the corresponding value of in the same reference table, and is also the corresponding value of in the reference table. The detailed break down of China AQI for PM2.5 concentrations is shown in Table 17.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Hong Zheng is the group leader and she is responsible for the project management and in charge of revising this manuscript. Haibin Li is responsible for data analysis and planning and performing the experiments. Xingjian Lu and Tong Ruan provided valuable advice about the revised manuscript.

Acknowledgments

The authors are pleased to acknowledge the National Natural Science Foundation of China under Grant nos. 61103115 and 61103172; the National Natural Science Youth Foundation of China under Grant no. 61602175; the special fund for Software and Integrated Circuit Industry Development of Shanghai under Grant no. 150809; and the “Action Plan for Innovation on Science and Technology” Projects of Shanghai (Project no. 16511101000).