Abstract

The present study aims to develop an efficient dynamic statistical model to describe the daily behavior of boundary layer ozone in Macau. Four types of Kalman-filter-based models were proposed and applied to model the daily maximum of the 8 hr averaged ozone concentrations within a decade (2000–2009). First, the boundary layer ozone was modelled with the time-varying autoregressive model of order p, TVAR(p), which is a pure time series model hindcasting the ozone concentration by a weighted sum of the ozone histories of the previous p days. Then, it was modelled with the time-varying autoregressive model with linear exogenous input, TVAREX-Lin, which combines the TVAR model and the exogenous input of key meteorological variables in a linear fashion. Next, the nonlinear TVAREX model (TVAREX-NLin) which assumes the nonlinear influence of individual meteorological variable on ozone was adopted. Finally, a semiempirical TVAREX model (TVAREX-O3) was proposed to address the coastal nature of Macau and the interaction between the input variables. It was found that the proposed TVAREX-O3 model was the most efficient one among the model candidates in terms of the general modelling performance and the capability of modelling the episode situation.

1. Introduction

The increasing trend of boundary layer ozone concentrations in the Pearl River Delta (PRD) region was observed in recent decades due to increasing anthropogenic emissions of ozone precursors such as oxides of nitrogen (NO𝑥) and volatile organic compounds (VOCs) [1, 2]. This problem has been receiving greater attention due to increasing evidences discovered for the adverse effects of ozone exposure (respiratory and ocular damage, crop yield reduction, climate change, etc.) [35]. Therefore, taking actions towards its proper management is necessary and urgent. Developing a representative model to investigate the variation of the ozone concentrations measured in the existing air quality monitoring network is obviously an initial but essential step.

In the past decade, the statistical models developed to model the boundary layer ozone behavior of the Pearl River Delta region are based on offline techniques such as the artificial neural network and the support vector machine [4, 6, 7]. These techniques assume that the model coefficients should be time-invariant after model training. However, air quality system changes with time, making the offline models underperform [8]. Therefore, the objective of this study is to develop a representative dynamic statistical model to describe the daily behavior of boundary layer ozone of Macau, a fast developing city in the PRD, for improvement. A dynamic statistical model differs from a static one with its time-varying model coefficients. Through changing the model coefficients at every time step, it is believed that the model can adapt to the change in the actual system. In this study, the extended Kalman filter [9, 10] was adopted to estimate the time-varying model coefficients for the proposed models. This technique has been implemented on similar models used to predict the daily averaged PM10 concentrations of Macau before [8, 11, 12]. Therefore, the present study only focuses on describing the proposed models for the ground-level ozone. Detailed procedures of employing the Kalman filter for the state estimation of the statistical air quality models can be referred to in [11].

2. Data and Methodology

2.1. Study Area and Data

Being Asia’s well-known gaming mecca, Macau has been experiencing rapid economic growth in the last decade due to the booming of its gaming and tourism industry. The increasing energy consumption has caused degradation on the ambient air quality as well as the indoor one. In view of the public concern, the Macau Meteorological and Geophysical Bureau has been developing automatic air quality monitoring network spanning over the Macau peninsula (urban core), the Taipa Island (suburban area), and the Coloane Island (rural area) to monitor the variation of key air pollutants including ozone since 1999. In this study, the boundary layer ozone concentrations recorded at the ambient (Taipa) station from January 2000 to December 2009 are studied since this station has the longest dataset of all the monitoring stations. In addition, the boundary layer ozone is a more dominant problem in the Taipa island due to more biogenic emissions of VOCs and the lack of NO𝑥 titration [1315]. Finally, this station is located at the peak of the Taipa Grande Hill, which has an altitude of 110 m above the sea level. Therefore, the ozone concentrations monitored at this station are expected to represent the general ambient condition.

2.2. Model Candidates

Four adaptive statistical air quality models are proposed here. They can be classified into two major types, namely, the time-varying autoregressive (TVAR) model and the time-varying autoregressive model with exogenous inputs (TVAREX). These model candidates are further described below.

2.2.1. TVAR(𝑝) Model

The first model candidate TVAR(p) is defined as the time-varying autoregressive model of order 𝑝, which has the following form: 𝑥𝑘=𝜙1,𝑘1𝑥𝑘1++𝜙𝑝,𝑘1𝑥𝑘𝑝+𝑓𝑘1.(1) In it, the daily maximum of the 8 hr averaged ozone concentration on the 𝑘th day 𝑥𝑘 is predicted by a weighed combination of the ozone past histories of the previous 𝑝 days [8]. The weights are given by the uncertain time-varying model coefficients 𝜙𝑖,𝑘1,𝑖=1,,𝑝, which can be estimated by Kalman filter [11]. The last term 𝑓𝑘1 denotes the modelling error which represents the unmodeled system dynamics that affect the ozone concentration and is modeled as a Gaussian-independent and identically distributed (i.i.d.) process with zero mean and the process noise variance 𝜎2𝑓.

2.2.2. TVAREX-Lin Model

The second model candidate TVAREX-Lin is defined as the time-varying autoregressive model with linear exogenous inputs: 𝑥𝑘=𝜙1,𝑘1𝑥𝑘1+𝜙2,𝑘1𝑥𝑘1+𝜙3,𝑘1𝑠𝑘+𝜙4,𝑘1𝑘+𝜙5,𝑘1𝑢𝑘+𝜙6,𝑘1||𝜃𝑘||+𝜙7,𝑘1𝐼𝑘+𝑓𝑘1.(2) Unlike the TVAR(p) model, the TVAREX-Lin model assumes that the ground-level ozone does not only evolve according to its past histories, but is also influenced by other conditions occurring on the day of prediction. These conditions include (i) the photochemical reactions that transform the precursors into ozone, (ii) the nature of the replenishing air masses, and (iii) the dispersion condition. These mechanisms are reflected by using the linear combination of the influencing meteorological variables as the exogenous inputs. In this model, the symbols 𝑥𝑘1 and 𝑥𝑘1 denote the daily maximum of the 8 hr averaged ozone concentration on the (𝑘1)th day and the 8 hr averaged ozone concentration measured before the (𝑘1)th day, respectively. These two variables are applied to represent the initial ozone condition. The variable 𝑠𝑘 is the solar radiation index which is defined as the product of the daily sum of the 8 hr averaged solar radiation and the duration of sunshine on the 𝑘th day. It is used to reflect the amount of energy that propels the photochemical reaction for ozone formation. The symbol 𝑘 is the daily average of the 8 hr averaged relative humidity, and this factor can limit the ozone generation since high humidity encourages the chemical reaction of the NO2 and the NaCl particles brought by sea breeze [16] on the 𝑘th day. The parameters 𝑢𝑘 and |𝜃𝑘| are the magnitude and the absolute angle of the resultant wind vector which is calculated by the vector sum of the 8 hr averaged wind velocity vectors on the 𝑘th day. It is noted that the 8-hr averaged wind velocity vector is calculated by the vector sum of the hourly wind velocity vectors of the previous 8 hours. The magnitude 𝑢𝑘 reflects the dispersion condition, and the absolute angle |𝜃𝑘| represents the leading wind direction which influences the source of the replenishing air masses on the 𝑘th day. For example, a resultant angle of 0° refers to the prevailing northerly wind, while 180° indicates the prevailing southerly wind. The input 𝐼𝑘 is the index of wind reversion on the 𝑘th day: 𝐼𝑘=max𝑗||𝜃cos𝑗1||||𝜃cos𝑗||,(3) where |𝜃𝑗| is the absolute angle of the resultant wind vector at the 𝑗th hour. This index is positively correlated with the maximum of the ozone concentration on the day of prediction. It is defined to capture the sudden reversion of wind direction which may recirculate the contaminants and cause ozone enhancement [17]. The symbol 𝜙𝑖,𝑘1 is the time-varying coefficient for the associated input variable on the (𝑘1)th day. Finally, the term 𝑓𝑘1 represents the modelling error and it is modeled as Gaussian i.i.d. with zero mean and the process noise variance 𝜎2𝑓.

2.2.3. TVAREX-NLin Model

The third model candidate TVAREX-NLin is defined as the time-varying autoregressive model with nonlinear exogenous inputs:𝑥𝑘=𝜙1,𝑘1𝑥𝑞𝑘11,𝑘1+𝜙2,𝑘1𝑥𝑞𝑘12,𝑘1+𝜙3,𝑘1𝑠𝑘𝑞3,𝑘1+𝜙4,𝑘1𝑘𝑞4,𝑘1+𝜙5,𝑘1𝑢𝑘𝑞5,𝑘1+𝜙6,𝑘1||𝜃𝑘||𝑞6,𝑘1+𝜙7,𝑘1𝐼𝑘𝑞7,𝑘1+𝑓𝑘1.(4) As the formation of ozone is a nonlinear process, it is attempted to investigate the effect of nonlinearity on the modelling performance by using the same set of input variables of the TVAREX-Lin model and modelling the effect of each input variable with the power law. The symbol 𝑞𝑖,𝑘1 denotes the time-varying exponent of the input variable on the (𝑘1)th day.

2.2.4. TVAREX-𝑂3 Model

Although the TVAREX-Lin model and the TVAREX-NLin model address the important exogenous inputs which represent the influencing photochemical and physical mechanisms of ozone, their functional forms are solely empirical and may not completely reflect the nature of the study area. In addition, these two models do not account for the interaction between the input variables. Therefore, the semiempirical TVAREX-O3 model is proposed below to address the coastal characteristics of Macau and the interaction of some input variables: 𝑥𝑘=𝜙1,𝑘1𝑥𝑘1+𝜙2,𝑘1𝑥𝑘1+𝜙3,𝑘1𝑠𝑘exp𝜙4,𝑘1𝑘𝜙5,𝑘1||𝜃𝑘||+𝜙6,𝑘1𝐼𝑘×exp𝜙7,𝑘1𝑢𝑘+𝑓𝑘1.(5) The first two terms are still used to reflect the background ozone conditions, while the third term is used to represent the photochemically formed ozone by the transboundary ozone precursors within the Pearl River Delta region. Since Macau is a coastal city located at the southwest of the PRD, the ozone precursors can be easily transported to Macau from its upwind cities nearby when the prevailing winds are blowing from the northerly directions. On the contrary, Macau is facing the South China Sea located to its south. It is expected that Macau is less influenced by the transboundary air pollution from the cities located to its North when the prevailing wind direction is southerly. Therefore, the absolute angle in the third term is used to represent the effect of the replenishing air masses brought by the prevailing wind on the day of prediction. However, the concentrations of the precursors in the replenishing air masses can be also increased when there is a reversion in the wind direction that causes recirculation and accumulation due to this momentum swing. On the other hand, the amount of water present in the replenishing air masses may affect the concentration of the nitrogen dioxide. Therefore, the variables 𝑘 and 𝐼𝑘 also appear in the exponential term to account for these effects on the ozone precursors. Then, the solar radiation index 𝑠𝑘 outside the exponential term is adopted to represent the available energy to propel the ozone formation for the precursors present in the incoming air masses due to the photochemical reactions. After the photochemical formation process, the original background concentration together with the photochemically formed one will be dispersed by wind, and the exponential term containing the magnitude of the resultant wind velocity vector is used as a discounting factor on the modelled ground-level ozone concentration.

3. Results and Discussion

3.1. General Performance of the Proposed Models

In this section, the proposed models are evaluated based on the measured daily maximum of the 8-hr averaged ozone concentrations recorded at the ambient (Taipa) station between 2001 and 2009. In order to avoid the transient state of Kalman filter, the data in 2000 are not included in the model evaluation. First of all, the general (global) performances of these models are evaluated with some well-known error indices including the root-mean-square error (RMSE): RMSE=1𝑁𝑁𝑘=1𝑧𝑘̂𝑥𝑘𝑘12,(6) the mean absolute percentage error (MAPE): 1MAPE=𝑁𝑁𝑘=1||𝑧𝑘̂𝑥𝑘𝑘1||𝑧𝑘×100%,(7) the coefficient of determination (𝑅2): 𝑅2=𝑁𝑘=1𝑧𝑘𝜇𝑧̂𝑥𝑘𝑘1𝜇𝑥𝑁𝑘=1𝑧𝑘𝜇𝑧2𝑁𝑘=1̂𝑥𝑘𝑘1𝜇𝑥2,(8) and the index of agreement (IA): IA=1𝑁𝑘=1𝑧𝑘̂𝑥𝑘𝑘12𝑁𝑘=1||𝑧𝑘𝜇𝑧||+||̂𝑥𝑘𝑘1𝜇𝑧||2,(9) where N, 𝑧𝑘, ̂𝑥𝑘𝑘1, 𝜇𝑥, and 𝜇𝑧 represent the number of measured days, the measured daily maximum of the 8-hr averaged ozone concentration on the 𝑘th day, the corresponding hindcast by the model on the 𝑘th day, the averaged value of the modelled concentration time history and the averaged value of the measured concentration time history, respectively. Generally, an efficient model associates with small values of RMSE and MAPE as well as large values of 𝑅2 and IA which are close to unity. Table 1 shows the general performance of the TVAR(p) models, 𝑝=1,,7. It is noted that the RMSE of the TVAR models decreases slightly by 2.4%, and the value of 𝑅2 increases by 4.2% when 𝑝 is increased from 1 to 7. However, the value of IA remains constant, and the MAPE increases by 3.8%. The contradiction of the observed trends for different error statistics implies that the prediction performance may not be necessarily improved by using a higher-order TVAR model. This finding agrees with the short half-life time of ozone (~12 hours) in the atmosphere. Therefore, the most parsimonious TVAR(1) model was chosen as the most plausible model among TVAR(p) model candidates. Table 2 shows the general performance of the TVAR(1) model as well as the TVAREX models in different seasons. First of all, it is observed that TVAR(1) model is the least efficient among four models with the largest RMSE and MAPE as well as smallest values of 𝑅2 and IA throughout the years. It is possibly due to the lack of the information about the meteorological conditions which may highly influence the ozone production. In addition, it is interesting to find that the TVAREX-Lin model generally outperforms the TVAREX-NLin model. It should be noted that both models share the set of input variables. The former is a linear model, and the latter is a nonlinear one with more adjustable time-varying parameters. However, the extra flexibility given to the TVAREX-NLin model may be unnecessary and just make it to be more sensitive to the local fluctuation in the data. Therefore, this flexibility makes the TVAREX-NLin become more difficult to tune with the general behavior of the concentration time history and hence leads to its underperformance. On the contrary, the proposed TVAREX-O3 model is also a nonlinear model which has just the same number of uncertain parameters as the TVAREX-Lin model. However, it is more efficient compared to the other two empirical TVAREX models. Furthermore, it is noted that the degree of performance improvement of the TVAREX-O3 model compared to TVAREX-Lin model in Summer and Autumn is larger than that in Spring and Winter as shown in Table 3, meaning that TVAREX-O3 model is especially efficient in the episode seasons. Therefore, this demonstrates that deciding the relevant input variables before the model construction is important. However, designing a meaningful functional form which reflects the correct role played by each input variable in the model is also the key leading to the success. Once the global performance of the proposed models has been evaluated and compared, it is also worthy to investigate more details from the local modelling performance of the TVAR models and the TVAREX models. In the following section, the local modelling performance of each model is evaluated based on graphical comparisons of the measured concentrations as well as their hindcasts.

3.2. Local Performance of the Proposed Models

Figure 1(a) shows the scatter plot of daily maximum values of the 8-hr averaged ozone concentrations modelled by the TVAR(1) model against the measurements. A 45° line was also drawn on the same figure for comparison. When a data point falls on the 45° line, it means that the hindcast by the model is exactly equal to the measurement. It is noted that a large number of points are still lying close to the 45° line, meaning that the TVAR(1) is capable to capture the general trend of the time history. For the cases of high predicted ozone concentrations or high measured ozone concentrations, those points are far above or below the 45° line. This is possibly due to the time delay problem, that is, the modelled time history generally lags behind the measured one. This problem may cause large modelling error when there is significant variation in the daily maximum value, and those points are associated with the onset and retreat of the ozone episode.

Figures 1(b)1(d) show the scatter plots of the modelled daily maximum of the 8-hr averaged ozone concentrations versus their measurements for TVAREX-Lin, TVAREX-NLin, and TVAREX-O3 models, respectively. It is obvious to see that the points in these figures are lying closer to the 45° line compared to the points in Figure 1(a); hence signifying the improvement of the time delay problem. By a detailed comparison of these three figures, it is noted that a portion of the points in Figures 1(b) and 1(c) are lying far above the 45° line compared to the same region in Figure 1(d).

This indicates that both the TVAREX-Lin and the TVAREX-NLin models may produce more false alarms compared to the TVAREX-O3 model. In addition, it is noted that the points of TVAREX-O3 model are more concentrated around the 45° line compared to those of the TVAREX-Lin and TVAREX-NLin models even under the conditions of high ozone concentrations. Therefore, the TVAREX-O3 model has the most efficient local performance, and this further supports the observations in the previous subsection.

3.3. Episode Capturing Capability of the Proposed Models

As indicated in the previous graphical comparison of the local modelling performance, the time-varying model candidates can have significantly different capabilities of capturing the pollution episodes. In this section, it is aimed to further investigate this diversity more quantitatively by using two more numerical indicators, namely, the probability of detection (POD) and the probability of false alarm (PFA). The probability of detection (POD) is defined as the probability of issuing a successful warning for a given episode day: POD=𝑃̂𝑥𝑘𝑘1𝑇𝑧𝑘,𝑇(10) where 𝑇 is the threshold concentration of the ozone episode. High POD value usually indicates that the model is capable to warn the public for the bad air quality. However, it should be kept in mind that overestimated predictions of the model may also associate with a high POD. Therefore, the probability of false alarm (PFA) is commonly used together with POD to check the episode capturing capability of a model. The PFA is defined as the probability of having a nonepisode day given that a warning has been issued: 𝑧PFA=𝑃𝑘<𝑇̂𝑥𝑘𝑘1.𝑇(11) For example, a perfect model should have a POD of 1 and a PFA of 0. Figure 2(a) shows the PODs of the TVAR(1) model and the TVAREX models for different thresholds ranging from 0 to 160 μg m−3, respectively. It is noted that the PODs of the TVAR(1) model are significantly lower than those of the TVAREX models when the threshold 𝑇 is larger than 80 μg m−3. This implies that the onset of ozone episodes in Macau is closely linked to the meteorological conditions on the day of prediction and may explain why the TVAR(1) model underperforms in this situation. For the TVAREX models, it is noted that the PODs of the TVAREX-O3 model are slightly higher than those of the TVAREX-Lin and TVAREX-NLin model when the threshold 𝑇 is larger than 120 μg m−3. This finding reinforces similar observation made in the graphical comparison of the local performances between the TVAR(1) model and the TVAREX models. Figure 2(b) shows the PFAs of the TVAR(1) model and the TVAREX models against different threshold concentrations. First of all, it is noted that the TVAR(1) model produces more frequent false alarm compared to the TVAREX models. Since the TVAR(1) model predicts the daily maximum value of the 8-hr averaged ozone concentration tomorrow based on the filtered estimate of today only, there is a high probability of predicting it is an episode day tomorrow given that today is an ozone episode day. However, the episode may retreat tomorrow due to the swing of prevailing wind direction from north to south, or the decrease in the solar radiation. Therefore, false alarm will be produced by the TVAR(1) model under this situation. In addition to the PODs, it is noted that the PFAs of the TVAREX-O3 model are significantly less than those of the TVAREX-Lin and TVAREX-NLin models when the threshold concentration is larger than 100 μg m−3. This confirms that the improvement of the episode capturing capability in the TVAREX-O3 model is not done by sacrificing the false alarm rate. Finally, it should be concluded that the TVAREX-O3 model is the most efficient model to capture the ozone pollution episodes in this study.

4. Conclusion

Four types of time-varying statistical models, namely, the TVAR(p) model, the TVAREX-Lin model, the TVAREX-NLin model, and the TVAREX-O3 model were proposed to model the daily maximum of the 8-hr averaged ozone concentrations observed at the ambient (Taipa) station of Macau within a decade (2000–2009). Throughout comparing the general performance of the TVAR(p) model with the order ranging from 1 to 7, it was concluded that the daily behavior of the ground level ozone system in Macau has the Markovian property [18], and this coincides with its short half-life time. Further comparison of the TVAR(1) model and the TVAREX models in terms of the general modelling performance, the local modelling performance and the episode capturing capability indicates that the TVAR(1) model underperforms. Therefore, it was concluded that the daily fluctuation of the ozone time history in Macau is dominated by the meteorological conditions on the day of prediction. In addition, it was found that the TVAREX-Lin model outperforms the TVAREX-NLin model, and this might be due to the unnecessary model parameters of the TVAREX-NLin model, which makes the model itself become more difficult to tune with the general behavior of the data. Therefore, a complex model is not necessarily better than a simple one. Finally, it was found that the TVAREX-O3 model, which considers the coastal nature of Macau and the interaction of the input variables in its functional form, outperforms the other two empirical TVAREX models. Therefore, it was concluded that selection of the relevant input variables during the model construction is important. However, designing a meaningful functional form which reflects the correct role played by each input variable in the model is also the key leading to the success.

Acknowledgments

This work is supported by the Macau Science and Technology Development Fund (025/2011/A) and the University of Macau (UL019/08-Y4/CEE/MKM01/FST). The Macau Meteorological and Geophysical Bureau is thanked for supplying the data of the ambient station.