Abstract

This study presents a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method for flight arrival flow prediction at the airport. Correlation analysis is conducted between the historic arrival flow and input features. The XGBoost method is applied to identify the relative importance of various variables. The historic time-series data of airport arrival flow and selected features are taken as input variables, and the subsequent flight arrival flow is the output variable. The model parameters are sequentially updated based on the recently collected data and the new predicting results. It is found that the prediction accuracy is greatly improved by incorporating the meteorological features. The data analysis results indicate that the developed method can characterize well the dynamics of the airport arrival flow, thereby providing satisfactory prediction results. The prediction performance is compared with benchmark methods including backpropagation neural network, LSTM neural network, support vector machine, gradient boosting regression tree, and XGBoost. The results show that the proposed LSTM-XGBoost model outperforms baseline and state-of-the-art neural network models.

1. Introduction

The airport is the terminal for aircraft taking off and landing. It is also the transferring point for passenger distribution. The daily air traffic flow has strong periodicity and randomness. There are many factors influencing the airport arrival flow, among which the most widely acknowledged are the complex meteorological factors, for example, the change of short-term arrival flow caused by severe weather such as thunderstorm in summer and blizzard in winter, as well as the unfavorable weather conditions that may affect visibility [1, 2]. Real-time and high-precision arrival flow prediction at the airport is of great significance to identify similar patterns, implement passenger evacuation strategy, alleviate airport congestion, and improve air transportation management systems [35]. It can also assist passengers to make better traffic mode selection decisions. Therefore, it is necessary to take the meteorological factors into account when forecasting the short-term arrival flow at the airport.

Recently, a series of studies have been conducted regarding the short-term traffic flow prediction based on time-series data. The commonly used methods can be categorized into two groups, including parametric algorithms such as linear regression, time-series models, and Kalman filtering and nonparametric algorithms such as k-nearest neighbor method, support vector regression, deep-learning methods such as neural networks (e.g., convolutional neural network and recurrent neural network), and a combination of these methods [611]. The parametric algorithms are easy to implement and can reflect the relation between the independent variables and dependent variable directly, while the nonparametric algorithms, especially the deep-learning method, show superiority with higher prediction accuracy and less computation time for large datasets. For example, Lu et al. proposed a combined method for short-term highway traffic flow prediction based on a recurrent neural network [12]. Asadi and Regan presented a spatiotemporal decomposition-based deep neural network for time-series forecasting with the case of highway traffic flow data from the Bay Area of California. A multikernel convolutional layer is designed to maintain the network structure and extract short-term and spatial patterns [13]. Li et al. proposed an adaptive real-time prediction model under uncontainable conditions. The model consists of two stages, including an online sequence extreme learning machine with a forgetting factor for noise processing and a hidden Markov model for traffic flow prediction [14].

As compared with highway traffic flow prediction, the short-term prediction of airport arrival flow tends to be more complicated, due to the stochasticity and dynamic nature of air traffic flow considering the various influencing factors such as weather conditions [1517]. Until recently, the short-term prediction of air traffic flow remains a hot issue. Although different statistical approaches have been used in past studies, each has suggested that there are meaningful relationships between various input variables and traffic flow rate [1820]. Further development is still needed to advance the predictive aspects of the linkage between airport arrival flow and the input variables including meteorological variables and then to predict future arrival flow using data mining techniques.

The primary objective of this paper is to, first, discover if there are significant relationships between airport arrival flow and various meteorological variables; second, identify which factors can then be used as inputs to estimate airport arrival flow; and third, select an appropriate model that can be used to predict the airport arrival flow with decent performance. To this end, the correlation between historic arrival flow and various features is calculated. Then, a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method is proposed for airport arrival flow prediction. The selected features including meteorological variables are input into the network.

The rest of the paper is organized as follows. Section 2 illustrates the data collection and preparation procedure. Section 3 presents the proposed framework incorporating the long short-term memory neural network and the extreme gradient boosting algorithm components. Section 4 describes the data analysis results by comparing the performance of the proposed method with that of commonly used benchmark methods. Section 5 discusses the conclusions and future works.

2. Data Preparation

To meet the research objective, the airport performance data and various factors required in the data mining procedure are collected. The data sources for analysis can be divided into two categories: flight arrival data and airport meteorological information.

2.1. Flight Arrival Data

This paper selects the flight arrival data of Nanjing Lukou International Airport (NKG) from January 1, 2018, to December 31, 2018, with a total of 113,243 records of information for data extraction and analysis. The specific flight information includes flight ID, aircraft type, departure airport, destination airport, estimated departure time, estimated arrival time, actual departure time, actual arrival time, and status of flight for that day.

The daily flight information is divided into 48 records, with 30 minutes as the time horizon of a record. According to the flight information provided, the flight date, planned and actual arrival time of the aircraft, and the final status of the flight are used to calculate the planned and actual flow data of each time slice of the day. The canceled flights and changed flights on that day are excluded. Figure 1 illustrates the daily arrival and canceled flights in 2018. It can be found that the trend of flight arrivals is periodically fluctuated, while the trend of canceled flights tends to be stochastic and nonscheduled. In addition to the canceled flights, there are also some cases that may cause the difference between the scheduled flight counts and the actual flight counts, that is, change of flight routes, transferring to alternate airports, and missing values. As for the 30 min data records, the difference between the scheduled flight counts and the actual flight counts ranges from 0.014 to 6.803 with a mean value of 2.027, which accounts for 17.56% to 88.47%% with an average of 34.94%.

2.2. Airport Meteorological Information

The airport meteorological information comes from OGIMET [21], which provides local weather conditions. Data from the Meteorological Report of Aerodrome Conditions (METAR) of Nanjing airport in 2018 are collected, including the four-character code of the airport, UTC time, wind direction, wind speed, wind gusts, temperature, dew point temperature, visibility (runway visual range), air pressure, cloud height, cloud cover, humidity, pressure, and weather phenomena such as precipitation, thunderstorm, fog, snowfall, and haze. Variables about some weather phenomena are set as dummy variables. Taking rainfall as an example, 1 indicates the presence of rainfall and 0 indicates no rainfall. The collected METAR messages are summarized. Table 1 presents partial data of the real-time meteorological indicators of Nanjing Lukou International Airport from 10:00 to 14:00 on June 28, 2018, for illustration.

As the METAR information is issued roughly hourly, the linear interpolation method is used to obtain the 30 min granularity meteorological data to match the flow data of 48-time slices per day. Considering that the meteorological information includes not only continuous meteorological factors such as wind speed, temperature, and visibility but also discrete meteorological factors such as rain, snow, and thunderstorm, the piecewise linear interpolation method is used to interpolate the hourly continuous meteorological data, while the weather phenomena are regarded to be consistent in the current one-hour period. Figure 2 illustrates the daily arrival flights as well as the occupied time duration of rain and thunderstorm of NKG in May 2018.

2.3. Data Preprocessing

The collected data are preprocessed by filtering, normalizing, and reconstructing, which effectively improve the convergence speed and prediction accuracy of the model. The final dataset includes one actual inflow as the output variable and twelve features which contain eleven real-time weather features and one planned flow volume as the input variables. All the variables are normalized using the following equation to transform into a dimensionless value ranging from 0 to 1:where x’ represents the normalized dimensionless value and x represents the original value. The model is calibrated using data from January to September with a total of 13,104 30 min records and then validated using data from October to December with a total of 4,416 30 min records.

3. Methodology

In this section, a combined LSTM-XGBoost method is constructed for short-term airport arrival flow prediction. The proposed LSTM-XGBoost method contains two components, the long short-term memory neural network and the extreme gradient boosting algorithm. The methods used in each component are briefly discussed.

3.1. The LSTM Method

LSTM is one of the important variants of Recurrent Neural Networks (RNNs). It has been proved that LSTM works well on sequence-based tasks with long-term dependencies. Compared with the traditional artificial neural network, the LSTM network realizes the combination of long-term and short-term memory by setting special structures such as forget gate, input gate, and output gate [22]. In recent years, the LSTM method has been frequently applied in short-term prediction with good performance [23, 24].

As shown in Figure 3, xt is the input variable and ht is the output variable at time t. ơ and tanh are the activation functions of the network, where ơ represents the sigmoid function and tanh is the hyperbolic tangent function. Their role is to introduce nonlinear transformations in neural networks in order to make the network have stronger nonlinear expression capabilities. The data processing procedure of a unit in the LSTM network structure is like this. First, xt is input together with the output data at the previous time into the network. Then, the long-term memory state variables are selectively remembered through the forget gate, and a new memory state variable is formed by superposing the current state with the long-term state at the previous time through an input gate. Finally, the output variable at time t can be obtained as the long-term memory state variable through the output gate:

In equations (2) to (6), , , , , and are learning parameters. and are two commonly used nonlinear activation functions.

3.2. The XGBoost Method

The extreme gradient boosting (XGBoost) method is an improved method based on Gradient Boosted Decision Tree (GBDT) proposed by Chen and Carlos (2016) [25]. The salient features of XGBoost which make it different from other gradient boosting algorithms include clever penalization of trees, a proportional shrinking of leaf nodes, newton boosting, and extra randomization parameter. In this paper, the XGBoost method is used to extract features and evaluate relative feature importance. The procedures are presented as follows.

For a given dataset with n samples and M characteristics, represented as , assuming that XGBoost model has K decision trees, the flight flow prediction model is represented as follows:where is the predicted value at time i; is the corresponding input variables for ; and is the prediction function corresponding to the kth decision tree, which is defined as follows:where represents the structure function of mapping to the kth decision tree corresponding to the leaf node; is the quantization weight vector of the leaf node; and M is the number of leaf nodes in the tree.

The loss function L of the XGBoost algorithm includes error term l and regularization term Ω. The prediction model is learned by minimizing the loss function of the formula. In this paper, the root-mean-square error is selected as error term l, which is defined as follows:

In the formula, the regularization term prevents the model from overfitting.

3.3. The Combined LSTM-XGBoost Method

As mentioned above, the daily air traffic flow has strong periodicity and randomness. Data analysis indicates that there are several peak time of arrival flights, from 8:30 am to 11:00 am, from 12:30 pm to 13:30 pm, and from 17:00 pm to 19:00 pm. The airport arrival flow is influenced by many external factors, among which meteorological factors are commonly recognized that may be significant. The LSTM model has been widely used to deal with time-series problems, which can capture the temporal correlation of time-series data. However, the traditional LSTM lacks the ability to extract the external features that may affect the predicted variables. To this end, this paper proposes an LSTM-XGBoost model, which can well characterize the temporal correlation as well as the influence of external characteristics.

The structure of the LSTM-XGBoost model is shown in Figure 4. The input data of the LSTM cell consists of two parts, including the scheduled flight flow data and historic flight flow data , constituting the input matrix , where ; T represents the prediction timestep. After the LSTM layer, the Rectified Linear Unit (Relu) is used as the activation function to output the predicted value at time, which is shown as follows:

Then, the XGBoost model is used to predict the arrival flow at time T + i from input features , which incorporates the predicted value from LSTM at time T + i () () and external meteorological characteristics :

3.4. Evaluation Metrics

To evaluate the performance of the proposed model, mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) are calculated for each method, respectively. The equations are shown as follows:where yi represents the actual value of sample i; represents the predicted value of sample i; represents the average value of the real data; and n is the sample size.

4. Data Analysis Results

4.1. Correlation Analysis of Input Features

As mentioned above, twelve features are collected and incorporated in the proposed model, including scheduled flights, wind speed, temperature, dew point temperature, visibility, atmospheric pressure at nautical height (QNH), cloud, rain, thunderstorm, fog, snowfall, and haze. To identify the relationship of various factors, the Pearson correlation coefficient (r) between actual arrival flow and the explanatory variables as well as the correlation between different explanatory variables is calculated. The equation is shown as follows:

In this formula, x is the independent variable; y is the dependent variable; is the mean of the independent variable; and is the mean of the dependent variable. The Pearson correlation coefficient (r) ranges from −1 to 1, which represents the strength of the linear correlation between two variables. The results are shown in Figure 5.

As shown in Figure 5, it can be found that, besides scheduled flights that are highly related, the actual flights are also positively related to visibility, wind speed, and temperature, while negatively related to fog. In addition, the visibility is positively related to temperature, wind speed, scheduled flights, and dew point temperature, while negatively related to fog, cloud, rain, QNH, and haze. It should also be noted that although thunderstorm and snowfall have a weak correlation with the other features with the current data, it does not indicate that these two factors can be excluded from consideration. On the contrary, as rare events, these extreme bad weather conditions may seriously affect the arrival of flights. Considering that, as input features, the temperature is highly positively related to dew point temperature and highly positively negatively to QNH, these two variables (dew point temperature and QNH) are removed from input features in the subsequent models.

4.2. Analysis of Variable Importance

With the selected features, the XGBoost method is applied to identify the relative importance of various variables. The results are shown in Figures 6(a)6(c) for the 30 min, 60 min, and 120 min prediction time horizon, respectively. Generally, the meteorological variables have a similar impact on the arrival flow for all the three scenarios. The most important influential feature is scheduled flights, which is congenial with common sense. The other two important influential features include temperature and visibility. As for the temperature, it is due to the reason that first, the collected data indicate that, in general, people prefer to travel more in warmer days, except for the traditional holidays. Second, there are more flights in the daytime with higher temperature, as compared with nighttime. Considering the visibility, it is acknowledged that there are visibility requirements for the operation of aircraft. The flights tend to be delayed with poor visibility until it returns to normal conditions.

There are some slight differences for the relative importance of variables of the prediction models with different time periods, which are temperature, followed by visibility, wind speed, cloud, and snow for the 30 min perdition model; visibility, temperature, wind speed, cloud, and snow for the 60 min perdition model; and visibility, temperature, wind speed, snow, and thunderstorm for the 120 min perdition model.

It is also found that the F-scores for the meteorology features are relatively low, while the extreme weather conditions may have strong impacts on the actual flight arrival rate. The collected data indicate that the difference between the actual flow rate and the scheduled flow rate has a higher fluctuation under bad weather conditions. The reason for the small F-scores is that almost all the extreme weather conditions are rare events. The feature importance is generated according to the degree of influence of the feature on the accuracy of the prediction during the process of generating the model. Besides, some of the weather conditions occur at specific time periods during a day. For example, the fog usually appears in the early morning with a lower arrival flow rate. Thus, the calculated importance of the feature will be small according to the collected data. In addition, it is acknowledged that most of the meteorology features are associated with visibility. The impacts of these bad weather conditions are reflected through the perspective of the feature of visibility to a certain extent, rather than the occurrence of snow, thunderstorm, rain, haze, fog, and so on, in terms of dummy variables.

4.3. Comparison of Prediction Results

With the selected features as inputs, the LSTM-XGBoost model is constructed. The hyperparameters are testified, including hidden layers, number of neurons in each hidden layer, and timestep for the LSTM component as well as the depth of the tree, learning rate, and number of decision trees for the XGBoost component. The input values are shown in Table 2.

To testify the performance of the proposed LSTM-XGBoost model, several benchmark methods are also tested and compared. The selected benchmark methods include backpropagation (BP) neural network, LSTM neural network, support vector machine (SVM), gradient boosting regression tree (GBRT), and XGBoost, which were commonly used in previous studies of short-term traffic flow prediction. The hyperparameters for BP and LSTM are selected in a similar way as that for the LSTM-XGBoost model. All the benchmark methods are trained and tested with the same data and input variables, so as to ensure that the models are comparable. The results are summarized in Table 3.

As shown in Table 3, for each method, six short-term arrival flow prediction models are developed, with 30 min, 60 min, and 120 min as the prediction time level, as well as historic and scheduled flights and historic and scheduled flights together with meteorological variables as input features. Based on the data analysis results, the following findings can be obtained.

First, for each method, MAE, MSE, and RMSE increase sharply with the increase in prediction time horizon, while MAPE slightly decreases. Specifically, MAE and RMSE are the lowest for the 30 min prediction time horizon, as the two metrics increase with the magnitude of the original arrival flow data, while in terms of MAPE, the model exhibits the best performance for the 120 min prediction time horizon.

Second, for all the five methods, the model performance can be increased by incorporating meteorological variables, especially for the 120 min prediction time horizon, indicating the fact that these factors may have a significant impact on airport arrival flow, especially extreme weather conditions. The improvement is the most prominent for the proposed LSTM-XGBoost method.

Third, the proposed LSTM-XGBoost method generally outperforms all the other machine learning techniques in terms of lower MAE, MSE, RMSE, and MAPE, followed by XGBoost, GBRT, and LSTM. This confirms the superiority and feasibility of the proposed model, which can successfully capture both the temporal features and influencing factors.

To further investigate the performance of the proposed model affected by various meteorological factors, the prediction accuracy of the airport arrival flow for different weather conditions is tested and compared, as shown in Figure 7.

In Figure 7, the x-axis represents the randomly selected samples with 30 min data for each sample. The y-axis represents the number of flights. The prediction results from LSTM, XGBoost, and LSTM-XGBoost methods are compared with the actual data. It is found that the proposed LSTM-XGBoost model outperforms the other two methods for all scenarios. The results further demonstrate the robustness and applicability of the proposed model.

5. Conclusions

This paper proposed a combined Long Short-Term Memory and Extreme Gradient Boosting (LSTM-XGBoost) method for arrival flow prediction at the airport. The traditional Long Short-Term Memory (LSTM) network and the XGBoost model are incorporated by taking both the time-series information and the meteorological features into account. The Pearson correlation coefficients are calculated to describe the strength of the linear correlation between two variables, and the importance of variables is identified. The prediction results are compared with some benchmark methods, including BP, LSTM, SVM, GBRT, and XGBoost. The proposed algorithm improves the accuracy and stability of short-term airport arrival flow prediction.

Even though the proposed LSTM-XGBoost approach has exhibited great potential for short-term prediction of airport arrival flow, several limitations are still needed to be addressed in this study. First, this study is focused on incorporating the meteorological factors in airport arrival flow prediction. As a matter of fact, the real-time airport arrival flow is affected by a series of factors. Future research is still needed to identify the impacts of other significant variables. Second, the paper used the data from Nanjing Lukou International Airport as a case study. Data from other airports can also be applied to further investigate the robustness and applicability of the proposed model, especially those with extreme weather conditions. The authors recommend that future studies could focus on these issues.

Data Availability

The Flight Data.rar file is provided as supplementary materials, containing all the flight arrival data for Nanjing Lukou Airport in 2018. The airport meteorological information is collected from OGIMET (http://ogimet.com/metars.phtml.en).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was sponsored by the Fundamental Research Funds for the Central Universities of China (NS2020046), National Natural Science Foundation of China (51608268, U1933119, and 71971112), and Science and Technology Innovation Project for College Students (2020CX00760 and 2020CX00753).