#### Abstract

Reliable prediction of short-term passenger flow could greatly support metro authorities’ decision processes, help passengers to adjust their travel schedule, or, in extreme cases, assist emergency management. The inflow and outflow of the metro station are strongly associated with the travel demand within metro networks. The purpose of this paper is to obtain such prediction. We first collect the origin-destination information from the smart-card data and explore the passenger flow patterns in a metro system. We then propose a data driven framework for short-term metro passenger flow prediction with the ability to utilize both spatial and temporal related information. The approach adopts two forecasts as basic models and then uses a probabilistic model selection method, random forest classification, to combine the two outputs to achieve a better forecast. In the experiments, we compare the proposed model with four other prediction models, i.e., autoregressive-moving-average, neural networks, support vector regression, and averaging ensemble model, as well as the basic models. The results indicate that the proposed approach outperforms the others in most cases. The origin-destination flows extracted from smart-card data can be successfully exploited to describe different metro travel patterns. And the framework proposed here, especially the probabilistic combination method, can improve the performance of short-term transportation prediction.

#### 1. Introduction

Many cities, due to increasing travel demands and ever-extending city coverage, rely more and more on metro systems. With reliable, efficient, and safe service, metro networks are experiencing a sharp rise in ridership [1]. The successful delivery of such a huge amount of passengers requires a high level of operational services where short-term passenger flow prediction plays a key role. Short-term passenger flow prediction is the prerequisite and foundation for the adaptive control of traffic condition in the intelligent transit system. The prediction of passenger information can be extensively applied to the advanced traffic management systems and the advanced travel information systems to help with facility improvement, operation planning, revenue management, and even emergency evacuation.

Short-term traffic prediction and short-term passenger flow prediction are successful applications of short-term transportation prediction in literature [2]. Unlike metro passenger demand forecasting, traffic forecasting has a long research history and has various analytic, statistical, and simulation based models. Parametric models include ARMA, seasonal ARMA, Kalman filtering, etc., while k-nearest neighbors (kNN) approach and spectral analysis are some frequently used nonparametric models [3]. In recent years, various data driven methods and machine learning models have attracted much attention. Many researchers have moved from what can be considered as a classical statistical perspective (the ARMA family of models), to neural and evolutionary computational approaches. This significant leap from analytical to data driven modeling has been marked by an overwhelming increase of computational intelligence and data mining approaches to analyzing the data [4]. Machine learning algorithms like neural network (NN) [5], support vector regression (SVR) [6], and Gaussian Processes [7] have been adopted frequently to perform short-term traffic prediction. Compared to traffic prediction for ground transportation, short-term passenger demand prediction for transit systems is a relatively new research field. Here are some related works we found. Tsai et al. [8] constructed two types of improved neural network model. The first is multiple temporal units neural network (MTUNN), which deals with distinctive input information via designated connections in the network. The second is parallel ensemble neural network (PENN), which deals with different input information in several individual models. The results show that both MTUNN and PENN outperform conventional multilayer perception neural networks. Wei and Chen [2] developed a hybrid forecasting approach which combines empirical mode decomposition and back-propagation neural networks and found that the approach performs well and stably in forecasting the short-term metro passenger flow. Sun et al. [9] proposed a hybrid model of Wavelet Support Vector Machine (SVM). The method first decomposes the passenger flow data into different high frequency and low frequency series by wavelet and then predicted these series by SVM separately. In the end, the diverse predicted sequences are reconstructed by wavelet. This method is claimed to have the best forecasting performance compared with the state-of-the-art techniques in the year 2015. Li et al. [10] presented a multiscale radial basis function network for forecasting the irregular fluctuation of metro passenger flows. And the prediction performance is said to outperform current prevailing computational intelligence methods for nonregular demand forecasting at least 30 min prior. Silva et al. [11] proposed an approach to analyzing massive transportation systems with goals of quantifying the effects of shocks in the system, such as line and station closures, and to predict traffic volumes. They used past disruptions to predict unseen scenarios, by relying on simple physical assumptions of passenger flow and a system-wide model for origin-destination (OD) movement.

Combining forecast takes advantage of the availability of both multiple information and computing resources for data-intensive forecasting to improve forecasting accuracy [12]. They are wildly used in a postprocessing phase to improve the stability and accuracy of the individual models [12–16]. Research on this methods started from around the sixties and now is available in a wide range from the robust simple average to the far more theoretically complex, such as state-space methods that attempt to model nonstationarity in the combining weights. Among most literatures, it is documented that a combined forecast improves the overall accuracy to a great extent and is often better than the forecast of each component model [17]. A majority of them form a weighted linear combination of the component forecasts. The statistical averaging techniques, e.g., simple average, trimmed mean, Winsorized mean, median, etc., are the most basic ensemble methods, as they do not explicitly determine the combining weights. Many studies found that these fairly simple methods reasonably outperformed a number of more advanced combining schemes [18].

In this work, we propose a framework to select and combine forecasts to achieve better prediction performances. In the first step, two basic models are initially adopted to obtain two forecasts, which are k-nearest neighbors (kNN) and adaptive boosting (Adaboost). Due to the complexity of the travel pattern and flow, these individual prediction models may have different performance when facing different situations. Thus we compared the results based on the fitted regression relationships between forecasting accuracies and features for each basic model. Finally, probabilistic combination is applied to combine the refined results into a final forecasting value. The subsequent validation performed indicates that the proposed model outperforms some widely used forecasting model, i.e., ARMA, NN, SVR, and an averaging ensemble model using the same basic models, in terms of prediction accuracy and stability.

The remainder of the paper is organized as follows. Section 2 presents the dataset used in this study. Section 3 provides an intuitive analysis of the data. Section 4 provides the whole method and a case study of passenger flow prediction using the real world data is presented. Comparative analyses of the prediction performances are provided. Section 5 gives the result and visualizes the prediction performances. Finally, conclusions are drawn and future research directions are indicated in Section 6.

#### 2. Data Description

Metro line 1 in Zhengzhou City, China, is selected for a case study. The city lies in the middle of China and has a very prosperous and promising metro demand for its 9.568 million population. During the period our dataset covered, line 1 was the only metro line in the city with an average daily passenger volume about . This dataset reflects a typical growth pattern of a new-built metro system and the exploration of its demand pattern helps the policymaker to understand the passenger volume fluctuation of a medium city.

The datasets were collected from automatic fare collection (AFC) system which covers 667 days, from July 20th, 2014, to May 16th, 2016. AFC records comprise all of the journeys by metro. The useful information in each record for this case consists of a time stamp, a location code, a transaction type, and card IDs. The location code uniquely identifies each of the 19 stations of the system that were active during the period covered by our data. The two transaction types of our interest are generated when a passenger touches the smart-card reader at the entrance (“tap-in” event) or at the exit (“tap-out” event) of a station. Card IDs allow us to pair the tap-in and tap-out records in which way ODs are obtained. We discarded all tap-in records that are not matched to a tap-out and vice versa. Paired OD flows are aggregated by a time interval of 15 minutes. Each day is composed of 96 time intervals. Our analysis covers all weekdays and weekends. Some statistics of instation flow and outstation flow are shown in Table 1, including average 50th and 80th percentile value and standard deviation of 15 minutes’ flow for all days and every station. Most stations have balanced in and out station flows. It is noteworthy that Table 1 shows a high volatility of the flows, which also indicates the difficulty of our prediction task.

#### 3. Travel Demand Analysis

##### 3.1. In- and outstation Flow Patterns

We investigate the patterns of the inflow and outflow at each station with the average counts in each timeslot, as shown in Figure 1. The stations could be split into the following four groups according to their flow patterns. (i) The stations which display a two-peak pattern on both the in- and the outflows on weekdays, e.g., Station 12. The main reason is that these stations are located in the zones which serve as both residence and nonresidence areas. The number of people departing from these zones is comparable with the number of those arriving at these zones during the peak hours. (ii) The stations where the inflow shows only one strong peak during the morning peak hour, while, the outflow shows one strong peak during the evening peak hour, e.g., Station 4. Such stations are located in the zones which mainly serve as residential area. Most of the travelers are commuters in these zones, who usually depart from home to work in the morning peak hour and back to home in the evening. (iii) On the contrary with the second pattern, some other stations have a strong morning peak on the outflow and evening peak on the inflow, e.g., Station 16. These stations are in the nonresidential areas, including the business or industrial area, and provide a great number of jobs. (iv) The stations which display peaks in the noon on both of the inflow and outflow, such as Station 19. Such stations may be located in the popular commercial area, which is the destination of lots of noncommuters. And there are some stations showing mixed characters of those four divisions, like Station 10. They are mostly located in the central district of the city and surrounded by big shopping malls and public utilities like hospitals and theaters. They show mixed characters of types (iii) and (iv).

**(a)**

**(b)**

**(c)**

**(d)**

##### 3.2. Metro Travel Distance Distribution

The distribution of metro trip displacements computed from the dataset is shown in Figure 2. A metro trip refers to a travel from the boarding station to alighting station of a passenger, which reflects the movement of an individual within the metro system. The displacement between any OD pair is calculated from station longitudes and latitudes. The probability increases quickly from the beginning and reaches the peak at about km, and another peak is reached at about 9 km. This may result from the pricing structures and travel purposes. With a flat fare of 2 RMB and some time to enter and exit the metro stations, taking metro to complete very short trips is not economical and time efficient. And most metro trips are for the purpose of daily travel, like commuting, which is not likely to have a very long distance, and also the upper bound of the travel distance is restricted by the city scale. We select Weibull distributions to fit the displacements. The likelihood indicated that the Weibull distribution fits the metro trip displacements slightly better, which is different from the result from [19]. The difference may lie in the structure of urban layout.

##### 3.3. Spatial Correlation

The relationship between each two OD pairs in a complex network, especially the correlation among spatially neighboring OD, has always been a research interest. There is a possibility that related OD flows could be used as potential features to improve the prediction. We tried to explore the possibility by finding out the interdependencies between geographically close OD flows. In Figure 3, OD 4–8 (passenger flow from Station 4 to Station 8) is chosen as the target OD pair; neighboring OD pairs include all OD pairs starting from Stations 2–6 and ending at Stations 6–10. Each individual figure shows the correlation between the neighboring OD pairs with the target OD pair. It can be seen that those ODs with correlation higher than are OD 3–8, OD 3–9, OD 4–7, OD 4–9, and OD 5–8. Most of these are one station away from the target OD. After conduct the analysis on all OD flows, we choose a spatial window of 1 (one station away) to construct features in prediction. Thus for one target OD flow, 8 neighboring OD flows are used as feature in prediction.

#### 4. Travel Demand Prediction

##### 4.1. Basic Model Formulation

The datasets are split into three parts to train and test the prediction models. The first 80% of them are used as training set. The remaining are used as* testing set* to examine the performance of proposed combined prediction model. For the training set, we randomly disordered the whole sequence (the first 80% of all data) and then divided the disordered sequence by 5:3 to obtain the* Training Set 1* and* Training Set 2*. By doing the random permutation, we excluded the long-term trend from the training dataset. The* Training Set 1* (50% of all data) is used to train the two basic prediction models, Adaboost and kNN;* Training Set 2* ( of all data) is used to train the probabilistic classifier, RF, here named* Training Set 2*.

It is noteworthy that the weekdays and weekends are treated separately. In the whole computational process of this present method, OD flow in the next timeslot, , is selected as target. The forecasts of instation and outstation passenger flows can be obtained by aggregating OD flows by their origin or destination station. Historical values of the targeting and some neighboring OD flows of the target are selected as features, as shown in Figure 4(a).

**(a)**

**(b)**

As we select the spatiotemporal flows as the inputs of the prediction model, the number of features is 45 when the spatial window is (8 neighboring ODs and the target OD) and the time lag is . Such number of features contains redundant information which increases the uncertainty of prediction. Therefore, principal component analysis (PCA) is adopted to decompose the original OD flow into a number of principal components and scores, and it also plays the role of data filtering. In this way, the systematic variations in OD flows are captured in lower dimensions. Consequently, the resulting principal components have a desirable property of being independent of each other, and the weights of principal components could serve as predictors in prediction process. A detailed description of PCA could be referred to in [20]. In our case, the size of feature matrix is . The variable represents the number of samples; i.e., each row of the data matrix is the estimated OD flow vector in any of the intervals. The variable represents the number of features. The principal component directions can then be determined by performing Singular Valued Decomposition on the data matrix where is a rectangular-diagonal matrix with positive values called singular values; is a matrix with orthogonal column vectors called the left singular vectors; and is a matrix with orthogonal column vectors called the right singular vectors. The columns of the matrix are the principal component directions. Alternatively, the columns of can also be interpreted as the eigenvalues of the matrix , which is the sample covariance matrix.

Let the individual principal component directions be represented by . Here, represents the principal component direction with the largest sample variance, represents the principal component direction with the largest sample variance subject to being orthogonal to , and so on. Assume that only first directions explain a majority of the variance in the OD flow vector. The first principal component directions can be represented using a matrix asThen the principal component vector of the OD flow vector can be written asand the OD flow vector can be approximately constructed back as

The extracted principal components are sorted by the percentage of the total variance explained by each principal component, from high to low, and reveal various patterns of OD flow. In this case, we use the first 10 principal components.

The two basic models, Adaboost and kNN, are widely used forecasting approaches. We will not describe them in detail. For Adaboost, we used 500 trees and the maximum depth equals to 4. Interested readers could refer to [21] for an introduction. For kNN, we adopt .

##### 4.2. Probabilistic Model Selection

Two basic forecasts are obtained from previous stage. In the following stage, we combine the prediction results of two individual basic models with a probabilistic model selection approach, random forest classifier, as shown in Figure 4(b). In contrast with the deterministic classification which selects the better one from the individual outputs, the probabilistic classification yields a probability that the output is selected as the better one. This probabilistic mechanism reduces the bias of prediction when the outputs of the individual models are divergent, as one overestimates the traffic flow while another one underestimates the value.

Suppose we have a training set , , , where is the number of samples in training set. Here in our metro OD flow prediction problem, is the projected feature vector of the th independent variables, as shown in Figure 4(a). is the binary label of the th sample. implies that the prediction of Adaboost is closer to the actual observation than kNN; implies the contrary. In this way, once we applied the trained Adaboost and kNN to the samples in* Training Set 2*, the labels can be collected by comparing the prediction with the ground truth. Consequently, we can adopt the supervised classification method, random forest, to learn the relationship between samples and their labels. Random forest is an ensemble tree model introduced by Leo Breiman [22] consists multiple classification trees. The basic tree models are trained on the data randomly selected from the original training data set, namely, bootstrapping. Usually, the predicted label of a testing sample is the label that appears most frequently in the basic tree models.

For the probabilistic classification, we desire to estimate the probability of each label, that is, . The most simple way is to count the proportion of trees that output class when observation is passed to the tree, namely, Voting. In this work, we adopt an out-of-bag method. First, we denote the set of samples used for training the basic trees by , and the complementary set of is , a.k.a, out-of-bag method. Here in our work, is generated by running the individual predictor on* Training Set 2*. Obviously, for each basic tree model, is not used for training. That is, the terminal nodes are not pure when passing down the samples in to the trained tree. Therefore, gives information about the underlying class probabilities. Suppose that samples in are assigned to the terminal node of a basic tree and these samples contain classes; we use the relative frequency of class as the classification probability of this terminal node. For the terminal node ,where if the th sample of is labeled as class in and 0 if else. Finally, each terminal node in basic trees is associated with a set of classification probabilities, for a -class classification problem.

When passing down the testing sample to the trained random forest model, we assign the sample to a terminal node in each basic tree and get a set of classification probabilities. Then we average the probabilities of all basic trees to infer the final classification probabilities. In the context of metro flow prediction, we use two basic prediction models and thus train a binary random forest model to estimate the probability of each prediction model. Once we have applied the probabilistic classification model to the* testing set*, the final prediction can be calculated using the following linear combination:where is the predicted inflow or outflow count of Adaboost and is that of kNN; is the probability that Adaboost performs better than kNN on the active sample; is the probability that kNN performs better than Adaboost.

#### 5. Results Analysis

We implemented several widely used prediction models, i.e., ARMA, neural network (NN) and support vector regression (SVR), using the same dataset to compare predictive performances with the proposed model. It is noteworthy that the proposed combined model is applied to the OD flow first, and then the prediction results are aggregated to get the inflow and outflow, while the three reference methods are directly applied to the aggregated inflow and outflow data, as analyzed in Table 1 and shown in Figure 1, rather than the OD flow. The reason is that we aim to validate not only the proposed combination approach but also the effect of OD flow aggregation. The parameters of in ARMA model are calibrated for each time series data (in/outflow of every station), with Bayesian information criterion; For NN, we configure two hidden layers with 5 and 2 nodes, respectively. The model is solved with Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm; for SVR, we choose the radial basis function (RBF) as the kernel, and the two key parameters of RBF are chosen by grid search, with and .

Besides, to compare the probabilistic combination with other combining methods, we implement an averaging combination model. Averaging is one of the commonly used model combination method [23]. The model is implemented in the same way with our probabilistic combining model except that the basic models are combined with simple averaging.

To assess the performance of the proposed combined approach, we use two metrics, Mean Absolute Percentage Error (MAPE) and Variance of Absolute Percentage Error (VAPE). MAPE is used to measure the prediction accuracies, and VAPE is used to examine the stability of the proposed and reference models. The formulations of MAPE and VAPE are given as follows:where denotes the actual counts of inflow or outflow at the metro station. Here we only consider the counts larger than 100 as the policymakers are more concerned with large travel flows; the total number of samples is during the testing phase; is the predicted value produced by the prediction models.

Tables 2, 3, 6, and 7 show the MAPE of the proposed combined model and the referenced models. It can be seen that among the total 76 (in/outflow at 19 stations on weekdays and weekends) forecasts, the proposed combined model outperforms others in 43 forecasts. By comparing the MAPE of inflow and outflow, we observe that the outflow is generally more predictable than the inflow. That is because the outflow is tied up with the historical inflow, which may be known in the inputs of the prediction models.

The comparison of VAPE is showed in Tables 4, 5, 8, and 9. It can be seen that among the total 76 (in/outflow at 19 stations on weekdays and weekends) forecasts, the proposed combined model outperforms others in 47 forecasts. The improvements are significant compared with ARMA, NN, SVR, and the two basic models. The VAPE of combined model performs best at most of the stations, for inflow or outflow, on weekdays or weekends. However, our probabilistic combined model and averaging combined model yield close VAPEs in many cases. This demonstrates that the prediction ability of combined methods is more stable than other references models. Another interesting finding is that, from the overall view, for the same station, prediction on weekdays is better than weekends and prediction of outflow is better than inflow.

Figures 5 and 6 exhibit the visual comparison of the prediction results with the actual data at selected stations during the operation time of one weekday and weekend, respectively. In contrast with the reference models, our model shows a relatively stable performance both in peak and in small perturbations. Among the four prediction models, ARMA yields more undesired peaks than others. Although the combined model cannot follow the bumpinesses in the actual curve, it gives the most promising prediction among the four models. However, we observe that the combined model underestimates the evening peaks on weekdays at Stations 4, 12, and 16, as shown in Figures 5(a), 5(c), and 5(d). This shortcoming leaves room for the further improvement. To understand more clearly the ability of predicting peak hours of combined model, we summarized the MAPEs for peak hours in Table 10. Peak hours are selected as 7-8 a.m and 5-6 p.m.

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

Another aspect of this method we discover during this study is that when applying combination to two basic models, if one individual basic model performs significantly worse than the other, the combination may be not efficient, because the final prediction is a linear combination of the two outputs of the two basic models. In the opposite, another extreme case is that when the two basic models are with very similar characteristics, the combination may also not improve the performance. The reason is that the predicted results of the basic models tend to be same. Therefore, we need to carefully select basic models. The two models should be expert in different scenarios and have overall balanced performance. In this case, the combination will give the best performance.

#### 6. Conclusion

Accurate short-term travel flow prediction in metro systems helps the policymaker to manage the operation of metro more efficiently and also make proactive control possible. With robust and accurate prediction, various control measures could be prepared in advance, for example, volume restriction and timetable reschedule. In this paper, we first analyzed the travel pattern of passengers. Then a practical approach is proposed, in which probabilistic model combines two basic models, to improve metro passenger flow forecasting performance. The effective combination benefits from the intermodel diversities, mitigates the risks of using an isolated model, and compensates the drawbacks of the individual models. More importantly, the combination rules could be applied to most individual prediction models. Besides, our model predicts the inflow and outflow at each metro station in an unaggregated way by utilizing OD flows. That is, the OD flows are predicted in a spatiotemporal fashion and then aggregated to get the inflow and outflow at each station. Our assumption is that OD flows enjoy more predictable power for they essentially reflect travel behaviors. Overall, the experiment results indicate that the accuracy and stability of the proposed model outperform the baseline models, ARMA, neural networks, and support vector regression. And the whole framework proposed is generalizable to other complex transportation systems with origin-destination and therefore offers important insights to future research.

In the future research, we would like to improve the usage of spatiotemporal information for the flow prediction of more complex metro networks with a certain number of transfer stations. Another interesting direction is to expand the prediction problem to a multicriteria one. In the past research, assessments of the relative performance of various combinations have generally been made under an accuracy criterion, expressed in terms of MSE, MAE, or MAPE. However, the robustness should attract more attention. Therefore, the variance should also be considered as one of the criteria when developing models. In the end, to build a method to predict the short-term travel demand under disruptions is also in our future plan.

#### Data Availability

The AFC dataset is not open for public access currently.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors are thankful to the Zhengzhou Metro Authority for providing the AFC data used in this study. This work is partly funded by the China Scholarship Council. The authors appreciate Dr. Rao for his advice.