#### Abstract

Passenger flow forecasting plays an important role in urban rail transit (URT) management. However, complex spatial and temporal correlations make this task extremely challenging. Previous work has been done by capturing spatiotemporal correlations of historical data. However, the spatiotemporal relationship between stations not only is limited to geospatial adjacency, but also lacks different perspectives of station correlation analysis. To fully capture the spatiotemporal correlations, we propose a deep learning model based on graph convolutional neural networks called MDGCN. Firstly, we identify the heterogeneity of stations under two spaces by the Multi-graph convolutional layer. Secondly, we designed the Diff-graph convolutional layer to identify the changing trend of heterogeneous features and used the attention mechanism unit with the LSTM unit to achieve adaptive fusion of multiple features and modeling of temporal correlation. We evaluate this model on real datasets. Compared to the best baselines, the root-mean-square errors of MDGCN are improved by 1%–15% for different prediction intervals.

#### 1. Introduction

With the expansion of urban traffic congestion, URT has become one of the important solutions to alleviate the congestion problem. As an important research topic in URT, short time passenger flow forecasting can help decision-makers to make timely emergency plans and enhance security forces [1], as well as providing an important reference basis for the optimization of bus line networks.

Accurate, real-time passenger flow forecasting in URT is a challenging task. This is because the accuracy of forecasting can be affected by various aspects, such as the URT network’s development, its topology, and spatiotemporal correlation. The short-term prediction models of URT passenger flow range from the initial mathematical-statistical models such as Autoregressive Integrated Moving Average (ARIMA) [2] to traditional machine learning models such as Support Vector Machines (SVM) [3] to various deep learning models. However, mathematical-statistical models cannot mine the nonlinear features of traffic flow data, and traditional machine learning models rely on feature extraction, so their adaptability to data is not stable. Compared with traditional mathematical-statistical models and machine learning models, deep learning models perform better in the extraction of spatiotemporal features and are mainly classified into three categories: CNN-based, RNN-based, and GCN-based.

In the earliest stage, many scholars treated traffic data as image pixels and used a convolutional neural network (CNN) for local spatiotemporal feature extraction. For example, Luo et al. [4] modeled the traffic flow as a two-dimensional spatiotemporal matrix to describe the images of spatiotemporal relationships of traffic flow and used CNN to predict the traffic speed. Similarly, based on converting the passenger flow data into images, Zhang et al. [5] introduced the residuals into CNN. However, applying CNN to the passenger flow prediction cannot accurately capture the spatiotemporal correlation. Regarding this problem, Zhang et al. [6] proposed a multitask learning prediction model called MTL-TCNN that considered the relevance of multiple regions for the prediction problem, but CNN is only suitable for Euclidean space, and traffic data is typical non-Euclidean data. Recurrent neural network (RNN) and its variant models have good performance in handling time-series data because of their self-circulation mechanism. For example, Fu et al. [7] used gate recurrent unit (GRU) for traffic flow prediction for the first time. Ma et al. [8] and Yang et al. [9] used long-short term memory (LSTM) to capture the trend of long and short time series. Abduljabbar et al. [10] extended the one-way LSTM into a two-way LSTM; thus, the accuracy was further improved by training the data twice by forward and backward. However, RNN and CNN cannot reasonably extract the correlation of spatial dimensions and are not suitable for traffic flow data processing.

Researchers gradually realize the advantages of GCN in dealing with non-Euclidean data such as traffic flow. Many scholars tend to apply GCN to traffic flow prediction with other models that perform well in capturing temporal correlations. Some studies [11–13] integrated GCN- and RNN-based models to predict. For example, Zhao et al. [14] and Lv et al. [15] combined GCN with GRU; Geng et al. [16] combined GCN with RNN; Zhang et al. [1], Chai et al. [17], and He et al. [18] combined GCN with LSTM to make up for the shortcoming of GCN in capturing temporal dependence poorly. Zhang et al. [19] proposed a model including attention mechanism, GCN, and sequence-to-sequence for multistep prediction. Besides, some studies tried to propose novel models, such as Wu et al. [20], who proposed an adaptive adjacency matrix to change the structure of a fixed graph and used extended convolution to capture temporal correlation. Park et al. [21] used the transformer model [22] and the self-attention mechanism with encoder-decoder architecture. Hao et al. [23] constructed a sequence-to-sequence architecture with attentional mechanisms. Also, a hybrid prediction model [24] based on kernel ridge regression (KRR) and Gaussian process regression (GPR) is proposed to predict the short-term passenger flow. However, these studies only consider the topological structure of the URT network. More correlations between stations should be explored. Effectively exploring correlations among stations could further improve the prediction performance. For example, Geng et al. [16], Chai et al. [17], Wang et al. [25], and Wang et al. [26] reconstructed the relationship graph from different perspectives. However, they ignored the correlation hidden in historical data and the changing trend which could help capture the movement patterns of passenger flow.

The above research indicates that there are still two problems to be solved in the current research on URT passenger flow prediction: the correlation between stations and the extraction of changing trends. To be exact, the relationship between stations is not limited to the adjacency relationship of geographical space; there is a lack of station correlation analysis from different perspectives. And it is necessary to construct an appropriate model to mine the changing trends.

In this paper, we propose MDGCN (**M**ulti-**G**raph **D**ifferential **C**onvolutional **N**etwork). It can mine the correlations between stations in heterogeneous space for URT passenger flow prediction. The contributions of this paper are as follows:(i)The correlation of boarding and alighting passenger flows between stations that are physically distant from each other is also important for passenger flow prediction. Therefore, to more accurately represent the connectivity between stations, we reconstruct the connectivity of stations by historical passenger flow data and station geographic information.(ii)We design a layer of Multi-GCN that can jointly mine the correlation between stations in physical space and nonphysical space to gain the heterogeneous spatial correlation between stations.(iii)We introduce the concept of difference and construct the difference feature extraction layer called Diff-GCN to extract the changing trends of heterogeneous spatial features.(iv)We conducted experiments on two datasets. The experimental results show that the prediction error of the proposed model is reduced by 1%–15% compared with the optimal baseline.

The arrangement of the paper is shown below. Section 2 introduces the basic problem definition of passenger flow forecasting. Section 3 presents the model framework. Section 4 discusses the experimental results. Conclusions are made in Section 5.

#### 2. Preliminary

In this section, we define some key concepts and give the problem definition of the studied content.

*Definition 1. **(urban rail transit network).* We define the URT network using the graph as , where the set of stations is used as nodes of *G*, and is the number of stations. Figure 1 illustrates sample lines related to the URT network. The connection relationships between stations are used as edges of *G*. We use the matrix to record the adjacent relationships between stations, and the value represents the strength of the association between stations. So, we reconstruct the spatial relationships of the URT network from two perspectives.

*Definition 2. **(node characteristics).* We take the passenger flow of a station as the characteristics of a network node; i.e., given a station and a period , the passenger flow of the station at that period is denoted as , and the passenger flow of all stations in the past periods is denoted as ; then the temporal characteristics of the set of stations in the past periods can be written as .

*Problem. **Definition.* This problem can be defined as learning a mapping function to predict the passenger flow data at the next moment given the URT network with the historical passenger flow data . The process can be referred to as

#### 3. Study Methodology

This section introduces the framework of our model, research methods, and pseudocode.

##### 3.1. Modeling Framework

This model (Figure 2) can be divided into four main parts: the reconstruction of the relationship of station, the layer of Multi-GCN, the layer of Diff-GCN, and the layer of output.(i)The first part is based on the relationship between the physical space and the nonphysical space of stations and combines the historical passenger flow data as the input of the subsequent model.(ii)The second part is the Multi-GCN (**M**ultiple-**G**raph **C**onvolutional **N**etwork) layer, which can jointly model the correlation between stations in both physical space and nonphysical space; i.e., the Multi-GCN layer is constructed to capture the heterogeneous spatial correlation between stations.(iii)The third part is a Diff-GCN (**D**ifferential **G**raph **C**onvolutional **N**etwork) layer, a differential feature extraction module, to extract the changing trends of heterogeneous spatial features. Besides, the attention mechanism is used to adaptively fuse the features of the Multi-GCN layer and the Diff-GCN layer.(iv)The fourth part is the LSTM unit for global temporal correlation extraction and the fully connected network layer for prediction result output.

###### 3.1.1. Reconstruction of the Correlation of Stations

The key for a neural network being able to be trained is in the input data. And the input to a GCN consists of two parts: the adjacency matrix and the node features. Whether the adjacency matrix correctly encodes the relationships of the network nodes is related to the performance of the model. Therefore, we interpret the URT network from two perspectives. Stations in close geographical proximity may have similar passenger flow patterns. Given this, we construct the topology of the physical space based on the physical adjacency and distances between station and station *j*.

Firstly, we obtain the adjacency matrix (in (2)) between stations using the physical adjacency relationship of stations. Then, we build the distance matrix (in (3)) based on the inverse of the distance between stations. Finally, we reconstruct the topology matrix using the Hadamard product formula [1] (in (4)).

However, the flow pattern could make two stations with no spatial adjacency have a certain association, and the strength of the association is determined by the passenger flow. This flow pattern truly reflects the trend of passenger flow movement between stations. Therefore, we describe the passenger flow movement pattern and movement intensity between stations based on the data of historical passenger flow (in (5)) to construct the associated graph . Denote as the passenger flow from station to .

###### 3.1.2. Multi-GCN Network

Two different network topologies are obtained by spatial relationship reconstruction of stations. We propose the layer of Multi-GCN to extract heterogeneous spatial correlation by GCN. The propagation rule of GCN can be expressed aswhere is the input of *l*-layer and its original value is . is the adjacent of a graph with added self-connections, is degree matrix, represents the sigmoid function for a nonlinear model, and is a weight matrix of the current layer.

In particular, GCN has a powerful ability to model higher-order neighborhood interactions by stacking multiple layers. As shown in Figure 3, taking the red site as an example, the first layer of GCN can capture the influence of two neighboring stations of the red one. As the number of network layers deepens, the correlation of all stations can be captured completely. The spatial features in the URT network, i.e., the topology based on the physical space and the station association matrix in the passenger space, will be extracted by the N-layer GCN. So we input them into the layer of Multi-GCN to obtain the outputs (in (7)) and (in (8)) of the multigraph convolutional layers, respectively.

###### 3.1.3. Diff-GCN Network

The difference reflects the variation between two discrete quantities. We introduce the concept of differencing to construct the layer of Diff-GCN to extract the tendency of heterogeneous spatial features. As shown in Figure 4, the passenger flow data , based on time and , and the adjacent matrix are put into Multi-GCN, respectively, and the respective feature matrices and are obtained after Multi-GCN, the difference is calculated, and the final difference factor (in (9)) is obtained by using GCN for spatial feature extraction.

Here, we obtain two factors (in (10)) and (in (11)) based on two spatial relationships.

###### 3.1.4. Attention

For and obtained from the layer of Multi-GCN and and obtained from the layer of Diff-GCN, we use the attention mechanism to learn the importance of these four features as follows (in (12)). Firstly, we calculate the hidden state vector scoring of using a nonlinear transformation and one shared attention vector (in (13)). Here, is the weight matrix and is the bias vector. Similarly, the embedding do the same operation to get . Secondly, we perform a normalization operation (in (14)) to obtain the coefficient () of the individual embedding vector. Finally, the linear operation is used to combine the four components to obtain (in (15)).

###### 3.1.5. Prediction

After obtaining by attention, we use LSTM that is shown in Figure 5 to capture temporal features. LSTM is one of the most common neural network structures in time series forecasting problems. It uses the concept of gates to control the selective passage of information. As can be seen in (16) and Figure 5, it has three gates: Forget Gate, Input Gate, and Output Gate. Forget Gate filters the input information. Input Gate decides what new memory is stored. And Output Gate gets the processed results. The processing of each gate can be seen in (16).

As can be seen in Figure 5, after is inputted into LSTM, the result is gained. Then we obtain the predicted output via the layer of fully connected with neurons which means the number of stations.

##### 3.2. Pseudocode

MDGCN pseudocode is shown in Algorithm 1. Besides, to evaluate the complexity of this deep learning algorithm, we use two indicators, FLOPs and Params. The time complexity is determined by the number of operations of the model, i.e., FLOPs (0.0126 G), and the space complexity is determined by the number of parameters, i.e., Params (2, 147, 895).

#### 4. Results and Discussion

##### 4.1. Experimental Settings

This model is implemented in Tensorflow and Keras. All experiments were run on a graphics processing unit (GPU) platform with an NVIDIA GeForce GTX 1070 Ti graphics card and 8GB GPU memory. We train our model using Adam optimizer with a learning rate of 0.0001. The training epoch is 150. Two kinds of evaluation metrics are adopted, including root mean squared errors (RMSE), mean absolute errors (MAE), and the evaluation indicators which are shown in where is the true value, is the predicted value, and is the number of samples.

##### 4.2. Dataset

We used two real-world datasets from China to evaluate our model, i.e., data collected from Xiamen and Shanghai subway Automatic Fare Collection (AFC) System.

SH dataset was collected from 1 April to 31 April 2015. This dataset includes records from 14 lines and 313 stations. The example of this dataset is shown in Table 1. Each record contains the card number (ID), travel date (DATE), card-swiping time (TIME), the name of stations (STATION_NAME), the number of lines (LINE_ID), and TPYE. Passenger flow is calculated by formula (19). represents the number of passenger flows of station in period . Here, is determined by “STATION_NAME” and is determined by “DATE” and “TIME.”

XM dataset was collected from 1 December to 31 December 2019. This dataset includes records from two lines and 52 stations. The example of this dataset is shown in Table 2. We use the same procedure of data preprocessing to gain passenger flow.

##### 4.3. Experiment Results

In this section, we conducted four main parts of experiments to demonstrate the validity of the proposed model and the reasonableness of each module.

###### 4.3.1. Evaluation of the Number of GCN Layers

To determine the optimal value of the number of GCN stacks, we conduct experiments at different stacking values to obtain the corresponding prediction values, and the prediction results are shown in Figure 6. We found that the experiment worked best when , so we set the number of stacks to 3 in the experiment.

###### 4.3.2. The Results in Baselines

To evaluate the competitive performance of the proposed method (i.e., MDGCN), we compared it with two types of models, covering the most basic mathematical-statistical models and the deep learning models. All baselines are optimized to output the best performance.(i)HA [27]: it means historical average model, which uses the average of several previous periods as the prediction. In this paper, we use the average of the last ten periods as the forecast value.(ii)ARIMA [28]: we used SPSS software to make predictions so that we could get the best predictions.(iii)SVR: we used support vector regression (SVR) to predict passenger flow. Here, the kernel function is rbf, epsilon is 0.2.(iv)GBDT: we used gradient boosting decision tree (GBDT) to predict passenger flow. Here, n_estimators is 100, min_samples_split is 2, and learning_rate is 0.1.(v)LSTM: Ma et al. [8] used this method to predict traffic flow. In this paper, we use the same method to predict passenger flow. We used a stacked LSTM which has 128 and 276 neurons for the first and second layers.(vi)GCN + LSTM: Chai et al. [17] used this method to predict bike flow. Likewise, we use the same method to predict passenger flow. We used a 2-layer GCN and a layer of LSTM which has 64 neurons.(vii)ResLSTM: it refers to the method in [29], which combines ResNet and GCN with LSTM to predict the traffic flow. Here we use the same setting as it is without branch 4.(viii)Conv-GCN: it refers to the method in [1], which combines the 3D-Conv with GCN to predict the traffic flow. Also, we use the same setting as is.

The results are shown in Table 3. The following trends can be seen from the results. Firstly, deep learning models (including LSTM, Conv-GCN, ResLSTM, GCN + LSTM, and MDGCN) have a better performance than traditional mathematical-statistical (including HA and ARIMA) and shallow machine learning models (including SVR and GBDT). This result shows that deep learning methods can better capture nonlinear spatiotemporal correlations. Second, the variant of recurrent neural network (including LSTM and ResLSTM) outperforms machine learning methods (including SVR and GBDT). It indicates these time-series models can capture temporal correlation deeply. By comparison, the spatial deep learning-based models (including Conv-GCN, GCN + LSTM, and MDGCN) outperform the variant of LSTM. It demonstrates the effectiveness of spatial correlation in passenger flow prediction. Also, MDGCN mines the relationship of all the stations deeply to have the best performance compared to all baselines. Thirdly, almost all the models behave much better on XM than SH. The biggest reason must be the complexity of the dataset. SH dataset has more lines and stations than XM. Thus, the traffic condition patterns in Shanghai would be much more complex than in Xiamen. Finally, MDGCN has the best performance in these two datasets. Compared with the optimal baseline, it reduced error by 1% to 15%. This shows that it can get better performance by capturing various correlations (including spatial and temporal) and deeply mining the historical passenger flow.

###### 4.3.3. Evaluation of Spatial Relationship Reconstruction of MDGCN

In the third experiment, we try to evaluate the effectiveness of spatial relationship reconstruction. Specifically, we compare the performance of the following three variants of MDGCN. Here, represents the adjacent matrix that is put into the Multi-GCN layer.(i)MDGCN-: it concludes only the most basic adjacency matrix (i.e., )(ii)MDGCN-: it concludes only the physical topology matrix based on the adjacency matrix with distance inverse recoding (i.e., )(iii)MDGCN-: it concludes only the topological matrix based on the passenger space (i.e., )(iv)MDGCN: it combines the topological matrix of both physical space and passenger space (i.e., )

The experiment results are shown in Table 4. On the one hand, MDGCN has the best overall performance and MDGCN-A has the worst. It demonstrates the importance and effectiveness of recoding. On the other hand, MDGCN- outperforms MDGCN-A and MDGCN-, and it has almost the same level of accuracy as MDGCN. It suggests that the topology of the passenger space more accurately describes the correlations between stations.

###### 4.3.4. Evaluation of the Diff-GCN Layer of MDGCN

In the last experiment, we evaluate the performance of the layer of Diff-GCN. We compare the performance of MDGCN with its variant called MGCN which does not conclude the module of Diff-GCN. The experiment results are shown in Figure 7.

**(a)**

**(b)**

It is clear from the results on these two datasets are consistent with each other. The layer of Diff-GCN reduces the prediction error because it can fully extract the trend information among passenger flow.

#### 5. Conclusions

In this paper, we propose a deep learning model called MDGCN to predict passenger flow. We construct the Multi-GCN layer to extract heterogeneous correlations under two spatial relationships, and we use the Diff-GCN layer to extract the changing trend of heterogeneous features to fully capture the spatiotemporal correlations. Eventually, it is validated in the actual datasets and obtained better results. However, there are still many points for improvement in the above work, for example, how to construct dynamic connectivity of stations using historical passenger flow data and how to consider the impact of various factors such as weather and unexpected accidents on the prediction results. Therefore, future work will focus on addressing these issues.

#### Data Availability

The original data set cannot be provided due to the confidentiality of the data.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by the General Project of Natural Science Foundation of Fujian Province of China (2017J01111), the Science and Technology Plan Guiding Project of Fujian Province (2020H0016), the National Nature Science Foundation of China (61802133), and the Fundamental Research Funds for the Central Universities (ZQN-910).