Research Article  Open Access
Estimation Method of PathSelecting Proportion for Urban Rail Transit Based on AFC Data
Abstract
With the successful application of automatic fare collection (AFC) system in urban rail transit (URT), the information of passengers’ travel time is recorded, which provides the possibility to analyze passengers’ pathselecting by AFC data. In this paper, the distribution characteristics of the components of travel time were analyzed, and an estimation method of pathselecting proportion was proposed. This method made use of single path ODs’ travel time data from AFC system to estimate the distribution parameters of the components of travel time, mainly including entry walking time (ewt), exit walking time (exwt), and transfer walking time (twt). Then, for multipath ODs, the distribution of each path’s travel time could be calculated under the condition of its components’ distributions known. After that, each path’s pathselecting proportion can be estimated. Finally, simulation experiments were designed to verify the estimation method, and the results show that the error rate is less than 2%. Compared with the traditional models of flow assignment, the estimation method can reduce the cost of artificial survey significantly and provide a new way to calculate the pathselecting proportion for URT.
1. Introduction
As the basis of network flow assignment calculation, the pathselecting proportion is directly related to the operation and management of urban rail transit (URT), including operation indicators calculating, train plan making, and fare clearing. Currently, there have been many research results on flow assignment for URT under the condition of network operation, most of which are multipath models based on path utility.
Nguyen et al. [1] developed a graphtheoretic framework for the passenger assignment problem, which encompassed simultaneously the departure time and the route choice dimensions. Also, a passenger equilibrium flow model was defined and a mathematical formulation was suggested. The research results can be used to solve the passenger flow distribution problem of URT network.
Poon et al. [2] also studied the dynamic traffic assignment model for congested networks and used timeincrement simulation to calculate the passenger flow on the network. The model assumed that the vehicle was running in full accordance with the schedule and the passengers had full predictive information about the present and future network conditions, so the model can be used to simulate the performance of an existing transit system operating with preannounced schedules or to evaluate the effects of changes in schedules, lines, or passenger demand on system performance.
Xu et al. [3] established a multipath assignment model of URT network, using travel time as paths’ impedance. The basic characteristics of rail network and passenger travel behavior are fully considered in the model. With its high accuracy and strong practicality, the model has been successfully applied in Beijing Subway network. The survey results show that its error rate of section flow is less than 5%.
Si et al. [4] constructed the generalized cost function of URT network, considering the major factors (including the travel time and times of transfer) influencing the passenger flow assignment, and then proposed a mathematical optimization model of passenger flow assignment based on the stochastic user equilibrium principle. The results of numerical example using Beijing URT network data showed that the model was feasible and effective.
The basic idea of such models above can be summarized as follows:(1)determine the impedance of each path, which can be time costs or mileage;(2)design the utility function based on the path impedance;(3)calculate the flow assignment proportion of each path;(4)distribute each OD’s total passenger flow of one day to paths between the origin stations and the destination stations (OD).
Such models can basically guarantee the accuracy of flow assignment. However, the parameters of these models, including the entry and exit walking time at stations and transfer time, are calibrated by manual survey, which is really a timeconsuming and costly work. In addition, when the network structure or operation organization changes, the parameters need to be recalibrated in order to keep accurate. Therefore, it is necessary to study a new method of network flow assignment.
In recent years, the AFC system is widely used in URT, which can accurately record the passengers’ entry time at the origin stations and the exit time at the destination stations. With many years of application, the AFC system has accumulated vast amounts of passengers’ travel information. However, such information has rarely been used to study the behavior of travelling and pathselecting.
Chapleau et al. [5] and Rahbee [6] used the AFC data to predict OD matrices and network flow distribution, but only the entry records were used, which could not support the analysis of passengers’ pathselecting behavior.
Lee and Hickman [7] analyzed large amount of AFC data and found that activity and travel patterns differ significantly across the different farecard types, such as travel period, travel time, travel region (urban areas to the suburbs, urban areas to urban areas, etc.). The research results can be used to grasp the characteristics of different travel cardholders.
Kusakabe et al. [8] developed a methodology and an algorithm for estimating which train is boarded by each smart card holder based on longterm transaction data. The proposed method made a number of assumptions and distributed the uncertain smart card data to each possible train evenly, which needed further research.
Zhou and Xu [9] considered the process of the passenger travelling with the train running together and set up a model of calculating each passenger’s path by AFC data. The proposed method could estimate each passenger’s travel route and train choice. But the model needed a lot of manual investigation work to be also further verified by mathematical statistics.
Sun and Xu [10] analyzed the travel time reliability and estimated passenger route choice behavior by AFC data. The proposed model could be used to calibrate the parameters of traditional passenger flow distribution models. But the model needed all the stations’ walking time obtained by artificial survey and lacked further excavation of AFC data, such as ticket type and travel period.
Based on the above background, this paper works on analyzing passengers’ travel time data recorded by AFC system with the theory of mathematical statistics and proposes a new estimation method of pathselecting proportion. This method provides a new idea of network flow assignment for URT.
2. Analysis of Travel Time
Travel time of passengers in URT network mainly consists of the following six parts:(1)entry walking time (the time of passengers walking from the AFC gate to the platform in the origin station, denoted by );(2)entry platform waiting time (the time of passengers waiting for the train on the platform in the origin station, denoted by );(3)ontrain time (the time of passengers travelling on the train, denoted by );(4)transfer walking time (the time of passengers walking from the arrival platform to the departure platform in the transfer station, denoted by );(5)transfer platform waiting time (the time of passengers waiting for the train on the platform in transfer station, denoted by );(6)exit walking time (the time of passengers walking from the platform to the AFC gate in the destination station, denoted by ).
and only occur in the transfer station. The components of travel time are shown in Figure 1.
Lots of random factors can affect passengers’ travel time, so this paper makes the following assumptions:(1)passengers arrive at the station randomly, dispersedly, and stably;(2)all passengers can board the first train after arriving at the platform;(3)the trains run according to the plan strictly with a certain speed level and no abnormal condition occurs.
2.1. Walking Time (ewt, twt, and exwt)
For one station, suppose the walk distances for different passengers are the same, and then the distribution of walking time only depends on the walking speed. For walking speed, most research results [11, 12] show that it can be considered to follow normal distribution. Thus, walking time follows a normal distribution as well; that is,where , , and are the mean of and , , are the variance of , , and .
2.2. Entry Platform Waiting Time (epwt)
is the random variables between 0 and the interval of the trains. Based on the assumption of passengers arriving at the origin station randomly and dispersedly, they arrive at the platform randomly and dispersedly too. Thus, can be considered to follow a uniform distribution; that is,where is the interval of the trains.
The distribution can be verified by the following simulation experiment. The experimental environment is as follows: the interval of the trains at Station is 180 s, between 8:00:00 and 10:00:00; .
The experimental procedure is as follows.
Generate randomly and evenly 20,000 passengers between 8:00:00 and 9:50:00; for passenger , let be the arriving time of the station and let be the entry walking time and then the time of arriving at the platform is .
Search for the first train arriving at Station after . Let be the departure time of the first train at Station , and then .
MATLAB is used to do the simulation experiment. Assuming , use SPSS to do KolmogorovSmirnov test (KS test for short) of the waiting time data from simulation, and the significance level is generally taken to be 0.05. Frequency statistics is shown in Figure 2. The result shows that the significance is 0.626 (>α). So, the assumption is tenable.
2.3. Transfer Platform Waiting Time (tpwt)
Same as , is the random variables between 0 and the interval of the trains. However, their distributions are different, that is because passengers arrive randomly and dispersedly at the origin station but intensively at the transfer station.
Generally speaking, when the interval of transfer line is big, the faster passengers walk and the longer passengers may wait; when the interval of transfer line is small, the faster passengers walk and the shorter passengers may wait. It can be seen that there is strong correlation between and . Thus, this paper analyzes the distribution of , which is denoted by hereinafter.
The factors affecting distribution of are as follows: the interval of the beforetransfer line (); the interval of the aftertransfer line (); the coordination time between the beforetransfer line and the aftertransfer line (), as shown in Figure 3; the distribution of .
Obviously, is related to and , and its value has certain periodicity. To facilitate the presentation, let be the first coordination time, and it can be calculated by the following formula:where is the minimum coordination time and is the maximum one.
Let be the least common multiple of , . Then, the calculation formula of is designed as follows:
Suppose that the passengers arrive at the transfer station with the coordination time . Then, if the transfer walking time of one passenger is smaller than , his/her is ; otherwise, the passenger will wait for one or some more intervals of Line 2 and his/her will be . As a result of , the probability distribution of can be obtained by formula (5) as follows:
In the above formula, can be infinite theoretically, but, in fact, the probability is almost 0 when .
2.4. OnTrain Time (ott)
Based on the assumption, all the trains run according to the timetable strictly with a certain speed level. So, between two certain stations (Station , Station ) on the same line, the total running time of different trains is a constant, which is denoted by . Obviously, the ontrain time of passengers traveling between Station and Station is equal to and is a constant as well.
2.5. Analysis of Independence
Whether the components of travel time are independent is very important to analyze distribution characteristics of the path’s travel time. According to the analysis in the previous section, is a constant, so is independent of other components. divides travel time into three independent parts: and , , and . Thus, only the independence of and needs to be analyzed. In fact, one passenger’s waiting time is not related to his/her walking speed. As the passengers arrive at the station randomly and dispersedly, they arrive at the platform randomly and dispersedly too, which makes their waiting time random variable. That is the reason why the passengers walking fast may wait longer at the origin station. Therefore, and can be considered independent.
Based on the above analysis, all the components of travel time are independent.
3. Model and Algorithm
3.1. AFC Data
In China, smart cards and AFC system are applied in most cities, which can record part of the passengers’ traveling information on the URT network. The basic structure of AFC data is shown in Table 1.

Thus, based on the AFC data structure, any passenger’s travel time can be calculated.
3.2. Distribution Parameter Estimation for Components of Travel Time
According to the analysis in Section 2, the distribution characteristics and parameters of the components of travel time are shown in Table 2.

From Table 2, it can be seen that the parameters to be estimated are the mean and variance of of each station (); the mean and variance of in each transfer direction of each transfer station (); the mean and variance of of each station ().
3.2.1. Estimation Method of , and ,
Take the OD with single path and no transfer (the origin station and the destination station are on the same line) as the research object, and the travel time is only comprised of , , , and . Let be the origin station and let be the destination station. Then, large samples of passengers’ travel time data () can be obtained by AFC system, where is the actual travel time of passengers and is the sample size.
Based on , the mean and variance of the path’s travel time () can be estimated by moment estimation. Moment estimation is a commonly used method of parameter estimation, proposed by K. Pearson in 1894 [13]. According to WienerKhinchin law of large numbers, the sample moment converge to the population moment when the sample size is large. The principle of moment estimation is as follows: estimate the corresponding population moment by the sample moment and estimate the parameters by making use of the relationship between the unknown parameters and the population moment.
Because the components of travel time are independent, the following equations are established:where is the mean of of station ; is the variance of of station ; is the interval of station ; is the train running time between station and station ; is the mean of of station ; is the variance of of station ; and can be obtained from the timetable directly.
Equation (6) are suitable for any two different stations on the same line and can be converted to the following:where is the station set of line .
Therefore, for any station on the line , if its distribution parameters (mean and variance) of or are known (by survey), the distribution parameters of all the stations on line can be calculated by (7). According to the theory of moment estimation, the larger the sample size is, the more accurate the parameters are estimated. Thus, in order to improve the accuracy of parameter estimation, the sequence of stations on line for parameters estimation (called PES problem hereinafter) can be made according to the passenger flow.
To describe and solve PES problem, define as the set of distribution parameters of all stations on line , and thenwhere is the set of ’s distribution parameters of station on line and ; is the set of ’s distribution parameters of station on line and .
Take the five stations of line of the network in Figure 4(a); for example, the relationship between the sets of each station’s distribution parameters can be established as the undirected graph in Figure 4(b). In the figure, two sets are connected by an edge if there is only one path between the two stations; otherwise, two sets are unconnected. is the passenger flow between station and station .
(a) Part of URT network
(b) Undirected graph of parameter sets of line
In order to estimate the parameters accurately, the sample size should be as large as possible. Therefore, the PES problem can be summarized to seek the maximum spanning tree in Figure 4(b), and the model is described as follows:where is the spanning tree of the undirected graph and is the weight value of the edge between vertex and .
The most common algorithm used to calculate the optimal spanning tree is Kruskal algorithm [14]. Let be the set of all vertices, let be the set of all edges, let be the number of all edges, let be the spanning tree, let be the set of edges of the spanning tree, and let be the number of edges of the spanning tree. Then, the steps of Kruskal algorithm for solving the maximum spanning tree are as follows.
Step 1. Set , .
Step 2. Choose the edge with the maximum value in the graph, .
Step 3. If there is any loop in the graph , then and then back to Step 2; otherwise, , , .
Step 4. If , the algorithm ends; otherwise, repeat Step 2.
For example, suppose the passenger flows of ODs in Figure 4(b) are shown as in Figure 5(a) and the maximum spanning tree calculated by Kruskal algorithm is shown in Figure 5(b).
(a) Weighted graph of parameter sets of line
(b) Maximum spanning tree
According to the maximum spanning tree, the node converges the maximum number of edges, of which the parameters are designed, to be estimated by artificial survey. Then, the parameters of other nodes will be obtained in succession by formula (7). Therefore, the estimation sequence of parameters can be made as shown in Figure 6.
Based on the above analysis, the estimation method of , , and , of each station on one line is summarized as follows.
Step 1. Build the relationship graph of parameter sets, of which the value of each edge is the passenger flow between its two vertices collected by AFC system.
Step 2. Use Kruskal algorithm to calculate the maximum spanning tree of the relationship graph.
Step 3. Find the node with the maximum number of edges and use the artificial survey to collect its samples of walking time ( or ); then, its parameters of (, ) or (, ) can be estimated by moment estimation.
Step 4. According to the maximum spanning tree, estimate the parameters (, , , and ) of other stations in succession by formula (7).
3.2.2. Estimation Method of ,
In URT system, transfer path means the path from the platform of one line to the platform of another line at the transfer station. Therefore, for a twoline transfer station, there are four transfer paths in total. Based on the analysis of most URT networks in China, for any transfer path of one transfer station, certain OD with single path and one transfer can be always found to contain the transfer path, of which the travel time only includes , , , , and .
Suppose that a certain OD () with single path and one transfer contains transfer path , as the components of travel time are independent. Formulas (10) and (11) are established as follows:where , are the origin station and the destination station; , are the mean and the variance of ’s travel time; , are the mean and the variance of of transfer path ; is the train running time from station to the transfer station; is the train running time from the transfer station to station .
, can be estimated by moment estimation of AFC data; , , and can be obtained from the timetable; , , , and can be estimated from Section 3.2.1. Therefore, only , are unknown in formulas (9) and (10), which can be calculated easily.
Under the conditions of , known, , can be calculated by formulas (5).
3.3. Estimation Method of PathSelecting Proportion
Let be a multipath OD and let be the variable of travel time. Then, the probability density function of ’s travel time can be described as follows:where is the pathselecting proportion of the th path between station and station ; is the probability density function of the th path’s travel time; is the number of paths between station and station ; is the set of stations on the URT network.
Based on (12) and (13), the following can be deduced as well: where is the mean of ’s travel time; is the variance of ’s travel time; is the mean of the th path’s travel time; is the variance of the th path’s travel time.
and can be estimated from the AFC data; while and can be estimated by the following equations, in which each parameter can be obtained by the methods described in Section 3.2:where is the mean of transfer station ’s ; is the variance of transfer station ’s ; is the set of transfer stations in the th path.
Combining (13)~(14), each path’s pathselecting proportion can be calculated. It is worth noting that only order () central moment statistics (mean and variance) are used in the derivation above. So, the estimation method can only apply to the ODs with no more than three paths. When there are more than three paths, the idea in this paper is applicable as well, but order () central moment statistics should be introduced.
4. Numerical Example
The OD “” in Figure 4(a) is used as an example to verify the model and algorithm. Obviously, there are two paths between the OD, as shown in Figure 7 and Table 3.

Based on the method proposed in this paper, the process for estimating the proportion of two paths is as follows.
Step 1 (estimate the distribution parameters of each station’s walking time , and , ). Build the relationship graph of parameter sets and use Kruskal algorithm to calculate the maximum spanning tree of the graph. Based on the maximum spanning tree, the parameter estimation sequence of stations is obtained.
Take line and its five stations; for example, as shown in Figure 6, of should be estimated by artificial survey, and then other parameters can be estimated by formula (7).
To verify the parameter estimation method, a simulation of station and is designed as follows: generate randomly 10,000 passengers from to with the entry time between 8:00:00 and 9:50:00; ; the interval of line is s; train running time is s; .
MATLAB is used to do the simulation experiment. The entry walking time data are collected to be the data of artificial survey, and the entry and exit time data of simulation are collected to be AFC data. The parameters of are estimated by moment estimation method, while the ones of are calculated by formula (7). The results are shown in Table 4.
From Table 4, we can see that the error rate of parameter estimation is less than 0.2%, and the estimation method is verified.

Step 2 (estimate , of each transfer path of all transfer stations concerned). With the walking time parameters of all the stations on the URT network known, calculate , of each transfer path of transfer stations using the travel time data of AFC system by (10), (11). Then, , can be estimated by (5). The parameter estimation method in this step can also be verified by similar simulation experiments in Step.
Step 3 (estimate each path’s pathselecting proportion). Suppose the values of parameters estimated in Step 1 and Step 2 are partly shown in Table 5.
Design the simulation experiment as follows: generate randomly 10,000 passengers from to with the entry time between 8:00:00 and 9:50:00; the interval of each line is s; the train running time of Path 1 is min, while min of Path 2.
Do the simulation three times with different pathselecting proportions between Path 1 and Path 2 (0.7 : 0.3, 0.5 : 0.5, 0.3 : 0.7). Also, the entry and exit time data of simulation are collected to be AFC data.
Then, each path’s pathselecting proportion can be estimated by formula (13)~(14). The results are shown in Table 6.
The results in Table 6 shows that the error rate of path’s pathselecting proportion estimation is less than 2%, which verifies the model and algorithm in this paper.


5. Conclusion
In the network operation phase of URT, pathselecting proportion is the key to network flow assignment and fare clearing. This paper analyzed the distribution characteristics of the components of travel time and then proposed an estimation method of pathselecting proportion, making use of the travel time data from the AFC system. Also, simulation experiments were created to verify the estimation method, and the results show that the error rate is less than 2% and the method is reliable.
Compared with the traditional models based on path utility, the estimation method of pathselecting proportion has the following advantages: by making full use of the AFC data, the sample size for parameter estimation is large and the results have the good feature of accuracy and the estimation method relies on data analysis and processing, reducing the cost of artificial survey significantly.
In fact, the estimation method in this paper is being used to analyze and validate the URT network flow assignment results in Shanghai.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research is supported by China Postdoctoral Science Foundation (Project no. 2014M551454). The authors also wish to acknowledge Shanghai Shentong Metro Group Co., Ltd., for providing basic data during the research.
References
 S. Nguyen, S. Pallottino, and F. Malucelli, “A modeling framework for passenger assignment on a transport network with timetables,” Transportation Science, vol. 35, no. 3, pp. 238–249, 2001. View at: Publisher Site  Google Scholar
 M. H. Poon, S. C. Wong, and C. O. Tong, “A dynamic schedulebased model for congested transit networks,” Transportation Research Part B: Methodological, vol. 38, no. 4, pp. 343–368, 2004. View at: Publisher Site  Google Scholar
 R.H. Xu, Q. Luo, and P. Gao, “Passenger flow distribution model and algorithm for urban rail transit network based on multiroute choice,” Journal of the China Railway Society, vol. 31, no. 2, pp. 110–114, 2009. View at: Publisher Site  Google Scholar
 B.F. Si, B.H. Mao, and Z.L. Liu, “Passenger flow assignment model and algorithm for urban railway traffic network under the condition of seamless transfer,” Journal of the China Railway Society, vol. 29, no. 6, pp. 12–18, 2007. View at: Google Scholar
 R. Chapleau, M. Trepanier, and K. K. Chu, “The ultimate survey for transit planning: complete information with smart card data and GIS,” in Proceedings of the 8th International Conference on Survey Methods in Transport: Harmonisation and Data Comparability, Annecy, France, May 2008. View at: Google Scholar
 A. B. Rahbee, “Farecard passenger flow model at chicago transit authority, Illinois,” Transportation Research Record, vol. 2072, pp. 3–9, 2008. View at: Publisher Site  Google Scholar
 S. G. Lee and M. D. Hickman, “Travel pattern analysis using smart card data of regular users,” in Proceedings of the 90th Annual Meeting of the Transportation Research Board, Washington, DC, USA, 2011. View at: Google Scholar
 T. Kusakabe, T. Iryo, and Y. Asakura, “Estimation method for railway passengers' train choice behavior with smart card transaction data,” Transportation, vol. 37, no. 5, pp. 731–749, 2010. View at: Publisher Site  Google Scholar
 F. Zhou and R.H. Xu, “Model of passenger flow assignmentfor Urban rail transit based on entryand exit time constraints,” Transportation Research Record, no. 2284, pp. 57–61, 2012. View at: Publisher Site  Google Scholar
 Y. Sun and R. Xu, “Rail transit travel time reliability and estimation of passenger route choice behavior,” Transportation Research Record, no. 2275, pp. 58–67, 2012. View at: Publisher Site  Google Scholar
 L. F. Henderson, “The statistics of crowd fluids,” Nature, vol. 229, no. 5284, pp. 381–383, 1971. View at: Publisher Site  Google Scholar
 D.W. Li, Modeling and Simulation of Microscopic Pedestrian Flow in MTR Hubs, Beijing Jiaotong University, Beijing, China, 2007.
 Z. Huang, W. Feng, and Z. Hu, Probability and Statistics, People's Education Press, Beijing, China, 1982.
 M.Z. Li, Graph Theory and Its Algorithms, China Machine Press, Beijing, China, 2010.
Copyright
Copyright © 2015 Feng Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.