Abstract

The automated fare collection (AFC) system has gained increasing popularity among transit systems worldwide. The AFC system is usually an entry-only system that only records the serial number of the smart card and the transaction time of each use. Neither the AFC data nor the bus global positioning system (GPS) could reveal the passenger’s alighting information, namely, alighting time and station. Hence, the station-to-station origin-destination (OD) trip information cannot be obtained directly from the available data sources. To address this problem, this paper proposes a methodology that estimates the OD matrix by using smart card and GPS data. In this paper, the characteristics of the basic data sources are first analyzed, based on which the bus arrival time is generated using the density-based clustering algorithm and a time correction strategy, based on which the passenger’s boarding station is identified. The alighting stations are inferred based on the characteristics of bus trip chaining, which could identify over 80% of the alighting stations on average. Finally, the proposed methodology is verified by a comprehensive field survey in Suzhou, China, with 100% sample rate.

1. Introduction

The bus origin-destination (OD) matrix estimation is one of the most fundamental steps for urban transit planning, operation, and management strategies, such as passenger demand forecasting [1, 2], network design [3], transit pricing [4, 5], scheduling [6, 7], and bus operation [8, 9]. Traditionally, the acquisition of OD trip information as well as the analysis of passenger’s spatial-temporal travel behavior relies largely on travel survey data, which has the drawbacks of small sample size, high costs, and being time consuming [1014]. With the help of advanced automatic fare collection (AFC) system, the smart card data has received increasing interest as a new and reliable data source for the collection and analysis of passenger trip information [15]. The statistical performance of smart card data outperforms that of the field survey data for providing comprehensive spatial-temporal information about the transit system and capturing the dynamics of passenger trips [16, 17]. The smart card data could also detect and correct the potential sources of sampling bias [18].

There has been some work in mining the smart card data and other types of data collected during the bus operation, such as bus global positioning system (GPS) data and WiFi/Bluetooth probes data. Pelletier et al. [19] have conducted an extensive review of the cutting-edge technique of smart card data mining in public transit system. Early studies on the smart card data analysis mainly focus on the statistical analysis of urban public transport passenger flow. For example, both Kiyohiro et al. [20] and Ouyang et al. [21] use smart card data to analyze the travel pattern of urban residents. Ebadi et al. [22] utilize smart card data to construct students’ activity-mobility trajectories in time-space dimension. Ji et al. [23] propose a Bayesian model to estimate trip-level OD flow matrices utilizing the data collected by Wi-Fi sensors and boarding data provided by automatic vehicle location (AVL) systems. However, compared to the smart card data, the stability of Wi-Fi sensors and the reliability of collected data are relatively low. Ma et al. [24] investigate the smart card transactions and GPS data. The results show relatively high accuracy of the passenger’s origin information. Ma et al. [25] continue to mine the smart card data and propose an efficient method to identify the passenger’s travel patterns using passenger’s historical travel data.

It has been widely recognized that the application of data fusion techniques which consider various correlated data sources can help to improve the accuracy of the estimation of OD trip matrix [2629]. In such data fusion process, the following two issues have to be addressed: (1) the smart card data contains only transaction-related information, which is useful only when it can be matched with the vehicle’s operational data while the boarding and alighting locations are not recorded [30]; (2) the information recorded by the smart card system might not be matched perfectly with that data of the GPS system. To identify the boarding station, two kinds of techniques are mainly adopted, namely, clustering analysis and GPS data matching. Specifically, the swiping time of ID card is first clustered and then matched with the AVL data to identify bus stops. Many efforts have been devoted for combining the smart card data with the bus dispatching information. Zhang et al. [31] identify the commuting trips of passengers by using their smart card data and use the clustering analysis method to identify commuting passengers’ boarding stations according to the time recorded in the smart card data. Farzin [32] uses the GPS time matching to identify the boarding location through the correlation matching of bus IC card swiping information, GPS positioning information, and bus station locations in São Paulo, Brazil.

The identification of the alighting station is more challenging because of the absence of available alighting location information [19]. Two kinds of methods have been proposed to estimate passenger’s alighting station: aggregate and disaggregate methods. In the aggregate method, passengers are assumed to alight from the bus according to a specific probability distribution with respect to the travel distance and station attractiveness [33]. To obtain more reliable estimations, a large number of studies focus on the disaggregate method based on the passenger’s trip chain by combining the smart card data with other sources of data, e.g., AVL data, which provides the detailed trip and corresponding geographic information.

The trip chain is generated to describe the same individual’s daily travel which is composed of several bus trips with the assumption that the passenger’s boarding station of a trip is close to the alighting station of the last trip [34]. Wang et al. [35] apply the trip chaining to obtain the OD information using smart card transactions and AVL data in London. It is also the first attempt to validate the results with manual passenger survey data. However, the sample rate of some bus lines is about 60%. Barry et al. [13] infer the alighting station based on the data collected by AFC system in New York, USA. Two specific assumptions are made to define the individual’s trip chain: (1) a large proportion of passengers returns to the previous station to start the next trip and (2) passengers would begin their first trip of the day at the station where they end their trip of the previous day. Zhao [36] examines the rail-to-bus trip chain by integrating the AFC and AVL data, as well as the GIS technology. It was demonstrated that the passenger’s travel behavior can be obtained rigorously by using the advanced automated data collection system. Seaborn et al. [37] use the smart card fare payment data to investigate the passenger’s multimodal trips in London. They illustrate that passengers between intersecting routes can be identified by the smart card data. Cui [38] first estimates the OD matrix of a single route and then extends it to the network-level OD matrix estimation. A case study of a full-size network of Chicago is conducted. The results show that the maximum likelihood estimation is suitable for the estimation of the single route OD matrix, while the proportional distribution method is recommended for the estimation of the transfer flow OD matrix in the network level. Lu et al. [39] use the AFC data from Beijing metro based on the trip chaining method and k-means clustering method to infer the actual destination, which could also be applied to the bus passengers’ alighting station inference based on the smart card data. In sum, the above-mentioned studies on the passenger’s trip chain are based on two assumptions: (1) the closest stop and (2) the daily symmetry rules. The identification rate of the passenger’s alighting station using the trip chain method is between 67% and 71% [33, 36, 38].

In literature, bus OD matrix can be estimated in three ways, i.e., field survey, analysis of smart card data only, and the fusion of multisource data including smart card data and bus GPS data. Although existing literature has extensively studied the characteristics of different types of data, dealing with the heterogeneity and systematic error of multisource data is still an open question. Specifically, the gaps of current studies are summarized as follows: (1) the fusion of multisource data does not allow for the internal correlation in both spatial and temporal dimensions; (2) the inference of alighting stations is accurate on an aggregate level but inaccurate on an individual level; and (3) the estimated results cannot be verified comprehensively because of the absence of real data and inadequate sample rate through field experiments.

To sum up, the contribution of this study is threefold. First, this paper proposes a spatiotemporal correction method for smart card data, bus GPS data, and bus station data to address incompleteness and complexity issues of multisource data. A density-based clustering method is applied to correct the difference between the trajectories recorded by the GPS devices and the bus station data. Second, the bus arrival time is obtained through data clustering technique in the spatial dimension. An additional correction algorithm in the temporal dimension is proposed to calibrate the timestamps in smart card data to match the bus arrival time. To identify the alighting station in the individual level, this paper divides the trips into the chained dataset and unchained dataset and infers the alighting stations for individuals for the chained dataset, through which the identification rate of the alighting station can be improved by about 10%. Third, the effectiveness and feasibility of the proposed method are verified on the data collected by a large-scale field survey in Suzhou, China.

The rest of this paper is orgnized as follows.The problem statement and data utilized in this paper are introduced in Section 2. Section 3 discusses the method for identifying boarding stations by incorporating the heterogeneity in multisource data. Section 4 presents the categories of trips and the trip-chaining-based inference method for alighting stations. Section 5 conducts a case study in Suzhou, China, to validate the performance of the proposed algorithm. Finally, we conclude the paper with some remarks and perspectives.

2. Data Description

Three crucial datasets can be obtained in the urban transit system, i.e., smart card data, bus GPS data, and bus station information data. Smart card data records the swiping time, vehicle number, and other transaction-related information of passengers, which contains mainly temporal information. In most cases, the station at which a passenger boards is not recorded. Bus GPS data records the trajectories of bus vehicles, which contain both spatial and temporal information. Bus station data records the static positions of bus stations of each bus line, which only contain the spatial information. The raw data of bus GPS and bus station location sometimes do not match because of different coordinate systems. In the rest of this section, the detailed description of these datasets is given.

2.1. Smart Card Data

The smart card data is recorded by the card swiping terminal equipped on each bus. When passengers board the vehicle, their trip information will be recorded. The recorded fields and corresponding description are listed in Table 1, including the record ID, card ID, card type, card swiping time, bus line number, and vehicle number, which are mainly temporal information.

2.2. Bus GPS Data

In practice, most buses in large- and medium-sized cities have been equipped with GPS terminals, which can accurately collect the real-time positioning information, along with the bus line number and operating direction, which contains both spatial and temporal information. As shown in Table 2, the data fields include the record ID, vehicle number, bus type, bus line number, bus line name, operating status, timestamp, longitude, latitude, altitude, running speed, and running angle. The system error of the GPS data is about 10–30 meters. However, in practice, some of the GPS equipment may be obsolete and cannot be corrected in time. The systematic error would reach 50 meters.

2.3. Bus Station Location Data

The bus network information is stored in relational database in the form of the bus route information and the location information of bus stations (see Table 3). In the relational database, bus lines and bus stations are regarded as entities. A bus station belongs to at least one bus line, and a bus line contains at least one bus station. The major data fields include line number, line name, line direction, the stating and end location, and line type.

3. Passenger Boarding Station Extraction

The first step of bus OD matrix estimation is to obtain the passenger’s boarding station, i.e., identifying the origin of each trip. As mentioned in the previous section, the smart card data only records the transaction time, while the spatial information of the boarding station is absent. As for the GPS data, only the vehicle trajectories are recorded, while the spatial relationship between trajectories and the location of stations is also unclear. Hence, it is necessary to match the smart card data to the bus GPS data to extract the passenger’s boarding station. In this section, we first use the clustering technique to obtain the bus arrival time, which can be further matched to the smart card’s transaction time, and then extract the passenger’s boarding station.

3.1. Obtaining Bus Arrival Time Based on Density Clustering Algorithm
3.1.1. Feasibility Analysis of Density Clustering Algorithm

Through the spatial matching of the bus station data and bus GPS data, the bus arrival time at each station can be obtained. However, it is difficult to directly create reference from the coordinates in bus trajectories to the location of bus stations, in that differences exist in the geographic coordinate systems, and bus GPS data can suffer from interference. These uncertainties and disturbances can be formulated as follows:where and represent the coordinates of bus station j on bus line i recorded in the bus station data; and represent the coordinates of the k-th sample of the bus station j on bus line i in the bus GPS data; and represent the conversion errors between the geographic coordinate systems adopted by the two different data sources; and and represent the positioning errors of the k-th sample of bus station j on bus line i in the bus GPS data. Due to the existence of these errors, it is essential to ensure the correctness of the spatial clustering and matching of data from two different sources.

The clustering algorithm is an unsupervised learning method which is capable of classifying the data into several groups by analyzing the similarity and mutuality between observed samples. The density-based clustering algorithm can further investigate the connectivity of the data from the perspective of the local density of samples. For the bus GPS data, a vehicle usually decelerates when approaching the station and then stops if this station is not skipped. This motion pattern leads to the spatial aggregation of trajectory points near bus stations, which is suitable for the use of density-based clustering algorithm. In the following section, the density-based spatial clustering of applications with noise (DBSCAN) algorithm is applied to obtain the bus station based on the GPS data [25], while the errors corresponding to the direct data fusion can therefore be avoided.

3.1.2. Obtaining the Bus Arrival Time Based on DBSCAN Algorithm

The DBSCAN algorithm can be first applied on the bus GPS data, whereby the location of bus stations in the bus GPS data can be identified. Then, the bus GPS data is further matched with the bus station data based on the connectivity between trajectory points to generate bus arrival timetable without being affected by the errors from different data sources. The detailed steps of the proposed method for bus timetable generation are as follows:Step 1: Density-based clusteringThe DBSCAN algorithm involves two steps, namely, searching for core samples and generating clusters. Before executing the algorithm, the bus station data is first matched with the bus GPS data. In the DBSCAN algorithm, two key parameters need to be defined: , distance, and , the minimum number of points. The is defined to measure the density-reachable range, within which the points are considered as the neighborhood of the same cluster. The limits the maximum number of points in the same cluster. Through DBSCAN, the samples are split into two categories, i.e., core samples and noncore samples. Instead of randomly selecting a core sample from the complete dataset, we only select the sample from the bus GPS data as a clustering seed. The cluster grows from the seed to its neighboring samples, and the seed is then labelled as the primary core sample. By repeating these steps, the clusters and their primary core samples can be generated.Step 2: Searching for the bus stationsAfter finding all the clusters in the dataset, the bus station corresponding with the samples needs to be determined. Firstly, if the sample is directly density reachable from a primary core sample, it will be labelled as the core sample of the cluster that the primary core sample belongs to and connected with the same bus station of the primary core sample. Secondly, if a sample is density reachable from exactly one primary core sample, it will also be matched with the bus station of the primary core sample. Thirdly, if a sample is density reachable from multiple primary core samples, the core samples falling in its -neighborhood are placed in the candidate set. Then, we count the number of samples that are density reachable from each candidate core sample, and the one with most density-reachable samples will be matched with the selected sample. The bus station of that sample will be the same as the corresponding bus station of that core sample.Step 3: Obtaining the bus arrival timeThe primary core samples identified by the DBSCAN algorithm and the corresponding bus stations can be used to generate the bus timetable. The time recorded by the primary core samples is the arrival time at those bus stations. Based on the algorithm above, the matching of the bus GPS data and bus station data in the spatial dimension can be realized, and an accurate bus timetable, which is fundamental to bus OD estimation and records the exact time when each vehicle arrives at each bus station, can be generated.

3.2. Boarding Station Identification Based on Temporal Matching

Through the effective matching of the bus GPS data and the bus station data, the bus arrival time has been obtained. Subsequently, the boarding station can be identified by building connections between the smart card data and the bus arrival time in the temporal dimension.

3.2.1. Temporal Matching Algorithm

The premise of applying the above matching idea is to align the recorded time of the smart card data with the bus arrival time. Since the card swiping system and the GPS terminal installed on the bus operate independently, a fixed time difference between the recorded time values, referred to as the system time difference, might exist. Therefore, before identifying the boarding stations, the data from the two sources should be adjusted to ensure that they are consistent in the temporal dimension.

Generally, the time recorded in the smart card data is merely the transaction time rather than the true boarding time. As shown in Figure 1, the swiping behavior of bus passengers regularly occurs during the time the bus takes from the boarding station to the next station.

The system time difference between the smart card data and the bus GPS data will lead to the following two situations:(1)The system time of the smart card data is earlier than that of the bus GPS data system. In such case, it is possible that the boarding station is mistakenly identified as the previous boarding station, resulting in a lower number of boardings at the current station and a higher number of boardings at the previous station.(2)The system time of the smart card data system time is later than that of the bus GPS data system. In this case, it is possible that the boarding station is mistakenly identified as the next station, resulting in a higher number of boardings at the current station and a lower number of boardings at the previous station.

Assuming the system time difference between smart card system and bus GPS system is , and this difference is constant for the two data sources. The system time difference can be expressed aswhere and are subject towhere represents the system time difference of i-th bus routes on day, represents the time of the first transaction of the i-th bus routes on day d, represents the time of the first bus arrival of the i-th bus routes on day d, T represents the collection of , represents the threshold of the maximum system time difference, and indicates whether the value is positive or negative.

Equation (2) takes the maximum time difference between the card swiping time of the first bus on that day and the first bus arrival time as the system time difference. Constraint (4) guarantees that all calculated time difference values take the same sign. Constraint (5) ensures that the first transaction recorded in the smart card data corresponds to the boarding at the first station of that bus route.

In practice, a minor lag might exist between the transaction time of the first passenger and the bus arrival time at the terminus. This can be addressed by simply adding a lag coefficient to equation (3), formulated aswhere represents a lag coefficient satisfying a specific probability distribution, which can be calibrated according to the actual condition. Finally, using the system correction algorithm, the time of the smart card data can be revised aswhere represents the card swiping time on day d of bus line i after correction.

3.2.2. Framework of Boarding Station Recognition

After correcting the system time difference based on the characteristics of multisource data, the overall framework of the boarding station recognition algorithm proposed in this paper can be summarized in two steps.Step 1: System time correctionThe transaction time of the first passenger recorded in the smart card data and the first bus arrival data each day for all bus lines are extracted for system time correction. According to the system time correction algorithm introduced in the previous section, the corrected smart card data can be obtained by Constraint (5).Step 2: Boarding station identificationThe boarding stations are identified based on the corrected smart card data and bus arrival timetable data according to the condition that the card swiping behavior happens in the time interval between two stations. In other words, the transaction time should be later than the bus arrival time at a station but earlier than the bus arrival time of the next station, where the first station here is regarded as the boarding station of that passenger.

4. Estimation Method of the Alighting Station Based on Trip Chaining

4.1. Definition of Bus Trip Chaining

Relevant research indicates that single trip of urban residents is the basic unit of the trip chaining. Bus trip chaining can be further defined as a process where residents take bus travel and form at least one spatial connection to neighboring travel. A bus trip chain can either be closed or unclosed. As shown in Figure 2, a typical and complete bus trip chain consists of the origin station, bus lines, middle stations, transfer stations, transfer bus lines, and the destination station. It has the following two features: (1) the passenger’s alighting station of his/her non-last bus travel is spatially connected to the boarding station of his next bus travel on the same day and (2) the passenger’s alighting station of his/her latest travel is spatially connected to the boarding station of his first travel.

However, in a multimodal transit system, passengers intend to use a combined travel pattern including different travel modes (e.g., bus, metro, taxi). Hence, the bus trip chain of a passenger is sometimes unclosed. As shown in Figure 3, the “alighting station” and “transfer station” in the cycle may not have clear spatial connection within the same day, which indicates that the passenger may choose other travel modes between these two stations. In this regard, the unclosed trip chain can be restored by the passenger’s historical trip data to distinguish his/her regular or occasional trip.

In a bus trip chain, a typical passenger picks up the bus line at bus station and gets off at station . Then, he/she transfers to bus line via station and gets off at station . After n times of transfer, he/she arrives at the destination station . The distance between the -th alighting station and the -th alighting station should be shorter than the acceptable walking distance. In summary, bus trip chaining should meet the following constraints: (1) component constraint: a bus trip chain must consist of at least two complete bus travels; (2) temporal constraint: the boarding time in last trip must be earlier than the boarding time in next trip; (3) spatial constraint: the distance between the alighting station in last trip and the boarding station in next trip should be shorter than the typically acceptable walking distance (e.g., 500 –1500 meters); and (4) transfer constraint: the transfer times in a bus trip chain should be lower than the acceptable frequency.

4.2. Alighting Stations Estimation Algorithm

On the one hand, it is difficult to estimate those unchained trips due to their irregularity and unpredictability. On the other hand, bus network planning usually focuses on the main travel behavior such as commuting. Therefore, estimating the alighting station for the chained trips can not only extract the spatiotemporal characteristics and OD matrix of most passenger’s bus travel, but also meet the needs of bus network planning and design. By stripping and reforming the above dataset, the closed bus trip chain dataset and the unbroken part of unclosed bus trip chain dataset are integrated to form the dataset of chained trips. Meanwhile, the broken part of the unclosed bus trip chain data and the remaining data are added to the dataset of unchained trips.

4.2.1. Alighting Stations Estimation of Non-Last Bus Travel

According to the result of boarding station identification, it is assumed that a passenger picks up on the station of bus line and his/her next boarding station is station of bus line . Based on the spatial constraint and minimum station distance assumption, passenger would choose the alighting station which is the nearest to the next boarding station. Thus, the station which is the nearest to the station of line from the stations in the downstream of the station online is chosen as the alighting station of the last travel.

To ensure the effectiveness of data processing, the selection of the stations on bus line where its distance from the station on bus line is shorter than the acceptable walking distance, is recommended. A table storing such transfer topology information is calculated in advance for subsequent selection and processing.

4.2.2. Alighting Stations Estimation of Last Bus Travel

According to the result of boarding station identification, it is assumed that a passenger picks up on the station of bus line in his/her last travel of the day, and his/her boarding station in first travel of the day is station of bus line . Based on the minimum station distance assumption and commuting rules, passenger would choose the alighting station which is the nearest to the first boarding station on that day. Thus, the station which is the nearest to the station of line from the stations in the downstream of the station is chosen as the alighting station of the last travel of the day.

4.2.3. Alighting Stations Estimation of Unclosed Bus Trip Chain

Unclosed bus trip chain can be split into the broken and unbroken parts. If the broken part occurs in the middle of the chain, the broken parts are usually recovered by the historical data while the remaining unbroken part can be regarded as a continuous bus trip chain. If the broken chain is in the head and tail of the chain, the alighting stations of non-last bus travel can be estimated in the same way as that used in a closed bus trip chain. Instead of the recovery method based on historical data, another method named chaining extension which tries to establish a spatial connection between the neighboring bus travels in neighboring days is preferred due to its higher confidence.

After reasonably dividing the bus datasets that have been effectively identified on the boarding station, we use the proposed algorithm based on the bus trip chain to estimate the alighting station and obtain the OD matrix.

5. Experiment and Verification

In order to verify the effectiveness of the proposed methodology, a survey was conducted in Suzhou, China, to collect the real bus travel OD information. As shown in Figure 4 four typical bus lines are selected, which have different functions (see Table 4). The ticket recycling survey method is used to collect the real number of passengers from station to station.

5.1. Ticket Recycling Survey Method

The ticket recycling survey method is a classical flow-up survey which has been widely applied to obtain the real passenger distribution along a bus route [40]. As shown in Figures 5 and 6, each passenger would receive a ticket that records the number of the station where s/he boards the bus at the front door. This ticket will be collected before the passenger alight from the back door. Though this method is time consuming and costly, it has a 100% sample rate which could obtain the real station-to-station trip information. The survey contains 72 round trips of these four lines. The total number of valid tickets is 10,551.

5.2. Accuracy Evaluation

By counting the number of board and alight passengers surveyed, the mean absolute error (MAE)of each class (i.e., from the origin station to the destination station) can be calculated. As a result, the MAE on each class is 1.326 passengers/station for numbers of boarding passengers and 1.299 passengers/station for numbers of alighting passengers. The detained statistical description of the MAE is presented in Table 5.

Figure 7 shows the verification results of bus line 146 (departure time: 8:25 am, October 24) and bus line 178 (departure time: 7:45 am, October 22). In order to find out whether the algorithm is accurate for most stations, this paper also proposes another checking indicator by calculating the percentage of the number of stations whose error is less than a certain threshold (number of passengers) to the total number of stations. It is assumed that the result is considered to be accurate when the estimated value differs from the actual value by two or less. As shown in Figure 8, the average percentage of the number of boarding stations is accurately identified as 81.8%, and this indicator is 80.9% for the alighting station estimation.

Furthermore, consider that there is a certain systematic error between the sampling time of the card reader and the POS time which is difficult to obtain. Table 6 lists the accurate inference percentage of each bus line and adds the limited dynamic time warping (LDTW). LDTW is usually used to evaluate time series similarity. In this study, the constraint of deformation distance is added on the standard dynamic time deformation. This means the value in the time series can only match the adjacent time interval and . This indicator has better robustness than an absolute error by considering the relevance of the attraction adjacent stations to passenger flow.

Another practical significance of using LDTW for evaluation is that it can prove the prediction accuracy of the algorithm for the overall trend. LDTW can be more accurate to estimate a small area instead of one station due to the fact that the adjacent stations are usually close enough to be considered as a small district. The verification results show that the accuracy of line 5 and line 146 is above 80%, and the corresponding LDTW is less than 10. The performance on line 301 is comparatively worse, but the accurate percentage is still above 70%, and the LDTW is less than 15. Tables 7 and 8 present a detailed comparison between the estimated and observed results for 10 stations with the highest passenger volume. It shows that the estimation values of the boarding passengers are almost less than 10 passengers, while the alighting passengers are less than 15 passengers.

5.3. Accuracy Evaluation of OD Matrix Estimation

By combining the results of boarding and alighting station estimation, the accuracy of OD matrix in each bus class can be calculated (the statistics do not include the OD with estimated and actual value being 0). The mean absolute error of the OD estimation is 0.581. Figure 9 shows the distribution of OD estimation accuracy. The accuracy threshold of Figure 9(a) is 2 passengers difference between the inferred value of the OD and the actual value, and the average accuracy is 94.3%. Figure 9(b) shows the accuracy threshold to 1 person, and the average accuracy is 72.8%. Table 9 lists the MAE of OD estimation in different bus lines. It can be seen from the table that the accurate percentage of each line (the error is less than 2 passengers) is above 90%, while the mean absolute error (MAE) is basically below 0.7. By comparing the estimated value with the real data of the bus travel survey, it can be found that the proposed algorithm performs well both in the identification of boarding and alighting station and in the overall OD matrix estimation. It can effectively obtain the passenger flow of each station and OD of different bus lines.

6. Conclusions

The accurate estimation of bus OD matrix is essential for the planning and management of urban bus system. This paper proposes a framework for estimating the OD matrix, including a boarding station identification algorithm and an alighting station inference algorithm. The boarding station identification algorithm allows for the mismatch of system time between the smart card system and the bus GPS system. These two datasets are aligned in the temporal dimension through the density-based clustering algorithm. Then, based on the identified chained and unchained trips, an alighting station inference algorithm based on trip chaining is designed. Above 80% of alighting stations could be identified by using the proposed estimation algorithm, which increases the identification rate by 10% compared with previous studies [33, 36, 38]. Finally, the accuracy of the proposed algorithm is evaluated on the data collected by the field survey in Suzhou, China. The results show that the proposed methodology could obtain an accuracy level above 90%.

Also, this algorithm can be further improved in the following aspects. First, the proposed framework works only within the bus network. However, with the development of multimodal urban transport system, travelers’ trip chains can include other travel modes such as light rail, subways, and even shared bicycles. Second, more data sources can be introduced to compensate for the incomplete market share of smart cards.

Data Availability

The data used to support the findings of this study have not been made available because of confidential issues.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Key Research and Development Program of China (No. 2018YFB1600900), the Key Project (No. 51638004) of the National Natural Science Foundation of China, the Scientific Research Foundation of Graduate School of Southeast University (No. YBPY1835), and DiDi Gaia Research Collaboration Initiative. The authors would like to thank Cheng Lyu and Yunyang Shi for their efforts in data preprocessing.