#### Abstract

The planning and operation of urban buses depend heavily on the time-varying origin-destination (OD) matrix for bus passengers. In most cities, however, only boarding information is recorded, while the alighting information is not available. This paper proposes a novel method to predict the destination of a single bus passenger based on bus smartcard data, metro smartcard data, and global positioning system (GPS) bus data. First, the attractiveness of each bus stop in a bus line was evaluated, considering the attractiveness of nearby metro stations. Then, the exploration and preferential return (EPR) model was employed to estimate the probability of a bus stop to be the alighting stop, i.e., the destination, of a passenger. The estimation result was obtained through a simulation based on the Monte Carlo (MC) algorithm. The effectiveness of our method was proved through a case study on the bus network in Shenzhen, China.

#### 1. Introduction

The origin-destination (OD) estimation of bus passengers is essential to the network planning and operation of buses. Traditionally, the ODs of bus passengers were estimated by questionnaire surveys, which are small in sample size and low in precision. In recent years, the OD estimation of bus passengers has progressed rapidly, owing to the proliferation and use of geoinformation systems and smartcard techniques. GPS and bus IC card data are widely used. A bus IC card provides a large amount of data for the travel characteristic analysis of bus passengers, which has large amount of data, low cost, and high accuracy. However, it can only provide the relevant information of passengers’ boarding stops. In order to obtain the passenger travel OD, it is necessary to calculate the alighting stops.

Many scholars have investigated the OD estimation problem. The early studies mainly derived the OD matrix of public transit passengers from the passenger flow at each station. For instance, Ben-Akiva et al. [1, 2] proposed an OD derivation method based on the survey data on ODs. After analyzing the passenger flow at each bus stop, Navick et al. [3, 4] estimated the travel patterns of bus passengers and constructed the OD matrix of these passengers. Tsygalnitsky [5] predicted the destinations of public transit passengers under the assumption that all the passengers boarding the same type of public transit vehicle at the same station have the same probability to alight at another station.

Moreover, Li and Cassidy [6] designed an algorithm that does not need a seed matrix to estimate the ODs of public transit passengers. Based on boarding and alighting counts at each stop along the route, the designed algorithm deduces an OD matrix for the entire trip and forecasts the probabilities for passengers to board and alight at each stop along the route. These probabilities tend to remain fixed throughout the trip. Compared with Tsygalnitsky’s prediction technique, Li and Cassidy’s algorithm is highly suitable for general use. Using boarding and alighting counts, Li [7] developed an efficient statistical inference method for a closed-form OD matrix of a travel route: the Markov chain model was adopted to capture the correlations between matrix elements and reduce the number of unknown parameters; then, the unknown parameters of the Markov chain model were inferred by Bayesian analysis.

In practice, the OD estimation is often realized through iteration proportional fitting, in which the OD matrix is adjusted continuously based on survey data, and the passenger flow of each bus stop is obtained by adding up the row and column vectors of the OD matrix [8, 9]. This approach is efficient and easy to implement but costly in terms of labor and money. In addition, a very few cities have adopted an automatic passenger counter (APC) system to collect the information of bus passenger flows in an automatic manner.

Recently, new methods have emerged to estimate the destinations of bus passengers based on global positioning system (GPS) data and smartcard data. Using the APC data, Barry et al. [10] estimated the destinations of bus passengers in New York City based on the trip chain theory. Zhao [11] derived the destinations of passengers that transfer between metro lines or between metro and bus. Seaborn et al. [12] explored the time span for transfer between public transit modes and relied on the time span to identify multimode transfers. Hofmann and O’Mahony [13] determined the transfer stations according to the time difference between two card swipes.

In addition, based on GPS data, Giannotti et al. [14] researched the detailed trajectory of vehicles and the frequent mode of urban residents’ travel, the prediction method of traffic intensive areas, and the description method of traffic congestion; Jiang et al. [15] found that human travel distance obeyed power-law distribution using GPS data. Based on the travel survey data, Garske et al. [16] studied the residents’ travel in two cities with different economic development levels in China. Kölbl and Helbing [17] found that under different travel modes, people’s travel distance follows a general distribution law. In [18], the daily travel logs of 230 volunteers in Frauenfeld, Switzerland, were analyzed. The author found that the travel distance of the group obeyed a power-law distribution with exponential truncation, which was very close to the empirical research results based on mobile phone data, and most individual travel distance does not conform to power-law distribution. Cui [19] deduced the boarding stations of passengers from the data collected by the automatic fare collection system and the automatic vehicle location system [20]. Farzin [21] analyzed the boarding stations of passengers in São Paulo, Brazil, referring to information of integrated circuit (IC) bus cards, GPS data, and bus stop data. Xu et al. [22] estimated the destinations of public transit passengers, in the light of travel distance distribution and bus stop features. Xu et al. [23] clustered smart card data before estimating the alighting stops of passengers.

The current research focuses on two aspects: Firstly, the main method is to analyze the distribution characteristics of passenger travel distance to estimate alighting stop of a single bus route. Secondly, OD was obtained with using the difference of swipe card time in passenger travel to analyze the transfer between different modes of transportation. In terms of individual travel characteristics, it is found that the individual average travel distance is significantly different, the individual visit frequency to the location also follows the power-law distribution, and the individuals with different average travel distance have high similarity in spatial motion location distribution [24].

Therefore, it is reasonable to use the distribution characteristics of travel distance to study the single bus stop without transfer with subway, but this study uses the mobile phone signaling data; the traffic mode includes walking and nonmotor vehicle. The travel distance is a continuous variable, which is from the origination to the destination, not the next stop. When passengers travel by bus, the travel distance is a discrete variable, and the next station may not be the destination, so the transfer must be considered. In addition, according to the habits of human travel activities, travel behavior tends to choose to return to places that have been visited more times in history, such as home and office [25]. For the passengers with a large number of historical travel records, it is more accurate to analyze the historical travel law to predict the next stop.

In general, the destinations of bus passengers on a single bus line are estimated from the trip chain of boarding stations and the historical smartcard data, or from the land use attributes of all bus stops. In this paper, the historical travel features of a single passenger and the attractiveness of the nearby metro stations of a bus stop are analyzed in detail; the alighting stop, i.e., the destination, of that passenger was estimated by using the exploration and preferential return (EPR) model [26] and the Monte Carlo (MC) algorithm; and we also try to estimate the destinations of bus passengers in the future under connected and autonomous vehicle environment [27–31].

#### 2. Data Sources

The research data can be divided into five datasets: bus line data, bus smartcard data, bus GPS data, metro smartcard data, and road network data.

##### 2.1. Bus Line Data

Bus line data contain the information of each bus stop on a bus line, including but not limited to coordinates, name, and bus line number. In total, the data were collected from 1,516 bus lines in Shenzhen.

##### 2.2. Bus Smartcard Data

The bus smartcard data refer to the transaction information captured by the smart card fare collector, once a passenger swipes his/her smartcard upon boarding a bus. The data cover passenger identity (ID), boarding time, bus ID, and bus line number. Each bus smartcard carries a unique passenger ID that can be identified easily. In this research, the bus smartcard data were provided by ShenZhenTong, the largest public transit service provider in Shenzhen, and collected in the 21 days from October 11^{th} to 31^{st}, 2014. These data are for the three weeks after the National Day holiday. The three weeks include 18 working days and 4 nonworking days. The average number of card swipes per week is the largest in the whole year in Figure 1, which is representative.

##### 2.3. Bus GPS Data

Each bus has a GPS tracker that records the bus position in real time. The bus ID and bus line number recorded by the GPS tracker are unique, allowing us to match bus GPS data with bus smartcard number.

##### 2.4. Metro Smartcard Data

Metro smartcard data include the name of boarding and alighting metro stations, passenger ID, and boarding and alighting time. In this research, the metro smartcard data were also provided by ShenZhenTong and collected in the same period.

##### 2.5. Road Network Data

The road network data of Shenzhen (N: 22°46′–22°83′; E: 113°78′–114°59′) was simplified into 21,115 sections.

#### 3. Methodology

Based on the above data, this paper aims to estimate the destination for every single passenger boarding a bus line. The estimation process can be broken down into three tasks. The first task is to determine the arrival time and location of the bus at each stop of the bus line by matching bus line data with bus GPS data. The second task is to identify the boarding station of the passenger by matching GPS data with bus smartcard data. The third and final task is to estimate the destination of the passenger by using the EPR model, with metro smartcard data as the basis for exploration.

##### 3.1. Data Matching

The bus smartcard only records the boarding time and vehicle of the passenger, but not the boarding location (station). To determine the boarding station, it is necessary to match bus smartcard data with bus GPS data, i.e., complete the first two tasks mentioned at the beginning of this section: Task 1: matching bus line data with bus GPS data In general, a bus line is composed of an upline and a downline; that is, a bus moves in one of the two opposite directions at a time. Considering the inevitable errors in bus GPS data and the close proximity between stops with similar names, it is impossible to match bus line data with bus GPS data based on the distance indicated by position feedbacks from the GPS trackers on the bus and the GPS device at each stop. Instead, the moving direction of the bus should be identified in the light of its trajectory from the GPS, and then the locations be matched based on the direction. The matching process is illustrated in Figure 2, where *L*_{u} and *L*_{d} are the upline and downline, respectively. Let *S* and *S*′ be the departure stations of *L*_{u} and *L*_{d}, respectively. Ten consecutive tracking points were chosen from the GPS data of a bus M1 that operates along line L. Then, the distance from each tracking point to *S* or *S*′ was calculated. If the distance of these points increases from station *S* (*S*′), then the bus is on the up (down) line, that is, *L*_{M1} = *L*_{u} (*L*_{d}). Next, the bus GPS data were matched with stops along the right direction. The location matching was deemed successful, when the GPS location fell within 100 m of the boarding stop *S*_{0}. The matching time was taken as the arrival time at the stop. In this way, the arrival time of bus *M*_{1} at all stops along line *L* can be obtained as T_{i}, *i* = 1,2,3, …, *n*. Task 2: matching bus smartcard data with bus GPS data Suppose a passenger boards *M*_{1} at station *S*_{i} and swipes his/her smartcard at time *T*_{p}. Then, *T*_{p} was taken as the boarding time of passenger . Comparing *T*_{p} with *T*_{i}, the boarding stop *S*_{0} (as shown in Figure 3, i.e., origin, of passenger can be identified as the station *S*_{i} with the lowest time difference from *T*_{p}. Any of the stops following the origin could be the destination of the passenger.

##### 3.2. Destination Estimation of a Single Bus Passenger

The destination of a bus passenger depends on various factors, namely, travel distance and attractiveness of each stop. The latter, referring to the possibility that a passenger alights at the stop, encompasses the attractiveness of the stop itself and extra attractiveness. Here, extra attractiveness is measured by the attractiveness of metro stations near the stop because buses often serve as the feeders for metro in the large and dense transit network in Shenzhen. The two mechanisms of the EPR model and the MC algorithm are introduced to estimate the destination of a single bus passenger.

###### 3.2.1. Explore Mechanism

The explore mechanism applies to the scenario where there is no card swiping record at the previous boarding stops of the passenger. For such a passenger, the possibility that he/she alights at a stop is the probability for the stop to be the destination. According to the definition and composition of stop attractiveness, the probability covers two items, namely, the probability arising from the attractiveness of the stop itself and the probability stemming for the attractiveness of nearby metro stations.

The travel distance can be approximated by the number of stops passed by the bus. Without considering probability , the number of stops passed by a bus follows the Poisson distribution. In this case, the probability can be expressed aswhere *i* and *j* are the serial numbers of the boarding and alighting stops, respectively (the serial numbers were assigned from the departure station of the bus line); *n* is the number of stops on the bus line; and is the mean number of stops in a bus line ( was set to 10 for Shenzhen [32]). If the number of remaining stops after the boarding stop is fewer than , then .

The Poisson distribution can be normalized as

However, another determinant factor of bus passengers’ destination is the nature of land use. If there are shopping malls and entertainment sites nearby, the attraction is greater; especially in the stops near the transportation hub, the number of people boarding and alighting the bus is the largest. Due to the round-trip characteristics of residents’ public transportation, the volume of generation and attraction of stops are basically in a balanced state; that is, the more the people get on the bus, the more the people get off the bus. The attraction intensity of each stop is calculated by counting the total number of passengers at each stop from the judgment of the previous stop.

The metro records near the boarding stop were statistically analyzed. The boarding stop was considered near a metro station, if their distance is smaller than 1,000 m. This distance may lead to a time difference in the metro records of the transfer passenger. Here, the time difference is set to 30 min; that is, the metro records generated 30 min after the passenger swiped his/her smartcard on the target bus are counted. The records at the nearby metro stations were used to measure the attractiveness of the bus stop. If there is no metro station nearby, the attractiveness of the bus stop was set to zero.

Then, the probability of the bus stop *j* can be computed aswhere is the record at the nearby metro stations generated 30 min after the passenger swiped his/her smartcard on the target bus.

According to the literature [33], based on more than 230000 pieces of data analysis, research on the transfer time between conventional bus and subway, and analyzing the transfer data with the interval of 5 minutes, it can be found that most of the transfer behavior takes less than 20 minutes and only 2–4% of the total transfer amount exceeds 20 minutes (Figure 4). In order to ensure the integrity of transfer sample identification, 30 is selected as the transfer time threshold.

According to the agglomeration effect of public transit stations, the attractiveness of a metro station decreases with its distance to the bus stop. Hence, the value obeys the exponential distribution:where is the intensity of the agglomeration effect; = 1 is the peak agglomeration effect; and *s* is the distance from a metro station to the bus stop.

Then, the value can be obtained by

On this basis, the probability a stop to be the destination for a single bus passenger can be calculated bywhere is the coefficient of the attractiveness of nearby metro stations , i.e., the weight of , and the weight of is 1. The value of is positively correlated with the proportion of passengers taking metro instead of bus.

###### 3.2.2. Preferential Return Mechanism

Based on the research results of the literature [25], it is found that in terms of residents’ travel destination, when the historical data of residents’ travel increases with increase of time, the number of residents visiting new places follows *s*(*t*) = *t*^{−μ}, *μ*∼0.6; the frequency of residents’ visit to the place (*k*) is subject to *f*_{k} = *k*^{−ξ}, *ξ*∼1.2; and the authors pointed out that the accuracy of the CTRW model is not good. The main idea of the exploratory regression model is that individuals return to the previously visited places with the probability (*ρs*^{−γ}) of exploring the location, and the probability (1^{−}*ρs*^{−γ}) of visiting a place is directly proportional to the probability of individuals being found in the location, as shown in Figure 5.

Next, the preferential return mechanism was employed to predict the destination under the scenario that there is no card swiping record on the bus line boarded by the passenger. In general, passengers prefer to alight at frequently visited places, such as home and workplace. Thus, the basic idea of the preferential return mechanism is that passengers tend to alight at stops with more historical card swiping records. In other words, the probability for a stop to be the destination is proportional to the historical record count of the passenger at that stop.

To eliminate the interference of stops with similar names, the smartcard records within 100 m were counted as the records of one stop, where 100 m is the return range. The stop with many historical records has a high probability of being returned, which is directly proportional to the number of historical records.

Based on the historical records of a passenger, the probability of a stop to be the destination can be described aswhere *i* is the serial number of stops following the boarding stop; is the number of historical records of stop ; and is the number of stops with a probability of being returned.

###### 3.2.3. MC Algorithm

Finally, the destination of a single bus passenger was predicted by using the MC algorithm. Based on probability theory, the MC algorithm relies on a random probability model to approximate the probability through simulation and statistical testing on random variables. As shown in Figure 6, the MC algorithm is implemented in the following steps [20]: Step 1: construct and describe the probability process as formulae (2) and (6) Step 2: determine the sample size and samples from the probability distribution, and set the number of simulations to 1,000 for each passenger Step 3: confirm the estimation, i.e., the alighting stop

The estimated destination is the stop with the largest number of occurrences in the 1,000 simulations. If a passenger boards at stop *S*_{0} of bus line *L*, then the stops after *S*_{0} are numbered as in turn. The number of occurrences of each stop in the 1,000 simulations is denoted as , respectively, and the estimated destination as *S*_{i}, where .

#### 4. Case Study

##### 4.1. Estimation of the Boarding Station

The proposed method was applied to predict the destination of every single bus passenger in the 1,516 bus lines across Shenzhen. The smartcard bus data, bus GPS data, and smartcard metro data were collected by ShenZhenTong in 21 days from October 11th to 31st, 2014, including 14,109 trip records for 23 passengers, as well as the trajectories and stop locations of the 1,516 bus lines.

During the 21 days from October 11 to 31, 2014, the total number of card swipes was 268623 (excluding subway passengers). There were 113625 passengers. The method of random sampling was used to collect the sample, and the sample size is calculated as shown in Table 1.

Sample size *n* is shown below:

As mentioned before, the road network data of Shenzhen (N: 22°46′–22°83′; E: 113°78′–114°59′) was simplified into 21,115 sections.

For simplicity, the downtown of Shenzhen was divided into zones, each of which is 555 m by latitude and 515 m by longitude. Then, the smartcard records, GPS bus data, and stop locations were fused to obtain spatial distribution of the boarding stops for bus passengers in Shenzhen during the five days from October 27^{th} to 31^{st}, 2014. As shown in Figure 7, most passengers board buses at the center of the city; that is, the bus stops at the central area of Shenzhen have relatively high attractiveness.

To differentiate the estimated results of exploration mechanism from those of the preferential return mechanism, the boarding stops of the 293 bus passengers with and without historical records are displayed in Figures 8 and 9, respectively.

##### 4.2. Destination Estimation

###### 4.2.1. Determination of Value

The value of , that is, the weight of or the coefficient of the attractiveness of nearby metro stations, was set to 0.5 and 0.7 during the destination estimation. The results show that the value has a limit effect on the estimation. Since the ratio of metro trips to bus trips in Shenzhen is 3 : 7, the value of was set to 0.7.

###### 4.2.2. Destination Estimation

Our method was adopted to estimate the alighting stop, i.e., destination, of every single bus passenger from October 27^{th} to 31^{st}, 2015. The estimated destinations are displayed as the heat map in Figure 10. It can be seen that the destinations concentrated in the central area of the city, revealing a correlation between boarding and alighting locations.

Next, the exploration mechanism and the preferential return mechanism were separately adopted to estimate the destinations of each of the 293 passengers without historical records on the bus line, with *α* = 0.7. The estimated results of the two mechanisms are presented in Figures 11 and 12, respectively.

##### 4.3. OD Distributions in Different Periods

Based on the above estimation, the distribution of origins (boarding stops) and destinations (alighting stops) was illustrated for different periods of the day (Figures 13−18).

As shown in Figures 13−18, the origin distribution in the morning peak is consistent with the destination distribution in the evening peak, and both origins and destinations are clustered in residential areas. It is learned that the bus trips of passengers in the morning and evening peaks leave from and return to their homes, respectively.

Besides, the destinations in the morning peak mostly fell in commercial and school areas, indicating that most passengers go to work or school in the morning. The origins in the evening peak was slightly scattered, yet mainly from commercial and school areas, which are the main destinations for business and school travels.

Finally, the origins and destinations were relatively decentralized in the off-peak hours, suggesting that a certain portion of the travels are nondomestic.

Through our data analysis, 3,023 destinations (82.8%) were derived by the preferential return mechanism from the data of 3,651 passengers, while 62 (17.2%) were derived by the exploration mechanism. It can be speculated that urban residents tend to return to places they have visited before. With the growing number of trips, the residents are more likely to prefer the historical locations over new locations. This agrees with the rule of preferential return. Our estimation shows that the travels of bus passengers concentrated in the morning and evening peaks: going downtown in the morning peak and returning home in the evening peak. This is clearly in line with the situation of urban commuters.

#### 5. Conclusions

This paper fully integrates bus line data, bus smartcard data, bus GPS data, and metro smartcard data, with road network data, and introduces the EPR model and MC algorithm to estimate the alighting stops of bus passengers in Shenzhen. Firstly, the boarding location and time of every single passenger were estimated based on the integrated data. Then, the alighting station of the passenger was predicted under the exploration mechanism and the preferential return mechanism, which is based on the features of travel activities. During the prediction, the metro smartcard data were innovatively employed to evaluate the extra attractiveness of each bus stop. Considering the features of historical trips and bus-metro transfer, the proposed method was found to effectively solve the destination estimation problem through a case study. The future research will further explore the multimode traffic and OD estimation of multitransfer trips; based on the results of this study, we can obtain the origination and the destination of passengers and estimate the travel OD combined with multimode traffic and multiple transfer, so as to obtain the relevant characteristics of passengers’ travel, including travel time, travel OD, travel distance, and station passenger flow; the travel distance and station passenger flow can provide more accurate data support for urban bus dispatching and schedule more reasonable departure time and interval of peak and flat peak vehicles. In addition, the passenger flow and travel OD can provide the basis for public transportation network adjustment.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

The authors thank Shenzhen Urban Transport Planning Center for providing part of our research data and also to other scholars and research teams for their scientific contributions. This project was supported by Open Fund of Engineering Research Center of Catastrophic Prophylaxis and Treatment of Road & Traffic Safety of Ministry of Education (Grant no. kfj180401, Changsha University of Science & Technology), Natural Science Foundation of Hunan Province, China (Grant no. 2019JJ40306), and the National Natural Science Foundation of China (Grant no. 61773077).