Abstract

Mobile phone data have become a critical data source for transportation research. While a cell-id trajectory was routinely reorganized by International Mobile Subscriber Identity (IMSI), it potentially allows to analyze transportation behaviors and social interaction of total population, with a full temporal coverage at low cost. However, cell-id trajectory is often sparse due to low reporting frequency and uncertainness of mobile holders’ position. So, the cell-id trajectory refinement has been recognized as challenging work to further facilitate trajectory data mining. This paper presents a comprehensive approach to identify cell-id trajectories of public service vehicles (PSVs) from large volume of trajectories and further refines these cell-id trajectories by a heuristic global optimization approach. The modified longest common subsequence (LCSS) method is used to match a cell-id trajectory and a public transportation route (PTR) and correspondingly calculates their similarities for determining whether the trajectory is PSV mode or not. Taking full advantages of the nature of a PSV tends to move on the PTR in uniform motion to meet a prescript visit to stops, a heuristic global optimization approach is deployed to build a spatiotemporal model of a PSV motion, which estimates new locations of cell-id trajectories on the PTR. The approach was finally tested using Beijing cellular network signaling datasets. The precision of PSV trajectory detection is 90%, and the recall is 88%. Evaluated by our GNSS-logged trajectories, the mean absolute error (MAE) of refined PSV trajectories is 144.5 m and the standard deviation (St. Dev) is 81.8 m. It shows a significant improvement in comparison of traditional interpolation methods.

1. Introduction

Cellular network-based data are emerging as a great data source for urban transportation application due to the advantage in the large geographic coverage of cellular networks and the comprehensive penetration in a population [13]. Generally, cellular network-based data collected by mobile network operators can be reorganized into a mobile phone user’s trajectory formed as a sequence of time-stamped cell-ids (i.e., cell-id trajectory), illustrating motion characteristics of corresponding mobile objects [4].

The location in cell-id trajectory at a particular time is assigned with the coordinate of an occupied base transceiver station (BTS). The spatial resolution of cell-id trajectory data depends on the service radius of each BTS, which varies in different areas, e.g., of hundred meters in metropolitan cities, and several kilometers in rural areas [5]. Meanwhile, the records of a cell-id trajectory are collected in a relatively long-time interval, which inevitably results in the problem of trajectory discontinuity or sparsity [610]. Therefore, due to the inconformity between the spatiotemporal sparsity of the cell-id trajectory data and the requirement on fine-scale footprints, the refinement of cell-id trajectories becomes a research hotspot due to its extensive application prospects.

The trajectory refinement that refers to filling the spatiotemporal gaps in the data is an approach to mitigate the sparsity of time-series trajectories [11]. Existing refinement method based on mobile phone location data mainly includes an interpolation-based method and a map-matching-based method. The former mainly uses spatial-temporal correlations among records to interpolate missing points. During interpolation, trajectory points are calculated based on the distances and time spans between each missing point and its contextual points by an estimation function (e.g., nearest-neighbor function, linear function, or Gaussian function). Ficek and Kencl [12] proposed an intercall mobility model which combines Gaussian mixtures to refine CDRs. Hoteit et al. [13, 14] compared the reconstruction performance of various interpolation methods (linear, cubic, nearest, and spline interpolations) on trajectories with different sampling interval and radius of gyration. Yu et al. [15] simply used a spatially-linear-interpolated method to estimate exposure in air pollution of cell phone user. Csaji et al. [16] interpolated between home and office to infer user’s location and analyze the population distribution. These refinement methods might meet the requirement for pollution exposure or census estimation; however, it is unfeasible for transportation study. Perera et al. [17] provided a method to compute the location of phone user within a cell, but it requires extra speed information which is generally unavailable in cell-id data.

As for map-matching-based methods, the basic assumption is that vehicle movement behaviors always occur along road networks. Thus, a sequence of trajectory points can be aligned to a sequence of road segments to form a complete path. Hidden Markov model (HMM) is the most popular one among map-matching-based methods. Jagadeesh and Srikanthan [18] used a pretraining route choice model to generate partial map-matched paths and identify the most likely one. Another HMM model proposed by Jagadeesh and Srikanthan is in [19] which considered the tailored transition probabilities for the type of data. Algizawy et al. [20] used HMM to generate a road-level traffic density, at an hourly granularity, for each mobile trajectory. Xiao et al. [21] used contextual relationships between trajectory points as features of the CDR trajectories in a conditional random field model to reconstruct individual trajectories. Chen et al. [22] proposed two approaches for completing CDRs adaptively to reduce the sparsity and mitigate the problems the latter raises. However, the basic assumption that underlies the map-matching method is questionable. That is, individuals in urban space can travel by public transportation, which limits the performance of such methods.

This paper aims to refine sparse cell-id trajectory of public service vehicles (PSVs) by combining vehicle transport model and mobile cell-id trajectories, in which an LCSS-based SVM classifier took full advantage of the similarity between cell-id trajectories and designed public transportation routes (PTRs) to separate PSVs from those with other transport modes (e.g., walk or private car) in a large-scale unlabeled trajectory dataset. Then, each trajectory of PSV was modeled to fit with its mobile behaviors as much as possible at stops, junctions, and roads and be consistent to a spatial cell cover and bus travel speed by a heuristic global optimization. We evaluated our proposed trajectory refinement method by using an encrypted cell-id trajectory dataset and a GNSS-logged bus trajectory collection. The results show that our approach delivers a state-of-the-art achievement in refining cell-id trajectories.

2. Method

2.1. Conceptual Model

The cell-id trajectory explored in this paper is a sequence of time-stamped cell-ids. Each cell-id is a unique identification of a base transceiver station (BTS) with geographical coordinates and sectoral signal transmission coverage. So, we can imagine that a refined cell-id trajectory must occur within the overlap area among a road network and a BTS sectorial area. Figure 1 gives an overview of the architecture of our approach, including two phases: PSV trajectory detection and trajectory refinement or reconstruction.

PSV cell-id trajectory detection is to identify PSV-generated trajectories from huge volume of cell-id trajectories with various transportation modes, i.e., walk, cycle, and private car. An LCSS alignment algorithm is used to match a cell-id trajectory to a public transportation route (PTR) according to a number of PTR-nearby BTS sectoral coverages and chronological order of timestamps. LCCS generates a sequence of corresponding points (also called anchor points), in which an anchor point is represented to two different locations on the PTR route and cell-id trajectory, respectively.

The similarities calculated based on the LCSS sequence are sufficient conditions to recognize PSV trajectory by a support vector machine classifier. Anchor points are initial inputs to heuristic optimization model for further estimating precise locations of anchors on PTR routes, and consequently, high-quality trajectories are generated by interpolating among optimized anchor points.

2.2. PSV Cell-ID Trajectory Detection

The PSV such as buses, subways, and trams always run following the transportation mode with fixed routes and prescript schedules, providing transportation services for public passengers. To identify cell-id trajectories from mass datasets, it is essential to establish a set of specific measures to quantify the correlation between cell-id trajectories and the PTR. Based on LCSS, the longest common sequence among a cell-id trajectory and PTR is matched and thus a set of similarity measures are proposed to measure the spatiotemporal correlation between them. Taking use of the similarity measures, a SVM binary classifier is deployed to recognize cell-id trajectories with the PSV mode.

2.2.1. BTS Sector

In a cell-id trajectory, a phone holder's location is roughly represented as the installation position of the cell-id marked BTS. It is not the identical location of the phone holder as the BTS actually covers a large geographical area. It is impossible to catch the exact location of mobile users as they can be anywhere inside the mobile network cell.

There are two popular mobile network cell models. Most researches represent mobile network cells as Voronoi areas centered on BTSs (see Figure 2(a)). In this work, BTS sectors are used that offer a finer spatial scale due to limiting a mobile phone location to only parts of the cell, as shown in Figure 2(b). Generally, there are three antenna sectors per BTS and the sector’s orientation is same as the direction of BTS antenna. We construct sectors with an average diameter of 500 meters in urban areas.

2.2.2. LCSS Matching

Due to the uncertainness of spatial positioning and irregular time interval logging at a cell-id tracking point, it is difficult to match a PSV cell-id trajectory to a PTR route. An LCSS alignment algorithm that originates in the field of string matching, where two strings are given to find characters that appear left-to-right, not necessarily consecutively, in both strings [23], is applied to find the longest common subsequence of two sequences. The longer an LCCS is, the higher the probabilities that the trajectory was generated by a PSV.

The LCCS is extended to support periodic matching in this work. Following pilot studies [24, 25] that processed cell-id trajectories as strings, we discretized a continuous PTR route into a cyclic sequence of discrete points with an interval of 50 meters. Given a cell-id trajectory and a PSV route ,where is the th point on trajectory , is the timestamp of , and is the th point on route .

Two points and may be considered to be matched if located within the BTS sector of , and it can be represented aswhere is the BTS sector of .

Let denote the first points of trajectory and denote the first points of , the LCSS between and .

Dynamic programming algorithm deployed to discover optimal alignment for and is shown in Figure 3. A matrix is created to save the LCSS for every subsequence pair of and .T‖ and ‖R‖ are the point numbers of and , respectively. Considering a PSV periodically moves along PTR, is designed as a periodic cyclic sequence. is the search space on for to find matched The entries of are gradually filled as the dynamic programming proceeds (lines 4–10), and the last entry stores the LCSS of aligning and . Finally, we decode to find all the aligning cell-id log pairs of the optimal alignment (lines 12–22). The extended LCSS model can be viewed as a modified version of the models [26, 27], which not only finds the longest common subsequence in terms of the accumulate number of matched anchors but also the cycle numbers of the cell-id trajectory.

2.2.3. Similarity Measures

(a) Anchors Ratio. An often used similarity measure for an LCSS matching is calculated as the ratio between the number of points in LCSS and the number of cell-ids in original cell-id sequences size [26, 27], called anchors ratio in this work. where is the number of the matched anchor point pairs, and is a subset of that is the common part between and . Using instead of is because contains points beyond on-duty period.

(b) Traverse Completeness. Traverse completeness refers to the ration between the length of the LCSS and the length of a PTR. Anchor points of split and , resulting in a set of subtrajectories and a set of subroute . Applying Hausdorff distance [28] to determine whether is matched with each other, all geometrically closed pair could be regarded as traversed partial on the PTR; correspondingly, traverse completeness is defined as follows:where is the union of matched subroutes , is the length of union-routes , and is the total length of route .

(c) PSV Cycles or Round-Trips. PSV cycles denote to the cycles of periodic matching of cell-id trajectory on the PTR, indicating how many round-trips the PSV run along the route. where and are the mileages of the first and last anchor points, respectively, represents the total distance during on-duty time, and is the total length of route .

2.3. Trajectory Refinement
2.3.1. Trips Partition

A PSV cell-id trajectory often includes several round-trips along a PTR route, each alternating with a long-time stay at terminal stations as drivers have access to toilet facilities at rest, fuel, and food establishments. The stop time cannot be directly obtained from original data, so we introduced a spatiotemporal kernel density estimation (STKDE) method to identify these stays and therefore enable to partition trips. Trajectory refinement or reconstruction is thus able to be performed trip by trip because those short trajectories generated at terminal stay time are eliminated.

We first transform the cell-id trajectory into a time-distance relationship by calculating the distance between start station and each BTS. Then, the kernel density along the distance axis is estimated aswhere is the accumulated mileage of the PSV, is a kernel function for the spatial domain, and is the spatial bandwidth. Trajectory point is weighted on a univariate kernel density function as follows:

The kernel density is estimated based on the travel duration when the PSV pass through the spatial domain. Points on which the PSV has met traffic congestion, junctions, or stops often have high-density values. In particular, the long-time stay at terminal stations causes an extremely high-density peak, which can be used to partition trips [29].

2.3.2. Heuristic Global Optimization

For a round-trip matching, we deploy a heuristic global optimization approach to estimate the precise locations of anchor points on the route. It tends to match a time-stamped point sequence to a monotonically increased mileage sequence. For each anchor point of the LCSS from a cell-id trajectory and a PTR, it must satisfy (a) locating within the intersection of the PTR and BTS sectors, (b) having a longer mileage than previous points’ one, and (c) having new location nearby the initial.

A heuristic optimization model, as a commonly used model on finding approximate global optima problems, is used to search new location of anchor points naturally. Assuming a PSV always tends to move on a PTR in uniform motion to meet a prescript visit to stops, an objective function is defined as equation (8) to minimize the standard deviation of bus speed along a route in which the speeds among two consecutive anchors totally depend on their locations and time interval.where is the mileage of the ith anchor along the PTR, is the timestamp, is the total number of anchor points, and , are the upper and lower limits of the decision variables being optimized.

As illustrated in Figure 4, by iteratively adjusting the location of anchor points in a search space, a minimum objective function value is reached by leveraging dwelling times on bus stops and cross-roads besides the abovementioned constraints. The steps of the method are listed as follows.

3. Study Area and Datasets

3.1. Study Area

The study was conducted at Huilongguan town to evaluate the proposed public transit mode detection and cell-id trajectory refining method. Huilongguan town is one of the largest townships in the northern part of Beijing, China. The town has a total population of about 450,000 people and an area of 34.5 square kilometers. Being one of the most populated residential areas in Beijing, choked traffic has been an archenemy to urban transportation system in this area. As shown in Figure 5, the town streets and road maps were sourced from OpenStreetMap and total 20 bus routes and related more than 400 bus stops covering this area were also retrieved from Beijing Public Transport Corporation (BPTC).

3.2. Datasets

Cellular phone network signaling datasets, covering the town extension area on early August, 2016, were collected, about 670,000 mobile phone trajectories, 240 million records with an average phone-station interaction interval of 280 seconds. Each record includes timestamp (TS), International Mobile Equipment Identity (IMEI), tracking area code (TAC), and cell identity (CI), representing an interaction event between a mobile phone (IMEI) and a base station (TAC plus CI) at a dedicated time (TS). All private information in mobile phone datasets has been encrypted to protect privacy.

Ground truth datasets were collected by deploying an android device-based cell signal monitor program. Following GNSS positioning (longitude, latitude, and time), cellular towers information along bus routes was also acquired, including cellular network (GPRS/EDGE/UMTS/LTE), current cell identity (CID), current area identity (LAC/RNC/TAC), signal strength (RSSI and RSRP for LTE networks), and cells that were used by the mobile device. This tracking dataset was logged with a sampling interval of 1 second and a GPS positioning error of 5–10 meters. It is mainly used to calibrate and validate our proposed model.

3.3. Data Preprocessing

Ground truth datasets were reorganized as follows. First, the information about bus round-trips was extracted from GNSS-measured bus trajectories. Then, bus cell-id trajectories were prepared by resampling raw cellular tacking datasets with a long-time interval of 280 seconds to be consistent with our big cellular collection in 2016.

As shown in Figure 6, a subset of ground truth datasets at route no. 307, totally 3,8273 GNSS points and 136 cell towers of BTS, were illustrated, among which each BTS might be deployed many times when our android devices passed through its covered area. Figure 6 displays both a GNSS trajectory and a cell-id trajectory from 8 : 00 to 20 : 00 within 4 separate round-trips. GNSS points overlap bus route very well but cell-id points are quite poor.

4. Results

4.1. PSV Cell-ID Trajectory Detection by LCSS

Applying our revised LCSS algorithm to register cell-id trajectories datasets to 20 bus routes in Huilongguan, the similarities between trajectories and bus routes were calculated. Total 843 candidate cell-id trajectories with 16,732 time-stamped track points (or towers) were identified with an anchor ratio threshold of 0.2. In this work, the cell-id trajectories that are from bus drivers or conductors are called “PSV mode.” By human interpretation, 456 cell-id trajectories (phone holders) were labelled as “PSV mode” and the rest were “non-PSV mode.” The statistics of similarity measures of PSV mode trajectories is listed in Table 1.

Support vector machine (SVM), which may maximize the margin by separating two classes of samples, was used to identify PSV cell-id trajectories from others in this work. K-fold cross-validation was used for SVM parameters tuning such that the model with most optimal value of hyperparameters can be trained. The interpreted candidates are divided into 5 folds, out of which four folds are for training and one for testing. The results of five repeated classifications are shown in Table 2. For PSV cell-id trajectory detection, the precision is about 90% while the recall is from 88.60% to 92.32%. For non-PSV mode detection, the precision is around 88%, and the recall is from 88.37% to 93.54%.

4.2. Trajectory Refinement by Heuristic Optimization

Once the trajectories with a PSV mode were identified, our proposed heuristic optimization method is used to determine the precise location of anchor points on a bus route at a specific timestamp by concerning its corresponding BTS sector and other contextual information. To evaluate the performance of our heuristic globe optimization approaches, the ground truth datasets were deployed. As the result of trajectory matching shown in Figure 7, 81 out of 136 BTS along bus route no. 307 were detected that they have counterparts anchor points existing on the route.

Five optimization algorithms including particle swarm optimization (PSO), augmented Lagrangian (AL), compass search (CS), and artificial bee colony (ABC) were tested to solve the optimization problem of spatiotemporal modelling of a cell-id trajectory. For our PSV’s location optimization, it is actually a continuous, constrained, and single-objective problem. That is, an anchor point of a BTS needs to be found in optimization space which is the common segment part of the BTS sector and the bus route.

A python library PyGMO was deployed for implementing this iterative process, and all results were evaluated by the indicators of MAE (mean absolute errors) and St. Dev (standard deviation) by referring to ground truth GNSS tracking points. As shown in Figure 8, the compass search algorithm achieved the best performance in comparison of others. The estimated locations of anchor points at their BTS-communication time have the smallest errors. Based on the optimized cell-id trajectories, we further make a spatiotemporal interpolation among anchor points using method in [13] and reconstruct a high-quality PSV trajectory.

Figure 9 presents the iterative process of a trip of PSV cell-id trajectory data using compass search algorithm. A monotonic basin hopping (MBH) meta-algorithm is applied.

To avoid local optimization and accelerate convergence velocity of iteration, a monotonic basin hopping meta-algorithm is applied in the abovementioned optimization processes. The optimization space threshold roughly set to 500 m to avoid an explicitly unreasonable location was tested in the iteration. As shown in Figure 10, the objective decreases quickly at the beginning and finally converges at minimization after 120,000 iterations. Generally, an acceptable result could be reached after 8,000 iterations.

4.3. Comparison with Other Spatiotemporal Interpolations

To evaluate the advantages of our proposed method, we selected three most popular trajectory reconstruction methods, nearest sampling (nearest), linear interpolation (L), and cubic hermit interpolation (CH), [13] to make a comparison. These popular interpolations have two models, constrained (C) and unconstrained (U), depending on whether ancillary transportation line datasets are used or not. Calculating the difference between the estimated location and the GNSS measured location of a tracking point at a specific timestamp, the performances of the abovementioned methods are shown in Figure 10, among which our method (compass search algorithm) has the smallest error on its refined cell-id trajectory.

The box of CS errors is compact with a small median of 166 m. Among traditional interpolation methods, constrained linear interpolation seems introducing a relatively good result with a median of 207 m, but the range of errors widely varies. Moreover, CS is also robust and the better performance is due to the fact that we accounted for the unique transportation model of PSV and the constraints of public transport line.

5. Discussion

5.1. Shift in LCSS Matching

Assuming a bus route crosses a BTS sector that is a wireless-signal coverage area (Figure 11), it leads to a piece of road intersection on which a mobile holder (a bus driver) moves when the phone communicates with the corresponding BTS at a particular time. The end-points of the interaction may be called, entry point and exit point. All communication events with this BTS must happen on the interactions with big probabilities.

The revised LCSS for trajectory matching is a forward matching process. This process is able to find the longest common subsequence, but cannot guarantee an appropriate anchor point that is expected to be the identical location of the bus when moving on roads at a specific time. Among the discretized points on the intersection part, the closer a point is to the entry, the higher the probabilities of the point are to be matched. This is because the effect of mismatch and process variation results in shifting in the process of LCSS.

Our heuristic optimization with compass search algorithm further adjusts the initial positions of cell-id trajectory points within the intersection part and estimates appropriate positions to match a PSV-mode transportation along the bus route. In comparison of four GNSS trajectories, the average distance between initial points and estimated points is about 197 m with a deviation of 57 m.

5.2. How a Time Interval Affects the Spatiotemporal Modeling in Trajectory Refinement?

From the comparison of heuristic optimization and traditional interpolations in the above section, our proposed method did improve the accuracy of trajectory refinement. Moreover, the inherent quality of cell-id trajectories, particularly the time intervals of tracking points on a cell-id trajectory, also has big effects on the refinement.

Based on the ground truth datasets collected in four round-trips at bus route 307, the cellular signaling data originally logged with one second interval were resampled into a number of trajectories with intervals ranging from 10 s to 600 s. The BTS (cell-id) number in these resampled cell-id trajectories varies from six hundred to two dozen, as shown in Figure 12(a). The errors of estimated trajectory to the GNSS trajectory increase from thirty meters to six hundred meters, as shown in Figure 12(b). The interval of 5 minutes that is close to the average interval of our big cell-id trajectory collection generates an error of 180 m. Once the interval is greater than 10 minutes, our proposed method will not support acceptable trajectory refinement any more.

6. Conclusion

This work proposed a novel approach to reconstruct the spatiotemporal location of vehicles between sparse updates in a cell-id trajectory. First, an LCSS-based machine learning method was proposed to detect PSV-mode cell-id trajectories from the huge volume of cellular network signaling datasets. Then, a heuristic global optimization method was deployed to estimate the precise locations along bus routes of these detected cell-id trajectories. Evaluated with ground truth datasets, our proposed method has achieved very good performance in both accuracy and robustness in comparison with traditional interpolations.

Our heuristic global optimization with compass search algorithm overcomes the issue of location shifting in LCSS when matching a cell-id trajectory and a bus route. This leads to a set of high-quality anchor points, that is, a spatial position at a road network that is originally corresponding to a cell-id tracking point at a particular time is estimated in the common intersection part of the BTS sector and the road network. The experiment indicates that, by taking advantage of the nature of PSV-mode cell-id trajectories, our approach works well on cellular network signaling datasets with five-minute sampling interval, but the performance decreases sharply after a ten-minute sampling.

Data Availability

The 4G-LTE mobile phone data used to support the findings of this study were supplied by China Mobile Communications Group Co., Ltd., under license and so cannot be made freely available. Requests for access to these data should be made to Kemin, Xianfeng ([email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Foundation of China (nos. 2017YFB0503702 and 2017YFB0503605), 973 Program (no. 2013CB733402), and National Natural Science Foundation of China (nos. 41601486 and 40771167).