Abstract

Traditional intercity trip distribution modeling methods are merely derived from household travel survey due to its limitation to partial or inaccurate information. With the development of information construction, reliable historical data can be easily collected from different sources, such as sensor and statistical data. In this study, a data-driven method based on Poisson distribution theory is proposed to estimate intercity trip distribution using sensor data and various city features. A Poisson model, which reveals the deep correlation between city feature variables and trip distribution, is initially formulated. The L1-norm approach and the coordinate descent algorithm are then adopted in selecting related features and estimating model parameters, respectively, to reduce the complexity of the model. Finally, a k-means clustering method is used to analyze the latent correlation between city features and improve the availability of the model. The methodology is tested on a realistic dataset containing the highway trips of 17 cities in Shandong Province, China. The city feature variables have 66 dimensions, including economic index and population indicator. In comparison with traditional gravity model, which regards population as the most important factor affecting city attraction, our result shows that one of the core positive factors is the economic feature, such as gross regional domestic product. Moreover, the dimension of city features in the developed model decreases from 66 to 13 dimensions. The model developed in this study performs well in replicating the observed intercity origin-destination matrix.

1. Introduction

Intercity trip distribution analyzes and estimates the number of trips between the departure and the destination cities, considering the cities’ differences in body mass index and degree of urban obstruction. It is a crucial part of transportation analysis on city interaction and helps us gain a better understanding of intercity human mobility.

Lei et al. [1] divided intercity trip distribution analysis and modeling methods into two groups, namely, top-down and bottom-up approaches. Top-down approaches begin with city socioeconomic, land use, and demographic information, and only few data are utilized to calibrate model parameters. These models rely on experts’ knowledge and behavior postulates and are robust to the change of inputs, thereby guaranteeing intercity trip distribution estimation with only few detailed data. However, this group of models only considers few factors, whereas the actual situation is considerably more complicated with different effects in different regions. Conversely, bottom-up approaches, also called data-driven method, directly estimate flows of trip distribution from available data sources. Determining the manner in which relative data is collected is a critical issue in these approaches.

Traditional dominant methodology, which is referred to as top-down approaches, including the gravity models and destination choice models, is modeled by following different behavior postulates and theories. Gravity model, which is one of the most well-known models, is widely used in estimating the flow among different regions [24]. It is formulated based on the postulates that the observed trip distribution is governed by a power decay effect and proportional to the population of origin-destination (OD) cities. Several studies have considered other factors that affect city’s production-attraction capability and decay effects, such as generalized cost [2], tertiary percentage, tourism revenue, the convenience of other traffic modes [3], city size, and trip purpose [4]. Other types of models, such as destination choice models, are based on discrete choice and random utility theories at disaggregate level, such as intervening opportunity, radiation, and population weight opportunity models. These models conclude that the factor of distance has no effect on the travelers’ destination choice, and only intervening opportunities play a surrogate role. Thus, the models estimate the commuting flows only by population. However, the conclusions about the performances of these two types of models vary in different studies. Mishra [5] showed that destination choice models perform well in a statewide situation, whereas Lenormand et al. [6] showed that gravity model performs well in estimating commuting flows. Yan et al. [7] developed a general model that only requires population as an input in predicting human mobility in different scales.

In previous studies, household travel survey sample data have been widely used in analyzing trip distribution and continue to be a mainstay of trip distribution modeling [8]. Pitombo [9] applied nonparametric decision tree algorithms to analyze individual destination choices based on OD survey dataset. However, the information obtained from these surveys is limited and partial. Conversely, information derived from sensors and statistics is reliable and sufficiently wide to cover an entire area. Rasouli and Nikraz [10] estimated the distribution of journey to work trips using census data sourced from the Australian Bureau of Statistics based on a generalized regression neural network algorithm. Perrakis et al. [11] also utilized census data to estimate OD flows based on a statistical Bayesian approach. Bekhor [8] applied cellular phone technology to analyze long-distance trips.

Recent development in intelligent transportation allows passive data collection from several travelers and offers a wide variety of data, including GPS and fare data. Mario [12] reviewed studies utilizing passively collected data and summarized the applications conducted based on these data, which include OD estimation. Moreover, city statistical data have become richer and more accurate, and multidimensional data, such as population, economic, and traffic network, can be considered. Traditional top-down methods are limited when detailed data are absent, and the fitness of model to complex reality is not sufficient, which inspires us to consider more factors and different scenarios. The reliable historical traffic data and multidimensional city statistical data allow us to estimate and analyze trip distribution using data-driven method without behavior postulates.

The present work proposes a data-driven method to estimate intercity trip distribution through reliable traffic historical data and statistical data of urban multidimensional characteristics. The method described in this study follows two main steps (Figure 1). First, a Poisson model is proposed to reveal the correlation between city features and trip distribution according to the theory of statistics. Next, the L1-norm approach is applied to select the feature, and the coordinate descent algorithm is used to estimate parameters of the model to solve the problem raised by its complexity. A k-means clustering method is then used to analyze and interpret all the city features in the light of the feature selection result. Finally, an attribute-optional trip distribution model is developed to enable the replacement of similar attributes in the same cluster. In summary, the main contributions of this paper include the following:(1)Considering the samples estimated are biased and insufficient even if the sampling rates are large, we utilize the historical OD data to estimate the trip distribution.(2)This theoretical attempt is one of the first to consider such complicated and comprehensive features, which include not only population and distance that always considered in the intercity trip distribution models but also urban economic index and transportation capacity used in estimating intercity trip distribution. Meanwhile, we propose a solution to the problem that high dimension of feature causes overfitness and low generalization of model. Besides, considering some features are hard to be obtained, the features selected by the method are alternative.(3)According to the assumption that OD flows are independently Poisson distributed[11], we develop the model based on statistical theory and big data science without relying on experts’ knowledge.(4)We evaluate the methodology with real-world OD data collected from 17 cities in Shandong Province, and the result outperforms the traditional gravity model as a baseline approach.

As a case study, the new approach is applied to highway intercity trips for Shandong Province in China, which includes 17 cities using expressway networking toll data and Shandong statistical data sourced from Shandong Bureau of Statistics. Accordingly, two different models are developed, namely, the trip distribution model based on Poisson theory and gravity model. Quantified evaluation and comparison of the two models are performed. The result shows that the Poisson model performs better than the traditional gravity model (R2 score is improved from 0.39 to 0.69). The real city attraction is not population, which the traditional intercity trip distribution model always considers, but the economic feature, such as gross regional domestic product).

The rest of the paper is arranged as follows. Section 2 presents the model formulation and estimation. Section 3 shows the clustering method used to find the latent classes in city features. The case study and result analysis are presented in Section 4. Finally, Section 5 presents the conclusions and provides directions for further research.

2. Trip Distribution Model Based on Poisson

In this section, the model is formulated by Poisson distribution based on statistical theory, which describes the relationship between intercity distribution and city features, as shown by Step 1 of the methodology in Figure 1. L1-norm and coordinate descent algorithm are utilized in selecting city features and estimate parameters to solve the problem raised by the high complexity of the model.

For computational convenience, the flows of intercity trip distribution are presented as a vector. Considering the nature of variables, we divide the variables into two groups. One group presents the city features, and the other group presents the relationship between two different cities. n is denoted as the number of trip distribution samples, p denotes the number of dimensions of city attraction features, and q denotes the number of variables presenting relationship between two different cities. Let denote the vector of observed flows of intercity trip distribution; the vector of parameters; and the design matrix of dimensionality containing the p origin-city feature variables, p destination-city feature variables, and q relationship variables between origin and destination cities and intercept, with as the ith row of related to flow and .

2.1. Model Formulation

According to the likelihood assumption that these intercity trips occur with a known constant rate and independently of time since the last trip, Poisson distribution is used to describe the probability of a given number of intercity trips occurring in a fixed interval of time. Thus, for In the context of general linear models, the common assumption of link function in Poisson is logarithmic function. According to the logarithmic link function , the Poisson mean of is formulated as for . Thus, . The probability of observing y is defined as follows:According to the definition of Poisson distribution, the expected value of is equal to , that is, . Using the maximum likelihood estimation for , the likelihood assumption isIn practice, working with the natural logarithm of the likelihood function called log-likelihood is often convenient. must be estimated to develop the trip distribution model.

2.2. Feature Selection and Parameter Estimation

According to the formulation simplified in Section 2.1, the parameters can be estimated by using a regression method. A regular term L1 is added to the loss function to improve the prediction accuracy and interpretability of the model. This method selects only a subset of the provided covariates called lasso regression to be used in the final model. The lost function is expressed as follows:Coordinate descent algorithm is utilized to calculate the lost function.

Step 1 (initialization). Set starting value . The number in parentheses presents the number of iterations.

Step 2 (loss minimization). for each   dofor each   doOnly consider as a variable, and the other dimensions of are constants, formulated as follows:Set , .Update until convergence.The update iterative formula follows:

end for

end for

Step 3 (calculation). Calculate the difference between and , and the change of L. If both are sufficiently small, then algorithm convergence is achieved. Otherwise, go back to Step 2.

3. Feature Analysis and Interpretation

In Section 2, a trip distribution model is developed according to the Poisson distribution. However, the model is difficult to explain, and the coefficient estimates are not unique if covariates are collinear. The result must be further analyzed and explained as Step 2 in Figure 1. K-means method is applied to classify all the relevant features, including features selected and not selected from Section 2, to analyze and explain the result of feature selection. The clustering result called latent classes can be used to improve the availability of the model. Figure 2 illustrates the relationship between the explicit classes of city features and the latent classes derived from the k-means clustering method. If the nonzero features are unavailable, then they can be replaced with the method in the same latent classes. The experiment is presented in Section 4.

For computational convenience, let denote all the city features considered in the trip distribution model, and , which is a subset of , denote the nonzero features in the lasso regression. m is the number of cities; thus, the number of trip distribution sample n equals , and the dimension of is m.

K-means clustering method has always been used for cluster analysis in data mining. However, the number of clusters k and the initial set of centroid , which are input parameters, are difficult to determine, and an inappropriate choice may yield unsatisfactory results. In this study, k and are determined by the result of lasso regression, which provides a good initial set for the k-means clustering method. The nonzero feature set is the initial set of the centroid, and the number of nonzero features is k.

Input: Feature set: All the city features (pm): , the dimension of is mInitial set of centroid: Number of clusters:

Output: Class number of each city feature

for each   in feature set do

Normalize :

end for

for each   in feature set do

Normalize :

end for

repeat until convergencefor each   in doend forfor each   in  end for

4. Data

As one of the richest provinces in China, Shandong owns 17 cities, 15 of which are in the list of top 100 cities in China in 2015. The balanced development in the economy of these cities is convenient for the analysis of intercity trip distribution. As a coastal province in China, Shandong is located in the lower reaches of the Yellow River. The Grand Canal of China enters Shandong from the northwest and leaves on the southwest. An easy communication network of water channels, roads, railways, and air transportation covers all the cities. All the cities are connected by more than 5,000 kilometers of expressways in total. Most cities can reach each other within half a day.

The analysis, which is aimed at the highway intraprovince trip distribution of all the 17 cities in Shandong, is based on the dataset of expressway networking toll conducted in the 17 cities of Shandong and statistical data sourced from Shandong Bureau of Statistics. The expressway networking toll data of approximately 280,000 records include information, such as entry time, departure time, name of toll gate, and vehicle type. According to the additional geographical information of toll gates, we can determine the cities where these toll gates belong and calculate the number of intercity trip distributions and the average travel time.

4.1. Trip Distribution OD Matrix

The initial data of OD city matrix is sourced from Shandong intrprovince expressway toll. In Figure 3, chordal graph of intercity trip distribution illustrates the intercity OD matrix, and Figure 3 shows chordal graph of intercity trip distribution.

4.2. Feature Variables

Variables are classified into city characteristics and impedance measurement between the OD pairs. Most of the researches[9] use distance to descript impedance measurement, but we think the intercity average travel time is a better measurement of intercity travel costs, as average travel time not only depends on not only the distance between cities, but also the conditions of traffic. Besides, the dimension of city characteristics which is included in our research is 66 (listed in Table 1). Those are derived from Shandong Bureau of Statistics. All variables are transformed in logarithmic scale, so that the multiplicative interpretation of the models remains on natural scale.

5. Case Study

This section initially presents the performance of the intercity trip distribution model and clustering method, followed by the result analysis of model and clusters.

5.1. Estimation of Distribution Model

K-fold cross-validation is applied on the intercity trip distribution model, and data are divided into 10 groups in the cross-validation. R2 score between the observed and predicted outputs is utilized to estimate the goodness-of-fit of the model according to the following equation:The proposed model and the traditional gravity model are compared. Table 2 summarizes the modeling results for the cross-validation and final model. It shows that the proposed model provides better results than the gravity model. Figure 4 illustrates the goodness-of-fit results of the two models; the measured value is on the horizontal axis, and the predicted value is on the vertical axis. The result predicted by the model is denser than the gravity model. Figure 5 shows the density distribution of the 10 groups of cross-validation. The peak of the gravity model’s curve is approximately 0.3 and that of the intercity trip distribution model is approximately 0.7.

5.2. Validation of Parameter Selectivity

A random selection policy is applied on the clustering method to estimate the clustering methods and validate the selectivity of the parameters. The R2 score is used to validate the result. According to the clustering results, the features relative to the nonzero features are selected to take the place of the nonzero features to fit the model, and R2 score is used to estimate the random sampling results. This process is repeated 500 times to estimate the clustering methods. Table 3 and Figure 6 indicate that the mean score of the k-means is high and that the variance is low.

5.3. Case-Specific Inferences

The results of the model are applied to the nonzero parameters, and the effects of these parameters on the expected trip distribution are additive on the logarithmic scale. Thus, the interpretation of the result is explicit. The positive parameter values correspond to an increasing effect, whereas values less than 0 mean a decreasing effect. Table 4 shows the parameters and classes of nonzero features and Table 5 shows the k-means cluster results. The effects are analyzed through the combination of the results of the model and cluster.

Class I, which presents the city’s economic level, shows the positive effect of the intercity trip distribution on origin and destination cities. This result is intuitively consistent because the cities with developed economies interact actively with other cities. Class IV, which is relevant about the total and highway freight volumes, also shows the positive effects. By contrast, the passenger volume has almost no effect (Classes IX and XII). The phenomenon presents that a large part of the intercity trip distribution of Shandong freeway is freight, which is relative to the highly developed rail network and even some roadway passenger volume (Class X). Class XIV, which presents the freight turnover volume, shows a negative effect because the negative effect of distance has greater influence than the positive effect of freight volume. The number of counties (Class VII) shows a positive effect, which corresponds to the city size. The result most inconsistent with previous research is Class IV, which presents the population and has no effect on the intercity trip distribution. The result can be explained by the complexity of the intercity trip distribution, and it can prove the real effects are city economy and city size. The average time cost is the core traffic impedance factor. Similar to previous studies, the trip distribution shows a power law time cost decay effect.

6. Conclusions

In this study, a trip distribution modeling method from complex city characteristics was investigated based on Poisson distribution theory, and the application of lasso regression for feature selection and parameter estimation was presented. Finally, a clustering algorithm was applied on the model. The conclusion drawn was that the nonzero features could be replaced with other relevant features.

Traditional approaches, namely, gravity and destination choice models, consider the population of departure and destination cities. However, population was not a statistically significant variable for our models. The methodology can solve the problem of previous studies on the selection of explanatory variables. Moreover, it provides a new approach that selects the variables by machine learning algorithms, not by experts. Besides, the model performs well in generalization because the features selected are alternative. A comparison of different models show that our method is more effective than the gravity model. Furthermore, our method has no prerequisite. The trip distribution model based on Poisson model, which has good fitness and wide inferential capabilities, can be an effective and alternative to the gravity model.

Given the limitation of the difficulty of data collection for intercity distribution, the proof of the proposed method is insufficient, and city characteristics can be further diversified. Future research may collect more features, further verify the availability of the model, and quantitatively analyze the relationship between features.

Data Availability

Shandong statistic data was sourced from Shandong Bureau of Statistics.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Key R&D program under Grant no. 2016YFC0801700, Beijing Municipal Science and Technology Project no. Z171100000917016, and the National Natural Science Foundation Project under Grant no. U1636208.