Abstract
A crucial task in traffic data analysis is similarity pattern discovery, which is of great importance to urban mobility understanding and traffic management. Recently, a wide range of methods for similarities discovery have been proposed and the basic assumption of them is that traffic data is complete. However, missing data problem is inevitable in traffic data collection process due to a variety of reasons. In this paper, we propose the Bayesian nonparametric tensor decomposition (BNPTD) to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. BNPTD is a hierarchical probabilistic model, which is comprised of Bayesian tensor decomposition and Dirichlet process mixture model. Furthermore, we develop an efficient variational inference algorithm to learn the model. Extensive experiments were conducted on a smart card dataset collected in Guangzhou, China, demonstrating the effectiveness of our methods. It should be noted that the proposed BNPTD is universal and can also be applied to other spatiotemporal traffic data.
1. Introduction
Recent advances in data acquisition technologies and mobile computing lead to a collection of large quantities of urban traffic data from various sources, such as loop detectors data, GPS data, and smart card data. These datasets can capture rich spatialtemporal information of the whole transportation system and enable some traffic analysis. A crucial task in a datadriven transportation system is similarity pattern discovery. For example, as shown in Figure 1(a), two vehicles have different trajectories but share the same trip origin and trip destination. Figure 1(b) shows that the passenger inflow time series of subway station A and subway station B are similar in a week, which both have peaks in the morning and the peak values of weekends are relatively small to weekdays.
(a)
(b)
These similarities are beneficial for urban mobility pattern understanding and the authorities' policymaking. For example, for aggregatelevel, the classification management can be adopted in metro systems and the managers should pay more attention to station A and station B to prevent congestion during the morning peak. For individuallevel, the travelers that have similar trip rules can be found, that is, familiar stranger [1] so that the authorities can perform more precise demand control and provide personalized travel service. In addition, the similarities can be used for anomaly detection and improving the traffic prediction as a prior knowledge [2–4].
In general, the similarity pattern could be extracted by clustering methods, such as Kmeans algorithm and the densitybased spatial clustering of applications with noise (DBSCAN) algorithm [5]. However, the conventional clustering algorithms always require the complete data, while the missing data problem is inevitable in traffic data collection process. For example, some observations will not be recorded if the sensors were broken. Traditionally, the missing data should be filled firstly; then, a clustering approach is implemented for the imputed data [6]. The missing data imputation and clustering are conducted separately, which is complex and timeconsuming. Besides, the frequently used clustering methods such as Kmeans algorithm must set the number of clusters in advance, and the tedious sensitivity analysis is necessary.
Aiming at the missing data imputation, a variety of methods have been proposed such as multioutput Gaussian processes [7], deep generative models [8], and Bayesian tensor decomposition [9], among which Bayesian tensor decomposition is proved to be more effective and efficient than the other methods. Tensor (i.e., multiway arrays) is a generalization of matrix, which is widely used for modeling multidimensional traffic data and can capture the multidimensional structural dependencies; for example, the passenger flow of subway stations can be represented with a threeorder tensor (station day time of the day). Subsequently, the tensor decomposition can be employed to project station, day, and time into a latent space, also known as factor matrices. Finally, the missing data can be filled by the interaction of factor matrices. Recently, many studies have solved tensor decomposition in the Bayesian formula [9,10], which can provide more robust estimation and uncertainty measures for those missing data.
Inspired by the recent work of Bayesian tensor decomposition, we propose a novel framework named Bayesian nonparametric tensor decomposition (BNPTD) to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. The BNPTD consists of two components: (1) Bayesian tensor decomposition and (2) Dirichlet process mixture model (DPMM). DPMM is a nonparametric Bayesian mixture model that has shown great promise for data clustering and allows for the automatic determination of an appropriate number of clusters. These two components are combined with a hierarchical probabilistic model and a variational inference algorithm is presented to derive the posterior distributions of all the model parameters and hyperparameters. It should be noted that the combination not only finds groups of similar objects in the case of missing data but also offers adaptive prior to Bayesian tensor decomposition model which can further improve better imputation performance.
In summary, our contributions are summarized as follows:(i)We proposed a Bayesian nonparametric tensor decomposition model to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously.(ii)We presented a variational inference algorithm to learn the BNPTD model. Variational inference tends to be faster and easier to scale to largescale datasets than classical methods, such as Markov chain Monte Carlo sampling.(iii)Extensive experiments were conducted on a smart card dataset from Guangzhou, China. The results demonstrate that our approach successfully finds interpretable similarity pattern and recovers the missing values well in the case of incomplete traffic data.
The rest of this paper is structured as follows. Section 2 presents a literature review. Section 3 describes the materials and methods. The results and discussion are presented in Section 4. Finally, Section 5 summarizes the main conclusions of this work.
2. Literature Review
In this section, we review the previous approaches regarding missing data imputation and similarity pattern discovery in the traffic data analysis.
2.1. Missing Data Imputation
Missing data problem is inevitable in traffic data collection process, and the reasons are manifold. For example, some observations will be lost if the sensors were broken. Besides, when we use mobile sensors such as floating car, some locations will not be covered at some times. So missing data imputation is always a hot topic in academic research and predecessors have solved this problem using a wide range of methods. The traditional approach is to use time series method such as autoregressive integrated moving average (ARIMA) and its variants [11]. Recently, machine learning methods that considering spatialtemporal correlations have achieved the stateoftheart performance. For example, Rodrigues et al. [7] proposed the use of multioutput Gaussian processes to model the complex spatial and temporal patterns for crowdsourced traffic data imputation. Yoon et al. [12] adapted the wellknown Generative Adversarial Nets (GAN) framework and proposed Generative Adversarial Imputation Nets.
The collaborative filtering methods such as tensor decomposition have been proved more effective and efficient than other methods. Tensor (i.e., multiway arrays) is a generalization of matrix. The two most popular tensor decomposition frameworks are Tucker and CANDECOMP/PARAFAC (CP), which both could capture the underlying multilinear factors. Kolda et al. [13] gave a comprehensive overview of tensor decomposition and its applications. Tan et al. [14] introduced the tensor to preserve the multiway nature of traffic data and developed a tensor decomposition based imputation method for missing traffic data completion. Zhao et al. [10] formulated CP decomposition using a hierarchical probabilistic model and incorporated a sparsityinducing prior over factor matrices, resulting in automatic rank determination. Chen et al. [9] extended the Bayesian probabilistic matrix factorization to higherorder tensor to recover missing values. Zhang et al. [15] proposed an iterative tensor decomposition approach and proved its effectiveness under different missing cases.
2.2. Similarity Pattern Discovery
A crucial task in traffic data analysis is similarity pattern discovery and the similarities can help to demand control, personalized travel service, and anomaly detection. The similarity pattern can be extracted by clustering methods. Zhao et al. [2] proposed an unsupervised clusteringbased approach to classify passengers in terms of the similarity of their travel patterns and find out the abnormal passengers in metro systems. Gan et al. [16] used the kmeans algorithm for clustering metro stations and identified urban mobility patterns from a spatiotemporal perspective.
Besides, the nonparametric clustering such as Dirichlet process mixture model (DPMM) is widely used for similarity pattern discovery. DPMM is a nonparametric Bayesian mixture model which allows for the automatic determination of an appropriate number of clusters. The most commonly used description of DPMM is Chinese Restaurant Process that can be solved by MCMC sampling [17]. Blei et al. [18] presented an alternative stickbreaking representation and adopted a variational inference algorithm for DPMM. It is universally acknowledged that variational inference is more efficient than MCMC sampling and adapted to largescale datasets. In traffic data analysis, Ngan et al. [19] and Ngan et al. [20] employed DPMM to outlier detection and validated its effectiveness via numerical experiments. However, DPMM only can apply to complete data while the realworld traffic data is incomplete inevitably.
In this paper, we propose Bayesian nonparametric tensor decomposition that is comprised of Bayesian tensor decomposition and Dirichlet process mixture model via a hierarchical probabilistic model to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. Compared to the previous studies, DPMM is employed to lowrank feature extracted from Bayesian tensor decomposition rather than raw data, which can solve similarity pattern discovery in the case of incomplete traffic data effectively. Moreover, DPMM is beneficial for missing data imputation as an adaptive prior distribution.
3. Materials and Methods
The proposed BNPTD consists of two components: Bayesian tensor decomposition model and Dirichlet process mixture model. In the following subsections, we first give detailed descriptions of these two components. Secondly, we provide an overview of our Bayesian nonparametric tensor decomposition model. Finally, we derive variational approximations to learn the proposed BNPTD. The description below is taken from smart card data as an example and can be applied to other spatiotemporal traffic data.
3.1. Bayesian Tensor Decomposition Model
A smart card data record usually has multidimensional attributes such as origin station, destination station, day, and time of day. We organize the smart card data into a threeorder tensor , where , , and represent the size of station dimension, day dimension, and time dimension, respectively. Each entry in the tensor is denoted as to represent the inflow (or outflow) of th station in th time of th day.
could be an incomplete tensor when the traffic data is missing. We assume the element is observed if , where denotes a set of indices. Furthermore, we define a binary tensor as an indicator of the observed entries, which has the same size as .
Then, CP decomposition factorizes the tensor into a sum of rankone tensors; that is,where , and represent the factor matrices of each dimension, respectively, and , and are the th column vectors of factor matrices. The symbol denotes the outer product of the vectors and is the tensor rank. Figure 2 is a visualization of the CP decomposition of a thirdorder tensor.
Elementwise, (1) is written as
CP decomposition can be solved by the alternating least squares method [13]. In order to obtain a more robust solution and avoid overfitting, we resort to Bayesian formula and assume Gaussian observation as follows:where denotes the Gaussian distribution and is precision.
The maximum likelihood estimation of (2) is equivalent to the minimum of square loss function . To avoid overfitting, we place conjugate prior distributions over factor matrices and hyperparameters, which correspond to regularization terms:where is the th row vector of factor matrix and have similar prior distributions.
We also place a conjugate prior distribution over the precision; that is,
The graphical model of Bayesian tensor decomposition is illustrated in Figure 3.
3.2. Dirichlet Process Mixture Model
The smart card data can be represented with a threeorder tensor (station day time of the day); then Bayesian tensor decomposition factorizes the tensor into factor matrices and . The factor matrices also denote the features of each attribute; for example, reflects passenger departure (or arrival) demand patterns of th station in the specific time interval. In order to find groups of stations that have similar demand pattern, we place a Dirichlet process mixture prior over the factor matrix .
Dirichlet process mixture model is the Bayesian nonparametric technique which allows for the automatic determination of an appropriate number of mixture components. In this paper, we adopt the stickbreaking construction of DPMM [18]. Specifically, let be a categorical variable that indicates demand pattern category of th station. is truncation level that could be chosen large relative to the number of stations. The generative process of DPMM is given as follows:(1)For each mode (a)Draw (b)Compute (c)Draw (2)For each station (a)Draw (b)Draw where is hyperparameter that could be set in advance. is the probability value of mode . is base distribution, which could be chosen Gaussian distribution in this work. is identity matrix and is hyperparameter. The graphical model of DPMM is illustrated in Figure 4.
3.3. The Proposed Bayesian Nonparametric Tensor Decomposition Model
The BNPTD is comprised of Bayesian tensor decomposition and DPMM, and these components are coupled via a hierarchical probabilistic model. The graphical model of BNPTD is illustrated in Figure 5.
The advantages of the combination are as follows: (1) missing data imputation and similarity pattern discovery can be achieved simultaneously, which is more efficient and avoids errors accumulation, and (2) DPMM can offer adaptive prior distribution to Bayesian tensor decomposition, corresponding to adaptive regularization term and further better imputation performance.
3.4. Model Learning via Variational Inference
The goal of BNPTD model learning is to derive the posterior distributions of parameters and hyperparameters. Variational inference is the method that approximates the posterior distributions through optimization [21]. Specifically, for a group of random variables , variational inference seeks a variational distribution to approximate the true posterior distribution . The approximation is measured by Kullback–Leibler divergence, and a Kullback–Leibler divergence of 0 indicates that the two distributions are identical:where all expectations are taken with respect to . Because the KL divergence is difficult to compute, variational inference optimizes an alternative objective that the minimization of KL is equivalent to the maximization of the evidence lower bound (i.e., ELBO).
We further focus on the meanfield variational family that the variational distribution can be factorized with regard to each variable ; that is,
On the basis of the meanfield variational family, the optimal variational posterior distribution of each can be derived in turn via maximizing the ELBO, and the optimal form is given bywhere notation denotes an expectation with respect to distributions over all variables except for . The deduction process of optimal variational posterior distributions with regard to parameters and hyperparameters can be found in Supplementary File. Variational inference updates the model parameters and hyperparameters alternatively. Algorithm 1 outlines the training process of BNPTD.

4. Results and Discussion
In order to evaluate the performance of our methods, we conduct extensive experiments on a realworld smart card dataset collected from subway stations in Guangzhou, China. This dataset contains a large number of tapin/tapout records from 3/7/2017 to 16/7/2017. Because the subway agency does not provide services all day, we focus on the smart card data from 6 a.m. to 10 p.m., which contain the main trip time of the whole day. We construct the dataset into a threeorder tensor (station day time of the day). Each entry of this tensor denotes the passenger inflow. The length of each time interval is set as 10 minutes, and the size of the tensor is .
4.1. The Missing Data Imputation Performance
In order to show the effectiveness of the adaptive prior distribution, we compare BNPTD with some baselines including DA (daily average), BTD (Bayesian tensor decomposition), and GAIN (deep generative models [8]) in terms of missing data imputation. The BTD is also trained via variational inference algorithm and two methods have the same initialization. We implement experiments in the two missing scenarios, which are random missing scenario and continuous missing scenario. Besides, we set (the maximum clusters), (CP rank).
In our experiments, we evaluate the imputation accuracy with two widely applied metrics, namely, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE):
The missing ratio ranged from 10% to 50%. Table 1 shows the performance of our proposed method as compared to other baselines in the random missing scenario, and the result of the continuous missing scenario is shown in Table 2. Our proposed BNPTD outperforms other baselines by achieving the lowest RMSE and MAE in most of the missing ratio cases. In some cases, the performance of BNPTD in Tables 1 and 2 is close to BTD, and the reason may be that the adaptive prior distribution is close to single prior distribution in numerical results. But the adaptive prior distribution is conducive to similarity pattern discovery compared to single prior distribution.
4.2. The Similarity Pattern Discovery
The CP rank represents the size of lowrank feature, and large CP rank may lead to redundant feature which is bad for similarity pattern discovery. In this subsection, we set CP rank and the maximum clusters empirically. The random missing ratio of incomplete smart card data is 10%. The result of similarity pattern discovery is illustrated in Figure 6, and the passenger flow time series that has similar departure pattern is plotted in a subfigure. There are 7 clusters which represent different departure patterns. Moreover, Figure 7 gives a visualization of stations distribution in Google Maps. The detailed description of these patterns is given as follows:(a)In the case of Cluster 1, the subway stations have large passenger volume every day, and there is no difference between a working day and nonworking day. From Figure 7, we observe that these stations are adjacent to passenger transportation hub such as train station and highspeed railway station.(b)The second departure pattern is anomalous, and the passenger volume of nonworking days from 8/7/2017 to 9/7/2017 is larger than the working day. With our web search, the reason is that there is an exhibition happening near the subway stations. So, the proposed BNPTD also finds some special events effectively and can be used for anomaly detection.(c)Aimed at Cluster 3, the passenger volume of the stations is lower than 600 in 10 minutes. These stations are not usually crowded and the transportation potential could be further excavated to service more passengers.(d)Cluster 4 has a larger passenger volume compared to Cluster 3. The stations belonging to Cluster 4 have small peak values during the morning rush hour or evening rush hour. Besides, these stations are mainly situated in urban fringe area and far from the central business area from Figure 7.(e)Cluster 5 is a morning peak departure pattern, and the peak values of weekends are relatively small to weekdays. The stations are mainly in close proximity to the residential area in Figure 7. In the morning, people will leave home and go to work by subway.(f)The stations with the sixth departure pattern have larger passenger volume in the evening rush hour, and the peak values of weekends are relatively small to weekdays. The stations are mainly located in the work district, and many people working in this area usually take the subway and return home after going off work.(g)The seventh departure pattern has an extremely large passenger volume in the evening rush hour, and these stations are situated in the middle of the central business district (CBD) from Figure 7.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
4.3. The Clustering Stability with Different Missing Ratios
In this subsection, an evaluation measure called purity is adopted to show the clustering stability of BNPTD with different missing ratios. Purity is an external index that is used to measure the similarity of the formed clusters to external clusters such as ground truth [22], considering as external clusters and are the clusters made by the BNPTD. To compute purity, each cluster of is assigned to the class which is most frequent in the cluster, and the accuracy of this assignment is measured by counting the number of correctly assigned stations and dividing by the number of stations ; that is,
It is easy to see that , and there is a perfect clustering if is equal to 1. We set the clustering result in missing ratio 10% as external clusters and implement experiments from missing ratios 15% to 50% with step size 5%. Figure 8 shows the purity with different missing ratios. We observe that purity is greater than 0.6 in all cases, which indicates the clustering result is stability.
5. Conclusions
In this paper, a novel Bayesian nonparametric tensor decomposition model called BNPTD is proposed. The model combines the Bayesian tensor decomposition and Dirichlet process mixture model via a hierarchical probabilistic model, which can achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. Moreover, we derive a variational inference algorithm to learn the model efficiently. Experiments on a realworld smart card dataset show the effectiveness of the proposed model.
Actually, the similarity pattern extracted in this paper not only helps us understand the urban mobility patterns, but also can improve the traffic prediction as prior knowledge. For future work, we plan to develop multitask prediction based on the above similarities.
Data Availability
The data used to support the results of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the project of National Natural Science Foundation of China (no. U1811463) and the Natural Science Foundation of Guangdong 292 Province, China (no. 20187616042030004).
Supplementary Materials
The deduction process of optimal variational posterior distributions with regard to parameters and hyperparameters can be found in Supplementary File. (Supplementary Materials)