Abstract

We study a matrix factorization problem, that is, to find two factor matrices and such that , where is a matrix composed of the values of the objects at consecutive time points . We first present MAFED, a constrained optimization model for this problem, which straightforwardly performs factorization on . Then based on the interplay of the data in , , and , a probabilistic graphical model using the same optimization objects is constructed, in which structural dependencies of the data in these matrices are revealed. Finally, we present a fitting algorithm to solve the proposed MAFED model, which produces the desired factorization. Empirical studies on real-world datasets demonstrate that our approach outperforms the state-of-the-art comparison algorithms.

1. Introduction

The essence of the matrix factorization (MF) problem is to find two factor matrices and , such that their product can approximate a given matrix ; that is, . As a fundamental model in machine learning and data mining, MF methods have been widely used in various applications, such as collaborative filtering [1, 2], social network analysis [3], text analysis [4], image analysis [5, 6], and biology analysis [7].

In this work, we study a special variety of the MF problem. Let the matrix consist of the values of the objects at a serial of time points , where the entry is the value of at . Our goal is to find the suitable factor matrices and , which can not only approximate , but also depict the evolution of the objects over time.

A typical application of this work is to estimate the missing historical traffic speed data, which is necessary for many transportation information systems [810]. Given an urban road network with roads , we let correspond to the traffic speed of and let entry be the speed of at ; then is composed of speed values of the roads at the time . Especially, if the value of was not collected, we say it is missing and denote as ”. With this representation, to estimate the missing values, we can first fit and with the nonmissing entries of , and then for each ”, take its estimation as , where is the column of .

A main difference between this work and the other existing MF models is that we take the time evolution effects into account. Well-studied MF models, such as nonnegative matrix factorization [5], max margin matrix factorization [11, 12], and probabilistic matrix factorization [13], are based on the i.i.d assumption, which implicates that the behavior of the objects evolved with time is ignored and treated as being independent of time, and thus their abilities on describing time-varying data are poor. To tackle this issue, the factors that affect the evolution of the objects are explicitly modeled in our contribution, and that enables our model to interplay with the time dependent information. Furthermore, our empirical studies confirm that the proposed model can scale well with size of the data.

The remainder of this paper is organized as follows. Section 2 briefly summarizes the notations used in the paper. Section 3 reviews the works on matrix factorization. Section 4 presents our matrix factorization model as well as the statistical mechanism of the model. The proposed algorithm is introduced in Section 5. Section 6 is devoted to the analysis of the experiments. Our conclusions and future works are presented in Section 7.

2. Notations

For a vector , we denote its 2-Norm as , where For a matrix , we let be the column of and denote its Frobenius norm as

3. Matrix Factorization

The essence of matrix factorization is to find the suitable factor matrices and , such that their product can approximate a given matrix . In principle, the MF problem can be formulated as the optimization model below: where the function is used to measure the closeness of the approximation to the target . In general, can be decomposed into the sum of the pairwise loss between the entries of and ; that is, . Most common used forms of the Loss function include the square loss () [1, 13, 14], the 0-1 Loss () [11], and the divergence loss () [6].

Notably, if is a solution of , then for any scalar , is another solution, and hence the problem is ill posed. To overcome this obstacle, various constraints on and are introduced, such as constraints on the entries [5], constraints on the sparseness [15, 16], constraints on the norms [13, 17], and constraints on the ranks [18, 19]. All these constraints, from the perspective of the statistical learning theory, can be regarded as the length of the model to be fitted. According to the minimum description length principle [20, 21], smaller length means better model, and thus most of them can be incorporated into as additional regularized terms; that is, where the regularization factor corresponds to the constraints on and .

As a transductive model, has many appealing mathematical properties, such as the generalization error bound [22] and the exactness [14, 23]. However, as well known, when compared with the generative model, a main weakness of the transductive model is that it can hardly be used to describe the relation in the data. In particular, for our studied problem, even though the model may work well, it is difficult to express the interplay between the model and the evolutions of the objects.

4. The Proposed Model

In this section the proposed model for matrix factorization for evolution data (MAFED) is presented. We first formalize MAFED as a constraint optimization model and then present a probabilistic graphical model, the solution of which is equivalent to the MAFED model. Finally we attain the generative interpretation of MAFED, with which we elaborate the ability of MAFED to describe the data relation. The last subsection is devoted to the solution algorithm for MAFED.

4.1. The Model

Let be the matrix generated by independent objects at consecutive time points ; then our MAFED model can be formulated as follows: Here and correspond to the column of and the column of , respectively.

From constraints (3b)–(3d), we can see that the roles of and in (3a) are asymptotic. We call the object matrix, of which the column () is the invariant feature vector of . Similarly, we call the environment matrix, of which the column () is the time varying environment feature vector at . Note that for each , the range it takes effects in is global; that is, for each time point , all the objects ,,…, share the same environment feature vector . As a result, every entry is composed by the object’s intrinsic feature vector and the environment feature vector , and hence for each object , its evolutions over the time are controlled by . To be more illustrative, let us consider such an example. Let be the speed matrix of the roads at the time points , let be the feature matrix of the roads, and let be the feature matrix of the environment; then corresponds to the feature description of and the entries of may consist of the intrinsic features of the road, such as the surface rough, the number of lanes, and the role in the traffic network; similarly, for , every corresponds to the description of the environment at , and the entries of can be the time, the weather, the visibility, and so on. Therefore, for the product , it can be interpreted as the speed of at is composed by the road feature and the environment feature .

The ability of MAFED to describe the time evolution effect lies in the constraint (3d). As well known, a fundamental characteristic of the evolution is for the all objects; when the interval between two adjacent time points, for example, and , is small enough, the corresponding values of the objects, that is, and , should tend to remain the same. To satisfy this constraint, in (3d) we introduce the tunable values . When , we let . Now since we have When and are fixed, leads to , and so via (3d) it is guaranteed that .

On the other hand, when the time interval , the values that and take will tend to be independently. For this case, we let , indicating that can take its value regardless of .

With the constraints (3c) and (3d), we can control the value of via

The result above shows that we can bound the sum by selecting appropriate parameters and . Besides, from the constraint (3b), the sum is bounded by . According to [12, 24], under some suitable situations, the sum is the convex envelop of (), and hence MAFED accommodates the ability to control the rank of the model. On the other side, as shown by Candès and Recht [14] and Candès and Tao [23], when the target matrix is low rank, it is possible to exactly recover it from only a few of its observations. This declaration shows that MAFED is possible to achieve high accuracy in the applications of missing imputations.

4.2. The Generative Interpretation

In this section, to investigate how MAFED interplays with the evolution of the data from the view of generative modeling, we first present a probabilistic graphical model (PGM [25]) and then show that the maximum likelihood solution to the proposed PGM is exactly the same as the solution of (3a).

First of all, the Lagrange dual of (3a) is

Here , , and are tunable parameters, corresponding to the upper bounds , , and of constraints (3b)–(3d), respectively, where greater bounds correspond to smaller parameters, and vise versa.

Considering the PGM presented in Figure 1, we have the following assumptions for , , and .(1)The columns of are from the same Gaussian distribution with mean and covariance matrix ; that is, for , we have (here we use to denote the dimension of ) (2)The columns of are linearly dependent in the order of their subscripts with respect to prespecified priors and ; that is,   Here for clarity, we let and . We also assume is a Gaussian random vector with mean and covariance and () is a Gaussian random vector with mean and covariance ; that is, (3)The entry of (, ) is a Gaussian random variable with mean and variance ; that is, With these assumptions, we in the following part of this section show that, for a given matrix and priors , , ,  , model (7) is equivalent to the following maximum likelihood fitting problem: In fact, according to the Bayesian theorem, we have Since is observed and , , , and are prespecified, the denominator can be treated as a constant. Therefore, we get Combining Figure 1 and the assumptions , we have Taking the logarithm on the two sides of the equation, we get Comparing (7) and (16), the PGM model is equivalent to the optimization model (3a) if we let , , and .

5. The Algorithm

We in this section present a fitting algorithm to solve (7). For clarity, we only discuss the case that all the time intervals are of equal length; that is, . Under this hypothesis, we take in model (7). Cases with unequal length time intervals can be handled by removing the constraint if .

Let It is straightforward to verify that is convex with respect to and , respectively. Therefore we can achieve the local minimum solution of via coordinate descent [26]. We first calculate the partial derivative with respect to and , respectively. Then we have the following.

For , and for , Similarly, With the equations above, Algorithm 1 is presented.

Input: matrix ; number of the latent features ; learning rates , and ;
  regularization parameters , and ; threshold
Output: the estimated matrix .
  // Initialize     and   .
(1) Generate random vectors ;
(2) for   ; ;   do
(3) Generate with
(4) Let
(5) end
  // Coordinate descent.
(6) ;
(7) ;
(8) while     do
(9)   ;
(10) for     do
(11) Let ;
(12) end
(13)  ;
(14) for     do
(15) Let ;
(16) end
(17) Replace the all with , and with , recompute ;
(18) end
(19) return   ;

6. Experiments

In this section, experimental evaluations against a finance dataset and a traffic speed dataset are presented. The proposed MAFED algorithm is compared with the other two state-of-the-art approaches.

6.1. Evaluation Methodology

To evaluate the algorithms, we perform missing imputation on an incomplete matrix . Similar to many other matrix competition problems, the testing protocol adopted is the Given X protocol [27]; that is, for , we only show of its observed entries to the users, while holding the remaining observations to evaluate the trained model. For example, when , Given X means that the algorithm is trained with 10% of the nonmissing entries and the rest of 90% nonmissing ones are held and to be recovered.

With the Given X setting, (16) can be rewritten as The first selected comparison algorithm is the probabilistic principle component analysis model (PPCA [28]), which achieves the state-of-the-art performance in the missing traffic flow data imputation problem [29]. The second is the probabilistic matrix factorization model (PMF), which is one of the most popular algorithms in the Netflix matrix completion problem [13].

The evaluation criterion employed is the root mean square error (RMSE). More formally, for a test dataset and the estimated set    is the estimation of , the RMSE of the estimation is given by .

In all our experiments, for every , the data partition is repeated for 5 times, and the average results as well as standard deviations are recorded.

6.2. Experiments on the Finance Dataset

The finance dataset is a matrix , which consists of the opening prices of leading U.S. companies in 245 consecutive trading days from August 21, 2009 to August 20, 2010. Each row in corresponds to a company and each column corresponds to a day.

To characterize the evolution of the opening prices, we introduce the term incremental rate; for the company, the incremental rate of its opening price in the day is given by Intuitively, the incremental rate quantifies how strong the evolution is. A smaller implies a closer connection between and , and hence we say the evolution from to is gradual; while a larger indicates a looser relation between and , and thus the evolution is considered to be saltatory.

We calculate the incremental rates for all companies in the days . Figure 2 illustrates the cumulative probability distribution of these rates. We can observe that almost all the incremental rates locate in a very sharp interval (−10%, 10%). This implies that the changes of the opening prices are very slight. In other words, the evolutions of the opening prices are gradualism.

We evaluate the parametric sensitivity of the algorithm by tuning the parameters and . In our experiments, we first fix and vary via an assignment expression , where . Then we do the reverse by fixing and changing via . For the Given X protocol setting, we take ; that is, 50% of the data in is randomly selected as training data, with which the algorithm is trained and recovers the remaining 50% as test data. The average RMSEs of the experiments are shown in Figure 3.

As shown in Figure 3, the RMSE values remain stable even when is expanded by more than 200 times (). Similar result also appears in the experiments on the parameter . We can observe that significant changes of the RMSE value only occur in cases where (i.e., is expanded by more than times).

To study the prediction ability of the proposed algorithm, we increase the value in the Given X protocol from to . Then for each setting, missing imputations are performed using MAFED with and the comparison algorithms with the latent feature number and , respectively. As illustrated in Table 1, MAFED outperforms PPCA significantly in all settings. For any value, the RMSE of MAFED is at most of that of PPCA. Specifically, for , the RMSE of MAFED is even only 10% of PPCA. More notably, the value has strikingly different impacts on MAFED and PPCA. When changes from to , almost all the RMSEs of PPCA dramatically increase, while for MAFED, RMSEs are decreased by nearly except for the case with . This observation suggests that, with carefully setting of the parameter , it is possible to improve the predicting accuracy of MAFED. We can also observe that, compared with PMF, MAFED achieves smaller RMSEs in the all settings. Especially, when , the RMSE of MAFED is only half of that of PMF. Another interesting finding is that, for PMF, the value of has little impact on predictions..

The convergence rate of MAFED is also explored in this work. We record the RMSEs of data recovered by MAFED every iterations and plot them on Figure 4, where the -axis is the number of iterations and -axis represents the RMSEs. We can observe that, for all values, corresponding curves drop dramatically in the first iterations and remain stable afterwards. From the result presented in Figure 3 and Table 1, we can conclude that, in our experiments, MAFED converges to the local optimization solutions within iterations.

6.3. Experiments on the Traffic Speed Dataset

To reconfirm the recovery performance of MAFED, we in this section conduct another evaluation on a traffic speed dataset collected in the urban road network of Zhuhai City [30], China, from April 1, 2011 to April 30, 2011. Again we adopt a matrix to represent the data, where consists of rows and columns. Each row corresponds to a road and each column corresponds to a 5-minute-length time interval. All columns are arranged in ascending order of time. The entry in is the aggregate mean traffic speed of the road in the interval. As the data in are collected by probing vehicles, the value of might be missing if there is no probing vehicle on the road during the time interval. In the data set, nearly half of the data, that is, million entries in , are such missing values.

We calculate the incremental rates of the nonmissing data in and present the cumulative density distribution for the incremental rates in Figure 5. The evolution of the traffic speeds differs from that of the stock opening prices remarkably. As aforementioned, the incremental rates of the finance data set mainly concentrate on the sharp interval [, ]. While for the traffic speed data, as demonstrated in Figure 5, the range of the incremental rates spreads from to . This indicates that there are plenty of sudden changes of the road speeds occuring in adjacent time intervals. In other words, many of the speed evolutions are saltation. A possible cause of this observation could be the effects of the traffic lights in the urban road network. When the traffic light of a road turns from green to yellow (and then red), traffic speed on the road immediately reduces to . On the other side, when the light turns from red to green, vehicles are restarted instantly, resulting in a significant acceleration of the speed on the road.

This experiment is conducted with settings , , and . Table 2 shows that in the all settings, MAFED also outperforms PPCA and PMF. In particular, when there are few observations (e.g., and ), RMSEs of MAFED are 33% lower than those of PPCA and 10% lower than those of PMF. When , the RMSE differences between PPCA and MAFED tend to be slight. Despite that, the overall errors of PPCA are roughly higher than those of MAFED, and for PMF, the RMSEs remain roughly higher than those of MAFED.

7. Conclusion and Future Works

Matrix factorization models are fundamental in machine learning and data mining. In this paper, we present MAFED, a matrix factorization model for evolution data. In MAFED, the two factor matrices are treated as composed of intrinsic features of the objects and time-varying features of the environment, respectively. Hence, their product accommodates the ability to describe the evolution of the objects over time. Besides, we construct a probabilistic graphical model, with which the statistical natural of MAFED is elaborated. We also present a fitting algorithm to find the solution of MAFED. Finally, we evaluate MAFED through missing data imputation on two real-world datasets. Experimental results indicate that the proposed model outperforms the comparison algorithms in both gradualism cases and saltation cases.

According to the generative interpretation of MAFED, its success mainly attributes to the introduction of the linear chain structural prior; as a result, a natural extension of the present work is to deal with the factorizations on matrices with more complex structural priors. In particular, as have been shown in the empirical studies section, many of the pieces of data in real-world scenarios are highly incomplete, so it is important and interesting to study the factorization on the incomplete matrix with no prespecified structural information.

Conflict of Interests

The authors declare that there is no conflict of interests regerding the publication of this paper.

Authors’ Contribution

Huang and Xiang contributed equally and Huang corresponds to this paper.

Acknowledgments

This work is supported in part by National High-Tech R&D Program (863 Program) of China under Grant no. 2012AA12A203.