Abstract

This paper investigates a homotopy-based method for embedding with hundreds of thousands of data items that yields a parallel algorithm suitable for running on a distributed system. Current eigenvalue-based embedding algorithms attempt to use a sparsification of the distance matrix to approximate a low-dimensional representation when handling large-scale data sets. The main reason of taking approximation is that it is still hindered by the eigendecomposition bottleneck for high-dimensional matrices in the embedding process. In this study, a homotopy continuation algorithm is applied for improving this embedding model by parallelizing the corresponding eigendecomposition. The eigenvalue solution is converted to the operation of ordinary differential equations with initialized values, and all isolated positive eigenvalues and corresponding eigenvectors can be obtained in parallel according to predicting eigenpaths. Experiments on the real data sets show that the homotopy-based approach is potential to be implemented for millions of data sets.

1. Introduction

Dimensional reduction has been an important technology for data visualization and analysis in the age of big data. One of these methods is based on eigendecomposition used widely in observing intrinsic associations between data items by 2D graphic. With this context, the solution of eigenvalues is an essential step regarded as a multidimensional scaling (MDS) problem to establish low-dimensional embedding spaces with a distance matrix. The capacity of an implemented eigendecomposition determines what scale of data can be dealt with. However, there is a bottleneck of eigendecomposition in embedding large-scale data sets directly [13]. These eigenvalue-based approaches have been developed to two categories: the direct solution immediately acquires the eigenpairs (eigenvalues and associated eigenvectors) from the distance matrix [4] and the indirect solution selects a small set of landmark points to calculate the eigenvalues; then these estimated eigenpairs are fed to approximate the entire embedding [1]. All of them are operated in a serial computational mode [5], leading to some negative influences in processing massive data sets: firstly the indirect methods need to select landmark set that will break up embedded space by the error, generated from approximate calculation [1, 3]; secondly they both have low efficiency needing a balance between the representational precision and computational efficiency [4, 5]; thirdly they both are prone to omit some positive eigenvalues that reduces the maximum dimensions available for dimensional reduction [2].

This work investigates the parallelization of eigenvalue-based dimensional reductions in the context of large-scale data sets. It is aiming at providing a paralleled solution for data reduction in practice. The general frameworks are often divided into two categories [4], distance scaling and classical scaling; the methods based on the spring model of distance scaling [6] are prone to local minima and can only be used to embed data in two or three dimensions [7], while classical scaling, the eigenvalue-based approaches, [8] has been considered as a useful replacement for large graphs. To overcome the eigenvalue problem in big processing, several algebraic operations have been conducted. One of the most useful methods is the matrix approximation [1, 2, 9]. It samples a small matrix from the distance matrix to solve eigenvalues and then approximates objective embedding. This workflow is closely related between each segment of computation. For matrix sampling, some studies on the selecting strategy [3, 1012] make the approach worse for parallelization to cut the time consumed consumption. In today’s eigencalculation technology, it is well established that eigendecomposition of large and sparse matrix can be resolved by some novel numerical solutions, such as Block Lanczos [13], PCG Lanczos [14], and homotopy [15, 16]. Resorting to these high-performance approaches may expand considerably the applications of dimensional reductions.

As to the homotopy continuation method, it is suggested as an alternative to find all isolated eigenvalues of a large matrix. It is demonstrated that there are distinct smooth curves connecting trivial solutions to desired eigenvalues [17] and then a proof of the existence of homotopy curves for eigenvalue problems is presented by [18]. Reference [19] reports the numerical experience of the homotopy for computing eigenvalues of real symmetric tridiagonal matrices. It enables eigenvalues to calculate for embedding in parallel. For the recent studies, [16] combines with a divide-and-conquer approach to find all the eigenvalues and eigenvectors of a diagonal-plus-semiseparable matrix. Reference [20] establishes some bounds to justify the Newton method in constructing the homotopy curves.

In this paper, a parallelization of eigenvalue-based dimensional reduction (HC-MDS, Homotopy Continuation Multidimensional Scaling) via the homotopy continuation method is proposed. Unlike those -based approaches, HC-MDS can directly solve most positive eigenvalues in parallel from large distance matrices (transformed from more than 60,000 data items). With the theory, we construct a homotopy continuation function to link a trivial set with the object eigenpairs. In this case, the eigenvalue problem is converted to the solution of ordinary differential equations (ODE). Each eigenpair will be obtained by tracing the ordinary differential function with corresponding initialized values that means the solutions of eigenvalues can be implemented in independent computational threads. This model will bring an advantage that allows scheduling computational resources for each eigenpair calculation, so that it can be enhanced on parallel or distributed systems.

The remainder of this report is organized as follows. Section 2 talks about the review on the eigenvalue-based MDS. Section 3 outlines the methodologies used in this paper. The experiments are presented in Section 4. Finally, we draw a conclusion in Section 5.

2. Background

Most eigenvalue-based dimensional reductions involve a step of spectral decomposition for solving eigenvalues, such as LLE [21], Laplacian Eigenmaps [22], Isomap [23], and L-Isomap [1]. A more fundamental MDS by using the power method for eigenpairs is illustrated in [4]. The aim of eigenvalue calculation in embedding is to obtain eigenpairs of the inner product matrix, which is converted by the distance matrix using a “double-centering.” A low-dimensional space is then built by those positive ones among the estimated eigenvalues and corresponding eigenvectors. Without loss of generality, the general eigenvalue-based dimensional reduction is summarized as follows.

Procedure. Eigenvalues in embedding are shown as follows.

Step 1. Let denote the distance of square matrix that is symmetric and real valued. An existing inner product matrix () can be derived from () as .

Step 2. Implement one of eigendecomposition methods into , as ; is a diagonal matrix of eigenvalues and is a matrix whose columns are corresponding orthonormal eigenvectors.

Step 3. is the embedded space that can be generated by with , where is a combination of the positive eigenpairs among . Those negative and zero eigenpairs should be discarded because of the arithmetical rule of extraction of square root.
If the approximate method is applied for approximating embedding [2], the following steps should be proceeded.

Step 4. Assume that the inner product matrix belongs to a partition of the original inner product matrix . The remainder embedded space is approximated by with the error estimation .

Step 5. Let be incorporated with ; we finally obtain the entire embedded space as .

For embedding with eigenpairs, most investigations do not discuss how the eigenpairs are generated but commonly draw a conclusion that eigendecomposing for large symmetric, dense matrices is an expensive calculation. Currently the embedding technologies applied on large-scale data sets are mainly used for data visualization. Since this kind of application only needs two or three eigenvalues, the largest one and the second largest one are solved by the existing eigenvalue methods, for example, the power iteration method [4]. However, this eigenvalue approach operates in a serial way and its computation is considerably inefficient [4]. To capture more positive eigenvalues for embedding, QR has been justified in small to medium matrices [24] and all eigenvalues will be solved including negative ones and zero ones that are useless and consuming CPU time; Lanczos, a family of eigenpair solutions based on the Lanczos process, has demonstrated that it is available for millions of data items in actual engineering applications [14], but they may have the problem to complete computing procedure once related errors accumulated to a certain range [24]. Therefore, we investigate an alternative, the homotopy continuation method, for parallelization of embedding. Compared to the presented works, we employ a Runge-Kutta (R-K) method with step estimation to control tracking procedures of homotopy curves that can omit the constraints for eigenpairs tracking [20].

3. Methodology

The homotopy continuation method consists of two parts: how to convert the eigendecomposition of embedding to the solution of ODE with initialized values (Section 3.1) and how to calculate each ODE of eigenpair using numerical methods (Section 3.2).

3.1. Homotopy for Eigenpairs

Before constructing the homotopy function, the symmetric square matrix should be transformed to the symmetric tridiagonal matrix by the Householder transformation [20]. The eigenvalues of are equal to the eigenvalues of . We obtain the eigenvalues of through the approximate eigendecomposition to , and then the corresponding eigenvectors of can be reversed out by their eigenvalues. For eigendecomposing to the transformed matrix , we consider the eigenvalue problem aswhere is the corresponding eigenvector of the eigenvalue by . The orthogonality of eigenvectors is expressed by .

Associating (1) with the orthogonality of eigenvectors, we convert the eigenvalue problem to an optimal problem aswhere is the corresponding zero vector by .

To find the optimum set of such object function, the homotopy method has been recommended. We thus construct a homotopy function to link the object function with an auxiliary function by the Newton method [20]:where , , , and . The constant matrix needs to be user-defined, where its eigenpairs are apparently captured; for example, if , we often set as . Obviously, the diagonal elements , , and are exactly the eigenvalues of and , , are the corresponding eigenvectors, respectively.

Since there are distinct smooth curves called eigenpaths connecting the trivial solution of to the desired eigenpairs of [17], each eigenpath is expressed by an ODE with a pair of initial values aswhere the interval of is and the initial values are given by obtained from one eigenpair of . Solving (4) with all initial values given by , the eigendecomposition of can be implemented in parallel. We show the derivation of (4) from (3) as follows.

We take the derivative of (3):

Adding the initialized values, we obtain the ODE of homotopy curve as (4). The solving procedure of (4) is the solution of ODE within . When increases to 1, each eigenpair of is obtained as .

We need to transform the calculated eigenvector because the eigenvector from the homotopy method cannot be applied to build an embedded space directly. The transforming formula is described aswhere is the eigenvector of corresponding to .

Since it is proved that the eigenpaths of homotopy function are continuous and monotonous [17, 18], this brings an advantage for forecasting the trend of eigenpaths: the negative eigenvalues and undersize positive eigenvalues can be estimated by the tracking of homotopy function curves, for example, Figure 1. Specifically, each trace of eigenvalues evolution is implemented in an independent computational thread. When the trend of an ODE is predicted to be negative (red tracks), the associated computation is regarded as not effective and shut down. With this feature, it enables dynamical allocation of computing resource.

3.2. Numerical Calculation of ODE

For avoiding overflow during calculating a massive matrix, we normalize the matrix in Section 3.1 aswhere and are the maximum and minimum, respectively.

The specific steps are as shown in Algorithm 1.

Input , , , tol,
Output
()
() initialize step
()
While  do
  ()
  ()
  ()
  ()
  ()
  If   then
    
    
  End
  
End

For solving the ODE problem in (4), we adopt an automatic step method based error estimation [25, 26]. Let ; the is iterative to calculate each homotopy curve bywhere is the ODE of (4) and the parameter is calculated by

For estimating the error of each eigenpath tracking, the is operated bywhere is customized for and the is an allowable error referring to an empirical value.

4. Simulation

In this section, we experiment on two large data sets to embed a series of low-dimensional spaces (2, 3, 50, and 100 dimensions) for comparison between our parallelized approach and the typical serial mode of dimensional reduction. The assessments are implemented in terms of the scale-independent quality criteria [27], relative distance error [2], and CPU time. The first two tests reflect the precisions of resulting embedded spaces indirectly and directly, respectively. The third part is the empirical evaluation of performance period.

4.1. Experiments Setup

Our experiments compare the following embedding methods with the three eigendecompositions modes: QR, Lanczos, and homotopy. Embedding with the first two modes is a typical serial workflow that worked in a single process, while it is operated in parallel in the third mode:(i)classical metric multidimensional scaling [4] (CMDS),(ii)landmark multidimensional scaling employing the matrix approximation [2] (LMDS),(iii)isometric feature mapping [21] (Isomap),(iv)landmark isometric feature mapping [1] (L-Isomap),(v)t-distributed stochastic neighbor embedding [7] (t-SNE).

As the third part testing for performance reference, the t-SNE and its relative variants are well established which are available in massive data [7], based on distance scaling. We randomly initialize the embedding of t-SNE. Both the Euclidean distance and geodesic distance are used. Specifically the geodesic distances are approximated by 6-ary neighborhoods using the Euclidean graph. Parameters of the comparative methods are set to typical values, with no further optimization. All experiments are carried out in a distributed computing environment built by four computers (amount of 20 GB memory).

For the quality criteria, we assess embedding in terms of , a rank-based criteria [27], where indicates the local preservation of neighborhoods, shows a global consensus of the embedding, and measures the preservation of -ary neighborhoods in a straightforward way; relative distance error (dist-err) [2] is computed by average distance value between original distance matrix and reconstructed matrix; CPU time records computing time of the eigendecomposition procedure in embedding.

4.2. Corel Image Features

The Corel Image Features is a UCI KDD data set consisting of features extracted from 68,040 images. Four sets of features are extracted: 32 color histograms, 32 color layout histograms, 9 color moments, and 16 texture features. Images with missing features are ignored, leaving 64,433 images. We computed the Euclidean distance between two image feature vectors from each feature set and combined them to a single distance following [2]. The first 5000 data items are selected as the pivot points used for the matrix approximation.

For the comparison between local and global preservation and mean distance error, Table 1 summarizes the results. The selection for the serial and parallel computational modes clearly impacts on the embedded results when operated around hundreds of thousands of items. To find an embedding space, the QR-based methods need the approximation to reduce the scale of matrices in middle procedure. The results are in the rows of and . These proximate methods, and , have acceptable precisions for 0.422 and 0.481 of the mean distance error to the 100-dimensional spaces, respectively. The homotopy-based and Lanczos-based methods are both successful obtaining sufficient numbers of positive eigenpairs for embedding with all the desired numbers of dimensions. The Lanczos-based approaches outperform the homotopy-based ones in the three criterions. Embedding with the geodesic distance shows better results than the Euclidean ones in the same embedding type. The distance errors in all competitive models decrease with the increase of embedding dimensions. Embedding with the homotopy outperforms slightly compared to those Lanczos-based and QR-based ones in distance error.

For a more intuitive comparison, Figures 2 and 3 show the results of two-dimensional embedding refereed with the pair of scalar values, and . From Figure 2, the local preservation of the homotopy-based embedding with the Euclidean distance is found in the bottom of the others, while its corresponding geodesic version reaches the third place. That implies that the geodesic distance may be more suitable for this kind of eigenvalue-based embedding, and the similar result occurs in the next data set. With the geodesic distance, t-SNE runs the highest result according to the coordinate of and in Figure 3. The homotopy method for embedding the Euclidean distance matrix races the last one that indicates that it may be problematic with the solving accuracy of eigenpairs and its geodesic version with Isomap shows a moderate performance.

For the comparison in CPU time, the items of -axis in Figure 4 correspond to the first column of Table 1. The homotopy approaches save the most CPU time for 12 min and 58 s and 12 min and 26 s ( and ) except the approximation methods in Figure 4 (the red bars) by taking advantage of parallel computational design. The time consuming with embedding approximation of the is far less than the direct methods (homotopy and Lanczos), achieving the embedding tasks by 3 min and 52 s and 3 min and 45 s (the first and eighth bars), although the homotopy-based approach is enhanced by the parallel computational environment. The two-version t-SNE (Geodesic and Euclidean distance) runs in the last two of CPU time for 36 min and 53 s and 35 min and 18 s. The CPU times of and are not recorded because the QR method cannot be applied to the eigendecomposition directly for the large square matrices which exceed 8,000 dimensions.

4.3. PNAS Titles

The title data was obtained by crawling the PNAS website and yielded 79,801 long sentences. Each title is represented by the standard tf-idf vector representation. The distance matrix is obtained following the cosine distance measured by all title vectors. We employ the approximation with the first 5000 pivots of data items.

The results of assessing local and global preservation and mean distance error are given in Table 2. Under experiments on the text data set, we demonstrate once again the fact that the QR eigendecomposition is unable to implement in the large-scale data set. The corresponding results are absent denoted by “-.” Both in the models of CMDS and Isomap, the Lanczos-based methods run ahead of the homotopy-based ones in terms of , , and dist-err, although the advance is slight. The QR-based LMDS with the matrices approximation is better than L-Isomap when the scale of data increases, but they both fall behind the homotopy method in the three criterions. The homotopy-based Isomap outperforms the homotopy-based and Lanczos-based CMDS but slightly runs behind the Lanczos-based Isomap.

From Figures 5 and 6, the results of the text data set also show that the embedding constructed by geodesic distance represents the original matrix better than them with the Euclidean distance. The top three are all based on geodesic distance: , , and . In this case, the approximation with the landmarks or pivots seems to be not good at representing the global preservation for the entire embedding space. The embedding of t-SNE applying with the Euclidean distance declines slightly both in the local and global parts.

The results of time consuming are shown in Figure 7. The LMDS and L-Isomap are the top two for 3 min and 46 s and 4 min and 01 s, respectively (the first and eighth bars). The homotopy-based methods are in the second echelon (red bars) for 17 min and 23 s and 17 min and 38 s. The Lanczos-based approaches are in the third level for 26 min and 18 s and 32 min and 33 s. The t-SNE based methods consume the most time for 57 min and 32 s and 61 min and 28 s. Regardless of the approximation embeddings, a clear superiority brought by the homotopy is the boost from the distributed system.

5. Discussion

5.1. QR

Combining the results of the two data sets, the capability of the QR based method is limited within around 10,000 of data scale, even though it improved by both the approximation and the geodesic measuring approach [23]. Although the QR approach is subject to the small and medium size problem, it is also used widely [2] integrated with matrices reduction which can reach a balance between time consuming, computational resources, and embedding precision. Furthermore, most algorithms of QR are designed for dense square matrices which mean the QR-based models are difficult to expand to distributed systems for decentralized computation and storage compared with sparse matrices.

5.2. Homotopy versus Lanczos

For several hundreds of thousands of levels of data scale, we should consider to employ those eigendecompositions technologies which are specific to super large scale matrices. The homotopy continuation method and the Lanczos family will be the feasible options. However, the result in Figure 2 shows that the homotopy-based method may exist in the problem of numerical instability which decreased the precision of embedded space. The Lanczos also has a similar problem arising on some form of matrices [26]. To overcome this impediment, we need further substantial steps in these fields.

The superiorities of the homotopy approach have three aspects: the calculation of each eigenpairs is a separate process which means that a boost of multiple eigenpairs solution can be realized by the augmentation of computational cells, that is the core idea for parallelizing the eigenvalue-based dimensional reduction; the trend of each eigenpath is predictable that makes those negative eigenpairs can be estimated before it completely calculated, this can be a basis for scheduling computational resources by aborting those meaningless processes; it also provides a way to estimate the maximum available dimensions of embedding space by predicting the number of positive eigenvalues before completing the calculation of eigenpairs.

6. Conclusions and Future Work

This paper focuses on the parallelization of eigenvalue-based dimensional reduction via the homotopy continuation method. A novel embedding method based on the homotopy is presented. It transforms the eigendecomposition process of embedding to an ODE problem with initial values. By tracking eigenpaths separately, all isolated positive eigenpairs can be solved in parallel. The experiments show that the homotopy-based approach is applicable to the embedding task with around hundreds of thousands of data items. This will satisfy to approximate embedding to millions of items.

From the analysis of experiments, there are two problems around the homotopy on eigendecomposition of embedding: the accuracy of eigenpairs is still expected to improve The numerical instability which may exist during computational procedure needs more works to investigate the convergence of homotopy method.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This paper was supported by the National Natural Science Foundation of China under Grant no. 61302170.