Abstract

Nonnegative matrix factorization (NMF) decomposes a high-dimensional nonnegative matrix into the product of two reduced dimensional nonnegative matrices. However, conventional NMF neither qualifies large-scale datasets as it maintains all data in memory nor preserves the geometrical structure of data which is needed in some practical tasks. In this paper, we propose a parallel NMF with manifold regularization method (PNMF-M) to overcome the aforementioned deficiencies by parallelizing the manifold regularized NMF on distributed computing system. In particular, PNMF-M distributes both data samples and factor matrices to multiple computing nodes instead of loading the whole dataset in a single node and updates both factor matrices locally on each node. In this way, PNMF-M succeeds to resolve the pressure of memory consumption for large-scale datasets and to speed up the computation by parallelization. For constructing the adjacency matrix in manifold regularization, we propose a two-step distributed graph construction method, which is proved to be equivalent to the batch construction method. Experimental results on popular text corpora and image datasets demonstrate that PNMF-M significantly improves both scalability and time efficiency of conventional NMF thanks to the parallelization on distributed computing system; meanwhile it significantly enhances the representation ability of conventional NMF thanks to the incorporated manifold regularization.

1. Introduction

Data representation is a fundamental problem in data analysis. A good representation typically uncovers the latent structure of a dataset by reducing the dimensionality of data. Several methods including principal component analysis (PCA), linear discriminant analysis (LDA), and vector quantization (VQ) have addressed this issue. Recently, nonnegative matrix factorization (NMF) [1] incorporates nonnegativity constraint to obtain parts-based representation of data, and thus it has been widely applied in many applications, such as document clustering [2, 3], image recognition [4, 5], audio processing [6], and video processing [7].

However, conventional NMF suffers from a few deficiencies: conventional NMF usually works in batch mode and requires all data to reside in memory, and this leads to tremendous storage overhead as the increase of the data samples and conventional NMF ignores the geometrical structure embedded in data and causes unsatisfactory representation ability. To overcome the first deficiency, either parallel or distributed algorithms have been proposed for NMF to fit for large-scale datasets. Kanjani [8] utilized multithreading to develop a parallel NMF (PNMF) based on multicore machine. Robila and Maciak [9] introduced two thread-level parallel versions for traditional multiplicative solution and adaptive projected gradient method. However, their methods are prohibited for large-scale datasets due to the memory limitation of a single computer. Liu et al. [10] proposed a distributed NMF (DNMF) to analyze large-scale web dyadic data and verified its effectiveness on distributed computing systems. Dong et al. [11] also attempted to design a PNMF based on the distributed memory platform with the message passing interface library. Although all the above parallel NMF algorithms achieve a considerable speedup in terms of scalability, they cannot consider the geometric structure in dataset. To overcome the second deficiency, Cai et al. [12] proposed graph-regularized NMF (GNMF) which extended conventional NMF by constructing an affinity graph [13] to encode the geometrical information of data and enhanced representation ability. Gu et al. [14] further extended GNMF to avoid trivial solution and scale transfer problems by imposing a normalized cut-like constraint on the cluster assignment matrix. Lu et al. [15] incorporated manifold regularization into NMF for hyperspectral unmixing and obtained desirable unmixing performance. These improved algorithms get better representation ability but work inefficiently for large-scale datasets. Liu et al. [16] introduced the geometric structure of data into incremental NMF [17, 18] and utilized two efficient sparse approximations, buffering and random projected tree, to process large-scale datasets. Yu et al. [19] also presented an incremental GNMF algorithm to improve scalability. But these algorithms only performed well for incremental or streaming datasets and could not deal with large-scale batch datasets. In addition, Guan et al. [20] and Liu et al. [21], respectively, introduce Manhattan distance and large-cone penalty for NMF to improve representation and generalization ability.

In conclusion, none of the above works can simultaneously overcome both deficiencies due to the great computation for calculating the decomposition and the storage requirement of the adjacency matrix. In this paper, we take the best of advantages of parallel NMF and manifold regularized NMF and design a parallel NMF with manifold regularization method (PNMF-M) by parallelizing manifold regularized NMF on distributed computing systems. In particular, PNMF-M distributes both data samples and factor matrices to multiple computing nodes in a balanced way and parallelizes the update for both factor matrices locally on each node. Since the graph construction is the bottleneck of the computation of PNMF-M, we adopt a two-step distributed graph construction method to compute the adjacency matrix and obtain an adjacent graph equivalent to that constructed in batch mode. Experimental results on popular text corpora and image datasets show that PNMF-M not only outperforms conventional NMF in terms of both scalability and time efficiency by parallelization, but also significantly enhances the representation ability of conventional NMF by incorporating manifold regularization.

This section gives a brief review of PNMF, GNMF, and their corresponding popular algorithms. Details are as follows.

2.1. PNMF

Due to working in batch mode, conventional NMF requires to process masses of large matrix operations, which is inevitably confronted with high time overhead. Parallel processing is often an efficient method to speed up computing intensive algorithms. In general, the parallelism of the algorithms is embodied in two aspects. The first aspect is computation flow, but unfortunately conventional NMF is highly correlated among the computing steps and could not be decomposed into uncorrelated tasks. So the second aspect, data structure, is taken into account. A feasible plan is to divide the data into small blocks and assign them to different processing units.

Robila and Maciak [9] implement a common example of parallel NMF. Let the input data matrix , the basis matrix , and the coefficient matrix . Given that there are processing units, both and are exactly divisible by . They divide the above three matrices into equally sized submatrices by column; that is, , and . They still minimize the cost function of conventional NMF and obtain the update rules for and as follows:

where denotes the iteration number, , and .

By parallelizing conventional NMF, PNMF can achieve the same representation ability, while reducing the time complexity.

2.2. GNMF

In common, conventional NMF performs the factorization in the Euclidean space, which cannot get satisfactory solution. Based on manifold assumption [22], GNMF embeds the intrinsic geometric structure of the data space in conventional NMF to overcome the above defect.

One of the representative algorithms for GNMF is proposed by Cai et al. [12]. They define the graph weight matrix , which is obtained by computing the weight relationship between any two data samples in input data matrix . Furthermore, another two matrices are derived from . One is the diagonal matrix whose entries are row or column sum of , that is, , and the other is , which is called graph Laplacian. They utilize Lagrangian multiplier method to solve the above problem and derive the following update rules:

where is the regularization parameter, denotes the iteration number, , and .

By incorporating the geometric structure, GNMF can significantly improve representation ability in comparison to conventional NMF.

3. Parallel Manifold Regularized Nonnegative Matrix Factorization

In conventional GNMF, the intrinsic low dimensional manifold structure embedded in the dataset is expressed by by graph weight matrix . From the update rule (2), we can easily see that the manifold terms in conventional GNMF give rise to a lot of additional computations compared with NMF. Therefore, we describe the proposed PNMF-M in this section, which improves this issue by distributing both data samples and factor matrices to multiple computing nodes while parallelizing the update for both factor matrices and manifold construction.

3.1. Data Partition and Distributed Computing

We still assume , , , , and as the input data matrix, basis matrix, coefficient matrix, graph weight matrix, and number of processors, respectively.

PNMF-M divides and into blocks in the column direction, namely, , , as shown in Figures 1(a) and 1(b); thus each of the first blocks consists of columns and the last one consists of columns. Similarly, PNMF-M divides and into blocks in the row direction as shown in Figures 2(a) and 2(b); that is, , , so each of the first blocks consists of rows and the last one consists of rows.

As the scale of is much larger than those of , , and , the weight matrix should be distributed across processors. The same is true for . To this end, PNMF-M divides and into blocks in the row direction, namely, , . In this way, each of the first blocks is a -dimensional matrix and the last one is a -dimensional matrix, as described in Figures 1(c) and 1(d).

According to [12], to take the advantage of manifold regularization, PNMF-M optimizes the following objective:

where denotes the regularization parameter and .

Since each update for (or ) needs the latest value of (or ), it is difficult to distribute the updates for them on processors. According to the aforementioned data partitioning strategy, PNMF-M reformulates the multiplicative update rules in [12], at the iteration, as follows:

where denotes the iteration number, (note that when , , ), with .

In each iteration round, PNMF-M updates each block of and locally by (5) and (6), respectively, and inserts the reduction for the whole factors among processors between (5) and (6). Since our algorithm is developed on a distributed computing system with 40ā€‰Gb/s Infiniband interconnection network whose communication overloads among processors are relatively low, this updating strategy significantly speeds up the calculation of manifold regularized NMF by the distributed computing.

3.2. Manifold Construction

Besides data partition and distributed computing, manifold construction is another key factor in PNMF-M. In the update rule (5), updating in the processing unit requires the products of and in all processing units. However, directly constructing the global adjacency graph from the input data matrix costs time in manifold learning and heavily prohibits manifold regularized NMF for large-scale datasets. In this subsection, we design an efficient construction method by means of the communications between different processing units.

In Figure 1(c), we can see that the processor simply holds which is a part of the global manifold and represents the weight relationship between data samples in and all the data samples in . Originally, it needs the entire data matrix to construct , and the tremendous storage requirement becomes the bottleneck which restricts the scalability of the algorithm. It is therefore critical to construct the manifold in an efficient way. To overcome this drawback, in this paper, we design a two-step manifold construction method by means of high-throughput interconnection network.

Figure 3 depicts the details of the construction method. Without loss of generality, we take the processor as an example to demonstrate both steps. In the first step, PNMF-M first constructs locally an adjacency subgraph from in relative to the adjacency graph constructed from and thus obtains an index set which contains local neighbor indices of each data sample in . Then, utilizes high-throughput interconnection network to take turns to receive data samples from the other remote processors. When comes, where except , PNMF-M reconstructs an adjacency subgraph from and and obtains an index set which contains remote neighbor indices of each data sample in . In the second step, PNMF-M computes the weight relationship between and data sample matrix indicated by each index set and picks out the final neighbor indices of each data sample in to construct a manifold in .

In our method, we select -nearest neighbors (-NN) mode in [22] to construct the graph, and thus the proposed construction method can be proved to get a manifold equivalent to the batch construction method. Since each processor can discard data sample matrix received from remote processors after constructing adjacency subgraph, PNMF-M greatly reduces the storage overhead and improves the scalability.

3.3. PNMF-M Algorithm

We are now ready to alternatively apply the update rules (5) and (6) to solve PNMF-M. Algorithm 1 shows the pseudocode on distributed computing systems. Given that processes exist in PNMF-M, Algorithm 1 first partitions data samples and factor matrices according to Figures 1 and 2 and assigns the associated submatrices to each process (Statement ). Then, begins to locally construct its private weight matrix in parallel. It first computes from in the local process (Statement ) and then receives from the other processes to obtain (Statement ). Next, it merges all index sets to construct (Statement ). Then, alternatively employs the update rules (5) and (6) to obtain its private coefficient submatrix (Statement ) and basis submatrix (Statement ) until the termination condition for the iteration is satisfied.

Input: data matrix , reduced dimensionality , regularization parameter , tolerance , parallelism
ā€‰
Output: basis matrix , coefficient matrix
(1) Assign , , , and to as described in Figures 1 and 2
(2) Compute from
(3) Send to the other processes
(4) Receive from the other processes to obtain
(5) Merge all index sets to construct
(6) repeat
(7) ā€ƒApply (5) to update
(8) ā€ƒSend to the other processes
(9) ā€ƒReceive from the other processes to constitute
(10) ā€ƒApply (6) to update
(11) ā€ƒSend to the other processes
(12) ā€ƒReceive from the other processes to constitute
(13) until convergence
(14) returnā€‰ā€‰ and

Notice that updating requires the basis matrix which consists of submatrices distributed across each process , where , as shown in (5). Hence, Algorithm 1 has to insert the interprocess communication to implement the reduction for before updating (Statement -). Similarly, Algorithm 1 also requires increasing the interprocess communication to achieve the reduction for before updating (Statements and ).

4. Experimental Evaluation

In this section, we verify the proposed PNMF-M by comparing it with PNMF in [11] and GNMF in [12], which incorporate parallel computing and manifold regularization into NMF, respectively.

4.1. Datasets and Evaluation Metrics

We choose Reuters and TDT2 text corpora widely used for document clustering and MNIST and COIL100 image databases used for pattern recognition and machine learning as our benchmark datasets. Table 1 lists their detailed statistics.

In this experiment, we utilize canonical -means clustering method to transform the coefficient matrix obtained by our algorithm into the labels of data samples, then compare them with the labels provided by the benchmark datasets, and finally use the accuracy (AC) and the normalized mutual information (NMI) to evaluate the clustering performance of each tested algorithm. For the way to compute AC and NMI, Xu et al. [2] and Cai et al. [12] have given detailed description, so we need not explain them here.

To evaluate the scalability of our proposed algorithm, we record the execution time of each tested algorithm and utilize the speedup (SP) and time cost (TC) as metrics, where SP denotes the speedup obtained by PNMF-M relative to GNMF and TC denotes the time cost introduced by PNMF-M relative to PNMF; that is,

4.2. Clustering Performance Comparisons

We collect data samples in multiple clusters randomly selected from the benchmark datasets and provide them and the corresponding cluster number to each tested algorithm. All the tested algorithms are run 50 times on different data subsets each of which consists of some part of clusters randomly selected from the whole clusters. We compute the mean of them for each given cluster number as the final clustering performance. At the same time, we also compute the mean of SP and TC to verify the effectiveness of PNMF-M.

Table 2 lists the values of main experimental parameters for clustering performance evaluation. In our experiment, we set regularization parameter to and the tolerance to as the termination condition of each iteration. In addition, we employ -NN definition to construct the graph weight matrix and set to . For parallel algorithms, we adaptively select the parallelism according to the number of samples and set , where means rounding up to an integer.

Figures 4ā€“7 report AC and NMI versus the cluster number on four benchmark datasets, respectively. Figure 8 reports SP and TC of our algorithm versus the cluster number on four benchmark datasets.

Through comparative analysis, we can obtain the following observations:

Since the manifold constructed in PNMF-M is equivalent to that constructed in GNMF, both obtain the same clustering performance on all four benchmark datasets.

PNMF-M gets better clustering performance than PNMF in terms of both AC and NMI, because PNMF-M incorporates the intrinsic geometric structure of data and obtains the same clustering performance as GNMF, while PNMF ignores the manifold structure of the dataset.

GNMF directly constructs the manifold from the entire data matrix in a single processing node, which incurs tremendous computation and storage overhead. In contrast, PNMF-M designs a two-step distributed method to accelerate the manifold construction. In addition, PNMF-M parallelizes the update procedure to eliminate large size matrix operations in the update rules and further reduces the computation and storage overhead. According to Figure 8, PNMF-M acquires about 4ā€“10 times speedup on four benchmark datasets relative to GNMF.

Compared with PNMF, PNMF-M requires to construct the manifold in parallel and introduce the manifold term into the update rule for the coefficient matrix, which inevitably increases the computation overhead of the algorithm to some extent. PNMF-M pays about 2ā€“4 times cost on four benchmark datasets relative to PNMF, as shown in Figure 8.

4.3. Scalability Comparisons

We collect increasing scale data subsets in all the clusters randomly selected from the benchmark datasets as the input datasets of each tested algorithm, run them in the same hardware configuration and software environment as listed in Table 3, and verify the scalability by comparing their execution time and evaluating SP achieved by PNMF-M. In our experiments, we set parallelism to 2 (PNMF-M2), 4 (PNMF-M4), 8 (PNMF-M8), and 16 (PNMF-M16) for PNMF-M, respectively.

Figure 9 reports SP achieved by PNMF-M with different parallelism on four benchmark datasets relative to GNMF. We can draw the conclusions from it as follows:

When the number of input data samples is small (below about ), SP obtained by PNMF-M is less than the corresponding parallelism. This is because all processing nodes need to communicate with each other to construct the manifold during the execution of PNMF-M, which introduces communication latency into parallel computation.

When the number of input data samples is large (above about ), SP achieved by PNMF-M is more than the corresponding parallelism. The reason is that all the input data samples reside in external storage before the execution of each tested algorithm. But GNMF requires all data to be loaded into memory during its execution, which goes beyond the memory capacity in a single processing node and has to load large amounts of data samples by virtual memory. To address the problem, PNMF-M divides the input data into multiple subsets and stores them in different processing node, which reduces memory access overhead caused by loading data.

In the case of small input data samples, PNMF-M achieves better SP on MNIST and COIL100 than on Reuters and TDT2 when parallelism for PNMF-M is equal to 2, 4, and 8. But the result is the opposite with the increase of the number of input data samples. The main reason for this result is twofold: one is based on the two analyses mentioned in and and the other is that memory and communication overhead for a single data sample in Reuters and TDT2 are larger because the dimensionality of data in them is higher than that in MNIST and COIL100. But when parallelism for PNMF-M increases to 16, the overhead caused by high dimensionality is spread over each processing node. Hence, SP obtained by PNMF-M16 has little difference among four benchmark datasets.

In addition, we also compare the execution time of all the tested algorithms on four benchmark datasets, as shown in Figure 10. It is obvious that the increasingly widening gap of the execution time exists between serial algorithm (GNMF) and parallel algorithms (PNMF-M2, PNMF-M4, PNMF-M8, and PNMF-M16). It is because GNMF involves lots of large size matrix operations during the manifold construction and updating procedure, and the time consumed by them grows nonlinearly as the number of data samples and attributes increase, while PNMF-M transforms them into relatively small size matrix operations and utilizes multiple processing nodes to perform them in parallel. Hence, PNMF-M achieves better scalability than GNMF.

5. Conclusion

In this paper, we proposed a parallel nonnegative matrix factorization with regularization method (PNMF-M) which introduced manifold regularization into conventional NMF and parallelized it on distributed computing systems. In PNMF-M, besides the balanced data partition strategy, we conducted the update procedure in parallel. In particular, we presented a two-step distributed graph construction method and obtained a manifold equivalent to that constructed in batch mode. Experimental results showed that PNMF-M not only outperforms conventional NMF in terms of both scalability and time efficiency thanks to the parallelization, but also significantly enhances the representation ability of conventional NMF thanks to the manifold regularization.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.