Research Article  Open Access
Sai Kiranmayee Samudrala, Jaroslaw Zola, Srinivas Aluru, Baskar Ganapathysubramanian, "Parallel Framework for Dimensionality Reduction of LargeScale Datasets", Scientific Programming, vol. 2015, Article ID 180214, 12 pages, 2015. https://doi.org/10.1155/2015/180214
Parallel Framework for Dimensionality Reduction of LargeScale Datasets
Abstract
Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original highdimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and highdimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in reallife applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of largescale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.
1. Introduction
Computational analysis of highdimensional data continues to be a challenging problem, spurring the development of numerous computational techniques. An important and emerging class of methods for dealing with such data is dimensionality reduction. In many applications, features of interest can be preserved while mapping the high dimensionality data to a small number of dimensions. These mappings include popular techniques such as principle component analysis (PCA) [1] and complex nonlinear maps such as Isomap [2] and kernel PCA [3].
Linear manifold learning techniques, for example, PCA or multidimensional scaling [4–7], existed as orthogonalization methods for several decades. Nonlinear methods like Isomap, LLE (locally linear embedding) [8], and Hessian LLE [9] were discovered recently. Another class of methods that emerged in the past few years are the unsupervised learning techniques, including artificial neural networks for Sammon’s nonlinear mapping [10], Kohenen’s or self organizing maps (SOM) [11], and curvilinear component analysis [12]. Modifications to the existing algorithms of manifold learning, to improve either their efficiency or performance, were another area where efforts were focused [13–16]. For example, Landmark Isomap [17] is a modification to the original Isomap method to extend its usage to larger datasets by picking a few representative points and applying Isomap technique to them. Along with the emergence of new manifold learning techniques, different sequential implementations of these techniques, targeting various hardware platforms and various programming languages, have been developed [18, 19].
Dimensionality reduction techniques are often compute intensive and do not easily scale to large datasets. Recent advances in highthroughput measurements using physical entities, such as sensors, or results of complex numerical simulations are generating data of extremely high dimensionality. It is becoming increasingly difficult to process such data sequentially.
In this paper, we propose a parallel framework for dimensionality reduction. Rather than focusing on a particular method, we consider the class of spectral dimensionality reduction methods. Till date, few efforts have been made in developing parallel implementations of these methods, other than development of a parallel version of PCA [20, 21], parallelization of multidimensional scaling (MDS) for genomic data [22], and algorithm development for GPU platforms [23, 24].
We perform a systematic analysis of spectral dimensionality reduction techniques and provide their unified view that can be exploited by dimensionality reduction algorithm designers. We identify common computational building blocks required for implementing spectral dimensionality reduction methods and use these abstractions to derive a common parallel framework. We implement such a framework and show that it can handle large datasets and it scales to thousands of processors. We demonstrate advantages of our software by analyzing 75,000 images of morphology evolution during manufacturing of organic solar cells, which enables us to visually inspect and correlate fabrication parameters with morphology.
The remainder of this paper is organized as follows. In Section 2 we introduce the dimensionality reduction problem and describe basic spectral dimensionality reduction techniques, highlighting their computational kernels. In Section 3 we provide a detailed description of our parallel framework including algorithmic solutions. Finally, in Section 4 we present experimental results, and we conclude the paper in Section 5.
2. Definitions and Methods Overview
The problem of dimensionality reduction can be formulated as follows: Consider a set of points, where , and . We are interested in finding a set , such that , , and . Here, denotes a specific norm that captures properties we want to preserve during dimensionality reduction [25]. For instance, by defining as Euclidean norm we preserve Euclidean distance, thus obtaining a reduction equivalent to the standard technique of principal component analysis (PCA) [1]. Similarly, defining to be the angular distance (or conformal distance [26]) results in locally linear embedding (LLE) [8] that preserves local angles between points. In a typical application [27, 28], represents a state of the analyzed system, for example, temperature field and concentration distribution. Such state description can be derived from experimental sensor data or can be the result of a numerical simulation. However, irrespective of the source, it is characterized by high dimensionality, that is, is typically of the order of [29]. While represents just a single state of the system, common data acquisition setups deliver large collections of such observations, which correspond to the temporal or parametric evolution of the system [27]. Thus, the cardinality of the resulting set is usually large (). Intuitively, information obfuscation increases with data dimensionality. Therefore, in the process of dimensionality reduction (DR) we seek as small a dimension as possible, given constraints induced by the norm [25]. Routinely, as it permits, for instance, visualization of the set .
DR techniques have been extensively researched over the last decade [25]. In particular, methods based on spectral data decomposition have been very successful [1, 2, 9] and have been widely adopted. Early approaches in this category exploited simple linear structure of the data, for example, PCA or multidimensional scaling (MDS) [30]. More recently, techniques that can unravel complex nonlinear structures in the data, for example, Isomap [2], LLE, and kernel PCA [3], have been developed. While all these methods have been proposed taking into account specific applications [19, 25], their underlying formulations share similar algorithmic mechanisms. In what follows we provide a more detailed overview of spectral DR techniques and we identify their common computational kernels that form the basis for our parallel framework.
2.1. Spectral Dimensionality Reduction
The goal of DR is to identify a lowdimensional representation of the original dataset , that preserves certain predefined properties. The key idea underpinning spectral DR can be explained as follows. We encode desired information about , that is, topology or distance, in its entirety by considering all pairs of points in . This encoding is represented as a matrix . Next, we subject matrix to unitary transformation , that is, transformation that preserves norm of , to obtain its sparsest form , where . Here, is a diagonal matrix with rapidly diminishing entries. As a result, it is sufficient to consider only entries of to capture all the information encoded in . These entries constitute the set . The above procedure hinges on the fact that unitary transformations preserve original properties of [31]. Note also that it requires a method to construct matrix in the first place. Indeed, what differentiates various spectral methods is the way information is encoded in .
We summarize the general idea of spectral DR in Algorithm 1. In the first four steps we construct the matrix . As indicated, this matrix encodes information about the property that we wish to preserve in the process of DR. To obtain we first identify the nearest neighbors (NN) of each point . Note that currently all studied methods use NN defined in Euclidean space. This enables us to define a weighted graph that encapsulates, both distance and topological, properties of the set . Given graph , we can construct a function to isolate the desired property. For instance, consider the Isomap algorithm in which the geodesic distance is maintained. In this case, returns the length of the shortest path between and in . Note that for some methods is very simple; for example, for PCA it is equivalent to a distance measure , , while for other methods can be more involved. Differences between various DR methods and their corresponding function are outlined in Table 1. The property extracted by function is stored in an auxiliary matrix , which is next normalized to obtain matrix . This process of normalization is a simple algebraic transformation, which ensures that is centered and hence that the final lowdimensional set of points contains the origin and is not an affine translation [31]. Subsequently, is spectrally decomposed into its eigenvalues that constitute the sparsest representation of . Resulting eigenvectors and eigenvalues are then postprocessed to extract the set of lowdimensional points.
 
Note: is the identity matrix, and . 

The abstract representation of spectral DR methods in Algorithm 1 is based on a thorough analysis of existing techniques [1, 2, 8]. While this representation is compact, it offers sufficient flexibility to, for instance, design new dimensionality reduction procedures. At the same time it provides clear separation of individual computational steps and explicates data flow in any DR process. We exploit both these facts when designing our parallel framework.
2.2. Performance Analysis of Dimensionality Reduction Methods
We used the above presentation of DR methods to identify their basic computational kernels. To better understand how these kernels contribute to the overall performance of different DR methods, we performed a set of experiments using domain specific implementation in Matlab. Experiments were carried out for varying and a fixed on a workstation with 8 GB of RAM and an Intel 3.2 GHz processor. Obtained results are presented in Table 2.

As can be seen, the run time of analyzed methods is dominated by two steps, namely, NN and construction of the auxiliary matrix . Together they account for 99.8% of the total execution time for . In our implementation the NN procedure depends on allversusall distance calculations. This is justified taking into account that is very large and thus efficient algorithmic strategies for NN, for example, based on hierarchical space decomposition [32], are infeasible. Consequently, complexity of this step is . The cost of computing matrix depends explicitly on the definition of function . Among existing DR techniques this function is the most complex for the Isomap method. Recall that in the process of DR we are interested in preserving either distance or local topology characteristics. Local topology properties can be directly obtained from NN [8, 9, 33], inducing computationally efficient definition of . Conversely, distance characteristics must conform to global constraints and therefore have higher computational complexity [34]. In case of Isomap, pairwise geodesic distances can be efficiently derived from allpairs shortest path distances using, for example, FloydWarshall algorithm, with worstcase complexity.
Another significant DR component is normalization. Although implementation of this step varies between different methods it is invariably dominated by matrixmatrix multiplication. Therefore, we assume overall normalization complexity to be . The last important component is the eigenvalue solver. In general, complexity of this kernel varies depending on the particular solver used. Commonly employed algorithms include Lanczos method [35], Krylov subspace methods [36], or deflationbased power methods [37, 38]. The choice of method is driven by the structure of the matrix and the number of required eigenvalues. Standard distance preserving DR methods operate on dense symmetric matrices, while topology preserving methods involve sparse symmetric matrices. Accordingly, complexity of these techniques is usually , where is the number of desired eigenvalues.
A final key factor we have to consider is memory complexity of the described kernels. Here, the main contributing structures are matrices and . These matrices are most often dense and in the majority of cases require storage. Because NN directly depends on distances between all pairs, it utilizes an matrix as well. Finally, input dataset requires memory.
One important caveat that affects the above analysis is the relationship between and . In many applications is significantly greater than . This is not surprising taking into account that acquiring high resolution data (hence highdimensional) is resource intensive. Therefore one may expect that with increasing there is usually decrease of . In our applications [29, 39] it is not uncommon that or even . Consequently, the NN step in Algorithm 1 becomes the most compute intensive while memory requirement is dominated by the input data. Observe that this trend is reflected in our experimental data.
3. Parallel Framework for Dimensionality Reduction
Dimensionality reduction very quickly becomes both memory and compute intensive, irrespective of the particular method. Memory consumption arises from the size of input data and the auxiliary matrices created in the process. The computational cost is dominated by pairwise computations and weight matrix construction. The goal of our framework is to scale DR methods to very large datasets that could be analyzed on large parallel machines.
We designed our parallel DR package following the general outline presented in Algorithm 1. Taking into account significant memory and computational complexity, we focused on distributed memory machines with MPI. To ensure modularity of the framework without sacrificing performance and scalability, we decided to employ a scheme in which processors are organized into a logical 2D mesh. In what follows, we assume a simple pointtopoint communication model with latency and bandwidth .
3.1. Constructing Graph
The graph construction procedure is based on identifying nearest neighbors of each input point. Because of the high dimensionality of the input data it is advantageous to implement NN in two steps, where we first compute all pairwise distances and then we identify neighbors in a simple scan. Note that these pairwise distances actually represent the distance measure (see Algorithm 1). Therefore we will consider to be an distance matrix. Parallel pairwise computation is a well studied problem [40]. Here, we benefit from our earlier experience with accelerating pairwise computations on heterogeneous parallel processors [41].
Let denote the number of processors conceptualized as organized into a virtual mesh. We decompose into blocks of elements. Processor with coordinates is responsible for computing elements of within block . This scheme requires that each processor stores two blocks of points of the input dataset that correspond to rowvectors and columnvectors used to compute respective part of the matrix . In our implementation, the distribution of the input dataset is performed by parallel I/O with initially preprocessed . Note that to obtain a single element of we perform norm computations, which are particularly well suited for vectorization. Therefore, we handtuned our code to benefit from SIMD extensions of modern processors.
Given pairwise distances, the second step is to identify neighbors of individual points (i.e., vertices of ). This step is executed only for methods where , which virtually involves all methods other than PCA (see Table 1). As in the case of pairwise computations it can be efficiently parallelized using the following scheme. Initially, each processor creates a set of candidate neighbors with respect to the block of matrix it stores. Specifically, processor with coordinates searches for neighbors of the set of points by analyzing rows of its local block of . Because is very small this operation can be performed using a simple linear scan. Next, all processors within the same row perform alltoall communication to aggregate candidate neighbors. Here, the candidate neighborhood of point is assembled on the processor with coordinates . This processor merges candidate neighborhood lists into the final set of nearest neighbors. Observe that this completes the graph construction phase—graph is now stored in the form of adjacency list distributed over processors.
The computational complexity of the entire procedure can be decomposed into cost of computing distance matrix, cost of performing local NN search, and cost of distributing local NN information to all processors sharing a row. Hence, the total complexity is bounded by , which is optimal for .
3.2. Building Auxiliary Matrix
Given graph we proceed to the next step, which involves constructing the auxiliary matrix from the information encapsulated in . As is the case of we distribute over mesh of processors.
Recall that the formulation of DR methods proposed in Algorithm 1 ensures that the only step that is method dependent is the construction of matrix . Consequently, any parallel implementation of this step will vary but it will reflect limitations inherent to the sequential counterpart. Specifically, topology preserving methods, such as, LLE, will involve only local data and hence will be amenable to embarrassing parallelism with limited or no communication. Conversely, distance preserving methods will inevitably require a global data view and thus potentially more sophisticated parallelization strategies. Following our previous claim regarding complexity of Isomap we focus our presentation on the parallel implementation of this particular method.
The function used in Isomap is based on the geodesic distance, which has been mathematically shown to be asymptotically equivalent to a graph distance in (i.e., shortest path distance) [42]. However, the geodesic distance is a metric, while allpairs shortest paths in directed graph do not have to satisfy the symmetry condition. Therefore, to obtain , special attention must be paid to how shortest path distances are used. More precisely, graph must be transformed to ensure that it is symmetric. Note that after such transformation the graph is no longer regular; that is, certain nodes may have more than neighbors (see Figure 1).
(a)
(b)
Taking into account the above requirements we obtain the following procedure of constructing in parallel. First, all processors within the same row perform alltoall communication to replicate graph . As a result, each column of processors stores a copy of the entire graph , that is, rowwise distributed between processors in that column. Thanks to this, each processor can initialize its local part of without further communication. After initialization represents the distributed adjacency matrix of , where if is a neighbor of , and otherwise. In the next step symmetrization procedure is executed. Processors with coordinates and , where , exchange respective blocks of and select elementwise minimum value between blocks. Similar operation is performed locally by processors on the diagonal, that is, processors for which . At this stage can be used to identify allpairs shortest paths. Several parallel algorithms have been proposed to address this problem, targeting the PRAM model [43–45], shared memory architectures [46], and multi/many cores [47–49], as well as distributed memory machines [45, 50]. Among the existing parallel strategies we decided to adopt the checkerboard version of the parallel Floyd algorithm [46]. Briefly, the method proceeds in iterations, where in each iteration every processor performs operations to update its local block of . All processors are synchronized in each iteration, owing to the fact that in iteration , th row and th column of have to be broadcasted. The broadcast is performed between processors that share the same row/column. After iterations, matrix stores allpairs shortest path, which concludes the entire procedure.
Complexity of this phase is and is dominated by the parallel Floyd’s algorithm. While replication and symmetrization of can be executed efficiently in time, allpairs path searching involves extensive communication that slightly hinders scalability. Nevertheless, the algorithm remains scalable as long as , which is true in the majority of reallife cases.
3.3. Matrix Normalization
The goal of normalization is to transform matrix such that the resulting matrix is both row and column centered; that is, and . The normalization stage in all cases consists of matrixmatrix multiplication (see Table 1). However, in certain situations, especially in distancepreserving methods, explicit matrix multiplication can be avoided by taking advantage of structural properties of one of the matrices (e.g., the matrix in Table 1). This is the case for, for example, PCA and Isomap, where we exploit the fact that matrices and are given analytically and thus can be generated inplace on each processor that requires them to perform multiplication. Consequently, the communication pattern inherent to the standard parallel matrixmatrix multiplication algorithms is simplified to one parallel reduction in the final dotproduct operation. The complexity of this approach is .
3.4. Finding Eigenvalues
Computing eigenvalues is the final step in the dimensionality reduction process. Although parallel eigensolvers are readily available, they are usually designed for shared memory and multi/many core architectures [51–54]. This unfortunately makes them impractical for our purposes. At the same time, existing distributed memory solutions are not scalable and cannot handle large and dense data. For instance, one of the more popular packages, SLEPc [55], uses a simple 1D decomposition and in our tests did not scale to more than 4096 processors. A more recent library, elemental [56], which is still under development, offers 2D blockcyclic decomposition, but relies on a fixed block size (private communication). Finally, ELPA [57], which is the most promising library in this category, is also under development.
For these reasons we decided to implement a custom eigenvalue solver that exploits special properties of matrix (symmetric, positive semidefinite) and computes only the first eigenvalues. Our solver is based on the power method [31] and matrix deflation and is outlined in Algorithm 2. Note that power methods are considered easy but not efficient to parallelize. At the same time, however, they are at heart of several important reallife systems, for instance, Google’s PageRank [58].

In general, our approach follows the standard scheme of the power method (lines 3–18), repeated times to identify first largest eigenvalues. After identification of an eigenvalue and its associated eigenvector, the matrix is deflated—the contribution of the vector is removed from (line 19). Observe that power method involves nested matrixvector product (lines 6–11) that under normal circumstances would require parallel vector transposition. However, our parallel implementation benefits from the fact that is symmetric, hence eliminating need for vector transposition. Indeed, the entire procedure consists of local matrixvector product followed by allreduce operation. Here, the reduction operation alternates between columns and rows as required to ensure that vector is stored properly. Note that the power method is bounded by convergence criteria (line 5). In our case we use one of several popular conditions, which involves checking relative error between the current and previous estimate of the eigenvalue that can be performed every several iterations. We also note that convergence is significantly improved by using a matrix shifting strategy in the form , where is a positive number [59].
Extracting eigenvalue and eigenvector in iteration (lines 11–18) depends on vectors and , while deflation step involves . Therefore, it is advantageous to replicate both and in their entirety on each processor. We achieve this with alltoall communication executed by processors within the same row. This allows us to execute the deflation step in parallel, with each processor updating its local block of matrix . Thus, the complexity of a single iteration of the power method is , while the deflation step is .
To conclude this section we would like to emphasize that our solver operates under the same assumptions as any power method. It requires that the first eigenvectors of are linearly independent, the initial vector generated in th iteration is not orthogonal to the eigenvector , and finally, the first eigenvalues are nondegenerate [31]. Note that these conditions are not restrictive and are easily satisfied in the context of dimensionality reduction.
4. Experimental Results
To assess scalability of our framework and test its performance in reallife applications, we performed a set of experiments using the TACC Ranger cluster [60]. A single node of this machine is based on AMD processors working at 2.3 GHz and provides 16 cores with 32 GB of DDR2 RAM and 512 KB of L2 cache per core. Nodes are connected by a multistage Infiniband network that offers 1 GB/s bandwidth. To compile all test programs and the framework, we used the Intel C++ 10.1 compiler with standard optimization flags and MVAPICH 1.0.1 MPI implementation. In every test we ran one MPI process per CPU core, which we refer to as processor.
4.1. Scalability Tests
In the first set of experiments we measured how problem size influences performance of our solution. We created a collection of synthetic datasets consisting of points with . Next, we performed Isomap dimensionality reduction using different numbers of processors. Obtained results are summarized in Table 3 and Figure 2.

The results show that our framework provides very good scalability for large problem sizes in the entire range of tested processor configurations. The superlinear speedup observed for is naturally explained by cache performance. Observe that the dominating computational factors in our framework are operations like matrixmatrix and matrixvector products, which are well suited to exploit memory hierarchy. In our current implementation, we use direct SIMDbased implementation of these routines. A slightly weaker performance for small problem sizes and large number of processors can be attributed to network latency that offsets computational gains.
To further understand how different components of the framework perform, we measured their run time obtained for changing problem sizes. Table 4 shows that all modules scale as we would expect based on their theoretical complexity. The most time consuming stages are construction of the auxiliary matrix for Isomap and normalization. This is not surprising taking into account that both components scale as and the parallel Floyd’s algorithm involves rounds of communication. The abrupt performance decrease in the normalization stage, which can be observed for , can be attributed to cache performance. Recall that normalization depends on matrixmatrix multiplication and hence is inherently sensitive to data locality. The final remark concerns nearest neighbors module and eigenvalue solver. The NN scales linearly with the data dimension (see Table 5) and both modules can be used as standalone replacements whenever NN or largest eigenvalues problem has to be solved.


In the final test we compared our parallel eigensolver with SLEPc [55], one of the most popular and widely used libraries providing eigensolvers. SLEPc is an efficient and portable framework that offers an intuitive user interface. In many cases it is the first choice for solving largescale sparse eigenvalue problems.
Table 6 shows that our implementation systematically outperforms SLEPc. This can be explained by two main factors: first, unlike SLEPc, our implementation follows 2D data decomposition scheme, which offers better scalability, and second, we are seeking only the largest eigenvalues.

4.2. Using Dimensionality Reduction to Explore Manufacturing Pathways
Solar cells, or plastic solar cells, manufactured from organic blends, that is, a blend of two polymers, represent a promising lowcost, rapidly deployable strategy for harnessing solar energy. While highly costeffective and flexible, their low power conversion efficiency makes them less competitive on a commercial scale in comparison with conventional inorganic solar cells. A key aspect determining the power conversion efficiency of organic solar cells is the morphological distribution of the two polymer regions in the device. Recent studies reveal that significant improvement in power conversion efficiency is possible through better morphology control of the organic thin film layer during the manufacturing process [29, 61–66]. Highthroughput exploration of the various manufacturing parameters, for example, evaporation rate, blend ratio, substrate patterning frequency, substrate patterning intensity, and solvent, can potentially unravel processmorphology relationships that can help tailor processing pathways to obtain enhanced morphologies. Note that such highthroughput analysis results in datasets that are too large to visually inspect for trends and relationships. A promising approach towards unraveling processmorphology relationships in this highthroughput data is via dimensionality reduction. Here, we showcase our parallel framework on this pressing scientific problem. In particular, we focus on using dimensionality reduction to understand the effects of substrate patterning [67, 68], described by patterning frequency and intensity, on morphology evolution. We note that nanotip photolithography patterning of the substrate has shown significant potential to guide morphology evolution [69].
The input dataset consists of morphologies. Each morphology is a 2dimensional snapshot which is vectorized to have dimensionality . Figure 3 shows several representative final morphologies obtained by varying the patterning frequency, , from to , and the intensity of the attraction/repulsion, , from to (in the remaining we omit leading 1 for clarity). In all these cases, the lower surface of the domain is patterned to attract and/or repel specific classes of polymers, thus affecting the morphology. We performed dimensionality reduction on this dataset using processors on TACC Ranger. The total run time was 1058 seconds.
Figure 4 plots the first 10 eigenvalues of the data. Note that the first three eigenvalues, and hence the first three principle components of the data, represent ~70% of the information content of the entire data. We therefore characterize each morphology in terms of this threedimensional representation.
(a)
(b)
Figure 5 represents all the morphologies in such reduced space. In the plot to the left, the points are color coded according to the patterning frequency used, while in the plot to the right, the points are color coded according to the patterning intensity. This plot provides significant visual insight into the effects of patterning frequency and intensity. There exists a central plane of patterning frequency, where the morphology evolution is highly regulated irrespective of the patterning intensity (). This is particularly valuable information as the patterning frequency is much easier to control than patterning intensity from a manufacturing perspective. For patterning frequencies above , the morphologies are highly sensitive to slight variations in both frequency and intensity. This is also clearly seen in Figure 6, where slight variations in the intensity give dramatically different final morphologies. Notice also that higher intensity does not necessarily give different morphologies. This is a very important insight that allows us to preclude further, potentially expensive, exploration of the phase space of increasing patterning intensity.
(a)
(b)
Finally, the lowdimensional plots illustrate the ability to achieve the same morphology using different processing conditions. For instance, in Figure 7, we isolate the morphology evolution under two processing conditions that result in identical morphologies. Such correlations, most sensitive regions versus least sensitive regions (Figure 6) and configurations resulting in identical morphologies, are enormously useful as we tailor processing pathways to achieve designer morphologies. This analysis illustrates the power of parallel dimensionality reduction methods to achieve this goal. We defer a comprehensive physicsbased analysis of this dataset to a subsequent publication.
5. Conclusions
In this work we illustrate a systematic analysis of dimensionality reduction techniques and recast them into a unified view that can be exploited by dimensionality reduction algorithm designers. We subsequently identify the common computational building blocks required to implement a spectral dimensionality reduction method. We use this insight to design and implement a parallel framework for dimensionality reduction that can handle large datasets and scales to thousands of processors. We demonstrate the capability and scalability of this framework on several test datasets. We finally showcase the applicability and potential of the framework towards unraveling complex processmorphology relationships in the manufacturing of plastic solar cells.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research is supported in part by the National Science Foundation through XSEDE resources provided by TACC under Grant no. TGCTS110007 and supported in part by NSF PHY0941576 and NSF1149365.
References
 J. L. Lumley, “Atmospheric turbulence and radio wave propagation,” in The Structure of Inhomogeneous Turbulent Flows, 1967. View at: Google Scholar
 J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. View at: Publisher Site  Google Scholar
 K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernelbased learning algorithms,” IEEE Transactions on Neural Networks, vol. 12, no. 2, pp. 181–201, 2001. View at: Publisher Site  Google Scholar
 M. L. Davison, Multidimensional Scaling, John Wiley & Sons, New York, NY, USA, 1983. View at: MathSciNet
 J. B. Kruskal, “Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis,” Psychometrika, vol. 29, no. 1, pp. 1–27, 1964. View at: Google Scholar  MathSciNet
 R. N. Shepard, “The analysis of proximities: multidimensional scaling with an unknown distance function. I,” Psychometrika, vol. 27, no. 2, pp. 125–140, 1962. View at: Google Scholar  MathSciNet
 W. S. Torgerson, “Multidimensional scaling. I. Theory and method,” Psychometrika, vol. 17, no. 4, pp. 401–419, 1952. View at: Google Scholar  MathSciNet
 S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. View at: Publisher Site  Google Scholar
 D. L. Donoho and C. Grimes, “Hessian eigenmaps: new locally linear embedding techniques for highdimensional data,” Proceedings of the National Academy of Sciences, vol. 100, pp. 5591–5596, 2003. View at: Google Scholar
 J. W. Sammon Jr., “A nonlinear mapping for data structure analysis,” IEEE Transactions on Computers, vol. C18, no. 5, pp. 401–409, 1969. View at: Publisher Site  Google Scholar
 T. Kohonen, SelfOrganizing Maps, Springer, 2001.
 P. Demartines and J. Hérault, “Curvilinear component analysis: a selforganizing neural network for nonlinear mapping of data sets,” IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 148–154, 1997. View at: Publisher Site  Google Scholar
 V. de Silva and J. B. Tenenbaum, “Sparse multidimensional scaling using landmark points,” Tech. Rep., Stanford University, 2004. View at: Google Scholar
 A. Talwalkar, S. Kumar, and H. Rowley, “Largescale manifold learning,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, 8, June 2008. View at: Publisher Site  Google Scholar
 M. H. Yang, “Face recognition using extended Isomap,” in Proceedings of the International Conference on Image Processing, vol. 2, pp. 117–120, 2002. View at: Publisher Site  Google Scholar
 J. Zhang, S. Z. Li, and J. Wang, “Nearest manifold approach for face recognition,” in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FGR '04), pp. 223–228, May 2004. View at: Google Scholar
 V. Silva and J. B. Tenenbaum, “Global versus local methods in nonlinear dimensionality reduction,” in Advances in Neural Information Processing Systems, vol. 15, pp. 705–712, 2002. View at: Google Scholar
 L. van der Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2625, 2008. View at: Google Scholar
 L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik, “Dimensionality reduction: a comparative review,” Tech. Rep., Maastricht University, 2009. View at: Google Scholar
 M. Andrecut, “Parallel GPU implementation of iterative PCA algorithms,” Journal of Computational Biology, vol. 16, no. 11, pp. 1593–1599, 2009. View at: Publisher Site  Google Scholar  MathSciNet
 F. Zhong, D. W. Capson, and D. C. Schuurman, “Parallel architecture for PCA image feature detection using FPGA,” in Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE '08), pp. 1341–1344, May 2008. View at: Publisher Site  Google Scholar
 S.H. Bae, J. Y. Choi, J. Qiu, and G. C. Fox, “Dimension reduction and visualization of large highdimensional data via interpolation,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10), pp. 203–214, June 2010. View at: Publisher Site  Google Scholar
 S. Ingram, T. Munzner, and M. Olano, “Glimmer: multilevel MDS on the GPU,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 2, pp. 249–261, 2009. View at: Publisher Site  Google Scholar
 T. T. Yeh, T.Y. Chen, Y.C. Chen, and W.K. Shih, “Efficient parallel algorithm for nonlinear dimensionality reduction on GPU,” in Proceedings of the IEEE International Conference on Granular Computing, pp. 592–597, August 2010. View at: Publisher Site  Google Scholar
 J. A. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, Information Science and Statistics, Springer, New York, NY, USA, 2007. View at: Publisher Site  MathSciNet
 S. Bergman, The Kernel Function and Conformal Mapping, American Mathematical Society, New York, NY, USA, 1950. View at: MathSciNet
 A. Fontanini, M. G. Olsen, and B. Ganapathysubramanian, “Thermal comparison between ceiling diffusers and fabric ductwork diffusers for green buildings,” Energy and Buildings, vol. 43, no. 11, pp. 2973–2987, 2011. View at: Publisher Site  Google Scholar
 Y. Xie, H. Hu, and B. Ganapathysubramanian, “Phase transitions in vortex shedding in the wake of a heated circular cylinder at low Reynolds number,” in Proceedings of the 3rd Annual Graduate Research Symposium, 2011. View at: Google Scholar
 O. Wodo, S. Tirthapura, S. Chaudhary, and B. Ganapathysubramanian, “A graphbased formulation for computational characterization of bulk heterojunction morphology,” Organic Electronics, vol. 13, no. 6, pp. 1105–1113, 2012. View at: Publisher Site  Google Scholar
 J. B. Kruskal and M. Wish, Multidimensional Scaling, Sage Publications, Beverly Hills, Calif, USA, 1978.
 G. H. Golub and C. F. van Loan, Matrix Computations, The John Hopkins University Press, 1996.
 B. Hariharan and S. Aluru, “Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods,” Parallel Computing, vol. 31, no. 34, pp. 311–331, 2005. View at: Publisher Site  Google Scholar
 M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003. View at: Publisher Site  Google Scholar
 L. K. Saul and S. T. Roweis, “Think globally, fit locally: unsupervised learning of low dimensional manifolds,” Journal of Machine Learning Research, vol. 4, no. 2, pp. 119–155, 2004. View at: Google Scholar  MathSciNet
 C. Lanczos, “An iteration method for the solution of the eigenvalue problem of linear differential and integral operators,” Journal of Research of the National Bureau of Standards, vol. 45, pp. 255–282, 1950. View at: Publisher Site  Google Scholar  MathSciNet
 Y. Saad, “Krylov subspace methods for solving large unsymmetric linear systems,” Mathematics of Computation, vol. 37, no. 155, pp. 105–126, 1981. View at: Publisher Site  Google Scholar  MathSciNet
 B. N. Parlett and W. G. Poole Jr., “A geometric theory for the QR, LU and power iterations,” SIAM Journal on Numerical Analysis, vol. 10, no. 2, pp. 389–412, 1973. View at: Google Scholar
 J. H. Wilkinson, Algebraic Eigenvalue Problem, Clarendon Press, Oxford, UK, 1965. View at: MathSciNet
 Q. Guo, D. Rajewski, E. Takle, and B. Ganapathysubramanian, “Constructing lowdimensional stochastic wind models from meteorology data,” Renewable Energy. Submitted. View at: Google Scholar
 B. Hendrickson and S. Plimpton, “Parallel manybody simulations without alltoall communication,” Journal of Parallel and Distributed Computing, vol. 27, no. 1, pp. 15–25, 1995. View at: Publisher Site  Google Scholar
 A. Sarje, J. Zola, and S. Aluru, “Accelerating pairwise computations on cell processors,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 1, pp. 69–77, 2011. View at: Publisher Site  Google Scholar
 M. Bernstein, V. de Silva, J. C. Langford, and J. B. Tenenbaum, “Graph approximations to geodesics on embedded manifolds,” Tech. Rep., Stanford University, 2000. View at: Google Scholar
 A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders, “A parallelization of Dijkstra's shortest path algorithm,” in Mathematical Foundations of Computer Science, vol. 1450 of Lecture Notes in Computer Science, pp. 722–731, Springer, Berlin, Germany, 1998. View at: Publisher Site  Google Scholar  MathSciNet
 K. Madduri, D. A. Bader, J. W. Berry, and J. R. Crobak, “An experimental study of a parallel shortest path algorithm for solving largescale graph instances,” in Proceedings of the Workshop on Algorithm Engineering and Experiments, pp. 23–35, 2007. View at: Google Scholar
 U. Meyer and P. Sanders, “$\delta $stepping: a parallel single source shortest path algorithm,” in Proceedings of the 6th Annual European Symposium on Algorithms, pp. 393–404, Venice, Italy, August 1998. View at: Google Scholar
 J. F. Jenq and S. Sahni, “All pairs shortest paths on a hypercube multiprocessor,” in Proceedings of the International Conference on Parallel Processing, pp. 713–716, August 1987. View at: Google Scholar
 A. Buluç, J. R. Gilbert, and C. Budak, “Solving path problems on the GPU,” Parallel Computing, vol. 36, no. 56, pp. 241–253, 2010. View at: Publisher Site  Google Scholar
 D. Delling, A. V. Goldberg, A. Nowatzyk, and R. F. Werneck, “PHAST: hardwareaccelerated shortest path trees,” Journal of Parallel and Distributed Computing, vol. 73, no. 7, pp. 940–952, 2013. View at: Publisher Site  Google Scholar
 G. J. Katz and J. T. Kider Jr., “Allpairs shortestpaths for large graphs on the GPU,” in Proceedings of the 23rd SIGGRAPH/Eurographics Conference on Graphics Hardware, pp. 47–55, June 2008. View at: Google Scholar
 P. A. Humblet, “Another adaptive distributed shortest path algorithm,” IEEE Transactions on Communications, vol. 39, no. 6, pp. 995–1003, 1991. View at: Publisher Site  Google Scholar
 S. Balay, J. Brown, K. Buschelman et al., PETSc Users Manual Revision 3.5, 2010.
 E. Breitmoser and A. Sunderland, “An overview of eigensolvers for HPCx,” Tech. Rep., HPCx Consortium, 2003. View at: Google Scholar
 J. Choi, J. Demmel, I. Dhillon et al., “ScaLAPACK: a portable linear algebra library for distributed memory computers—design issues and performance,” Computer Physics Communications, vol. 97, no. 12, pp. 1–15, 1996. View at: Publisher Site  Google Scholar
 I. S. Dhillon, B. N. Parlett, and C. Vömel, “The design and implementation of the MRRR algorithm,” ACM Transactions on Mathematical Software, vol. 32, no. 4, pp. 533–560, 2006. View at: Publisher Site  Google Scholar  MathSciNet
 V. Hernandez, J. E. Roman, and V. Vidal, “SLEPc: a scalable and flexible toolkit for the solution of eigenvalue problems,” ACM Transactions on Mathematical Software, vol. 31, no. 3, pp. 351–362, 2005. View at: Publisher Site  Google Scholar  MathSciNet
 J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and N. A. Romero, “Elemental: a new framework for distributed memory dense matrix computations,” ACM Transactions on Mathematical Software, vol. 39, no. 2, pp. 1–24, 2013. View at: Publisher Site  Google Scholar  MathSciNet
 T. Auckenthaler, H.J. Bungartz, T. Huckle, L. Krämer, B. Lang, and P. Willems, “Developing algorithms and software for the parallel solution of the symmetric eigenvalue problem,” Journal of Computational Science, vol. 2, no. 3, pp. 272–278, 2011. View at: Publisher Site  Google Scholar
 K. Bryan and T. Leise, “The $25,000,000,000 eigenvector: the linear algebra behind Google,” SIAM Review, vol. 48, no. 3, pp. 569–581, 2006. View at: Publisher Site  Google Scholar  MathSciNet
 K. Meerbergen, A. Spence, and D. Roose, “Shiftinvert and Cayley transforms for detection of rightmost eigenvalues of nonsymmetric matrices,” BIT Numerical Mathematics, vol. 34, no. 3, pp. 409–423, 1994. View at: Publisher Site  Google Scholar  MathSciNet
 D. S. Katz, S. Callaghan, R. Harkness et al., “Science on the TeraGrid,” Computational Methods in Science and Technology, vol. 1, pp. 81–97, 2010. View at: Publisher Site  Google Scholar
 L.M. Chen, Z. Hong, G. Li, and Y. Yang, “Recent progress in polymer solar cells: Manipulation of polymer: fullerene morphology and the formation of efficient inverted polymer solar cells,” Advanced Materials, vol. 21, no. 1415, pp. 1434–1449, 2009. View at: Publisher Site  Google Scholar
 C. Deibel, V. Dyakonov, and C. J. Brabec, “Organic bulkheterojunction solar cells,” IEEE Journal on Selected Topics in Quantum Electronics, vol. 16, no. 6, pp. 1517–1527, 2010. View at: Publisher Site  Google Scholar
 H. Hoppe and N. S. Sariciftci, “Morphology of polymer/fullerene bulk heterojunction solar cells,” Journal of Materials Chemistry, vol. 16, no. 1, pp. 45–61, 2006. View at: Publisher Site  Google Scholar
 S. H. Park, A. Roy, S. Beaupré et al., “Bulk heterojunction solar cells with internal quantum efficiency approaching 100%,” Nature Photonics, vol. 3, no. 5, pp. 297–303, 2009. View at: Publisher Site  Google Scholar
 J. Peet, A. J. Heeger, and G. C. Bazan, “‘Plastic’ solar cells: selfassembly of bulk heterojunction nanomaterials by spontaneous phase separation,” Accounts of Chemical Research, vol. 42, no. 11, pp. 1700–1708, 2009. View at: Publisher Site  Google Scholar
 S. E. Shaheen, C. J. Brabec, N. S. Sariciftci, F. Padinger, T. Fromherz, and J. C. Hummelen, “2.5% efficient organic plastic solar cells,” Applied Physics Letters, vol. 78, no. 6, pp. 841–843, 2001. View at: Publisher Site  Google Scholar
 O. Wodo and B. Ganapathysubramanian, “Computationally efficient solution to the CahnHilliard equation: adaptive implicit time schemes, mesh sensitivity analysis and the 3D isoperimetric problem,” Journal of Computational Physics, vol. 230, no. 15, pp. 6037–6060, 2011. View at: Publisher Site  Google Scholar  MathSciNet
 O. Wodo and B. Ganapathysubramanian, “Modeling morphology evolution during solventbased fabrication of organic solar cells,” Computational Materials Science, vol. 55, pp. 113–126, 2012. View at: Publisher Site  Google Scholar
 D. C. Coffey and D. S. Ginger, “Patterning phase separation in polymer films with dippen nanolithography,” Journal of the American Chemical Society, vol. 127, no. 13, pp. 4564–4565, 2005. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2015 Sai Kiranmayee Samudrala et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.