Scientific Programming

Volume 2015, Article ID 180214, 12 pages

http://dx.doi.org/10.1155/2015/180214

## Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

^{1}Department of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA 30080, USA^{2}Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY 14620, USA^{3}Department of Biomedical Informatics, University at Buffalo, Buffalo, NY 14620, USA^{4}School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA^{5}Department of Mechanical Engineering, Iowa State University, Ames, IA 50011, USA

Received 9 March 2013; Accepted 1 August 2013

Academic Editor: Boleslaw Szymanski

Copyright © 2015 Sai Kiranmayee Samudrala et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

#### 1. Introduction

Computational analysis of high-dimensional data continues to be a challenging problem, spurring the development of numerous computational techniques. An important and emerging class of methods for dealing with such data is dimensionality reduction. In many applications, features of interest can be preserved while mapping the high dimensionality data to a small number of dimensions. These mappings include popular techniques such as principle component analysis (PCA) [1] and complex nonlinear maps such as Isomap [2] and kernel PCA [3].

Linear manifold learning techniques, for example, PCA or multidimensional scaling [4–7], existed as orthogonalization methods for several decades. Nonlinear methods like Isomap, LLE (locally linear embedding) [8], and Hessian LLE [9] were discovered recently. Another class of methods that emerged in the past few years are the unsupervised learning techniques, including artificial neural networks for Sammon’s nonlinear mapping [10], Kohenen’s or self organizing maps (SOM) [11], and curvilinear component analysis [12]. Modifications to the existing algorithms of manifold learning, to improve either their efficiency or performance, were another area where efforts were focused [13–16]. For example, Landmark Isomap [17] is a modification to the original Isomap method to extend its usage to larger datasets by picking a few representative points and applying Isomap technique to them. Along with the emergence of new manifold learning techniques, different sequential implementations of these techniques, targeting various hardware platforms and various programming languages, have been developed [18, 19].

Dimensionality reduction techniques are often compute intensive and do not easily scale to large datasets. Recent advances in high-throughput measurements using physical entities, such as sensors, or results of complex numerical simulations are generating data of extremely high dimensionality. It is becoming increasingly difficult to process such data sequentially.

In this paper, we propose a parallel framework for dimensionality reduction. Rather than focusing on a particular method, we consider the class of spectral dimensionality reduction methods. Till date, few efforts have been made in developing parallel implementations of these methods, other than development of a parallel version of PCA [20, 21], parallelization of multidimensional scaling (MDS) for genomic data [22], and algorithm development for GPU platforms [23, 24].

We perform a systematic analysis of spectral dimensionality reduction techniques and provide their unified view that can be exploited by dimensionality reduction algorithm designers. We identify common computational building blocks required for implementing spectral dimensionality reduction methods and use these abstractions to derive a common parallel framework. We implement such a framework and show that it can handle large datasets and it scales to thousands of processors. We demonstrate advantages of our software by analyzing 75,000 images of morphology evolution during manufacturing of organic solar cells, which enables us to visually inspect and correlate fabrication parameters with morphology.

The remainder of this paper is organized as follows. In Section 2 we introduce the dimensionality reduction problem and describe basic spectral dimensionality reduction techniques, highlighting their computational kernels. In Section 3 we provide a detailed description of our parallel framework including algorithmic solutions. Finally, in Section 4 we present experimental results, and we conclude the paper in Section 5.

#### 2. Definitions and Methods Overview

The problem of dimensionality reduction can be formulated as follows: Consider a set of points, where , and . We are interested in finding a set , such that , , and . Here, denotes a specific norm that captures properties we want to preserve during dimensionality reduction [25]. For instance, by defining as Euclidean norm we preserve Euclidean distance, thus obtaining a reduction equivalent to the standard technique of principal component analysis (PCA) [1]. Similarly, defining to be the angular distance (or conformal distance [26]) results in locally linear embedding (LLE) [8] that preserves local angles between points. In a typical application [27, 28], represents a state of the analyzed system, for example, temperature field and concentration distribution. Such state description can be derived from experimental sensor data or can be the result of a numerical simulation. However, irrespective of the source, it is characterized by high dimensionality, that is, is typically of the order of [29]. While represents just a single state of the system, common data acquisition setups deliver large collections of such observations, which correspond to the temporal or parametric evolution of the system [27]. Thus, the cardinality of the resulting set is usually large (). Intuitively, information obfuscation increases with data dimensionality. Therefore, in the process of dimensionality reduction (DR) we seek as small a dimension as possible, given constraints induced by the norm [25]. Routinely, as it permits, for instance, visualization of the set .

DR techniques have been extensively researched over the last decade [25]. In particular, methods based on spectral data decomposition have been very successful [1, 2, 9] and have been widely adopted. Early approaches in this category exploited simple linear structure of the data, for example, PCA or multidimensional scaling (MDS) [30]. More recently, techniques that can unravel complex nonlinear structures in the data, for example, Isomap [2], LLE, and kernel PCA [3], have been developed. While all these methods have been proposed taking into account specific applications [19, 25], their underlying formulations share similar algorithmic mechanisms. In what follows we provide a more detailed overview of spectral DR techniques and we identify their common computational kernels that form the basis for our parallel framework.

##### 2.1. Spectral Dimensionality Reduction

The goal of DR is to identify a low-dimensional representation of the original dataset , that preserves certain predefined properties. The key idea underpinning spectral DR can be explained as follows. We encode desired information about , that is, topology or distance, in its entirety by considering all pairs of points in . This encoding is represented as a matrix . Next, we subject matrix to unitary transformation , that is, transformation that preserves norm of , to obtain its sparsest form , where . Here, is a diagonal matrix with rapidly diminishing entries. As a result, it is sufficient to consider only entries of to capture all the information encoded in . These entries constitute the set . The above procedure hinges on the fact that unitary transformations preserve original properties of [31]. Note also that it requires a method to construct matrix in the first place. Indeed, what differentiates various spectral methods is the way information is encoded in .

We summarize the general idea of spectral DR in Algorithm 1. In the first four steps we construct the matrix . As indicated, this matrix encodes information about the property that we wish to preserve in the process of DR. To obtain we first identify the nearest neighbors (NN) of each point . Note that currently all studied methods use NN defined in Euclidean space. This enables us to define a weighted graph that encapsulates, both distance and topological, properties of the set . Given graph , we can construct a function to isolate the desired property. For instance, consider the Isomap algorithm in which the geodesic distance is maintained. In this case, returns the length of the shortest path between and in . Note that for some methods is very simple; for example, for PCA it is equivalent to a distance measure , , while for other methods can be more involved. Differences between various DR methods and their corresponding function are outlined in Table 1. The property extracted by function is stored in an auxiliary matrix , which is next normalized to obtain matrix . This process of normalization is a simple algebraic transformation, which ensures that is centered and hence that the final low-dimensional set of points contains the origin and is not an affine translation [31]. Subsequently, is spectrally decomposed into its eigenvalues that constitute the sparsest representation of . Resulting eigenvectors and eigenvalues are then postprocessed to extract the set of low-dimensional points.