Abstract

Isomap is a widely used nonlinear method for dimensionality reduction. Landmark-Isomap (L-Isomap) has been proposed to improve the scalability of Isomap. In this paper, we focus on two important issues that were not taken into account in L-Isomap, landmark point selection and topological stability. At first, we present a novel landmark point selection method. It first uses a greedy strategy to select some points as landmark candidates and then removes the candidate points that are neighbours of other candidates. The remaining candidate points are the landmark points. The selection method can promote the computation efficiency without sacrificing accuracy. For the topological stability, we define edge density for each edge in the neighbourhood graph. According to the geometrical characteristic of the short-circuit edges, we provide a method to eliminate the short-circuit edge without breaking the data integrity. The approach that integrates L-Isomap with these two improvements is referred to as Robust L-Isomap (RL-Isomap). The effective performance of RL-Isomap is confirmed through several numerical experiments.

1. Introduction

Real-world data such as voice signals, gene microarray, or hyperspectral imagery data usually has a high dimensionality, which makes them difficult to analyze. This is known as the “curse of dimensionality” [1]. In order to analyze and process the high-dimensional real-world data adequately, data dimensionality reduction has been attracting significant interest. Generally, real-world data is found to lie on a low-dimensional manifold embedded in the high-dimensional observation space. Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced dimensionality. It can remove redundant information from the original data and alleviate the “curse of dimensionality” problem.

In the last few decades, many dimensionality reduction algorithms have been proposed. Principal component analysis (PCA) [2] and multidimensional scaling (MDS) [3] are traditionally linear methods. However, these linear methods fail to deal with complex nonlinear data. Currently, a number of nonlinear dimensionality reduction methods are available, for example, isometric feature mapping (Isomap) [4], local linear embedding (LLE) [5], and local tangent space alignment (LTSA) [6]. Among these methods, Isomap is representative of global manifold learning methods, which attempts to preserve global geometrical features of data set in the embedding space. It substitutes Euclidean distance in MDS for geodesic distance. Hindered by MDS and geodesic distance computation, Isomap would become very time-consuming as the amount of the input data increases. To improve the scalability of Isomap, L-Isomap was proposed by Silva and Tenebaum [7] in which the time-consuming computation was performed on a subset of points referred to as landmark points. However, two main problems still exist with L-Isomap.

The first problem is how to select landmarks for L-Isomap. So far, several landmark selection methods have been proposed. In [7], landmarks are selected randomly from the given data. An interesting approach called Fast-Isomap based on integer optimization was proposed in [8]. In [9], landmarks are chosen from an approximate central part of the given data. In this paper, we propose a novel method for landmark point selection. First, the method selects some points as candidates using a greedy strategy. After that, it obsoletes the candidate points that are neighbours of other candidate points. The remaining candidate points are determined as the landmark points. Our method is proven to be more efficient without sacrificing accuracy by experiments.

Another critical problem of L-Isomap algorithm is its topological instability [10], which is caused by short-circuit edges [11]. In order to compute the geodesic distance, a neighbourhood graph is constructed by connecting every data point with its -nearest neighbours. The short-circuiting problem occurs when an oversized neighbourhood is chosen so that the neighbourhood distance is larger than the distance between the folds in the manifolds. Short-circuit edges severely damage the manifold structure. Thus even a single short-circuit edge may produce many errors when computing geodesic distance which in turn severely impairs the performance of L-Isomap. A simple way to avoid short-circuit edges is to decrease the neighbourhood size. But a very small neighbourhood size will fragment the manifold into a large number of disconnected regions [10]. Thus, choosing an appropriate neighbourhood size requires a priori information about the global geometry of the high-dimensional manifold. However, it is difficult to know the information beforehand [4, 10]. H. Choi and S. Choi [12] solve the short-circuiting problem by removing data points with extremely large total flows when computing the shortest path. In this paper, we define edge density for every edge in the neighbourhood graph using multivariate kernel density estimation (KDE) and propose a method to make L-Isomap robust to short-circuit edge. Different from the former method which tries to delete abnormal points, our method aims at short-circuit edge itself. According to the geometric feature of the short-circuit edge that the areas which short-circuit edges go through have very few data points, the edge which has extraordinarily low edge density is claimed as the short-circuit edge and our method removes such edges from the neighbourhood graph. Numerical experiments on several data sets demonstrate the effectiveness of our method.

This paper is organized as follows. Section 2 briefly reviews the Isomap algorithm and L-Isomap algorithm. A novel landmark selection method is given in Section 3. Section 4 presents a method of eliminating short-circuit edges. Experiment results on several data sets are given in Section 5, in order to validate the useful performance of RL-Isomap. Finally, conclusions and future extensions are discussed in Section 6.

2. Isomap and L-Isomap

For the given original input data with samples and dimensions, Isomap attempts to embed those samples into a lower dimensional space , while preserving the geodesic distance between all the input points as faithfully as possible. Isomap first constructs a weighted undirected neighbourhood graph , where each node , corresponding to the point , is connected with its -nearest neighbours and each edge is assigned weight that represents the Euclidean distance between points and . Second, Isomap computes the shortest path between every two points in the graph to approximate the geodesic distance using Dijkstra’s or Floyd’s shortest-path algorithm [13, 14]. The geodesic distance between all the data points forms the geodesic distance matrix . Finally, Isomap applies MDS on matrix to find the low-dimensional embedding.

So far, Isomap has been successfully applied in many different fields. Unfortunately, when the amount of input data, , is too large, Isomap may become too time-consuming in terms of the shortest-path construction () and the MDS eigenvalue decomposition (). In order to speed up these two computations, L-Isomap is proposed. L-Isomap randomly selects points from , denoted as landmark points. Instead of computing the shortest path between all data points, L-Isomap only computes the shortest path from each data point to the landmark points. Then classical MDS is applied to the resulting geodesic distance matrix to find the low-dimensional embedding of the landmark points. The embeddings of the remaining points are obtained by a fixed linear transformation of their geodesic distance to the landmark points. This way the time complexities of the shortest path and the MDS computation are, respectively, reduced to and .

3. Landmark Selection

3.1. Landmark Candidate

A very important procedure for L-Isomap is to build the -nearest neighbourhood graph. However, many neighbourhoods are similar because they share common points. Thus, some neighbourhoods can be deleted to get a simpler neighbourhood graph. Based on this idea, we select the candidate landmark points by simplifying neighbourhood graph. Formally, let be the neighbourhood set, one for each point , where , and each set is assigned a nonnegative cost . The goal is to find a set cover , satisfying and minimizing . Once the cover is obtained, the corresponding landmark candidates are determined. The problem can be resolved by a greedy strategy. Let be the number of uncovered points in and the ratio of is which counts the number of points covered by per unit cost. The probability of including in increases with the ratio . The landmark candidate selection problem is an unweighted case, where . A sketch of the landmark candidate selection method can be summarized as Algorithm 1.

(1) Let .
(2) If for all then stop: is the cover, where . Otherwise find a subscript ,
maximizing the ratio and proceed to Step .
(3) Add to , replace each by and return to Step .

Algorithm 1 can run in time and it achieves an approximation ratio of , whereand is the size of the largest set in [15, 16]. We apply Algorithm 1 to get the neighbourhood subset and the corresponding landmark candidate set , where corresponds to each neighbourhood in and .

3.2. Landmark Selection Algorithm

The landmark candidates may be neighbours of each other and some of them are unnecessary [17]. Thus we can further optimize the set of landmark candidate points to obtain the landmark points that are nonneighbouring to each other. Formally, let , where represents the set of points whose neighbourhood include and . The landmark point selection method can be summarized as Algorithm 2.

(1)
(2) for
if , then for all , let , delete from ,
and ,
end
(3) is the landmark point set.

In order to test the effectiveness of our landmark selection method, we, respectively, run RL-Isomap and other three methods on Swiss roll data, where the amount of the input data varies from 500 to 7500. Tenebaum et al. [4] used residual variance to characterize how well the embedding result preserves the geometry features of the high-dimensional manifold. The smaller the residual variance value, the better the embedding result. As shown in Figure 1, EL-Isomap has the worst performance of the four methods in the experiment because it always tends to select landmarks which are situated around the circumcenter [9]. The random scheme performs well but it is not stable enough because of its random strategy and unpredictability. Fast-Isomap has similar performance to RL-Isomap; however, the number of landmark points selected by RL-Isomap is much smaller than that of Fast-Isomap as shown in Figure 2(a). In this experiment, the landmark points selected by Fast-Isomap is averagely 48.55% more than RL-Isomap. As a result, RL-Isomap will be faster than Fast-Isomap when computing the low-dimensional embedding. As shown in Figure 2(b), RL-Isomap is averagely 33.56% faster than Fast-Isomap in the experiment. So RL-Isomap performs best in both speed and accuracy in this experiment.

4. Robust L-Isomap

4.1. Short-Circuit Edge Elimination

As pointed out in Section 1, L-Isomap faces topological instability problem because of the short-circuit edge. An oversized neighbourhood might result in short-circuit edges that destruct the manifold structure. As shown in Figure 3, the short-circuit edge, denoted by black solid line, directly links up two points which are supposed to be very far on the manifold. A simple way to avoid these short-circuit edges is to decrease the neighbourhood size; however too small neighbourhood will break the connectivity of the neighbourhood graph, as shown in Figure 4. Thus it is not an easy job to choose an appropriate neighbourhood size. The previous method tries to delete some outliers to mitigate the short-circuiting problem. But the method breaks the integrity of the data set. In this section, we present a novel method which directly deletes the short-circuit edges in the neighbourhood graph.

As shown in Figure 3, the areas which short-circuit edges go through usually have very sparse data points. Based on this fact, we claim that the edge that goes through the area which has extremely low data density is a short-circuit edge. In order to quantify the data density of the area which the edges go through, we introduce edge density for each edge in the neighbourhood graph using KDE method.

KDE is a nonparametric tool for estimating the distribution of data [18]. The multivariate KDE is a direct extension of the univariate estimator. For the -variate random sample having density , the -dimensional KDE is defined as follows:where , is a symmetric positive matrix called the bandwidth matrix, andwhere is the determinant of and is -dimensional kernel function that satisfies and is shorthand for .

However, KDE is linear method, such that it cannot be applied directly on complex nonlinear data. In mathematics, the manifold is a topological space that locally resembles Euclidean space near each point. More precisely, each point of a -dimensional manifold has a neighbourhood that is homeomorphic to the -dimensional Euclidean space [5]. Therefore, the points within the same neighbourhood can be approximately seen to have linear structure. For the input data , L-Isomap uses rule to build the neighbourhood graph with edge set , where is the neighbourhood of point and denotes the edge connecting points and . Let and, according to the local linearity properties of the manifold, data points in have nearly linear structure. This way, we define the point density of the manifold at point aswhere denotes the number of points included in .

Given an edge , the quartiles of can be calculated as follows:

After that, we have

Then let

We define edge density of each edge in as follows:

In this paper, we take the standard -variate normal density, as the kernel function. The choice of bandwidth matrix is intractable in KDE [19]. In the univariate case, too big values of will make the estimate too smooth and may not uncover the structural features, whereas small values of bandwidth yield “wiggly” estimate and show spurious features [20]. In the multivariate case, the choice of the bandwidth matrix faces the same dilemma. Fortunately, our experiments show that the choice of has little effect on the results because the edge density values of short-circuit edges are much lower than that of the normal edges. For simplicity, we take the unit matrix as bandwidth matrix.

We compute the edge density for every edge in by (9) and have . As mentioned above, short-circuit edges have extremely low edge density. In such case, if is less than a certain threshold, we believe that the corresponding edge is a short-circuit edge and remove it from the neighbourhood graph. For example, edge density values for Swiss roll data are illustrated for two different sizes of neighbourhoods, where and in Figure 5. In Figure 5(c), there are a few short-circuit edges in the neighbourhood graph where . In this case, a few edges have extremely low edge density values shown in Figure 5(d). While, for the case of , the neighbourhood graph in Figure 5(a) is very healthy and the corresponding edge density values are well-distributed as shown in Figure 5(b). From Figure 5, it is clear that the edge density values of short-circuit edges are much lower than that of normal edges. In order to determine the threshold, we sort all the elements in in ascending order and use which has the maximum increment in the first half of the sequence as our threshold. The threshold can be obtained adaptively as follows.

Arrange all elements in in ascending order and the first half of the sequence is , where with denoting the set of all integers.

, where , and .

The algorithm that integrates L-Isomap with the landmark selection method and the short-circuit edge elimination method is called RL-Isomap. The main procedure of RL-Isomap is summarized as Algorithm 3.

(1) Define a neighbourhood graph by using -NN rule.
(2) Eliminate the edges whose is below .
(3) Select landmark candidates by Algorithm 1.
(4) Determine landmark by Algorithm 2.
(5) Apply LMDS to find a low-dimensional embedding.

The previous method proposed by H. Choi and S. Choi [12] must recompute the neighbourhood graph because a few points which have extremely high total flows have been eliminated from the original input data set. On the contrary, RL-Isomap directly eliminates the short-circuit edges from the neighbourhood graph and will not destroy the data’s integrity.

4.2. Complexity of RL-Isomap

RL-Isomap algorithm runs in , where is the complexity of constructing the neighbourhood graph of all input data. Then, we compute the edge density for the neighbourhood graph to eliminate the short-circuit edges and its time complexity is . Next term is the complexity using Algorithm 1 to get the landmark candidate points. Algorithm 2, determining the landmark points, runs in where is the number of the candidate points. The fifth term is the complexity of computing the geodesic distance using Dijkstra’s algorithm. The last term is the complexity of computing the embedding of the input data.

5. Numerical Experiments

In this section, we conduct several experiments on different data sets: Swiss roll data; Toroidal Helix data; face image data, to test RL-Isomap. After that, we use RL-Isomap to analyze the nonlinear structure of the high-dimensional Internet traffic matrix.

5.1. Experiment on Swiss Roll Data

In Figure 5(c), 1000 data points were used to build the neighbourhood graph where the neighbourhood size . In such case, there are several short-circuit edges appearing in the neighbourhood graph shown in Figure 5(c). First, we apply Robust Kernel Isomap [12] to the neighbourhood graph. After deleting two points which are considered as outliers by Robust Kernel Isomap, a new neighbourhood graph is shown in Figure 6(a) and there are still some short-circuit edges left. In such case, the embedding result by Robust Kernel Isomap is not correct shown in Figure 6(e). The distribution of edge density values for the neighbourhood in Figure 5(c) is shown in Figure 5(d) and it is clear that some values are extremely low. We then apply RL-Isomap to compute the threshold, marked by an orange line in Figure 5(d), where the threshold and the edges whose edge density values are below the threshold are deleted to get a new neighbourhood graph that is shown in Figure 6(b) where the short-circuit edges disappear. Then we recalculate the edge density value for the edges in the new neighbourhood graph and the results are shown in Figure 6(c). It is clear that the edge density values are well-distributed by this time. Figure 6(d) shows the landmark candidate points (black solid points) and the landmark points (marked by red circles) obtained, respectively, by Algorithms 1 and 2. As shown in Figure 6(f), RL-Isomap finds the correct two-dimensional embedding. In this experiment, RL-Isomap outperforms Robust Kernel Isomap and the experiment results demonstrate the effectiveness of RL-Isomap.

5.2. Experiment on Toroidal Helix Data

Toroidal Helix data is shown in Figure 7(a), with data number . The neighbourhood graph is constructed with and Figure 7(b) gives an overhead view of the neighbourhood graph and in such case there are a few short-circuit edges. First, we use Robust Kernel Isomap and 8 points are removed. After that, Robust Kernel Isomap reconstruct the neighbourhood graph, but there are still short-circuit edges existing, shown in Figure 7(c) and the corresponding embedding result in Figure 7(g) is not correct. Then we use (9) to compute edge density for each edge in the neighbourhood graph in Figure 7(b) and the values are illustrated in Figure 8(a) where the threshold . The edges whose edge density is below the threshold are removed from the neighbourhood graph and then get a new neighbourhood graph without short-circuit edge shown in Figure 7(d). The edge density values of the new neighbourhood distribute more evenly than that of the original neighbourhood graph (see Figure 8(b)). After that, we run Algorithm 1 on the new neighbourhood graph and get the landmark candidate points denoted by black solid point shown in Figure 7(e) and then Algorithm 2 is applied to the candidate points and the resulting landmark points, marked by red circles, are illustrated in Figure 7(f). Finally, RL-Isomap finds the correct two-dimensional embedding, shown in Figure 7(h), that faithfully preserves the geometric features of the original manifold. This experiment further validates the effectiveness of RL-Isomap.

5.3. Experiments on Face Data

In addition to the previous experiments on synthetic data set, we apply RL-Isomap on real-world data, face images. The input consists of 64-pixel-by-64-pixel images of a face with different lighting directions and poses, represented by a sequence of 4096-dimensional vectors. Thus the data set actually is a 3-dimensional manifold embedded in the 4096-dimensional observational space. We apply L-Isomap and RL-Isomap, respectively, on the data to detect its intrinsic geometric structure, where . As shown in Figure 9, representative faces are shown next to circled points in different parts of the space where each coordinate axis of the embedding result corresponds with one degree of freedom underlying the initial data: up-down pose and left-right pose. The grayscale of a square bar below each representative face represents lighting direction. In Figure 9(a), the embedding result of L-Isomap is not correct because some points representing left-pose are projected close to the points representing the right pose which is clearly caused by short-circuiting problem. In Figure 9(b), RL-Isomap correctly finds the 3-dimensional embedding of the input data.

5.4. Nonlinear Structure Analysis of Traffic Matrix by RL-Isomap

Internet traffic matrix has been a useful traffic data model to understand the Internet from the whole-network perspective, but Internet traffic matrix usually possesses high-dimensional attributes [21]. The experiments above have proved the effectiveness of RL-Isomap, so, in this part, we apply RL-Isomap to analyze the intrinsic nonlinear structures of the Internet traffic matrix.

5.4.1. Traffic Matrix Modelling

An Origin-Destination (OD) flow comprises all traffic originating from a given source and delivering to a given destination. Let denote the number of all sources and destinations in a network. A traffic matrix is naturally represented by a three-dimensional, nonnegative hypermatrix , with th entry . Each entry represents the traffic volume, measured in terms of bytes or packets, from source to destination in time interval , the full measurement interval being denoted by . It is intractable to monitor the traffic in real-time; thus the measurements are restricted to the average traffic in a discrete interval. The choice of depends on the applications and available measurements [22]. In this paper, the data set we obtained was sampled from the backbone network (Abilene) over time intervals of 5 minutes.

The data set in this paper was downloaded from http://www.cs.utexas.edu/~yzhang/research/AbileneTM/, the publicly available set of traffic matrix [23]. The experimental environment was like a typical network with 12 PoPs; hence there are 144 PoP pairs and 144 OD flows. Thus, the traffic matrix has dimensionality of 144.

5.5. Intrinsic Dimensionality Analysis

The original traffic matrix data possesses 144 dimensions; thus it is very difficult to analyze them directly. Through the analysis above, RL-Isomap gives a simple way to analyze the high-dimensional input data and find their low-dimensional structures. In this section, we try to explore the underlying nonlinear structure hidden behind the high-dimensional traffic matrix data using RL-Isomap. We apply the algorithm to the 24 data sets, to learn the relations between the residual variance and the dimensions, and the results are illustrated in Figure 10 (Due to limited space, we only present 4 data sets in this paper.). As mentioned above, residual variance is used to characterize how well the low-dimensional Euclidean embedding preserves the geodesic distances estimated from the neighbourhood graph. The intrinsic dimension of the data can be estimated by looking for the “elbow” where the curve of the residual variance stops decreasing significantly with added dimensions. The intrinsic dimensions of these 24 data sets are shown in Figure 11. It is clear that the intrinsic dimensions of these data sets fluctuate from 3 to 8, which are far less than 144.

Why does low dimensionality in OD traffic matrix exist? Two causes bring about this kind of low dimensionality. First, if the magnitude of variation among dimensions in the original data differs greatly, then the data may have low effective dimension for that reason alone. This is the case if variation along a small set of dimensions in the original data is dominant [24]. Second, it is caused by the spatial correlation of Internet traffic. Internet consists of core network and edge network. The traffic flows through different ingresses and egresses of the core network may originate from the same edge network, such that the traffic from different OD pairs shares some common patterns or trends.

As discussed above, we know that there truly exists nonlinear manifold structure hidden behind the original traffic matrix data. The low-dimensional structure can help analyze the flow characteristic of the network, including the traffic variation trend and traffic anomaly. These thoughts drive us to explore more of the nonlinear structure of the traffic matrix in the future.

6. Conclusion

In this paper, we present RL-Isomap algorithm. Two methods are given in RL-Isomap that, respectively, address the landmark selection problem and stable instability problem in L-Isomap. The experiments on synthetic data sets and physical data sets demonstrate the effectiveness of RL-Isomap.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grants nos. 61233003, 61202285, 61572463, and 61673361, in part by Research Fund for the Doctoral Program of Higher Education of China under Grant no. 20123402110029, in part by Natural Science Research Program of the Anhui High Education Bureau of China under Grant no. KJ2012A286, in part by the grant of the Open Project Program of the State Key Lab of CAD & CG under Grant A1709, Zhejiang University, and in part by the grant of the Shanghai Key Laboratory of Intelligent Information Processing, China under Grant IIPL-2016-003.