Information Analysis of High-Dimensional Data and ApplicationsView this Special Issue
Review Article | Open Access
Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework
When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state-of-the-art estimators.
Since the 1950s, the rapid pace of technological advances allows us to measure and record increasing amounts of data, motivating the urgent need to develop dimensionality reduction systems to be applied on real datasets comprising high-dimensional points.
To this aim, a fundamental information is provided by the intrinsic dimension () defined by Bennett  as the minimum number of parameters needed to generate a data description by maintaining the “intrinsic” structure characterizing the dataset, so that the information loss is minimized.
More recently, a quite intuitive definition employed by several authors in the past has been reported by Bishop in , p. 314, where the author writes that “a set in dimensions is said to have an equal to if the data lies entirely within a -dimensional subspace of .”
Though more specific and different definitions have been proposed in different research fields [3–5], throughout the pattern recognition literature the presently prevailing definition views a point set as a sample set uniformly drawn from an unknown smooth (or locally smooth) manifold structure, eventually embedded in a higher dimensional space through a nonlinear smooth mapping; in this case, the to be estimated is the manifold’s topological dimension.
Due to the importance of in several theoretical and practical application fields, in the last two decades a great deal of research effort has been devoted to the development of effective estimators. Though several techniques have been proposed in the literature, the problem is still open for the following main reasons.
At first, it must be highlighted that though Lebesgue’s definition of topological dimension  (see Section 3.2) is quite clear, in practice its estimation is difficult if only a finite set of points is available. Therefore, estimation techniques proposed in the literature are either founded on different notions of dimension (e.g., fractal dimensions, Section 3.2.1) approximating the topological one or on various techniques aimed at preserving the characteristics of data-neighborhood distributions, which reflect the topology of the underlying manifold. Besides, the estimated value markedly changes as the scale used to analyze the input dataset changes  (see an example in Figure 1), and with the number of available points being practically limited, several methods underestimate when its value is sufficiently high (namely, ). Other serious problems arise when the dataset is embedded in higher dimensional spaces through a nonlinear map. Finally, the too high computational complexity of most estimators makes them unpractical when the need is to process datasets comprising huge amounts of high-dimensional data.
In this work, after recalling the application domains of interest, we survey some of the most interesting, widespread used, and advanced estimators. Unfortunately, since each method has been evaluated on different datasets, it is difficult to compare them by solely analyzing the results reported by the authors. This highlights the need of a benchmark framework, such as the one proposed in this work, to objectively assess and compare different techniques in terms of robustness with respect to parameter settings, high-dimensional datasets, datasets characterized by a high , and noisy datasets.
The paper is organized as follows: in Section 2 the usefulness of the knowledge is motivated and interesting application domains profitably exploiting it are recalled; in Section 3 we survey notable state-of-the-art estimators, grouping them according to the employed methods; in Section 4 we summarize mostly used experimental settings, we propose a benchmark framework, and we employ it to objectively assess and compare relevant estimators; in Section 5 conclusions and open research problems are reported.
2. Application Domains
In this section we motivate the increasing research interest aimed at the development of automatic estimators, and we recall different application contexts where the knowledge of the of the available input datasets is a profitable information.
In the field of pattern recognition, the is one of the first and fundamental pieces of information required by several dimensionality reduction techniques [8–12], which try to represent the data in a more compact, but still informative, way to reduce the “curse of dimensionality” effects . Furthermore, when using an autoassociative neural network to perform a nonlinear feature extraction, the value can suggest a reasonable value for the number of hidden neurons . Indeed, a network with a single hidden layer of neurons with linear activation functions has an error function with a unique global minimum and, at this minimum, the network performs a projection on the subspace spanned by the first principal components  estimated on the dataset (see of ), with being the number of hidden neurons. Besides, according to statistical learning theory , the capacity and generalization capability of a given classifier may depend on the . More specifically, in the particular case of linear classifiers where the data are drawn from a manifold embedded through an identical map, the Vapnik-Chervonenkis () dimension of the separation hyperplane is (see , pp. 156–158). Since the generalization error depends on the dimension, it follows that the generalization capability may depend on the value . Moreover, in  the authors mark that, in order to balance a classifier generalization ability and its empirical error, the complexity of the classification model should also be related to the of the available dataset. Furthermore, since complex objects can be considered as structures composed by multiple manifolds that must be clustered to be processed separately, the knowledge of the local s characterizing the considered object is fundamental to obtain a proper clustering .
These observations motivate applications employing global or local estimates to discover some structure within the data. In the following we summarize or simply recall some interesting examples [19–25].
In  the authors introduce a fractal dimension estimator, called correlation dimension () estimator (see Section 3.2.1), and show that the estimate it computes is a reliable approximation of the strange attractor dimension in chaotic dynamical systems.
In the field of gene expression analysis, the work proposed in  shows that the estimate computed by the nearest neighbor estimator (described in  and Section 3.2.2) is a lower bound for the number of genes to be used in supervised and unsupervised class separation of cancer and other diseases. This information is important since generally used datasets contain large number of genes and the classification results strongly depend on the number of genes employed to learn the separation criteria.
In , the authors show that estimation methods being derived from the basis theory of fractal dimensions ([7, 19, 27], see Section 3.2.1) can be successfully used to evaluate the model order in signals and time series, which is the number of past samples required to model the time series adequately and is crucial to make reliable predictions. This comparative work employs fractal dimension estimators, since the domain of attraction of nonlinear dynamic systems has a very complex geometric structure, which could be captured by closely related studies on fractal geometry and fractal dimensions.
A noteworthy research work in the field of crystallography  employs the fractal estimator  followed by a correction method  that, according to the authors, “is needed because the estimator, to give correct estimations of the , requires an unrealistically large number of points.” Anyway, the experimental results show that is a useful information to be exploited when analyzing crystal structures. This study not only proves that estimates are especially useful when dealing with practical tasks concerning real data, but also underlines the need to compute reliable estimates on datasets drawn from manifolds characterized by high and embedded in spaces of much greater dimensionality.
The work of Carter et al.  is very interesting and notable because it is one of the first considering that the input data might be drawn from a multimanifold structure, where each submanifold has a (possibly) different . To separate the manifolds, the authors compute local estimates, by applying both a fractal dimension estimator (namely, ; see Section 3.2.1) and a nearest neighbor-based estimator (described in [29, 30]; see Section 3.2.2) on properly defined data neighborhoods. The authors then show that the computed local s might be helpful for the following interesting applications: (1) “Debiasing global estimates”: the negative bias caused both by the limited number of available sample points and by the curse of dimensionality is reduced by computing global estimates through a weighted average of the local ones, which assign greater importance to the points away from the boundaries. However the authors themselves note that this method is only applicable for data with a relatively low , since in high dimensions the points lie nearby the boundaries . (2) “Statistical manifold learning”: the local estimates are used to reduce the dimension of statistical manifolds , that is, manifolds whose points represent a . When this step is applied as the first step of document classification applications, and analysis of patients’ samples acquired in the field of flow cytometry, it allows us to obtain lower dimensional points showing a good class separation. (3) “Network anomaly detection”: considering that the overall complexity of a router network is decreased when few sources account for a disproportionate amount of traffic, a decrease in the of the entire network is searched for. (4) “Clustering”: problems of data clustering and image segmentation are dealt with by assuming that different clusters and image patches belong to manifold structures characterized by different complexity (and s).
In , to the aim of analyzing gene expression time series, the authors compute estimates by comparing the fractal estimator and the nearest neighbor () estimator . The results on both simulated and real data show that seems to be more robust than with respect to nonlinear embedding and the underlying time-series model.
In the field of geophysical signal processing, hyperspectral images, whose pixels represent spectra generated by the combination of an unknown set of independent contributions, called endmembers, often require estimating the number of endmembers. To this aim, the proposal in  is to substitute state-of-the-art algorithms specifically designed to solve this task with estimators. After motivating the idea by describing the relation between the of a dataset and the number of endmembers, the authors choose to experiment two fractal estimators [7, 19] and a nearest neighbor-based one . They obtain the most reliable results with the latest one after opportunely tuning the number of nearest neighbors to be considered.
Finally, other noteworthy examples of research works that profitably exploit and estimate it by usually applying fractal dimension estimators concern financial time series prediction , biomedical signal analysis [35–37], analysis of ecological time series , radar clutter identification , speech analysis , data mining and low dimensional representation of (biomedical) time series , and plant traits representation .
3. Intrinsic Dimension Estimators
In this section we survey some of the most notable, recent, and effective state-of-the-art estimators, grouping them according to the main ideas they are based on.
Specifically, in Section 3.1 we describe projective estimators, which basically process a dataset to identify a somehow appealing lower dimensional subspace where to project the data; the space dimension of the identified subspace is viewed as the estimate.
More recent projective estimators exploit the assumption of datasets being uniformly drawn from a smooth (or locally smooth) manifold , embedded into a higher -dimensional space through a nonlinear map; this is also the basic assumption of all the other groups of methods that will be referred to as topological estimators (see Section 3.2) and graph-based estimators (see Section 3.3).
We note that the taxonomy we are using to group the reviewed methods is different from the one, commonly used by several authors in the past (as an example, see ), that viewed methods as global, when estimation is performed by considering a dataset as a whole, or local, when all the data neighborhoods are analyzed separately and an estimate is computed by combining all the local results. All the recent methods have abandoned the global approach since it is now clear that analyzing a dataset at its biggest scale cannot produce reliable results. They thus estimate the global by somehow combining local s. This way of proceeding comes from the assumption that the underlying manifold is locally smooth.
3.1. Projective Estimators
The first projective estimators introduced in the literature explicitly compute the mapping that projects input points to the subspace minimizing the information loss [27, 43] and therefore view the as the minimal number of vectors linearly spanning the subspace . It must be noted that, since these methods were originally designed for exploratory data analysis and dimensionality reduction, they often require the dimensionality of (the to be estimated) to be provided as input parameter. However, their extensions and variants include methodologies to automatically estimate .
Most of the projective estimators can be grouped into two main categories: projection techniques based on Multidimensional Scaling () [44, 45] or its variants, which tend to preserve as much as possible pairwise distances among the data, and projection techniques based on Principal Component Analysis () [15, 46] and its variants that search for the best projection subspace minimizing the projection error.
Some of the best known examples of algorithms are [47–52], Bennett’s algorithm [1, 53], Sammon’s mapping , Curvilinear Component Analysis () , , and Local Linear Embedding () . As shown by experiments reported in [56, 57] and variants of compute the most reliable estimates. We believe that their better performance is due to the fact that both and have been the first projective methods based on the assumption that the input points are drawn from an underlying manifold, whose curvature might affect the precision of data neighborhoods computed by employing the Euclidean distance. However, as noted in [58, 59], these algorithms have shown that they suffer of all the major drawbacks affecting -based algorithms, which are too much tied by the preservation of the pairwise distance values. Besides, as highlighted in , as an estimator, as well as other spectral based methods like , relies on a specific estimated eigenstructure that may not exist in real data. Regarding , it either requires the value to be known in advance or may automatically estimate it by analyzing the eigenvalues of the data neighborhoods ; however, as outlined in [7, 27], estimates computed by means of eigenvalue analysis are as unreliable as those computed by most -based approaches. Moreover, in  it is noted that methods such as are based on the solution of a sparse eigenvalue problem under the unit covariance constraint; however, due to this imposed constraint, the global shape of the embedded data cannot reflect the underlying manifold.
[15, 46] is one of the most cited and well-known projective estimators, often used as the first step of several pattern recognition problems, to compute low dimensional representations of the available datasets. When is used for estimation, the estimate is the number of “most relevant” eigenvectors of the sample covariance matrix, also called principal components (s). Due to the promising dimensionality reduction results, several -based approaches, both deterministic and probabilistic, have been published. Among deterministic approaches, we recall the Kernel ()  and the local ()  and its extensions to automatically select the number of s [63, 64]. We observe that the work presented in  is one of the first works that estimates by considering an underlying topological structure and therefore applies on data neighborhoods represented by an Optimally Topology Preserving Map () built on clustered data (given an input dataset , its is usually computed through Topology Representing Networks (s); these are unsupervised neural networks  developed to map to a set of neurons whose learnt connections define proximities in . These proximities correspond to the optimal topology preserving Voronoi tessellation and the corresponding Delaunay triangulation. In other words, s compute connectivity structures that define and perfectly preserve the topology originally present in the data, forming a discrete path-preserving representation of the inner (topological) structure of the topological manifold underlying the dataset ). The authors of this method state that their approach is more efficient and less sensitive to noise with respect to the -based approaches. However, they do not show any experimental comparison and, besides, their algorithm employs critical thresholds and a data clustering technique whose result heavily influences the precision of the computed estimate .
The usage of a probabilistic approach has been firstly introduced by Tipping and Bishop in . Considering that deterministic methodologies lack an associated probabilistic model for the observed data, their Probabilistic () reformulates as the maximum likelihood solution of a specific latent variable model. and its extensions to both mixture and hierarchical mixture models have been successfully applied to several real problems, but they still provide an -estimation mechanism depending on critical thresholds. This motivates its subsequent variants  and developments, whose examples are Bayesian ()  and two Bayesian model order selection methods introduced in [69, 70]. In  the asymptotic consistency of estimation by (constrained) isotropic version of is shown with numerical experiments on simulated and real datasets.
While the aforementioned methods have been simply recalled since their estimation results have shown to be unreliable [7, 27], in the following recent and promising proposals are described with more details.
The Simple Exponential Family ()  has been developed to overcome the assumption of Gaussian-distributed data that makes it difficult to handle all types of practical observations, for example, integers and binary values. achieves promising results by using exponential family distributions; however, it is highly influenced by critical parameter settings and it is successful only if the data distribution is known, which is often not the case, specially when highly nonlinear manifold structures must be treated.
In  the authors propose the Sparse Probability () as a probabilistic version of the Sparse () . Precisely, selects by forcing the sparsity of the projection matrix that is the matrix containing the s. However, based on the consideration that the level of sparsity is not automatically determined by , employs a Bayesian formulation of , achieving sparsity by employing a different prior and automatically learning the hyperparameter related to the constraint weight through Evidence Approximation ( Section 3.5). The authors’ results and also the results of the comparative evaluation proposed in  show that this method seems to be less affected by the problems of the aforementioned projective schemes.
An alternative method ()  applies Singular Value Decomposition (), basically a variant of , locally and in a multiscale fashion to estimate the characterizing -dimensional datasets drawn from nonlinearly embedded -dimensional manifolds corrupted by Gaussian noise. Precisely, exploiting the same ideas of the theoretical -based estimator presented in , the authors note that the best way to avoid the effects of the curvature (induced by the nonlinearity of the embedding) is to apply locally, that is, in hyperspheres centered on the data points and having radius . However, the choice of is constrained by the following considerations: (1) must be big enough to have at least neighbors, (2) must be small enough to ensure that is linear (or at least smooth), and (3) must be big enough to ensure that the effects of noise are negligible. When these three constraints are met, the tangent space , computed by applying on the neighbors, is a good approximation of the tangent space of and the number of its relevant eigenvalues corresponds to the (local) of . To find a proper value for , the authors propose a multiscale approach that applies on neighborhoods whose radius varies in a range . This allows us to compute scale-dependent, local singular values ; using a least squares fitting procedure the s can be expressed as functions of whose analysis allows us to identify the range of scales not influenced by either noise or curvature. Finally, in the range the squared s are analyzed to get the estimate that maximizes the gap for the largest range of . The proposed algorithm has been evaluated on unit -dimensional hyperspheres and cubes embedded in and affected by Gaussian noise. The reported results are very good, while other ten well-known methods [19, 23, 27, 30, 79–81] show that the s estimated on the same datasets are unreliable also in the absence of noise.
3.2. Topological Approaches
Topological approaches for estimation consider a manifold embedded in a higher dimensional space through a proper (locally) smooth map and assume that the given dataset is , where are independent identically distributed (i.i.d.) points drawn from through a smooth probability density function () .
Under this assumption the to be estimated is the manifold’s topological dimension, defined either through the firstly proposed Brouwer Large Inductive Dimension  or the equivalent Lebesgue Covering Dimension . Since Brouwer’s definition has been soon neglected by mathematicians for its difficult proof , the commonly adopted topological dimension definition is Lebesgue’s Covering Dimension, reported in the following.
Definition 1 (cover). Given a topological space , a cover of a set is a countable collection of open sets such that each and .
Definition 2 (refinement of a cover). A refinement of a cover of a set is another cover such that each set in is contained in some sets of .
Definition 3 (topological dimension (Lebesgue Covering Dimension)). Given the aforementioned definitions, the topological dimension of the topological space , also called Lebesgue Covering Dimension, is if every finite cover of admits a refinement such that no subset of has more than intersecting open sets in . If no such minimal integer value exists, is said to be of infinite topological dimension.
To our knowledge, at the state of the art only two estimators have been explicitly designed to estimate the topological dimension.
One of them, the Tensor Voting Framework ()  and its variants , relies on the usage of an iterative information diffusion process based on Gestalt principles of perceptual organization . iteratively diffuses local information describing, for each , the tangent space approximating the underlying neighborhood of . To this aim, the information diffused at each iteration is second-order symmetric positive definite tensors whose eigenvectors span the local tangent space. Practically, during the initialization step a ball tensor , which is an identity matrix representing the absence of orientation, is used to initialize a token for each point as . During iteration each token “generates” the set of tensors that enact as votes cast to neighboring tokens; precisely, is sent to the th neighbor, and it encodes information related to the local tangent space estimate in at time and decays as the curvature and the distance from the th neighbor increase. On the other side, at iteration each token receives votes that are summed to update the ’s tensor as ; this essentially refines the estimate of the local tangent space in . Based on the definition of topological dimension provided by Brouwer , in  it is noted that can be employed to estimate the local s by identifying the number of most relevant eigenvalues of the computed second-order tensors. Although interesting, this method has a too high computational cost, which makes it unfeasible for spaces of dimension .
From the definition of Lebesgue Covering Dimension it can be derived  that the topological dimension of any coincides with the affine dimension of a finite simplicial complex (a simplicial complex in has affine dimension if it is a collection of affine simplexes in , having at most dimension or having at most vertices) covering . This essentially means that a -dimensional manifold could be approximated by a collection of -dimensional simplexes (each having at most vertices); therefore, the topological dimension of could be practically estimated by analyzing the number of vertices of the collection of simplexes estimated on . To this aim, in  a method is proposed to find the number of relevant positive coefficients that are needed to reconstruct each from a linear combination of its neighbors, where is a parameter to be manually set in the range . This algorithm is based on the fact that neighbors with positive reconstruction coefficients are the vertices of a simplex with dimension equal to the topological dimension of . Practically, to ensure that , its value is set to , the reconstruction coefficients are calculated by means of an optimization problem constrained to be nonnegative, and the coefficients bigger than a user-defined threshold are considered as the relevant ones. The estimate is then computed by employing two alternative approaches: the first one simply computes the mode of the number of relevant coefficients for each neighborhood and the second one sorts in descending order the coefficients computed for each neighborhood, computes the mean of the sorted coefficients, and estimates as the number of relevant values in . Note that, since , this method is strongly affected by the curvature of the manifold when the is big enough. Indeed, the results of the reported experimental evaluation make the authors assert that the method works well only on noisy-free data of low (), under the assumption that the sampling process is uniform and the data points are sufficient.
Though interesting, both approaches have shown to be effective only for manifolds of low curvature as well as low values.
In the following we survey other estimators employing two different estimation approaches, which allow us to categorize them. More precisely, in Section 3.2.1 fractal estimators are described, which estimate different fractal dimensions since they are good approximations of the topological one; in Section 3.2.2 nearest neighbors-based () estimators are recalled, which are often based on the statistical analysis of the distribution of points within small neighborhoods.
3.2.1. Fractal Estimators
Since topological dimension cannot be practically estimated, several authors implicitly assume that has a somehow fractal structure (see  for an exhaustive description of fractal sets) and estimate by employing fractal dimension estimators, the most relevant of which are surveyed in this section.
Very roughly, since the basis concept of all fractal dimensions is that the volume of a -dimensional ball of radius scales with its size as [90, 91], all fractal dimension estimators are based on the idea of counting the number of observations in a neighborhood of radius to (somehow) estimate the rate of growth of this number. If the estimated growth is , then the estimated fractal dimension of the data is considered to be equal to .
Note that all the derived estimators have the fundamental limitation that, in order to get an accurate estimation, the size of the dataset with has to satisfy the inequality proved by Eckmann and Ruelle in  for the correlation dimension estimator ( , see below):This will lead to a large value of , even for a dataset with lower .
Among fractal dimension estimators, one of the most cited algorithms is presented in  and will be referred to as in the following. It is an estimator of the correlation dimension (), whose formal definition is as follows.
Definition 4 (correlation dimension). Given a finite sample set , letwhere is the Euclidean norm and is the step function used to simulate a closed ball of radius centered on each ( if , and otherwise). Then, for a countable set, the correlation dimension is defined asIn practice computes an estimate, , of by computing for different and applying least squares to fit a line through the points . It has to be noted that, to produce correct estimates, this estimator needs a very large number of data points , which is never available for practical applications; however, the computed unreliable estimations can be corrected by the correction method proposed in .
The relevance of the estimator is shown by its several variants and extensions. An example is the work proposed in , where the authors propose a normalized estimator for binary data and achieve estimates approximating those computed by .
Since is heavily influenced by the setting of the scale parameters, in  the authors estimate the by computing the expectation value of through maximum likelihood estimate of the distribution of distances among points. The estimated is computed as where is the set and is the Euclidean distance between two generic data points and is a real value, called cut-off radius.
To develop an estimator more efficient than , in  the authors choose a different notion of , namely, the Information Dimension :where is the minimum number of -sized hypercubes covering a topological space and is the probability of finding a point in the th hypercube. Noting that when the scale in (5) is big enough the different coverings used to estimate could produce different values for , the authors look for the covering composed by the minimum number of nonempty sets. Similar to the algorithm, the is the average slope of the curve obtained by fitting the points with coordinates .
This algorithm is compared with the estimator, and the experimental tests show that both methods compute the same estimates. However, the achieved computation time is much lower than that of .
Considering that can severely underestimate the topological dimension if the data distribution on the manifold is nearly nonuniform, in  the author proposes the Packing Number (), a fractal dimension estimator that approximates the Capacity Dimension (). This choice is motivated by the fact that does not depend on the data distribution on the manifold and, if both and the topological dimension exist (which is certainly the case in machine learning applications), the two dimensions agree. To formally define , the -covering number of a set must be defined; is the minimum number of open balls whose union is a covering of , where is a distance metric. The definition of of is based on the observation that the covering number of a -dimensional set is proportional to :
Since the estimation of is computationally expensive, based on the relation , the authors employ the -Packing Number , defined in  as the maximum cardinality of an -separated set. Employing a greedy algorithm to compute , the estimate, , of is computed asTo estimate a greedy algorithm is used; however, as noted by the author, the dependency of with respect to the order in which the points are visited by the greedy algorithm introduces a high variance. To avoid this problem, the algorithm iterates the procedure times on random permutations of the data and considers the average as the final estimate. The comparative evaluation with the estimator makes the authors assert that “seems more reliable if data contains noise or the distribution on the manifold is not uniform.” Unfortunately, also this method is scale-dependent.
To avoid any scale-dependency in  the authors propose an estimator () based on the asymptotes of a smoothed version of (2), obtained by replacing the step function with a suitable kernel function. Precisely, they definewhere is a kernel function with bandwidth and is the assumed dimensionality of the manifold from which the points are sampled. Note that, to guarantee the converge of (8), the bandwidth has to fulfill the constraint . For this reason the authors formalize as a function of and, to achieve scale-independency, propose a method that estimates the by analyzing the convergence of when varying the parameters and . Precisely, the dataset is subsampled to create sets of different cardinalities and the curves whose points have coordinates () are considered. Employing this information the following estimator is proposed:This work is notable since the empirical tests are performed on synthetic datasets specifically designed to study the influence of high curvature as well as noise on the proposed estimator. The usefulness of these datasets is confirmed by the fact that they have been also employed to assess several subsequent methods [76, 96].
In  the authors present a fractal dimension estimator derived by the analysis of a vector quantizer applied to datasets . Considering the codebook containing code-vectors , a k-point quantizer is defined by a measurable function , which brings each data point to one of the code-vectors in . This partitions the dataset into so-called quantizer cells , where is called the rate of the quantizer. Being a random vector distributed according to a probability distribution , the quantization error is , where and is the Euclidean norm in . Given the set of all -dimensional -point quantizers, the performance achieved by an optimal -point quantizer on is . When the quantizer rate is high, the quantizer cells can be well approximated by -dimensional hyperspheres with radius equal to and centered on each code-vector . In this case, the regularity of ensures that the probability of such balls is proportional to , and it can be shown  that . This is referred to as the high-rate approximation and motivates the definition of quantization dimension of order :The theory of high-rate quantization  confirms that, for a regular supported on the manifold , exists for each and equals the intrinsic dimension of . Furthermore, the limit allows us to motivate the relation between the quantization dimension and the Capacity Dimension. Indeed, according to the theory of high-rate quantization [98, 99], there exists a decreasing sequence , such that for sufficiently large values of (i.e., in the high-rate regime that is when ) the ratio can be approximated increasingly finely, both from below and from above, by quantities converging to the common value . To practically compute an estimate of the quantization dimension, having fixed the value of , the authors select a range of codebook sizes and design a set of quantizers giving good approximations of over the chosen range of . An estimate is obtained by fitting the points with coordinates and measuring the average slope over the chosen range . Though the authors mention that their algorithm is less affected by underestimation biases than neighborhood-based methods (see Section 3.2.2), in  this statement is confuted with theoretical arguments.
3.2.2. Nearest Neighbors-Based Estimators
In this section we consider estimators, referred to as estimators in the following, that describe data-neighborhoods’ distributions as functions of . They usually assume that close points are uniformly drawn from small -dimensional balls (hyperspheres) having radius (a small radius guarantees that samples included into , being less influenced by the curvature induced by the map , are approximating well enough the intrinsic structure of the underlying portion of ) and being centered on .
Practically, given an input dataset , the value of functions is computed by approximating the sampling process related to through the -nearest neighbor algorithm ().
Among estimators, Trunk’s method  is often cited as one of the first methods. It formulates the distribution function, , with an ad hoc statistic based on geometric considerations concerning angles; in practice, having fixed a threshold and a starting value for the parameter , it applies to find the neighbors of each and calculates the angle between the th-nearest neighbor and the subspace spanned by the -nearest neighbors. Considering a threshold parameter , if , then is considered as the estimate; otherwise, is incremented by and the process is repeated. The main limitation of this method is the difficult choice of a proper value for .
The work presented by Pettis et al.  is notable since it is one of the first works providing a mathematical motivation for the use of nearest-neighbor distances.
Indeed, for an i.i.d. sample drawn from a density distribution in , the following approximation holds:where is the number of nearest neighbors to within the hypersphere of radius and centered on and is the volume of the (unit -dimensional) ball in .
This means that the proportion of sample points falling in is roughly approximated by times the volume of . Since this volume grows as , assuming the density to be a constant, it follows that the number of samples in is proportional to . From the relationship in (11), and assuming that the samples are locally uniformly distributed, the authors derive an estimator for :where is the average of the distances from each sample point to its th nearest neighbors; defining as the distance between and its th-nearest neighbor, is expressed as .
Since this algorithm is limited by the choice of a suitable value for parameter , in  the authors propose a variant which considers a range of neighborhood sizes . However, in the same work the authors themselves show that this technique generally yields an underestimate of the when its value is high.
Taking into account relation (11), in  the number of data points in is described by a polynomial of degree . In practice, considering , calling the interpoint distances, and being , a parameter adaptively estimated (to estimate by means of , the radius value corresponding to the first significant peak of the histogram of the s is found), a set of radius values is selected and used to calculate pairs , where is the number of interpoint distances strictly lower than . To estimate the coefficients , the computed pairs are fit by a least squares fitting procedure that estimates exactly coefficients. Since by hypothesis the degree of is , the significance test described in  is used to estimate the degree of , which is considered as the estimate. The comparative evaluation of this algorithm with the well-known Maximum Likelihood Estimator ()  and its improved version , both described below, has shown that it is more robust than them when dealing with high-dimensional datasets.
, one of the most cited estimators, treats the neighbors of each point as events in a Poisson process and the distance between the query point and its th nearest neighbor as the event’s arrival time. Since this process depends on , estimates by maximizing the log-likelihood of the observed process. In practice a local estimate is computed as
Averaging the s, the global estimate is .
The theoretical stability of the proposed estimator for data living in submanifold of , , and for data in an affine subspace of has been proved, respectively, in [103, 104]. Though the authors’ comparative evaluation shows the superior performance of the proposed estimator with respect to the estimator  (see Section 3.2.1) and the estimator , they further improve it by removing its dependency from the parameter ; to this end, different values for are adopted and the computed results are averaged to obtain the final estimate: .
Considering that, in practice, is highly biased for both large and small values of , a variant of is proposed in , where the arithmetic mean is substituted with the harmonic average, leading to the following estimator: .
Though the proposal in  seems to achieve more accurate results, it is based on the assumption that neighbors surrounding each are independent, which is clearly incorrect. To cope with this problem, in  an interesting regularized version of applies a regularized maximum likelihood technique to distances between neighbors. The comparative evaluation with the aforementioned methods [27, 102] makes the authors state that, though the method might be the first to converge to the actual estimate given enough data points, its estimation accuracy is comparable to that achieved by the competing schemes.
Despite the good results achieved by -based approaches, these techniques have shown to be affected by the curvature induced by on the manifold neighborhoods approximated by . To reduce this effect, various estimators have been proposed in [76, 107]; here, we review those achieving the most promising experimental results.
In  the authors firstly propose a family of estimators (), which exploit the describing the distance between the center of , , and its nearest neighbor. Briefly, formulating as a function of the value (), the estimator is computed by a maximum likelihood approach.
After noting that this algorithm is still affected by a bias causing underestimations when the dataset dimensionality becomes sufficiently high (i.e., ), the authors present theoretical considerations which relate the bias to the fact that estimators based on nearest-neighbor distances are often founded on statistics derived under the assumption that the amount of available data is unlimited, which is never the case in practical applications. Based on these considerations, two different estimators, named and , are presented.
compares the empirical of the neighborhood distances computed on the dataset () with the distribution of the neighborhood distances computed from points uniformly drawn from hyperspheres of known increasing dimensionality (). The estimate is the dimensionality that minimizes their Kullback-Leibler divergence , which is evaluated by means of the data-driven technique proposed in .
relies on the authors’ observation that the quantities are distributed according to the beta distribution with parameters and , respectively. Therefore, since , a consistent estimator equalsTo reduce the effect of the aforementioned bias, finally applies an asymptotic correction step that, inspired by the correction method presented in , models the underestimation error by considering both the base algorithm and the given dataset.
Motivated by the promising results achieved by , in  the authors propose its extension, namely, ; it further reduces the underestimation effect by combining an estimator employing normalized nearest-neighbor distances with one employing mutual angles. More precisely, compares the statistics estimated on with those estimated on (uniformly drawn) synthetic datasets of known . The comparisons are performed by two Kullback-Leibler divergences applied to the distribution of normalized nearest-neighbor distances , having , and the distribution of pairwise angles , being the von Mises-Fisher distribution ()  with parameters and .
The estimate is the one minimizing the sum of the two divergences:A fast implementation of (Fast-) is also developed. Comparative evaluations show that this algorithm achieves promising results (as shown in  and Section 4).
Another work, which is notable because the authors not only prove the consistency in probability of the presented estimators but also derive upper bounds (see (19) below) on the probability of the estimation-error for finite, and large enough, values of , is proposed in . More precisely, the authors introduce two estimators by firstly defining a function slowly varying near the origin (see  for a detailed description and motivation of this assumption). The function is then used to express the logarithm of the probability of a point of being in the hypersphere : , having .
Considering that for big enough, the authors derive the following system of equations:and solve it for to obtain a local estimate:The two proposed estimators are then computed either by averaging () or by voting ():where denotes the number of points for which is verified.
Under differentiability assumptions on the function and regularity assumptions on the authors prove the consistency in probability of their estimators and provide upper bounds (see (19)) on the probability of the estimation-error for finite, and large enough, values of . However, the derived bounds depend on unknown universal constants :
3.3. Graph-Based Estimators
As noted in , the work of  has cleared up the fact that theories underlying graphs can be applied to solve a variety of statistical problems; indeed, also in the field of estimation various types of graph structures have been proposed [29, 96, 111, 112] and used for estimation. Examples are the graph () , the Minimum Spanning Tree ()  and its variation, the geodesic () ; and the sphere of influence graph ()  and its generalization, the sphere of influence graph () .
Given a sample set a graph built on , usually denoted by , employs the sample points as nodes (vertices) of the graph and connects them with weighted arcs (edges) .
A is built by employing a distance function, which commonly is the Euclidean one, to weight the arcs connecting each to its s.
A is the spanning tree minimizing the sum of the edge weights. When the weights approximate Geodesic distances , a is obtained.
A is defined by connecting nodes and iff , where is the distance between and its nearest neighbor in . Essentially, two vertices and are connected if the corresponding hyperspheres intersect. A generalization of is , where nodes and are connected iff , being the distance between and its in . This means that the hyperspheres centered on and intersect.
In the following we recall interesting estimators based on , , and .
In [29, 30], after defining the length functional , , to build either the or the of , graph theories are exploited to estimate both the of the underlying manifold structure and its intrinsic Rènyi -entropy . To this aim, the authors derive the linear model: , , , being an unknown constant, and exploit it to define an estimator of both and . Briefly, a set of cardinalities is chosen and, for each , the is constructed on the set , which contains points randomly sampled from , to obtain a set of pairs . Fitting them with a least squares procedure the estimates and are computed. Recalling that , the is calculated as . This process is iterated to produce the final estimate as the average of the obtained results.
The aforementioned based algorithm [29, 30] is exploited in , where the authors consider datasets sampled from a union of disjoint manifolds with possibly different s. To estimate the local s, the authors propose a heuristic, which is not described here, to automatically determine the local neighborhoods with similar geometric structures without any prior knowledge on the number of manifolds, their s, and their sampling distributions.
In  the authors present three estimation approaches, defined as “graph theoretic methods” since the statistics they compute are functions only of graph properties (such as vertex degrees and vertex eccentricities) and do not directly depend on the interpoint distances.
The first statistic, denoted by in the following, is based on the reach (the reach , in steps of a node , is the total number of vertices which are connected to by a path composed of arcs or less in ) of vertices in the . Considering that the reach of each vertex grows as the increases, in  the average reach in steps of vertices in is employed: .
The second statistic, denoted by , is computed by considering the degree of vertices in the . Recalling that, for datasets obtained from a continuous distribution on , the ratio of nodes with a given degree in converges a.s. to a limit depending only on and  and that the average degree in a tree is a constant depending only on the number of vertices, the authors empirically observe a dependency between the average degree and the . This leads to the definition of an estimator employing the statistic .
The third statistic, denoted by , is motivated by studies in the literature  showing that the expected number of neighbors shared by a given pair of points depends on the of the underlying manifold. Accordingly, calling the number of samples in the intersection of the two hyperspheres centered on and , intuitions similar to those considered for lead to defining of .
Based on their theoretical results and empirical tests on synthetically generated datasets characterized by values in a finite range (where in the reported experiments), the authors propose an approximate Bayesian estimator that could indistinctly employ each of the three statistics , , and , denoted by in the following. To this aim, they assume that each statistic can be approximated by a Gaussian density ; to estimate and , for each , datasets of large size are synthetically generated by random sampling from the uniform distribution on the unit -cube. These datasets are then used to estimate the parameters and that define the approximation , computed on a generic sample set with size and , of the Gaussian density of .
At this stage, given a new input dataset having unknown , the statistic is computed and used to calculate the approximated value , . Assuming equal a priori probability for all the , the posterior probability is given byand employed to compute an “a posteriori expected value” of the : The authors evaluate the performance of their methods on synthetic datasets, some of which have been used by similar studies in the literature , while the others (challenging ones) are proposed by the authors to have manifolds with nonconstant curvature. The comparison of the achieved results with those obtained by the estimators proposed in [27, 33, 112, 118] has led to the conclusion that none of the methods has a good performance on all the tested datasets. However, graph theoretic approaches would appear to behave better when manifolds of nonconstant curvature are processed.
This interesting comparison strengthens the need of defining a benchmark framework to allow an objective and reproducible comparative evaluation of estimators. For this reason, in Section 4 we describe our proposal in this direction.
4. A Benchmark Proposal
At the present, an objective comparison of different estimators is not possible due to the lack of a standardized benchmark framework; therefore, in this section, after recalling experimental datasets and evaluation procedures introduced in the literature (see Sections 4.1 and 4.2), we choose some of them to propose a benchmark framework (see Section 4.3) that allows for reproducible and comparable experimental setups. The usefulness of the proposed benchmark is then shown by employing it to compare relevant state-of-the-art estimators whose code is publicly available (see Section 4.4).
The datasets employed in the literature are both synthetically generated datasets and real ones. In the following sections we describe those we choose to use in our benchmark study.
4.1.1. Synthetic Datasets
Synthetic datasets are generated by drawing samples from manifolds () linearly or nonlinearly embedded in higher dimensional spaces.
The publicly available tool (http://www.mL.uni-saarland.de/code/IntDim/IntDim.htm) proposed by Hein and Audibert in  allows us to generate kinds of synthetic datasets by uniformly drawing samples from manifolds of known ; they are schematically described in Table 1, where they are referred to as . These manifolds are embedded in higher dimensional spaces through both linear and nonlinear maps and are characterized by different curvatures. We note that manifold is particularly challenging for its high curvature; indeed, when it is used for testing, most relevant estimators compute pronounced overestimates (see also the results reported in ).
Another interesting dataset  is generated by sampling a -dimensional paraboloid, , nonlinearly embedded in a higher () dimensional space, according to a multivariate Burr distribution with parameter . Tests on this dataset are particularly challenging since the underlying manifold is characterized by a nonconstant curvature.
To perform tests on datasets generated by employing a smooth nonuniform , we propose the dataset , obtained as follows: we sample points in , according to a beta distribution with parameters and , respectively (high skewness), and store them in a matrix ; multiply each point of () by , thus obtaining a matrix ; multiply each point of by , thus obtaining another matrix ; append and to generate a matrix ; append to its duplicate to finally generate a test dataset containing points in .
To further test estimators’ performance on nonlinearly embedded manifolds of high , we propose to generate two datasets, referred to as and in the following (a tool to generate the datasets sampled from -dimensional paraboloids, the dataset, the dataset, and the dataset, is available at http://security.di.unimi.it/~fox721/dataset_generator.m). Precisely, to generate we uniformly draw points in , we transform each point by means of where , we obtain points in by appending each transformed to , and we duplicate the coordinates of each point to finally generate points in . The of is , and its points are drawn from a manifold nonlinearly embedded in . To generate containing points in , we applied the same procedure on vectors sampled in .
4.1.2. Real Datasets
Real datasets employed in the literature generally concern problems in the fields of image analysis, signal processing, time series prediction, and biochemistry. Among them, the most known and used datasets are face database , database , dataset ,  dataset, and time series . Recently, the Crystal Fingerprint space for the chemical compound silicon dioxide dataset has also been proposed .
face database consists in gray-level images of size depicting the face of a sculpture. This dataset has three degrees of freedom: two for the pose and one for the lighting direction (see Figure 2(a)).
database consists in gray-level images of size of hand-written digits (see Figure 2(b)). The real of this database is not actually known, but some works [79, 122] propose similar estimates for the different digits; as an example, the proposed values for the digit “” are in the range .
dataset has been generated as follows: subjects spoke the name of each letter of the alphabet twice, thus producing about training examples from each speaker, for a total of samples. The speakers are grouped into sets of speakers each and are referred to as , , , , and . The real value characterizing this dataset is not actually known, but a study reported in  shows that the correct estimate could be in the range .
The version of dataset is a time series of one-dimensional points having nine degrees of freedom () and being generated by a simulation of particle motion. In order to estimate the attractor dimension of this time series, it is possible to employ the method of delays described in , which generates -dimensional vectors by partitioning the original dataset in blocks containing consecutive values; as an example, by choosing a dataset containing points in is obtained.
is a time series composed by samples measured from a hardware realization of Chua’s circuit . Employing the method of delays with , a dataset containing points in is obtained. The characterizing this dataset is ~2.26 .
Crystal Fingerprint spaces, or Crystal Finger spaces, have been recently proposed in crystallography  with the aim of representing crystalline structures; these spaces are built starting from the measured distances between atoms in the crystalline structure. The theoretical of one Crystal Finger space consists in crystal degrees of freedom, where is the number of atoms in the crystalline unitary cell.
4.2. Experimental Procedures and Evaluation Measures
At the state of the art, two approaches have been mainly used to assess estimators on datasets of known .
The first one subsamples the test dataset to obtain subsets of fixed cardinality and computes the percentage of correct estimations. To analyze estimators’ behavior with respect to the cardinality of input datasets, this procedure may be repeated by using different cardinality values [29, 30, 79, 122], thus obtaining a distinct performance evaluation measure for each cardinality.
The second approach estimates the on permutations of the same dataset and averages the estimates to obtain the final one [27, 76, 107, 126]. This value is then compared with the real one to assess the estimator.
To also test the estimator’s robustness with respect to its parameter settings, in [27, 107, 126] the authors apply a further test, originally proposed by Levina and Bickel in . Precisely, sample sets with different cardinalities are drawn from the standard Gaussian in and, for each set, the estimator is applied varying the values of its parameters in fixed ranges; this allows us to analyze the behavior of the estimate as a function of both the dataset’s cardinality and the parameter settings.
Note that, since estimators are usually tested on different datasets to evaluate their reliability when confronted by different dataset structures and configurations, in  an overall evaluation measure is proposed. This indicator, called Mean Percentage Error (), summarizes all the obtained results in a unique value computed as , where is the number of tested datasets, is the estimated on the dataset , and is the real of . To apply this technique to real datasets whose belongs to the range , the same authors propose to calculate the associated ’s term as , where is the mean of the range.
In this section we propose an evaluation approach which can be used as a standard framework to assess estimators performance, comparing it to relevant estimators whose code is publicly available. In this benchmark, we suggest to use the following estimators (see Section 3): , , , , , , , and (the source code of the mentioned methods is available at : http://www.mL.uni-saarland.de/code.shtml, : http://dept.stat.lsa.umich.edu/~elevina/mledim.m, : http://web.eecs.umich.edu/~hero/IntrinsicDim/, : http://www.math.duke.edu/~mauro/code.html#MSVD, : http://research.microsoft.com/en-us/um/cambridge/projects/infernet/blogs/bayesianpca.aspx, : http://cseweb.ucsd.edu/lvdmaaten/dr/download.php, , and : http://www.mathworks.it/matlabcentral/fileexchange/40112-intrinsic-dimensionality-estimation-techniques). Note that these estimators cover all the groups described in Section 3, that is, Projective, Fractal, Nearest-Neighbors-based, and Graph-based estimators.
The benchmark is composed by following steps:(1)Test all the considered estimators on both the synthetic and real datasets described below. We highlight that the synthetic datasets whose is a user-defined parameter should be created with sufficiently high values ().(2)Comparative evaluation steps are as follows:(a)compute the indicator both for synthetic and real datasets,(b)compute a ranking test with control methods; to this aim, we suggest the Friedman test with Bonferroni-Dunn post hoc analyses ,(c)perform the tests proposed in  to evaluate the robustness, with respect to different cardinalities and parameter settings.
The synthetic datasets used in the benchmark, referred to as in the following, are listed in Table 2 with their relevant characteristics (, , and ). The first datasets are generated with the tool proposed in ; they include instances, , of dataset , which are drawn from after its embedding in by setting . Note that we did not include the dataset sampled from (see Table 1) since relevant and recent estimators have similarly produced highly overestimated results when tested on it . Indeed, dealing with highly curved manifolds is still a quite challenging problem in the field.