Applied Mathematics and Statistical Mechanics and their ApplicationsView this Special Issue
Computing Simplicial Depth by Using Importance Sampling Algorithm and Its Application
Simplicial depth (SD) plays an important role in discriminant analysis, hypothesis testing, machine learning, and engineering computations. However, the computation of simplicial depth is hugely challenging because the exact algorithm is an NP problem with dimension and sample size as input arguments. The approximate algorithm for simplicial depth computation has extremely low efficiency, especially in high-dimensional cases. In this study, we design an importance sampling algorithm for the computation of simplicial depth. As an advanced Monte Carlo method, the proposed algorithm outperforms other approximate and exact algorithms in accuracy and efficiency, as shown by simulated and real data experiments. Furthermore, we illustrate the robustness of simplicial depth in regression analysis through a concrete physical data experiment.
With the development of computer technology and multivariate statistical analysis, scientists deal with a large amount of multidimensional data in many fields, such as biogenetics and industrial engineering. The demand for multivariate data analysis tools has become increasingly urgent. As a powerful multivariate nonparametric and robust statistical tool, the statistical depth function extends the concept of one-dimensional data order statistics and provides the central-outward sorting of multivariate data [1–4]. In recent years, the interest of researchers in statistical depth has increased due to the extensive application of the statistical depth function in multivariate statistical analysis, robust estimation, discriminant analysis, hypothesis testing, machine learning, economics, and hydrological data analysis [5, 6].
The first statistical depth function concept, which was proposed by Tukey in 1975, is known as the halfspace depth (also known as the Tukey depth) [7–9]. The other concepts of the statistical depth function include projection depth [3, 10], simplicial depth (SD) [11, 12], and regression depth [13, 14]. Zuo and Serfling defined a general structural property of the statistical depth function . Among the many concepts of this statistical depth function, SD is a relatively attractive one not only because of its simple form and ability to achieve the maximum depth value in the center and satisfy monotonicity but also because of its important applications in sign test and centralization test [1, 12].
However, the computation of SD is complicated. The exact calculation of SD is an NP problem, which is only feasible when the dimension is no higher than three. Serfling and Wang emphasized that the computation of SD for higher-dimensional data still requires further study . The computation and application of the statistical depth function are active research topics.
Similarly, Monte Carlo (MC) methods have become important statistical, computational tools that are widely used in finance, engineering computation, genetic biology, computational chemistry, and other related fields [15–18]. As a critical MC strategy, the importance sampling (IS) method concentrates most of the test samples in the important area of the objective function by introducing the transfer probability density function [15, 19]. This method dramatically improves the computational efficiency and is an important MC acceleration algorithm. In this study, we apply an efficient IS algorithm to the approximate computation of SD and demonstrate the advantages of such an algorithm over other MC methods and exact algorithms through simulated and real data examples. Furthermore, we extend the SD to regression analysis and obtain a robust estimation of regression analysis. The results of a real physical data experiment show that the estimation based on the SD method is more robust than that based on the traditional least squares (LS) method.
The remainder of this paper is organized as follows. In Section 2, we review the preliminary concept and existing algorithms for SD. Section 3 describes the IS algorithm used for the computation of SD. The advantages of the IS algorithm are illustrated through simulated data examples in Section 4. The extension of SD to regression analysis and a real data experiment are presented in Section 5. Lastly, the conclusions are provided in Section 6.
2. Preliminary of SD and the State of the Art
In this section, we present the preliminary of SD and the existing algorithms for its computation.
Consider a sample set in , where is one sample of size in and is a given point in . The sample version  of SD of with respect to the sample set is expressed aswhere denotes the indicator function of event denotes the simplex determined by the sample points .
Serfling and Wang stated that no algorithms are faster than simply generating all simplices and counting the ones enclosing the given point (using complex time) when dimension . Therefore, designing an efficient approximate algorithm for the computation of SD is necessary.
A direct MC method for the computation of SD contains two steps: (1) randomly selecting points from and then (2) taking the average of the points that enclose the given point (i.e., using to estimate the true SD value ).where is randomly chosen from and is the trying number for the estimation.
Another approach for the computation of SD is the use of the IS algorithm, which is the proposed method in this study. The computation of SD is an expectation computation. Therefore, SD can be estimated by the IS algorithm. The simple MC method uses the randomly selected points to estimate the SD, whereas the IS approach selects points with a high probability that they contain the given point . Theoretically, the results of the latter will have a smaller variance than those of the former. The simulated data examples in Section 4 illustrate the advantage of the IS algorithm over the MC method.
3. New Algorithm for SD in
3.1. Overview of the IS Algorithm
Many engineering problems can be expressed as computations of a multidimensional integral. Using the MC method to compute the integral involves drawing samples from a uniform distribution on a regular area and using the sample mean to approximate the true integral. In higher-dimensional cases, the efficiency of the MC method is extremely low if the region where the target function is not equal to zero is extraordinarily sparse. On the contrary, the IS algorithm draws most samples in the important area. This strategy improves the efficiency of the integral computation. The IS algorithm plays an important role in the field of statistical physics, molecular simulation, and Bayesian statistics.
For example, we want to compute the integral of on region ; that is,and the integral computation (3) can be treated as an expectation calculation:where is a random variable (r.v.) with its own probability density function (p.d.f.) ; that is, . If denote samples with size from , the MC method draws from a uniform distribution on region . From the Law of Large Numbers , the sample mean can be used to estimate the expectation in (4) aswhere is the area of and is the r.v. from the uniform distribution on ().
However, the efficiency of the MC method (5) will be extremely low if region is extremely wide or sparse (especially in high-dimensional cases). By contrast, the IS method uses a special p.d.f. instead of in (4) to compute mean and utilizes the corresponding sample mean to estimate the expectation in (4):and the variance , which means that we can choose appropriate close to to reduce the variance of . In extreme situations, if we select , that is, (where ), the variance of will drop to zero, and is equal to the exact value . However, we cannot directly use the IS method defined in (6) during such an extreme situation because we do not know the exact value of in advance. Nevertheless, it gives us a significant hint that the closer and are, the more accurate the IS method’s result is. The steps of the IS method for the computation of integral (3) are listed as follows:(1)Draw the samples from .(2)Compute the importance weights .(3)Use the mean of the computed weights to estimate the integral in (3):
The following theorem shows that the IS estimator in (7) is unbiased.
Theorem 1. The IS estimatorin (7) is an unbiased estimator of.
Proof. To prove that the IS estimator is unbiased, we only need to show that the expectation of is equal to :Because is a r.v. and ,We obtain the expression , which verifies that the IS estimator in (7) is unbiased. So we complete the proof of this theorem.
Aside from being an unbiased estimator of the integral presented in (3), the IS estimator exhibits a more efficient and powerful integral computation than the MC estimator defined in (5), especially in higher-dimensional cases.
3.2. IS Algorithm for SD Computation
We use the previously described IS method to compute the SD. Using the definition of SD in (1) is not appropriate in computing the SD value of a data point with respect to a dataset. The MC method in (6) becomes extremely inefficient when dimension or sample size is excessively large because the number of simplices containing the original data point decreases with the increase in or .
The IS algorithm can transform the original p.d.f. into a highly efficient one that can construct the simplex containing the original data point. In the computation of SD, the MC method randomly selects data points to construct the simplex, whereas the IS method chooses the data points that are likely to let the original data point inside the simplex. Figure 1 is a 2D example that is composed of 20 sample data points. The data point is used to compute the SD value. After sampling the two data points ( and ), only two more ( or ) are needed to construct the simplex that contains the original data point . In this illustrated example, we do not need to count all the simplices after getting and ; only or is considered as the final vertex of the simplices containing .
We list the details of the IS algorithm for the computation of SD in high-dimensional cases. Suppose that is a sample with size in (i.e., ) and is a given point in . The data points are in general position (i.e., any data points can define a unique -dimensional hyperplane in ). The procedure of using the IS algorithm to compute SD (i.e., the computation of ) is summarized as follows:(1)Set the IS parameters, including the number of samples tries .(2)Let . Compute the importance weight for the -th sample try.(i)Randomly choose sample points from , and denote them as .(ii)Let , and compute the simplex data point set (i.e., the datasets consist of the possible data points that can construct a simplex containing the original data point ). Replace the -th data point with the original data point to obtain a dataset with size (i.e., ). Compute the unique director which is perpendicular to the hyperplane determined by . Project all data points and along , and compute the projected value , and is the projected value of along . Compute the simplex data point set .(iii)Let , and set , where .(3)The sample mean of can be treated as the IS estimator of ; that is,
Theorem 2. The computational complexity of using the IS algorithm to calculate SD iswhere is the number of samples tries of the IS algorithm, is the dimension of the sample data, and is the sample size.
Proof. According to the steps for computing SD using the IS algorithm, we need to compute every for . For every , every selected sample data point for must be replaced. The computational complexity of finding the unique director perpendicular to the hyperplane is , whereas that of projecting all data points to the unique director is . The total computational complexity is . Then we complete the proof of this theorem.
Theorem 2 shows that the computational complexity of the IS algorithm for the computation of SD is a polynomial with dimension and sample size as its input arguments. While all other exact algorithms for the computation of SD are NP problems, especially when the dimension , there is no algorithm that can run faster than simply generating all simplices and computing the exact SD value (i.e., using time) . According to the definition of the IS algorithm in (7) and Theorem 1, the IS estimator defined in (10) is an unbiased estimator of .
4. Performance Comparison
This section presents simulated and real data examples of SD computation. All results are obtained using R (version X64 3.6.2) and MATLAB (R2017a) on a Lenovo K42-80 laptop computer (Intel(R) Core(TM) i7-6500U [email protected] GHz, RAM 16.00 GB, Windows 10). The R and MATLAB codes for the results in this section are available upon request from the corresponding author.
4.1. 2D Simulated Data Example
In the simulated data experiment, we compare the computed SD results of the IS, exact, and approximate algorithms, including the MC method. The simulated dataset is sampled from a 2D multivariate normal distribution (i.e., N( ), where is 2D zeros vector and is a 2D unit matrix), and the sample size is 100.
We used the exact algorithm , the MC method, and the IS algorithm to compute the SD. The selected points are , , and . We used the exact and approximate algorithms to compute the SD of with respect to the dataset. The number of random simplices was set to 100 for the MC and IS algorithms. All computations were repeated 50 times. The computed results (mean, standard deviation (sd), and total CPU time (s)) are summarized in Table 1 and Figure 2.
Since there is an exact algorithm for the SD computation in the 2D case, we can evaluate the accuracy of the IS and MC methods through their mean values and sd values. Moreover, the total CPU time consumed by every algorithm can reflect its efficiency. So, in this experiment, we use these three indicators (mean, sd, and total CPU time) to compare the performances of these algorithms (exact, MC, and IS methods) for the computation of SD.
The results reveal that (1) the exact algorithm consumes less CPU time (approximately 0.1 s), (2) the approximate algorithms (MC and IS) can achieve accurate results because their means are extremely close to the exact value, (3) IS performs better than MC as indicated by the smaller sd of the results of the former compared with those of the latter under the same CPU time, (4) all computed SD results from exact and approximate algorithms are zeros at point , which means that is outside the data cloud, and (5), with the exact algorithm, the simulated example also indicates that the IS algorithm can obtain highly accurate results.
4.2. Higher-Dimensional Simulated Data Example
In this subsection, we compute the SD of different data points by using the MC and IS algorithms in 3D and five-dimensional simulated dataset. We did not use the exact algorithm  because it cannot obtain any result within three hours.
In the 3D case, the dataset was sampled from , and the sample size was 1000. We used MC and IS methods to compute the SD of points , and . We set the number of random simplices to 100 and repeated the computation 50 times. The computed results are summarized in Table 2 and Figure 3.
Because the exact algorithm cannot get any computed SD results within three hours when dimension , we can only use MC and IS methods for the computation of SD in this subsection. Three indicators (mean, sd, and total CPU time) are summarized for the evaluation of the approximate methods. The mean values can be seen as the final computed SD results and the sd reflects the accuracy of the method (the smaller, the more accurate). The total CPU time reflects the efficiency of the method because it is more efficient if the method consumes less CPU time in the same computation of SD.
Table 2 and Figure 3 indicate that (1) the computed SD results decrease when the data points are changed from to ; the data point is deeper than the data point with respect to the dataset; (2) the two methods have similar computational efficiencies because they consume almost the same total CPU time; (3) the sd obtained by the IS method is smaller than that calculated by the MC method, which means that the former is more accurate than the latter in this case.
In the five-dimensional case, the dataset was sampled from , and the sample size was 1000. We used MC and IS methods to compute the SD of points , and . The number of random simplices was 100, and the computations were repeated 50 times. The computed results (mean, sd, and total CPU time in s) are presented in Table 3 and Figure 4.
Table 3 and Figure 4 show that (1) the computed SD values decrease when the data points are changed from to , thereby suggesting that the former is deeper than the latter; (2) the SD values in the five-dimensional examples are slightly smaller than those in 3D examples because the sparsity of the data points increases when the dimension is increased from three to five; (3) the IS algorithm performs better than the MC approach as indicated by the smaller sd of the results of the former compared with those of the latter; (4) the two approximate algorithms consume almost the same CPU time; (5) even after using 100 random simplices, the MC algorithm cannot find any simplex containing point, whereas the IS algorithm can identify many simplices. In conclusion, the IS method outperforms the MC method in terms of accuracy in these simulated examples.
We also evaluated the MC and IS methods with other numbers of random samples tries in different datasets. The findings show that the result’s accuracy increases with the increase in the number of random samples tries. In addition, the number of random samples tries can be used by IS method in lots of datasets. It is found in our experiments that, if we set the number of random samples tries N = 1000, the IS method can obtain the computed SD results within one second when dimension d ≤ 10 and sample size n ≤ 10000.
5. Application to Regression and Real Data Example
One of the most important extensions of SD is the robust estimation of regression based on SD. To demonstrate the relevant concept, we consider the linear regression model as follows:where random variables and are in , , and , and are unknown parameters.
Considering that can measure the depth of with respect to , we extend the definition of SD to regression (12) and determine the simplicial regression depth:where are the parameters, are the samples of the model defined in (12), and is the residual based on the i-th sample and
The SD based estimator of (12) can be defined as the maximum of ; that is,
We consider the physical experiment data concerning the relationship between the atmospheric pressure and boiling point of water, which was discussed by a Scottish physicist named James D. Forbes . In the mid-nineteenth century, this experiment can illustrate whether the simple measurement of the boiling point of water can substitute for the direct reading of the barometric pressure. The dataset was collected in the Alps in Scotland (Table 4 and Figure 5).
The linear regression model in (12) was used to fit the Forbes dataset. We used LS and SD methods to estimate the parameters of the model in (12). The function “lm” in R Stats package (“stat”) can be used to determine the LS estimator of the model in (12). For the SD based method, we combined quasi-Newton  and IS methods to find the maximum point of (15). Moreover, we performed three statistical tests (i.e., the R square value, normality test, and the test of goodness of fit ) for every fitted regression model to get a more insightful analysis. The R square (or adjusted R square) value from the significance test gives the percentage that the dependent variable () can be explained by the fitted model () (see (12)). The normality test is used to test whether the residuals of the fitted model obey normal distribution which is the basis of other statistical tests. For example, under the assumption of normality, the F statistic value in the test of goodness of fit can be used to determine whether the fitted regression model makes sense.
We first used the LS method and SD approaches to compute the linear regression model with the original Forbes dataset (Table 4, denoted as original data in this section). The computed regression results are summarized in Table 5 and Figure 5(a); their corresponding statistical tests are summarized in Table 6 and Figure 6.
Table 5 and Figure 5(a) show that the LS and SD estimators obtained the very similar intercept parameter and slope parameter. This finding suggests that the SD method can capture the same accurate regression results compared with LS method.
The statistical test results have also confirmed the finding since the results from LS and SD methods were also very similar. They have very high R square values which indicate more than variance of the dependent variable that can be explained by the fitted model. Under significance level 0.01, we accept the assumption of normality and they pass the goodness of fit test (i.e., the value of F statistic is almost zero). In addition, if one needs a higher level of significance (such as 0.05) in this example, then some statistical techniques (e.g., Box-Cox transformation or strong influence points detection) can be used to improve the regression model (see more details in ). However, this is another research topic and there is a lack of sample points in this example; we only focus on the robustness of the regression model computed from different methods, especially when the dataset is contaminated, and that is what we do in the next experiment.
In the following experiment, we worked with a contaminated dataset from Forbes data. We intentionally changed the pressure of the 16th data point from 29.88 to 59.76. The new dataset was denoted as the contaminated data (Figure 5(b)). We compared the SD and LS methods’ performances in the linear regression model with the contaminated dataset. The regression results are presented in Table 5 and Figure 5(b). Their corresponding statistical tests are summarized in Table 6 and Figure 7.
The results show that the LS estimator is greatly influenced by the contaminated data point, whereas the SD estimator can maintain satisfactory performance. The slope parameter estimated by the LS estimator changes from 0.522 9 to 1.026 6, which cannot reflect the actual variation trend of the pressure-temperature curve. By contrast, the SD estimator is not affected by the contaminated data point and can still provide the actual variation trend. The estimated slope parameters obtained using SD method for two different datasets are 0.508 6 and 0.508 5, respectively. The statistical test results show that, under the influence of the contaminated data point, the residuals of the fitted models from the two methods do not pass the normality test. However, the R square (or Adjusted R square) value from the SD method (0.991 7) is much large than that of the LS method (0.765 0) which means that the regression line from the SD method can explain more percentage of the variance of dependence variable compared with that of the LS method. These results imply that the SD estimator outperforms the LS estimator in the contaminated dataset experiment in terms of robustness.
The concept of statistical depth plays an important role in mathematical sciences, engineering, regression analysis, and life sciences. In this study, we computed the SD using the IS method and found that this new approach performs better than other exact and MC methods in terms of accuracy and efficiency. The simulated and real data examples illustrated the advantage of this new method. Finally, we tested the SD method based regression analysis through a concrete physical data example. The result indicated the excellent robustness of the proposed method compared with the LS estimation.
Given the many favorable properties of the proposed method, further research can be conducted on different angles. First, the IS parameter (i.e., number of sample tries N) plays an important role in the computation of SD, so the determination of N before the performance of IS algorithm is yet to be thoroughly investigated. Second, the IS method for the SD computation can be improved by sampling the data points via other more important simplices (not the last data point in the possible simplices). Third, with the development of modern computer science, multicore high-performance computer is gaining popularity. Therefore, the IS method can be extended to a parallel computation based version. Lastly, the approximate algorithms (advanced MC methods) for other statistical depths (e.g., halfspace depth, projection depth, and regression depth) can be further explored.
The experimental data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research was partially supported by the National Natural Science Foundation of China (11 501 320 and 11 701 318), the Natural Science Foundation of Shandong Province (ZR2021MA090 and ZR2019PG005), and the Postdoctoral Science Foundation of China (2020M682143).
J. W. Tukey, “Mathematrics and the picturing of data,” in Proceedings of the International Congress of Mathemmaticians, vol. 2, pp. 523–531, Vancouver, Canada, August 1974.View at: Google Scholar
J. S. Liu, Monte Carlo Strategies in Scientific Computing, Springer, New York, USA, 2001.
F. Liang, C. Liu, and R. J. Carroll, Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples, John Wiley & Sons, NJ, USA, 2011.
J. Shao, Mathematical Statistics, Springer, Berlin, 2003.
S. Weisberg, Applied Linear Regression, John Wiley & Sons, New Jersey NY USA, 1980.
R. R. Development Core Team, A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2020.