Michele Lungaroni, Andrea Murari, Emmanuele Peluso, Pasqualino Gaudio, Michela Gelfusa, "Geodesic Distance on Gaussian Manifolds to Reduce the Statistical Errors in the Investigation of Complex Systems", Complexity, vol. 2019, Article ID 5986562, 24 pages, 2019. https://doi.org/10.1155/2019/5986562
Geodesic Distance on Gaussian Manifolds to Reduce the Statistical Errors in the Investigation of Complex Systems
Michele Lungaroni,^{1}Andrea Murari,^{2}Emmanuele Peluso,^{1}Pasqualino Gaudio,^{1} and Michela Gelfusa^{1}
^{1}Department of Industrial Engineering, University of Rome Tor Vergata, Via del Politecnico 1, 00133 Rome, Italy
^{2}Consorzio RFX (CNR, ENEA, INFN, Universita’ di Padova, Acciaierie Venete SpA), Corso Stati Uniti 4, 35127 Padova, Italy
Academic Editor: Dan Selişteanu
Received18 Mar 2019
Revised17 Jul 2019
Accepted21 Jul 2019
Published18 Aug 2019
Abstract
In the last years the reputation of medical, economic, and scientific expertise has been strongly damaged by a series of false predictions and contradictory studies. The lax application of statistical principles has certainly contributed to the uncertainty and loss of confidence in the sciences. Various assumptions, generally held as valid in statistical treatments, have proved their limits. In particular, since some time it has emerged quite clearly that even slightly departures from normality and homoscedasticity can affect significantly classic significance tests. Robust statistical methods have been developed, which can provide much more reliable estimates. On the other hand, they do not address an additional problem typical of the natural sciences, whose data are often the output of delicate measurements. The data can therefore not only be sampled from a nonnormal pdf but also be affected by significant levels of Gaussian additive noise of various amplitude. To tackle this additional source of uncertainty, in this paper it is shown how already developed robust statistical tools can be usefully complemented with the Geodesic Distance on Gaussian Manifolds. This metric is conceptually more appropriate and practically more effective, in handling noise of Gaussian distribution, than the traditional Euclidean distance. The results of a series of systematic numerical tests show the advantages of the proposed approach in all the main aspects of statistical inference, from measures of location and scale to size effects and hypothesis testing. Particularly relevant is the reduction even of 35% in Type II errors, proving the important improvement in power obtained by applying the methods proposed in the paper. It is worth emphasizing that the proposed approach provides a general framework, in which also noise of different statistical distributions can be dealt with.
1. Robust Statistics and Measurement Errors in the Science of Complex Systems
In the last decades the number of contradictory, inaccurate, and/or misleading scientific pronouncements reported in the media about complex systems has increased exponentially. From medicine to finance, very often opposing studies and findings have generated a quite high level of confusion. Moreover various expectations, predictions, and evaluations have been very often completely contradicted by reality; a very evident example is the financial crisis of 2008. This situation has considerably corroded the public confidence in expert knowledge [1] and even more generally in the sciences. The status of peer reviewed journals in various fields is not more reassuring. In a famous paper and subsequent works, it has been shown how many studies, published in the most highly respected medical journals, are refuted by other investigations in a matter of months or at the most a few years [2].
There are many causes to the worrying situation previously described, from the reduction in public funding to the corporate takeover of public functions and institutions. However, also the misguided application of inadequate statistical tools has certainly played a role and contributed to exacerbate the problem. Indeed classic significance tests are the main statistical tools in many disciplines, ranging from medicine and education to economics and psychology. In the last decades, hypothesis tests have been increasingly used also in the so called exact sciences, such as physics, chemistry and engineering, particularly when complex systems have to be studied. These classic methods are based on certain specific assumptions, which have to be reasonably satisfied; otherwise the accuracy of the results can be seriously compromised. One fundamental assumption is that the data are randomly sampled from a Gaussian distribution. In case of hypothesis testing and comparison between independent groups, it is also typically assumed that the distributions have a common variance (even if they present different means); this is the so called homogeneity of variance or homoscedasticity hypothesis. Conversely, in practical applications the data available do not necessarily conform to these assumptions. The probability density functions sampled in experiments are not necessarily Gaussians and can present heavy tails or be skewed. Homoscedasticity is even more frequently violated.
Some of the classic parametric tests are quite robust against violation of the normality assumption, in the sense that they can preserve both the expected rates of Type I and Type II errors, even if they have quite low power compared to their nonparametric counterparts. This has been reported extensively in Sawilowsky [3] and Sawilowsky and Blair [4]. On the contrary, in the last decades, a lot of evidences have emerged showing how a blind reliance on the assumption of homoscedasticity can produce rather inaccurate results as reported in Wilcox [5, 6] and in this respect rank-based nonparametric statistics can fare even worse. In Wilcox [6] it is shown that the violation of homoscedasticity significantly affects type I errors, up to 0.5 at a significant level of 0.05. The power of the tests can be even more severely compromised by violation of the previous assumptions; indeed for distributions characterized by small departures from normality, as reported in Wilcox [5], the power of the test can be reduced from 0.96 to 0.28. Therefore a lot of efforts have been recently devoted to developing robust tools, which seek to provide methods that compare well with popular statistical techniques, when the classic hypotheses are satisfied, but which are not unduly affected by departures from model assumptions. Developed in the framework of robust statistics, as reported in Huber and Ricchetti [7], these techniques provide quite accurate, even if slightly suboptimal results, in the case the assumptions of normality and homoscedasticity are correct, but are not compromised, if the data have been sampled from a different distribution. However, even if they are quite successful in terms of descriptive statistics, robust techniques can be affected by significant increase in Type I errors when converted into inference statistics. The developments proposed in this paper are indeed meant to improve robust statistics explicitly in this direction.
The tools developed in the framework of robust statistics in the last decades have achieved very impressive results but they do not address at all a problem typical of the experimental sciences. In the vast majority of natural science applications, the data are measurements affected by error bars. Therefore, the available data can be not only sampled from a distribution, which is quite far from a Gaussian, but they can also be affected by significant additive noise. Robust statistical methods address only the first of these two problems, the fact that the assumptions about the probability density function (pdf) generating the data are not satisfied. However, the additional uncertainty introduced by the additive noise in the measurements is not considered.
In all the formulas developed in the framework of robust statistics, the Euclidean distance is implicitly assumed to be the proper metric to adopt. The Euclidean distance, however, implicitly requires considering all data as single infinitely precise values. This assumption can be appropriate in other applications but it is obviously not so in the natural sciences, since all the measurements are typically affected by noise. As will be shown later, an inappropriate evaluation of the uncertainties in the measurements can have a major impact even on the determination of the basic statistical measures such as the mean. An alternative approach is to use a new distance between data, which would take into account the measurement uncertainties. The idea, behind the method proposed in this paper, consists of considering the measurements not as points, but as Gaussian distributions. This is a valid assumption in many scientific applications, because the measurements are affected by a wide range of noise sources, which, from a statistical point of view, can be considered random variables.
Modelling measurements not as punctual values, but as Gaussian distributions, requires defining a distance between Gaussians. This distance is the Geodesic on the Gaussian Manifold (GDGM) and can be expressed as a closed formula (see Section 3). As shown in the rest of the paper, adopting this geodesic distance increases significantly the accuracy of robust statistical tools, even when the data are affected by a limited level of noise, particularly for hypotheses testing.
With regard to the structure of the paper, next section is devoted to an introductory discussion about the importance of a proper evaluation of the errors associated with the experimental measurements in the framework of a modern theory of uncertainty. Section 3 provides the background on the main mathematical tool introduced in the paper: the Geodesic Distance on Gaussian Manifolds. The probability density functions used in the paper to test the proposed new approach are reviewed in Section 4. The main ideas behind robust statistics are introduced in Section 5. The proposed new method is applied first to the measurements of location and scale, as described in Sections 6 and 7. The impact of using the GDGM in hypothesis testing is described in Section 8. The performance of GDGM in the case the sampled pdf is asymmetric and the additive noise is heteroscedastic is exemplified in Section 9. Conclusions and lines of future work are provided in the last Section 10 of the paper.
2. The Theory of Uncertainty and the Experimental Measurements
In the science of complex systems, measurements are the basic inputs required to provide quantitative knowledge about phenomena. However, all the measurements provide limited information about the measure and since they are affected by uncertainties. At the end of last century, the theory of uncertainty was consolidated and became the dominant paradigm with the publication of the IEC-ISO “Guide to the Expression of Uncertainty in Measurement” (GUM) in 1992 [8]. Assuming that all the systematic errors have been eliminated, the uncertainty to be associated with the results of a series of measurements is their standard deviation, defined as
In the previous equation, the standard deviation is expressed in terms of the average of the available observations. This emphasizes how the process of calculating the average is at the very basis of the first, most elementary statistical steps, required to quantify even the most elementary quality of the available measurements. Therefore any error in the determination of the mean has the potential to invalidate the entire statistical analysis. To illustrate the potential impact of the issue let us analyse a case, for which the traditional heteroscedasticity assumption is not valid. In this simple, introductory example, we investigate the case that the measurements are generated by a Gaussian distribution of =3 and s=1. The data are considered affected by a Gaussian noise of zero mean but a varying standard deviation. In particular, the noise increases almost linearly with the amplitude of the measurements; this is not an uncommon case in practice because in many instruments the measurement errors are a fixed percentage of the measured value. This situation is illustrated in Figure 1, which summarise a numerical experiment with 1000 data points.
The accuracy of the traditional estimates of location, mean, and mode is reported in Table 1, together with the values of the robust indicator introduced later in the paper (see Sections 3–5). The values obtained by application of our methodology, using the GDGM, are also shown. In Table 1, as for all the others reported either in the main text or in the Appendixes, the numerical values are the results obtained mediating over 1000 independent realisations. The details of the method proposed in this paper are given later; for the moment what should be retained is that the traditional statistical measures of location fare very badly for this kind of noise. Also the robust statistical indicators commit very serious overestimates. On the contrary, our approach of combining robust indicators with the GDGM reduces the inaccuracy in the estimate of location of even an order of magnitude.
No Noise
Noise
Mode
2.975
4.728
Mean
3.000
4.664
2.335
3.352
Trimmed Mean
2.999
4.687
1.965
2.939
Wisorized Mean
3.000
4.680
2.122
3.088
3. Geodesic Distance on Gaussian Manifolds
As mentioned in the previous section, in the natural sciences the data available are typically the result of experimental measurements, which are affected by uncertainties referred to as noise. The sources of this uncertainty are normally several, independent and additive: as a consequence it is more than reasonable to assume that the pdf of the noise is normal. Each measurement can therefore be modelled as a probability density function of the Gaussian type, determined by its mean and its standard deviation s:The set of normal distributions can be seen as a two dimensional space, or better a two dimensional manifold, parameterised by and . Modelling measurements not as punctual values, but as Gaussian distributions, requires defining a distance between Gaussians. The most appropriate definition of distance between Gaussian distributions is the geodesic distance, on the probabilistic manifold containing the data, which is not a Euclidean but a Riemannian space. This geodesic distance on the Gaussian manifold can be calculated using the Fisher-Rao metric as it has been applied in Cannas et al. [9] and Murari et al. [10]. For two univariate Gaussian distributions and , parameterised by their mean and standard deviations , the geodesic distance GD is given by
A pictorial view of the potential impact of the GDGM is provided in Figure 2, which shows the distance between two couples of Gaussian distributions having the same difference in means. The standard deviations of the two Gaussians in the bottom plot are much larger than the ones of the two Gaussians, whose distance is depicted in the top plot, using the Euclidean distance results in attributing the same distance to the two couples of normal distributions. On the contrary, according to the GDGM, the distance between the two Gaussians with a larger σ is much smaller, which makes intuitive sense since they overlap much more. As will be shown in detail in the next sections, the replacement of the Euclidean distance with the GDGM improves significantly all the robust estimators, reducing the effects of noise and outliers.
4. Nonnormal Distributions
To exemplify and prove the usefulness of the method proposed in this paper, a series of numerical tests has been performed, with probability density functions of families different from the Gaussian. To simulate realistic situations in the natural sciences, Gaussian noise of various amplitudes has been added to the data sampled from the nonnormal pdfs. The tests performed in the rest of the paper have been obtained from the pdfs belonging to the families: log-normal, exponential, contaminated χ^{2} and the so called g-h pdf. The analysed distributions are defined in the rest of this section (more details are provided in Wilcox 2005 [11]):
The traditional log normal distribution corresponds to the pdf:
The pdf of the exponential distribution can be written asThe contaminated χ^{2} distribution is the sum of two χ^{2} distributions, the first sampled with probability 1- Ɛ, and the second with probability Ɛ. The is defined as In (6) k indicates the number of degrees of freedom of the distribution and is The g-h distribution is a family of pdfs parameterised by the two values g and h. It consists of a normal distribution modified by the parameters g and h, which determine its asymmetry and the relevance of the tail respectively. The pdf of the g-h distribution isIf g = 0 thenWhen both parameters are zero, the standard normal is recovered.
5. The Approach of Robust Statistics and the GDGM
In this section the needs of a special treatment for pdfs with heavy tails are reviewed. The potential of the GDGM to address the issue of noise is introduced.
5.1. The Issues Presented by Non-Gaussian Pdfs
A typical case, discussed in detail to illustrate the main ideas behind the approach proposed in this paper, is the calculation of the sample mean. When the data are not sampled from a Gaussian distribution or are affected by a significant fraction of outliers, it is well known that a trimmed mean can produce a much better estimate of location as it is reported in Wilcox [11]. An example is the case of the log normal distribution shown in Figure 3.
This is the typical example of a nonsymmetric pdf, with significant asymmetric tails. Contrary to the case of the symmetric normal distribution, in the case of the log normal the mean does not correspond to the maximum value. Indeed, as can be appreciated by Figure 3, the mean value of the distribution is quite displaced to the right of the maximum. Therefore the mean value is no more representative of the most probable value to be expected once the log normal is sampled. As a consequence, if the measures of location are meant to determine the typical value of a quantity, as it is typically the case in the science of complexity, the traditional sample mean is not really representative. The median provides a much better estimate of central tendency for distributions with tails but it can provide quite wrong estimates in the case the data are indeed sampled from a normal distribution.
Various strategies can be adopted to alleviate this problem. In the work of Wilcox [11] they are referred to as robust measures of location. A very important family of these methods is based on some form of trimming of the available data. A trimmed mean or truncated mean consists of calculating the mean of the available data after discarding the high and low end parts of the samples (typically discarding an equal amount of both). The number of samples to be discarded is usually given as a percentage of the total number of samples. For most statistical applications, 5 to 25 percent of the ends are discarded; the 25% trimmed mean (when the lowest 25% and the highest 25% are discarded) is known as the interquartile mean. The trimmed mean is obviously more robust to the presence of outliers than the traditional mean. Similar considerations apply to the alternative methods to calculate robust measure of location, such as the winsorized mean discussed in the paper of Huber et al. (2009).
5.2. Application of the Geodesic Distance on Gaussian Manifolds to Measures of Location
The trimmed mean assumes that the data obtained are perfectly sampled from a distribution and they are not affected by noise. As mentioned, this is not realistic in the experimental investigations of complex systems, whose data are typically the results of complicated measurements always affected by a certain level of noise. The noise can often be modelled by a normal distribution of zero mean. The variance of the noise can often be determined experimentally but if that is not the case, the method proposed in this paper provides also a quite robust estimate of this important quantity (see Section 6.2).
To attack this second complication posed by the measurement noise, a good starting point is the observation that the mean can be considered the point having the minimum distance from the available data. This point can be calculated on the basis of the GDGM, which is an appropriate metric once the experimental values are affected by Gaussian noise. So instead of implicitly adopting the Euclidean metric to determine the trimmed and winsorized means, the more appropriate GDGM is used, which can properly take into account additive Gaussian noise.
The approach briefly described for the trimmed mean can be adopted for the other robust statistical techniques considered to determine the central tendency of pdfs. An appropriate metric, the Geodesic Distance on Gaussian Manifolds, is applied to the data already manipulated with the most appropriate robust statistical methods. So the robust statistical methods remedy the issue that the sampled distribution is not a Gaussian and the GDGM handles in a principled way the additional uncertainties due to the fact that the data is affected by noise. The details will be discussed in the next sections but, as far as the estimates of location are concerned, Tables 2 and 3 report a comparison between the traditional and robust estimators using the Euclidean distance and the GDGM for the case of data sampled from a log normal distribution. For the case with noise, the GDGM always improves the estimates of central tendency, by providing values closer to the mode compared to the robust statistics indicators. As already mentioned, all the results reported either in the main text or in Appendixes have been obtained by mediating over 1000 independent realisations.
Lognormal
Exponential
Contaminated
G-h
Mode
6,56209
0,99761
3,80171
0,30349
Mean
12.28140
1.99360
7.59350
0.70637
10.77281
1.80658
6.24093
0.44912
Trimmed Mean
8.56378
1.56370
4.19754
0.07971
8.30072
1.52316
4.07220
0.05915
Wisorized Mean
9.13209
1.65190
4.53569
0.12788
8.77492
1.59667
4.34010
0.09893
20 points
50 points
100 points
1000 points
Location
Location
Location
Location
Mode
8,5394
6,5722
5,7292
3,75215
Mean
12.24879
12.2451
12.19511
12.20451
10.78797
10.79425
10.75084
10.80813
Trimmed Mean
8.9453
8.9077
8.78285
8.75854
8.68679
8.63359
8.48676
8.44375
Wisorized Mean
9.40048
9.43643
9.3558
9.36122
9.03414
9.00968
8.89143
8.86864
6. Measures of Location
To show the potential of the proposed method, the GDGM has been applied first to the most common robust measures of location. In this section, it is assumed that the variance of the Gaussian noise has been already determined experimentally and it is known. In the next subsection an extension of the methodology is introduced to determine the variance of the noise directly from the data, without any a priori information.
6.1. Measures of Location in the Case the Variance of the Noise Is Known
The most common robust statistical measures of central tendency are the trimmed mean and the winsorized mean. The trimmed mean has already been introduced in the previous sections. The winsorized mean is a variant of the trimmed means, which involves the calculation of the mean after replacing given parts of the available samples, at the high and low end, with the most extreme remaining values. To summarise, the traditional mean, the trimmed and winsorized means are calculated according to the following formulas, in which indicate the values sampled from the various pdfs.
The traditional mean is defined aswhere N is the number of available samples. The trimmed mean is defined aswhere g corresponds to the number of trimmed points. The winsorized mean is defined aswhereIt is worth mentioning that the results reported in the rest of the paper have been obtained for symmetric versions of trimming and winsorization but the proposed alternative geodesic distance could be applied equally well to asymmetric versions of these measures.
As already briefly discussed, to apply the GDGM, which is a distance, the mean is considered the point with the minimum distance to the available samples. To this end, the data used by the robust techniques, trimmed or winsorized, are used to calculate, with an iterative process, the value with the minimum GDGM. The value of the mean is scanned over a reasonable range using the traditional mean as the first guess. The value which minimises the mean calculated with (2) is then chosen. Of course the only precaution to take is that the scan must cover a sufficiently wide range to include this minimum, but such a condition is easy to implement. To perform the iteration, the expression for the GDGM is (3), which depends on two quantities μ and σ. As mentioned, the standard deviation of the noise is assumed known or can be found with the technique described in Section 6.2. For the sample mean μ, since the noise is assumed Gaussian, the measured value is taken as the most probable value and therefore μ is identified with the experimental measured value. A series of numerical tests and theoretical considerations support this choice as can be seen in Verdolaeege [12].
In more detail, indicating with either the traditional sample mean, the trimmed mean, or the winsorized mean (depending of the context) of the available data, sampled from the pdf , the iterative process consists of finding the p_{mean} in the expression which minimises the GDGM. In this last expression p_{mean} is the parameter scanned to obtain the measure of location _{model} by minimising the distance to the available experimental points. As mentioned earlier, for the various parameters necessary to compute the GDGM, the following values are chosen:In these relations, is the percentage of Gaussian noise added to the individual samples and the subscript model indicates the type of mean (traditional, trimmed, or winsorized).
To prove the potential of the proposed method in solving realistic problems, a series of numerical tests have been performed. The data have been sampled from the pdfs reviewed in Section 4. Various levels of Gaussian noise have been added to the individual points. A series of realisations, up to 1000, have been generated for each case and the results reported in Tables 4–9 are the averages over these realisations. The results of the systematic tests for the measures of location and scale are reported in Appendix A. In the rest of this section only the example of sampling from a log normal pdf is provided. In order to condense the information, the percentage variation between the carious estimates is reported in the main text (all the details are in Appendix A). Therefore in Tables 4–9 quantities of the form (vertical bars indicate absolute values) are reportedwhere the estimates using the GDGM are compared to the ones of the various robust indicators. As mentioned in the previous sections, the robust techniques are meant to provide a better estimate of central density, compared to traditional techniques; in this context this means that they tend to give results closer to the mode of the distributions from which the data are sampled (with the term mode we indicate the value where the pdf presents the main peak). The estimators using the GDGM are considered to improve the measures of location if they provide values closer to the mode of the sampled pdf compared to the robust method using the Euclidean distance.
Methods
Error
Methods
Error
Reduction [%]
Reduction [%]
20.09
18.73
0.99
0.54
5.14
5.62
20 points
50 points
100 points
1000 points
Scale s
Scale s
Scale s
Scale s
Mean
14.59655
15.23205
15.68975
16.38328
14.67788
15.30456
15.75851
16.44366
Trimmed Mean
1.53488
0.90977
0.64213
0.20385
2.14418
1.28424
0.88579
0.27558
Winsorized Mean
5.63982
5.73328
5.74412
5.78392
5.67598
5.75673
5.76635
5.80524
Confidence Intervals
Low Limit
Upper Limit
Classic
Trimmed
Winsorized
20 points
50 points
100 points
1000 points
Type I errors [%]
Type I errors [%]
Type I errors [%]
Type I errors [%]
-56.0
-71.8
-67.6
-63.2
2.24
-10.6
-4.3
-8.7
-12.1
-21.5
-16.1
-17.8
20 points
50 points
100 points
1000 points
Power
Power
Power
Power
Mean
0.015
0.024
0.023
0.025
0.021
0.038
0.045
0.285
Trimmed Mean
0.164
0.236
0.307
0.885
0.194
0.288
0.335
0.957
Winsorized Mean
0.159
0.204
0.256
0.722
0.182
0.278
0.351
0.957
No Noise
Noise
Mode
2.576
2.731
Mean
3.250
3.248
2.765
3.257
Trimmed Mean
3.052
3.158
2.079
2.372
Winsorized Mean
3.100
3.216
2.249
2.646
For the case of the log normal distribution, the results for 20 and 100 samples are reported in Table 4, where the percentage improvement with respect to the traditional robust indicators is reported.
From this table, it emerges very clearly that, even at a level of noise of 30%, which is typical of many applications, the improvement provided by the GDGM is significant. Indeed the quantities calculated with the help of the GDGM are closer to the peak of the log normal pdf compared to the results obtained with the trimmed and winsorized means. In Appendix A, the results of a series of systematic tests are reported, showing how the GDGM always improves the estimates of location, over a wide range of samples and noise levels. As can be seen in the Tables in Appendix A, the GDGM allows outperforming the various robust techniques also for a quite high number of points sampled from the distribution (in the order of thousands). Appendix A also shows how this performance is not limited to the case of the log normal distribution but is equally appreciable for all the other pdfs tested.
6.2. Measures of Location in the Case the Variance of the Noise Is Not Known
In the investigation of complex systems, it is possible that the level of Gaussian noise can not be precisely quantified. Typically some experimental evidence is available but sometimes the uncertainties on the level of additive noise can be substantial. Another advantage of the GDGM is that it allows determining the level of normal noise, in particular its sigma, directly from the data. This can be achieved by scanning the sigma of the noise in the GDGM and by repeating the procedure for the identification of the mean described in the previous subsection. In the scan, the estimated mean remains constant for values of the assumed lower than the actual sigma of the noise. The mean then starts decreasing drastically when the assume is higher than the actual sigma of the noise. The inflection point is a very good estimate of the actual sigma of the additive Gaussian noise.
The approach just described is illustrated graphically in Figure 4 for the case of the log normal distribution. Data have been sampled from the pdf and then additive normal noise of zero mean and a sigma of 20% of the mean has been added. The mean of the data has then been calculated with the GDGM for a wide range of . As can be seen from the Figure, the inflection point in the mean corresponds very well to the added level of noise.
(a) 3D view of GD plane
(b) Zoom of 2D view:
(c) 2D view:
The proposed procedure has been verified for all the pdfs used in the paper and it has typically provided a reasonable estimate of the noise sigma.
The capability to derive information about the level of noise affecting the available data is of course a significant added value of the proposed technique, which can have very significant practical applications in the experimental investigations of complex systems.
7. Measures of Scale
The robust statistical methods developed in the last decades allow improving not only the estimates of location but also those of scale. The scale measures tested in this paper are reported in the following (see [5]). As a reference the classic standard deviation is defined as
In the case the location is calculated with the trimmed mean, the standard deviation is defined aswhere n, ,, and are the number of points sampled from the pdf, the percentage of trimming, the data of the trimmed, and the trimmed mean, respectively.
A similar definition applies to the standard deviation of the winsorized mean:
To apply the GDGM to the standard deviations previously defined, the following formula has been applied: where indicates either the traditional mean, the trimmed mean, or the winsorized mean of the available data.
The improvement in the determination of the scale using the GDGM is exemplified in Table 5 for the case of the log normal distribution. Again the impact on the other pdfs is reported in Appendixes. The results are very similar to the ones obtained by the other robust techniques, even if they are systematically slightly better. The advantages of the GDGM will become more evident in the case of hypothesis testing, described in the next section, where it will be shown how adopting the GDGM values of location and scale reduces significantly the Type I errors and improves the power.
8. Hypothesis Testing
In this section, the problems of computing confidence intervals and testing hypotheses are considered. The population variance is to be estimated from the sample variance and the standard deviation of the noise is considered known, either experimentally or by applying the method described in Section 6.2. The null hypothesis is as usual the one which assumes that the measure of location has a certain value: H_{0}: . The alternative hypothesis is therefore . In the case of the classic significance tests, the null hypothesis is rejected at the confidence level α/2 if where T is the student’s T distribution and n the number of degrees of freedom. Robust statistical methods have been developed to test hypotheses when the usual assumptions of normal distribution of the sampled pdf and homoscedasticity are not verified. They are based on the robust estimators of location and scale introduced in the previous sections. An exhaustive treatment of these techniques can be found in Wilcox [11]. For the various robust indicators the null hypothesis can be rejected at the confidence level α/2 if the conditions of the following inequalities are satisfied.
In the case of the trimmed location estimate, according to the Yuen method described in Wilcox [5], the inequalities to evaluate are whereIn the case of the winsorized mean, the student test iswhereThe confidence intervals for these estimates can be derived from the following relations:The extremes of the confidence intervals have been derived using the relations of Table 6. In the previous formulas and in Table 6, the subscript clas indicates the values calculated with the traditional methods and the Euclidean distance.
In order to verify the potential of the method proposed in this paper to help coping with Gaussian noise, all the previous tests have been calculated using also the estimates of location and scale obtained with the GDGM. Two main types of test have been performed. First, the data have been sampled by the same pdf and noise has been added. In this case, the objective is to verify the resilience of the various tests to Type I error, i.e., to the wrong rejection of the null hypothesis (see Table 7).
In the main body of the paper this analysis has been particularised for the log-normal distribution. Further examples are reported in Appendix A. Secondly, the data have been sampled from two different distributions before adding the noise. In this case, the objective consists of determining whether the GDGM can help in reducing Type II errors, i.e., the failure to reject a false null hypothesis (see Table 8). As an example, in the following the data have been sampled from a log normal and a normal distribution. This is one of the most difficult cases since one of the two pdfs is a Gaussian. The results of similar tests for the other pdfs are again reported in Appendix A.
Some representative results of the tests for the Type I errors are reported in Table 7, where a minus sign indicates the reduction of errors achieved when using the GDGM. The use of the GDGM improves significantly the situation by reducing the Type I errors even of 20% with respect to the robust statistics techniques.
The effect of the GDGM is even more significant on the power, as can be seen in Table 8, again for some representative tests. The increase in the Power with GDGM methods can indeed reach values of the order of 35% compared to conventional methods.
9. Nonnormal Distributions and Heteroscedasticity
This section of the paper is the meant to cover the combined effects of sampling from an asymmetric distribution in presence of heteroscedastic noise. As a reference case of particular importance, the log normal distribution is analysed in detail. The data are sampled from a log normal distribution. Gaussian noise has then been added to the sampled data. The noise has zero mean and a standard deviation equal to 50% of the value of the data. It is therefore a heteroscedastic noise quite common in practice, since often the uncertainties in the measurements are expressed as a percentage of their value. The effects of different types of noise are reported in Appendix B. Again the objective consists of estimating the mode, the most probable value of the data. The results of the various approaches are summarised in Table 9. From the table it is easily seen how the GDGM allows recovering values of central tendency much closer to the original distribution than the other indicators, even the robust ones. Such significant variations in the measures of location of course reverberate on the rest of the statistical quantities, from scale to hypothesis testing, again reducing significantly the errors committed by traditional robust indicators in the presence of Gaussian heteroscedastic noise.
10. Conclusions
In the science of complex systems, the data are often the output of delicate measurements and therefore are typically affected by Gaussian additive noise. Since the data are also not necessarily sampled from normal pdfs, it is important to develop tools which can handle both these problems. The Geodesic Distance on Gaussian Manifolds is a principled way to address the issue of Gaussian noise. In this paper, it has been shown how the GDGM can improve the estimates of robust statistical methods, ranging from the evaluation of location and scale to hypothesis testing. The improvements obtained with GDGM for the estimates of location and scale are not negligible for all the non-Gaussian distributions tested. In the case of hypothesis testing, the advantages provided by the GDGM are quite substantial; in particular the power is significantly improved since the Type II errors can be reduced typically of more than 30% for all the pdf tested. Moreover, since the GDGM is a distance on a Gaussian manifold, it does not introduce unacceptable errors if the data are sampled from a normal pdf. A specific development of the technique allows also estimating the level of noise associated with the measurements, when this information cannot be determined experimentally.
In terms of future developments, it is planned to investigate whether the GDGM can improve also other robust techniques, in particular those belonging to the class of the so called M-estimators, which are considered better performing than the ones based on trimming [Huber et al. (2009)]. First preliminary tests indicate that there is no reason to expect that the impact of the GDGM will be less positive for this class of estimators. It is therefore considered more urgent to apply the GDGM also to the methods of robust regression, to assess whether progress can be achieved also for this class of problems. Recently new regression methods, indicated collectively as symbolic regression, have allowed relaxing the constraints of linear regression and obtain formula in nonpower law form as in the works of Murari et al. [13, 14] and in the paper of Peluso et al. [15]. The use of the GDGM seems particularly promising also in this context as it has been shown in Murari et al. [16, 17]. The proposed methodology could be therefore profitably be used in the field of Thermonuclear Fusion to help in the development of more robust scenarios [18–20].