Probability and Statistics with Applications in Finance and EconomicsView this Special Issue
Research Article | Open Access
The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey’s h Transformation as a Special Case
I present a parametric, bijective transformation to generate heavy tail versions of arbitrary random variables. The tail behavior of this heavy tail Lambert random variable depends on a tail parameter : for , , for has heavier tails than . For being Gaussian it reduces to Tukey’s distribution. The Lambert W function provides an explicit inverse transformation, which can thus remove heavy tails from observed data. It also provides closed-form expressions for the cumulative distribution (cdf) and probability density function (pdf). As a special case, these yield analytic expression for Tukey’s pdf and cdf. Parameters can be estimated by maximum likelihood and applications to S&P 500 log-returns demonstrate the usefulness of the presented methodology. The R package LambertW implements most of the introduced methodology and is publicly available on CRAN.
Statistical theory and practice are both tightly linked to Normality. In theory, many methods require Gaussian data or noise: (i) regression often assumes Gaussian errors; (ii) many time series models are based on Gaussian white noise [1–3]. In such cases, a model , parameter estimates and their standard errors, and other properties are then studied, all based on the ideal(istic) assumption of Normality.
In practice, however, data/noise often exhibits asymmetry and heavy tails, for example, wind speed data , human dynamics , or Internet traffic data . Particularly notable examples are financial data [7, 8] and speech signals , which almost exclusively exhibit heavy tails. Thus a model developed for the Gaussian case does not necessarily provide accurate inference anymore.
One way to overcome this shortcoming is to replace with a new model , where is a heavy tail distribution: (i) regression with Cauchy errors ; (ii) forecasting long memory processes with heavy tail innovations [11, 12], or ARMA modeling of electricity loads with hyperbolic noise . See also Adler et al.  for a wide range of statistical applications and methodology for heavy-tailed data.
While such fundamental approaches are attractive from a theoretical perspective, they can become unsatisfactory from a practical point of view. Many successful statistical models and techniques assume Normality, their theory is very well understood, and many algorithms are implemented for the simple and often much faster Gaussian case. Thus developing models based on an entirely unrelated distribution is like throwing out the (Gaussian) baby with the bathwater.
It would be very useful to transform a Gaussian random variable to a heavy-tailed random variable and vice versa and thus rely on knowledge and algorithms for the well-understood Gaussian case, while still capturing heavy tails in the data. Optimally such a transformation should (a) be bijective, (b) include Normality as a special case for hypothesis testing, and (c) be parametric so the optimal transformation can be estimated efficiently.
Figure 1 illustrates this pragmatic approach: researchers can make their observations as Gaussian as possible () before making inference based on their favorite Gaussian model . This avoids the development of, or the data analysts waiting for, a whole new theory of and new implementations based on a particular heavy-tailed distribution , while still improving statistical inference from heavy-tailed data . For example, consider from a standard Cauchy distribution in Figure 2(a): modeling heavy tails by a transformation makes it even possible to Gaussianize this Cauchy sample (Figure 2(c)). This “nice” data can then be subsequently analyzed with common techniques. For example, the location can now be estimated using the sample average (Figure 2(d)). For details see Section 6.1.
(a) Random sample
(b) Box-Cox transformed with
(c) Gaussianized with
(d) Cumulative sample average
Liu et al.  use a semiparametric approach, where has a nonparanormal distribution if and is an increasing smooth function; they estimate using nonparametric methods. This leads to a greater flexibility in the distribution of , but it suffers from two drawbacks: (i) nonparametric methods have slower convergence rates and thus need large samples, and (ii) for identifiability of , and must hold. While (i) is inherent to nonparametric methods, point (ii) requires to have finite mean and variance, which is often not met for heavy-tailed data. Thus here we use parametric transformations which do not rely on restrictive identifiability conditions and also work well for small sample sizes.
The main contributions of this work are threefold: (a) a metafamily of heavy tail Lambert W × distributions (see also ) with Tukey’s distribution  as a special case, (b) a bijective transformation to “Gaussianize” heavy-tailed data (Section 2), and (c) simple expressions for the cumulative distribution function (cdf) and probability density function (pdf) (Section 2.4). In particular, analytic expressions for the pdf and cdf for Tukey’s (Section 3) are presented here, to the best of the author’s knowledge, for the first time in the literature.
Section 4 introduces a method of moments estimator and studies the maximum likelihood estimator (MLE). Section 5 shows their finite sample properties. As has been shown in many case studies, Tukey’s distribution (heavy tail Lambert W × Gaussian) is useful to model data with unimodal, heavy-tailed densities. Applications to S&P 500 log-returns confirm the usefulness of the Lambert W framework (Section 6). Finally, we discuss the new methodology and future work in Section 7. Detailed derivations and proofs are given in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/909231.
Computations, figures, and simulations were done in R . The R package LambertW implements most of the presented methodology and is publicly available on CRAN.
1.1. Multivariate Extensions
While this work focuses on the univariate case, multivariate extensions of the presented methods can be defined component-wise, analogously to the multivariate version of Tukey’s distribution . While this may not make the transformed random variables jointly Gaussian, it still provides a good starting point for more well-behaved multivariate estimation.
1.2. Box-Cox Transformation
A popular method to deal with skewed, high variance data is the Box-Cox transformation A major limitation of (1) is the nonnegativity constraint on , which prohibits its use in many applications. To avoid this limitation it is common to shift the data, , which restricts to a half-open interval. If, however, the underlying process can occur on the entire real line, such a shift undermines statistical inference for yet unobserved data (see ). Even if out-of-sample prediction is not important for the practitioner, Figure 2(b) shows that the Box-Cox transformation in fact fails to remove heavy tails from the Cauchy sample. (We use and use boxcox from the MASS R package; .)
Moreover, the main purpose of the Box-Cox transformation is to stabilize variance [21–23] and remove right tail skewness ; a lower empirical kurtosis is merely a by-result of the variance stabilization. In contrast, the Lambert W framework is designed primarily to model heavy-tailed random variables and remove heavy tails from data and has no difficulties with negative values.
2. Generating Heavy Tails Using Transformations
Random variables exhibit heavy tails if more mass than for a Gaussian random variable lies at the outer end of the density support. A random variable has a tail index if its cdf satisfies , where is a slowly varying function at infinity, that is, for all . (There are various similar definitions of heavy, fat, or long tails; for this work these differences are not essential.) The heavy tail index is an important characteristic of ; for example, only moments up to order can exist.
2.1. Tukey’s Distribution
A parametric transformation is the basis of Tukey’s random variables where is standard Normal random variable and is the heavy tail parameter. The random variable has tail parameter  and reduces to the Gaussian for . Morgenthaler and Tukey  extend the distribution to the skewed, heavy-tailed family of random variables where again . Here and shape the left and right tail of , respectively; thus transformation (3) can model skewed and heavy-tailed data; see Figure 3(a). For the sake of brevity let .
(a) transformation (3)
(b) Inverse in (80) in the supplementary material
(c) Inverse (9) as a function of and
However, despite their great flexibility, Tukey’s and distributions are not very popular in statistical practice, because expressions for the cdf or pdf have not been available in closed form. Although Morgenthaler and Tukey  express the pdf of (2) as they fall short of making explicit. So far the inverse of (2) or (3) has been considered analytically intractable [4, 27]. Thus parameter inference relied on matching empirical and theoretical quantiles [4, 17, 26], or by the method of moments . Only recently Headrick et al.  provided numerical approximations to the inverse. However, numerical approximations can be slow and prohibit analytical derivations. Thus a closed form, analytically tractable pdf, which can be computed efficiently, is essential for a widespread use of Tukey’s (and variants).
In this work I present this long sought closed-form inverse, which has a large body of literature in mathematics and is readily available in standard statistics software. For ease of notation and concision main results are shown for ; analogous results for will be stated without details.
2.2. Heavy Tail Lambert W Random Variables
Tukey’s transformation (2) is strongly related to the approach taken by Goerg  to introduce skewness in continuous random variables . In particular, if Tukey’s , then skewed Lambert W × with skew parameter .
Adapting the skew Lambert W × input/output idea (see Figure 1), Tukey’s random variables can be generalized to heavy-tailed Lambert W × random variables. (Most concepts and methods from the skew Lambert W × case transfer one-to-one to the heavy tail Lambert W random variables presented here. Thus for the sake of concision I refer to Goerg  for details of the Lambert W framework.)
Definition 1. Let be a continuous random variable with cdf , pdf , and parameter vector . Then, is a noncentral, nonscaled heavy tail Lambert W × random variable with parameter vector , where is the tail parameter.
Tukey’s distribution results for being a standard Gaussian .
Definition 2. For a continuous location-scale family random variable define a location-scale heavy-tailed Lambert W × random variable with parameter vector , where and and are mean and standard deviation of , respectively.
The input is not necessarily Gaussian (Tukey’s ) but can be any other location-scale continuous random variable, for example, from a uniform distribution, (see Figure 4).
(a) Lambert W × with
(b) Lambert W × with
(c) Lambert W × with
(d) Lambert W × with
Definition 3. Let be a continuous scale-family random variable, with scale parameter and standard deviation ; let . Then, is a scaled heavy-tailed Lambert W × random variable with parameter .
Let define transformation (6). (For noncentral, nonscale input set ; for scale-family input .) The shape parameter governs the tail behavior of : for values further away from are increasingly emphasized, leading to a heavy-tailed version of ; for , , and for values far away from the mean are mapped back again closer to . For and , also . For and , also .
Remark 4 (only nonnegative ). Although gives interesting properties for , it defines a nonbijective transformation and leads to parameter-dependent support and nonunique input. Thus for the remainder of this work assume , unless stated otherwise.
2.3. Inverse Transformation: “Gaussianize” Heavy-Tailed Data
Transformation (6) is bijective and its inverse can be obtained via the Lambert W function, which is the inverse of , that is, that function which satisfies . It has been studied extensively in mathematics, physics, and other areas of science [29–31] and is implemented in the GNU Scientific Library (GSL) . Only recently the Lambert W function received attention in the statistics literature [16, 33–35]. It has many useful properties (see Appendix A in the supplementary material and Corless et al. ), in particular, is bijective for .
Lemma 5. The inverse transformation of (6) is where and is the sign of . is bijective for all and all .
Lemma 5 gives for the first time an analytic, bijective inverse of Tukey’s transformation: of Morgenthaler and Tukey  is now analytically available as (8). Bijectivity implies that for any data and parameter , the exact input can be obtained.
In view of the importance and popularity of Normality, we clearly want to back-transform heavy-tailed data to data from a Gaussian rather than yet another heavy-tailed distribution. Tail behavior of random variables is typically compared by their kurtosis , which equals if is Gaussian. Hence for the future when we “normalize ” we cannot only center and scale but also transform it to with (see Figure 2(c)). While does not guarantee that is Gaussian, it is a good baseline for a Gaussian sample. Furthermore, it puts different data not only on the same scale, but also on the same tail.
This data-driven view of the Lambert W framework can also be useful for kernel density estimation (KDE), where multivariate data is often prescaled to unit variance, so the same bandwidth can be used in each dimension [36, 37]. Thus “normalizing” the Lambert Way can also improve KDE for heavy-tailed data (see also [38, 39]).
Remark 6 (generalized transformation). Transformation (2) can be generalized to The inner term guarantees bijectivity for all . The inverse is For comparison with Tukey’s I consider only. For transformation (10) is closely related to skewed Lambert W × distributions.
2.4. Distribution and Density
For ease of notation let
Theorem 7. The cdf and pdf of a location-scale heavy tail Lambert W × random variable equal Clearly, and , since and for all .
For scale family or noncentral, nonscale input set or , .
2.5. Quantile Function
Quantile fitting has been the standard technique to estimate , , and of Tukey’s . In particular, the medians of and are equal. Thus for symmetric, location-scale family input the sample median of is a robust estimate of for any (see also Section 5). General quantiles can be computed via  where are the -quantiles of .
3. Tukey’s Distribution: Gaussian Input
For Gaussian input Lambert W × equals Tukey’s , which has been studied extensively. Dutta and Babbel  show that which, in particular, implies that  Thus the kurtosis of equals (see Figure 5) For , (18) and (19) reduce to the familiar Gaussian values.
Corollary 8. The cdf of Tukey’s equals where is the cdf of a standard Normal. The pdf equals (for )
Proof. Take in Theorem 7.
3.1. Tukey’s versus Student’s
Student’s -distribution with degrees of freedom is often used to model heavy-tailed data [41, 42], as its tail index equals . Thus the th moment of a Student’s random variable exists if . In particular, and kurtosis
Comparing (24) and (25) with (18) and (19) shows a natural association between and and a close similarity between the first four moments of Student’s and Tukey’s (Figure 5). By continuity and monotonicity, the first four moments of a location-scale -distribution can always be exactly matched by a corresponding location-scale Lambert W × Gaussian. Thus if Student’s is used to model heavy tails and not as the true distribution of a test statistic it might be worthwhile to also fit heavy tail Lambert W × Gaussian distributions for an equally valuable “second opinion.” For example, a parallel analysis on S&P 500 -returns in Section 6.2 leads to divergent inference regarding the existence of fourth moments.
4. Parameter Estimation
Due to the lack of a closed form pdf of , has typically been estimated by matching quantiles or a method of moments estimator [4, 26, 28]. These methods can now be replaced by the, fast and usually efficient, maximum likelihood estimator (MLE). Rayner and MacGillivray  introduce a numerical MLE procedure based on quantile functions, but they conclude that “sample sizes significantly larger than should be used to obtain reliable estimates.” Simulations in Section 5 show that the MLE using the closed form Lambert W × distribution converges quickly and is accurate even for sample sizes as small as .
4.1. Maximum Likelihood Estimation (MLE)
For an i.i.d. sample the log-likelihood function equals The MLE is that which maximizes (26); that is, Since is a function of , the MLE depends on the specification of the input density. Equation (26) can be decomposed as where is the log-likelihood of the back-transformed data (via (8)) and where Note that only depends on and (and ), but not necessarily on every coordinate of .
Decomposition (28) shows the difference between the exact MLE based on and the approximate MLE based on alone: if we knew beforehand, then we could back-transform to and estimate from (maximize (29) with respect to ). In practice, however, must also be estimated and this enters the likelihood via the additive term . A little calculation shows that for any , if , with equality if and only if . Thus can be interpreted as a penalty for transforming the data. Maximizing (28) faces a trade-off between transforming the data to follow (and increasing ) and the penalty of a more extreme transformation (and decreasing ); see Figure 6(b).
(a) Penalty (31) as a function of and ( and )
(b) Sample of Lambert W Gaussian with (left); log-likelihood (solid, black) decomposes in input log-likelihood (dotted, green) and penalty (dashed, red)
Figure 6(a) shows a contour plot of as a function of and : it increases (in absolute value) either if gets larger (for fixed ) or for larger (for fixed ). In both cases, increasing makes the transformed data get closer to , which in turn increases its input likelihood. For , the penalty disappears since input equals output; for there is no penalty since for all .
Figure 6(b) shows an i.i.d. sample () Lambert W × Gaussian with and the decomposition of the log-likelihood as in (28). Since is known, the likelihood and penalty are only functions of . Theorem 9 shows that the convexity of the penalty (decreasing, red) and concavity of the input likelihood (increasing, green) as a function of holds true in general for any data , and their sum (solid, black) has a unique maximum; here (blue, dashed vertical line). See Theorem 9 below for details.
The maximization of (28) can be carried out numerically. Here I show existence and uniqueness of assuming that and are known. Further theoretical results for remain for future work. Given the “nice” form of , which is continuous, twice differentiable (under the assumption that is twice differentiable), the MLE for should have the usual optimality properties, such as being consistent and efficient .
4.1.1. Properties of the MLE for the Heavy Tail Parameter
Without loss of generality let and . In this case
Theorem 9 (see ). Let have a Lambert W × Gaussian distribution, where and are assumed to be known and fixed. Also consider only . (While for some samples the MLE also exists for , it cannot be guaranteed for all . If (and ), then is either not unique in (principal and nonprincipal branch) or does not have a real-valued solution in if .)(a)If then . If (34) does not hold, then(b) exists and is a positive solution to (c)There is only one such satisfying (35); that is, is unique.
Proof. See Appendix B in the supplementary material.
Condition (34) says that only if the data is heavy-tailed enough. Points (b) and (c) guarantee that there is no ambiguity in the heavy tail estimate. This is an advantage over Student’s -distribution, for example, which has numerical problems and local maxima for unknown (and small) [45, 46]. On the contrary, is always a global maximum.
Given the heavy tails in one might expect convergence issues for larger . However, for the true and close to a standard Gaussian if . Since the log-likelihood and its gradient depend on and only via the performance of the MLE should thus not get worse for large as long as the initial estimate is close enough to the truth. Simulations in Section 5 support this conjecture, even for .
4.2. Iterative Generalized Method of Moments (IGMM)
A disadvantage of the MLE is the mandatory a priori specification of the input distribution. Especially for heavy-tailed data the eye is a bad judge to choose a particular parametric . It would be useful to directly estimate without the intermediate step of estimating first.
Goerg  presented an estimator for based on iterative generalized methods of moments (IGMM). The idea of IGMM is to find a such that the back-transformed data has desired properties, for example, being symmetric or having kurtosis . An estimator for , , and can be constructed entirely analogous to the IGMM estimator for the skewed Lambert W × case. See the Supplementary Material, Appendix C for details.
IGMM requires less specific knowledge about the input distribution and is usually also faster than the MLE. Once has been obtained, the back-transformed can be used to check if has characteristics of a known parametric distribution . It must be noted though that testing for a particular distribution is too optimistic as has “nicer” properties regarding than the true would have. However, estimating the transformation requires only three parameters and for a large enough sample, losing three degrees of freedom should not matter in practice.
This section explores finite sample properties of estimators for and under Gaussian input . In particular, it compares Gaussian MLE (estimation of and only), IGMM and Lambert W × Gaussian MLE, and, for a heavy tail competitor, the median. (For IGMM, optimization was restricted to .) All results below are based on replications.
5.1. Estimating Only
Here I show finite sample properties of for , where and are known and fixed. Theorem 9 shows that is unique: either at the boundary or at the globally optimal solution to (35). Results in Table 1 were obtained by numerical optimization restricted to () using the nlm function in .
Table 1 suggests that the MLE is asymptotically unbiased for every and converges quickly (at about ) to its asymptotic variance, which is increasing with . Assuming and to be known is unrealistic and thus these finite sample properties are only an indication of the behavior of the joint MLE, . Nevertheless they are very remarkable for extremely heavy-tailed data (), where classic average-based methods typically break down. One reason lies in the particular form of the likelihood (32) and its gradient (35) (Theorem 9): although both depend on , they only do so through . Hence as long as is sufficiently close to the true , (32) and (35) are functions of almost Gaussian random variables and standard asymptotic results should still apply.
5.2. Estimating All Parameters Jointly
Here we consider the realistic scenario where and are also unknown. We consider various sample sizes (, , and ) and different degrees of heavy tails, , each one representing a particularly interesting situation: (i) Gaussian data (does additional, superfluous, estimation of affect other estimates?), (ii) fourth moments do not exist, (iii) nonexisting mean, and (iv) extremely heavy-tailed data (can we get useful estimates at all?)
The convergence tolerance for IGMM was set to . Table 2 summarizes the simulation.
|(a) Truly Gaussian data:|