Linear Models with Response Functions Based on the Laplace Distribution: Statistical Formulae and Their Application to Epigenomics
The statistical application considered here arose in epigenomics, linking the DNA methylation proportions measured at specific genomic sites to characteristics such as phenotype or birth order. It was found that the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at CpG sites, may be successfully modelled by a Laplace distribution which is perturbed by a Hermite polynomial. We use a linear model with such a response function. Hence, the response function is known, or assumed well estimated, but fails to be differentiable in the classical sense due to the modulus function. Our problem was to estimate coefficients for the linear model and the corresponding covariance matrix and to compare models with varying numbers of coefficients. The linear model coefficients may be found using the (derivative-free) simplex method, as in quantile regression. However, this theory does not yield a simple expression for the covariance matrix of the coefficients of the linear model. Assuming response functions which are except where the modulus function attains zero, we derive simple formulae for the covariance matrix and a log-likelihood ratio statistic, using generalized calculus. These original formulae enable a generalized analysis of variance and further model comparisons.
1. Introduction and Motivation
This work arose in a biological context, in epigenomics, namely, the modelling of the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at specific genomic sites (CpG sites). It was observed that this error distribution may be suitably modelled by a truncated Laplace distribution perturbed by a Hermite polynomial.
This error distribution was first noticed in Sequenom measurements but has wider application. A survey of data generated by measurements on the Infinium, Illumina, Affymetrix, and MeDIP2 machines showed similar characteristics to that of the Sequenom, where such an amended Laplace distribution was required to properly describe the probability density function. It is thought that variation in the scattering angle of light in the measurement processes common to all of these platforms is responsible for the frequencies in the tails of the measurement distributions not conforming to a simple Laplace density and requiring our proposed amendment. Without the amendment the Laplace density gives tail probabilities for the deviations that are too high, potentially leading to an incorrect failure to reject a null hypothesis. Because the observed frequency distribution of epigenomic and gene expression measurements appears to be a common feature of molecular biology, it is important that the process of estimation and inference under the amended Laplace probability density be studied. The paper reports results from a study of estimation and inference under the amended Laplace density.
We extend the theory of linear models as given in  to deal with response variables with distributions more general than the exponential family. We consider the Laplace distribution, and amendments thereof, with probability density functions which have abrupt changes in gradient due to the presence of the modulus function. The theory in this paper corresponds to least absolute error (LAE) (or least absolute deviation (LAD)) regression [2–4], also called median regression , when the response function is the Laplace distribution without modification. We focus on coefficient estimation for a linear model with a response variable distribution assumed to be a truncated and/or perturbed version of the Laplace distribution, estimating the standard errors of these coefficients and understanding the asymptotic theory.
The theory of generalized linear models as described in  covers the case of distributions from the exponential family. These distributions have probability density functions which are twice continuously differentiable (), everywhere on their support. The usual expressions for the standard errors of the model coefficients for the generalized linear models in  are derived using Taylor series and assume distributions with probability density functions which are everywhere on their support. They cannot be applied to our model due to the presence of the modulus function.
The modified Laplace probability density functions considered here have a sharp peak at the maximum. Maximum likelihood estimation (MLE) of coefficients may be done by non-gradient methods, such as the simplex method. However, the usual classical expressions for the standard errors of the coefficients, the information matrix, and the log-likelihood ratio statistic do not apply due to lack of differentiability. We derive expressions for generalized versions of these quantities using generalized functions. Consequently, we show that our MLE is asymptotically normal.
The method we present to estimate these statistics could in principle be applied to other probability density functions exhibiting abrupt changes in gradient. Response function parameters are assumed known or previously estimated. The theory is applied to find the standard errors for coefficients of a linear model, assuming the response function has a truncated Laplace distribution with added kurtosis, due to perturbation by a Hermite polynomial. To illustrate the application, we show how birth order can be linked to methylation status at two CpG sites in the promotor of the H19 gene.
2. The Model
2.1. The Expectation Is Linear
Let be a vector of response variables; let be an matrix of explanatory variables (real-valued). The subscripts denote the dimensions and will be omitted when these are assumed fixed (in Sections 2 and 3). Let be a vector of coefficients for our linear model and assume that Then each component of the deviation (or error) vector has expectation zero. The explanatory variables may be continuous or discrete. We assume . Let denote the rank of ; then . In practice, we usually have , the number of data points, much larger than , the number of coefficients. Our goal is estimating the components of by ML principles, and determining their standard errors, given a set of response variables , explanatory variables , and a response variable distribution based on the Laplace distribution as described below. In terms of generalized linear models, the link function is assumed to be the identity.
2.2. The Distribution of the Deviations—A Modified Laplace Distribution
Example 1. Let be defined by
where is a real-valued parameter and is a real-valued normalizing function defined so that
Then is the probability density function for the Laplace distribution, with scale parameter , centred at the origin, and with unbounded support. It is not differentiable at the origin in the classical sense.
The method of MLE for the response function (6) corresponds to least absolute error (LAE) regression [2–4]. However, the theory of LAE regression is not sufficiently general for our epigenomic modelling problem. We next describe the more general response functions we require.
Example 2. Now consider the case of bounded support. For finite , define by where is a real-valued scale parameter and is a real-valued normalizing function defined so that Then is the probability density function for the truncated Laplace distribution with scale parameter , centred at the origin, and with bounded support .
Example 3. More generally, consider perturbations of the truncated Laplace probability density function of the following form. Let where real-valued is equal to the constant map plus a perturbation, is a vector of parameters for , and parameter vector , . If , has no parameters. Here, is a real-valued normalizing function defined so that We assume that there exists some such that is in on and that on , for fixed parameter vector . Note that, as a consequence of using the modulus function, will not be differentiable with respect to , at , in general. As in Example 1, is not differentiable at the origin due to the use of the modulus function.
Example 4. We could allow unbounded support if is finite.
Example 5. Now consider our motivating example, a truncated Laplace distribution with bounded support , perturbed by adding kurtosis. Such a distribution is used to model the deviations in the proportions of methylation measured at gene promoter CpG sites. Specifically, to fit with observations, kurtosis is added to a Laplace probability density function with bounded support by adding a third-order Hermite polynomial to give an amended version.
Consider Here, , is small, , and is the third-order Hermite polynomial. Solving for yields
Example 6. The functions and , assuming small positive and bounded support, could be used in (10) to model distributions similar to the Laplace but with thinner tails.
We restrict to symmetric distributions satisfying .
3. Maximum Likelihood Estimation
3.1. The Log-Likelihood Function
Let be a probability density function with parameter vector , as described in Section 2. Let be a sequence of independent and identically distributed deviations with joint probability density function This is also the likelihood function which may be regarded as a function of or ; here, the subscript reflects our point of view. We use the log-likelihood function in the estimation of , where, using various notation, Substituting measured values of and known inputs into , we obtain a function of and . The parameters are assumed known, but if not, may be estimated separately. In our biological application, they are estimated by MLE and are assumed fixed for a particular measuring process. Hence, we have a function of , the coefficients of our linear model.
Our aim is to find a maximum likelihood estimator (MLE) denoted , that is, some point at which attains its maximum value. The subscript corresponds to the number of deviations. Now is a continuous function. If is finite, it has compact support in . Since a continuous function on a compact set attains its supremum, the existence of a MLE for is guaranteed. Even if , we may consider truncations with finite bounds , . Since is maximized when the are small, truncating to for large enough will not affect the set of points at which attains its maximum. We show in Section 3.3 that a MLE is not necessarily unique.
Case 1. If and so is the Laplace probability density function with parameter as in (6) or (8), then, for such that every is in the support of ,
where if the support of is . If the support of is then
The theory of LAE regression (corresponding to MLE using Laplace distributions without modification as response functions) may be found in various texts, for example, . Here, it is proved that there exists a MLE corresponding to at least zero errors. We are concerned with the extension of these ideas to the case of perturbed and truncated Laplace response functions. For the truncated Laplace distribution, we prove that there exists a MLE corresponding to at least zero errors. Consider the following more general case.
Case 2. If is a perturbed Laplace probability density function with perturbing function and bounded support as in (10), then for such that , ,
Note since is strictly positive on , so is and so is well-defined on , .
In Section 3.3, we show that if the perturbing function is such that is convex and non-increasing on , there exists a MLE corresponding to at least data points. We give an upper bound on on which, if not exceeded, ensures that there exists a MLE corresponding to at least one data point. We apply these results to our motivating example, the Laplace distribution with added kurtosis, described below.
Example 7. If and so is a Laplace probability density function with scale parameter , with added kurtosis and bounded support as in (13), then for such that , , Now is not differentiable in the classical sense with respect to the linear model coefficients when any . Hence, we cannot assume that is differentiable at a MLE. This paper addresses this issue firstly by proposing a non-gradient method of coefficient estimation and secondly (in Section 4) by using generalized functions to calculate statistical estimates including estimates of standard errors. In Section 5, we discuss the asymptotic theory of our MLE.
3.2. Coefficient Estimation Dealing with Abrupt Changes in Gradient
Although and are continuous functions of , their first derivatives are not. Consider the geometry of the coefficient space , where . For each index , since the response function distribution is defined in terms of absolute values, we can find a hyperplane in on which and are not differentiable, defined by setting . Let , the th row of , it is never the zero vector since , . Choose so that . Then yields . Let be the set of all such . For example, for , for each error term there is a line in on which and have a sharp ridge. By inspection of the geometry, we would expect the values of which maximize to be either on the union of the hyperplanes or very close to intersections of the hyperplanes , . Imagine searching in -space near the hyperplanes . Even if has a local maximum near but not on the union of the hyperplanes, it would be difficult to use a method based on the gradient of either or , since the gradient changes sharply whenever we cross one of the . The simplex method of coefficient estimation, which does not require any partial derivatives, suits this geometry.
3.3. The Maximum Likelihood Estimator Corresponds to a Data Point
3.3.1. Convex and Non-Increasing Perturbations of the Laplace Probability Density Function
Consider the probability density function with support , for some finite as described in Section 2.2 (Example 3). Recall we assume that there exists some such that is in on and that on , for fixed parameter . Let We consider the log-likelihood function (conditional on ) with its domain restricted to that is, constrained to or equivalently, consider the log-likelihood function constrained to , where is defined as . Note that and similarly . Also, if the function is defined by then .
Lemma 8. If is convex and non-increasing (i.e., on , then there exists a maximum of corresponding to at least data points. That is, there exists such that the constrained attains its maximum at , and there exists at least indices such that , .
Corollary 9. Let be the Laplace probability density function (8) with support and parameter . Then there exists a maximum of corresponding to at least data points.
Note that Corollary 9 could be proved directly by linear programming theory. Linear programming has been applied to the problem of minimising the sum of absolute errors in various applications (see , a survey paper, and also ). The convex analysis results we require are in Appendix A.
Proof of Lemma 8. To begin, assume that has full rank , recall and that has bounded support. Then is a compact convex subset of an -dimensional affine subspace of . Since the mapping is linear and has full rank , the inverse image is a compact convex subset of . We partition , which is the support of , into a finite collection of compact convex sets, so that, on each subset, is convex.
Let be the hyperplane in corresponding to the error term . Then and are the hyperplanes in corresponding to errors and , respectively. It follows that the log-likelihood function is a convex function in between the the hyperplanes , , and , . These hyperplanes divide the domain in into at most open sets bounded by (but not intersecting) the hyperplanes. Each such open set (and hence its closure) may be labelled by a set of signs. For any such that , ; labels the open set containing .
Next, consider as the union of its orthants, which we denote , . We assume the orthants are closed sets. For example, the non-negative orthant is . In the interior of any , the sign of does not change, . Relabel the open subsets , where , so that . Let .
Now, , where denotes the closure of the set. Since is bounded by hyperplanes, it is convex. It is closed and bounded and hence compact. Choose , such that is non-empty. Since continuous functions are bounded on compact sets, the supremum of , when restricted to , must be attained at one or more points in . By Corollary A.3, the supremum (in our case the maximum) of on is attained on the whole set or on a union of faces of dimension less than or at a vertex. Since there are a finite number of sets to consider, the maximum of must occur at a vertex but might occur, for example, on the whole of a set or on a union of faces. This is important to consider when using search algorithms such as the simplex method; as repeated application with different starting points may give a set of solutions which, for example, lie on a line segment. Note the following points.(i)Assuming that , any vertex of the set in at which attains its maximum must correspond to data points, possibly more (degeneracy). This is due to the fact that in , the gradient points in the direction of the boundary of the corresponding orthant and away from the boundary of . (ii)A MLE is not necessarily unique. (iii)If , then we may apply the same reasoning to a subspace of of dimension on which the error mapping has full rank . (iv)Since at the MLE the absolute values of the deviations will all be small, this proof for finite may be extended to .
Lemma 8 is useful but we need to know what happens for more general perturbing functions . First, we consider the Laplace distribution without perturbation.
3.3.2. The Truncated Laplace Probability Density Function
For , let be the Laplace probability density function (8) with scale parameter and with support . Then and The latter is a piecewise affine function in . It has a maximum value of when . More generally, for and independent deviations (error terms) , Hence, the log-likelihood function is, up to a constant term, a piecewise linear function in the error terms, with a maximum attained when all the errors are zero. However, we must restrict our domain to , or equivalently to . The log-likelihood function is a piecewise linear function, up to a constant term, in between the hyperplanes , , and , . Let denote the th column of , . Let denote the span of the columns of .
Now has a critical point at if and only if the gradient (evaluated at ) is orthogonal to . The columns of are tangent vectors to at this point. The gradient is constant in the interior of any orthant. If we travel along a straight line path in any orthant, either always increases, always decreases, or remains constant. Hence, we will not find an isolated local maximum or minimum in , an open set, for the constrained .
We need to be aware of the case where lies in or very close to a level set of . We might need to test for this. This happens when the sign vector is orthogonal to , or nearly so. In the former case, the constrained is constant on . In the latter case, the constrained will differ very little around the maximum on . If this is the case for all the , then the ML values for the coefficients will not be sharply defined (will have large variance).
3.3.3. More General Perturbations of the Laplace Probability Density Function
The question is, given a nontrivial perturbing function , does the log-likelihood function attain its maximum at a data point? We have given conditions on in Lemma 8 which are sufficient to ensure the maximum is attained at a data point. We give a more general criterion in Lemma 10.
Assume that is non-linear in any orthant. Otherwise, we can write in the form of a scaled Laplace distribution and apply Lemma 8. Then is the sum of an affine function and a non-linear function in any orthant. This affine function is the log-likelihood function corresponding to the Laplace distribution. The non-linear function is Then , and so , where
It is possible that there exist orthants in which the set is orthogonal to the gradient . It is possible that attains its global maximum (with respect to ) on the whole of such an . Hence, it is important to consider the behaviour of in such orthants.
Lemma 10. Assume bounded support and let where denotes supremum. Then, if , the supremum or maximum of is attained at a data point. In the special case that is orthogonal to in any orthant, it may be that the supremum is also attained elsewhere.
Proof of Lemma 10. Choose , where , such that the set is non-empty, choose (an open set relative to ), and let be the projection of
onto . Consider the case is non-zero. Then the affine function is not constant on . Given any in , there exists some and a map
which is increasing at , since
Since is continuous and strictly positive on , is finite. We show that if , at , that is, at . Specifically,
if and is non-zero. Then the supremum of , constrained to , must be attained on the boundary of , relative to , and this must be at a data point due to the direction of .
In the limiting case, where , but we still have , we find that the supremum of , constrained to (relative to ), must be attained at a data point but may be attained at other points in as well. Assume this is not the case, that is, there exists no data point at which attains its supremum when constrained to . We find a contradiction as follows. Assume there exists some , an open set (relative to ), such that where denotes boundary. Imagine perturbing the columns of slightly (and continuously) to obtain so that , a continuous function, conditional on , changes by at most , that is, for all , and so that now . We also require the perturbation small enough that Now for some unique . Let . If the perturbation of is small enough, then and so .
Now, . Hence, which is not possible since . Hence, we have a contradiction. Now, for completeness, assume there exists some , not a data point (that is some ) such that Then there exists , such that Hence, The contradiction follows as above. Hence, we have proved the lemma.
3.3.4. An Amended Laplace Probability Density Function with Added Kurtosis
We may apply Lemma 10 to show that, for the amended Laplace probability density function with added kurtosis (Example 5) and for realistic values of and , the maximum of the log-likelihood function must be attained at a data point. Here, on , that is, on . In dimensions, Consider, for , When , is negative when and positive when , and so the continuous monotonic function has a point of inflection, where in , at say . Hence, and are both concave and convex on . When is small, the concavity occurs close to the origin and the functions are convex on most of . Moreover, is always negative on , and its limit as is also negative, so we will not get isolated local maxima in one dimension. The function and its first and second derivatives are plotted in Figures 1, 2, and 3, respectively; setting , a typical value. Reading from Figure 2, the upper bound is approximately 0.075, when . Since typically , the criterion () for Lemma 10 is satisfied.
3.3.5. Non-Increasing Perturbations Both Concave and Convex
In certain situations we might find the criteria for both Lemmas 8 and 10 are not satisfied but that is non-increasing on and convex on , where is small. Since a sum of convex functions is convex (see ), will be convex wherever is convex for . For example, for and , we can prove a partial result as follows. By restricting to the domain or equivalently by substituting for each set the subset and applying convex function theory as for Lemma 8 we can prove that there exists at which attains its maximum, and there exists at least one index such that . Hence, there exists some at which attains its maximum, and there exists at least one index such that . In other words, there exists some point at which has a maximum at which at least one of the errors is small. It makes sense to search for the maxima of near or at vertices.
For , we may apply Lemma 10 and so do not need this partial result, but for a perturbation similar in shape to , non-increasing everywhere on , with convex everywhere on , concave everywhere on , for some small positive , and with steep slope at the point of inflection (), such analysis would be useful.
4. Statistics for Linear Model Coefficients Assuming Perturbed and Truncated Laplace Response Functions
4.1. Dealing with Abrupt Changes in Gradient
The inclusion of the modulus (absolute value) function in the Laplace probability density function (6) (and variations thereof) is the cause of abrupt changes in the gradient of the log-likelihood function (see Section 4.2). This section is devoted to dealing with the problems which are associated with these abrupt changes, encountered when deriving statistical formulae, for example, for standard errors.
The fact that is not differentiable in the classical sense at a local maximum means that the assumptions made in the derivation of the usual classical formulae for the information matrix, the expected value of the Hessian of the log-likelihood function and the variance-covariance matrix for the model coefficients , , are not met. For probability density functions (and log-likelihood functions), these formulae are derived using Taylor series. We find alternative expressions for these quantities assuming the truncated and/or perturbed Laplace response functions (as defined in Section 2) which are where the modulus function is non-zero. In Section 5 these expressions will be used to prove the asymptotic convergence of our MLE to a random variable with a normal distribution.
4.2. Differentiation in a Generalized Sense
The following generalized functions are required to determine the first and second partial derivatives of the log-likelihood function , with respect to the coefficients . These derivatives are needed for the calculation of the standard errors. We require and which is the delta function, that is, except at , and . These expressions and the modulus function are connected by where the differentiation is taken in the generalized sense (see [9, 10]). Hence, for , the generalized derivative Also, the derivative of the delta function may be defined via integration by parts, assuming is , we have
In Section 5, we investigate the behaviour of our model as , and so use subscripts to clarify the variables under consideration ( or ) and/or the dimension of the space(s) under consideration. Using (10) and (59), it follows that Since (see (5)), Letting denote the gradient of and letting denote the gradient of , we have In addition, using (60) and omitting the dependence of upon its parameters for brevity, and, if ,
Let denote the generalized Hessian of , where and let denote its expected value. Then and are diagonal matrices. Since the diagonal elements of are all equal (see Section 4.6), this matrix is a multiple of the identity. We have where denotes the identity matrix. Let denote the generalized Hessian of , where and let denote its expected value. Then and If any , then the th diagonal element of is infinite. In this case has infinite components. We prove in Section 4.6 that is finite.
4.3. A Classical Relation to Be Generalized
Let denote the Fisher information matrix, where The components of our MLE depend on the errors and have a distribution whose variance-covariance matrix is denoted , where , and . If the log-likelihood function was sufficiently smooth around the region of interest (i.e., around its maximum value), then Taylor series expansions could be used to derive a relationship between , , and , namely, (see ). However, our is not sufficiently smooth, and so we cannot make use of this relationship without further justification. In general, (75) does not hold, assuming a truncated (and possibly perturbed) Laplace distribution.
In Sections 4.5 and 4.6 we calculate the expected values of the first and second partial derivatives of the log-likelihood function using generalized functions; this enables us to derive a generalized Taylor series expansion for the log-likelihood function about a maximum even when the maximum is, for example, on a ridge or at a vertex. Also, this enables us to derive an expression for the generalized variance-covariance matrix for the MLEs of the model coefficients and an expression for the generalized log-likelihood ratio statistic. These formulae differ from the standard formulae for the case of smooth log-likelihood functions, although their form is similar. Specifically, in our case, we prove that is a multiple of , but that the multiple is not −1, rather, it is a negative real number that depends on , the perturbation , its parameters and the bound . We prove that our generalized is a multiple of , where the multiple is a positive real number depending on , , , and . We assume independent error distributions.
4.4. The Mean and Variance of the Partial Derivatives of the Log-Likelihood Function
The mean and the variance of are required in the calculation of . Recall is the joint probability density function for the independent deviations (errors), and that so Let ; then is independent of index . Since , , and are symmetric about the origin, and so for any choice of , , , and . Let where denotes variance. Using (63) for , where, for , and . Since is independent of index , we omit the subscript. Hence, the information matrix , defined with respect to , is
If , then , and so For the nontrivial perturbing function , for fixed and , one can show that depends on by direct calculation.
4.5. The Information Matrix
We calculate the information matrix (conditional on , , , and ). We are trying to quantify the steepness of the slope of around a maximum, in the directions represented by the coefficients . If is very flat in one direction, then the model coefficient representing that direction is not well defined (will have large variance). When calculating , we are taking into account the behaviour of the gradient of on a whole neighbourhood of the MLE (how it differs from the expected value) and discontinuities on sets of measure zero can be accommodated. Recall (64), for , Hence, omitting some subscripts on for brevity, we obtain