Linear Models with Response Functions Based on the Laplace Distribution: Statistical Formulae and Their Application to Epigenomics

Hassell Sweatman, C. Z. W.; Wake, G. C.; Pleasants, A. B.; McLean, C. A.; Sheppard, A. M.

doi:https://doi.org/10.1155/2013/496180

International Scholarly Research Notices

On this page

Abstract Introduction Discussion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2013 | Article ID 496180 | https://doi.org/10.1155/2013/496180

Linear Models with Response Functions Based on the Laplace Distribution: Statistical Formulae and Their Application to Epigenomics

C. Z. W. Hassell Sweatman,^1,2G. C. Wake,^1,2,3A. B. Pleasants,^2,3,4C. A. McLean,²and A. M. Sheppard^2,3

Academic Editor: A. A. Ding, J. López-Fidalgo

Received20 Jun 2013

Accepted24 Jul 2013

Published02 Nov 2013

Abstract

The statistical application considered here arose in epigenomics, linking the DNA methylation proportions measured at specific genomic sites to characteristics such as phenotype or birth order. It was found that the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at CpG sites, may be successfully modelled by a Laplace distribution which is perturbed by a Hermite polynomial. We use a linear model with such a response function. Hence, the response function is known, or assumed well estimated, but fails to be differentiable in the classical sense due to the modulus function. Our problem was to estimate coefficients for the linear model and the corresponding covariance matrix and to compare models with varying numbers of coefficients. The linear model coefficients may be found using the (derivative-free) simplex method, as in quantile regression. However, this theory does not yield a simple expression for the covariance matrix of the coefficients of the linear model. Assuming response functions which are except where the modulus function attains zero, we derive simple formulae for the covariance matrix and a log-likelihood ratio statistic, using generalized calculus. These original formulae enable a generalized analysis of variance and further model comparisons.

1. Introduction and Motivation

This work arose in a biological context, in epigenomics, namely, the modelling of the distribution of errors in the proportions of chemical modification (methylation) on DNA, measured at specific genomic sites (CpG sites). It was observed that this error distribution may be suitably modelled by a truncated Laplace distribution perturbed by a Hermite polynomial.

This error distribution was first noticed in Sequenom measurements but has wider application. A survey of data generated by measurements on the Infinium, Illumina, Affymetrix, and MeDIP2 machines showed similar characteristics to that of the Sequenom, where such an amended Laplace distribution was required to properly describe the probability density function. It is thought that variation in the scattering angle of light in the measurement processes common to all of these platforms is responsible for the frequencies in the tails of the measurement distributions not conforming to a simple Laplace density and requiring our proposed amendment. Without the amendment the Laplace density gives tail probabilities for the deviations that are too high, potentially leading to an incorrect failure to reject a null hypothesis. Because the observed frequency distribution of epigenomic and gene expression measurements appears to be a common feature of molecular biology, it is important that the process of estimation and inference under the amended Laplace probability density be studied. The paper reports results from a study of estimation and inference under the amended Laplace density.

We extend the theory of linear models as given in [1] to deal with response variables with distributions more general than the exponential family. We consider the Laplace distribution, and amendments thereof, with probability density functions which have abrupt changes in gradient due to the presence of the modulus function. The theory in this paper corresponds to least absolute error (LAE) (or least absolute deviation (LAD)) regression [2–4], also called median regression [5], when the response function is the Laplace distribution without modification. We focus on coefficient estimation for a linear model with a response variable distribution assumed to be a truncated and/or perturbed version of the Laplace distribution, estimating the standard errors of these coefficients and understanding the asymptotic theory.

The theory of generalized linear models as described in [1] covers the case of distributions from the exponential family. These distributions have probability density functions which are twice continuously differentiable (), everywhere on their support. The usual expressions for the standard errors of the model coefficients for the generalized linear models in [1] are derived using Taylor series and assume distributions with probability density functions which are everywhere on their support. They cannot be applied to our model due to the presence of the modulus function.

The modified Laplace probability density functions considered here have a sharp peak at the maximum. Maximum likelihood estimation (MLE) of coefficients may be done by non-gradient methods, such as the simplex method. However, the usual classical expressions for the standard errors of the coefficients, the information matrix, and the log-likelihood ratio statistic do not apply due to lack of differentiability. We derive expressions for generalized versions of these quantities using generalized functions. Consequently, we show that our MLE is asymptotically normal.

The method we present to estimate these statistics could in principle be applied to other probability density functions exhibiting abrupt changes in gradient. Response function parameters are assumed known or previously estimated. The theory is applied to find the standard errors for coefficients of a linear model, assuming the response function has a truncated Laplace distribution with added kurtosis, due to perturbation by a Hermite polynomial. To illustrate the application, we show how birth order can be linked to methylation status at two CpG sites in the promotor of the H19 gene.

2. The Model

2.1. The Expectation Is Linear

Let be a vector of response variables; let be an matrix of explanatory variables (real-valued). The subscripts denote the dimensions and will be omitted when these are assumed fixed (in Sections 2 and 3). Let be a vector of coefficients for our linear model and assume that Then each component of the deviation (or error) vector has expectation zero. The explanatory variables may be continuous or discrete. We assume . Let denote the rank of ; then . In practice, we usually have , the number of data points, much larger than , the number of coefficients. Our goal is estimating the components of by ML principles, and determining their standard errors, given a set of response variables , explanatory variables , and a response variable distribution based on the Laplace distribution as described below. In terms of generalized linear models, the link function is assumed to be the identity.

2.2. The Distribution of the Deviations—A Modified Laplace Distribution

Example 1. Let be defined by where is a real-valued parameter and is a real-valued normalizing function defined so that Then is the probability density function for the Laplace distribution, with scale parameter , centred at the origin, and with unbounded support. It is not differentiable at the origin in the classical sense.
The method of MLE for the response function (6) corresponds to least absolute error (LAE) regression [2–4]. However, the theory of LAE regression is not sufficiently general for our epigenomic modelling problem. We next describe the more general response functions we require.

Example 2. Now consider the case of bounded support. For finite , define by where is a real-valued scale parameter and is a real-valued normalizing function defined so that Then is the probability density function for the truncated Laplace distribution with scale parameter , centred at the origin, and with bounded support .

Example 3. More generally, consider perturbations of the truncated Laplace probability density function of the following form. Let where real-valued is equal to the constant map plus a perturbation, is a vector of parameters for , and parameter vector , . If , has no parameters. Here, is a real-valued normalizing function defined so that We assume that there exists some such that is in on and that on , for fixed parameter vector . Note that, as a consequence of using the modulus function, will not be differentiable with respect to , at , in general. As in Example 1, is not differentiable at the origin due to the use of the modulus function.

Example 4. We could allow unbounded support if is finite.

Example 5. Now consider our motivating example, a truncated Laplace distribution with bounded support , perturbed by adding kurtosis. Such a distribution is used to model the deviations in the proportions of methylation measured at gene promoter CpG sites. Specifically, to fit with observations, kurtosis is added to a Laplace probability density function with bounded support by adding a third-order Hermite polynomial to give an amended version.
Consider Here, , is small, , and is the third-order Hermite polynomial. Solving for yields

Example 6. The functions and , assuming small positive and bounded support, could be used in (10) to model distributions similar to the Laplace but with thinner tails.
We restrict to symmetric distributions satisfying .

3. Maximum Likelihood Estimation

3.1. The Log-Likelihood Function

Let be a probability density function with parameter vector , as described in Section 2. Let be a sequence of independent and identically distributed deviations with joint probability density function This is also the likelihood function which may be regarded as a function of or ; here, the subscript reflects our point of view. We use the log-likelihood function in the estimation of , where, using various notation, Substituting measured values of and known inputs into , we obtain a function of and . The parameters are assumed known, but if not, may be estimated separately. In our biological application, they are estimated by MLE and are assumed fixed for a particular measuring process. Hence, we have a function of , the coefficients of our linear model.

Our aim is to find a maximum likelihood estimator (MLE) denoted , that is, some point at which attains its maximum value. The subscript corresponds to the number of deviations. Now is a continuous function. If is finite, it has compact support in . Since a continuous function on a compact set attains its supremum, the existence of a MLE for is guaranteed. Even if , we may consider truncations with finite bounds , . Since is maximized when the are small, truncating to for large enough will not affect the set of points at which attains its maximum. We show in Section 3.3 that a MLE is not necessarily unique.

Case 1. If and so is the Laplace probability density function with parameter as in (6) or (8), then, for such that every is in the support of , where if the support of is . If the support of is then
The theory of LAE regression (corresponding to MLE using Laplace distributions without modification as response functions) may be found in various texts, for example, [3]. Here, it is proved that there exists a MLE corresponding to at least zero errors. We are concerned with the extension of these ideas to the case of perturbed and truncated Laplace response functions. For the truncated Laplace distribution, we prove that there exists a MLE corresponding to at least zero errors. Consider the following more general case.

Case 2. If is a perturbed Laplace probability density function with perturbing function and bounded support as in (10), then for such that , , Note since is strictly positive on , so is and so is well-defined on , .
In Section 3.3, we show that if the perturbing function is such that is convex and non-increasing on , there exists a MLE corresponding to at least data points. We give an upper bound on on which, if not exceeded, ensures that there exists a MLE corresponding to at least one data point. We apply these results to our motivating example, the Laplace distribution with added kurtosis, described below.

Example 7. If and so is a Laplace probability density function with scale parameter , with added kurtosis and bounded support as in (13), then for such that , , Now is not differentiable in the classical sense with respect to the linear model coefficients when any . Hence, we cannot assume that is differentiable at a MLE. This paper addresses this issue firstly by proposing a non-gradient method of coefficient estimation and secondly (in Section 4) by using generalized functions to calculate statistical estimates including estimates of standard errors. In Section 5, we discuss the asymptotic theory of our MLE.

3.2. Coefficient Estimation Dealing with Abrupt Changes in Gradient

Although and are continuous functions of , their first derivatives are not. Consider the geometry of the coefficient space , where . For each index , since the response function distribution is defined in terms of absolute values, we can find a hyperplane in on which and are not differentiable, defined by setting . Let , the th row of , it is never the zero vector since , . Choose so that . Then yields . Let be the set of all such . For example, for , for each error term there is a line in on which and have a sharp ridge. By inspection of the geometry, we would expect the values of which maximize to be either on the union of the hyperplanes or very close to intersections of the hyperplanes , . Imagine searching in -space near the hyperplanes . Even if has a local maximum near but not on the union of the hyperplanes, it would be difficult to use a method based on the gradient of either or , since the gradient changes sharply whenever we cross one of the . The simplex method of coefficient estimation, which does not require any partial derivatives, suits this geometry.

3.3. The Maximum Likelihood Estimator Corresponds to a Data Point

3.3.1. Convex and Non-Increasing Perturbations of the Laplace Probability Density Function

Consider the probability density function with support , for some finite as described in Section 2.2 (Example 3). Recall we assume that there exists some such that is in on and that on , for fixed parameter . Let We consider the log-likelihood function (conditional on ) with its domain restricted to that is, constrained to or equivalently, consider the log-likelihood function constrained to , where is defined as . Note that and similarly . Also, if the function is defined by then .

Lemma 8. If is convex and non-increasing (i.e., on , then there exists a maximum of corresponding to at least data points. That is, there exists such that the constrained attains its maximum at , and there exists at least indices such that , .

Corollary 9. Let be the Laplace probability density function (8) with support and parameter . Then there exists a maximum of corresponding to at least data points.

Proof of Corollary 9. Let be the constant map ; then , and hence is convex and non-increasing on . Corollary 9 follows from Lemma 8.

Note that Corollary 9 could be proved directly by linear programming theory. Linear programming has been applied to the problem of minimising the sum of absolute errors in various applications (see [6], a survey paper, and also [7]). The convex analysis results we require are in Appendix A.

Proof of Lemma 8. To begin, assume that has full rank , recall and that has bounded support. Then is a compact convex subset of an -dimensional affine subspace of . Since the mapping is linear and has full rank , the inverse image is a compact convex subset of . We partition , which is the support of , into a finite collection of compact convex sets, so that, on each subset, is convex.
Let be the hyperplane in corresponding to the error term . Then and are the hyperplanes in corresponding to errors and , respectively. It follows that the log-likelihood function is a convex function in between the the hyperplanes , , and , . These hyperplanes divide the domain in into at most open sets bounded by (but not intersecting) the hyperplanes. Each such open set (and hence its closure) may be labelled by a set of signs. For any such that , ; labels the open set containing .
Next, consider as the union of its orthants, which we denote , . We assume the orthants are closed sets. For example, the non-negative orthant is . In the interior of any , the sign of does not change, . Relabel the open subsets , where , so that . Let .
Now, , where denotes the closure of the set. Since is bounded by hyperplanes, it is convex. It is closed and bounded and hence compact. Choose , such that is non-empty. Since continuous functions are bounded on compact sets, the supremum of , when restricted to , must be attained at one or more points in . By Corollary A.3, the supremum (in our case the maximum) of on is attained on the whole set or on a union of faces of dimension less than or at a vertex. Since there are a finite number of sets to consider, the maximum of must occur at a vertex but might occur, for example, on the whole of a set or on a union of faces. This is important to consider when using search algorithms such as the simplex method; as repeated application with different starting points may give a set of solutions which, for example, lie on a line segment. Note the following points.(i)Assuming that , any vertex of the set in at which attains its maximum must correspond to data points, possibly more (degeneracy). This is due to the fact that in , the gradient points in the direction of the boundary of the corresponding orthant and away from the boundary of . (ii)A MLE is not necessarily unique. (iii)If , then we may apply the same reasoning to a subspace of of dimension on which the error mapping has full rank . (iv)Since at the MLE the absolute values of the deviations will all be small, this proof for finite may be extended to .

Lemma 8 is useful but we need to know what happens for more general perturbing functions . First, we consider the Laplace distribution without perturbation.

3.3.2. The Truncated Laplace Probability Density Function

For , let be the Laplace probability density function (8) with scale parameter and with support . Then and The latter is a piecewise affine function in . It has a maximum value of when . More generally, for and independent deviations (error terms) , Hence, the log-likelihood function is, up to a constant term, a piecewise linear function in the error terms, with a maximum attained when all the errors are zero. However, we must restrict our domain to , or equivalently to . The log-likelihood function is a piecewise linear function, up to a constant term, in between the hyperplanes , , and , . Let denote the th column of , . Let denote the span of the columns of .

Now has a critical point at if and only if the gradient (evaluated at ) is orthogonal to . The columns of are tangent vectors to at this point. The gradient is constant in the interior of any orthant. If we travel along a straight line path in any orthant, either always increases, always decreases, or remains constant. Hence, we will not find an isolated local maximum or minimum in , an open set, for the constrained .

We need to be aware of the case where lies in or very close to a level set of . We might need to test for this. This happens when the sign vector is orthogonal to , or nearly so. In the former case, the constrained is constant on . In the latter case, the constrained will differ very little around the maximum on . If this is the case for all the , then the ML values for the coefficients will not be sharply defined (will have large variance).

3.3.3. More General Perturbations of the Laplace Probability Density Function

The question is, given a nontrivial perturbing function , does the log-likelihood function attain its maximum at a data point? We have given conditions on in Lemma 8 which are sufficient to ensure the maximum is attained at a data point. We give a more general criterion in Lemma 10.

Assume that is non-linear in any orthant. Otherwise, we can write in the form of a scaled Laplace distribution and apply Lemma 8. Then is the sum of an affine function and a non-linear function in any orthant. This affine function is the log-likelihood function corresponding to the Laplace distribution. The non-linear function is Then , and so , where

It is possible that there exist orthants in which the set is orthogonal to the gradient . It is possible that attains its global maximum (with respect to ) on the whole of such an . Hence, it is important to consider the behaviour of in such orthants.

Lemma 10. Assume bounded support and let where denotes supremum. Then, if , the supremum or maximum of is attained at a data point. In the special case that is orthogonal to in any orthant, it may be that the supremum is also attained elsewhere.

Proof of Lemma 10. Choose , where , such that the set is non-empty, choose (an open set relative to ), and let be the projection of onto . Consider the case is non-zero. Then the affine function is not constant on . Given any in , there exists some and a map which is increasing at , since Now consider Since is continuous and strictly positive on , is finite. We show that if , at , that is, at . Specifically, if and is non-zero. Then the supremum of , constrained to , must be attained on the boundary of , relative to , and this must be at a data point due to the direction of .
In the limiting case, where , but we still have , we find that the supremum of , constrained to (relative to ), must be attained at a data point but may be attained at other points in as well. Assume this is not the case, that is, there exists no data point at which attains its supremum when constrained to . We find a contradiction as follows. Assume there exists some , an open set (relative to ), such that where denotes boundary. Imagine perturbing the columns of slightly (and continuously) to obtain so that , a continuous function, conditional on , changes by at most , that is, for all , and so that now . We also require the perturbation small enough that Now for some unique . Let . If the perturbation of is small enough, then and so .
Now, . Hence, which is not possible since . Hence, we have a contradiction. Now, for completeness, assume there exists some , not a data point (that is some ) such that Then there exists , such that Hence, The contradiction follows as above. Hence, we have proved the lemma.

3.3.4. An Amended Laplace Probability Density Function with Added Kurtosis

We may apply Lemma 10 to show that, for the amended Laplace probability density function with added kurtosis (Example 5) and for realistic values of and , the maximum of the log-likelihood function must be attained at a data point. Here, on , that is, on . In dimensions, Consider, for , When , is negative when and positive when , and so the continuous monotonic function has a point of inflection, where in , at say . Hence, and are both concave and convex on . When is small, the concavity occurs close to the origin and the functions are convex on most of . Moreover, is always negative on , and its limit as is also negative, so we will not get isolated local maxima in one dimension. The function and its first and second derivatives are plotted in Figures 1, 2, and 3, respectively; setting , a typical value. Reading from Figure 2, the upper bound is approximately 0.075, when . Since typically , the criterion () for Lemma 10 is satisfied.

3.3.5. Non-Increasing Perturbations Both Concave and Convex

In certain situations we might find the criteria for both Lemmas 8 and 10 are not satisfied but that is non-increasing on and convex on , where is small. Since a sum of convex functions is convex (see [8]), will be convex wherever is convex for . For example, for and , we can prove a partial result as follows. By restricting to the domain or equivalently by substituting for each set the subset and applying convex function theory as for Lemma 8 we can prove that there exists at which attains its maximum, and there exists at least one index such that . Hence, there exists some at which attains its maximum, and there exists at least one index such that . In other words, there exists some point at which has a maximum at which at least one of the errors is small. It makes sense to search for the maxima of near or at vertices.

For , we may apply Lemma 10 and so do not need this partial result, but for a perturbation similar in shape to , non-increasing everywhere on , with convex everywhere on , concave everywhere on , for some small positive , and with steep slope at the point of inflection (), such analysis would be useful.

4. Statistics for Linear Model Coefficients Assuming Perturbed and Truncated Laplace Response Functions

4.1. Dealing with Abrupt Changes in Gradient

The inclusion of the modulus (absolute value) function in the Laplace probability density function (6) (and variations thereof) is the cause of abrupt changes in the gradient of the log-likelihood function (see Section 4.2). This section is devoted to dealing with the problems which are associated with these abrupt changes, encountered when deriving statistical formulae, for example, for standard errors.

The fact that is not differentiable in the classical sense at a local maximum means that the assumptions made in the derivation of the usual classical formulae for the information matrix, the expected value of the Hessian of the log-likelihood function and the variance-covariance matrix for the model coefficients , , are not met. For probability density functions (and log-likelihood functions), these formulae are derived using Taylor series. We find alternative expressions for these quantities assuming the truncated and/or perturbed Laplace response functions (as defined in Section 2) which are where the modulus function is non-zero. In Section 5 these expressions will be used to prove the asymptotic convergence of our MLE to a random variable with a normal distribution.

4.2. Differentiation in a Generalized Sense

The following generalized functions are required to determine the first and second partial derivatives of the log-likelihood function , with respect to the coefficients . These derivatives are needed for the calculation of the standard errors. We require and which is the delta function, that is, except at , and . These expressions and the modulus function are connected by where the differentiation is taken in the generalized sense (see [9, 10]). Hence, for , the generalized derivative Also, the derivative of the delta function may be defined via integration by parts, assuming is , we have

In Section 5, we investigate the behaviour of our model as , and so use subscripts to clarify the variables under consideration ( or ) and/or the dimension of the space(s) under consideration. Using (10) and (59), it follows that Since (see (5)), Letting denote the gradient of and letting denote the gradient of , we have In addition, using (60) and omitting the dependence of upon its parameters for brevity, and, if ,

Let denote the generalized Hessian of , where and let denote its expected value. Then and are diagonal matrices. Since the diagonal elements of are all equal (see Section 4.6), this matrix is a multiple of the identity. We have where denotes the identity matrix. Let denote the generalized Hessian of , where and let denote its expected value. Then and If any , then the th diagonal element of is infinite. In this case has infinite components. We prove in Section 4.6 that is finite.

4.3. A Classical Relation to Be Generalized

Let denote the Fisher information matrix, where The components of our MLE depend on the errors and have a distribution whose variance-covariance matrix is denoted , where , and . If the log-likelihood function was sufficiently smooth around the region of interest (i.e., around its maximum value), then Taylor series expansions could be used to derive a relationship between , , and , namely, (see [1]). However, our is not sufficiently smooth, and so we cannot make use of this relationship without further justification. In general, (75) does not hold, assuming a truncated (and possibly perturbed) Laplace distribution.

In Sections 4.5 and 4.6 we calculate the expected values of the first and second partial derivatives of the log-likelihood function using generalized functions; this enables us to derive a generalized Taylor series expansion for the log-likelihood function about a maximum even when the maximum is, for example, on a ridge or at a vertex. Also, this enables us to derive an expression for the generalized variance-covariance matrix for the MLEs of the model coefficients and an expression for the generalized log-likelihood ratio statistic. These formulae differ from the standard formulae for the case of smooth log-likelihood functions, although their form is similar. Specifically, in our case, we prove that is a multiple of , but that the multiple is not −1, rather, it is a negative real number that depends on , the perturbation , its parameters and the bound . We prove that our generalized is a multiple of , where the multiple is a positive real number depending on , , , and . We assume independent error distributions.

4.4. The Mean and Variance of the Partial Derivatives of the Log-Likelihood Function

The mean and the variance of are required in the calculation of . Recall is the joint probability density function for the independent deviations (errors), and that so Let ; then is independent of index . Since , , and are symmetric about the origin, and so for any choice of , , , and . Let where denotes variance. Using (63) for , where, for , and . Since is independent of index , we omit the subscript. Hence, the information matrix , defined with respect to , is

If , then , and so For the nontrivial perturbing function , for fixed and , one can show that depends on by direct calculation.

4.5. The Information Matrix

We calculate the information matrix (conditional on , , , and ). We are trying to quantify the steepness of the slope of around a maximum, in the directions represented by the coefficients . If is very flat in one direction, then the model coefficient representing that direction is not well defined (will have large variance). When calculating , we are taking into account the behaviour of the gradient of on a whole neighbourhood of the MLE (how it differs from the expected value) and discontinuities on sets of measure zero can be accommodated. Recall (64), for , Hence, omitting some subscripts on for brevity, we obtain Since the expectations of the partial derivatives of are zero, the diagonal elements of are an indication of the steepness of the gradient around the maximum likelihood estimate. Now, since the cross-terms indexed by equal zero by symmetry, that is, So Hence,

If , then .

4.6. The Expected Value of the Generalized Hessian

In order to calculate the expected value of the generalized Hessian we require and Let since , where . The quantity does not depend on , so we omit the index. Hence, If , then and so although the generalised Hessian has infinite elements, its expected value is negative definite. By continuity, if is close enough to the constant map , the expected value will still be negative definite. Note that if has negative slope at the origin, the peak of at the origin becomes sharper, compared to that for the case . If has positive slope at the origin, we see the opposite effect.

4.7. The Generalized Variance-Covariance Matrix for the Model Coefficients

We use a generalized Taylor series expansion (in the coefficients ) to approximate by a negative definite quadratic function about a local maximum. Although we know that, for finite , this approximation is not exact, we show in Section 5 that we would expect it to become more accurate as . Assuming and are given, conditional on and , we could write a Taylor series expansion for about a ML estimator as follows: if and hence and were sufficiently smooth. Now our probability density function and hence are not sufficiently smooth, but we can replace the second derivative of by its expected value using generalized functions. Let . This yields the approximation Assuming is nonsingular, which is true when , we ignore higher order terms. Equation (97) provides an indication of the behaviour of about a maximum since, for example, if , then is negative definite.

Next we consider the score function and use a Taylor series approximation incorporating generalized functions (about a local maximum ) to derive a relationship between the expected value of the generalized Hessian , the information matrix , and the generalized variance-covariance matrix . This approximation is Since and has full rank, . We multiply each side by its own transpose and take expected values to obtain Hence, so Here, we require to have full rank and that , which is certainly true when .

4.8. Generalized Statistical Expressions and Relations

We have shown that the expected value of the generalized Hessian and the information matrix are multiples of . Specifically, (see (94) and (90)). Hence, if (true if ), then In addition, assuming that the scalar is also non-zero (true if ) and that has full rank , we have proved the following relations: to be used in the derivation of the generalized log-likelihood ratio statistic (Section 4.11).

4.9. Statistical Relations for the Laplace Distribution

If , then using (21), (84), and (95): Hence, so Here, where , , and has full rank . Hence, (75), derived for distributions, does not hold for the Laplace distribution with bounded support. However, (75) describes the limiting behaviour, as .

4.10. Statistical Relations for a Laplace Distribution with Added Kurtosis

Now consider our motivating example, a Laplace distribution with bounded support amended by adding kurtosis. In this case (see (14)), , Also, using (82) and (81), we obtain and using (93), we obtain For typical values and , numerical integration gives and , to four decimal places. In this case, and so Here In this example, (75) could be used as an approximation but does not exactly describe the relationship between , , and .

4.11. The Generalized Log-Likelihood Ratio Statistic

The log-likelihood ratio statistic enables us to assess the adequacy of a model. It enables us to compare a model with coefficients (parameters) with a model of interest which differs only in that it has fewer coefficients, say , with , see [1], for example. We wish to compare a linear model with coefficients, the , for with a lesser linear model with fewer coefficients. The aim is to decide whether or not the excluded coefficients are useful. Our comparison is conditional on the parameters and .

Let denote the likelihood function for the model with parameters, and let denote the likelihood function for the lesser model. Let (or ) be the maximum likelihood estimator of , and let (or ) be the maximum likelihood estimator of . Then the likelihood ratio is a ratio of two probabilities and will be greater than one since the model with parameters provides the more complete description of the model. In our application, , , and usually (but not necessarily) . We show that a multiple (conditional on and ) of has a chi-squared distribution as follows.

The derivation of the log-likelihood ratio statistic (for smooth functions) may be found in the textbooks. For example, for generalized linear models (see [1]) where the log-likelihood function is smooth in a neighbourhood of its maximum, is distributed approximately as . Here is a noncentrality parameter, a positive constant which will be near zero if the lesser model fits the data almost as well as the model with more coefficients. Consider our situation in which the log-likelihood function is continuous at a maximum, but where this maximum occurs at a vertex or possibly on a ridge in -space, and generalized calculus is required to consider Taylor series expansions. Then, as in (97), around , replacing the Hessian of by its expected value, we obtain the following approximation:

So, by (105) assuming that the scalar factors and are both non-zero and that has full rank which has the distribution if the MLE has a normal distribution. We show in Section 5 that the distribution of the MLE is asymptotically normal. Hence, if is large enough, this will be a good approximation. Similarly which has the distribution , approximately. Noting that when , let Then , the log-likelihood ratio statistic calculated with generalized functions, a positive number, may be expressed as Hence, is the sum of three terms; the first (positive) has the distribution (approximately). The second (negative) has the distribution (approximately). The third is a positive constant (say ) that depends on and . Hence, is distributed approximately as . If the lesser model gives a good description of the data, then will be small. The generalized log-likelihood ratio statistic is easily calculated and is hence a potentially useful statistic for assessing our linear model for which the log-likelihood function is not . Note if , , and so , as .

5. The Maximum Likelihood Estimator Is Consistent and Asymptotically Normal

The generalized expressions derived in Section 4 will be used to prove the asymptotic convergence of our MLE to a random variable with a normal distribution. Recall the following assumptions. (i)Our model is linear (see (4)). (ii)The response function is a Laplace probability density function, generally perturbed and/or truncated, as given in (10). We make the following further assumptions. (i)There exists a unique true vector of coefficients whose value we are trying to estimate. (ii)The matrix has full rank , so that has full rank . (iii)The , a positive definite matrix. (iv)Assuming fixed , for , denote by a (not necessarily unique) MLE of the true value , corresponding to the explanatory variables in .

Lemma 11. The ML estimates exist, for .

Proof of Lemma 11. Since a continuous function on a compact set attains it maximum, the existence of a maximum of the log-likelihood function is guaranteed for finite bound . Even if the domain is not bounded, we can work with finite bound , and then let .

Theorem 12. The random variable converges in distribution to an -dimensional normally distributed random vector with mean and covariance matrix , that is, as ,

Lemma 13. The random variable converges in distribution to an -dimensional normally distributed random vector with mean and covariance matrix , that is, as , Hence, converges in probability to .

Proof of Lemma 13. Consider the random variable . If we repeat the sampling of data points, using a total of times, we may write where denotes the average of samples of the random vector . Now has mean and covariance matrix , and so by the multivariate Central Limit Theorem, as , Hence, as and , Hence, as ,

Lemma 14. Although for finite , the log-likelihood function is not differentiable at a maximum; converges in distribution to a negative definite quadratic function centred at . Hence, the MLEs are consistent, that is, they converge in probability to the true value .

Proof of Lemma 14. Using a generalized Taylor series expansion about and noting that by Lemma 13, converges in probability to , we can write This shows that, in the limit as , the log-likelihood function has an isolated maximum at , and so a sequence of MLEs must converge to in probability (consistency).

In the Proof of Lemma 14 we ignored the third partial derivatives of in the generalized Taylor series expansion. The justification is as follows. Since is assumed to be on an open interval which contains , must be bounded above by some positive real number on . Hence, the absolute values of the third partial derivatives of must be bounded above by some positive real number except at points , where some (data points). Note that , . This follows from (66), using the symmetry introduced by the modulus function and (62). Hence, the third partial derivative terms will be small compared to the second partial derivative terms near a critical point, and so we may ignore them in our generalized Taylor series expansion for , when the expected value of the Hessian has full rank.

Proof of Theorem 12. Consider the first degree approximation Since has full rank, so By Lemma 13, as , Hence, as , Hence, as ,

An alternative Proof of Theorem 12 for the simplest case, that is, in the absence of truncation or perturbation, (i.e., and ) is in [2]. This case is also discussed in [11, page 451].

6. Real and Simulated Data Illustrations

6.1. Empirical Distribution of Methylation Proportion Deviations

Quantitative analysis of DNA methylation at specific genomic sites (known as CpG sites) was carried out with the Sequenom MassARRAY Compact System (http://www.sequenom.com/). Briefly, this involves accurate determination and comparison of the mass of transcription cleavage products following chemical modification of the DNA which is dependent upon the a priori methylation status, using MALDI-TOF mass spectrometry (Bruker-Sequenom) (see [12]). Quantitative CpG methylation was assessed using proprietary EpiTyper software v1.0.5 (http://www.sequenom.com/). Sequenom measurements of 1440 CpG sites in each of 41 human umbilical cord tissue samples were performed in duplicate, and the difference between the measurements was recorded. This difference represents the deviation in the measurement of CpG methylation due to sample nanodispensing and MALDI-TOF mass spectrometry detection. Figure 4 is a histogram of the deviations of these 1440 repeated measurements. Although not obvious from the histogram, about 1.04% of values are greater than 0.2 in absolute value. The histogram has heavy tails. The data was shown not to conform to a normal distribution using the test proposed by [13] ( value <0.00001).

6.2. Simulated Data Example Using Methylation Proportion Deviations

In order to illustrate the application of the theory developed, a sample of 40 deviations was chosen at random from the total pool of 1440 available CpG methylation proportion deviations. A constant value of 0.48 was added to 20 of these samples and designated treatment H, while a constant value of 0.45 was added to the other 20 samples and designated treatment L. A uniform random variable sampled between −0.01 and 0.01 was added to each value to simulate the additional differences expected to occur between individuals.

We analysed the data using our amended Laplace distribution (13), with , , (machine characteristics), and . The parameter MLEs and (estimates plus or minus standard errors) were given by Buckland's algorithm [14, 15], with the being highly significantly different from zero, (with a log-likelihood ratio statistic of 135.6 distributed as ). Assuming (LAE regression with truncation) yields . Buckland's algorithm does not take into account the truncation, but since is large and is so close to 1 in this example, the effect of truncation on the standard errors and (for values) is negligible. If the machine characteristic was smaller, say , assuming , this effect would become more significant.

We coded a low value treatment (L) by setting , and a high value treatment (H) by setting , . Estimates for and were calculated by maximum likelihood estimation, using the simplex method. The standard errors of the coefficient estimates were calculated using our generalized . Hence, the estimated treatment means (), and their standard errors were calculated. A value was obtained using , which is assumed distributed . This simulation was performed twice, and the results are given in Table 1. Note for each simulation, the value is less than 0.01, indicating a significant difference between the means.

For both simulations and (see Section 4.10). Also, rounding to five significant figures, Here, These expressions correspond to the classical expressions for smooth functions, up to our numerical tolerance.

For comparison, the results of a standard analysis of variance, assuming the deviations have a normal distribution, are also given in Table 1, for each of the two simulations. Using the values obtained, and , we would not reject the null hypothesis (that the means are equal) at any usual level of significance, using a least squares approach. However, our algorithm based on the amended Laplace distribution correctly identified the structure of the simulated data set, separating two means which were fairly close. The standard least squares algorithm failed to do this. The two data sets analysed are given in Tables 3 and 4 in Appendix B.

For comparison we also included LAE regression, estimating the model coefficients assuming the response function is the Laplace probability density function truncated to , using , given by Buckland's algorithm. We see that including the perturbing Hermite polynomial improves (decreases) the values, meaning we can be even more confident, than when using LAE regression, that the means are different.

6.3. Primiparous versus Multiparous Effects on DNA Methylation Proportion at the Promoter of the H19 Gene

The CpG methylation at two CpG sites in the promoter of the H19 gene was measured by Sequenom in umbilical cord tissues collected as part of an ongoing prospective birth cohort study. Phenotype variables in this population include birth order or parity, defined as first born child (primiparous) or later born (multiparous). We analysed the relationship between H19 gene methylation status and birth order in this study, using our amended Laplace distribution (13), with and , and for comparison, the usual least squares algorithm. Here an amended Laplace distribution (13) has been used to model the measurement errors as opposed to the differences of errors in repeated measurements (the deviations in Section 6.2). In practice this works well, despite the observation that if the measurement errors have a distribution of the form (13), then the deviations would not be expected to have exactly this form of distribution. Modelling the deviations as the difference between two random variables both distributed Laplace also works well. We used the estimate of for the deviations to estimate for the errors. The factor is chosen to halve the variance in the underlying Laplace distribution. For the error distribution, we estimated to be as close to 0.5 as we allow, say . This was done by assuming the errors were distributed as an amended Laplace distribution (13) with parameters and , deriving a corresponding distribution for the deviations and using MLE with the deviation data to obtain parameter estimates. We also tried with . The latter (Sequenom) parameter estimates gave smaller standard errors and values and were consistent with Infinium parameter estimates.

The problem was coded by substituting for primiparous and for multiparous. Estimates for and and their standard errors were calculated. The data used is given in Table 5 in Appendix B. The estimated means () and their standard errors are given in Table 2. The small values associated with the amended Laplace distribution (13), calculated using , identify a difference between the mean methylation proportions (primiparous versus multiparous) at a given CpG site. This demonstrates the power of accounting properly for the distributional properties of the methylation errors and hence enables clearer inference of the epigenetic mechanisms underlying these biological phenomena.

In this example, setting (LAE regression with truncation) yields similar means and variances to our MLE algorithm (the same to two decimal places) and also gives small values. For the CpG13 example with and , . For this example with and , . These values are very close as is large and so is small. For smaller values of , the value of taking into account the truncation and perturbation increases. A Mann-Whitney 2-sample -test was also done for comparison.

7. Discussion

7.1. A Comparison with LAE Regression

The original MLE theory and methods in this paper were developed assuming the response function is a modified version of the Laplace probability density function, that is, assuming nontrivial perturbation and/or truncation to compact support . Such response functions have been observed in measurement data generated by nearly all the analytical platforms currently used to assess DNA methylation, including the Sequenom EpiTyper, Infinium Mass Array and Restricted Representation Bisulphite Sequencing platforms [16].

In the absence of perturbation or truncation of the response function, the results in this paper correspond to the theory of LAE (or median) regression as found in [2–4]. That is, if and , then our MLE method corresponds with LAE regression. In this case, , , , (in one dimension), and so which is the asymptotic variance of the ordinary sample median for [2].

We present an original and practical method of obtaining the covariance matrix for the model coefficients. This involves evaluating , where although might be large, generally is not, and using generalized functions to numerically evaluate two one-dimensional integrals (to find and ). The calculation of and takes into account the characteristics of the response function (truncation and perturbation) which would be ignored if we used median regression. This is possible when we know the response function parameters, or have fairly accurate estimates, as in our epigenetic application, modelling DNA methylation proportion deviations.

For LAE regression, other methods of determining approximations to this covariance matrix may be found in the literature. In particular, in the method of quantile regression [5] implemented in the statistical package R, the covariance matrix for the MLE is calculated by resampling techniques, by bootstrapping or by using hierarchical spline models [17].

We prove that, even for truncated and perturbed Laplace response functions, subject to certain restrictions, the maximum of the log-likelihood function occurs at a data point. This result is well-known in the case of LAE regression. A proof that the LAE estimator passes through at least data points may be found in references [3, 4].

Three asymptotically equivalent test statistics for LAE regression may be found in [18], namely, a likelihood ratio test statistic, a Wald test statistic, and a Lagrange multiplier test statistic. Our likelihood ratio test statistic is an original modification of the former, applicable to our general case (not restricted to LAE regression), calculated using generalized functions. An test statistic for LAE regression is found in [4].

When working with a model for which the response function is assumed to be a truncated Laplace probability density function, we could ignore the truncation to . However, taking the truncation into account reduces the variance in the model coefficient estimates by a factor and increases the log-likelihood ratio statistic by a factor . This effect is small if is small but becomes more significant as decreases, that is, as more of the density function is truncated. Refer to Section 4.9, (111) and Section 4.11, (123), and (124) which show that, for example, when , Hence, by taking into account the truncation, we can be more confident of our coefficient estimates and the value of appropriate beta coefficients than the standard formulae for LAE regression indicate. This effect will also be seen for small perturbations of the density function.

7.2. Original Formulae Enable Model Comparison

The original formulae derived in Section 4 for the covariance matrix of the model coefficients and the generalized log-likelihood ratio statistic enable us to do a generalized analysis of variance, in our case comparing the means of two different groups of data. We see that in Section 6.3, the primiparous and multiparous means could be judged distinct with very high probability using either the Laplace or an amended Laplace response function. The -values are slightly lower when using the Hermite modification. In the simulation in Section 6.2, using the Hermite modification decreases the -value by an order of magnitude. In a situation in which is low or two means are less distinct, the Hermite modification may prove very useful.

Preliminary results indicate the use of an amended Laplace distribution enables a clearer separation of means than that given by other more standard procedures, for example, beta regression, in cases where independent evidence suggests that the means are different [16].

7.3. Summary

The Laplace distribution is the basis of many mathematical models (see [19]). Our focus has been modelling the distributions of errors in the proportions of DNA methylation measured at genomic CpG sites.

Molecular biology deals with complex interactions both in terms of the physiology of the processes of interest and in the instrumentation required to measure these effects. The non-linearity of these processes can result in frequency distributions that are far from normal, so that application of “standard” methods of statistical inference based on least squares may be inadequate. Methods which deal with the form of the frequency distribution directly such as maximum likelihood are necessary for adequate inference to be made.

The Laplace or double exponential distribution considered here has been observed in molecular biology studies, where a significant proportion of high deviations appears to occur regularly [16, 20]. The extension by Hermite polynomials considered here provides flexibility for describing the tails of the distribution. However, as noted, the use of the Laplace distribution as the “key” function introduces problems in finding maximum likelihood estimators and particularly their standard errors. This paper presents both a practical method for dealing with these problems and the underlying asymptotic theory.

Appendices

A. Useful Convex Analysis Results

We prove Lemma 8 by applying results from convex analysis (see [8]). The following definitions are taken from [8]. A face of a convex set is a convex subset of such that every (closed) line segment in with a relative interior point in has both endpoints in . The empty set and itself are faces of . The zero-dimensional faces of are called the extreme points of . The relative interior of a convex set is defined as the interior which results when is regarded as a subset of its affine hull. The affine hull of a set is the unique smallest affine set containing . An alternative definition of an extreme point of a convex set is a point that cannot be written as with , , , and [21, p686].

Theorem A.1. Let be a compact convex set in , and let be a linear function. The maximum and minimum of are attained at extreme points of .

Theorem A.1 (on the Maximum/Minimum Property (see [21])) is useful but does not give a complete characterisation of the set of points in at which has a maximum.

Rockafellar [8] gives a more general definition of a convex function than we require. It is enough for our purposes to say that if the domain of real-valued function is a convex set in and if for any and in this domain, for all , then is convex.

Theorem A.2 (see [8, Theorem 32.1]). Let be a convex function, and let be a convex set contained in the effective domain of . If attains its supremum relative to at some point of the relative interior of , then is actually constant throughout .

For our purposes, the effective domain of is the domain of since the functions we consider are finite-valued. (See [8] for definitions.)

Corollary A.3 (see [8, Corollary 32.1.1]). Let be a convex function, and let be a convex set contained in the effective domain of . Let be the set of points (if any), where the supremum of relative to is attained. Then is a union of faces of .

Corollary A.4 (see [8, Corollary 32.3.2]). Let be a convex function, and let be a non-empty closed bounded convex set contained in the relative interior of the effective domain of . Then the supremum of relative to is finite, and it is attained at some extreme point of .

Theorem A.5 (see [8, Theorem 5.4]). Let be a twice continuously differentiable real-valued function on a open convex set in . Then is convex on if and only if its Hessian matrix is positive semidefinite for every .

B. Data Sets for Section 5

The simulated high (H) and low (L) treatment data analysed in Section 6.2 are given in Tables 3 and 4. The CpG methylation measurements analysed in Section 6.3 are given in Table 5.

Acknowledgments

The authors wish to acknowledge funding support provided by the National Research Centre for Growth and Development, New Zealand (G. Wake, A. Pleasants, A. Sheppard), and the Foundation of Research Science and Technology, New Zealand (UOAX0808, A. Sheppard). Further, the authors acknowledge their collaborative link with the GUSTO birth cohort, led by Professors P. D. Gluckman, University of Auckland, and Yap-Seng Chong, National University of Singapore.

References

A. J. Dobson and A. G. Barnett, An Introduction to Generalized Linear Models, Chapman and Hall/CRC Press, 3rd edition, 2008.
G. Bassett Jr. and R. Koenker, “Asymptotic theory of least absolute error regression,” Journal of the American Statistical Association, vol. 73, no. 363, pp. 618–622, 1978.
View at: Publisher Site | Google Scholar | MathSciNet
D. Birkes and Y. Dodge, Alternative Methods of Regression, John Wiley & Sons, New York, NY, USA, 1993.
View at: Publisher Site | MathSciNet
P. Bloomfield and W. L. Steiger, Least Absolute Deviations, Theory Applications and Algorithms, Birkhäuser, Boston, Mass, USA, 1983.
View at: MathSciNet
R. Koenker and G. Bassett, Jr., “Regression Quantiles,” Econometrica, vol. 46, no. 1, pp. 33–50, 1978.
View at: Publisher Site | Google Scholar | MathSciNet
S. C. Narula and J. F. Wellington, “The minimum sum of absolute errors regression: a state of the art survey,” International Statistical Review, vol. 50, no. 3, pp. 317–326, 1982.
View at: Publisher Site | Google Scholar | MathSciNet
R. M. Norton, “The double exponential distribution: using calculus to find a maximum likelihood estimator,” The American Statistician, vol. 38, no. 2, pp. 135–136, 1984.
View at: Google Scholar
R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, USA, 10th edition, 1970.
View at: MathSciNet
M. J. Lighthill, Introduction to Fourier Analysis and Generalized Functions, Cambridge University Press, New York, NY, USA, 1958.
View at: MathSciNet
I. Stakgold, Boundary Value Problems of Mathematical Physics, Vol. 1, The Macmillan, New York, NY, USA, 1967.
View at: MathSciNet
E. L. Lehmann and G. Casella, Theory of Point Estimation, Springer, New York, NY, USA, 2nd edition, 1998.
View at: MathSciNet
M. Ehrich, M. R. Nelson, P. Stanssens et al., “Quantitative high-throughput analysis of DNA methylation patterns by base-specific cleavage and mass spectrometry,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 44, pp. 15785–15790, 2005.
View at: Publisher Site | Google Scholar
S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality: complete samples,” Biometrika, vol. 52, pp. 591–611, 1965.
View at: Google Scholar | MathSciNet
S. T. Buckland, “Fitting density functions with polynomials,” Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 41, no. 1, pp. 63–76, 1992.
View at: Publisher Site | Google Scholar | MathSciNet
S. T. Buckland, “Maximum likelihood fitting of hermite and simple polynomial densities,” Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 41, pp. 241–267, 1992.
View at: Google Scholar
Pleasants, unpublished observations.
W. Hendricks and R. Koenker, “Hierarchical spline models for conditional quantiles and the demand for electricity,” Journal of the American Statistical Association, vol. 87, no. 417, pp. 58–68, 1978.
View at: Google Scholar
R. Koenker and G. Bassett, “Tests of linear hypotheses ond $l_{1}$ estimation,” Econometrica, vol. 50, pp. 1157–1583, 1982.
View at: Google Scholar
S. Kotz, T. J. Kozubowski, and K. Podgórski, The Laplace Distribution and Generalizations, Birkhäuser, Boston, Mass, USA, 2001.
View at: Publisher Site | MathSciNet
E. Purdom and S. P. Holmes, “Error distribution for gene expression data,” Statistical Applications in Genetics and Molecular Biology, vol. 4, no. 1, article 16, 2005.
View at: Publisher Site | Google Scholar | MathSciNet
D. Kincaid and W. Cheney, Mathematics of Scientific Computing, Brooks/Cole, 3rd edition, 2002.

Copyright

Copyright © 2013 C. Z. W. Hassell Sweatman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1750

Downloads

823

Citations