Abstract

The method of maximum entropy is quite a powerful tool to solve the generalized moment problem, which consists in determining the probability density of a random variable from the knowledge of the expected values of a few functions of the variable. In actual practice, such expected values are determined from empirical samples, leaving open the question of the dependence of the solution upon the sample. It is the purpose of this note to take a few steps towards the analysis of such dependence.

1. Introduction and Preliminaries

To state what the generalized moment problem is about, let be a probability space and let be a measure space, with a finite or -finite measure. Let be an -valued random variable, such that its distribution has a density with respect to the measure The generalized moment problem consists in determining a density such that where is a collection of measurable functions and are given real numbers, and we set and to take care of the natural requirement on . A typical example is the following. stands for a positive random variable (a stopping time, perhaps being total risk severity) and we can compute by some Monte Carlo procedure at a finite number of points . The problem that we need to solve amounts to inverting the Laplace transform from such finite collection of values of the transform parameter .

Actually this last problem is of much interest in the banking and insurance industries, where the density is necessary to compute risk premia and regulatory capital of various types, samples may be small, and the estimation of reflects that. We direct the reader to Gomez-Gonçalves et al. [1], where this issue was addressed in the context of risk modeling and Laplace transform inversion.

Let us denote by the solution to problem (1) obtained by the maximum entropy method as explained in Section 3 below when the moments in the right hand side are known exactly. As in many situations the moments are to be estimated empirically as detailed in (2) below, it is to be expected that the maxentropic distribution obtained will depend on the sample used to compute . We shall denote this maxentropic density by to emphasize its dependence on the empirical moments .

The problem that we address in this note is the convergence on the to as well as that of the oscillations of mean values computed with about the oscillations of mean values computed with respect to .

When we have a sample , the empirical generalized moments (the sample averages) are given by which fluctuate around the exact moments ; we thus expect the output of the maxentropic procedure to somehow reflect this variability.

For that, in the next section we recall in a (short) historical survey the notion of entropy of a density, and in the following section we present the basics of the maximum entropy method.

In Section 4 we take up the main theme of this work: the variability of that comes in through . There we prove that converges pointwise and in to the maxentropic density obtained from the exact data, and we shall examine how deviates from in terms of the difference between the true and the estimated (sample) moments. We examine as well the deviations of expected values like from . That the density reconstruction from empirical moments has to depend on the sample seems to be intuitive, but neither the behavior of the maxentropic density as the sample size increases nor the fluctuations of the expected values with the densities seem not to have been studied before.

2. The Entropy of a Density

As there seem to be several notions of entropy, it is the aim of this section to point out that they are all variations on the theme of one single definition. Let us begin by spelling out what it is that we call the entropy of a density. Let be a measure on . Suppose that and let denote its density. The entropy (when we want to emphasize the density we write ) is defined by whenever is -integrable, or - if not. is called the entropy of (with respect to ) and is called the entropy of . Actually, we can also define for as follows. When is not necessarily a probability measure having a density with respect to , (3) is to be modified as follows: When is a probability measure, denote it by , and both and are equivalent to a measure , with densities given, respectively, by and ; then (3) can be written as and we call it the entropy of with respect to .

Comment. For the applications that we shall be dealing with, will stand for a closed, convex subset in some , and will be the usual Lebesgue measure. We also mention that when is a discrete measure, then the integrals would become sums.

The expression (3) seems to have made its first appearance in the work of Boltzmann in the last quarter of the XIXth century. There it was defined in , where was to be interpreted as the number of particles with position within and velocity within . The function happened to be a Lyapunov functional for the dynamics that Boltzmann proposed for the evolution of a gas, which grew as the gas evolved towards equilibrium. Not much later Gibbs used the same function, but now defined on , whose points denote the joint position and momenta of a system of particles. This time , and is the probability of finding the system within the specified “volume” element. Motivated by earlier work in thermodynamics, it was postulated that in equilibrium the density of the system yielded a maximum value to the entropy . These remarks explain the name of the method.

The expression (3) (with a reversed sign) made its appearance in field of information transmission under the name of information content in the density ; that is why it is sometimes called the Shannon-Boltzmann entropy. Also, expression (5) appeared in the statistical literature under the name of (Kullback-) divergence of the density with respect to the density , and it is denoted by and equal to . See Cover and Thomas [2] or Kullback [3] for a detailed study of the properties of the entropy functions.

Having made those historical remarks and having stated those equivalent definitions, we mention that we shall be working mostly with (3). In what comes below we make use of some interesting and well-known properties of (3) and (5), which we gather under the following.

Theorem 1. With the notation introduced above, one gets the following:(i)The function is strictly concave.(ii)For any two densities and , , and if and only if a.e. (iii)For any two densities , such that is finite, one has (Kullback’s inequality)

The reader is directed to either Cover and Thomas [2] or Kullback [3] for proofs.

3. The Standard Maximum Entropy Method

Here we recall some well-known results about the standard maximum entropy (SME) method along with some historical remarks. Even though the core idea seems to have been first made in the work of Esscher [4], where he introduced what nowadays is called the Esscher transform, it was not until the mid-1950s that it became part of the methods used in statistics, through the work of Kullback [3]. It seems to have been first formulated as a variational procedure by Jaynes [5] to solve the (inverse) problem consisting in finding a probability density (on the phase space of a mechanical system), satisfying the following integral constraints:where are observed (measured) expected values of some functions (“observables” in the physicist’s terminology), of the random variable . That problem appears in many fields; see Kapur [6] and Jaynes [7], for example.

Usually, we set and to take care of the natural requirement on . It actually takes a standard computation to see that when the problem has a solution, it is of the typein which the number of moments appears explicitly. It is usually customary to write , where is an -dimensional vector. Clearly, the generic form of the normalization factor is given byWith this notation the generic form of the solution can be rewritten asHere denotes the standard Euclidean scalar product in , and is the vector with components . At this point we mention that the simple minded proof appearing in many applied mathematics or physics textbooks is not really correct. That is because the set of densities is not open in . There are many alternative proofs. Consider, for example, the work by Csiszar [8] and Cherny and Maslov [9].

The heuristics behind (10) and what comes next are the following. If in statement (ii) of Theorem 1 we take to be any member of the exponential family , the inequality becomeswhich suggests that if we find a minimizer such that the inequality becomes an equality, by Theorem 1 we conclude that (10) is the desired solution. This dualization argument seems to have been first proposed in Mead and Papanicolaou [10] and is expounded in full rigor in Borwein and Lewis [11]. The vector can be found minimizing the dual entropy:where is the -vector with components , and obviously the dependence of on is through the minimizer . We add that, technically speaking, the minimization of is over the domain of which is a convex set with a nonempty interior, and usually the minimum is achieved in its interior. In many applications the domain of is . And for the record, we state the result of the duality argument as follows.

Lemma 2. With the notations introduced above, if the minimizer of (12) lies in the interior of the domain of , then

The proof goes as follows. Note that if is a minimizer of (12), the first order condition is , which written explicitly states that (10) satisfies the constraints (7). Since the entropy of this density is given by the right hand side of (13), it must be the density that maximizes the entropy.

4. Mathematical Complement

In this section we gather some results about that we need as follows.

Proposition 3. With the notations introduced above, suppose that the matrix which we use to denote the covariance of computed with respect to the density is strictly positive definite. Let one suppose as well that the set is an open set. Then one gets the following:(1)The function defined above is log-convex; that is, is convex.(2) is continuously differentiable as many times as one needs.(3)If one sets , then is continuously differentiable in (4)The Jacobian of at equals the (negative) covariance matrix of computed with respect to

The first two assertions are proved in Kullback’s book. Actually, the log-convexity of is a consequence of Hölder’s inequality, and the analyticity of involves a systematic estimation procedure. The third drops out from the inverse function theorem in calculus. See Fleming [12], and the last one follows from the fact that the Jacobian of equals the negative of the inverse of the Hessian matrix of , which is (minus) the covariance matrix . As a simple consequence of item (4) in Proposition 3 we have the following result which is relevant for the arguments in the next section.

Theorem 4. With the notations introduced above, setting , the following assertions hold. The change in as up to terms is given by and, more importantly, using (10) and, again, up to terms ,

To sketch a proof of (15) we proceed as follows. Let ; thenNow, neglecting terms of second order in , we approximate the numerator by and the denominator bywhere we used the fact that at the minimumTherefore from which the desired result readily drops out after neglecting terms of second order in .

5. Sample Dependence

Throughout this section, we shall consider a sample of size of the random variable . Here we shall relate the fluctuations of around its mean to the fluctuations of the density. The following is obtained from an application of the strong law of large numbers.

Theorem 5. Suppose that is integrable, with mean and covariance matrix . Then, for each , (2) is an unbiased estimator of and

Consider now the following.

Proposition 6. Define the empirical moments as in (2). Denote by the Lagrange multiplier determined as explained in Section 2. Then, as , and therefore (a.s.) .

If and are the maxentropic densities given by (10), corresponding, respectively, to and , then pointwise for every and almost surely .

The proof hinges in the following arguments. From Theorems 5 and 4 we obtain the first assertion. The rest follows from the continuous dependence of the densities on the parameter . Also, taking limits as in (15), we obtain another proof of the convergence of to .

The next result concerns the convergence of to in .

Theorem 7. With the notations introduced above, one has

Proof. The proof is a consequence of the continuous dependence of on its arguments, of the identity (13) and item (iii) in Theorem 1 with playing the role of and playing the role of . In this case happens to bewhich, as mentioned, tends to as .

To continue, consider the following.

Theorem 8. With the notations introduced above, one has the following:(1) is an unbiased estimator of .(2)For any bounded, Borel measurable function ,

The proof follows from (15). Multiply both sides of that identity by and integrate with respect to and then invoke the Cauchy-Schwartz inequality in to obtain the inequality.

What is interesting about (2) in Theorem 8 is the possibility of combining it with Chebyshev’s inequality to obtain rates of convergence. It is not hard to verify that where is the Euclidean norm in and

Corollary 9. With the notations introduced in Theorem 8 and the two lines above,

Comment. If we take , we obtain a simple estimate of the speed of decay of to zero, or of the speed of convergence of to if you prefer. Regarding the fluctuations around the mean, consider the following two possibilities.

Theorem 10. With the notations introduced in Theorem 4 and in the identity (15), one has in law as . Also, for any bounded, Borel measurable , in law. Above, , , where and .

The proof of the assertions involves applying the central limit theorem to the vector variable .

Final Comments. Some numerical results illustrating the results presented here appear in the paper by Gomez-Gonçalves et al. [1]. There, they display graphically a “cloud of densities” corresponding to samples on various sizes and see how they shrink to the plot of the true (or exact) density as the sample size increases.

Competing Interests

The author declares that they have no competing interests.