Research Article | Open Access
Calculation of Precise Constants in a Probability Model of Zipf’s Law Generation and Asymptotics of Sums of Multinomial Coefficients
Let be a full set of outcomes (symbols) and let positive , , be their probabilities . Let us treat as a stop symbol; it can occur in sequences of symbols (we call them words) only once, at the very end. The probability of a word is defined as the product of probabilities of its symbols. We consider the list of all possible words sorted in the nonincreasing order of their probabilities. Let be the probability of the th word in this list. We prove that if at least one of the ratios , , is irrational, then the limit exists and differs from zero; here is the root of the equation . The limit constant can be expressed (rather easily) in terms of the entropy of the distribution .
1. Introduction: The Statement of the Main Theorem
1.1. Brief Literature Overview
The wide presence of power laws in real networks, biology, economics, and linguistics can be explained in the framework of various mathematical models (see, e.g., [1, 2]). According to Zipf’s law , in a list of word forms ordered by the frequency of occurrence, the frequency of the th word form obeys a power function of (the value is called the rank of the word form). One can easily explain this law with the help of the so-called monkey model.
Recall that the word forms “the”; “of”; and “and” are used most frequently in English texts. According to Zipf’s law, the word “the” is used in the texts twice as much as “of” and three times as much as “and”; in other words the word form occurrence frequency obeys the power function of rank (the position number of the word form in an ordered frequency list) whose exponent is approximately . It should be noted that further surveys showed that Zipf’s law is roughly realised only for the most frequent words. At present, the researches try to describe the main part of the lexicon using the power law with an exponent , where . Zipf explained his law on the basis of the principle of least effort. In accordance with this principle, the authors aim to minimise the length of the text, which is required to convey their thoughts, even if this introduces ambiguities. On the other hand, readers want to minimize the effort required to understand the text .
Another explanation of Zipf’s law was suggested by Mandelbrot who slightly modified the law by introducing translation constant  into the argument of the power function. The important thing for our case is that later he hypothesized the existence of more simple explanation of the Zipf law associated with a simple probability model when all symbols in the text (including white-space) appear independently of each other with certain probability. Moreover, he analysed the Markovian dependence between these symbols and wrote out the correct (in a typical case) formula on the basis of special cases to determine the parameter by the transition probabilities matrix in the Markov model .
First, we will consider the model thoroughly described by Miller  and Li  for a special case of Mandelbrot’s experiment in which the monkey types the keys with uniform probability. To learn some other important references on the monkey model, we recommend to read the recent article by Richard Perline and Ron Perline  (see also references in the next subsection).
1.2. Statement of the Main Theorem and Its Connection with Other Results
Assume that a monkey types any of 26 Latin letters or the space on a keyboard with the same probability of . We understand a word as a sequence of symbols typed by the monkey before the space. Let us sort the list of possible words with respect to probabilities of their occurrence (the empty word, whose probability equals , will go first in this list followed by 26 one-letter words whose probabilities equal and then by possible two-letters words and so on). We can prove (see [7, 8]) that the probability of a word with the rank of satisfies the inequalitywhere and (here and below we use the symbol if the base of the logarithm is not significant; but for the natural logarithm we use the symbol ).
Relatively recently inequality (1) was generalized to the case of nonequiprobable letters. Let be the probability that the monkey types the space, let , , denote probabilities of choosing the th letter from the set of letters (, ), and let be, as above, the probability of a word with a rank of . Then, as is proved in [10, 11], the following inequality analogous to (1) takes place; namely, , such thatand is the root of the equation (evidently, ). Note that inequality (2) is equivalent to the boundedness of the difference .
In the case when the probability of each letter is not fixed but depends on the previous one, words represent trajectories of a Markov chain with the absorbing state and transient states . Then the value is the probability of the th trajectory in the list of possible trajectories sorted in the nonincreasing order of probabilities. In this case, the asymptotic behavior of does not necessarily have a power order. Namely, in this case one of the two alternatives takes place [12, 13]. The first variant is that there exists the limitwhere is some positive integer constant value that depends on the structure of the transition probability matrix and the structure of states, where the initial distribution of the Markov chain is concentrated. The second variant is that independently of the initial distribution there exists the following nonzero limit (the so-called weak power law):This limit equals , where is now defined with the help of the substochastic matrix of transition probabilities where the row and the column that correspond to the absorbing state are deleted. Namely, raising all elements of the mentioned matrix to the power of would equate its spectral radius to 1.
These results were obtained independently in [12, 14] and later refined in . Namely, as appeared, the first alternative means the subexponential order of the asymptotics; that is, in this case , such that
The case of the second alternative is much more difficult. If the matrix does not have the block-diagonal structure with coinciding powers such that raising elements of blocks to these powers makes the spectral radius equal 1, then one can replace the weak power law with a strong one. Namely, in this case the asymptotic behavior of has the power order; that is, inequality (2) is valid (with “matrix” defined above). Therefore, inequality (2) takes place in a “typical” case of letter probabilities.
However, one more natural question still remains without an answer.
Inequality (2) means that the asymptotic form has a power order but does not imply the exact power asymptotics. In a general case, as follows from the first example given in this section, useful properties can be established neither when letters in words are Markov-dependent nor when they are independent. However, as we prove later in this paper, in a “typical” case, for words composed of independent letters, the asymptotic behavior of the function is exact power. The following theorem is valid.
Theorem 1 (main). Let at least one of the ratios , , be irrational and let be the root of the equation . Then the limitexists and equals , where is the entropy of ; that is,
Here and below we always write the function under consideration in the numerator and do the norming (defined analytically) function in the denominator of the fraction, whose limit is to be calculated. In intermediate calculations it may be more convenient to do the opposite, but since this results only in the trivial raising of the limit constant to the power of , we sacrifice the convenience of calculations for the clarity of statements of results. Evidently, the theorem asserts that under certain assumptions there exists the nonzero limit (where ) as . It is equal to .
Let us describe the structure of the remaining part of the paper. In Section 2 we state the main theorem in terms of multinomial coefficients (of the Pascal pyramid). The proof of the theorem is reduced to the estimation of the limit behavior of the sum of these coefficients over some simplex. In Section 3 we prove an analog of this theorem with an integral in place of the sum. In this section we essentially use the Stirling formula which allows us to reduce calculations to the evaluation of a multivariate Gaussian integral. We establish an explicit formula for the determinant of the matrix of the quadratic form that defines the integrand. Finally, in Section 4 we prove that the ratio of the integral to the sum tends to 1. Here we use the general properties of the Riemann integral and uniformly distributed sequences. In conclusion we discuss possible generalizations and unsolved problems.
2. Equivalent Statements of the Main Theorem and the Pascal Pyramid
Let us first note that if , then . Reducing the nominator of fraction (6) by , we write the following statement in this case:
Theorem 2 (the case of ). Let be the probability of the symbol , , while (there is no stop symbol). Assume that at least one of the ratios , , is irrational. Let us consider all possible finite words (including the empty one) and sort them in the nonincreasing order of probabilities (we equate the probability of the empty word to 1 and calculate the probability of any other word as the product of probabilities of its letters). Let be the probability of the th word in the list (the word with the rank of ). Then the limit exists and equals , where is the entropy of the vector ; that is, .
In the statement of Theorem 2, as well as in Theorem 1, we use the bold font for the vector whose components are denoted by the same letter with the index ranging from 1 to . In what follows we use the bold font for analogous denotations without mentioning this fact.
One can easily see that Theorem 2 is not just a particular case of Theorem 1, but these theorems are equivalent. Namely, the replacement of probabilities with new ones turns the general case into the particular one. Therefore, in what follows we neglect , assuming (without loss of generality) that .
Fix some probability and denote by the rank of the last word whose probability is not less than in the list of all words sorted in the nonincreasing order of their probabilities. Let us redefine the function for noninteger as (here is the integer part of a number). Evidently, functions and are inverse (more exactly, quasi-inverse); namely, the graph of one of the hyperbola-shaped, decreasing stepwise functions turns into another one when axes and switch roles (in the first case, is the argument and is the value and vice versa in the second case).
It can be clearly seen that is equivalent toTherefore the equality in the assertion of Theorem 2 is equivalent to that
Denote the logarithm of the denominator in the last fraction by (i.e., ) and let . In view of considerations in the above paragraph the equality in the assertion of Theorem 2 is equivalent to that
Recall the proof of inequality (2) in . It is reduced to the proof of the boundedness of the difference for the introduced function with . Nonnegative values of form the definition domain of the function because . For convenience we redefine the function by putting for .
Let . Considering all possible variants of the last letters in words, whose quantity equals the value of the function , we obtain the functional equation , where is the Heaviside step (i.e., the function that vanishes with negative values of the argument and equals 1 with nonnegative values). For we get the following recurrent correlation:where .
The equality implies that the function satisfies (10). Since the function takes a finite number of positive values within interval, there exist positive and such thatfor all .
Replacing terms in the right-hand side of the recurrent correlation (10) with their lower (upper) bounds, we extend the solution set of inequality (11) to the domain , where . Repeating this procedure several times, in a finite number of steps we prove that the inequality is valid for any arbitrarily large . Performing the logarithmic transformation of the inequality, we conclude that is bounded, and then so is the difference .
Let us return to Theorem 2. As was mentioned above, Theorem 2 asserts (under certain assumptions) not only the boundedness of but also the validity of equality (9). Let us recall the combinatory sense of the function ; it is mentioned in . Evidently, all words that contain letters of the 1st kind, letters of the 2nd kind, , and letters of the th kind have one and the same probability of (i.e., ); ranks of these words are consecutive. The quantity of such words is defined by the multinomial coefficient
Considering the nonnegative part of the -dimensional integer grid and associating the point with the number , we get one of the variants of the Pascal pyramid. By the definition of the function the value equals the sum of multinomial coefficients over all integer vectors that lie inside the -dimensional simplex :
As a result, we obtain one more equivalent statement of the main theorem, which we are going to prove.
Theorem 3 (the multinomial statement). Let , , be arbitrary positive numbers such that at least one of the ratios , , be irrational and , where . Let a function obey formula (13). Thenwhere .
3. The Proof of an Analog of Theorem 3 with Integration instead of Summation
3.1. Reduction of the Integration to the Calculation of a Gaussian Integral
The function is defined for integer nonnegative vectors . Let us redefine it for noninteger vectors by replacing (in this case) in Definition (12) with . In what follows we use the denotation (or ) for the corresponding function which is continuous for nonnegative . Further we consider this function and study its properties only for such (nonnegative) .
In this section we prove the following theorem.
Theorem 4 (on the integral). Let , , be arbitrary positive numbers such that , where . Let a function obey the formula , where . Then
Proof. Let us first recall some evident properties of the integrand. Note that the existence of the (Riemann) integral of over the compact set evidently follows from the continuity of in the domain under consideration.
If all components of the vector , possibly, except one component , equal zero, then by definition we have . Let us prove that otherwise the function is strictly increasing in . Since the gamma function is positive definite, it suffices to prove that in this case the partial derivative of with respect to is positive. It equalsThe positiveness of this difference follows from the fact that the function is increasing; this property, in turn, follows from the logarithmic convexity of the gamma function (it is well known  that with ).
The proved assertion implies that the function attains its maximum in the domain at the boundary , where . Let us calculate the exact asymptotics of the maximal value of the function in the domain with . For the vector we denote by the sum of its components and parameterize by the value and ratios :Let us use one simplest corollary of the Stirling formula , namely, the fact that with a nonnegative argument the value of the difference is bounded. We obtain that, with any ,where (this correlation is closely connected with the so-called entropy inequality for multinomial coefficients).
We seek for the maximum of this function with under one additional condition (namely, the requirement that the maximum is attained at the boundary) , where , , and . Since , we get . Moreover, the condition with mentioned gives the correlationwhere . Substituting this expression in (18), we conclude that the maximum of (accurate to ) is attained at a vector such that the fraction takes on the maximal value. Recall that the difference takes on only nonnegative values and is called the Kullback–Leibler distance (divergence) between distributions and (see ). The minimum of this difference is attained at only one value of ; evidently, an analogous assertion is also true for : if Consequently, the maximum of the function in the domain is attained (accurate to ) at the intersection of the hyperplane with the straight line , , where it equals .
Let us now immediately prove Theorem 4. Note first that by using the L’Hopital rule we can reduce the proof to that of the formula obtained by differentiating numerator and denominator with respect to and to the proof of the equalitywhere and is the delta function.
Let be a real arbitrarily small positive value. Denote by the sector consisting of points , , and , such thatWith fixed on the hyperplane correlations (18) and (19) take the formLet us now strengthen inequality (20); namely, let us prove that if for correlations (22) are violated, thenwhere is a positive constant independent of .
Since is a convex combination of , it evidently is bounded:Consequently, formula (24) is equivalent to the inequalityThe latter correlation follows from the well-known property of the Kullback-Leibler divergence(see, e.g., lemma 3.6.10 in ).
The proved inequality (24) (in view of formula (23)) implies that outside the domain the function is exponentially small in comparison to the maximal value inside the domain which equals . More precisely, with and , we getNote that the condition of the exponential smallness in comparison to remains valid, even if depends on and tends to 0 as increases, though not too fast. In what follows we assume that
One can easily see that the same exponential upper bound as in (28) also takes place not only for function but also for its integral over the domain whose volume grows according to a power law:with . Therefore in limit (21) we can treat as the integral
Let us define the asymptotics (18) of the function in the domain more precisely. Let us use the standard Stirling formula, namely, the fact that with it holds that , where . We obtain that, in the domain ,Here, as usual, ; . Therefore, we conclude that when considering the asymptotics of function (31) we can treat as follows:
In the latter formula we can write the exponent asLet us write the Taylor expansion up to second-order terms near the maximum point in the plane , that is, near the point (in what follows we denote by coordinates of the point and do by the sum of these coordinates which evidently equals ).
First of all, note thatOne can easily calculate second derivatives of expression (34):(note that we do not use first derivatives in the Taylor expansion near the maximum point).
If , then by formula (19) we have (in the latter inequality we use the continuity of the function ). Consequently,
In particular, with chosen we have . We obtain that, in the domain , Here the term contains both the remainder of terms of the series whose order exceeds 2 and the value of added by some omitted second-order terms. With we can neglect the term of . Therefore, in integral (31) in place of we should substitute the function which differs from in the fact that its exponent does not contain the term of .
Let us change variables in the integral as follows: . Since the degree of homogeneity of the delta-function equals , we obtain that limit (21) coincides withwhere is matrix, whose all elements equal , except diagonal components which are greater by .
3.2. Calculation of the Determinant
Lemma 5. Let . Consider matrix , where all nondiagonal elements equal 1, while . Then(1)the determinant of this matrix equals(2)the algebraic complement of the element with indices , equals
Corollary 6. The matrix in formula (39) is degenerate.
Proof of Lemma 5. Note that the first item of Lemma 5 defines the value of the algebraic complement of the diagonal element of such a matrix. Let us prove the theorem by induction.
With in the formula in item () we get the product over the empty set; it is accepted that this product equals 1. The formula in item () remains valid with . In the induction step we assume that the formula in item () is proved for all dimensions less than and has to be proved for the case when the dimension equals , while the formula in item () is proved for all dimensions not greater than and has to be proved for matrix.
For proving item () we can use the expansion by the last row. Multiplying the algebraic complement by the diagonal element , we get the sumThe expansion by the entire last row, taking into account the induction hypothesis for item (), make the third part in row (42) vanish. First two terms in formula (42) together give the desired sum.
In order to prove item (), let us expand the determinant considered in this item (algebraic complement of the element with indices of the matrix with dimension) by the row whose number in the initial matrix of was equal to . Generally speaking, for clarity, we use the same indices as in the numeration of the initial matrix. Since the algebraic complement considered in this item and the occurring algebraic complement for the element with indices (obtained by the expansion by a row of the determinant under consideration) have opposite signs, the value added by the element with indices equals(here we have used the induction hypothesis for item ()). The difference from the desired formula consists in the last term which equals (taking into account the first multiplier)It vanishes, when taking into account the contribution of the remaining elements in the th row of the considered matrix.
Lemma 7. Let be the matrix mentioned in Lemma 5 (its dimension is , ). Assume that , , where are arbitrary nonzero numbers. Denote by a matrix of the same dimension in the form , where is an arbitrary numeric row and is the transposition sign. Let be an arbitrary real number. Then
Corollary 8. Let a vector satisfy additional constraints , (i.e., ), while . Then
Proof of Lemma 7. By the differentiation rule for determinants, the derivative of the determinant of matrix equals the sum of determinants of matrices such that in the th one all elements of the th row are replaced with their derivatives. We obtain that is the sum of determinants of matrices each one of which contains either the zero row or two various rows of the matrix . Since rank , we get .
Thus, is a linear function of , whose free term evidently equals . It is clear that for calculating the coefficient at it suffices to summate products of each element of the matrix by the algebraic complement of the corresponding element of the matrix . If an element has indices , , then by item () of Lemma 5 this product equals .
Let us explain the positive sign in the last formula. We calculate an algebraic complement of the matrix element. The matrix has dimension, and therefore the found algebraic complement differs from the algebraic complement of the corresponding matrix element for times. According to item () of Lemma 5, the algebraic complement of the corresponding matrix element is a “minus” product of multipliers . In the given case each of factors is negative (equals ) which results in positive sign of the last formula in the above paragraph.
Assume that this formula is valid for all . Then we get the sumHowever by item () of Lemma 5 the algebraic complement of the diagonal element of the matrix equals(here and below we omit the evident requirement that values of all indices belong to the set ).
Multiplying the first term in parentheses, that is, , by and summing over all , we get . Let us multiply the resting term in parentheses (48) by , sum over all , and subtract the valuefrom the obtained result (note that the subtrahend was “illegally” included in formula (47)). It gives the overall contribution of the second term in formula (48), which equalsTaking into account all the calculation elements of the determinant allows completing the proof of Lemma 7.
For completing the proof of Theorem 4 let us use Corollary 8. Let us replace the -function in integral (39) (as was proved earlier, this integral equals the limit considered in Theorem 4): . Treating the limit multiplied by the coefficient at the exponent as a multiplier in the integral, we come to the limit of the Gaussian integralthat is,Immediately applying Corollary 8, we get desired . This completes the proof.
4. The Ratio between the Sum and the Integral
What remains is to prove that, under assumptions of Theorem 3, the ratio of the integral of the function calculated over the domain to the sum of values of this function at integer points of this domain tends to 1 as . For comparing the integral of the function and the sum of its values in the same domain one usually applies the Koksma-Hlawka inequality (see ). Note that usually one considers the integral over a fixed domain (as a rule, the cube ), whereas the domain in the case under consideration is varying. However, we intend only to prove the convergence of the fraction to 1 and do not need to estimate the asymptotic difference between the integral and the sum, which simplifies the task.
Evidently, it suffices to calculate the limit of the ratio for an arbitrary infinite increasing sequence , such that .
Theorem 9. Let be a sequence of Jordan measurable sets such that for all . Assume that , , where , is an integrable and bounded on each of the domains function such that and as . Assume also that is a countable set of points from such that each of the sets is finite. Then if for any sufficiently small there exists a partition of onto a countable number of Jordan measurable sets , , such that for some , whilethen in this case there exists the limit
Proof. Evidently,Therefore, in view of (53) we conclude that starting with some it holds thatwith . In accordance with (54) we conclude that for all , except a finite number of values of the index. Therefore, there exists such that, with all ,Representing this correlation as a double inequality and summing it over all from to , we obtainwith .
Note that by condition the numerator in the latter fraction (different from the integral by a constant value) tends to infinity. Then the same is true for the denominator. Note that the denominator differs from by a constant value.
Therefore we conclude that all limit points of the sequence lie inside the interval . Due to the arbitrariness of the choice of positive Theorem 9 is proved.
Proof. For clarity we denote by the parameter that defines the boundary of the considered domain, and do by the corresponding parameter of the hyperplane that contains a certain interior point of this domain; that is, .
First of all, note that considerations in Section 3.1 imply that both in the sum and in the integral we can replace with the domainand replace the function with defined by formula (34). Therefore, we need to prove that(or that the difference of logarithms of the numerator and denominator tends to zero).
In view of Theorem 4 the logarithm of the numerator in the latter fraction is a uniformly continuous function of , while the logarithm of the denominator evidently is a nondecreasing function. Therefore for proving the existence of the limit with it suffices to prove the existence of the limit for a sequence in the form , , where is an arbitrarily small positive value (as the difference between the numerator and denominator of the logarithms in an arbitrary point slightly differs from the value of difference in the nearest points in this sequence). Namely, just for this fixed sequence we consider the ratio from the right-hand side of (62).
In order to apply Theorem 9, for an arbitrary sufficiently small positive we construct a partition of onto domains satisfying assumptions of the theorem. Namely, we construct this partition by dividing of an infinite quantity of “flapjacks” located between neighboring hyperplanes in the forms and , , where , onto a finite number of domains .
Evidently, for any we can choose a sequence such thatTo this end, it suffices to put , where (here is an upward rounding to the nearest integer).
Let . Denote by the th “flapjack” . We are going to “cut” onto a finite number of domains . We numerate the countable number of domains , , so as to make domains obtained by “cutting” with the least have lesser numbers, while the order of numbering inside the partition of plays no role.
Since , with , it holds that (cf. with (37)). Consequently, with we get and .
By formula (33),where We get and Using expansion in a series with evaluation of the second-order terms and considerations of the previous paragraph we obtain the following important observation. If andthen with sufficiently large it holds that
Since as , with sufficiently large it holds thatAs a result, we obtain that with sufficiently small , starting with some , it holds that
Therefore, dividing onto domains so as to fulfill correlation (67) for all points that belong to one domain, we guarantee the validity of assumption (53) in Theorem 9. Note that it suffices to fulfill condition (67) for all indices except one, because the validity of this condition for the remaining index follows from the fact that .
Finally, let us use the irrationality of for some . Let us denote by the set and do by the set . We are going to prove that, defining domains by inequalitieswe fulfill condition (54) (with ). Here, as usual, is a sufficiently small real positive value, though in this case we can choose as any number in the interval (roughly speaking, it is sufficient that the radius of the pieces used to divide “flapjacks” tends to infinity at ).
Evidently, we can divide “almost all” onto domains so as to simultaneously fulfill inequalities (67) and conditions (71) on and (the remaining “cuttings” on the edges of the domain which occur due to the inconsistency between the inequality and the definition of the boundary of the domain are asymptotically small).