Actuarial and Financial Risks: Models, Statistical Inference, and Case StudiesView this Special Issue
Research Article | Open Access
Zenga's New Index of Economic Inequality, Its Estimation, and an Analysis of Incomes in Italy
For at least a century academics and governmental researchers have been developing measures that would aid them in understanding income distributions, their differences with respect to geographic regions, and changes over time periods. It is a fascinating area due to a number of reasons, one of them being the fact that different measures, or indices, are needed to reveal different features of income distributions. Keeping also in mind that the notions of poor and rich are relative to each other, Zenga (2007) proposed a new index of economic inequality. The index is remarkably insightful and useful, but deriving statistical inferential results has been a challenge. For example, unlike many other indices, Zenga's new index does not fall into the classes of -, -, and -statistics. In this paper we derive desired statistical inferential results, explore their performance in a simulation study, and then use the results to analyze data from the Bank of Italy Survey on Household Income and Wealth (SHIW).
Measuring and analyzing incomes, losses, risks, and other random outcomes, which we denote by , has been an active and fruitful research area, particularly in the fields of econometrics and actuarial science. The Gini index is arguably the most popular measure of inequality, with a number of extensions and generalizations available in the literature. Keeping in mind that the notions of poor and rich are relative to each other, Zenga  constructed an index that reflects this relativity. We will next recall the definitions of the Gini and Zenga indices.
Let denote the cumulative distribution function (cdf) of the random variable , which we assume to be nonnegative throughout the paper. Let denote the corresponding quantile function. The Lorenz curve is given by the formula (see )
where is the unknown true mean of . Certainly, from the rigorous mathematical point of view we should call the Lorenz function, but this would deviate from the widely accepted usage of the term “Lorenz curve”. Hence, curves and functions are viewed as synonyms throughout this paper.
The classical Gini index can now be written as follows:
where . Note that is a density function on . Given the usual econometric interpretation of the Lorenz curve , the function
which we call the Gini curve, is a relative measure of inequality (see ). Indeed, is the ratio between (i) the mean income of the poorest of the population and (ii) the mean income of the entire population: the closer to each other these two means are, the lower is the inequality.
Zenga's  index of inequality is defined by the formula
where the Zenga curve is given by
The Zenga curve measures the inequality between (i) the poorest of the population and (ii) the richer remaining part of the population by comparing the mean incomes of these two disjoint and exhaustive subpopulations. We will elaborate on this interpretation later, in Section 5.
The Gini and Zenga indices and are (weighted) averages of the Gini and Zenga curves and , respectively. However, while in the case of the Gini index the weight function (i.e., the density) is employed, in the case of the Zenga index the uniform weight function is used. As a consequence, the Gini index underestimates comparisons between the very poor and the whole population, and emphasizes comparisons which involve almost identical population subgroups. From this point of view, the Zenga index is more impartial: it is based on all comparisons between complementary disjoint population subgroups and gives the same weight to each comparison. Hence, the Zenga index detects, with the same sensibility, all deviations from equality in any part of the distribution.
To illustrate the Gini curve and its weighted version , and to also facilitate their comparisons with the Zenga curve , we choose the Pareto distribution
where and are parameters. Later in this paper, we will use this distribution in a simulation study, setting and . Note that when , then the second moment of the distribution is finite. The “heavy-tailed” case is also of interest, especially when modeling incomes of countries with very high economic inequality. We will provide additional details on the case in Section 5.
Note. Pareto distribution (1.6) is perhaps the oldest model for income distributions. It dates back to Pareto , and Pareto . Pareto's original empirical research suggested him that the number of tax payers with income is roughly proportional to , where is a parameter that measures inequality. For historical details on the interpretation of this parameter in the context of measuring economic inequality, we refer to Zenga . We can view the parameter as the lowest taxable income. In addition, besides being the greatest lower bound of the distribution support, is also the scale parameter of the distribution and thus does not affect our inequality indices and curves, as we will see in formulas below.
Note. The Pareto distribution is positively supported, . In real surveys, however, in addition to many positive incomes we may also observe some zero and negative incomes. This happens when evaluating net household incomes, which are the sums of payroll incomes (net wages, salaries, fringe benefits), pensions and net transfers (pensions, arrears, financial assistance, scholarships, alimony, gifts). Paid alimony and gifts are subtracted in forming the incomes. However, negative incomes usually happen in the case of very few statistical units. For example, in the 2006 Bank of Italy survey we observe only four households with nonpositive incomes, out of the total of 7,766 households. Hence, it is natural to fit the Pareto model to the positive incomes and keep in mind that we are actually dealing with a conditional distribution. If, however, it is desired to deal with negative, null, and positive incomes, then instead of the Pareto distribution we may switch to different ones, such as Dagum distributions with three or four parameters [8–10].
Corresponding to Pareto distribution (1.6), the Lorenz curve is given by the formula (see ), and thus the Gini curve becomes . In Figure 1(a) we have depicted the Gini and weighted Gini curves. The corresponding Zenga curve is equal to and is depicted in Figure 1(b), alongside the Gini curve for an easy comparison. Figure 1(a) allows us to appreciate how the Gini weight function disguises the high inequality between the mean income of the very poor and that of the whole population, and overemphasizes comparisons between almost identical subgroups. The outcome is that the Gini index underestimates inequality. In Figure 1(b) we see the difference between the Gini and Zenga inequality curves. For example, for yields , which tells us that the mean income of the poorest of the population is lower than the mean income of the whole population, while the corresponding ordinate of the Zenga curve is , which tells us that the mean income of the poorest of the population is lower than the mean income of the remaining (richer) part of the population.
The rest of this paper is organized as follows. In Section 2 we define two estimators of the Zenga index and develop statistical inferential results. In Section 3 we present results of a simulation study, which explores the empirical performance of two Zenga estimators, and , including coverage accuracy and length of several types of confidence intervals. In Section 4 we present an analysis of the the Bank of Italy Survey on Household Income and Wealth (SHIW) data. In Section 5 we further contribute to the understanding of the Zenga index by relating it to lower and upper conditional expectations, as well as to the conditional tail expectation (CTE), which has been widely used in insurance. In Section 6 we provide a theoretical background of the aforementioned two empirical Zenga estimators. In Section 7 we justify the definitions of several variance estimators as well as their uses in constructing confidence intervals. In Section 8 we prove Theorem 2.1 of Section 2, which is the main technical result of the present paper. Technical lemmas and their proofs are relegated to Section 9.
2. Estimators and Statistical Inference
Unless explicitly stated otherwise, our statistical inferential results are derived under the assumption that data are outcomes of independent and identically distributed (i.i.d.) random variables.
Hence, let be independent copies of . We use two nonparametric estimators for the Zenga index . The first one  is given by the formula
where are the order statistics of . With denoting the sample mean of , the second estimator of the Zenga index is given by the formula
The two estimators and are asymptotically equivalent. However, despite the fact that the estimator is more complex, it will nevertheless be more convenient to work with when establishing asymptotic results later in this paper.
Unless explicitly stated otherwise, we assume throughout that the cdf of is a continuous function. We note that continuous cdf's are natural choices when modeling income distributions, insurance risks, and losses (see, e.g., ).
Theorem 2.1. If the moment is finite for some , then one has the asymptotic representation where denotes a random variable that converges to in probability when , and with the weight function
The latter expression of is particularly convenient when working with distributions for which the first derivative (when it exists) of the quantile is a relatively simple function, as is the case for a large class of distributions (see, e.g., ). However, irrespectively of what expression for the variance we use, the variance is unknown since the cdf is unknown, and thus needs to be estimated empirically.
2.1. One Sample Case
with the following expressions for the summands and first,
Furthermore, for every ,
With the just defined estimator of the variance , we have the asymptotic result:
where denotes convergence in distribution.
2.2. Two Independent Samples
We now discuss a variant of statement (2.14) in the case of two populations when samples are independent. Namely, let the random variables and be independent within and between the two samples. Just like in the case of the cdf , here we also assume that the cdf is continuous and for some . Furthermore, we assume that the sample sizes and are comparable, which means that there exists such that
when both and tend to infinity. From statement (2.3) and its counterpart for we then have that the quantity is asymptotically normal with mean zero and the variance . To estimate the variances and , we use and , respectively, and obtain the following result:
2.3. Paired Samples
Consider now the case when the two samples and are paired. Thus, we have that , and we also have that the pairs are independent and identically distributed. Nothing is assumed about the joint distribution of . As before, the cdf's and are continuous and both have finite moments of order , for some . From statement (2.3) and its analog for we have that is asymptotically normal with mean zero and the variance . The latter variance can of course be written as . Having already constructed estimators and , we are only left to construct an estimator for . (Note that when and are independent, then and thus the expectation vanishes.) To this end, we write the equation
Replacing the cdf's and everywhere on the right-hand side of the above equation by their respective empirical estimators and , we have (Theorem 7.3)
where are the induced (by ) order statistics of . (Note that when , then and so the sum is equal to ; hence, estimator (2.18) coincides with estimator (2.8), as expected.) Consequently, is an empirical estimator of , and so we have that
We conclude this section with a note that the above established asymptotic results (2.14), (2.16), and (2.19) are what we typically need when dealing with two populations, or two time periods, but extensions to more populations and/or time periods would be a worthwhile contribution. For hints and references on the topic, we refer to Jones et al.  and Brazauskas et al. .
3. A Simulation Study
Here we investigate the numerical performance of the estimators and by simulating data from Pareto distribution (1.6) with and . These choices give the value , which is approximately seen in real income distributions. As to the (artificial) choice , we note that since is the scale parameter in the Pareto model, the inequality indices and curves are invariant to it. Hence, all results to be reported in this section concerning the coverage accuracy and size of confidence intervals will not be affected by the choice .
Following Davison and Hinkley [17, Chapter 5], we compute four types of confidence intervals: normal, percentile, BCa, and -bootstrap. For normal and studentized bootstrap confidence intervals we estimate the variance using empirical influence values. For the estimator , the influence values are obtained from Theorem 2.1, and those for the estimator using numerical differentiation as in Greselin and Pasquazzi .
In Table 1 we report coverage percentages of confidence intervals, for each of the four types: normal, percentile, BCa, and -bootstrap. Bootstrap-based approximations have been obtained from resamples of the original samples. As suggested by Efron , we have approximated the acceleration constant for the BCa confidence intervals by one-sixth times the standardized third moment of the influence values. In Table 2 we report summary statistics concerning the size of the confidence intervals. As expected, the confidence intervals based on and exhibit similar characteristics. We observe from Table 1 that all confidence intervals suffer from some undercoverage. For example, with sample size 800, about of the studentized bootstrap confidence intervals with nominal confidence level contain the true value of the Zenga index. It should be noted that the higher coverage accuracy of the studentized bootstrap confidence intervals (when compared to the other ones) comes at the cost of their larger sizes, as seen in Table 2. Some of the studentized bootstrap confidence intervals extend beyond the range of the Zenga index , but this can easily be fixed by taking the minimum between the currently recorded upper bounds and 1, which is the upper bound of the Zenga index for every cdf . We note that for the BCa confidence intervals, the number of bootstrap replications of the original sample has to be increased beyond if the nominal confidence level is high. Indeed, for samples of size , it turns out that the upper bound of (out of ) of the BCa confidence intervals based on and with nominal confidence level is given by the largest order statistics of the bootstrap distribution. For the confidence intervals based on , the corresponding figure is .
4. An Analysis of Italian Income Data
In this section we use the Zenga index to analyze data from the Bank of Italy Survey on Household Income and Wealth (SHIW). The sample of the 2006 wave of this survey contains households, with of them being panel households. For detailed information on the survey, we refer to the Bank of Italy  publication. In order to treat data correctly in the case of different household sizes, we work with equivalent incomes, which we have obtained by dividing the total household income by an equivalence coefficient, which is the sum of weights assigned to each household member. Following the modified Organization for Economic Cooperation and Development (OECD) equivalence scale, we give weight to the household head, to the other adult members of the household, and to the members under years of age. It should be noted, however, that—as is the case in many surveys concerning income analysis—households are selected using complex sampling designs. In such cases, statistical inferential results are quite complex. To alleviate the difficulties, in the present paper we follow the commonly accepted practice and treat income data as if they were i.i.d.
In Table 3 we report the values of and according to the geographic area of the households, and we also report confidence intervals for based on the two estimators. We note that two households in the sample had negative incomes in 2006, and so we have not included them in our computations.
Note. Removing the negative incomes from our current analysis is important as otherwise we would need to develop a much more complex methodology than the one offered in this paper. To give a flavour of technical challenges, we note that the Gini index may overestimate the economic inequality when negative, zero, and positive incomes are considered. In this case the Gini index needs to be renormalized as demonstrated by, for example, Chen et al. . Another way to deal with the issue would be to analyze the negative incomes and their concentration separately from the zero and positive incomes and their concentration.
Consequently, the point estimates of are based on equivalent incomes with and . As pointed out by Maasoumi , however, good care is needed when comparing point estimates of inequality measures. Indeed, direct comparison of the point estimates corresponding to the five geographic areas of Italy would lead us to the conclusion that the inequality is higher in the central and southern areas when compared to the northern area and the islands. But as we glean from pairwise comparisons of the confidence intervals, only the differences between the estimates corresponding to the northwestern and southern areas and perhaps to the islands and the southern area may be deemed statistically significant.
Moreover, we have used the paired samples of the 2004 and 2006 incomes of the 3,957 panel households in order to check whether during this time period there was a change in inequality among households. In Table 4 we report the values of based on the panel households for these two years, and the confidence intervals for the difference between the values of the Zenga index for the years 2006 and 2004. These computations have been based on formula (2.19). Having removed the four households with at least one negative income in the paired sample, we were left with a total of observations. We see that even though we deal with large sample sizes, the point estimates alone are not reliable. Indeed, for Italy as the whole and for all geographic areas except the center, the point estimates suggest that the Zenga index decreased from the year 2004 to 2006. However, the confidence intervals in Table 4 suggest that this change is not significant.
5. An Alternative Look at the Zenga Index
In various contexts we have notions of rich and poor, large and small, risky and secure. They divide the underlying populations into two parts, which we view as subpopulations. The quantile , for some , usually serves as a boundary separating the two subpopulations. For example, we may define rich if and poor if . Calculating the mean value of the former subpopulation gives rise to the upper conditional expectation , which is known in the actuarial risk theory as the conditional tail expectation (CTE). Calculating the mean value of the latter subpopulation gives rise to the lower conditional expectation , which is known in the econometric literature as the absolute Bonferroni curve, as a function of .
Clearly, the ratio
of the lower and upper conditional expectations takes on values in the interval . When is equal to any constant, which can be interpreted as the egalitarian case, then is equal to . The ratio is equal to for all when the lower conditional expectation is equal to for all . This means extreme inequality in the sense that, loosely speaking, there is only one individual who possesses the entire wealth. Our wish to associate the egalitarian case with and the extreme inequality with leads to function , which coincides with the Zenga curve (1.5) when the cdf is continuous. The area
beneath the function is always in the interval . Quantity (5.2) is a measure of inequality and coincides with the earlier defined Zenga index when the cdf is continuous, which we assume throughout the paper.
Note that under the continuity of , the lower and upper conditional expectations are equal to the absolute Bonferroni curve and the dual absolute Bonferroni curve , respectively, where
is the absolute Lorenz curve. This leads us to the expression of the Zenga index given by (1.4), which we now rewrite in terms of the absolute Lorenz curve as follows:
We will extensively use expression (5.4) in the proofs below. In particular, we will see in the next section that the empirical Zenga index is equal to with the population cdf replaced by the empirical cdf .
We are now in the position to provide additional details on the earlier noted Pareto case , when the Pareto distribution has finite but infinite . The above derived asymptotic results and thus the statistical inferential theory fail in this case. The required adjustments are serious and rely on the use of the extreme value theory, instead of the classical central limit theorem (CLT). Specifically, the task can be achieved by first expressing the absolute Lorenz curve in terms of the conditional tail expectation (CTE):
using the equation . Hence, (5.4) becomes
where is of course the mean . Note that replacing the population cdf by its empirical counterpart on the right-hand side of (5.6) would not lead to an estimator that would work when , and thus when the Pareto parameter . A solution to this problem is provided by Necir et al. , who have suggested a new estimator of the conditional tail expectation for heavy-tailed distributions. Plugging in that estimator instead of the CTE on the right-hand side of (5.6) produces an estimator of the Zenga index when . Establishing asymptotic results for the new “heavy-tailed” Zenga estimator would, however, be a complex technical task, well beyond the scope of the present paper, as can be seen from the proofs of Necir et al. .
6. A Closer Look at the Two Zenga Estimators
Since samples are “discrete populations”, (5.2) and (5.4) lead to slightly different empirical estimators of . If we choose (5.2) and replace all population-related quantities by their empirical counterparts, then we will arrive at the estimator , as seen from the proof of the following theorem.
Theorem 6.1. The empirical Zenga index is an empirical estimator of .
Proof. Let be a uniform on random variable independent of . The cdf of is . Hence, we have the following equations: Replacing every on the right-hand side of (6.1) by , we obtain which simplifies to This is the estimator .
If, on the other hand, we choose (5.4) as the starting point for constructing an empirical estimator for , then we first replace the quantile by its empirical counterpart
in the definition of , which leads to the empirical absolute Lorenz curve , and then we replace each on the right-hand side of (5.4) by the just constructed . (Note that .) These considerations produce the empirical Zenga index , as seen from the proof of the following theorem.
Theorem 6.2. The empirical Zenga index is an estimator of .
Proof. By construction, the estimator is given by the equation: Hence, the proof of the lemma reduces to verifying that the right-hand sides of (2.2) and (6.5) coincide. For this, we split the integral in (6.5) into the sum of integrals over the intervals for . For every , we have , where Hence, (6.5) can be rewritten as , where with Consider first the case . We have and thus , which implies Next, we consider the case . We have and thus , which implies When , then the integrand in the definition of does not have any singularity, since due to almost surely. Hence, after simple integration we have that, for , With the above formulas for we easily check that the sum is equal to the right-hand side of (2.2). This completes the proof of Theorem 6.2.
7. A Closer Look at the Variances
Following the formulation of Theorem 2.1 we claimed that the asymptotic distribution of is centered normal with the finite variance . The following theorem proves this claim.
Theorem 7.1. When for some , then converges in distribution to the centered normal random variable where is the Brownian bridge on the interval . The variance of is finite and equal to .
Proof. Note that can be written as , where is the empirical process based on the uniform on random variables . We will next show that The proof is based on the well-known fact that, for every , the following weak convergence of stochastic processes takes place: Hence, in order to prove statement (7.2), we only need to check that the integral is finite. For this, by considering, for example, the two cases and separately, we first easily verify the bound . Hence, for every , there exists a constant such that, for all , Bound (7.5) implies that integral (7.4) is finite when , which is true since the moment is finite for some and the parameter can be chosen as small as desired. Hence, with denoting the integral on the right-hand side of statement (7.2). The random variable is normal because the Brownian bridge is a Gaussian process. Furthermore, has mean zero because has mean zero for every . The variance of is equal to because for all . We are left to show that . For this, we write the bound: Since , the finiteness of the integral on the right-hand side of bound (7.6) follows from the earlier proved statement that integral (7.4) is finite. Hence, as claimed, which concludes the proof of Theorem 7.1.
Theorem 7.2. The empirical variance is an estimator of .
Proof. We construct an empirical estimator for by replacing every on the right-hand side of (2.6) by the empirical . Consequently, we replace the function by its empirical version We denote the resulting estimator of by . The rest of the proof consists of verifying that this estimator coincides with the one defined by (2.8). Note that when and/or . Hence, the just defined is equal to Since when , we therefore have that Furthermore, where, using notations (6.6) and (6.8), the summands on the right-hand side of (7.10) are for all , and for all . When , then . Hence, we immediately arrive at the expression for given by (2.10). When , then and, after some algebra, we arrive at the right-hand side of (2.11). When , then we have the expression which, after some algebra, becomes the expression recorded in (2.12). When , then , and so we see that is given by (2.13). This completes the proof of Theorem 7.2.
Theorem 7.3. The empirical mixed moment is an estimator of .
Proof. We proceed similarly to the proof of Theorem 7.2. We estimate the integrand using After some rearrangement of terms, estimator (7.15) becomes When and , then estimator (7.16) is equal to , which leads us to the estimator . This completes the proof of Theorem 7.3.
8. Proof of Theorem 2.1
Throughout the proof we use the notation for the dual absolute Lorenz curve , which is equal to . Likewise, we use the notation for the empirical dual absolute Lorenz curve.
Proof. Simple algebra gives the equations with the remainder terms We will later show (Lemmas 9.1 and 9.2) that the remainder terms and are of the order . Hence, we now proceed with our analysis of the first two terms on the right-hand side of (8.1), for which we use the (general) Vervaat process and its dual version For mathematical and historical details on the Vervaat process, see Zitikis , Davydov and Zitikis , Greselin et al. , and references therein. Since and , adding the right-hand sides of (8.3) and (8.4) gives the equation . Hence, whatever upper bound we have for , the same bound holds for . In fact, the absolute value can be dropped from since is always nonnegative. Furthermore, we know that does not exceed . Hence, with the notation