Abstract

Nonparametric estimators for average and quantile treatment effects are constructed using Fractile Graphical Analysis, under the identifying assumption that selection to treatment is based on observable characteristics. The proposed method has two steps: first, the propensity score is estimated, and, second, a blocking estimation procedure using this estimate is used to compute treatment effects. In both cases, the estimators are proved to be consistent. Monte Carlo results show a better performance than other procedures based on the propensity score. Finally, these estimators are applied to a job training dataset.

1. Introduction

Econometric methods for estimating the effects of certain programs (such as job search assistance or classroom teaching programs) has been widely developed since the pioneering work of Ashenfelter [1], LaLonde [2], and others. In this case, a treatment refers to a certain program whose benefits are potentially obtainable by those selected for participation (treated), and it has no effect on a control group (nontreated).

Estimating average treatment effects (ATEs), which refers to the mean effect of the program on a given outcome variable in parametric and nonparametric environments (see [3, 4]), has been a central issue in the literature. Lehmann [5] and Doksum [6] introduced the concept of quantile treatment effects (QTEs) as the difference of the quantiles of the treated and control outcome distributions. In this case, it is implicitly assumed that individuals have an intrinsic heterogeneity which cannot be controlled for using observables. Bitler et al. [7] discuss the costs of focusing on average treatment estimation instead of other statistics.

Provided that in nonexperimental settings selection into treatment is not random, ordinary least squares (OLSs) and quantile regression techniques are inconsistent. As stated by Heckman and Navarro-Lozano [8], three different approaches were used to overcome this problem. First, the control function approach explicitly models the selection mechanism and its relation to the outcome equation; second, instrumental variables; third, local estimation and aggregation. In the latter, under the unconfoundedness assumption, which states that conditional on a given set of exogenous covariates (observables) treatment occurrence is statistically independent of the potential outcomes, local unbiased estimates can be obtained by conditioning on this set of covariates. The identification strategies that we follow relies on this assumption. Rosenbaum and Rubin [9, 10] show that, adjusting solely for differences between treated and control units in a scalar function of the pretreatment covariates, the propensity score also removes the entire bias associated with differences in pre-treatment variables.

Several estimation methods have been proposed for estimating ATE by conditioning on the propensity score. Matching estimators are widely used in empirical settings and in particular propensity score matching. In this case, each treated (nontreated) individual is matched to a nontreated (treated) individual (or aggregate of individuals) by means of their proximity in terms of the propensity score. Only in a few cases matching on more than one dimension has been used (see, e.g., [11]) because of the computational burden that multivariate matching requires. Moreover, Hirano et al.’s [12] method uses a series estimator of the propensity score to obtain efficient (in the sense of Hahn [13]) ATE estimators.

Estimation of QTE has been developed using the minimization of convex check functions as in Koenker and Bassett [14]. Abadie et al. [15] and Chernozhukov and Hansen [16, 17] develop this methodology using instrumental variables. On the other hand, Firpo [18] does not require instrumental variables, and his methodology follows a two-step procedure: in the first stage, he estimates the propensity score using a series estimator, while, in the second, he uses a weighted quantile regression method. Bitler et al. [7] compute QTE using the empirical distribution function and derives an equivalent estimator. Diamond [19] uses matching to construct comparable treated and nontreated groups, and, then computes the difference between the matched sample quantiles.

An alternative source of heterogeneity comes from the consideration of observables only. Treatment effects may vary depending on the amount of human capital or on the income and job status of their families. Differences in terms of these covariates determines that one may be interested in the conditional treatment effect that is conditional on some value of the observables. For instance, in terms of the propensity score, individuals are more likely to receive a treatment may have a different effect than those are less likely to receive it. As we show in this paper, how observables are treated determines differences in the parameter of interest for QTE but not for ATE. We define the average conditional quantile treatment effect as our parameter of interest, which can be described as the average of local QTEs. This parameter is equivalent to the standard unconditional QTE only in the case that the quantile treatment effect is constant.

In many cases, one would be more interested in the dependence of the outcome variable on the fractiles (i.e., quantiles) of the covariates rather than the covariates themselves. Mahalanobis’s [20] fractile graphical analysis (FGA) methodology was developed to account for this heterogeneity in observables. This method has awaken recent interest in the literature as a nonparametric regression technique [21, 22].

For our purposes, this methodology can be used as an alternative to matching, and it allows not only for estimating average but also quantile treatment effects. The idea is simple: divide the covariates space into fractiles, and obtain the conditional regression (or quantile) by a step function. Provided that the number of fractile groups increases with the number of observations, we obtain consistent estimates of these functions, as the local estimators would satisfy the unconfoundedness assumption (quoting Koenker and Hallock [23, page 147]: β€œ(...) segmenting the sample into subsets defined according to the conditioning covariates is always a valid option. Indeed, such local fitting underlies all nonparametric quantile regression approaches. In the most extreme cases, we have 𝑝 distinct cells corresponding to different settings of the covariate vector, π‘₯, and quantile regression reduces simply to computing univariate quantiles for each of these cells.”)

FGA can be viewed as a histogram-type smoother, and it shares the convergence rate of histograms as opposed to kernel-based methods that have a better performance. In the classification of Imbens [4], it can be associated with the β€œblocking on the propensity score” methods. An advantage of this procedure is that only the number of fractile groups needs to be chosen as a smoothing parameter.

In spirit, this method is very similar to matching. The latter matches every treated individual to a control (nontreated) individual whose characteristics are similar. Then, using the unconfoundedness assumption, it integrates over the covariates as the matched sample is similar to the treated. FGA decomposes the covariates distribution into fractiles. Then within each fractile, treated and nontreated individuals are compared. Finally, it integrates over the covariates (in this case over the fractile groups) as matching does. However, this nonparametric technique allows us to recover the complete graph for the conditional expectation or quantiles. In the latter, we show that the graph contains more information than the comparison of treated and nontreated separately.

The propensity score FGA estimators are compared to other estimators based on the propensity score. In particular we compare it to propensity score matching estimators and Hirano et al.’s [12] estimator for ATE and to Firpo’s [18] for QTE.

The paper is organized as follows. Section 2 describes the general framework and defines the parameters of interest. Section 3 reviews the literature on FGA. Section 4 derives ATE estimators, and Section 5 does it for QTE. Section 6 presents Monte Carlo evidence on the performance of these estimators, while Section 7 applies them to a well-known job training dataset. Conclusions appear in Section 8.

2. A General Setup for Nonrandom Experiments and Main Estimands

2.1. Unconditional Treatment Effects

To more formally characterize the model we follow the potential-outcome notation used in Imbens [4], which dates back to Fisher [24], Splawa-Neyman [25], and Rubin [26–28], and it is standard in the literature.

Consider 𝑁 individuals indexed by 𝑖=1,2,…,𝑁 who may receive a certain β€œtreatment” (e.g., receiving job training), indicated by the binary variable π‘Šπ‘–=0,1. Each individual has a pair of potential outcomes (π‘Œ1𝑖,π‘Œ0𝑖) that corresponds to the outcome with and without treatment, respectively. The fundamental problem, of course, is the inability to observe at the same time the same individual both with and without the treatment effect; that is, we only observe π‘Œπ‘–=π‘Šπ‘–Γ—π‘Œ1𝑖+(1βˆ’π‘Šπ‘–)Γ—π‘Œ0𝑖 and a set of exogenous variables 𝑋𝑖. We are interested in measuring the β€œeffect” of the π‘Š-treatment (e.g., whether job training increases salaries or the chances of being employed).

A parameter of interest is the average treatment effect, ATE,ξ€Ίπ‘Œπ›Ώ=𝐸1βˆ’π‘Œ0ξ€»(2.1) which tells us whether, on average the π‘Š-treatment has an effect on the population.

The key identification assumption is the unconfoundedness assumption(Rosenbaum and Rubin [9] called this strongly ignorable treatment assignment assumption, Heckman et al. [29] and Lechner [30, 31] conditional independence assumption) [9, 28], which states that conditional on the exogenous variables, the treatment indicator is independent of the potential outcomes. More formally, see the following assumption.

Assumption 2.1 (unconfoundedness). Considerξ€·π‘Œπ‘ŠβŸ‚1,π‘Œ0ξ€Έβˆ£π‘‹,(2.2) where βŸ‚ denotes statistical independence. Under this assumption we can identify the ATE (see, [4]) if both treated and nontreated have a common support, that is, comparable 𝑋-values ξ€Ίπ‘Œπ›Ώ=𝐸1βˆ’π‘Œ0ξ€»=πΈπ‘‹ξ€ΊπΈξ€Ίπ‘Œ1βˆ’π‘Œ0βˆ£π‘‹ξ€»ξ€»=πΈπ‘‹ξ€ΊπΈξ€Ίπ‘Œ1βˆ£π‘‹,π‘Š=1ξ€»ξ€»βˆ’πΈπ‘‹ξ€ΊπΈξ€Ίπ‘Œ0βˆ£π‘‹,π‘Š=0ξ€»ξ€»=𝐸𝑋[𝐸[π‘Œβˆ£π‘‹,π‘Š=1]]βˆ’πΈπ‘‹[𝐸[.π‘Œβˆ£π‘‹,π‘Š=0]](2.3)
In some cases, we are interested not only in the average effect but also in the effect on a subgroup of the population. Average treatment effects do not fully describe all the distributional features of the π‘Š-treatment. For instance, high-ability individuals may benefit differently from program participation than low-ability ones, even if they have the same value of covariates. This determines that the effect of a certain treatment would vary according to unobservable characteristics. A parameter of interest in the presence of heterogeneous treatment effects is the quantile treatment effect (QTE). As originally defined in the studies byDoksum [6] and Lehmann [5], the QTE corresponds, for any fixed percentile, to the horizontal distance between two cumulative distribution functions. Let 𝐹0 and 𝐹1 be the control and treated distribution of a certain outcome, and let Ξ”(𝑦) denote the horizontal distance at 𝑦 between 𝐹0 and 𝐹1, that is, 𝐹0(𝑦)=𝐹1(𝑦+Ξ”(𝑦)) or Ξ”(𝑦)=𝐹1βˆ’1(𝐹0(𝑦))βˆ’π‘¦. We can express this effect not in terms of 𝑦 but on the quantiles of the same variable, and the QTE is then π›Ώπœξ€·πΉ=Ξ”0βˆ’1ξ€Έ(𝜏)=𝐹1βˆ’1(𝜏)βˆ’πΉ0βˆ’1(𝜏)β‰‘π‘„πœπ‘—βˆ’π‘„πœπ‘—,(2.4) where π‘„πœπ‘—, 𝑗=0,1 are the quantiles of the treated and nontreated outcome distributions.
The key identification assumption here is the rank invariance assumption (which is implied by the unconfoundedness assumption): in both treatment statuses, all individuals would mantain their rank in the distribution (see [29], for a general discussion about this assumption). Therefore, using a similar argument as in the ATE case, Firpo [18] shows that this assumption provides a way of identifying the QTE: ξ€Ίπ‘ƒξ€Ίπ‘Œπœ=𝐸1β‰€π‘„πœ1ξ€Ίπ‘ƒξ€Ίπ‘Œβˆ£π‘‹ξ€»ξ€»=𝐸0β‰€π‘„πœ0ξ€Ίπ‘ƒξ€Ίβˆ£π‘‹ξ€»ξ€»=πΈπ‘Œβ‰€π‘„πœ1ξ€Ίπ‘ƒξ€Ίβˆ£π‘‹,π‘Š=1ξ€»ξ€»=πΈπ‘Œβ‰€π‘„πœ0,βˆ£π‘‹,π‘Š=0ξ€»ξ€»(2.5) where the last two expectations can be estimated from the observable data.

In both cases, Assumption 2.1 suggests that, by constructing cells of homogenous values of 𝑋, we would be able to get an unbiased estimate of the treatment effect. However this becomes increasingly difficult and computationally impossible as the dimension of 𝑋 increases. Rosenbaum and Rubin [9] argue that the unconfoundedness assumption can be restated in terms of the propensity score, 𝑝(𝑋)≑𝑃[π‘Š=1βˆ£π‘‹=π‘₯], under the following assumption.

Assumption 2.2 (common support). For all π‘₯∈domain(𝑋), we have that 0<𝑝≀𝑝(π‘₯)≀𝑝<1.(2.6)

In this case, we have the following lemma.

Lemma 2.3. Assumptions 2.1 and 2.2 imply that ξ€·π‘Œπ‘ŠβŸ‚1,π‘Œ0ξ€Έβˆ£π‘(𝑋).(2.7)

Proof. See the work by Rosenbaum and Rubin [9].

Therefore, the problem can be reduced to the dimension of 𝑝(𝑋). Through this paper we consider estimators based only on the propensity score.

2.2. Conditional Treatment Effects

Let π‘Œπ‘—(𝑋)=𝐸[π‘Œπ‘–βˆ£π‘‹] and 𝐹𝑗(β‹…βˆ£X), 𝑗=0,1 be the outcome distribution functions conditional on 𝑋, and let 𝐻(β‹…) be the distribution function of 𝑋. Then the ATE can be defined asξ€œξ‚Έξ€œπ‘Œ1ξ‚Ή(𝑋)𝑑𝐻(𝑋)𝑑𝐹1ξ€·π‘Œ1ξ€Έβˆ’ξ€œξ‚Έξ€œπ‘Œ0ξ‚Ή(𝑋)𝑑𝐻(𝑋)𝑑𝐹0ξ€·π‘Œ0ξ€Έ=ξ€œξ‚Έξ€œπ‘Œ1(𝑋)𝑑𝐹1ξ€·π‘Œ1ξ€Έβˆ’ξ€œπ‘Œβˆ£π‘‹0(𝑋)𝑑𝐹0ξ€·π‘Œ0ξ€Έξ‚Ήβˆ£π‘‹π‘‘π»(𝑋).(2.8)

Therefore, ATE can be obtained by comparing the unconditional mean outcome for the treated and nontreated or by obtaining first the conditional ATE and then integrating over the covariates space.

Now define π‘„πœπ‘—(π‘₯)=πΉπ‘—βˆ’1ξ€½π‘Œ(πœβˆ£π‘‹=π‘₯)≑infπ‘—βˆΆπΉπ‘—ξ€·π‘Œπ‘—ξ€Έξ€Ύβˆ£π‘‹=π‘₯β‰₯𝜏,𝑗=0,1,(2.9) as the conditional 𝜏th quantile. In generalπΈπ‘‹ξ€Ίπ‘„πœπ‘—ξ€»(𝑋)β‰ π‘„πœπ‘—,𝑗=0,1.(2.10)

In other words, the above equivalence cannot be applied to QTE: comparing the unconditional quantiles of the outcome distributions is not equivalent to computing the conditional quantiles and then aggregating. Chernozhukov and Hansen [16, 17] define the conditional quantile treatment effect (CQTE) as π›Ώπœ(π‘₯)=π‘„πœ1(π‘₯)βˆ’π‘„πœ0(π‘₯).(2.11)

Define the average conditional quantile treatment effect (ACQTE) asπ›Ώπœ=πΈπ‘‹ξ€Ίπ‘„πœ1(𝑋)βˆ’π‘„πœ0ξ€»(𝑋).(2.12)

Strictly speaking, differences in π‘„πœ1(𝑋)βˆ’π‘„πœ0(𝑋) can either be attributed to differences in the treatment effect or differences in the effect of the 𝑋's on the treated and nontreated. For instance, in a linear regression setup, we may have π‘„πœπ‘—(𝑋)=𝛼(𝜏,𝑋)𝑗+𝛽(𝜏,𝑗)𝑋, 𝑗=0,1. In the job training example, we may have that training increases salaries and returns to schooling, where years of schooling are 𝑋. However, in general, both parameters cannot be identified separately, and the literature often attributes to the treatment the whole conditional difference, that is, 𝛽(𝜏,𝑗)=𝛽(𝜏), 𝑗=0,1.

In order to see these differences consider the following simple example with one outcome variable. Let 𝑋 be a uniform random variable on (0,1), and let ⎧βŽͺβŽͺβŽͺ⎨βŽͺβŽͺβŽͺβŽ©π‘Œ(𝑋)=0withprob.0.50.5withprob.0.50.5withprob.0.51withprob.0.5if𝑋≀0.5,if𝑋>0.5.(2.13)

Here note that 𝐸[π‘Œ]=𝐸[𝐸𝑋[π‘Œ]] by the Law of Iterated Expectations. Let π‘„πœ be the quantile of the π‘Œ distribution, and let π‘„πœ(𝑋) be the conditional quantile of π‘Œ conditional on 𝑋. In this case, π‘„πœ=⎧βŽͺ⎨βŽͺ⎩0if𝜏<0.25,0.5if0.25β‰€πœ<0.75,1if𝜏β‰₯0.75.(2.14) But, πΈπ‘‹ξ€Ίπ‘„πœξ€»=ξ‚»(𝑋)0.25if𝜏<0.5,0.75if𝜏β‰₯0.5.(2.15)

This determines that recovering the complete graph {𝑋,π‘„πœπ‘—(𝑋)}, 𝑗=0,1, provides additional information that cannot be recovered by computing unconditional quantiles. Firpo [18], Bitler et al. [7], and Diamond’s [19] estimators obtain unconditional quantiles because their estimators compute the difference between the treated and nontreated quantiles.

If we add 𝑋 to the model and the treatment effect is constant across 𝑋, then we have the following expression:π‘„πœ[][](𝑋)=𝛼(𝜏)+𝛽(𝜏)πŸπ‘‹>0.5=0.5+0.5Γ—πŸπ‘‹>0.5,βˆ€πœ.(2.16)

However, in this case, we would be attributing no difference across quantiles. If we consider differences in the treatment effect across π‘‹π‘„πœβŽ§βŽͺβŽͺβŽͺ⎨βŽͺβŽͺβŽͺ⎩(𝑋)=𝛼(𝜏,𝑋)=0if𝜏<0.50.5if𝜏β‰₯0.50.5if𝜏<0.51if𝜏β‰₯0.5if𝑋≀0.5,ifX>0.5.(2.17)

We assume that π‘Œπ‘—(β‹…), π‘„πœπ‘—(β‹…), 𝑗=0,1, can be expressed as a function of 𝑝. In particular, for QTE, we assume that the CQTE is of the form π‘„πœ1(𝑝)βˆ’π‘„πœ0(𝑝)=𝛼(𝜏,𝑝), and therefore the ACQTE becomes π›Ώπœ=πΈπ‘ξ€Ίπ‘„πœ1(𝑝)βˆ’π‘„πœ0ξ€»(𝑝)=πΈπ‘ξ€Ίπ›Ώπœξ€»(𝑝),(2.18) which is our parameter of interest.

3. Fractile Graphical Analysis

Fractile graphical analysis (FGA) is a nonparametric estimation method developed first by Mahalanobis [20] based on conditioning on the fractiles of the 𝑋's. It was specifically designed to compare two populations, where the 𝑋 variable was influenced by inflation and therefore not directly comparable. It has the same properties as other histogram-type estimators [32]. Moreover, Bhattacharya [33] developed a conditional quantile estimation method based on FGA. Our proposal is to use FGA to develop estimators for both ATE and QTE. FGA produces a histogram-type smoother by blocking on the fractiles (i.e., quantiles) of the propensity score.

FGA was originally developed for one covariate (i.e., dim(𝑋)=1), but Bhattacharya [33] and others showed that it can be extended to more covariates. However, we will only consider FGA based on a single covariate, the propensity score. One-dimensional FGA allows us to recover the graphs {𝑝,𝛾(𝑝)}, where 𝛾 is any function of the propensity score.

Assume first that the propensity score is known and it has a distribution function 𝐻(𝑝). Further, assume that 𝐻(β‹…) is continuous and strictly increasing, and 𝑝 satisfies Assumption 2.2. Construct 𝑅 fractile groups (indexed by π‘Ÿ) on the propensity score: β„‘π‘Ÿπ‘=ξ‚†ξ‚ƒπ‘π‘βˆˆ,π‘ξ‚„βˆΆπœ‰(π‘Ÿβˆ’1)/𝑅<π‘β‰€πœ‰π‘Ÿ/𝑅,πœ‰(π‘Ÿβˆ’1)/𝑅=π»βˆ’1ξ‚€π‘Ÿβˆ’1𝑅,πœ‰π‘Ÿ/𝑅=π»βˆ’1ξ‚€π‘Ÿπ‘…,ξ‚ξ‚‡π‘Ÿ=1,2,…,𝑅,(3.1) where π»βˆ’1(𝜏)=inf{π‘βˆΆπ»(𝑝)β‰₯𝜏}.

Each fractile group contains a similar number of observations (i.e., about 𝑁/𝑅), and it has an associated interval on the domain of 𝑝 defined by the order statistics (πœ‰(π‘Ÿβˆ’1)/𝑅,πœ‰π‘Ÿ/𝑅], such that 𝑃[π‘βˆˆ(πœ‰(π‘Ÿβˆ’1)/𝑅,πœ‰π‘Ÿ/𝑅]]≃1/𝑅. As the number of fractiles increases, the divergence in terms of 𝑝 for all observations within the same fractile group becomes smaller, and therefore we would be gradually constructing groups with the same 𝑝-characteristics. In that case, estimates within each fractile group asymptotically satisfy the unconfoundedness assumption, provided that the conditioning set converges to a single propensity score value.

The following lines provide a short review of the asymptotic properties of FGA, which can be found in the studies by Bhattacharya and MΓΌller [32] and Bera and Gosh [21]. Let 𝑔(𝑝)=𝐸[π‘Œβˆ£π‘ƒ=𝑝] and 𝜎2(𝑝)=VAR[π‘Œβˆ£π‘ƒ=𝑝] be the conditional expectation and variance in terms of the propensity score, and consider the following notation: β„Ž(𝑑)=π‘”βˆ˜π»βˆ’1(𝑑) and π‘˜(𝑑)=𝜎2βˆ˜π»βˆ’1(𝑑) for 𝑑=(π‘Ÿβˆ’1+𝛼)/𝑅 with 0≀𝛼≀1. Suppose that β„Ž(β‹…) has bounded second derivative and π‘˜(β‹…) has bounded first derivative. Then, as π‘β†’βˆž and π‘…β†’βˆž so that 𝑅/𝑁→0 for fixed 𝑑, the bias and the variance of an FGA estimator of β„Ž(𝑑), ξβ„Ž(𝑑) become BIAS:πΈβ„Ž(𝑑)βˆ’β„Ž(𝑑)=βˆ’(2𝑅)βˆ’1β„Žξ…ž[]ξ‚€1(𝑑)1+π‘œ(1)=𝑂𝑅,=𝑅VARIANCE:VARβ„Ž(𝑑)𝑁(1βˆ’π›Ό)2+𝛼2ξ€»[]ξ‚€π‘…π‘˜(𝑑)1+π‘œ(1)=𝑂𝑁,(3.2) so that the mean-squared error of ξβ„Ž is =MSE:MSEβ„Ž(𝑑)4𝑅2ξ€Έ2ξ€·β„Žξ…žξ€Έ+𝑅(𝑑)𝑁(1βˆ’π›Ό)2+𝛼2ξ€Ύπ‘˜ξ‚„[](𝑑)1+π‘œ(1),(3.3) where 0≀𝛼=π‘…π‘‘βˆ’[𝑅𝑑]<1. Therefore, the best rate of convergence of fractile graphs is obtained by letting 𝑅=𝑂(𝑁1/3), which yields a rate of 𝑂(π‘βˆ’2/3) for the Integrated MSE.

If 𝑝 is not known, then it has to be estimated. In practice any estimate ̂𝑝=𝑝+π‘œπ‘(1) removes the bias. However, they will differ in the variance of the estimator, provided that the first stage (i.e., the estimation of the propensity score) needs to be taken into account. Hahn [13] shows that, by using the estimated propensity score, instead of the true propensity score, efficiency is achieved. Hirano et al. [12] and Firpo [18] use a semiparametric series estimator of the propensity score which produces this result.

We impose the following assumption regarding the use of the estimated propensity score.

Assumption 3.1 (convergence of propensity score fractile groups). Let ̂𝑝 be an estimator of the propensity score. Then, for fixed 𝑅 and for all π‘Ÿ, limπ‘β†’βˆžπ‘ƒξ‚ƒβ„‘π‘Ÿπ‘=β„‘π‘Ÿξ‚„Μ‚π‘=1.(3.4)

4. ATE Estimators

FGA ATE estimators are based on imputing the unobserved outcome in each fractile group. Let ξπ‘Œ1𝑖=ξ‚»π‘Œπ‘–ifπ‘Šπ‘–ξ‚π‘Œ=1,1𝑖ifπ‘Šπ‘–=0,(4.1) where ξ‚π‘Œ1𝑖=βˆ‘π‘π‘˜=1π‘Šπ‘˜π‘Œπ‘˜πŸ[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–βˆ‘Μ‚π‘]/π‘π‘˜=1π‘Šπ‘˜πŸ[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–Μ‚π‘],ξπ‘Œ0𝑖=ξ‚»π‘Œπ‘–ifπ‘Šπ‘–ξ‚π‘Œ=0,0𝑖ifπ‘Šπ‘–=1,(4.2) where ξ‚π‘Œ0𝑖=βˆ‘π‘π‘˜=1(1βˆ’π‘Šπ‘˜)π‘Œπ‘˜πŸ[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–βˆ‘Μ‚π‘]/π‘π‘˜=1(1βˆ’π‘Šπ‘˜)𝟏[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–]̂𝑝.

Therefore, the FGA ATE estimator isΜ‚1𝛿=𝑁𝑁𝑖=1ξπ‘Œ1π‘–βˆ’ξπ‘Œ0𝑖=1𝑁𝑁𝑖=1ξ‚π‘Œ1π‘–βˆ’ξ‚π‘Œ0𝑖.(4.3)

Similarly, it can be expressed as Μ‚1𝛿=π‘…π‘…ξ“π‘Ÿ=1̂𝛿(π‘Ÿ),(4.4) where ̂𝛿(π‘Ÿ)=βˆ‘π‘π‘–=1π‘Šπ‘–π‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘βˆ‘π‘π‘–=1π‘Šπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„βˆ’βˆ‘Μ‚π‘π‘π‘–=1ξ€·1βˆ’π‘Šπ‘–ξ€Έπ‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘βˆ‘π‘π‘–=1ξ€·1βˆ’π‘Šπ‘–ξ€ΈπŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘.(4.5)

The logic of this estimator is based on that of Hahn [13] β€œnonparametric imputation.” In this case, within each fractile group, 𝐸[π‘Šπ‘Œβˆ£β„‘π‘Ÿπ‘], 𝐸[(1βˆ’π‘Š)π‘Œβˆ£β„‘π‘Ÿπ‘], and 𝐸[π‘Šβˆ£β„‘π‘Ÿπ‘] are estimated nonparametrically using the previously estimated propensity score (̂𝑝).

Alternatively we construct a similar estimator using the weighting technique described in the study by Hirano et al. [12]. Let ξ‚π‘Œ1𝑖=ξ‚»π‘Œπ‘–ifπ‘Šπ‘–Μ†π‘Œ=1,1𝑖ifπ‘Šπ‘–=0,(4.6) where Μ†π‘Œ1𝑖=βˆ‘π‘π‘˜=1(π‘Šπ‘˜π‘Œπ‘˜/Μ‚π‘π‘˜)𝟏[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–]̂𝑝,ξ‚π‘Œ0𝑖=ξ‚»π‘Œπ‘–ifπ‘Šπ‘–Μ†π‘Œ=0,0𝑖ifπ‘Šπ‘–=1,(4.7) where Μ†π‘Œ0𝑖=βˆ‘π‘π‘˜=1((1βˆ’π‘Šπ‘˜)π‘Œπ‘˜/(1βˆ’Μ‚π‘π‘˜))𝟏[Μ‚π‘π‘˜βˆˆβ„‘π‘Ÿπ‘–]̂𝑝.

Then Μƒ1𝛿=π‘…π‘…ξ“π‘Ÿ=1̃𝛿(π‘Ÿ),(4.8) where ̃𝛿(π‘Ÿ)=𝑅𝑁𝑁𝑖=1π‘Šπ‘–Μ‚π‘π‘–π‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„βˆ’π‘…Μ‚π‘π‘π‘ξ“π‘–=11βˆ’π‘Šπ‘–1βˆ’Μ‚π‘π‘–π‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘.(4.9)

This estimator suffers from the same problems of Hirano et al.’s [12] estimator; that is, the presence of occasional high/low values of the propensity score produces a very bad empirical performance.

The following theorem shows that the FGA ATE estimators are consistent. The intuition behind the proof is that, as 𝑁 increases, and 𝑅 does it but at a smaller rate, each fractile group will have individuals with similar propensity score values. In the limit, the differences among them is negligible, and therefore the unconfoundedness assumption can be applied. In this case, the local (i.e., for a given propensity score value) ATE can be obtained by constructing the difference of the average treated and control individuals with that propensity score value.

Theorem 4.1 (consistency of ATE estimator). Consider Assumptions 2.1, 2.2, and 3.1, and assume that (1)the distribution functions of 𝑝 and (π‘Œ1,π‘Œ0)βˆ£π‘ are continuous and strictly increasing. (2)𝐸[π‘Œ21]<∞, 𝐸[π‘Œ20]<∞.
Then, ̂𝛿𝑃→𝛿 and ̃𝛿𝑃→𝛿 as 𝑁,π‘…β†’βˆž, 𝑅/𝑁→0.

Proof. See Appendix A.1.

5. QTE Estimators

Define the within fractile conditional quantiles: 𝑄(π‘Ÿ)𝜏1=argminπ‘žβˆ‘π‘π‘–=1πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„π‘ŠΜ‚π‘π‘–ξ€·π‘Œπ‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–β‰€π‘žξ€»ξ€Έβˆ‘π‘π‘–=1πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„π‘ŠΜ‚π‘π‘–,𝑄(π‘Ÿ)𝜏0=argminπ‘žβˆ‘π‘π‘–=1πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„ξ€·Μ‚π‘1βˆ’π‘Šπ‘–π‘Œξ€Έξ€·π‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–β‰€π‘žξ€»ξ€Έβˆ‘π‘π‘–=1πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„ξ€·Μ‚π‘1βˆ’π‘Šπ‘–ξ€Έ.(5.1)

Therefore, the QTE estimator isΜ‚π›Ώπœ=1π‘…π‘…ξ“π‘Ÿ=1Μ‚π›Ώπœ(π‘Ÿ)=1π‘…π‘…ξ“π‘Ÿ=1𝑄(π‘Ÿ)𝜏1βˆ’ξπ‘„(π‘Ÿ)𝜏0.(5.2)

Similarly we define𝑄(π‘Ÿ)𝜏1=argminπ‘žπ‘ξ“π‘–=11Μ‚π‘π‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„π‘ŠΜ‚π‘π‘–ξ€·π‘Œπ‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–,ξ‚π‘„β‰€π‘žξ€»ξ€Έ(π‘Ÿ)𝜏0=argminπ‘žπ‘ξ“π‘–=111βˆ’Μ‚π‘π‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„ξ€·Μ‚π‘1βˆ’π‘Šπ‘–π‘Œξ€Έξ€·π‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–,Μƒπ›Ώβ‰€π‘žξ€»ξ€Έπœ=1π‘…π‘…ξ“π‘Ÿ=1Μƒπ›Ώπœ(π‘Ÿ)=1π‘…π‘…ξ“π‘Ÿ=1𝑄(π‘Ÿ)𝜏1βˆ’ξ‚π‘„(π‘Ÿ)𝜏0.(5.3)

The following theorem proves the consistency of both QTE estimators.

Theorem 5.1 (consistency of QTE estimator). Consider Assumptions 2.1, 2.2, and 3.1, and assume that, the distribution function of 𝑝 is continuous and strictly increasing. The distribution function of (π‘Œ1,π‘Œ0)βˆ£π‘ is continuous, strictly increasing, and continuously differentiable.
Then, for 𝜏∈(0,1), Μ‚π›Ώπœπ‘ƒβ†’π›Ώπœ, and Μƒπ›Ώπœπ‘ƒβ†’π›Ώπœ as 𝑁,π‘…β†’βˆž, 𝑅/𝑁→0.

Proof. See Appendix A.2.

6. Monte Carlo Experiments

We evaluate the performance of the proposed estimators with respect to other estimators based on the propensity score. We compute propensity score matching estimators using nearest-neighbor procedures (with 1, 2, and 4 matches per observation), kernel and spline estimates. These estimators were designed by Barbara Sianesi for STATA 9.1, and they are available in the psmatch2 package. Additionally we compute Hirano et al. [12] semiparametric efficient estimator. In the case of QTE we compute Firpo [18] and Bitler et al. [7] estimators. We also compute QTE matching estimators following Diamond [19]. In this case, for each observation, the matching procedure constructs the corresponding matched pair (i.e., imputes the β€œclosest” observation with the opposite treatment status). Then, we compute the unconditional quantiles of the imputed treated and nontreated distributions. A succinct description of some estimators appears in Appendix B.

Our baseline model is 𝑋1,𝑋2,𝑋3𝑋,𝑒,π‘’βˆΌπ‘(0,1),π‘Š=𝟏1βˆ’π‘‹2+𝑋3ξ€»,π‘Œ+𝑒>01=𝛿+𝑋1+𝑋2π‘Œ+𝑒,0=𝑋1+𝑋2+𝑋3+𝑒.(6.1)

In this simple model QTEs are equal to ATE for all quantiles. We set 𝛿=2. We generate 1000 replications of the baseline models for sample sizes in {100,200,500,1000,2000}, and we compute mean square error (MSE) and mean absolute error (MAE). Table 1 reports ATE estimators, while Table 2 shows QTE estimators for 𝜏 in {.10,.25,.50,.75,.90}. For FGA the number of fractile groups is 𝑅=[𝑁1/3] which minimizes the integrated MSE (see, [32]), and we also consider doubling the number of fractile groups (i.e., 𝑅×2). We consider the two FGA estimators discussed above, that is, ̂𝛿 and ̃𝛿.

The FGA ATE estimator has reasonable good performance in terms of both MSE and MAE. In almost every case, doubling the number of fractile groups results in a better performance of the ̂𝛿 estimator. However, the contrary occurs to the ̃𝛿 estimator. FGA ATE ̂𝛿(𝑅×2) achieves the same values of the best matching estimators (using 4 neighbors and splines). Increasing the sample size reduces both MSE and MAE at similar rates in all estimators. Overall the Hirano et al. [12] and FGA ATE ̃𝛿 estimators show extremely high values, mainly because a random draw may contain occasional values of the propensity score very close to the boundary (i.e., 0 or 1).

FGA QTE ̂𝛿 estimators outperform that of Firpo [18] for all sample sizes and quantiles. All the estimators show consistency, although FGA QTE reduces both MSE and MAE at higher rates than Firpo's estimator. As in the last paragraph, doubling the number of fractile groups improves the estimator performance, and FGA QTE ̂𝛿 outperform ̃𝛿. As expected, better estimates are found in the median case than in the extreme quantiles. Matching estimators show a relatively good performance. However, only in a few cases they outperform the FGA QTE estimator. In particular the spline matching estimator shows an outstanding performance for 𝜏=0.9.

Overall nonparametric FGA estimators, where the propensity score is reestimated nonparametrically (i.e., ̂𝛿), show the best performance.

7. Empirical Application

We apply the estimators proposed in the paper to a widely used job training dataset first analyzed by LaLonde [2], the β€œNational Supported Work Program” (NSW). The same database was used in other applications such as those of Heckman and Hotz [34], Dehejia and Wahba [35, 36], Abadie and Imbens [11], and Firpo [18], among others.

The program was designated as a random experiment for applicants who if selected would had received work experience (treatment) in a wide range of possible activities, like learning to operate a restaurant, a child care, or a construction work, for a period not exceeding twelve months. Eligible participants were targeted from recipients of AFDC, former addicts, former offenders, and young school dropouts. Candidates eligible for the NSW were randomized into the program between March 1975 and July 1977. The NSW data set consists of information on earnings and employment in 1978 (outcome variables), whether treated or not, information on earnings and employment in 1974 and 1975, and background characteristics such as education, ethnicity, marital status, and age. We use the database provided by Guido Imbens (http://www.economics.harvard.edu/faculty/imbens/software_imbens/), which consists of 455 individuals, 185 treated, and 260 control observations. This particular subset is the one constructed by Dehejia and Wahba [35] and described there in more detail.

We will focus on the possible effect on participants' earnings in 1978 (if any); that is, we answer the following question: what is the effect of this particular training program on future earnings? Provided that earnings is a continuous variable, we would be able to apply quantile analysis. A main drawback of this variable is that those unemployed in 1978 report earnings of zero. In 1978, 92 control and 45 treated individuals were unemployed. The average (standard deviation) of earnings in 1978 is $5300 ($6631), which breaks into $6349 ($578) for treated and $4554 ($340) for control individuals. Without considering covariates, the difference between treated and nontreated is $1794 ($671), which in a two-sample 𝑑-test rejects the null hypothesis of equal values (𝑑-stat 2.67, 𝑃 value 0.0079). We also observe differences in terms of the percentiles in the earnings distribution. The 10th percentile for the treated (control) is $0 ($0); the 25th percentile $485 ($0); the median is $4232 ($3139); the 75th percentile $9643 ($7292); and the 90th percentile is $14582 ($11551). Therefore, assuming the rank invariance property discussed above, higher quantiles of the earnings distribution seems to be associated with larger treatment effects.

The propensity score is estimated by a probit model, where the dependent variable is participation and the covariates used are the individual characteristics and employment and earnings in 1974 and 1975. Note that the propensity score is of no particular interest by itself, provided that participants were randomly selected in the experiment. In this case, no particular covariate is individually significant, and a likelihood ratio test of joint significance gets chi-squared (8)=8.30, 𝑃 value = 0.4050.

As we mention above, a common support in the propensity score domain is necessary to make meaningful comparisons among treated and nontreated individuals. The empirical relevance of this assumption was pointed out by Heckman et al. [37], and it was identified as one of the major sources of bias. In our case, this has special importance since consistent estimates of treatment effect requires that both the number of treated and control is eventually large enough to apply large sample theory. Moreover, if there are no treated (controls) in a given fractile group, no within fractile estimate can be obtained. We use two different trimming procedures. First, provided that we may assume that 𝐹1(𝑝)≀𝐹0(𝑝), we only consider propensity score values in the range π‘βˆ—=min𝑝𝑝𝑖,π‘Šπ‘–ξ€Έ=1≀𝑝≀max𝑝𝑝𝑖,π‘Šπ‘–ξ€Έ==0π‘βˆ—.(7.1)

By doing this we drop 8 observations, and we refer to this sample as Trim 1. We also trim 2.5% in each tail of the propensity score distribution (Trim 2) dropping 23 observations.

Table 3 reports the propensity score estimates used in the Monte Carlo simulation, applied to LaLonde's data set. The first column contains the ATE estimate, while the second and third contain the average and standard deviation of a bootstrapping experiment with 1000 random samples with replacement of the original database. The last column calculates the ATE estimator for the two different trimming procedures discussed above. Table 4 estimates the QTE for the same quantiles analyzed in Table 2. The results confirm a positive average impact of training on earnings. FGA ATE estimators get $1572 and $1537, which are of the same magnitude as the kernel and spline propensity score matching estimates and the Hirano et al. [12] estimates. However, nearest-neighbor estimates are below these estimates by $100.

QTE estimates show considerable variability across quantiles (see Table 4). For the 10th quantile, estimates are not statistically different from zero. The median quantile is almost two-thirds of the ATE estimates, reflecting the presence of outliers in the sample or different distributional properties. Finally for the 90th quantile, the estimates produce up to a $3000 impact, twice the ATE. In other words, those who benefit more are those with a high level of unobservables. Unfortunately, all the estimators show high bootstrap standard errors.

8. Conclusion

FGA provides a simple methodology for constructing nonparametric estimators of average and quantile treatment effects, under the assumption of selection on observables. In this paper we develop estimators using the estimated propensity score and we prove its consistency. Moreover, FGA QTE estimators show a better performance than that of Firpo’s [18] QTE estimator, which constitutes the most relevant estimator in the literature using the propensity score.

Similar estimators can be derived for FGA in more than one dimension (see for instance the discussion in [33]), although its computational burden is unknown. Moreover, more efficient estimators may be obtained by applying smoothing techniques within or between fractiles [22].

Appendices

A. Proof of Theorems

A.1. Proof of Theorem 4.1

Proof. Let π‘β†’βˆž, 𝑅 and π‘Ÿ be fixed. Then, 𝑝limπ‘β†’βˆžΜ‚π›Ώ(π‘Ÿ)=𝑝limπ‘β†’βˆžβˆ‘π‘π‘–=1π‘Šπ‘–π‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘βˆ‘π‘π‘–=1π‘Šπ‘–πŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„βˆ’βˆ‘Μ‚π‘π‘π‘–=1ξ€·1βˆ’π‘Šπ‘–ξ€Έπ‘Œπ‘–πŸξ‚ƒΜ‚π‘π‘—βˆˆβ„‘π‘Ÿξ‚„Μ‚π‘βˆ‘π‘π‘–=1ξ€·1βˆ’π‘Šπ‘–ξ€ΈπŸξ‚ƒΜ‚π‘π‘–βˆˆβ„‘π‘Ÿξ‚„=𝐸̂𝑝(byLawofLargeNumbersandAssumptions2.2and3.1)π‘ŠΓ—π‘Œβˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ίπ‘Šβˆ£β„‘π‘Ÿπ‘ξ€»βˆ’πΈξ€Ί(1βˆ’π‘Š)Γ—π‘Œβˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ί(1βˆ’π‘Š)βˆ£β„‘π‘Ÿπ‘ξ€»=𝐸𝐸[]π‘ŠΓ—π‘Œβˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ίπ‘Šβˆ£β„‘π‘Ÿπ‘ξ€»βˆ’πΈξ€ΊπΈ[](1βˆ’π‘Š)Γ—π‘Œβˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ί(1βˆ’π‘Š)βˆ£β„‘π‘Ÿπ‘ξ€»=𝐸𝐸[]ξ€Ίπ‘Œ(byLawofIteratedExpectations)π‘Šβˆ£π‘Γ—πΈ1ξ€»βˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ίπ‘Šβˆ£β„‘π‘Ÿπ‘ξ€»βˆ’πΈξ€ΊπΈ[]ξ€Ίπ‘Œ(1βˆ’π‘Š)βˆ£π‘Γ—πΈ0ξ€»βˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ€»π‘ƒξ€Ί(1βˆ’π‘Š)βˆ£β„‘π‘Ÿπ‘ξ€»(byAssumption1.2).(A.1)
Let 𝐸[π‘Šπ‘]=𝑝, 𝑃[π‘Šβ„‘π‘Ÿπ‘]≑𝑝(π‘Ÿ), 𝐸[π‘Œ1𝑝]=𝑔1(𝑝), 𝐸[π‘Œ0βˆ£π‘]=𝑔0(𝑝), 𝛿(π‘Ÿ)=𝐸[π‘Œ1βˆ’π‘Œ0βˆ£β„‘π‘Ÿπ‘].
Then |||𝑝limπ‘β†’βˆžΜ‚π›Ώ(π‘Ÿ)βˆ’π›Ώ(π‘Ÿ)|||=||||||πΈξ‚ƒξ‚€π‘βˆ’π‘(π‘Ÿ)𝑔1(𝑝)βˆ£β„‘π‘Ÿπ‘ξ‚„π‘(π‘Ÿ)+πΈξ‚ƒξ‚€π‘βˆ’π‘(π‘Ÿ)𝑔0(𝑝)βˆ£β„‘π‘Ÿπ‘ξ‚„1βˆ’π‘(π‘Ÿ)||||||=||||ξ€ΊCOV𝑝;𝑔1(𝑝)βˆ£β„‘π‘Ÿπ‘ξ€»π‘(π‘Ÿ)+ξ€ΊCOV𝑝;𝑔0(𝑝)βˆ£β„‘π‘Ÿπ‘ξ€»1βˆ’π‘(π‘Ÿ)||||≀VARπ‘βˆ£β„‘π‘Ÿπ‘ξ€»Γ—πΆ(π‘Ÿ),(A.2) where 𝐢(π‘Ÿ)=βŽ›βŽœβŽœβŽœβŽξ”ξ€Ίπ‘”VAR1(𝑝)β„‘π‘Ÿπ‘ξ€»π‘(π‘Ÿ)+𝑔VAR0(𝑝)β„‘π‘Ÿπ‘ξ€»1βˆ’π‘(π‘Ÿ)⎞⎟⎟⎟⎠.(A.3)
Now let 𝑅,π‘β†’βˆž, 𝑅/𝑁→0. Then |||𝑝limπ‘β†’βˆžΜ‚|||=|||||π›Ώβˆ’π›Ώπ‘limπ‘β†’βˆž1𝑅(𝑁)𝑅(𝑁)ξ“π‘Ÿ=1̂𝛿(π‘Ÿ)βˆ’π›Ώ(π‘Ÿ)|||||≀limπ‘β†’βˆž1𝑅(𝑁)𝑅(𝑁)ξ“π‘Ÿ=1VARπ‘βˆ£β„‘π‘Ÿπ‘ξ€»πΆ(π‘Ÿ)≀limπ‘β†’βˆžmaxπ‘Ÿξ”ξ€ΊVARπ‘βˆ£β„‘π‘Ÿπ‘ξ€»πΆ(π‘Ÿ).(A.4)
By assumptions 𝐢(π‘Ÿ) is bounded and VAR[π‘βˆ£β„‘π‘Ÿπ‘]≀supπ‘βˆˆβ„‘π‘Ÿπ‘(𝑝)βˆ’infπ‘βˆˆβ„‘π‘Ÿπ‘(𝑝)≀1/𝑅, for all π‘Ÿ. Then Μ‚π›Ώβˆ’π›Ώ=𝑂𝑝(1/𝑅).
The consistency of ̃𝛿 can be easily proved by noting that, within each fractile group, the estimator is equivalent to that of Hirano et al. [12].

A.2. Proof of Theorem 5.1

Proof. Note that as π‘β†’βˆž, 𝑅 and π‘Ÿβ€‰β€‰are fixed, by convergence of sample quantiles 𝑄𝑝(π‘Ÿ)𝜏1βŸΆπΉβˆ’1π‘Ÿ,π‘Š=1(𝜏),(A.5) where πΉπ‘Ÿ,π‘Š=1ξ€Ί(π‘ž)β‰‘π‘ƒπ‘Œβ‰€π‘žβˆ£β„‘π‘Ÿπ‘ξ€»ξ€ΊπŸ[],π‘Š=1=πΈπ‘Œβ‰€π‘žβˆ£β„‘π‘Ÿπ‘ξ€»=𝐸[],π‘Š=1π‘ŠπŸπ‘Œβ‰€π‘žβˆ£β„‘π‘Ÿπ‘ξ€»π‘(π‘Ÿ),(A.6) and 𝑝(π‘Ÿ)=𝑃[π‘Š=1|β„‘π‘Ÿπ‘] is defined in the proof of Theorem 4.1.
Therefore, ξƒ¬π‘Šπœ=𝐸𝑝(π‘Ÿ)πŸξ‚ƒξπ‘„π‘Œβ‰€(π‘Ÿ)𝜏1ξ‚„βˆ£β„‘π‘Ÿπ‘ξƒ­ξƒ¬π‘=𝐸𝑝(π‘Ÿ)πΈξ‚ƒπŸξ‚ƒπ‘Œ1≀𝑄(π‘Ÿ)𝜏1ξ‚„ξ‚„βˆ£π‘βˆ£β„‘π‘Ÿπ‘ξƒ­.(A.7)
However, in general, 𝐸1ξ‚ƒπ‘Œπœβ‰ πΈ1≀𝑄(π‘Ÿ)𝜏1ξ‚„ξ‚„βˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ‚„ξ‚ƒπŸξ‚ƒπ‘Œ=𝐸1≀𝑄(π‘Ÿ)𝜏1ξ‚„βˆ£β„‘π‘Ÿπ‘ξ‚„.(A.8)
This divergence can be expressed as |||ξ‚ƒπŸξ‚ƒπ‘Œπœβˆ’πΈ1≀𝑄(π‘Ÿ)𝜏1ξ‚„βˆ£β„‘π‘Ÿπ‘ξ‚„|||=|||πœβˆ’πΉπ‘Ÿ,1𝑄(π‘Ÿ)𝜏1|||=1𝑝(π‘Ÿ)|||ξ‚ƒξ‚ƒπŸξ‚ƒπ‘ŒCOV𝑝,𝐸1≀𝑄(π‘Ÿ)𝜏1ξ‚„ξ‚„βˆ£π‘βˆ£β„‘π‘Ÿπ‘ξ‚„|||≀1𝑅×𝐾1(π‘Ÿ),(A.9) where πΉπ‘Ÿ,1(π‘ž)=𝑃[π‘Œ1β‰€π‘žβˆ£β„‘π‘Ÿπ‘] and 𝐾1(π‘Ÿ)=VAR[𝟏[π‘Œ0≀𝑄(π‘Ÿ)𝜏1]βˆ£β„‘π‘Ÿπ‘]/𝑝(π‘Ÿ) is bounded by assumptions (see Theorem 4.1).
How does this translate into the divergence of 𝑄(π‘Ÿ)𝜏1 and 𝑄(π‘Ÿ)𝜏1? By Taylor's theorem, 𝑄(π‘Ÿ)𝜏1βˆ’π‘„(π‘Ÿ)𝜏1=πΉπ‘Ÿ,1𝑄(π‘Ÿ)𝜏1ξ‚βˆ’πœπ‘“π‘Ÿ,1𝑄(π‘Ÿ)𝜏1+π‘œπ‘ξ‚€1𝑅𝐾1(π‘Ÿ)=𝑂𝑝1𝑅,(A.10)
Consider now the case that 𝑁,π‘…β†’βˆž, 𝑅/𝑁→0, |||||1𝑅(𝑁)𝑅(𝑁)ξ“π‘Ÿ=1𝑄(π‘Ÿ)𝜏1ξ€Ίπ‘„βˆ’πΈπœ1ξ€»|||||(𝑝)=𝑂𝑝1𝑅,(A.11) where 𝐸[1[π‘Œ1β‰€π‘„πœ1(𝑝)]βˆ£π‘]=𝜏 for all π‘βˆˆ[𝑝,𝑝]. The same argument can be applied to show the consistency of ξπ‘„πœ0.
Therefore, Μ‚π›Ώπœ=1𝑅(𝑁)𝑅(𝑁)ξ“π‘Ÿ=1𝑄(π‘Ÿ)𝜏1βˆ’ξπ‘„(π‘Ÿ)𝜏0=π›Ώπœ+π‘œπ‘(1).(A.12)
The consistency of Μƒπ›Ώπœ can be easily proved by noting that, within each fractile group, the estimator is equivalent to that of Firpo [18].

B. Other ATE and QTE Estimators

Hirano et al.’s [12] semiparametric efficient ATE estimator is 𝑁𝑖=1ξƒ©π‘Šπ‘–π‘Œπ‘–Μ‚π‘π‘–βˆ’ξ€·1βˆ’π‘Šπ‘–ξ€Έπ‘Œπ‘–1βˆ’Μ‚π‘π‘–ξƒͺ,(B.1) where ̂𝑝 is a semiparametric series estimator of the propensity score.

Bitler et al. [7] QTE estimator is obtained by finding the empirical quantiles of the weighted empirical distributions:𝐹0βˆ‘(π‘ž)=𝑁𝑖=1ξ€·ξ€·1βˆ’π‘Šπ‘–ξ€ΈπŸξ€Ίπ‘Œπ‘–ξ€»/ξ€·β‰€π‘ž1βˆ’Μ‚π‘π‘–ξ€Έξ€Έβˆ‘π‘π‘–=1ξ€·ξ€·1βˆ’π‘Šπ‘–ξ€Έ/ξ€·1βˆ’Μ‚π‘π‘–,𝐹1(βˆ‘π‘ž)=𝑁𝑖=1ξ€·π‘Šπ‘–πŸξ€Ίπ‘Œπ‘–ξ€»β‰€π‘ž/Μ‚π‘π‘–ξ€Έβˆ‘π‘π‘–=1ξ€·π‘Šπ‘–/̂𝑝𝑖,(B.2) that is, 𝐹0βˆ’1(𝜏) and ξπΉβˆ’1(𝜏).

Firpo [18] obtains the same results by minimizing weighted convex check functions:𝐹0βˆ’1(𝜏)=argminπ‘žπ‘ξ“π‘–=1ξ€·1βˆ’π‘Šπ‘–ξ€Έ1βˆ’Μ‚π‘π‘–ξ€·π‘Œπ‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–,ξπΉβ‰€π‘žξ€»ξ€Έ1βˆ’1(𝜏)=argminπ‘žπ‘ξ“π‘–=1π‘Šπ‘–Μ‚π‘π‘–ξ€·π‘Œπ‘–ξ€Ίπ‘Œβˆ’π‘žξ€Έξ€·πœβˆ’πŸπ‘–.β‰€π‘žξ€»ξ€Έ(B.3)

Acknowledgment

The author is grateful to Anil Bera, Antonio Galvao, and Todd Elder for helpful comments.