Artificial Intelligence and Data Mining 2014View this Special Issue
Moment Conditions Selection Based on Adaptive Penalized Empirical Likelihood
Empirical likelihood is a very popular method and has been widely used in the fields of artificial intelligence (AI) and data mining as tablets and mobile application and social media dominate the technology landscape. This paper proposes an empirical likelihood shrinkage method to efficiently estimate unknown parameters and select correct moment conditions simultaneously, when the model is defined by moment restrictions in which some are possibly misspecified. We show that our method enjoys oracle-like properties; that is, it consistently selects the correct moment conditions and at the same time its estimator is as efficient as the empirical likelihood estimator obtained by all correct moment conditions. Moreover, unlike the GMM, our proposed method allows us to carry out confidence regions for the parameters included in the model without estimating the covariances of the estimators. For empirical implementation, we provide some data-driven procedures for selecting the tuning parameter of the penalty function. The simulation results show that the method works remarkably well in terms of correct moment selection and the finite sample properties of the estimators. Also, a real-life example is carried out to illustrate the new methodology.
As Xie et al.  show, growing attention is being paid to the fields of artificial intelligence (AI) and data mining as tablets and mobile application and social media dominate the technology landscape. Moment conditions often appear in the study of artificial intelligence (AI) and data mining. We all know that empirical likelihood is a very practical tool for the study of moment conditions. When a parametric likelihood function is not specified for a model, estimating equations may provide an alternative instrument for statistical inference. For example, let be independent and identically distributed random vectors from a distribution, and let be a vector of unknown parameters. Suppose that the information of the distribution is available in the form of an unbiased estimating function satisfying and . When , can be estimated by solving the estimating equations . Allowing provides a useful device to combine available information for improving estimation efficiency, but directly solving may not be feasible.
The generalized method of moments (GMM) and empirical likelihood (EL) are two popular methodologies for estimating the parameters in the structural equations. As was introduced by Hansen , the GMM estimator is defined as where is the parameter space where lies and is a given weight matrix. Unlike the GMM, the EL uses likelihood to optimally combine information given in the estimating equations. More specifically, the estimator is defined by maximizing the following empirical likelihood: The estimator is optimal in the sense of Godambe and Heyde . It is known that maximizing (2) is equivalent to minimizing the empirical likelihood ratio where satisfies .
Both the GMM and the EL have been successfully used for parameter estimation and variable selection in general estimating equations. The statistical properties of the GMM and the EL estimators rely heavily on the quality of these moment conditions. The strong and valid moment conditions can help to reduce finite-sample bias and improve efficiency of the GMM and the EL estimators. However, when some moment conditions are misspecified, the GMM and the EL estimators may be inconsistent. In this paper, we are interested in estimating some unknown parameter identified by a set, set-1, of some moment restrictions which can be used to estimate consistently. Meanwhile, it is supposed that there is another set, set-2, of possibly misspecified moment conditions. When the moment conditions in set-2 (or some of them) are correctly specified, including them into estimation equations can improve the asymptotic efficiency of the estimator for . However, if they are misspecified, then using these moment conditions will lead to inconsistent estimation. Hence, whenever an empirical researcher has a set of moment conditions and there is no prior information about their validity, it is important to have some procedures to select the correctly specified moment conditions in that set and include them in the estimation equations.
Note that both the GMM estimators and the EL estimators are defined through moment restrictions. They generally have the same asymptotic distributions, but possibly different higher order asymptotic properties; see Newey and Smith  and Schennach . As discussed in Newey and Smith , the small sample performance of the GMM is poor in some applications and the EL has advantages over the GMM estimators. First, unlike GMM, the asymptotic bias of the EL estimator does not grow with the number of moment restrictions. Consequently, with many moment conditions, the bias of EL will be less than the bias of GMM. The relatively low asymptotic bias of the EL indicates that it is an important alternative to the GMM. Second, unlike the GMM, the EL does not require weight matrix estimation and is invariant to nonsingular linear transformations of the moment conditions. The third theoretical advantage of EL is that after it is bias corrected, it is higher efficient relative to the GMM bias corrected estimators. The reason is that the biased corrected EL estimators inherit the higher order property of maximum likelihood estimators.
Inspired by the idea of Liao  and considering the above advantages of the EL estimators relative to the GMM estimators, we propose a novel method for moment selection and parameter estimation simultaneously. The new method attaches a penalty function to the EL criterion and the resulting estimator of is then called the EL shrinkage estimator. Our method embeds the moment selection in EL estimation and once a certain moment condition is selected, it will be automatically included into estimating . Hence, our method not only selects the correct moment conditions in the set-2 in one step but also deals with the moment selection and efficient estimation simultaneously. Under some regularity conditions, we show that the EL shrinkage estimator of is root- consistent and asymptotically normal. Moreover, we show that consistent moment selection is automatically achieved in the penalized EL estimation and the EL shrinkage estimator of is asymptotically oracle-efficient (i.e., as efficient as the oracle EL estimator based on all valid moment conditions). Unlike the GMM, our proposed method allows us to carry out confidence regions for parameters included in the model without estimating the covariances of the estimators.
The rest of the paper is organized as follows. In Section 2, based on the EL and penalty method, the parameter estimation and moment condition selection are introduced. The theoretical properties of the EL shrinkage estimators and the empirical likelihood ratio are presented in Section 3. Section 4 provides simple and data-driven procedures of selecting the tuning parameters. Simulation studies and a real-life example are given in Section 5. Proofs and the technical derivations are included in the appendix.
Suppose that we are interested in estimating some unknown parameter identified by the following moment restrictions: where is a -dimensional random vector, is a -dimensional parameter vector, the subscript of denotes the number of moment conditions, and . The moment conditions in (4) can be used to estimate consistently. Suppose there is another set of possibly misspecified moment conditions as where “” signifies that equality may hold for some elements but not others, the subscript of denotes the number of moment conditions, and . The goal of this paper is to consistently select the correct moment conditions in the set-2 and automatically include them into the empirical likelihood estimation to improve the efficiency of estimating .
To incorporate moment selection into the estimation procedure, we first introduce a set of auxiliary unknown parameters and reparametrize the moment conditions in the set-2 as From (6), we see that if the th moment condition in (5) is correctly specified (or misspecified), then . Hence, the zero/nonzero components in can be used to identify the correctly specified/misspecified moment conditions in the set-2 and consistent moment selection is equivalent to consistent selection of the zero components in . We thus stack the moment conditions in (4) and (6) to get
Let be a sample of . The EL shrinkage estimator of is defined as where satisfies , is the parameter space where lies, and is the tuning parameter in some general penalty function . The success of our method in simultaneous moment selection and efficient estimation relies on the “oracle properties” of the shrinkage techniques. That is to say, if , for some , our method will estimate as zero with probability approaching 1 (w.p.a.1.). When is estimated as zero w.p.a.1., the information contained in the th moment condition of (5) is automatically used in estimating w.p.a.1. On the other hand, the nonzero components in are consistently estimated and their estimators are nonzero w.p.a.1. Hence, our method can consistently distinguish the zero and nonzero components in and is consistent in moment selection. Moreover, it estimates as if we knew all potentially correct moment conditions in the set-2.
There are many popular choices for the penalty function . For example, the bridge penalty is defined as , where ; the adaptive Lasso penalty is defined as , where and is some first-step consistent estimator of ; and the smoothly clipped absolute deviation (SCAD) penalty is defined as where is some positive real number strictly larger than 2. The above penalty functions differ in their empirical implementations, although the related EL shrinkage estimators may have the same asymptotic properties (see the results in Section 3). We focus on the SCAD penalty in this paper.
3. Asymptotic Theory
This section establishes the oracle property of the adaptive empirical likelihood (EL) shrinkage estimator. We state our theorems here, but their proofs are relegated to the appendix.
Let and be the index set of the nonzero components in and , respectively. For ease of notation, we sort the elements in in the following way: , where and . Let denote the number of valid moment conditions in the set-2. By definition, we know that and are and dimensional vectors, respectively. We define and for any . We use to denote the Euclidean norm in the Euclidean space.
We first present and discuss the sufficient conditions for consistency of .
Assumption 1. (i) is positive definite for ;
(ii) and are continuous in a neighborhood of the true value for ;
(iii) , , and are bounded by some integrable function in this neighborhood of the true value for , and the rank of is ;
(iv) for some when is large.
Assumption 1 is similar to those of Qin and Lawless . We emphasize that the dimensionality cannot exceed because the convex hull of is at most at a subset in .
Assumption 2. (i) is continuous in and for any there exists some such that
(ii) the following uniform law of large numbers (ULLN) holds: for ;
(iii) is positive definite; is a symmetric and real matrix and its eigenvalues are bounded from below and above by some fixed finite positive constants for all ;
(iv) the penalty function is nonnegative and for .
Condition (11) in Assumption 2(i) is the identifiable uniqueness condition for . By definition, ; thus, is locally uniquely identified under (11) and the continuity of . Assumptions 2(ii) and (iii) are two conditions whose application range is very wide because it does not specify the data structure, the properties of the moment functions, and the form of the weight matrix . It is clear that when is an identity matrix this assumption holds automatically. We choose usually. Assumption 1(iv) implies that the shrinkage effect of the penalty function on the moment selection coefficients (i.e. ) converges in probability to zero as . It states that the nonzero parameters cannot converge to zero too fast. This is reasonable because otherwise the noise is too strong. This condition includes the case that for as a special example.
Assumption 3. (i) The following functional central limit theorem (FCLT) holds:
(ii) is continuously differentiable in some neighborhood of for ;
(iii) has full column rank;
(iv) the penalty function satisfies and is continuously twice differentiable at for any with
Assumption 3(i) can be verified by applying Donsker’s theorem in specific models. Assumption 3(ii) imposes a local differentiability condition on the expectation of the moment function , . Assumption 3(iii) is a local identification condition for . If this assumption fails, the resulting estimator may not be -consistent. Assumption 3(iv) imposes some local smoothness conditions on the penalty function . Intuitively, this condition implies that attaching a penalty function to the empirical likelihood criterion function does not cause any local identification problem for the unknown parameter . It can be verified that the bridge, adaptive Lasso, and SCAD penalty functions satisfy Assumption 3(iv).
Theorem 4. Under Assumptions 1, 2, and 3, as and with probability tending to 1, the EL shrinkage estimator defined in (8) satisfies(a) and(b),where and .
It is clear from Theorem 4 that, by choosing a proper , there exists a consistent EL shrinkage estimator whose convergence rate is of the order . We now show that this estimator must possess the sparsity property for all , which is stated in Theorem 6.
Assumption 5. (i) The tuning parameter satisfies
(ii) for any and any random sequence with a.e. for all and , there is where is some nonnegative sequence such that .
Assumption 5(i) indicates that the convergence rate of for all is faster than . Under this assumption, Theorem 4 implies that that is, the convergence rate of is . Assumption 5(ii) is a generalized version of condition (3.5) in Fan and Li . Intuitively, Assumption 5(ii) implies that the shrinkage estimator of is the minimizer of w.p.a.1. From Assumptions 2(iv) and 3(iv), we know that is locally minimized at 0. Hence, Assumption 5(ii) is the key condition needed for showing consistent moment selection. It can be verified that the bridge, adaptive Lasso, and SCAD penalty functions satisfy Assumption 5.
Theorem 6. Under Assumptions 1, 2, 3, and 5, one has
From the consistency of and Theorem 6, we can immediately get that is, the consistent moment selection. We next provide the conditions needed for deriving the limiting distribution of the EL shrinkage estimator.
Assumption 7. (i) There exists a symmetric, nonrandom, and positive definite matrix such that
(ii) the following central limit theorem (CLT) holds: where is some Gaussian random vector.
Assumption 7(i) is a regularity condition. Assumption 7(ii) can be verified by applying CLTs in models with specific moment functions and data structure.
Next we will consider the oracle properties of the EL shrinkage estimation. The oracle properties state that the EL shrinkage estimation can consistently identify all potentially valid moment conditions in set-2 and its estimator of is as efficient as the oracle EL estimator based on all valid moment conditions. As the consistent moment selection is directly implied by Theorems 4 and 6, the oracle properties follow if we can show that the asymptotic variance-covariance matrix of coincides with that of the oracle EL estimator.
Let and denote the potentially valid and misspecified moment functions in set-2, respectively. We define If we had prior information about the validity of the moment conditions in set-2, then there would be moment conditions to estimate . We can stack these moment conditions as From the moment conditions in (23), we can compute the asymptotic variance-covariance matrix of the optimally oracle EL estimator as where is the leading submatrix of .
In the EL shrinkage estimation, if we choose a weight matrix such that then an interesting question is whether the resulting empirical likelihood (EL) shrinkage estimator of could be as efficient as the optimally weighted oracle EL estimator. The answer to the above question is affirmative, as illustrated in the following theorem.
Theorem 8 (Oracle Property). Under Assumptions 1–5, one has Furthermore, if the weight matrix satisfies (25) and Assumption 7 holds, then one has where is defined in (24).
The empirical likelihood method is capable of finding estimators, constructing confidence regions, and testing hypotheses. The following theorem, a generalization of the Wilks theorem, allows us to carry out inference for parameters included in the model without estimating their estimators’ covariance for our proposed method.
Theorem 9. Suppose Assumptions 1–7 hold. The empirical likelihood ratio statistic for testing is where is given by . Under Assumptions 1–7, as , where is true.
Theorem 9 allows us to use the EL ratio statistic for testing or obtaining confidence limits for parameters in a completely analogous way to that for parametric likelihood. The asymptotic confidence region of level for is where is the quantile of the chi-square distribution with degrees of freedom.
4. Adaptive Selection of Tuning Parameter
From the results of the previous sections, we see that the tuning parameter plays an important role in deriving the oracle properties of the EL shrinkage estimator. Assumptions 2(iv), 3(iv), and 5(i)-(ii) are sufficient conditions imposed on for the oracle properties to hold. However, these conditions do not provide a straightforward mechanism for choosing the tuning parameter in finite samples. For practical implementation of the shrinkage techniques, it is important to have some procedures of selecting such that the EL shrinkage estimator not only enjoys the oracle properties asymptotically but also has good finite-sample properties.
To choose the penalty parameter , some data-driven approaches for selecting tuning parameters need to be proposed. In the following, we will propose empirical likelihood based AIC-type criterion (EmAIC), BIC-type criterion (EmBIC), and Hannan-Quinn information criterion- (HQIC-) type criterion (EmHQIC). They are defined, respectively, as where is the index set of nonzero elements in , is the complement set of , and denotes the cardinality of the index set of and it stands for the number of moment conditions selected by the EL shrinkage method given , . is the generalized empirical likelihood (GEL) statistic proposed in Hong et al. , which is defined as , where is some concave function and its domain contains 0, is some dimensional vector, and is transferred from by setting the elements of whose index belongs to to be zero.
5. Numerical Studies
In this section, we first carry out simulations to demonstrate the performance of our method for finite data sets. We then apply our method to one real dataset. We compare our proposed method with the adaptive EL shrinkage method and the GMM shrinkage method. We find that both the adaptive EL and GMM shrinkage methods can consistently select the correct moment conditions in set-2 and automatically include them into the estimation to improve the efficiency of estimating . However, the adaptive EL shrinkage is more efficient relative to the adaptive GMM estimators because of the advantages of EL relative to GMM.
5.1. Simulation Example
Example 1. In this simulation study, the data are generated from the following linear model:
for all . The available IVs are , where is a scale random variable and is a random vector. There are two elements in which denote the potentially valid IVs and there are eight elements in which are misspecified IVs.
In (33), we take . The random variables , , , , and are generated from the following joint normal distribution: where the diagonal elements of are 1, , , and all other elements in are zero. is generated by the following equation: where is a vector of ones. The correlation of and measures the signal strength of the IV about the endogenous variable . There is one specification of used in the simulation; that is, .
We assume the econometrician knows that is a valid IV, while being unsure about validity of the IVs in . Hence, the moment conditions in set-1 are while the moment conditions in set-2 are The SCAD penalty is used in the empirical likelihood shrinkage estimation, where the first-step estimators of the moment selection coefficients are from the empirical likelihood estimation using the moment conditions in (37) and the reparametrized moment conditions in (38).
For the specification of , we use the simulated samples with sample sizes and , respectively, in our simulation study, and for each sample size, 2000 simulated samples are drawn from the data generating mechanism. With each simulated sample, we calculate four different types of estimators, which include the oracle estimator, empirical likelihood estimator, empirical likelihood shrinkage estimator using selected by EmAIC, and EL shrinkage estimator using selected by EmBIC. The oracle estimator is an EL estimator based on the moment conditions in set-1 and all valid moment conditions in set-2. The EL estimator is an EL estimator based only on the moment conditions in the set-1. Given the specification of and the sample size , we can get 2000 estimators of for each type of estimator using the 2000 simulated samples. Hence, we can estimate the finite sample marginal densities of different estimators for and the simulation results are presented in Figures 1 and 2. Tables 1 and 2 contain the selection probabilities of the correct, underselected, and overselected sets of moment conditions in EL (GMM) shrinkage estimation using the tuning parameters from EmAIC (GMM-AIC) and EmBIC (GMM-BIC), respectively.
There are several remarks we can make based on the simulation results presented in Figures 1 and 2. First, when the signal strength of the moment conditions in set-1 is , the EL shrinkage method selects all valid moment conditions in set-2 with high probability and selects the overselected sets of moment conditions with low probability. Second, when the sample size is increased from 100 to 500, the probability of selecting the set of valid moment conditions in set-2 increases greatly and the probability of selecting the overselected or underselected sets of moment conditions decreases sharply. Third, if we compare the EL shrinkage estimators based on different data-driven procedures of selecting the tuning parameter, we see that the EL shrinkage estimation using the tuning parameters from EmAIC has lower probability of selecting inconsistent sets of moment conditions. But it has nontrivial probability of selecting underselected sets of moment conditions, even when the sample size is increased from 100 to 500. On the other hand, the EL shrinkage estimation using the tuning parameters from EmBIC has lower probability of selecting the overselected sets of moment conditions and higher probability of selecting the set of correct moment conditions, but its probability of selecting the overselected sets of moment conditions is higher. Fourth, the finite sample densities of the EL shrinkage estimators behave much better than those of the EL estimators in all scenarios of this simulation study. Comparing the EL shrinkage estimator with the EL estimator, the most obvious improvement is the reduction of the variance, as we can see from the finite sample densities depicted in Figures 1 and 2. Also note that when the sample size is increased, the finite sample densities of the EL shrinkage estimators are approaching those of the oracle EL estimators. Finally, when the moment conditions in (37) are , the densities of the EL shrinkage estimators of are almost the same as those of the oracle estimators and the EL estimators. This is because the moment conditions in set-2 only contain the information about . Hence, when could be reliably estimated using the set-1 moment conditions, the extra valid moment conditions in set-2 do not help to reduce the variances of the estimators of .
5.2. Real Data Example
Example 2. We apply the EL shrinkage method to study the following labor supply equation in the life-cycle labor supply model [6, 10, 11]:
where is the annual hours working for money, is the hourly wage rate of individual at period , is a time varying constant, is the time varying error term, and measures the intertemporal substitution elasticity of labor supply with respect to the evolutionary wage changes and the theoretical prediction for its sign is positive.
Due to the measurement errors in , the OLS estimator of (39) may be inconsistent. MaCurdy  proposes to use a set of family background variables to construct the set-1 moment conditions; we only use the parents’ economic status as the credibly valid IV and include the rest of them into set-2. We also include the alternative measure of wage and the wage itself into set-2. Our sample is constructed from the Michigan Panel Study of Income Dynamics (PSID) dataset from year 1970 to year 1981.
We next apply the EL shrinkage estimation to the labor supply equation (39). The estimators of the moment selection coefficients are included in Table 3. As a comparison, we also include the EL estimators of the moment selection coefficients in different specifications of in Table 3. In the first two rows of Table 3, the constant term in (39) is treated to be time variant, while in its last two rows, is taken to be a time invariant constant. From Table 3, we see that the EL estimators of the moment selection coefficients are nonzero and it is hard to determine which moment conditions are valid (misspecified) based on these estimators. On the other hand, the EL shrinkage estimation gives the same moment selection result in the different specifications of . The moment conditions constructed from the IVs by MaCurdy  and Altonji  are picked up by our shrinkage method, while the moment condition constructed using the imputed wage is not selected, which implies that is an endogenous variable in the labor supply equation (39).
The results of the EL shrinkage estimation of the labor supply equation (39) are contained in Table 4. As a comparison, we also include the EL estimators of based on the moment condition in set-1 and the postmoment selection EL (PEL) estimators of in Table 4. Columns (1)-(2) of Table 4 present the EL estimators of based on the following IV: parent’s economic status when individual was young, which provides the moment condition in set-1. Compared with other estimators in Table 4, the EL estimators in columns (1)-(2) are larger in magnitude and have larger standard errors. On the other hand, the EL shrinkage estimators in columns (3)-(4) have much smaller standard errors, because some moment conditions in set-2 are selected and automatically included into estimation by the EL shrinkage method.
From Table 4, we see that, compared with the EL estimators, the EL shrinkage estimators of are closer to zero, which implies that part of the information in the moment condition constructed by is indeed used in the EL shrinkage estimation. Based on the above reasoning, we can deduce that the shrinkage effect of the penalty function on the estimators of may introduce some bias to the estimator of . To get rid of this bias, we conduct another EL estimation based on the moment condition in set-1 and the moment conditions in set-2 selected by our method. These PEL estimators are included in columns (5)-(6) of Table 4. We can see that the PEL estimators are slightly larger in magnitude than the EL shrinkage estimators and their standard errors are almost the same.
Remark 3. We now give detailed instructions about some marks in Table 3.(1)Standard errors are in parentheses and .(2)EL (GMM) estimation with the time dummy variables.(3)EL (GMM) estimation without time dummy variables.(4)EL (GMM) shrinkage estimation with time dummy variables, where the penalty function is the SCAD and the tuning parameter equals 0.000374 (selected by EmAIC (GMM-AIC), EmBIC (GMM-BIC), and EmHQIC (GMM-HQ)).(5)EL (GMM) shrinkage estimation without time dummy variables, where the penalty function is the SCAD and the tuning parameters equals 0.000948 (selected by EmAIC (GMM-AIC), EmBIC (GMM-BIC), and EmHQIC (GMM-HQ)).
Remark 4. We now give detailed instructions about some marks in Table 4.(1)Standard errors are in parentheses and sample size .(2) refers to the set of time dummy variables for the years from 1971 to 1981.(3)EL (GMM) is the EL (GMM) estimation only using the moment conditions in set-1.(4)SEL (SGMM) denotes the EL (GMM) shrinkage estimation based on the SCAD penalty. In column (3) the tuning parameter equals 0.000949 and in column (4) the tuning parameter equals 0.000374. EmAIC (GMM-AIC), EmBIC (GMM-BIC), and EmHQIC (GMM-HQ) produce the same number of the tuning parameter in each case.(5)PEL (P-GMM) denotes the EL (GMM) estimation based on the moment conditions selected by the EL (GMM) shrinkage estimation. The results in columns (5) and (6) are based on the moment conditions selected in (3) and (4), respectively.
Proofs. We first introduce some notations and definitions. Let and . Let denote the empirical process indexed by some function . Suppose that is a sequence of random vectors; then for a given sequence of nonnegative constants , we write to mean that, for any constant , there is a finite constant such that eventually; we write to mean that, for any constants , there is eventually.
In this appendix, we prove two lemmas which are useful for deriving the asymptotic properties of the EL shrinkage estimator. Define where .
Lemma 5. Under Assumption 1(iii), one has for all , where
Proof. By Assumption 2(iii), we deduce that for all , which implies that Note that Assumption 2(iii) also implies for all , which implies that From the inequalities in (A.6) and (A.8), we immediately get the claimed results in (A.3).
Lemma 6. Under Assumptions 2(iii) and 3(ii)-(iii), one has for all in shrinking neighborhoods of , where , are generic positive finite constants.
Proof. Denote First note that by Assumption 3(ii) where lies between and and is identity matrix. As is in the shrinking neighborhood of and , is continuous in ; we deduce that By (A.11), (A.13), and Cauchy-Schwarz inequality, we have Using Assumption 2(iii), the result in (A.14), and Cauchy-Schwarz inequality, we get As has full column rank and is strictly positive definite, has full rank and is strictly positive definite. Let and denote the smallest and largest eigenvalues of ; then by Assumptions 2(iii) and 3(iii), we have which together with (A.15) implies that The right inequality in (A.9) is implied by (A.17). The left inequality in (A.9) can be similarly derived. This finishes the proof.
Lemma 7. Under Assumption 1, for any with and as , for all , and for all .
Proof. Following Owen , Assumption 2(vi) implies that
Lemma 8. Let ; one has where .
The detailed theoretical reasons of this lemma can be found in Following Hjort et al. , so we neglect it here.
Proof of Theorem 4. (a) By the definition of , one has
Applying Lemma 7, one can deduce that
Applying Lemma 5 and Assumption 2(iv), one can deduce from (A.1) that
where is defined in Lemma 5.
From Assumption 2(ii) and the definition of , one gets By the triangle inequality, ULLN in (A.24), and Assumptions 2(iii)-(iv), one has From Assumption 2(iii) and results in (A.23) and (A.25), one can deduce that Now, the first result in (A.26) and Assumption 1(i) imply that . From the second result in (A.26), the triangle inequality, consistency of , and Assumption 2(i), one has which implies that . So one gets the result that as and with probability tending to 1, the EL shrinkage estimator defined in (6) satisfies .
(b) Using the inequalities in (A.3) and (A.22) and Lemma 7, one gets where is defined in Lemma 5. By Assumptions 2(iv) and 3(iv) and the inequality in (A.28), one has Next, by Assumption 3(iv), Taylor expansion, the triangle inequality, and Cauchy-Schwarz inequality, one gets where lies between and for . From Lemma 6, one obtains where is a finite constant. The inequality in (A.31), together with Assumption 3(iv), and the inequalities in (A.29) and (A.30) imply that By Assumption 3(i), one has . When the sample size is large enough, by definition, the probability that is strictly smaller than any given small number , together with the inequality in (A.32), implies that where . By definition, . Hence, one can choose some large enough number such that This and the results in (A.33) immediately imply that eventually, which gives us .
Proof of Theorem 6. We know that
On the event for some , we have the following KKT optimality condition:
where denotes the th row of the weight matrix .
By Assumption 3(ii) and the consistency of , there is where is defined in (A.11) and and lie between and . From Assumption 3(i), we have . By Theorem 4 and Assumption 5(i), we have . By the triangle inequality, Assumption 3(ii), and the consistency of , we deduce that Hence, we have which combined with Assumption 2(iii) implies that By Theorem 6 and Assumption 5(i), we have for all . Hence, conditional on the event for some , we can invoke Assumption 5(ii) to deduce that for . Now, using (A.38), (A.41), and (A.42), we deduce that as for any .
Proof of Theorem 8. Let and denote the potentially valid and misspecified moment functions in set-2, respectively. We define
where denotes a identity matrix. If we define
then under Assumptions 2(iii) and 3(iii), we know that is nonsingular matrix.
Recall that and accordingly . For any compact subset in , we denote any element as , where are the first elements in and are the last elements in . Denote where . From Theorem 6, we know that w.p.a.1. Thus, is the minimizer of w.p.a.1.
If we define then by Assumptions 3(i) we know that is a Donsker class. As is compact, so there exists some constant , such that . Now we can use Lemma 2.17 in Pakes and Pollard  to deduce that uniformly over .
By Assumption 3(ii) and the compactness of , we have uniformly over . Thus, (A.47) and (A.48) imply that, uniformly over , there is Now, we can use the result in (A.49), Assumptions 3(i) and 7(i), and the compactness of to deduce that uniformly over . If , then by Assumptions 3(iv) and 5(i) and the compactness of we have uniformly over .
Using Assumption 7(ii), the results in (A.50), and (A.51), we get