#### Abstract

We study the identification and estimation of graphical models with nonignorable nonresponse. An observable variable correlated to nonresponse is added to identify the mean of response for the unidentifiable model. An approach to estimating the marginal mean of response is proposed, based on simulation imputation methods which are introduced for a variety of models including linear, generalized linear, and monotone nonlinear models. The proposed mean estimators are -consistent, where *N* is the sample size. Finite sample simulations confirm the effectiveness of the proposed method. Sensitivity analysis for the untestable assumption on our augmented model is also conducted. A real data example is employed to illustrate the use of the proposed methodology.

#### 1. Introduction

The problem of missing data in practice has attracted much attention for decades. Rubin [1] gave the weakest general conditions under which the nonresponse is ignorable; that is, ignoring the process that causes missing can still result in correct inference when both missing and observing are at random. Alternatively, the nonresponse is said to be nonignorable (Little and Rubin [2]) if it depends on the value of the possibly unobserved outcome. Groves, Presser, and Dipko [3] illustrated that if the missing mechanism is not ignorable, then a complete-case analysis which excludes missing data could result in highly biased estimates. For more discussions about nonresponse bias, one can refer to the work of Little and Rubin [2], Ibrahim and Lipsitz [4], and Goves [5], among others.

To address the problems of nonignorable nonresponse, weighting adjustments, which adjust the estimates by rescaling each unit’s sample weight proportionally to the inverse of its response probability, are frequently used. The methods used for adjusting include poststratification by Holt and Smith [6], Calibration by Kott [7], and raking-ratio estimation by Deville, Särndal, and Sautory [8]. Weighting adjustments are model-based approaches, that is, the population values are treated as realizations of random variables that are distributed according to a superpopulation [9], and auxiliary information is incorporated into various models to describe nonrespondent behavior with respect to the variables of target. For instance, Greenlees, Reece, and Zieschang [10] conducted linear regression to analyze the unobserved income data assuming that nonresponse income depends on the unobserved value. Fay [11] and Baker and Laird [12] provided a family of estimable hierarchical log-linear models for the joint distribution of the data and the response indicator.

As pointed out by Little [13], the fully parametric approach is sensitive to failure of the assumed parametric model. Based on the exponential tilting model, Kim and Yu [14] proposed a semiparametric estimation method of mean functionals with nonignorable missing data. Riddles, Kim and Im [15] presented an approach of maximum likelihood estimation that uses parametric model assumptions about the variable of interest among the respondents only. Zhao, Tang, Qu, and Jiang [16] considered the parametric propensity model and studied semiparametric estimating equations inference by the nonparametric imputation method. Guo, Ma, and Wang [17] generalized the propensity model to semiparametric form and investigated the estimation of the parametric copula model. There are also some works using graphic models to depict nonresponse mechanisms in the literature on the modeling approach. For examples, Ma, Geng, and Hu [18] used graphic models with temporal structure to describe nonresponse mechanisms for binary income in a longitudinal study. Wang, Chen, Geng, and Zhou [19] extended the work of Ma, Geng, and Hu [18] to derive the maximum likelihood estimator for the parameter of the binomial proportion and its associated variance. More existing works along this line can be found in the work of Little [13], Fay [11], and Forster and Smith [20].

In this paper, we present a new method to tackle the problem of nonignore nonresponse. A completely observable variable correlated to *Y* is added to identify the mean of the response *Y*. Adjustments are applied by a fixing response approach which divides the population into two strata: one consists of respondents and the other of nonrespondents [21]. Then, an approach to estimating the marginal mean of response *Y* is proposed based on simulation imputation methods under linear, generalized linear, and monotone nonlinear models, respectively. It is shown that the proposed estimators are -consistent, where *N* is the sample size. The effectiveness of the method is demonstrated via simulations. Application to real data of the method is also considered. Simulations show that our new method is successful in modeling the mean of response *Y*. Since the assumptions on the models are untestable, we propose to assess sensitivity to the assumption of our methods.

The outline of this paper is as follows. In Section 2, we introduce the graphical models. Section 3 develops the simulation imputation methods for linear, generalized linear, and monotone nonlinear models. In Section 4, the generalized linear model is extended to cope with other covariates. Section 5 introduces empirical standard errors to monitor the accuracy of bootstrap imputations. A simulation study is conducted in Section 6. Sensitivity analysis is given in Section 7. Section 8 illustrates an application to the mental health dataset. Proofs are given in Appendix.

#### 2. Graphical Models

The response variable is denoted by *Y*. We assume that are independent and identically distributed response variables for *N* subjects in the study and are the respective status indexes for the responses, where *R*_{i} = 1 or 0, which depends on *Y*_{i} response or nonresponse, respectively. Identifying the marginal mean *E*(*Y*) of *Y* is an important problem in itself, and the treatment effect can be obtained once the marginal mean is available. However, if one uses only observed *Y*_{i}’s with *R*_{i} = 1 to estimate *E*(*Y*), it may result in a large biased estimator.

*Example 1. *We consider the mixture model with *Y*_{i} *=* *Y*_{i1}*R*_{i} + *Y*_{i2}(1 − *R*_{i}), where *Y*_{i1} ∼ *N*(0*,* 1), *Y*_{i2} ∼ *N*(−4*,* 1), and *R*_{i} is a binary variable independent of *Y*_{i1} and *Y*_{i2} and satisfies that *P*(*R*_{i} *=* 1) = 1 − *P*(*R*_{i} *=* 0) = 0*.*5, for *i* *=* 1*, . . ., n*. Then, the marginal mean of *Y* is *E*(*Y*) = −2, and the mean of observed *Y* is 0. Thus, if one ignores those nonresponses with *R*_{i} *=* 0, then the bias is 2.

To deal with this problem, we introduce the following three graphical models:

The graphical models have the following statistical meanings:(i)Model (a): *,* which means that *Y*_{i} is uncorrelated to *R*_{i}.(ii)Model (b): ; that is, *Y*_{i} is correlated to *R*_{i}.(iii)Model (c) (model (c) is untestable; i.e., one cannot test if *E*(*Z*_{i}*Y*_{i}*, R*_{i}) = *E*(*Z*_{i}*Y*_{i}) holds. The assumption is proposed based on the information from specific experts. We will perform sensitivity analysis on the assumption for our results): *,* which shows that *Z*_{i} is uncorrelated to *R*_{i} conditional on *Y*_{i}*.*The marginal mean of the response *Y* is identifiable for model (a), but not for model (b), since one observes only those values of response variable with *R*_{i} = 1 (the response group). To solve this problem, we now introduce a surrogate *Z* of Y, which is completely observable in model (c), to identify the marginal mean of *Y*. We call model (c) the augmented model of model (b), which is identifiable for the marginal mean under certain conditions. The motivation for adding such a completely observed variable *Z* is from a study conducted by us for the income of the inhabitants in an area of Beijing. Since the income variable *Y* is relevant to private matters, it is subject to informative missing. Neglecting of the missing values may result in very biased estimation for the income levels in the area. This motivates us to add a completely observed variable *Z*, the size of houses that every inhabitant possesses. Note that, conditional on the income variable *Y*, the missing mechanism can be regarded as uncorrelated to the housing variable *Z* so that the model (c) holds in this example.

In the following, we consider the identification of model (c) and propose a method to estimate the marginal mean of *Y*_{i}.

#### 3. Simulation Imputation

Since the nonresponse subjects are nonignorable, biased estimation may be resulted if only the subjects in the response group with *R*_{i} = 1 are considered. It is necessary to use the information from the nonresponse so that a consistent estimator of the marginal mean of *Y* can be obtained. To this end, we introduce an approach to achieving this objective. Our idea is to impute the nonresponse by simulation.

##### 3.1. Linear Models

Suppose that *Z*_{i} and *Y*_{i} satisfy the following linear model:

If *β* = 0, then model (c) is equivalent to the following model (d).

Since model (c) includes model (d), using a standard argument in hypothesis testing for general linear models, we can test if *H*_{0}: *β* = 0 holds via constructing testing statistic, *T* (*Z, Y*), say, based on the estimator of *β* from model (c). If (*d*) holds, then *E*(*Y*_{i}) is unidentifiable, and one must find a variable *Z* correlated with *Y* to solve this problem. If *β* = 0, then it follows from (1) that

Thus, identifying *E*(*Y*_{i}) is equivalent to estimating *α* and *β* because *E*(*Z*_{i}) is estimatable. From model (c), we know that then, regression of *Z*_{i} on *Y*_{i} for those observed *Y*_{i} with *R*_{i} = 1, we get the -consistent estimators of *α* and *β* (for example, the simple least squares estimators). We denote them by and, respectively. Let the estimator of *E*(*Z*_{i}) be . Then, the marginal mean of *Y* can be estimated as

The abovementioned estimation method is useful for the linear model, but it is inflexible for other models, for example, the generalized linear model and nonlinear model considered later. We here introduce another flexible estimating method, which is based on bootstrap. It uses the fact

To estimate *E*(*Y*), we need to estimate *E*(*Y* |*R* = 1) and *E*(*Y* |*R* = 0), where the former can be estimated by the average of *Y*_{i} in the response group and the latter can be estimated by the average of imputations of *Y*_{i} for the nonresponse group. This method is referred to as “*simulation imputation*,” which is detailed in the following:(i)Fitting model (1) with a data subset: the subjects are partitioned into two groups, one is of *R*_{i} = 1 (the response group), the other *R*_{i} = 0 (the nonresponse group). Without loss of generality, assume *R*_{i} = 1 for *i* = 1, . . .*, N*_{1}, and 0 others. Based on those subjects with responses observed (i.e., *i* = 1*, . . ., N*_{1}), we regress{*Z*_{i}} on {*Y*_{i}} using model (1) and obtain the estimators of *α* and *β*, denoted by and , respectively.(ii)Bootstrap residuals (The bootstrap based on residuals was proposed by Efron [22]. Because ’s are of mean zero, the empirical distribution function of the residuals is centered at 0): the residuals are denoted by for *i* = 1*, . . , N*_{1}. Let . We resample *M* samples with replacement, each with size *N*_{2} = *N* − *N*_{1}, from the empirical distribution function of the centered residuals{− *ε*¯*, i* = 1, *N*_{1}}, and denote the *M* samples by(iii)Prediction of *Y*_{j} in the nonresponse group: let , for . We predict as the average of over *m* = 1, *. . .,M.* The average is denoted by **.**(iv)Estimator of the marginal mean of *Y*:

Theoretical justification of the abovementioned “*simulation imputation*” method can be built by using standard bootstrap theory. Especially, we have the following consistency result for the estimator , which is proved in Appendix.

Theorem 1. *If model (1) is correct and β = 0, then the estimator in (4) for the marginal mean of is -consistent.*

##### 3.2. Logistic Regression Models

As an alternative to model (1), the following generalized linear model is used to model the relation between binary *Z*_{i} and *Y*_{i}:where *>* 0 is an unknown dispersion parameter, *η*_{i} = *η*(*Y*_{i}) = *α* + *βY*_{i} is a linear predictor, *µ*(*η*) = *e*^{η}*/*(1 + *e*^{η}) is a known differentiable function with derivative *,* and is the logit link defined as (see the work of Nelder and Wedderburn [23] and McCullagh and Nelder [24]):

In particular, when *Z*_{i} is binary with values 1 and 0, *E*(*Z*_{i}|*Y*_{i}) = *P* (*Z*_{i} = 1|*Y*_{i}), and one reasonable choice for (·) is the logit link with (*p*) = log{*p/*(1 − *p*)}. From model (7), we obtain thatif *β* is not zero. Note that by the conditional uncorrelation in model (c), we have *E*(*Z*_{i}|*Y*_{i}) = *E*(*Z*_{i}|*Y*_{i}*, R*_{i} = 1) and *V ar*(*Z*_{i}|*Y*_{i}) = *V ar*(*Z*_{i}|*Y*_{i}*, R*_{i} = 1). Therefore, model (7) can be estimated by those data in the response group. Let () be such estimator**s o**f (*α, β*) for model (7). Then, *µ*(*η*(*y*)) ≡ *E*(*Z*|*Y* = *y*) is nonparametrically estimable via regressing *Z*_{i} on *Y*_{i} for those observed *Y*s with *R* = 1. We denote the resulting nonparametric estimator by , for example, using the local linear smoothing in the work of Fan and Gijbels [25] and Jiang [26]. Then, by (6), we estimate *E*(*Y*) by

However, the estimator may suffer from substantial loss of efficiency since it uses only a portion of sample. Furthermore, in finite sample settings, the bias of nonparametric estimator may yield a highly biased estimate of the marginal mean of *Y*, since it can be magnified by the nonlinear link function. However, this approach can be used to compare with the following estimation method and justify the appropriateness of the parametric form of *µ*(·) in (7) for modeling real data.

Note that ; it follows that we can model the missing mechanism through

We here extend the previous “*simulation imputation*” approach to model (7), from which one can develop imputation of nonresponse *Y*_{i} from equation (9). Let . Then, ’s are white noises of mean zero and variance and .

Specifically, it proceeds as follows:(i)Fitting model (7) with a data subset: based on those observations from the response group, we regress {*Z*_{i}} on {*Y*_{i}} using model (7) and obtain the estimators of (*α, β*), the fitted values with , and Pearson’s residuals.(ii)Bootstrap residuals: we resample *M* samples, each with size *N*2 = N –N1, from the empirical distribution function of centered Pearson’s residuals (where *ε*¯ is the average of ’s), and denote the *M* samples by.(iii)Prediction of *Y*_{j} in the nonresponse group: , and we find *η*_{j} such that or equivalently Then, if and if . The solution is denoted by: The compute , and we use the average as prediction of .(iv)Estimator of the marginal mean of *Y*:

The abovementioned bootstrap Pearson’s residuals procedure in (i) and (ii) is standard in the generalized linear models (see, for example, page 341 of the work of Shao and Tu [27]). The “*simulation imputation*” method uses bootstrap data from Pearson’s residuals to yield predicted values for the nonresponses.

Theorem 2. *If model (7) is correct and , then the result in Theorem 1 continues to hold.*

##### 3.3. Monotone Nonlinear Models

We consider the following monotone nonlinear relation between *Z*_{i} and *Y*_{i}:where *f* (·) is a known monotone nonlinear function and *E*(*ε*_{i}|*Y*_{i}) = 0. By reversing the relation in (16), we get *Y*_{i} = *f*^{−1}(*Z*_{i} − *ε*_{i})*.* Using a simulation imputation method similar to that in Section 3.1, we can obtain the estimator of *E*(*Y*_{i}). This model requires one to find a surrogate *Z* for *Y* such that *E*(*Z*|*Y*) is monotone.

Up to now, one may wonder if the models in (1), (7), and (16) are appropriate in practice. This involves in model selection and diagnostic analysis. Traditional model diagnostic tools are useful for this problem. The problem can also be addressed by nonparametrically modeling those responses with *R*_{i} = 1 in the exploration analysis stage if a moderate sample is available, so that one can test if some parametric model holds, based on a nonparametric testing statistics such as the generalized likelihood ratio test in the work of Fan, Zhang, and Zhang [28], Fan and Jiang [29], and Fan and Jiang [30].

#### 4. Extension

Previous results cannot cope well with the cases in the presence of other covariates. We here extend them in the framework of the generalized linear models. The extension to other models can be similarly made. In parallel with model (7), we consider the following model with completely observed covariates *X* of dimension *d*:where *η*_{i} = *α*+*βY*_{i} + *γ*^{T}*X*_{i} is a linear predictor, is a canonical link function, and *µ*(·) is a known differentiable function with derivative (·) > 0*.*

As in Section 3.2, the same “simulation imputation” approach can be used to estimate the mean of *Y*, but with in Section 3.2 replaced by in the present setting. This approach will be readdressed in our real example.

#### 5. Empirical Standard Errors

The empirical standard error (*ESE*) is used to monitor the accuracy of convergence. For the *simulation imputation* methods mentioned above, the sampling number *M* is generally required to be large enough to ensure the prediction for the nonresponses with accuracy in a reasonable range. For a specific application, what should *M* be? Naturally, a good choice of *M* should yield an accurate predicted value of the nonresponse. This motivates us to use the *ESE*s of the simulation imputations to assess the accuracy of the predicted value of the nonresponse. In our real data analysis, *M* is taken as 1000, and the *ESE*s of the imputations are reported.

#### 6. Simulations

To investigate the performance of our procedure, we compare our estimators with the observed average of the responses in the following three models. The number of simulations is 600 and the number of bootstrapping samples for each simulation is taken as *M* = 1000*.* The deviation, , of each estimator from the true mean in the 600 simulations is computed and displayed via box plots, which depict the distributions of the deviations. The more the deviation concentrates at 0, the better the corresponding estimator is.

*Example 2. *We consider a linear model with *P* (*R*_{i} *=* 1) = 0*.*5, *Y*_{i1} ∼ *N*(0*,* 2^{2}), *Y*_{i2} ∼ *N*(1*,* 2^{2}), *Y*_{i} *=* *Y*_{i1}*R*_{i} + *Y*_{i2}(1 − *R*_{i}), *Z*_{i} *=* *α* + *βY*_{i} + *ε*_{i}, and *ε*_{i} ∼ *N*(0*,* 0*.*5^{2}). Sample size is *N* *=* 100, where *α* *=* 1 and *β* *=* 0*.*5 Thus, the true mean *E*(*Y*) equals 0*.*5.

*Example 3. *Let *P* (*R*_{i} *=* 1) = 0*.*5, *Y*_{i1} ∼ *N*(0*,* 2^{2}), *Y*_{t2} ∼ *N*(−4*,* 2^{2}), and *Y*_{i} *=* *Y*_{i1}*R*_{i} + *Y*_{i2}(1 − *R*_{i}),*i* *=* 1, · · ·, 100*,* which is mixed normal. Then, *E*(*Y*) = −2. We generate binary *Z*_{i} with outcomes 0 and 1 from the logistic model.where *α* = 3 and *β* = 1.

*Example 4. *Consider the monotone nonlinear model with *f* (*y*) = *a* + *y/*(*b* + *y*), where *a* *=* 0*.*5 and *b* *=* 1. For *i* *=* 1*,* 100, let *P*(*R*_{i} *=* 1) = 0*.*5, *Y*_{i1} ∼ *N*(0*,* 1), *Y*_{i2} ∼ *N*(2*,* 1), and *Y*_{i} *=* *Y*_{i1}*R*_{i} + *Y*_{i2}(1 − *R*_{i}), which is mixed normal. The marginal mean of *Y* is *E*(*Y*) = 1*.*

The boxplots of the deviations among 600 simulations for Examples 1–3 are reported in Figures 1(a)–1(c), respectively. The abovementioned three examples show that the proposed estimator is consistent, but the observed average of the response (with *R*_{i} = 1) is very biased because it ignores the information from the nonresponse subjects.

**(a)**

**(b)**

**(c)**

**(d)**

#### 7. Sensitivity Analysis

For identification of model (c), we have assumed *R*_{i} is uncorrelated to *Z*_{i} conditional on *Y*_{i}. This assumption is untestable and made with knowledge of the specific experts. Naturally, one may ask if our method exhibits robustness to some extent against the assumption. This motivates us to assess the sensitivity of our estimators to the assumption.

*Example 5. *To assess the sensitivity of our estimators to the assumption on uncorrelation of *Z*_{i} with *R*_{i} conditional on *Y*_{i}, we set samples size *N* *=* 100, *P*(*R*_{i} *=* 1) = 0*.*5, *Y*_{i1} ∼ *N*(0*,* 1), *Y*_{i2} ∼ *N*(3*,* 1), *Y*_{i} *=* *Y*_{i1}*R*_{i} + *Y*_{i2}(1 − *R*_{i}), *ε*_{i} *=* 0*.*5 *N*(0*,* 1), and *Z*_{i} *=* 1 + 0*.*5*Y*_{i} − *S ∗ R*_{i} + *ε*_{i}. We consider different values of *S* at grid points on the interval [−0*.*2*,* 0*.*2]. For each given *S*, the deviations of the estimators among 600 simulations were computed. The median of the deviations for each disturbing magnitude *S* is reported in Figure 1(d). When *S* increases, the conditional correlation gets stronger. Our estimator seems to perform reasonably well against the appropriate departure from the conditional uncorrelation, but the sample average of observed response does not.

#### 8. Real Data Analysis

We now use the mental health dataset to demonstrate how the proposed procedure works in a typical application.

This dataset is from a study of mental health of children in Connecticut. It was previously analyzed by Ibrahim, Lipsitz, and Horton (2001). There are totally 2486 subjects in study and six related variables:

*Father*: parental status of the household (father figure present = 0; no father figure present = 1).

*health*: physical health of the child (no health problem = 0; fair or poor health = 1). *t*_{rept}: teacher’s report of the psychopathology of the child (normal = 0; abnormal = 1; and missing = .); *p*_{rept}: parents’ report of the psychopathology of the child (normal = 0; abnormal = 1; and missing = .); *counts*: # of observations. *pctage*: percentage(=count/total sample size; total sample size = 2486).

Of the six variables, we choose the first five variables for analysis, since the 6th variable *pctage* is uniquely determinated by the 5th variable *counts*. The outcome of main interest, “*t*_{rept},” is missing for 1061(42*.*7%) subjects, but the variable “*p*_{rept},” which relates to “*t*_{rept}” and can serve as a surrogate for it, is observed for all subjects. Note that, as discussed in the abovementioned paper, once conditional on the surrogate “*p*_{rept},” the missing mechanism can be regarded as independent of the outcome “*t*_{rept},” although unconditionally the missing mechanism “*R*” (equals 0 for the missing and 1 for the observed) seems to depend on the outcome. Then, the model (c) seems reasonable to depict the relation among *t*_{rept}, *p*_{rept}, and *R*.

Our interest is to estimate the marginal mean of the response variable “*t*_{rept}.” We use model (5) and the estimation approach in Section 3.2 and set *Z* = *p*_{rept}, *Y* = *t*_{rept}, and the number of bootstrap sampling *M* = 1000*.* The estimated mean, variance of the response *t*_{rept}, and the average of the *ESE*s of the imputations *Y*_{j}^{∗} for the nonresponses are reported in Table 1. By incorporating the information on the first two covariates, *father* and *health*, we get the estimated mean and variance of the response from model (10), which can be found in third column of Table 1.

However, if only using the observations of *t*_{rept}, its marginal mean is computed to be 0*.*1832 with variance 0*.*1497. Obviously, our estimator not only rectifies the value of the marginal mean, but is more efficient than the simple average of the observed values of *Y*. The small values of the *ESE*s reflect high accuracy of the simulation imputations.

#### Appendix

#### Proofs of Theorems

*Proof. *of Theorem 1. The empirical distribution function for *i* *=* 1*,* · · ·*, N*_{1} is denoted by *F*_{N1} (*ε*). Since *ε*_{j}^{∗(m)}’s are drawn from *F*_{N1} (*ε*), conditional on the original sample points with *R*_{i} *=* 1, *ε*_{j}^{∗(m)} ∼ *F*_{N1} (*ε*) for *j* *=* *N*_{1} + 1*, …, N*_{1} + *N*_{2}. Hence, , since . Let . Then,Note thatUsing the standard argument for bootstrap consistency, it can be shown thatBy calculating the mean and variance, it is easy to see thatIt follows thatThis, combined with (4), yields consistency of .

*Proof. *of Theorem 2. The result follows from a similar argument as Theorem 1.

#### Data Availability

The data are public and available in the reference. They are also available upon request via emailing to the first author.

#### Conflicts of Interest

The authors declare no conflicts of interest.