Advances in Statistics

Advances in Statistics / 2014 / Article

Research Article | Open Access

Volume 2014 |Article ID 504325 | https://doi.org/10.1155/2014/504325

Georges Nguefack-Tsague, Ingo Bulla, "A Focused Bayesian Information Criterion", Advances in Statistics, vol. 2014, Article ID 504325, 8 pages, 2014. https://doi.org/10.1155/2014/504325

A Focused Bayesian Information Criterion

Academic Editor: Vito Mr Muggeo
Received31 May 2014
Revised18 Sep 2014
Accepted25 Sep 2014
Published14 Oct 2014

Abstract

Myriads of model selection criteria (Bayesian and frequentist) have been proposed in the literature aiming at selecting a single model regardless of its intended use. An honorable exception in the frequentist perspective is the “focused information criterion” (FIC) aiming at selecting a model based on the parameter of interest (focus). This paper takes the same view in the Bayesian context; that is, a model may be good for one estimand but bad for another. The proposed method exploits the Bayesian model averaging (BMA) machinery to obtain a new criterion, the focused Bayesian model averaging (FoBMA), for which the best model is the one whose estimate is closest to the BMA estimate. In particular, for two models, this criterion reduces to the classical Bayesian model selection scheme of choosing the model with the highest posterior probability. The new method is applied in linear regression, logistic regression, and survival analysis. This criterion is specially important in epidemiological studies in which the objective is often to determine a risk factor (focus) for a disease, adjusting for potential confounding factors.

1. Introduction

A variety of model selection criteria (Bayesian or frequentist) have been proposed in the literature; most of them aim at selecting a single model for any purposes. For an overview of model selection criteria, see the studies by Leeb and Poetscher [1] and Zucchini [2]; for inference after model selection, see the studies by Nguefack-Tsague [3], Nguefack-Tsague and Zucchini [4], Zucchini et al. [5], Behl et al. [6], and Nguefack-Tsague [79]. Allen [10], within the context of Mallows’ [11], developed a criterion that depends on a given prediction. In a frequentist approach, Claeskens and Hjort [12] developed a focused information criterion (FIC) for model selection which, unlike common model selection criteria that lead to a single model for all purposes, selects different models for different purposes. Thus Allen’s criterion can be considered as an early precursor of FIC. So far the FIC is gaining in popularity as evidenced by its applications in various fields and specific models. Some of these applications include missing response (Sun et al. [13]), energy substitution (Behl et al. [6]), economic applications (Behl et al. [14]), Tobit model (Zhang et al. [15]), additive partial models (Zhang and Liang [16]), volatility forecasting (Brownlees and Gallo [17]), and Cox proportional hazard regression models (Hjort and Claeskens [18]). Focused information criterion and model averaging can be found in the studies by Sueishi [19] and Sun et al. [13]. A more recent development of FIC on quantile regression can be found in the studies by Du et al. [20], Behl et al. [21], and Xu et al. [22]. The motivation for the new method is based on the fact that this concept appears, up to our knowledge, to be virtually unknown and overlooked in Bayesian model selection; thus, there is a need to develop a Bayesian counterpart. The present paper is organized as follows. Section 2 presents the concept of Bayesian model averaging and model selection while Section 3 introduces the new criterion. Section 4 provides practical examples, while Section 5 provides discussions. The paper ends with concluding remarks.

2. Bayesian Model Selection and Model Averaging

2.1. Framework

Consider a situation in which some quantity of interest, , is to be estimated from a sample of observations that can be regarded as realizations from some unknown probability distribution and that, in order to do so, it is necessary to specify a model for the distribution. There are usually many alternative plausible models available and, in general, they each lead to different estimates of .

Consider a sample of data, , and a set of models , which we will assume to contain the true model . Each consists of a family of distributions , where represents a parameter (or vector of parameters).

The prior probability that is the true model is denoted by and the prior distribution of the parameters of (given that is true) by . Conditioning on the data and integrating out the parameter , one obtains the following posterior model probabilities: where is the integrated likelihood under . If is a discrete distribution, the integral in (2) is replaced by a sum.

2.2. Bayesian Model Selection

Within this framework, classical Bayesian model selection involves selecting the model with the highest posterior. Sometimes Bayes factors are used. From the Bayes factor framework, the Bayes factor (Kass and Raftery [23]) for versus model is defined by

Model is chosen if . Under certain assumptions and approximations (in particular the Laplace approximation) and taking all candidate models as a priori equally likely to be true, this leads to the Bayesian information criterion (BIC), also known as the Schwarz criterion [24]. More information on Bayesian model selection and applications can be found in the studies by Guan and Stephens [25], Nguefack-Tsague [26], Carvalho and Scott [27], Fridley [28], Robert [29], Liang et al. [30], and Bernado and Smith [31].

2.3. Focused Information Criterion (FIC)

As one can notice, in classical Bayesian model selection, a single (selected) model is used to explain all aspects of data, that is, regardless of the purpose of the selection, irrespective of the inference to follow (Claeskens and Hjort [12]). Allen [10] first developed this idea of focusing on a parameter of interest in a prediction problem in which the prediction at a particular value (target) of the regressor vector different from the values in the sample is of interest. Geisser [32] focused on prediction as a major inferential goal, rather than estimation, under many circumstances. In that method, the steps in the derivation of Mallows’ are repeated for this target; thus, this criterion depends on that particular value and is therefore an early precursor to the FIC. In the FIC framework, a parameter of interest, say, , must have a definition making it meaningful across competing models. FIC methodology uses general parametric models and maximum likelihood as the estimation method in a general large sample theory. The FIC is derived as the result of establishing an unbiased estimation of the limiting risk of any submodel-based estimator of the parameter . FIC is based on the (crucial) assumption that the true data generating mechanism is contained in the largest parametric model considered. The candidate model with the smallest value of FIC is chosen. As one can see, FIC actually is based on frequentist approach; so far, up to our knowledge, Bayesian equivalent has not yet been considered.

2.4. Bayesian Model Averaging

Bayesian model averaging (BMA) is used to deal with the problem of model uncertainty. A discussion on the issue of model uncertainty is given in the study by Clyde and George [33]. Let be a quantity of interest depending on , for example, a future observation from the same process that generated . The idea is to use a weighted average of the estimates of obtained using each of the alternative models, rather than the estimate obtained using any single model. More precisely, the posterior distribution of is given by Note that is a weighted average of the posterior distributions , where the th weight, , is the posterior probability that is the true model.

The posterior distribution of , conditioned on being true, is given by The posterior mean and posterior variance are given by Raftery et al. [34] call this averaging scheme Bayesian model averaging. Leamer [35] and Draper [36] advocate the same idea. Madigan and Raftery [37] note that BMA provides better predictive performance than any single model if the measure of performance is Good’s [38] logarithm score rule, under the posterior distribution of given .

Hoeting et al. [39] give an extensive framework of BMA methodology and applications for different statistical models. Various real data and simulation studies have investigated the predictive performance of BMA (Clyde [40]; Clyde and George [33]). Nguefack-Tsague [41] uses BMA in the context of estimating a multivariate mean.

2.5. Challenges

Implementing BMA is demanding, especially the computation of the integrated likelihood. Software for BMA implementation, as well as some BMA papers, can be found at “http://www.research.att.com/~volinsky/bma.html”. An R [42] package for BMA is now available for computational purposes; this package provides ways for carrying out BMA for linear regression, generalized linear models, and survival analysis using Cox proportional hazard models. For computations, Monte Carlo methods, or approximating methods, are used; thus, many BMA applications are based on the BIC, an asymptotic approximation of the log posterior odds when the prior odds are all equal.

Another problem is the selection of priors for both models and parameters. In most cases, a uniform prior is used for each model; that is, , . When the number of models is large, model search strategies are sometimes used to reduce the set of models, by eliminating those that seem comparatively less compatible with the data.

3. Using BMA for Model Selection

The purpose of this section is to define the focused Bayesian model averaging (FoBMA).

Definition 1. For a set of models , with focus parameter , under BMA framework, the selection criterion FoBMA consists of choosing for which its estimate is closest (in terms of squared error) to BMA estimate.

Proposition 2. Under the square error loss and the weighted posterior probability of (4), FoBMA is an optimal model choice.

Proof. Conditioning on all models, that is, under (4), the optimal choice is the one for which its estimate minimizes This is equivalent to minimizing The term (2) does not depend on and, denoting the term (1) can be rearranged as (see also the study by Bernardo and Smith [31]) Since as well as and since () does not depend on , the only term that depends on is (). Hence, the preferred is the one whose estimate is closest to the BMA estimate .

In particular, for two models, Corollary 3 shows that FoBMA reduces to the classical Bayesian model selection scheme of choosing the model with the highest posterior probability.

Corollary 3. For two models, the selected model is the one with the highest posterior probability.

Proof. From Proposition 2, let us find the distance between each model and BMA model. Consider Similarly, Therefore, is selected if that is, if . Hence, FoBMA is equivalent to selecting the model with the highest posterior probability.

4. Applications

In this section, we apply the methodology to three models: linear regression, logistic regression, and survival analysis. The three following examples have been widely used in Bayesian model averaging (Raftery et al. [43]) and are available in the R packages and survival. They are also used as tutorial in the R package . Since these are parametric models, the focus parameter, , is in every case the regression coefficient (). FoBMA is compared to the classical well-known Bayesian information criterion (BIC). All computations were performed with R [42].

4.1. Linear Regression

In this subsection, FoBMA is applied to data of the effect of punishment regimes on crime rates (Ehrlich [44]), used in the study by Raftery [45]. It can be downloaded in the R package .

4.1.1. Data Description

Criminologists are interested in the effect of punishment regimes on crime rates. This is a dataset on per capita crime rates in 47 U.S. states in 1960 originally published in the study by Ehrlich [44]. The variables have been rescaled to convenient numbers.

Table 1 describes all variables used for linear regression. The dependent variable is the rate of crimes in a particular category per head, and 15 potential independent variables perceived to be associated with crime rates (Ehrlich [44]).


LabelDescription

Percentage of males aged 14–24
Indicator variable for a Southern state (South = 1 and others = 0)
Mean years of schooling
Police expenditure in 1960
Police expenditure in 1959
Labour force participation rate
Number of males per 1000 females
State population
Number of nonwhites per 1000 people
Unemployment rate of urban males 14–24
Unemployment rate of urban males 35–39
Gross domestic product per head
Income inequality
Probability of apprehension and imprisonment
Average time served in state prisons (years)
Rate of crimes in a particular category per head of population

4.1.2. Results

Table 2 shows that the classical BIC (no focus) selects the model with variables (-BIC = 55.91). Since the number of explanatory variables in this case is 15, the model space is small enough, which allows for full enumeration of model space. If the focus was the probability of imprisonment (the number of offenders imprisoned per offense known), the selected model is which is ranked 25 with BIC. If the focus was the average time spent in state prisons, the selected model is which is ranked 18 with BIC. The other focuses parameters show that there is a great discrepancy between the FoBMA and the classical BIC.


FocusBest model (FoBMA) -BICRankMeanSD

51.63497.51.40.53
52.1306.30.0070.04
52.2291002.130.51
52.52375.40.670.42
51.24524.60.220.40
50.9462.10.0030.08
52.3275.2−0.070.45
52.91828.9−0.020.04
52.52289.60.090.05
52.32811.9−0.040.14
54.7383.90.270.19
51.43733.80.200.35
53.114100.01.380.33
52.42598.8−0.250.10
52.91844.5−0.130.18

4.2. Logistic Regression

In this subsection, FoBMA is applied to data of risk factors associated with low infants birth weights (Hosmer and Lemeshow [46]). These data were also used in the study by Raftery [45] and can be downloaded in the R package . The aim was to study the risk factors associated with low infants birth weights.

4.2.1. Data Description

The “birthwt” data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts, during 1986.

4.2.2. Results

Table 3 describes all variables used for logistic regression. The outcome variable is low (indicator of birth weight (BWT) less than 2.5 kg), and 7 potential independent variables perceived to be associated with low birth weight (Hosmer and Lemeshow [46]).


LabelDescription

Indicator of birth weight (BWT) less than 2.5 kg
Mother’s age in years
Mother’s weight in pounds at last menstrual period
Mother’s race (1 = white, 2 = black, and 3 = other)
Smoking status during pregnancy
Number of previous premature labours
History of hypertension
Presence of uterine irritability
Number of physician visits during the first trimester
Birth weight in grams

Table 4 shows that BIC selects the one with variables (-BIC = 753). If the focus was the , the selected model is which is ranked 38 with BIC. If the focus was the , the selected model is which is ranked 39 with BIC. The other focuses parameters show that there is a great discrepancy between the FoBMA and BIC. If hypertension (HT) is the focus, the selected model is the one with variables ; but this model is ranked 20 if the focus is . The model is ranked 4 if the focus is , 11 if , 14 if , 23 if , and 5 if . It is not best (number 1) to estimate any of these focuses. Other applications of FIC for logistic regression in dental restoration can be found in the study by Candolo [47].


FocusBest model (FoBMA)-BICRankMeanSD

749.8 20 12.6 0.008 0.024
748.6 38 68.7−0.012 0.010
748.1 46 9.6 0.115 0.390
748.2 44 9.6 0.092 0.320
748.5 40 33.2 0.256 0.43
748.5 40 35.1 0.62 0.089
748.5 40 35.1 0.170 0.61
749.3 29 35.1−4.91 520
748.1 46 65.4 1.167 1
748.5 39 33.8 0.310 0.51
748.7 33 1.5−0.001 0.024

4.3. Survival Analysis

In this subsection, FoBMA is applied to the Mayo Clinic Primary Biliary Cirrhosis Data (Therneau and Grambsch [48]). It was used in the study by Volinsky et al. [49] and can be downloaded in the R package survival.

4.3.1. Data Description

The data are from the Mayo Clinic trial on primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the dataset participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants.

4.3.2. Results

Table 5 describes all variables used for survival analysis. The dependent variable was the survival time while the 18 others were the independent variables.


LabelDescription

Age of patients in years
Serum albumin (mg/dL)
Alkaline phosphatase (U/liter)
Presence of ascites
Aspartate aminotransferase, once called SGOT (U/mL)
Serum bilirubin (mg/dL)
Serum cholesterol (mg/dL)
Urine copper (ug/day)
0 no edema, 0.5 untreated or successfully treated edema, and 1 edema despite diuretic therapy
Presence of hepatomegaly or enlarged liver
Platelet count
Standardized blood clotting time
Male/female
Blood vessel malformations in the skin
Histologic stage of disease (which needs biopsy)
Status at endpoint, 0/1/2 for censored, transplant, or dead
Number of days between the date of registration and the earlier date of death or transplantation
1/2/NA for D-penicillamine, placebo, not randomized
Triglycerides (mg/dL)

Table 6 shows that BIC selects the model with variables . If the focus was the , the selected model is which is ranked 14 with BIC. If the focus was the , the selected model is which is ranked 7 with BIC. The other focuses parameters show that there is a great discrepancy between the FoBMA and BIC. If the focus was on age, BIC and FoBMA select the same model.


FocusBest model (FoBMA)-BICRankMeanSD

174.8 1 100 0.031 0.009
168.9 34 7.8 0.000 0.000
168.9 24 1.7 0.003 0.045
170.8 14 100 0.78 0.13
170.8 15 84.7 0.74 0.43
170.7 16 2.8 0.006 0.05
168.8 39 5.4 0.000 0.000
171.0 13 78 2.5 1.7
170.4 18 5.6−0.018 0.1
169.6 28 20.9 0.097 0.23
170.0 27 1.4 0.000 0.026
169.5 29 36.6 0.1 0.16