Abstract

The application of Bayesian methods is increasing in modern epidemiology. Although parametric Bayesian analysis has penetrated the population health sciences, flexible nonparametric Bayesian methods have received less attention. A goal in nonparametric Bayesian analysis is to estimate unknown functions (e.g., density or distribution functions) rather than scalar parameters (e.g., means or proportions). For instance, ROC curves are obtained from the distribution functions corresponding to continuous biomarker data taken from healthy and diseased populations. Standard parametric approaches to Bayesian analysis involve distributions with a small number of parameters, where the prior specification is relatively straight forward. In the nonparametric Bayesian case, the prior is placed on an infinite dimensional space of all distributions, which requires special methods. A popular approach to nonparametric Bayesian analysis that involves Polya tree prior distributions is described. We provide example code to illustrate how models that contain Polya tree priors can be fit using SAS software. The methods are used to evaluate the covariate-specific accuracy of the biomarker, soluble epidermal growth factor receptor, for discerning lung cancer cases from controls using a flexible ROC regression modeling framework. The application highlights the usefulness of flexible models over a standard parametric method for estimating ROC curves.

1. Introduction

Bayesian analysis is often used in the support of epidemiologic research [17]. A contemporary area of research in the population health sciences involves the development and application of statistical methods for evaluating the accuracy of medical tests. With binary outcome data, statistical methods focus on estimating sensitivity and specificity, while with quantitative data, standard objects of interest are the receiver operating characteristic (ROC) curve and area under the curve (AUC).

The ROC curve can be regarded as a graphical portrayal of the degree of separation between the distributions of test outcomes for “diseased” and nondiseased populations. The formula for an ROC curve depends on and , the sensitivity and specificity of the test at classification threshold . We proceed under the innocuous assumption that test outcomes tend to be higher for individuals in the diseased population. Let denote a continuous test outcome for a disease , where disease status is labeled as for disease absent and for disease present. In general, can be any continuously measured classifier that varies according to a cumulative distribution function for nondiseased individuals and varies according to for diseased individuals. The ROC curve is a plot of true positive probability versus false positive probability across all classification thresholds, that is, a plot of pairs for all . Since and , accurate direct estimation of the ROC curve depends crucially on accurate estimation of the distribution functions and .

This motivates the use of flexible data-driven methods for estimating functions, which is a central goal of nonparametric Bayesian analysis. The standard parametric approach imposes the strong condition that and be members of the normal family (i.e., the normal-normal model), while a flexible nonparametric approach treats them as arbitrary continuous distribution functions. A flexible method is desirable because and/or could exhibit unanticipated right or left skewness, multimodality, or they could be symmetric but not bell-shaped. As a result, the methodology we present has the flexibility to handle data from many different settings, and notably it will be designed to include the normal-normal model as a special case.

The development of modern nonparametric and semiparametric Bayesian procedures for ROC analysis is an active area of research [817]. The main goals of this paper are (i) to use flexible Bayesian models and ROC curves to evaluate the accuracy of a biomarker that may depend on patient characteristics and (ii) to provide a nontechnical description of Polya tree priors and show how they can be implemented in epidemiological research using standard statistical software.

2. Materials and Methods

Robust inference for ROC curves stems from allowing and to be members of broad classes of distributions. Our approach is to model them using Mixtures of Finite Polya Trees (MFPT) priors, which have been carefully discussed in the statistics literature [18, 19]. We avoid a full technical description because it requires considerable mathematical detail. Instead, a conceptual introduction appears in Appendix A.

Briefly, a major advantage beyond flexibility is that MFPT priors for and can be “centered” on separate parametric families. In part, this means that the flexible model generalizes the standard parametric model since the normal-normal model is allowed as a special case. Additionally, it means that the expected value of under the prior will be the distribution function that corresponds to the centering family. In our illustrations, the centering families will be . The MFPT priors will also have positive weight parameters, and , and positive integers and that define the (finite) length of the trees. Large values for the weights indicate higher prior confidence in the centering distributions, while values near zero allow for considerable flexibility around the centering family. To aid the selection of a weight parameter, consider that the parametric model is the special case obtained when . Hence, using a large weight value is similar to fitting the parametric centering model to the data (possible oversmoothing). The opposite extreme of setting produces empirical estimates (possible undersmoothing). In our experience, a weight parameter between 0.5 and 1 is often a positive compromise between these extremes. Tree lengths generally range between 4 and 10, depending on the sample size. Small values for tend to oversmooth the data. A fully nonparametric prior is obtained as , but large values of can substantially increase computing time. Finally, there will be priors on ; write them as , for . We assume and are independent, with priors denoted by , for .

2.1. Models

In the absence of covariates, consider two independent samples of continuous biomarker measurements, where constitute a random sample from an unknown distribution , and are realizations from . We regard the ’s as biomarker outcomes from nondiseased individuals and the ’s as outcomes from diseased individuals. Our model is represented as with and similarly for the ’s. The prior in our illustrations involves a normal distribution on and an independent uniform distribution on .

There are many possible extensions of this two-group model depending on the complexity of the data. We describe a semiparametric regression model that can be easily adapted to handle a variety of scenarios. The model specifies separate linear regressions with arbitrary residual distributions for the data from the nondiseased and diseased populations:

There are no intercepts in the regression part of the models since each residual distribution has an arbitrary unknown mean. The covariate vectors and can be distinct or have some overlap (e.g., both might include age) or complete overlap (). In the absence of available prior information, independent normal prior distributions with mean 0 and large variance can be used for the regression coefficients. Alternatively, when prior information is available, we recommend conditional means priors [20]. Special cases include a model with no covariates for the nondiseased population (no variables), a model with no covariates for the diseased population (no variables) and a model without covariates (the two-group model).

Our model can be used to estimate covariate-specific ROC curves and AUC. Let define two particular subpopulations of nondiseased and diseased people. Then, and its corresponding are measures of the tests’ accuracy when applied to nondiseased people with covariate vector and diseased people with covariate vector (often, the researcher will set equal to ). Approximate posterior inference for ROC curves can be made by calculating posterior summaries of across a fine grid of values for that spans the sample space.

3. Results

3.1. Fitting MFPT Models in SAS

The SAS 9.3 (SAS Institute, Inc., Cary, North Carolina) code that was used to fit a one-sample MFPT model to the simulated data described in Appendix A is in Appendix B.

3.2. ROC Analysis of sEGFR Data

We investigated a soluble isoform of the epidermal growth factor receptor (sEGFR) as a biomarker for lung cancer in premenopausal and postmenopausal women. sEGFR also has been associated with ovarian cancer [21, 22]. The data were collected from a case-control study conducted at the Mayo Clinic in Rochester, Minnesota between 1998 and 2003. There were 140 controls and 101 lung cancer cases, and 92 premenopausal and 149 postmenopausal women enrolled in the study.

In a preliminary analysis, we found no clear evidence of a difference in the distributions of for premenopausal versus postmenopausal lung cancer cases (Wilcoxon ). However, there was evidence for a statistical difference in the distribution of for controls based on menopausal status (Wilcoxon . We thus included menopausal status as a covariate in models for the control data. A flexible approach is supported over the standard parametric approach because normal quantile plots (not shown) indicated a right skew in the distributions of for cases and postmenopausal controls.

We compared two analyses. The first analysis had menopausal status as a covariate for the control group in an ROC regression model. The second analysis modeled outcomes among pre- and postmenopausal controls with completely separate (flexible) distributions, without regression structure. We compared these models using the log pseudomarginal likelihood (LPML) [23], [24, Section 4.9].

For all models that used Polya trees, we set weight parameters (the ’s) equal to 0.5 and tree lengths (the ’s) equal to 5. In this setting, values of tended to be higher among controls instead of cases, so we reversed the roles for cases and controls. Let for premenopausal women and for postmenopausal women. For both models under consideration, the data from cases were modeled as independently distributed according to an unknown distribution . The prior was , with selected to be independent of . The prior mean of 50 for was chosen because that was approximately the mean of for male lung cancer cases in a similar study, but we allowed for a high degree of uncertainty about the value of through a large prior variance.

For controls, model 1 for the data was the regression where, independently, , , and .

Model 2 specified separate distributions for data from premenopausal and postmenopausal controls. Denote the distributions by and , respectively, where was assigned an MFPT prior that was centered at , with and , for . Using the same priors, we also considered the standard parametric analysis (model 3) in which , , and were normal distributions with distinct means and variances.

The LPML statistics were similar for models 1 and 2 (). We therefore selected the more flexible model 2 that had separate distributions across . For the standard parametric model 3, , indicating a strong preference for the more flexible models (pseudo-Bayes factor ).

Estimated ROC curves were obtained from model 2, together with estimates from the parametric normal model 3 and from nonparametric empirical distribution functions. The estimated ROC curve from the parametric analysis differs drastically from estimates obtained from the MFPT and empirical distribution functions for postmenopausal women (Figure 1). Not surprisingly, the parametric model fails to track either of the data-driven estimates. Note that although the parametric model is its prior expectation, the MFPT model has the flexibility to allow the data to reallocate probability mass away from a bell-curve to produce an appropriately smooth and flexible estimate of the ROC curve. For premenopausal women, the parametric and Polya tree models also give different inferences, including at clinically important cutoffs corresponding to false positive probabilities up to 10% (Figure 2). The empirical partial AUC over that region was 0.026, while the estimate from the parametric analysis was increased by 37% to 0.041. A 90% posterior interval for the partial AUC from the parametric model was (0.027, 0.055), which does not contain the empirical estimate. The Polya tree analysis is a middle ground between these extremes, with an estimated ROC curve that is essentially a smoothed version of the empirical curve and an estimated partial AUC of 0.019 (0.009, 0.036). Similar results were obtained from a sensitivity analysis that placed diffuse Gamma priors with mean 1 and variance 10000 on the ’s and ’s.

4. Conclusions

We have described the use of MFPT priors in Bayesian analysis. Even when a parametric model is thought to hold, a Polya tree analysis offers a way to assess parametric assumptions and to perform a sensitivity analysis to address deviations from them. For example, a simple approach to sensitivity analysis involves comparing estimated ROC curves and AUC from the normal-normal and semiparametric models.

A finite Polya tree prior is not technically nonparametric. Bayesian nonparametric models are characterized by having an infinite number of parameters. Such models often use finite approximations, as in our case. It is thus more accurate to view models that contain finite Polya tree priors as parametric, with a large number of parameters. In fact, Bayesian statistical analysis with finite Polya tree priors has been called “parametric nonparametric statistics,” because it uses parametric models (with many parameters) that maintain flexibility [25]. In our experience, finite approximations of Polya trees are sufficiently flexible and scalable to perform well for a wide variety of complex settings.

Appendices

A. Polya Tree Priors

A.1. Finite Polya Tree Priors

A random sample is obtained from an unknown distribution . We describe informally what it means to place a finite Polya tree prior on . Polya tree priors place a distribution on a collection of cumulative distribution functions. The first step in constructing a finite Polya tree prior is to create nested partitions of the sample space. The first partition splits the sample space into two nonoverlapping intervals. In the second partition, those two intervals are each split, yielding a finer partition that contains four intervals. Then, these four intervals are each split to give an 8 interval third-level partition of the sample space. This sequential process of splitting the sample space into finer and finer partitions continues up to a certain (finite) level .

An analogy is to a convention center with floors. The rooftop encompasses the entire sample space. Down one level is a floor with two large meeting rooms. Two levels down is a floor with four meeting rooms, then a floor with 8 rooms, and so on until we get to the ground floor levels down, which contains meeting rooms (Figure 3 contains an example with ). Suppose the unknown distribution assigns multiple people (the ’s in this analogy) to rooms on the ground floor, and our task is to use the observed distribution of people to rooms to estimate . Although all levels (floors) of the tree (convention center) are important for the purpose of estimating , of primary importance is level (the ground floor).

The goal is to produce a data-driven estimate of that assigns high probability to intervals (rooms) that contain lots of data (people), assigns low probability to empty intervals, and assigns midrange probability to intervals that contain some (but not a lot) of the data. Consider first the simplest case where the convention center has only one floor (). Then, people are assigned to one of two meeting rooms on floor 1. Call them rooms and . The probability of being assigned to the first room on floor 1 is , which, since is a cumulative distribution function and is an interval of the form (), is to be interpreted as . Denote this unknown probability by the parameter . Then, by the complement rule, is the probability of being assigned to room 2 on floor 1. In notation, and . Since is unknown, and are unknown. To help interpret these parameters, suppose the data arise from a right-skewed distribution (lots more people in room 1 than room 2), then would be large and hence would be small, and vice-versa for left-skewed data. In practice, is never used because it would lead to a crude estimate of the density function , much like estimating a density using a relative frequency histogram that contains only two bins. In our experience, setting equal to , or 6 often works well in practice. Another option is to select so that roughly [18].

Now suppose we return a year later to find that the convention center has expanded and contains a second floor (. Suppose the new ground floor has four rooms, , , , and (third row of Figure 3). Moreover, suppose that room assignments are made based on whether the individual was in room or during the previous visit. If the previous assignment was to room , then the individual is assigned again to the left side of the convention center, namely, to either room with unknown probability or to room with probability . Similarly, define and () to be the probability of being assigned to room or , respectively, given that the previous assignment was to room .

The ’s are conditional probability parameters, since and . To relate the ’s to , we must determine the marginal probability of assignment to the various ground-floor rooms. Observe that interval is a subset of interval (see Figure 3), so the marginal probability of interval is

Similar steps lead to , , and . Suppose again that is right skewed. Then the data will estimate to be large, and it will estimate to be large since most of the people will be assigned to room . Therefore, the estimate of will be (relatively) large.

Notice that level 1 has only one unique parameter () associated with it because is completely determined by ; similarly, level 2 has two unique parameters ( and ) associated with it. We can continue the convention center storyline to any level . For (Figure 3), we add conditional probability parameters , , , , , , , and , but only 4 of these are unique. For instance, is the probability that is in interval given that it is in interval , and .

The key point is that if we can estimate all of the ’s, then we can estimate the probability that is allocated by to each interval at level . Estimating continuous requires continuing ad infinitum as grows. However, in practice we truncate to a fixed , hence the term, finite Polya tree. But this is incomplete for the purpose of estimating a continuous , because we have not modeled how probability mass is distributed within each interval at level . In terms of the convention center analogy, we have modeled the probability of assignment to each of the ground-floor rooms, but we have not modeled how people are distributed within those rooms. For instance, they could all be clustered in the middle of the room (all the ’s clumped together in the center of the interval). Alternatively, the data could be uniformly distributed, clumped to the right or left side of the interval, or have any other dispersion pattern within each interval. To address this issue, we model the data according to how a user-specified distribution allocates probability mass within the intervals at level .

The distribution is important in two other ways. First, it is used to determine the lower and upper endpoints of all intervals in the tree. The median of is used to split the sample space into two intervals at level 1 of the tree. The quartiles of define cutpoints for intervals at level 2. Writing the 25th percentile as , the median as , and the 75th percentile as , we have for a sample space that covers the real line

In general, the th interval is , for and .

Second, is the prior expectation of the unknown distribution function . This provides guidance for selecting ; we select it based on our best prior assessment of the data-generating distribution . If our prior assessment is that the data will be governed by a certain normal distribution, we use that normal for . If our prior assessment is accurate, the posterior estimate of from the finite Polya tree model will take the shape of that normal distribution. But the real advantage of Polya tree priors is their flexibility to allow for data-driven prior-to-posterior reallocation of probability mass away from the shape of to any shape supported by the data. This happens by estimating the ’s that were equated to .

The collection constitutes the unknown parameters corresponding to . We need to specify a prior distribution over this collection. Recall that when is an even number between 2 and , . Therefore, priors are needed only on when is odd. It is standard to use independent beta priors, specifically for odd and all . The prior mean of 0.5 gives equal weight to the left and right intervals formed by splitting a parent interval. The positive constant is often set equal to 0.5 or 1. The value of reflects our confidence in the prior estimate for the unknown . Large values of (e.g., ) translate into higher prior confidence in the particular , so a relatively larger amount of data are needed in order for the posterior estimate to stray from if the data conflict with it. A low value (e.g., ) will often lead to an estimate of that is similar to the empirical distribution function. We have found that or performs well in practice because it often is a middle ground between a purely data-based and model-based estimate.

Once we have selected , , and , we have all the elements needed to specify a finite Polya tree prior for . Formulas for and its corresponding density function are where is an integer between 1 and that identifies the interval at level containing , and is the probability of that interval (it is the product of of the ’s) [19]. The interval that contains at level can be determined using the formula , where the function returns the integer portion of a decimal number (e.g., ).

To motivate an extension to mixtures of finite Polya tree priors, consider Figure 4(a), where is estimated using a finite Polya tree analysis of 200 simulated data values from a mixture of two normal distributions (70% of observations from and 30% from ). The prior mean, , was the , and we set . The estimate from the finite Polya tree analysis is choppy, which turns out to be the result of using a fixed . These bumps get smoothed over if instead we replace with a parametric distribution , where denotes a collection of unknown parameters that is assigned a prior distribution. The posterior estimate of in Figure 4(b) was obtained using a for , with and . This is an example of our preferred prior in nonparametric Bayesian statistics, namely, the mixture of finite Polya trees prior. The model is , , where is a prior distribution on . With large , the model for is simply and the analysis becomes completely parametric. For a more elaborate but still elementary description, see [25].

B. SAS Code

SAS 9.3 code to fit the one-sample MFPT model to simulated data from the mixture of two normal distributions described in Appendix A is presented in Algorithm 1. The SAS data step is not shown in the code; in it, the variable y is read into a data set that was named MFPT, where y contains 200 evenly spaced percentiles from the true mixture distribution. The first line instructs the mcmc procedure to use the data in MFPT and specifies options for the number of burn-in iterates (nbi), the number of iterates to simulate post burn-in (nmc), to retain every 10th iterate (thin=10), and the parameters to monitor for posterior inference. Constants that are used in the program appear in line 2; in this example we set and . Each parameter is coded as pijk, with the appropriate numbers substituted for j and k. The marginal probabilities at level are named pk, where k ranges from 1 to 16; these variables represent as defined in Appendix A. Priors on the ’s, , and follow. SAS code to fit the finite Polya tree model described in Appendix A can be obtained by omitting the prior and parms lines associated with and and setting mu and sigma equal to constants in line 2 of the code. The variable k identifies the set each belongs to at level 4. The log likelihood (llike) is obtained by taking the natural log of the formula for that appears in (A.3). The remaining code is used to estimate the distribution function (G) and the density function (coded as littleg since SAS is not case sensitive).

proc mcmc data=MFPT nbi=5000 nmc=50000 thin=10 monitor=(G littleg mu sigma);
begincnst; J=4; c=1; endcnst;
parms pi11 pi21 pi23;
parms pi31 pi33 pi35 pi37;
parms pi41 pi43 pi45 pi47 pi49 pi411 pi413 pi415;
parms mu sigma / t(3);
array p[ ];
p1 = pi11*pi21*pi31*pi41; p2 = pi11*pi21*pi31*(1-pi41);
p3 = pi11*pi21*(1-pi31)*pi43; p4 = pi11*pi21*(1-pi31)*(1-pi43);
p5 = pi11*(1-pi21)*pi33*pi45; p6 = pi11*(1-pi21)*pi33*(1-pi45);
p7 = pi11*(1-pi21)*(1-pi33)*pi47; p8 = pi11*(1-pi21)*(1-pi33)*(1-pi47);
p9 = (1-pi11)*pi23*pi35*pi49; p10 = (1-pi11)*pi23*pi35*(1-pi49);
p11 = (1-pi11)*pi23*(1-pi35)*pi411; p12 = (1-pi11)*pi23*(1-pi35)*(1-pi411);
p13 = (1-pi11)*(1-pi23)*pi37*pi413; p14 = (1-pi11)*(1-pi23)*pi37*(1-pi413);
p15 = (1-pi11)*(1-pi23)*(1-pi37)*pi415; p16 = (1-pi11)*(1-pi23)*(1-pi37)*(1-pi415);
prior pi11 ~ beta(c,c); prior pi21 pi23 ~ beta(c*2**2,c*2**2);
prior pi31 pi33 pi35 pi37 ~ beta(c*3**2,c*3**2);
prior pi41 pi43 pi45 pi47 pi49 pi411 pi413 pi415 ~ beta(c*4**2,c*4**2);
prior mu ~ normal(50,sd=10); prior sigma ~ uniform(0,20);
k = int(2**J * cdf("normal", y, mu, sigma) + 1);
llike=J*log(2) + log(p[k]) + lpdfnorm(y, mu, sigma);
model general(llike);
array setID 52 ; array G 52 ; array littleg 52 ;
do i = 1 to dim(G) by 1;
setID[i] = int(2**J * cdf("normal", 39 + i/2, mu, sigma) + 1);
G[i] = p1*(setID[i] ge 2) + p2*(setID[i] ge 3) + p3*(setID[i] ge 4)
+ p4*(setID[i] ge 5) + p5*(setID[i] ge 6) + p6*(setID[i] ge 7)
+ p7*(setID[i] ge 8) + p8*(setID[i] ge 9) + p9*(setID[i] ge 10)
+ p10*(setID[i] ge 11) + p11*(setID[i] ge 12) + p12*(setID[i] ge 13)
+ p13*(setID[i] ge 14) + p14*(setID[i] ge 15) + p15*(setID[i] ge 16)
+ p[setID{i}] *(2**J * cdf("normal", 39 + i/2, mu, sigma)- setID[i] + 1);
littleg[i] = 2**J * p[setID{i}] * pdf("normal", 39 + i/2, mu, sigma);
end; run;

The model is fit using Gibbs sampling with block updating occurring for the groups of parameters that are defined in the parms statements. The posterior distributions of and are evaluated at . Figure 4(b) plots the posterior means of , and a continuous density is obtained by interpolation.

Disclosure

Andre Baron is a coinventor of patents related to sEGFR and cofounder of Tumor Biology Investment Group, Inc., a biotechnology company that holds the rights to several sEGFR patents. Of note, Dr. Baron was not involved in the statistical analyses or interpretation of the statistical results of these analyses.

Acknowledgments

This work was supported by National Institutes of Health (Grants K07 CA76170, R21 CA82520, and RO3 CA82091 to A.T.B). The authors thank both referees for their helpful and encouraging comments.