- About this Journal ·
- Abstracting and Indexing ·
- Aims and Scope ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents

International Journal of Biomedical Imaging

Volume 2010 (2010), Article ID 618747, 11 pages

http://dx.doi.org/10.1155/2010/618747

## Statistical Evaluations of the Reproducibility and Reliability of 3-Tesla High Resolution Magnetization Transfer Brain Images: A Pilot Study on Healthy Subjects

^{1}Pfizer Inc., New York, NY, USA^{2}NorthShore University HealthSystem, Evanston, IL, USA^{3}Albany Medical College, Albany, NY, USA^{4}University of Florida, Florida, FL, USA^{5}Northwestern University, Chicago, IL, USA^{6}University of Chicago, Chicago, IL, USA

Received 29 September 2009; Accepted 4 December 2009

Academic Editor: Shan Zhao

Copyright © 2010 Kelly H. Zou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Magnetization transfer imaging (MT) may have considerable promise for early detection and monitoring of subtle brain changes before they are apparent on conventional magnetic resonance images. At 3 Tesla (T), MT affords higher resolution and increased tissue contrast associated with macromolecules. The reliability and reproducibility of a new high-resolution MT strategy were assessed in brain images acquired from 9 healthy subjects. Repeated measures were taken for 12 brain regions of interest (ROIs): genu, splenium, and the left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Spearman's correlation coefficient, coefficient of variation, and intraclass correlation coefficient (ICC) were computed. Multivariate mixed-effects regression models were used to fit the mean ROI values and to test the significance of the effects due to region, subject, observer, time, and manual repetition. A sensitivity analysis of various model specifications and the corresponding ICCs was conducted. Our statistical methods may be generalized to many similar evaluative studies of the reliability and reproducibility of various imaging modalities.

#### 1. Introduction

Magnetization transfer (MT) imaging is a quantitative approach for detecting subtle or occult abnormalities in brain tissue. In previous studies, the Magnetization Transfer Ratio (MTR), an index of MT imaging, was sensitive to brain changes in patients with mild cognitive impairment, an Alzheimer’s disease prodrome [1, 2], to new lesions in patients with multiple sclerosis, [3] and to changes associated with progression in chronic neurological disorders [4]. The higher magnetic field strength afforded by 3T allows MT image resolution to be augmented compared with conventional MT acquisition at 1.5T [5–7]. We developed a high resolution MT technique to detect subtle changes in anatomically small, functionally eloquent brain structures. The increased field strength affords whole-brain coverage with considerably thinner slices, potentially reducing partial volume artifacts. However, even among healthy subjects, numerous factors may introduce variability in measures derived from magnetic resonance (MR) data, such as static field signal dropout and RF nonuniformity. Measurement variation may be introduced by scan repetitions, repositioning at different time points, and image post-processing. Moreover, 3T may be susceptible to variation associated with increased field strength [8]. Such variability may pose limitations when conducting clinical comparisons to differentiate normal and diseased brains or in developing statistically predictive algorithms.

To validate high resolution MT for detecting early disease or for monitoring progression in chronic neurological disease, it is necessary to collect information on normative values and to evaluate the reliability and reproducibility of the measurements when measured across time in healthy controls. This investigation evaluated observer-agreement of high-resolution MT measurements determined from repeated brain scans of 9 healthy volunteers. We postulated that MT values would remain stable during the one month study interval. We evaluated the reliability and reproducibility of the high resolution MT measurements in 12 brain regions of interest (ROIs), applied statistical measures to the data and used complex multivariate mixed-effects models to test the statistical significance of several effects due to region, subject, observer, time, and manual repetition.

#### 2. Materials and Methods

##### 2.1. Study Subjects

The study was approved by the IRB at the North Shore University Health System, and conducted following the ethical principles outlined in the Declaration of Helsinki. Eleven healthy adult volunteers were randomly selected from a database maintained at the Center for Advanced Imaging, Radiology Department, NorthShore University Health System provided written informed consent and evaluated for eligibility criteria. To protect the subjects’ confidentiality, all data were de-identified and handled according to the guidelines specified by the Health Insurance Portability and Accountability Act (HIPAA) in the USA.

##### 2.2. Image Acquisition

Brain images were acquired using a 3T General Electric (GE) HDx system (Waukesha, WI, USA). Each volunteer was scanned twice in a randomly-selected time interval between 1 to 4 weeks. Methods for reducing random errors in image acquisition included the use of a body-coil for excitation to control B1 non-uniformities and an 8-channel quadrature receive-only coil [9]. MT pulses with () and without saturation () were applied at an offset frequency from water resonance. To accelerate the scan for whole-brain coverage, while maintaining thin slices, the image protocol was optimized based on 3T using 3D SPGR [5]. The Gaussian Sinc MT pulse was applied in 8 ms at a 1200 HZ offset. The stability of the scanner and set-up procedure were addressed with a fixed set of parameters per subject. MT pulse was based on a three-dimensional spoiled gradient recalled (3D SPGR) acquisition. The image protocol included the following parameters: TR 34 to 35 ms, TE 4 to 8 ms, imaging FA , bandwidth 15.6 kHz, 0.75 NEX, phase FOV 0.75, voxel dimensions 0.9 0.9 0.9 1.3 mm^{3}. The whole brain was covered in 90 to 140 slices with acquisition time ranging from 7 minutes 40 seconds to 10 minutes 20 seconds using a partial -space acquisition.

##### 2.3. Image Analysis

MTR maps were generated off-line on a General Electric AW Workstation (General Electric, Milwaukee, WI, USA) using the standard equation:

where and were the signal intensities in a given voxel obtained, with and without the MT saturation pulse, respectively. MTR maps generated based on the high resolution MT are demonstrated in Figure 1. The 12 ROIs were: genu, splenium, left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Figure 2 illustrated the 12 ROIs that were investigated. Each ROI was sized approximately 30 to 43 mm^{2} and manually and independently placed by Observers 1 and 2 (Authors S.S. and Y.W.) following procedures in classical and standard agreement studies [10]. After an initial consensus decision was drawn regarding the sizes and locations of the 12 ROIs, the observers performed manual segmentations of the ROI independently on each set of images. This ROI placement procedure was repeated by each observer in the following week.

MTR values were extracted using the manually-defined ROIs with the combinations of observer, time point, and repetition (Table 1). The mean and SDs of the ROI values were calculated. Meta-data were stored in a SAS 9.1 (SAS, Cary, NC, USA) dataset, with individual volunteer identification numbers withheld and replaced by a sequence of 1 to 9 for each subject.

##### 2.4. Statistical Methods

Statistical analyses were performed using SAS 9.1 (SAS Institute, Cary, NC, USA; http://www.sas.com). The SAS analytic procedures conducted included “Proc Univariate,” “Proc Means,” “Proc Corr,” and “Proc Mixed.” Bar diagrams were constructed using Microsoft Excel (http://www.microsoft.com). Age and gender were not controlled for in analyses.

###### 2.4.1. Descriptive Statistics

Let having the indices described in Table 1 be a random variable representing the mean ROI value. For the th ROI, we first computed the sample mean and standard deviation of all mean ROI values:

where = = = measurements and the operator “” means the marginal sum over the particular index.

The 95-percentile normality range was approximately within the following interval, with the following lower and upper bounds:

The term “normality range” as used in Europe, could be arbitrarily-defined according to the number of standard deviations away from the mean [11]. Thus, it should not be viewed as the range of the entire dataset, but rather an interval useful for estimating the population value by one or several standard deviations away from the mean. Here the critical value of 2 was chosen as recommended by Bland and Altman [12].

Additionally, we justified using a Student’s -distribution with = 71 degrees of freedom. For any tail probability of (e.g., 0.025 for a 95-percent normality range), we used the quantile of the corresponding to particular -distribution, such that

This value happened to be close to the recommended multiplier of 2. Therefore, we rounded it to 2 in (3) for convenience.

###### 2.4.2. Concordance Using Spearman’s Rank Coefficient Coefficients

We first explored and measured the concordance between the various measurements fully nonparametrically via Spearman’s rank correlation coefficient. Suppose that we correlated the ROI values by Observers = 1 and = 2, then denoted the marginal ranks, and, respectively, for all with = 1 and = 2. The sample version of Pearson’s product-moment correlation coefficient between the ranks of the data was equivalent to Spearman’s rank correlation coefficient [13]:

where denotes and denotes .

Assuming that there was no presence of any ties since the ROI values were of continuous random variables, the Spearman’s rank correlation coefficient between Observers and was where the difference of an arbitrary pair of marginal ranks for Observer and was denoted by , for all . Consequently, all of the raw mean ROI values were converted to their marginal ranks and the differences between the ranks of each observation on the two variables were computed. Spearman’s rank correlation coefficient was also computed for the ROI values between any two different time points = 1 and = 2.

The strength of the concordance and the benchmark values have been discussed [14]. Bar diagrams were made to display the Spearman’s rank correlation coefficients between observers or time points for each ROI.

###### 2.4.3. Reproducibility Using Coefficients of Variations

We used the normalized measure of dispersion of a distribution to evaluate the reproducibility of the measurement [15]. The measure was the coefficient of variation (CV), defined as the ratio of the SD to the mean.

where both the numerator (i.e., sample SD) and the denominators (i.e., sample mean) in the above expression for CV are provided in (2). Skewed data, such as those generated by an exponential distribution for which the underlying population mean and standard deviation would be equal, and thus the CV became 1. Hence, CV 1 would generally represent low variability, and CV 1 would represent high variability. As in (4) and (6), further stratified computations of CV for different observers, time point, or repetitions were achieved using formulae similar to (7).

###### 2.4.4. Normality and Significance Tests for the Effects via a Multivariate Regression Analysis

As overall variability was likely a result of the effects illustrated in Table 1. We employed a multivariate mixed-effects regression analysis to direct model the ROI values.

A variance-component approach has advantages over many stratified analyses, especially studying studies with a limited sample size. Here, because of the novel imaging modality using MT and 3T acquisitions with labor-intensive manual segmentation procedures, large number of subjects would not have been feasible. To conduct an analysis of variance (ANOVA) based on the various effects, a distributional assumption of normality was necessary and convenient. Therefore, we conducted marginal normality tests using the Shapiro-Wilk test [16]. We would demonstrate (see Section 3.4) that the normality assumption was generally satisfactory.

Thus, we could then consider adopting a linear random-effects model with all pair-wise interactions, in addition to a third-order interaction term: The effects represented the following: as intercept, as subjects, as observers, as time points, as repetitions, and as the error team. A random-effects model assumed that each of the effects would have independent normal distributions with mean and variance.

If normality had failed and because the data were mean ROI values that were positively-valued, we would recommend a Box-Cox transformation, , of the outcome variable with an optimal power coefficient [17–19]. Note that the log-normal becomes a special case when the power coefficient . This normality transformation is given by:

A profile log-likelihood, llik of given the observations , would be maximized to estimate an optimal Box-Cox transformation via a nonlinear minimization routine, where the log-likelihood was where was a constant free of the power coefficient to be optimized.

Due to the limited number of subjects, however, even with an optimal normality transformation, over-fitting and non-convergence might be issues. Alternatively, we could regard all of the observers, time points, and repetitions as fixed and specify a mixed-effects model. The significances of the sources of variability were tested via a restricted maximum likelihood (REML) approach. For our multivariate analysis, the significance threshold for two-tailed -values was set if .

###### 2.4.5. Interobserver Reliability Using the ICCs

Stratified by the time points within each ROI, a two-way ANOVA was performed by regarding all of the observers, time points, and repetitions as fixed. We specified a mixed-effects model for simplicity. Due to the complexity of the variance components, we instead adopted a hybrid approach by considering two effects at once. For example, all subjects were segmented by the same observers who were from an entire population of observers. In other words, the subject effect was always assumed to be random, while the remaining effect (e.g., here the observer) was assumed to be fixed. We computed the Case-3 ICCs, accordingly [20].

We simplified our notations by only keeping the indices for the subject and observer effects of interest. We decomposed the data as follows:

where the subject effect was assumed to be random in an upper-case letter, which had a normal distribution with mean 0 and variance , for all (here ); the observer effect was considered to be a fixed effect in a lower-case letter, with the constraint , with the corresponding parameter to the variance being , for all (here ); the interaction term between the subject and the observer was the degree to which the th observer departed from his or her usual rating tendencies for the th subject, which had a normal distribution with a mean of 0 and variance ; the errors terms were assumed to have an independent and identical distribution (iid) normal distribution with a mean of 0 and variance . For the same th subject, the effects are further assumed to be subjected to the constraint over all of the observers. The corresponding two-way ANOVA table was listed (Table 3).

Shrout and Fleiss gave the true definition of ICC using the variance ratio of the subject variance over the total variance, with its estimated version using the quantities via ANOVA (Table 3) [19]:

###### 2.4.6. Intraobserver Reliability Using the ICCs

Similar to the analysis described above, we adopted a hybrid approach by considering two effects at once, with the subject effect always assumed to be random and the time point assumed to be fixed. The associate model was given by

As in (12), the estimated intraobserver agreement and its estimate were provided by: where the interaction term the interaction term between the subject and the time had a normal distribution with a mean of 0 and variance .

###### 2.4.7. Sensitivity Analyses of the ICCs under Various Models

We performed a sensitivity analysis by computing 6 different ICC values Shrout and Fleiss previously proposed assumptions for ICCs (Table 4) [18]. A SAS macro, written by Professor Robert Hamer, University of North Carolina School of Medicine, Chapel Hill, NC, USA (http://www.bios.unc.edu/~hamer), was run to perform the various ICC computations.

#### 3. Results

##### 3.1. Descriptive Statistics

Eleven healthy adults provided written informed consent to be evaluated and 9 underwent brain scans. Mean age of participants who received scans was years; 7 participants were men and 2 were women.

The mean ROI values varied across different region (Table 5). The left and right hemispheres tended to yield similar results when the average over these healthy subjects was considered.

##### 3.2. Concordance Using Spearman’s Rank Coefficient Coefficients

Spearman’s rank correlation coefficients showed that a majority of correlations within each observer was above 0.5, suggesting a moderate to high concordance (Figure 3). Time point 2 tended to yield higher concordance between the observers, which suggested a possible learning effect over time (Figure 4). Due to limited sample sizes in this pilot study, in Figures 3 and 4, we demonstrated the effect of observers by averaging over repetitions by each observer. Similarly, we demonstrated the effect of time points by averaging over repetitions at each time point.

##### 3.3. Reproducibility Using Coefficients of Variations

Overall, CVs ranged from 1.2% in the genu for Observer 2 to 7.0% in the right hippocampus for Observer 1 (Table 6). Since all of the CVs were within 7%, that is, all CVs were less than 10%, the reproducibility was reasonably high.

##### 3.4. Normality and Significance Tests via a Multivariate Analysis

The tests of the normal distribution assumption marginally using the Shapiro-Wilk test indicated that only occasionally (e.g., for left caudate, left and right putamen, and right hippocampus), this assumption was not met (see Table 7). Therefore, it was reasonable to specify linear mixed-effects modeling and two-way ANOVA reported in Sections 3.5 and 3.6.

##### 3.5. Interobserver Reliability Using the ICCs

At time point 1, ICCs were greater than 0.7 in regions of genu, left and right putamen, whereas ICCs were from 0.5 to 0.7 in regions of splenium, left and right hippocampus, left caudate, and right cerebral white matter (Table 8). These results indicated moderate to strong interobserver reliability. In comparison, at time point 2, ICCs were greater than 0.7 in regions of genu, splenium, left and right caudate, putamen and cerebral white matter, and left hippocampus and thalamus, while ICCs were from 0.5 to 0.7 in right hippocampus and thalamus. These results suggested a learning effect over time. However, for some ROIs such as the left cerebral white matter, right caudate, right thalamus, ICCs increased from 0.2 (at time point 1) to 0.9 (at time point 2), making it difficult to determine whether this represents a learning effect.

##### 3.6. Intraobserver Reliability Using the ICCs

At each time point, intraobserver agreement was at least 0.5 for a majority of the regions (Table 9).

##### 3.7. Sensitivity Analyses of the ICCs under Various Models

Six different methods for generating ICCs exhibited similar patterns for high vs. low reliability results in different ROIs (Table 10). Thus, reliability appeared to be sensitive to ROI.

#### 4. Conclusions and Discussion

We present mathematical methods for MT brain images using 3-T high resolution. Our image analysis may provide useful pilot information for future investigations. These mathematical and statistical methods may easily be generalized to practical studies with larger sample sizes or to studies of patients with active disease.

We acquired repeat brain measurements based on a high resolution MT imaging protocol at 3T in 9 healthy adults. Our results indicate moderate to high reproducibility, supporting the validity of this method for further studies. Overall, higher intraobserver reliability was observed at the second time point than that at the initial time point, suggesting a possible learning curve effect for both observers. Interobserver reliability was generally lower than intraobserver variability, suggesting a strong observer effect in this comparison, which may be a factor in future investigations using MT imaging.

Our analyses examined different aspects in a typical observer-agreement study, using measures for concordance, reproducibility, reliability, variance-component analysis, and multivariate analysis. In other studies, all or some of such methods may be considered. However, with a simpler study of either several observers, or one observer with several repetitions at different sessions or time points, then these scenarios may only require several of our methods. Only a small sample of healthy volunteers was evaluated in this initial pilot study. Therefore, the generalization of the 95-percentile normality range may be limited with respect to the wider spectrum of brain mechanisms represented in the broader population. For instance, demonstrating summary measures using all possible observer and time point combinations may not lead to meaningful interpretations in all cases. Nevertheless, since the technology is new, this research may provide useful pilot information for future investigations. Moreover, the statistical methods employed and illustrated here may easily be generalized to studies with larger sample sizes and diseased subjects.

Another limitation was that this study aimed to evaluate only the reproducibility and reliability, rather than the accuracy in a more comprehensive validation study. In the absence of a true gold standard, such as one based on digital phantoms where realistic variability may still not be simulated, or on histopathology, improved reliability may not be equated with improved accuracy [21]. Both sensitivity and specificity are of interest. Further research would benefit from a useful algorithm to perhaps statistically and optimally estimate the underlying spatial “ground truth” [22, 23].

Finally, future research may be directed to evaluating the diagnostic utility of high resolution MT for early detection of Alzheimer’s disease, multiple sclerosis or other neurological disorders and for monitoring progression across the clinical course.

#### Acknowledgments

None of the authors on this study had any conflict of interest. This study was partially supported by research Grants 1R01MH080636-01A2, NorthShore University Health System Pilot Grant EH07-267 and Alzheimer’s Drug Discovery Foundation (ISOA 271222). The authors are grateful for the assistance of Fiona Malone and Yuyuan Ouyang. In addition, they acknowledge with thanks for the SAS macro for computing various ICCs, developed by Dr. Robert M. Hamer, Professor of Psychiatry and Research Professor of Biostatistics, University of North Carolina School of Medicine, Chapel Hill, NC, USA. Dr. DeTora is a paid employee of Novartis Vaccines and Diagnostics, Cambridge MA, USA.

#### References

- N. J. Kabani, J. G. Sled, A. Shuper, and H. Chertkow, “Regional magnetization transfer ratio changes in mild cognitive impairment,”
*Magnetic Resonance in Medicine*, vol. 47, no. 1, pp. 143–148, 2002. View at Publisher · View at Google Scholar · View at Scopus - W. M. van der Flier, D. M. J. van den Heuvel, A. W. E. Weverling-Rijnsburger, et al., “Magnetization transfer imaging in normal aging, mild cognitive impairment, and Alzheimer's disease,”
*Annals of Neurology*, vol. 52, no. 1, pp. 62–67, 2002. View at Publisher · View at Google Scholar · View at Scopus - F. Agosta, M. Rovaris, E. Pagani, M. P. Sormani, G. Comi, and M. Filippi, “Magnetization transfer MRI metrics predict the accumulation of disability 8 years later in patients with multiple sclerosis,”
*Brain*, vol. 129, no. 10, pp. 2620–2627, 2006. View at Publisher · View at Google Scholar · View at Scopus - J. T. Chen, D. L. Collins, H. L. Atkins, et al., “Magnetization transfer ratio evolution with demyelination and remyelination in multiple sclerosis lesions,”
*Annals of Neurology*, vol. 63, no. 2, pp. 254–262, 2008. View at Publisher · View at Google Scholar · View at Scopus - M. Cercignani, M. R. Symms, M. Ron, and G. J. Barker, “3D MTR measurement: from 1.5 T to 3.0 T,”
*NeuroImage*, vol. 31, no. 1, pp. 181–186, 2006. View at Publisher · View at Google Scholar · View at Scopus - G. Helms, B. Draganski, R. Frackowiak, J. Ashburner, and N. Weiskopf, “Improved segmentation of deep brain grey matter structures using magnetization transfer (MT) parameter maps,”
*NeuroImage*, vol. 47, no. 1, pp. 194–198, 2009. View at Publisher · View at Google Scholar · View at Scopus - Y. Wu, P. Storey, A. Carrillo, et al., “Whole brain and localized magnetization transfer measurements are associated with cognitive impairment in patients infected with human immunodeficiency virus,”
*American Journal of Neuroradiology*, vol. 29, no. 1, pp. 140–145, 2008. View at Publisher · View at Google Scholar · View at Scopus - R. R. Edelman, “MR imaging of the pancreas: 1.5T versus 3T,”
*Magnetic Resonance Imaging Clinics of North America*, vol. 15, no. 3, pp. 349–353, 2007. View at Publisher · View at Google Scholar · View at Scopus - P. S. Tofts, S. C. A. Steens, M. Cercignani, et al., “Sources of variation in multi-centre brain MTR histogram studies: body-coil transmission eliminates inter-centre differences,”
*Magnetic Resonance Materials in Physics, Biology and Medicine*, vol. 19, no. 4, pp. 209–222, 2006. View at Publisher · View at Google Scholar · View at Scopus - P. Graham, “Modelling covariate effects in observer agreement studies: the case of nominal scale agreement,”
*Statistics in Medicine*, vol. 14, no. 3, pp. 299–310, 1995. View at Google Scholar · View at Scopus - S. R. Filipović and V. S. Kostić, “Utility of auditory P300 in detection of presenile dementia,”
*Journal of the Neurological Sciences*, vol. 131, no. 2, pp. 150–155, 1995. View at Publisher · View at Google Scholar · View at Scopus - J. M. Bland and D. G. Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,”
*The Lancet*, vol. 1, no. 8476, pp. 307–310, 1986. View at Google Scholar · View at Scopus - T. S. Hettmansperger,
*Statistical Inference Based on Ranks*, Krieger, Malabar, Fla, USA, 1991. - K. H. Zou, K. Tuncali, and S. G. Silverman, “Correlation and simple linear regression,”
*Radiology*, vol. 227, no. 3, pp. 617–622, 2003. View at Publisher · View at Google Scholar · View at Scopus - A. P. Zijdenbos, B. M. Dawant, R. A. Margolin, and A. C. Palmer, “Morphometric analysis of white matter lesions in MR images: method and validation,”
*IEEE Transactions on Medical Imaging*, vol. 13, no. 4, pp. 716–724, 1994. View at Publisher · View at Google Scholar · View at Scopus - S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),”
*Biometrika*, vol. 52, pp. 591–611, 1965. View at Google Scholar - G. E. P. Box and D. R. Cox, “An analysis of transformations,”
*Journal of the Royal Statistical Society. Series B*, vol. 26, pp. 211–252, 1964. View at Google Scholar - K. H. Zou and A. J. O'Malley, “A Bayesian hierarchical non-linear regression model in receiver operating characteristic analysis of clustered continuous diagnostic data,”
*Biometrical Journal*, vol. 47, no. 4, pp. 417–427, 2005. View at Publisher · View at Google Scholar · View at Scopus - A. J. O'Malley and K. H. Zou, “Bayesian multivariate hierarchical transformation models for ROC analysis,”
*Statistics in Medicine*, vol. 25, no. 3, pp. 459–479, 2006. View at Publisher · View at Google Scholar · View at Scopus - P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in assessing rater reliability,”
*Psychological Bulletin*, vol. 86, no. 2, pp. 420–428, 1979. View at Publisher · View at Google Scholar · View at Scopus - K. H. Zou, W. M. Wells III, R. Kikinis, and S. K. Warfield, “Three validation metrics for automated probabilistic image segmentation of brain tumours,”
*Statistics in Medicine*, vol. 23, no. 8, pp. 1259–1282, 2004. View at Publisher · View at Google Scholar · View at Scopus - S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation,”
*IEEE Transactions on Medical Imaging*, vol. 23, no. 7, pp. 903–921, 2004. View at Publisher · View at Google Scholar · View at Scopus - S. K. Warfield, K. H. Zou, and W. M. Wells, “Validation of image segmentation by estimating rater bias and variance,”
*Philosophical Transactions of the Royal Society A*, vol. 366, no. 1874, pp. 2361–2375, 2008. View at Publisher · View at Google Scholar · View at Scopus