International Journal of Biomedical Imaging

International Journal of Biomedical Imaging / 2010 / Article
Special Issue

Mathematical Methods for Images and Surfaces

View this Special Issue

Research Article | Open Access

Volume 2010 |Article ID 618747 | https://doi.org/10.1155/2010/618747

Kelly H. Zou, Hongyan Du, Shawn Sidharthan, Lisa M. DeTora, Yunmei Chen, Ann B. Ragin, Robert R. Edelman, Ying Wu, "Statistical Evaluations of the Reproducibility and Reliability of 3-Tesla High Resolution Magnetization Transfer Brain Images: A Pilot Study on Healthy Subjects", International Journal of Biomedical Imaging, vol. 2010, Article ID 618747, 11 pages, 2010. https://doi.org/10.1155/2010/618747

Statistical Evaluations of the Reproducibility and Reliability of 3-Tesla High Resolution Magnetization Transfer Brain Images: A Pilot Study on Healthy Subjects

Academic Editor: Shan Zhao
Received29 Sep 2009
Accepted04 Dec 2009
Published09 Feb 2010

Abstract

Magnetization transfer imaging (MT) may have considerable promise for early detection and monitoring of subtle brain changes before they are apparent on conventional magnetic resonance images. At 3 Tesla (T), MT affords higher resolution and increased tissue contrast associated with macromolecules. The reliability and reproducibility of a new high-resolution MT strategy were assessed in brain images acquired from 9 healthy subjects. Repeated measures were taken for 12 brain regions of interest (ROIs): genu, splenium, and the left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Spearman's correlation coefficient, coefficient of variation, and intraclass correlation coefficient (ICC) were computed. Multivariate mixed-effects regression models were used to fit the mean ROI values and to test the significance of the effects due to region, subject, observer, time, and manual repetition. A sensitivity analysis of various model specifications and the corresponding ICCs was conducted. Our statistical methods may be generalized to many similar evaluative studies of the reliability and reproducibility of various imaging modalities.

1. Introduction

Magnetization transfer (MT) imaging is a quantitative approach for detecting subtle or occult abnormalities in brain tissue. In previous studies, the Magnetization Transfer Ratio (MTR), an index of MT imaging, was sensitive to brain changes in patients with mild cognitive impairment, an Alzheimer’s disease prodrome [1, 2], to new lesions in patients with multiple sclerosis, [3] and to changes associated with progression in chronic neurological disorders [4]. The higher magnetic field strength afforded by 3T allows MT image resolution to be augmented compared with conventional MT acquisition at 1.5T [57]. We developed a high resolution MT technique to detect subtle changes in anatomically small, functionally eloquent brain structures. The increased field strength affords whole-brain coverage with considerably thinner slices, potentially reducing partial volume artifacts. However, even among healthy subjects, numerous factors may introduce variability in measures derived from magnetic resonance (MR) data, such as static field signal dropout and RF nonuniformity. Measurement variation may be introduced by scan repetitions, repositioning at different time points, and image post-processing. Moreover, 3T may be susceptible to variation associated with increased field strength [8]. Such variability may pose limitations when conducting clinical comparisons to differentiate normal and diseased brains or in developing statistically predictive algorithms.

To validate high resolution MT for detecting early disease or for monitoring progression in chronic neurological disease, it is necessary to collect information on normative values and to evaluate the reliability and reproducibility of the measurements when measured across time in healthy controls. This investigation evaluated observer-agreement of high-resolution MT measurements determined from repeated brain scans of 9 healthy volunteers. We postulated that MT values would remain stable during the one month study interval. We evaluated the reliability and reproducibility of the high resolution MT measurements in 12 brain regions of interest (ROIs), applied statistical measures to the data and used complex multivariate mixed-effects models to test the statistical significance of several effects due to region, subject, observer, time, and manual repetition.

2. Materials and Methods

2.1. Study Subjects

The study was approved by the IRB at the North Shore University Health System, and conducted following the ethical principles outlined in the Declaration of Helsinki. Eleven healthy adult volunteers were randomly selected from a database maintained at the Center for Advanced Imaging, Radiology Department, NorthShore University Health System provided written informed consent and evaluated for eligibility criteria. To protect the subjects’ confidentiality, all data were de-identified and handled according to the guidelines specified by the Health Insurance Portability and Accountability Act (HIPAA) in the USA.

2.2. Image Acquisition

Brain images were acquired using a 3T General Electric (GE) HDx system (Waukesha, WI, USA). Each volunteer was scanned twice in a randomly-selected time interval between 1 to 4 weeks. Methods for reducing random errors in image acquisition included the use of a body-coil for excitation to control B1 non-uniformities and an 8-channel quadrature receive-only coil [9]. MT pulses with () and without saturation () were applied at an offset frequency from water resonance. To accelerate the scan for whole-brain coverage, while maintaining thin slices, the image protocol was optimized based on 3T using 3D SPGR [5]. The Gaussian Sinc MT pulse was applied in 8 ms at a 1200 HZ offset. The stability of the scanner and set-up procedure were addressed with a fixed set of parameters per subject. MT pulse was based on a three-dimensional spoiled gradient recalled (3D SPGR) acquisition. The image protocol included the following parameters: TR 34 to 35 ms, TE 4 to 8 ms, imaging FA , bandwidth 15.6 kHz, 0.75 NEX, phase FOV 0.75, voxel dimensions 0.9 0.9 0.9 1.3 mm3. The whole brain was covered in 90 to 140 slices with acquisition time ranging from 7 minutes 40 seconds to 10 minutes 20 seconds using a partial -space acquisition.

2.3. Image Analysis

MTR maps were generated off-line on a General Electric AW Workstation (General Electric, Milwaukee, WI, USA) using the standard equation:

where and were the signal intensities in a given voxel obtained, with and without the MT saturation pulse, respectively. MTR maps generated based on the high resolution MT are demonstrated in Figure 1. The 12 ROIs were: genu, splenium, left and right hemispheres of the hippocampus, caudate, putamen, thalamus, and cerebral white matter. Figure 2 illustrated the 12 ROIs that were investigated. Each ROI was sized approximately 30 to 43 mm2 and manually and independently placed by Observers 1 and 2 (Authors S.S. and Y.W.) following procedures in classical and standard agreement studies [10]. After an initial consensus decision was drawn regarding the sizes and locations of the 12 ROIs, the observers performed manual segmentations of the ROI independently on each set of images. This ROI placement procedure was repeated by each observer in the following week.

MTR values were extracted using the manually-defined ROIs with the combinations of observer, time point, and repetition (Table 1). The mean and SDs of the ROI values were calculated. Meta-data were stored in a SAS 9.1 (SAS, Cary, NC, USA) dataset, with individual volunteer identification numbers withheld and replaced by a sequence of 1 to 9 for each subject.


Outcome Variable Effect in the Variance-Component AnalysisType of EffectMathematical SymbolIndexMaximum of the Index

Mean ROI Value via Manual SegmentationsSubjectRandom
ObserverFixed or Random
Time PointFixed or Random
RepetitionFixed or Random
Region of InterestFixed
Interaction TermsGenerally Mixed Based on the Appropriate Model Specification


Absolute Value of the Correlation CoefficientStrength of the Concordance Between Samples

0.0No
0.2Weak
0.5Moderate
0.8Strong
1.0Perfect

2.4. Statistical Methods

Statistical analyses were performed using SAS 9.1 (SAS Institute, Cary, NC, USA; http://www.sas.com). The SAS analytic procedures conducted included “Proc Univariate,” “Proc Means,” “Proc Corr,” and “Proc Mixed.” Bar diagrams were constructed using Microsoft Excel (http://www.microsoft.com). Age and gender were not controlled for in analyses.

2.4.1. Descriptive Statistics

Let having the indices described in Table 1 be a random variable representing the mean ROI value. For the th ROI, we first computed the sample mean and standard deviation of all mean ROI values:

where = = = measurements and the operator “” means the marginal sum over the particular index.

The 95-percentile normality range was approximately within the following interval, with the following lower and upper bounds:

The term “normality range” as used in Europe, could be arbitrarily-defined according to the number of standard deviations away from the mean [11]. Thus, it should not be viewed as the range of the entire dataset, but rather an interval useful for estimating the population value by one or several standard deviations away from the mean. Here the critical value of 2 was chosen as recommended by Bland and Altman [12].

Additionally, we justified using a Student’s -distribution with = 71 degrees of freedom. For any tail probability of (e.g., 0.025 for a 95-percent normality range), we used the quantile of the corresponding to particular -distribution, such that

This value happened to be close to the recommended multiplier of 2. Therefore, we rounded it to 2 in (3) for convenience.

2.4.2. Concordance Using Spearman’s Rank Coefficient Coefficients

We first explored and measured the concordance between the various measurements fully nonparametrically via Spearman’s rank correlation coefficient. Suppose that we correlated the ROI values by Observers = 1 and = 2, then denoted the marginal ranks, and, respectively, for all with = 1 and = 2. The sample version of Pearson’s product-moment correlation coefficient between the ranks of the data was equivalent to Spearman’s rank correlation coefficient [13]:

where denotes and denotes .

Assuming that there was no presence of any ties since the ROI values were of continuous random variables, the Spearman’s rank correlation coefficient between Observers and was where the difference of an arbitrary pair of marginal ranks for Observer and was denoted by , for all . Consequently, all of the raw mean ROI values were converted to their marginal ranks and the differences between the ranks of each observation on the two variables were computed. Spearman’s rank correlation coefficient was also computed for the ROI values between any two different time points = 1 and = 2.

The strength of the concordance and the benchmark values have been discussed [14]. Bar diagrams were made to display the Spearman’s rank correlation coefficients between observers or time points for each ROI.

2.4.3. Reproducibility Using Coefficients of Variations

We used the normalized measure of dispersion of a distribution to evaluate the reproducibility of the measurement [15]. The measure was the coefficient of variation (CV), defined as the ratio of the SD to the mean.

where both the numerator (i.e., sample SD) and the denominators (i.e., sample mean) in the above expression for CV are provided in (2). Skewed data, such as those generated by an exponential distribution for which the underlying population mean and standard deviation would be equal, and thus the CV became 1. Hence, CV 1 would generally represent low variability, and CV 1 would represent high variability. As in (4) and (6), further stratified computations of CV for different observers, time point, or repetitions were achieved using formulae similar to (7).

2.4.4. Normality and Significance Tests for the Effects via a Multivariate Regression Analysis

As overall variability was likely a result of the effects illustrated in Table 1. We employed a multivariate mixed-effects regression analysis to direct model the ROI values.

A variance-component approach has advantages over many stratified analyses, especially studying studies with a limited sample size. Here, because of the novel imaging modality using MT and 3T acquisitions with labor-intensive manual segmentation procedures, large number of subjects would not have been feasible. To conduct an analysis of variance (ANOVA) based on the various effects, a distributional assumption of normality was necessary and convenient. Therefore, we conducted marginal normality tests using the Shapiro-Wilk test [16]. We would demonstrate (see Section 3.4) that the normality assumption was generally satisfactory.

Thus, we could then consider adopting a linear random-effects model with all pair-wise interactions, in addition to a third-order interaction term: The effects represented the following: as intercept, as subjects, as observers, as time points, as repetitions, and as the error team. A random-effects model assumed that each of the effects would have independent normal distributions with mean and variance.

If normality had failed and because the data were mean ROI values that were positively-valued, we would recommend a Box-Cox transformation, , of the outcome variable with an optimal power coefficient [1719]. Note that the log-normal becomes a special case when the power coefficient . This normality transformation is given by:

A profile log-likelihood, llik of given the observations , would be maximized to estimate an optimal Box-Cox transformation via a nonlinear minimization routine, where the log-likelihood was where was a constant free of the power coefficient to be optimized.

Due to the limited number of subjects, however, even with an optimal normality transformation, over-fitting and non-convergence might be issues. Alternatively, we could regard all of the observers, time points, and repetitions as fixed and specify a mixed-effects model. The significances of the sources of variability were tested via a restricted maximum likelihood (REML) approach. For our multivariate analysis, the significance threshold for two-tailed -values was set if .

2.4.5. Interobserver Reliability Using the ICCs

Stratified by the time points within each ROI, a two-way ANOVA was performed by regarding all of the observers, time points, and repetitions as fixed. We specified a mixed-effects model for simplicity. Due to the complexity of the variance components, we instead adopted a hybrid approach by considering two effects at once. For example, all subjects were segmented by the same observers who were from an entire population of observers. In other words, the subject effect was always assumed to be random, while the remaining effect (e.g., here the observer) was assumed to be fixed. We computed the Case-3 ICCs, accordingly [20].

We simplified our notations by only keeping the indices for the subject and observer effects of interest. We decomposed the data as follows:

where the subject effect was assumed to be random in an upper-case letter, which had a normal distribution with mean 0 and variance , for all (here ); the observer effect was considered to be a fixed effect in a lower-case letter, with the constraint , with the corresponding parameter to the variance being , for all (here ); the interaction term between the subject and the observer was the degree to which the th observer departed from his or her usual rating tendencies for the th subject, which had a normal distribution with a mean of 0 and variance ; the errors terms were assumed to have an independent and identical distribution (iid) normal distribution with a mean of 0 and variance . For the same th subject, the effects are further assumed to be subjected to the constraint over all of the observers. The corresponding two-way ANOVA table was listed (Table 3).


Source of VariationDegrees of FreedomMean Squares

(A) Between Subjects BSMS
(B) Within Subjects WSMS
(B.1) Between Observers OMS
(B.2) Error EMS

Note: BSMS: Between Subjects Mean Squares; WSMS: Within Subject Mean Squares; OMS: Observer Mean Squares; EMS: Error Mean Squares.

Shrout and Fleiss gave the true definition of ICC using the variance ratio of the subject variance over the total variance, with its estimated version using the quantities via ANOVA (Table 3) [19]:

2.4.6. Intraobserver Reliability Using the ICCs

Similar to the analysis described above, we adopted a hybrid approach by considering two effects at once, with the subject effect always assumed to be random and the time point assumed to be fixed. The associate model was given by

As in (12), the estimated intraobserver agreement and its estimate were provided by: where the interaction term the interaction term between the subject and the time had a normal distribution with a mean of 0 and variance .

2.4.7. Sensitivity Analyses of the ICCs under Various Models

We performed a sensitivity analysis by computing 6 different ICC values Shrout and Fleiss previously proposed assumptions for ICCs (Table 4) [18]. A SAS macro, written by Professor Robert Hamer, University of North Carolina School of Medicine, Chapel Hill, NC, USA (http://www.bios.unc.edu/~hamer), was run to perform the various ICC computations.


Notation for the ICC MeasureMultivariate Modeling Assumptions

ICC(1,1)Each subject is rated by multiple observers; the observers are assumed to be randomly assigned to the subjects; all subjects have the same number of observers.
ICC(2,1)All subjects are rated by the same observers who are assumed to be a random subset of all possible observers.
ICC(3,1)All subjects are rated by the same observers who are assumed to be the entire population of observers.
ICC(1,2)Same assumptions as ICC(1,1) but reliability for the mean of 2 ratings.
ICC(2,2)Same assumptions as ICC(2,1) but reliability for the mean of 2 ratings.
ICC(3,2)Same assumptions as ICC(3,1) but reliability for the mean of 2 ratings. Assumes additionally there is no subject observer interaction.

3. Results

3.1. Descriptive Statistics

Eleven healthy adults provided written informed consent to be evaluated and 9 underwent brain scans. Mean age of participants who received scans was years; 7 participants were men and 2 were women.

The mean ROI values varied across different region (Table 5). The left and right hemispheres tended to yield similar results when the average over these healthy subjects was considered.


Region of InterestDescriptive Statistics (Mean SD)95% Normality Range (Mean 2 SD)

Genu77.0 1.075.0–79.0
Splenium72.8 1.569.9–75.7
Left Hippocampus51.5 2.546.6–56.4
Left Caudate59.5 2.255.2–63.8
Left Putamen62.0 2.058.1–65.9
Left Thalamus61.6 2.357.1–66.1
Left Cerebral White Matter73.2 1.270.8–75.6
Right Hippocampus52.0 3.345.5–58.5
Right Caudate61.3 1.758.0–64.6
Right Putamen62.8 1.559.9–65.7
Right Thalamus61.1 2.556.2–66.0
Right Cerebral White Matter73.0 1.370.5–75.5

Note: Results were pooled among all 72 observations within each region of interest. SD: standard deviation.
3.2. Concordance Using Spearman’s Rank Coefficient Coefficients

Spearman’s rank correlation coefficients showed that a majority of correlations within each observer was above 0.5, suggesting a moderate to high concordance (Figure 3). Time point 2 tended to yield higher concordance between the observers, which suggested a possible learning effect over time (Figure 4). Due to limited sample sizes in this pilot study, in Figures 3 and 4, we demonstrated the effect of observers by averaging over repetitions by each observer. Similarly, we demonstrated the effect of time points by averaging over repetitions at each time point.

3.3. Reproducibility Using Coefficients of Variations

Overall, CVs ranged from 1.2% in the genu for Observer 2 to 7.0% in the right hippocampus for Observer 1 (Table 6). Since all of the CVs were within 7%, that is, all CVs were less than 10%, the reproducibility was reasonably high.


Region of InterestObserver 1Observer 2
Mean SD ( ) CV (%) Mean SD ( ) CV (%)

Genu76.9 1.01.377.1 0.91.2
Splenium73.1 1.41.972.6 1.52.1
Left Hippocampus51.3 2.44.751.6 2.75.2
Left Caudate59.7 1.93.259.3 2.54.2
Left Putamen61.9 2.23.662.1 1.93.1
Left Thalamus59.9 1.52.563.3 1.72.7
Left Cerebral White Matter73.3 1.31.873.1 1.21.6
Right Hippocampus52.5 3.77.051.5 2.75.2
Right Caudate61.2 1.93.161.5 1.42.3
Right Putamen62.7 1.52.462.8 1.52.4
Right Thalamus59.7 1.72.862.5 2.54.0
Right Cerebral White Matter73.2 1.21.672.8 1.41.9

Note. SD: standard deviation.
3.4. Normality and Significance Tests via a Multivariate Analysis

The tests of the normal distribution assumption marginally using the Shapiro-Wilk test indicated that only occasionally (e.g., for left caudate, left and right putamen, and right hippocampus), this assumption was not met (see Table 7). Therefore, it was reasonable to specify linear mixed-effects modeling and two-way ANOVA reported in Sections 3.5 and 3.6.


Region of Interest -value -value
Time Point 1Time Point 2
Observer 1Observer 2Observer 1Observer 2

Genu.29.17.70.36
Splenium.31.06.93.61
Left Hippocampus.14.81.45 .99
Left Caudate.97 .000 .49.92
Left Putamen.20.06.0 .0
Left Thalamus.86.51.63.13
Left Cerebral White Matter.82.43.21.02
Right Hippocampus.54.86.0 .58
Right Caudate.49.80.60.89
Right Putamen.07.00 .25.0
Right Thalamus.50.68.82.13
Right Cerebral White Matter.79.78.16.54

Normal distribution was not met.
3.5. Interobserver Reliability Using the ICCs

At time point 1, ICCs were greater than 0.7 in regions of genu, left and right putamen, whereas ICCs were from 0.5 to 0.7 in regions of splenium, left and right hippocampus, left caudate, and right cerebral white matter (Table 8). These results indicated moderate to strong interobserver reliability. In comparison, at time point 2, ICCs were greater than 0.7 in regions of genu, splenium, left and right caudate, putamen and cerebral white matter, and left hippocampus and thalamus, while ICCs were from 0.5 to 0.7 in right hippocampus and thalamus. These results suggested a learning effect over time. However, for some ROIs such as the left cerebral white matter, right caudate, right thalamus, ICCs increased from 0.2 (at time point 1) to 0.9 (at time point 2), making it difficult to determine whether this represents a learning effect.


Region of InterestInter-Reader ICCInter-Reader ICC
Time Point 1Time Point 2

Genu0.8660.726
Splenium0.5370.758
Left Hippocampus0.6930.796
Left Caudate0.5800.902
Left Putamen0.8690.962
Left Thalamus0.4100.855
Left Cerebral White Matter0.3780.929
Right Hippocampus0.6530.656
Right Caudate0.2090.872
Right Putamen0.7250.882
Right Thalamus0.2640.572
Right Cerebral White Matter0.6370.896

3.6. Intraobserver Reliability Using the ICCs

At each time point, intraobserver agreement was at least 0.5 for a majority of the regions (Table 9).


Region of Interest Intraobserver ICCIntraobserver ICC
Observer 1 Observer 2

Genu0.5370.555
Splenium0.5980.756
Left Hippocampus0.5200.596
Left Caudate0.7090.362
Left Putamen0.9400.784
Left Thalamus0.4790.622
Left Cerebral White Matter0.5600.703
Right Hippocampus0.4110.826
Right Caudate0.4730.436
Right Putamen0.6590.657
Right Thalamus0.6870.308
Right Cerebral White Matter0.5700.770

3.7. Sensitivity Analyses of the ICCs under Various Models

Six different methods for generating ICCs exhibited similar patterns for high vs. low reliability results in different ROIs (Table 10). Thus, reliability appeared to be sensitive to ROI.


Region of InterestICC (1,1)ICC (2,1)ICC (3, 1)ICC (1, 2)ICC (2, 2)ICC (3, 2)

Interobserver ICC at Time 1

Genu0.8700.8790.8660.9310.9350.928
Splenium0.4970.4630.5370.6640.6330.699
Left Hippocampus0.6530.6050.6930.7900.7540.819
Left Caudate0.5620.5420.5800.7190.7030.734
Left Putamen0.8710.8740.8690.9310.9330.930
Left Thalamus 0.1140.410 0.2050.581
Left Cerebral White Matter0.3820.3850.3780.5530.5560.549
Right Hippocampus0.6600.6690.6530.7950.8020.790
Right Caudate0.1780.1800.2090.3020.3060.346
Right Putamen0.7250.7320.7200.8400.8450.837
Right Thalamus 0.0790.264 0.1460.417
Right Cerebral White Matter0.6300.6210.6370.7730.7660.779

Interobserver ICC at Time 2

Genu0.7220.7150.7260.8380.8340.841
Splenium0.7580.7570.7580.8620.8620.863
Left Hippocampus0.7920.7850.7960.8840.8800.886
Left Caudate0.9050.9090.9020.9500.9520.949
Left Putamen0.9610.9590.9620.9800.9790.980
Left Thalamus0.2970.2390.8550.4580.3850.922
Left Cerebral White Matter0.9280.9260.9290.9630.9620.963
Right Hippocampus0.6400.6200.6560.7810.7650.793
Right Caudate0.8760.8840.8720.9340.9380.932
Right Putamen0.8840.8870.8820.9380.9400.937
Right Thalamus0.4190.3470.5720.5910.5160.728
Right Cerebral White Matter0.8890.8760.8960.9410.9340.945


Region of InterestICC (1,1)ICC (2,1)ICC (3, 1)ICC (1, )ICC (2, )ICC (3, )

Intraobserver for Observer 1

Genu0.5370.5370.5370.6990.6990.699
Splenium0.5900.5790.5980.7420.7330.749
Left Hippocampus0.5310.5440.5200.6940.7050.684
Left Caudate0.7040.6960.7090.8260.8210.830
Left Putamen0.9420.9460.9400.9700.9720.969
Left Thalamus0.4810.4840.4790.6500.6530.647
Left Cerebral White Matter0.5500.5390.5600.7100.7010.718
Right Hippocampus0.4260.4390.4110.5970.6100.582
Right Caudate0.4700.4670.4730.6400.6370.643
Right Putamen0.6570.6540.6590.7930.7910.795
Right Thalamus0.6960.7110.6870.8210.8310.814
Right Cerebral White Matter0.5820.5960.5700.7360.7470.727

Intraobserver ICC for Observer 2

Genu0.5630.5720.5550.7200.7280.714
Splenium0.7600.7670.7560.8640.8680.861
Left Hippocampus0.6070.6230.5960.7560.7670.747
Left Caudate0.3650.3670.3620.5350.5370.531
Left Putamen0.7900.8000.7840.8830.8890.879
Left Thalamus0.6320.6450.6220.7740.7840.767
Left Cerebral White Matter0.7120.7260.7030.8320.8410.826
Right Hippocampus0.8290.8350.8260.9070.9100.905
Right Caudate0.4320.4290.4360.6030.6010.607
Right Putamen0.6670.6820.6570.8000.8110.793
Right Thalamus0.2980.2940.3080.4590.4550.471
Right Cerebral White Matter0.7770.7890.7700.8750.8820.870

4. Conclusions and Discussion

We present mathematical methods for MT brain images using 3-T high resolution. Our image analysis may provide useful pilot information for future investigations. These mathematical and statistical methods may easily be generalized to practical studies with larger sample sizes or to studies of patients with active disease.

We acquired repeat brain measurements based on a high resolution MT imaging protocol at 3T in 9 healthy adults. Our results indicate moderate to high reproducibility, supporting the validity of this method for further studies. Overall, higher intraobserver reliability was observed at the second time point than that at the initial time point, suggesting a possible learning curve effect for both observers. Interobserver reliability was generally lower than intraobserver variability, suggesting a strong observer effect in this comparison, which may be a factor in future investigations using MT imaging.

Our analyses examined different aspects in a typical observer-agreement study, using measures for concordance, reproducibility, reliability, variance-component analysis, and multivariate analysis. In other studies, all or some of such methods may be considered. However, with a simpler study of either several observers, or one observer with several repetitions at different sessions or time points, then these scenarios may only require several of our methods. Only a small sample of healthy volunteers was evaluated in this initial pilot study. Therefore, the generalization of the 95-percentile normality range may be limited with respect to the wider spectrum of brain mechanisms represented in the broader population. For instance, demonstrating summary measures using all possible observer and time point combinations may not lead to meaningful interpretations in all cases. Nevertheless, since the technology is new, this research may provide useful pilot information for future investigations. Moreover, the statistical methods employed and illustrated here may easily be generalized to studies with larger sample sizes and diseased subjects.

Another limitation was that this study aimed to evaluate only the reproducibility and reliability, rather than the accuracy in a more comprehensive validation study. In the absence of a true gold standard, such as one based on digital phantoms where realistic variability may still not be simulated, or on histopathology, improved reliability may not be equated with improved accuracy [21]. Both sensitivity and specificity are of interest. Further research would benefit from a useful algorithm to perhaps statistically and optimally estimate the underlying spatial “ground truth” [22, 23].

Finally, future research may be directed to evaluating the diagnostic utility of high resolution MT for early detection of Alzheimer’s disease, multiple sclerosis or other neurological disorders and for monitoring progression across the clinical course.

Acknowledgments

None of the authors on this study had any conflict of interest. This study was partially supported by research Grants 1R01MH080636-01A2, NorthShore University Health System Pilot Grant EH07-267 and Alzheimer’s Drug Discovery Foundation (ISOA 271222). The authors are grateful for the assistance of Fiona Malone and Yuyuan Ouyang. In addition, they acknowledge with thanks for the SAS macro for computing various ICCs, developed by Dr. Robert M. Hamer, Professor of Psychiatry and Research Professor of Biostatistics, University of North Carolina School of Medicine, Chapel Hill, NC, USA. Dr. DeTora is a paid employee of Novartis Vaccines and Diagnostics, Cambridge MA, USA.

References

  1. N. J. Kabani, J. G. Sled, A. Shuper, and H. Chertkow, “Regional magnetization transfer ratio changes in mild cognitive impairment,” Magnetic Resonance in Medicine, vol. 47, no. 1, pp. 143–148, 2002. View at: Publisher Site | Google Scholar
  2. W. M. van der Flier, D. M. J. van den Heuvel, A. W. E. Weverling-Rijnsburger et al., “Magnetization transfer imaging in normal aging, mild cognitive impairment, and Alzheimer's disease,” Annals of Neurology, vol. 52, no. 1, pp. 62–67, 2002. View at: Publisher Site | Google Scholar
  3. F. Agosta, M. Rovaris, E. Pagani, M. P. Sormani, G. Comi, and M. Filippi, “Magnetization transfer MRI metrics predict the accumulation of disability 8 years later in patients with multiple sclerosis,” Brain, vol. 129, no. 10, pp. 2620–2627, 2006. View at: Publisher Site | Google Scholar
  4. J. T. Chen, D. L. Collins, H. L. Atkins et al., “Magnetization transfer ratio evolution with demyelination and remyelination in multiple sclerosis lesions,” Annals of Neurology, vol. 63, no. 2, pp. 254–262, 2008. View at: Publisher Site | Google Scholar
  5. M. Cercignani, M. R. Symms, M. Ron, and G. J. Barker, “3D MTR measurement: from 1.5 T to 3.0 T,” NeuroImage, vol. 31, no. 1, pp. 181–186, 2006. View at: Publisher Site | Google Scholar
  6. G. Helms, B. Draganski, R. Frackowiak, J. Ashburner, and N. Weiskopf, “Improved segmentation of deep brain grey matter structures using magnetization transfer (MT) parameter maps,” NeuroImage, vol. 47, no. 1, pp. 194–198, 2009. View at: Publisher Site | Google Scholar
  7. Y. Wu, P. Storey, A. Carrillo et al., “Whole brain and localized magnetization transfer measurements are associated with cognitive impairment in patients infected with human immunodeficiency virus,” American Journal of Neuroradiology, vol. 29, no. 1, pp. 140–145, 2008. View at: Publisher Site | Google Scholar
  8. R. R. Edelman, “MR imaging of the pancreas: 1.5T versus 3T,” Magnetic Resonance Imaging Clinics of North America, vol. 15, no. 3, pp. 349–353, 2007. View at: Publisher Site | Google Scholar
  9. P. S. Tofts, S. C. A. Steens, M. Cercignani et al., “Sources of variation in multi-centre brain MTR histogram studies: body-coil transmission eliminates inter-centre differences,” Magnetic Resonance Materials in Physics, Biology and Medicine, vol. 19, no. 4, pp. 209–222, 2006. View at: Publisher Site | Google Scholar
  10. P. Graham, “Modelling covariate effects in observer agreement studies: the case of nominal scale agreement,” Statistics in Medicine, vol. 14, no. 3, pp. 299–310, 1995. View at: Google Scholar
  11. S. R. Filipović and V. S. Kostić, “Utility of auditory P300 in detection of presenile dementia,” Journal of the Neurological Sciences, vol. 131, no. 2, pp. 150–155, 1995. View at: Publisher Site | Google Scholar
  12. J. M. Bland and D. G. Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” The Lancet, vol. 1, no. 8476, pp. 307–310, 1986. View at: Google Scholar
  13. T. S. Hettmansperger, Statistical Inference Based on Ranks, Krieger, Malabar, Fla, USA, 1991.
  14. K. H. Zou, K. Tuncali, and S. G. Silverman, “Correlation and simple linear regression,” Radiology, vol. 227, no. 3, pp. 617–622, 2003. View at: Publisher Site | Google Scholar
  15. A. P. Zijdenbos, B. M. Dawant, R. A. Margolin, and A. C. Palmer, “Morphometric analysis of white matter lesions in MR images: method and validation,” IEEE Transactions on Medical Imaging, vol. 13, no. 4, pp. 716–724, 1994. View at: Publisher Site | Google Scholar
  16. S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, pp. 591–611, 1965. View at: Google Scholar
  17. G. E. P. Box and D. R. Cox, “An analysis of transformations,” Journal of the Royal Statistical Society. Series B, vol. 26, pp. 211–252, 1964. View at: Google Scholar
  18. K. H. Zou and A. J. O'Malley, “A Bayesian hierarchical non-linear regression model in receiver operating characteristic analysis of clustered continuous diagnostic data,” Biometrical Journal, vol. 47, no. 4, pp. 417–427, 2005. View at: Publisher Site | Google Scholar
  19. A. J. O'Malley and K. H. Zou, “Bayesian multivariate hierarchical transformation models for ROC analysis,” Statistics in Medicine, vol. 25, no. 3, pp. 459–479, 2006. View at: Publisher Site | Google Scholar
  20. P. E. Shrout and J. L. Fleiss, “Intraclass correlations: uses in assessing rater reliability,” Psychological Bulletin, vol. 86, no. 2, pp. 420–428, 1979. View at: Publisher Site | Google Scholar
  21. K. H. Zou, W. M. Wells III, R. Kikinis, and S. K. Warfield, “Three validation metrics for automated probabilistic image segmentation of brain tumours,” Statistics in Medicine, vol. 23, no. 8, pp. 1259–1282, 2004. View at: Publisher Site | Google Scholar
  22. S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation,” IEEE Transactions on Medical Imaging, vol. 23, no. 7, pp. 903–921, 2004. View at: Publisher Site | Google Scholar
  23. S. K. Warfield, K. H. Zou, and W. M. Wells, “Validation of image segmentation by estimating rater bias and variance,” Philosophical Transactions of the Royal Society A, vol. 366, no. 1874, pp. 2361–2375, 2008. View at: Publisher Site | Google Scholar

Copyright © 2010 Kelly H. Zou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views1417
Downloads442
Citations

Related articles