#### Abstract

Sensitivity and specificity are often used to assess the performance of a diagnostic test with binary outcomes. Wald-type test statistics have been proposed for testing sensitivity and specificity individually. In the presence of a gold standard, simultaneous comparison between two diagnostic tests for noninferiority of sensitivity and specificity based on an asymptotic approach has been studied by Chen et al. (2003). However, the asymptotic approach may suffer from unsatisfactory type I error control as observed from many studies, especially in small to medium sample settings. In this paper, we compare three unconditional approaches for simultaneously testing sensitivity and specificity. They are approaches based on estimation, maximization, and a combination of estimation and maximization. Although the estimation approach does not guarantee type I error, it has satisfactory performance with regard to type I error control. The other two unconditional approaches are exact. The approach based on estimation and maximization is generally more powerful than the approach based on maximization.

#### 1. Introduction

Sensitivity and specificity are often used to summarize the performance of a diagnostic or screening procedure. Sensitivity is the probability of positive diagnostic results given the subject having disease, and specificity is the probability of a negative outcome as the diagnostic result in the nondiseased group. Diagnostic tests with high values of sensitivity and specificity are often preferred and they can be estimated in the presence of a gold standard. For example, two diagnostic tests, the technetium-99m methoxyisobutylisonitrile single photon emission computed tomography (Tc-MIBI SPECT) and the computed tomography (CT), were compared for diagnosing recurrent or residual nasopharyngeal carcinoma (NPC) from benign lesions after radiotherapy in the study by Kao et al. [1]. The gold standard in their study is the biopsy method. The sensitivity and specificity are 73% and 88% for the CT test and 73% and 96% for the Tc-MIBI SPECT test.

Traditionally, noninferiority of sensitivity and specificity between two diagnostic procedures is tested individually using the the McNemar test [2–6]. Recently, Tange et al. [7] developed an approach to simultaneously test sensitivity and specificity in noninferiority studies. Lu and Bean [2] were among the first researchers to propose a Wald-type test statistic for testing a nonzero difference in sensitivity or specificity between two diagnostic tests for paired data. Later, it was pointed out by Nam [3] that the test statistic by Lu and Bean [2] has unsatisfactory type I error control. A new test statistic based on a restricted maximum likelihood method was then proposed by Nam [3] and was shown to have good performance with actual type I error rates closer to the desired rates. This test statistic was used by Chen et al. [8] to compare sensitivity and specificity simultaneously in the presence of a gold standard. Actual type I error rates for a compound asymptotic test were evaluated on some specific points in the sample space. It is well known that the asymptotic method behaves poorly when the sample size is small. Therefore, it is not necessary to comprehensively evaluate type I error rate [9–14].

An alternative to an asymptotic approach is an exact approach conducted by enumerating all the possible tables for given total sample sizes of diseased and nondiseased subjects. The first commonly used unconditional approach is a method based on maximization [15]. In the unconditional approach, only the number of subjects in the diseased and nondiseased group is fixed, not the total number of responses from both groups. The latter is considered as the usual conditional approach by treating both margins of the table as fixed. The value of the unconditional approach based on maximization is calculated as the maximum of the tail probability over the range of a nuisance parameter [15]. This approach has been studied for many years and it can be conservative due to a smaller actual type I error rate as compared to the test size in small sample settings. One possible reason leading to the conservativeness of this approach is the spikes in the tail probability curve. Storer and Kim [16] proposed another unconditional approach based on estimation which is also known as the parametric bootstrap approach. The maximum likelihood estimate (MLE) is plugged into the null likelihood for the nuisance parameter. Other estimates may be considered if the MLE is not available [7]. Although this estimation based approach is often shown to have type I error rates being closer to the desired size than asymptotic approaches, it still does not respect test size.

A combination of the two approaches based on estimation and maximization has been proposed by Lloyd [4, 17] for the testing of noninferiority with binary matched-pairs data, which can be obtained from a case-control study and a twin study. The value of the approach based on estimation is used as a test statistic in the following maximization step. It should be noted that there could be multiple estimation steps before the final maximization step. The final step must be a maximization step in order to make the test exact. This approach has been successfully extended for the testing trend with binary endpoints [5, 18]. The rest of this paper is organized as follows. Section 2 presents relevant notation and testing procedures for simultaneously testing sensitivity and specificity. In Section 3, we extensively compare the performance of the competing tests. A real example is illustrated in Section 4 for the application of asymptotic and exact procedures. Section 5 is given to discussion.

#### 2. Testing Approaches

Each subject in a study is evaluated by two dichotomous diagnostic tests, and , in the presence of a gold standard. Suppose each subject, either diseased or nondiseased, was already determined by the gold standard before performing the two diagnostic tests. Within the diseased group, () is the number of subjects with diagnostic results and , where and represent negative and positive diagnostic results from the th test , respectively, with being the associated probability. The total number of diseased subjects is . Similarly, () is the number of subjects with diagnostic results and in the nondiseased group, is the associated probability, and is the total number of nondiseased patients. Such data can be organized in a contingency table (Table 1), where and . It is reasonable to assume that the diseased group is independent of the nondiseased group.

In a study with given total sample sizes and in the diseased and the nondiseased groups, respectively, sensitivities of diagnostic tests and are estimated as and . Similarly, and are specificities for and , respectively. The estimated difference between their sensitivities isand the estimated difference between their specificities is

The hypotheses for noninferiority of sensitivity and specificity between and are given in the format of compound hypotheses as againstwhere and are the clinical meaningful differences between and in sensitivity and specificity, and . For example, investigators may consider a difference in sensitivity of less than 0.2 not clinically important .

A test statistic for the hypotheses versus iswhere is the estimated difference in sensitivities and is the estimated standard error of . The estimate of based on a restricted maximum likelihood estimation approach [3, 19, 20] is used, and the associated form is , whereThere are two reasons for using this estimate instead of some other estimates [2]. First, it has been shown to perform well [8, 20]. Second, it is applicable to a contingency table with off-diagonal zero cells. We are going to consider the exact approaches by enumerating all possible tables with some of them having zero cells in off-diagonals. The traditional estimate for does not provide a reasonable estimate of variance for such tables.

The test statistic for sensitivity in (5) follows a normal distribution asymptotically. The null hypothesis would be rejected if the test statistic in (5) is greater than or equal to , where is the upper percentile of the standard normal distribution.

As mentioned by many researchers, the asymptotic approach has unsatisfactory type I error control especially in small or medium sample settings. An alternative is an exact approach by enumerating all possible tables for a given total of sample sizes. The first exact unconditional approach considered here is a method based on maximization (referred to as the approach) [15]. The value of this approach is calculated as the maximum of the tail probability. In this approach, the worst possible value for the nuisance parameter is found in order to calculate the value, where is the observed data of . The tail set based on the test statistic for this approach is

It is easy to show that follows a trinomial distribution with parameters . Then, the value is expressed as where is the search range for the nuisance parameter and is the probability density function for a trinomial distribution.

The approach could be conservative when the actual type I error is much less than the test size [5, 9]. To overcome this disadvantage of exact unconditional approaches, Lloyd [21] proposed a new exact unconditional approach based on estimation and maximization (referred to as the approach). The first step in this approach is to compute the value for each table based on the estimation approach [16], also known as parametric bootstrap. We refer to this approach as the approach. The nuisance parameter in the null likelihood is replaced by the maximum likelihood estimate and the value is calculated as It should be noted that the approach does not guarantee type I error rate. Once the values are calculated for each table, they will be used as a test statistic in the next step for the value calculation. The value is then given by where is the tail set. The refinement from the step in the approach could possibly increase the actual type I error rate of the testing procedure which may lead to power increase for exact tests.

Monotonicity is an important property in exact testing procedures to reduce the computation time and guarantee that the maximum of the tail probability is attained at the boundary for noninferiority hypotheses. Berger and Sidik [22] showed that monotonicity is satisfied for paired data for testing one-sided hypothesis based on the NcNemar test. Most importantly, the dimension of nuisance parameters is reduced from two to one [17]. We provide the following theorem to show the monotonicity of the test statistic .

Theorem 1. *Monotonicity property is satisfied for under the null hypothesis: and .*

*Proof. *Let and . For a given , Under the null hypothesis, . In order to show , we only need to prove that . From (6), we know that where and are given in (6). It is obvious that is a decreasing function of and is a positive constant number when is fixed and is an increasing function of , which leads to . It follows that .

For a given , similar proof will lead to a result of .

The probability of the tail set for either the approach or the approach has two nuisance parameters, and . Applying the theorem for the monotonicity property, type I error of the test occurs on the boundary of the two-dimensional nuisance parameter space, . Therefore, there is only one nuisance parameter, , in the definition of the two exact values.

For testing the specificity, the asymptotic approach, the approach, the approach, and the approach can be similarly applied to test the hypotheses against . The test statistic [3, 19, 20] would bewhere is the estimated standard error of , , , and . Under the null hypothesis, one can show that the monotonicity of is in a similar way to .

When there are two diagnostic tests available, we may want to simultaneously confirm the noninferiority of sensitivity and specificity for the two tests. The population from the diseased group and the nondiseased group can be reasonably assumed to be independent of each other. Then, the joint probability is a product of two probabilities: where is the rejection region. Let and be the significance levels for testing sensitivity and specificity separately. We can reject the compound null hypothesis or at the significance level of when the sensitivity null hypothesis is rejected at the level of and the specificity null is rejected at the level of , where . For simplicity, we assume .

#### 3. Numerical Study

We already know that both the asymptotic approach and the approach do not guarantee type I error rate; however, it is still interesting to compare type I error control for the following four approaches: the asymptotic approach, the approach, the approach, and the approach. We select three commonly used values of and , 0.05, 0.1, and 0.2. For each configuration of and , actual type I error rates are presented in Table 2 for sample size and in Table 3 for sample size at the significance level of . It can be seen from both tables that the asymptotic approach generally has inflated type I error rates. Both the approach and the approach are exact tests and respect the test size as expected. Although the approach does not guarantee type I error rate, the performance of the approach is much better than the asymptotic approach regarding the type I error control. Even for large sample size, the approach is still conservative. The approach has an actual type I error rate which is very close to the nominal level when .

The asymptotic approach will not be included in the power comparison due to inflated type I error rates. We include the approach in the power comparison with the approach and the approach due to the good performance of type I error control in the approach. The power is a function of four parameters: , , , and where and approaches and and are the rejection region for the diseased group and the nondiseased group at a significance level of based on the approach. Given the two parameters and in the nondiseased group, the power is a function of for a given . We compared multiple configurations of the parameters. Typical comparison results for balanced data are presented in Figure 1. The power difference between the approach and the approach is often negligible and both are generally more powerful than the approach. We also compared the power for unbalanced data with the ratio of sample size 1/2, 1/3, 2, and 3. Similar results are observed as compared to the balanced data; see Figure 2. We also observe similar results in comparing the power as a function of for the given , , and .

#### 4. An Example

Kao et al. [1] compared diagnostic tests to detect recurrent or residual NPC in the presence of a gold standard test. Simultaneous comparison of sensitivity and specificity is conducted between the CT test () and the Tc-MIBI SPECT test , with and . The diagnostic results using these two tests are presented in Table 4. The sensitivity and specificity are 73% and 88% for the CT test and 73% and 96% for the Tc-MIBI SPECT test. The clinical meaningful difference in sensitivity and specificity is assumed to be and , respectively. Four testing procedures are used to calculate the value: the asymptotic approach; the approach; the approach; and the approach. The values based on the asymptotic, , , and approaches are 0.0677, 0.0317, 0.0764, and 0.0418, respectively. Both the approach and the approach reject the null hypothesis at a 5% significance level, while the asymptotic approach and the approach do not. It should be noted that the two tests have the same sensitivities which may contribute to the significant result even with a small difference between the two tests.

#### 5. Discussion

In this paper, the asymptotic approach, the approach, the approach, and the approach are considered for testing sensitivity and specificity simultaneously in the presence of a gold standard. Although the approach does not guarantee type I error rate, it has good performance regarding type I error rate control and the difference between the approach and the approach is negligible. Since the computational time is not an issue for this problem and the approach is an exact method, the approach is recommended for use in practice due to the power gain as compared to the approach.

Tang [9] has studied the approach and the approach for comparing sensitivity and specificity when combining two diagnostic tests. The approach has been shown to be a reliable testing procedure. We would consider comparing the approach with the approach in this context as a future work. The intersection-union method may be considered for testing sensitivity and specificity [8].

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors would like to thank the Associate Editor and the three reviewers for their valuable comments and suggestions. The authors also thank Professor Michelle Chino for her valuable comments. Shan’s research is partially supported by the NIH Grant 5U54GM104944.