Abstract

The present review describes and validates a new ratio “” created for matching predictability and balance between TP and TN. Validity of was studied in a three-step process as follows: (i) was applied to the data of a past study predicting cardiac output response to fluid bolus from response to passive leg raise (PLR); (ii) was comparatively analyzed with traditional ratios by modeling different 2 * 2 contingency tables in 1000 hypothetical patients; (iii) precision of was compared with other ratios by computing random fluctuations in the same patients. In comparison to other ratios, performs better in predicting the cardiac response to fluid bolus and supports more directly the clinical conclusions. When the proportion of false responses is high, is close to the coefficient correlation (CC). When the proportion of true responses is high, is the unique ratio that identifies the categorization that balances the proportion of TP and TN. The precision of is close to that of CC. In conclusion, should be considered for creating categories from quantitative variables; especially when matching predictability with balance between TP and TN is a concern.

1. Introduction

Numerous metrics and receiver operating characteristics (ROC) curves are used to test the performance of prediction methods [1]. However, these traditional tools are not, or mildly so, weighted according to the balance between TP (sign predicts event) and TN (nonsign predicts nonevent). This can be a limitation when categorizing a quantitative event with the objective of reaching occurrences (event and nonevent) of comparable probability. Indeed, predicting an event whose prevalence is close to 0 or 1 may have no clinical impact. This is especially the case when the prediction method is used to change patients’ treatments and/or for including a patient in one arm of a controlled study.

For example, in patients with both severe lung injury and circulatory shock, it has been suggested to test the cardiac output response during a passive leg raise (PLR) for predicting the effects of a fluid bolus (not reversible and potentially harmful) [24]. Basically, we expect a linear relationship between PLR response and fluid bolus response since PLR acts like an internal transfusion. However, if we look at creating protocols driving fluid therapy in this setting, we must determine the PLR response that best discriminates relevant fluid bolus response. To achieve this, we can use an existing database and test sequentially all possible PLR response cut-off thresholds versus all possible fluid response cut-off thresholds. In this situation, the best value of traditional ratios is in two situations: (i) with high PLR thresholds (e.g., >25%) predicting high fluid bolus thresholds (e.g., >50%) and (ii) with low PLR thresholds predicting low fluid bolus thresholds. However, these two pairs of thresholds have poor practical interest. In the first case, the prevalence of high fluid response is low and most patients are classified as TN (no high PLR response predicts no high fluid bolus response). A protocol based on this would nearly never recommend fluid bolus. Conversely, in the second case, the prevalence of low fluid response is high and most patients are classified as TP (PLR response over a low threshold predicts fluid bolus response over a low threshold) and a protocol would always suggest giving fluid. Thus, actual recommendations do not result from systematic statistical analyses but from clinical and metrological (least significant change) considerations.

We present in this paper the formula of a new “” ratio and show that this new ratio better matches predictability with balance between TP and TN as compared to other traditional statistics. More generally, would be of interest when dichotomizing a quantitative variable in existing databases for creating protocols like cited in our example but also for decision trees and study inclusion criterion.

2. Methods

Measuring the quality of any categorization is a particular case of a general approach of prediction-performance assessment. These methods first define the predictor (P) and the event (E), and then determine the cells matrix of TP (P predicts E), TN (non P predicts non E), FP (P predicts non E), and FN (non P predicts E). The best method for analyzing prediction performance is to consider the whole matrix. However, it is not immediately clear if a matrix gives a proper answer to the question asked and if a given matrix is superior to another one. Different metrics have been suggested to measure the distance between P and E using a single number.

2.1. Traditional Ratios

Herein, two situations are clearly different.

First, when P and E are naturally binary, the most widely used ratios are as follows:(i)sensitivity: ;(ii)specificity: ;(iii)positive predictive value: ;(iv)negative predictive value: .

The quality of the matrix may be analyzed using other ratios providing close information when P is a risk factor, a sign, or a treatment and E is a natural category like a disease or an outcome:(i)risk ratio or relative risk: (RR) = (TP/(TP+FP))/(FN/(TN+FN));(ii)Yule’s coefficient: ;(iii)Youden coefficient: ;(iv)likelihood ratio positive: ;(v)likelihood ratio negative: ;(vi)odds ratio .

The best choice depends on the clinical context and the underlying question to be answered. For example, OR is more suitable for case-control or retrospective studies; RR is preferably used in randomized controlled trials and cohort studies [5].

In contrast, when P and E are quantitative, a binary classification requires conventional cut-off thresholds. There is consequently a different matrix for each possible P and E pair of thresholds. A series of categories can be created for each variable, based on incremental thresholds. To determine which matrix works best, the use of LRs, RR, , , and OR suffers from one to three weaknesses. (i) LRs, RR, and OR are impossible to derive when the denominator of is null and when TP or TN is null. (ii) All these ratios, except and , range from zero to infinity. This nonlinearity is poorly intuitive for quantifying practical usefulness. (iii) These ratios are marginally weighted according to the balance between TP and TN. The different methods used for assessing the matrix performance in case of dichotomized quantitative E and P have been reviewed by Baldi et al. in 2000 [1]. The best practical solution is the Pearson product-moment correlation coefficient (CC) for two binary categories, also called Matthews correlation coefficient or phi coefficient [6]. This coefficient is derived as follows: It ranges from −1 to and is weighed according to the proportion of observations in each category since it turns to be the square root of a chi-square divided by the total number of observations . However, this relationship is complex and poorly related to the balance between TP and TN when the proportion of errors tends towards zero.

2.2. ROC Space

The ROC space is defined by the false positive rate as -axis and by the true positive rate = Se as -axis. Each matrix represents one plot in the ROC space. Therefore, a specific ROC curve is created by a set of matrixes using a given fixed E threshold and all possible P thresholds. If E is quantitative, different curves can be created using different E thresholds. The area under ROC curves can be calculated with their confidence intervals and compared using nonparametric tests [7, 8]. The predictability increases when the area under curve becomes significantly different from 0.5, towards 1 or 0. In this latter case, P is predictive when under, rather than the above, considered threshold. ROC curves are based on a set of matrixes and cannot be compared to traditional ratios assessing one single matrix. However, the best possible matrix would yield a point in the upper left corner of the ROC space with coordinate (0, 1), obtained when and , leading to an area under the curve = +1. The distance between an observed ROC plot and this optimal value can be measured by triangulation leading to . The smallest distance is obtained with 100% of TP. Similarly, with 100% of TN, the ROC plot would reach the lower right corner and the area under curve would be 0. However, a dichotomy based on these perfect predictions would not lead to a discriminative decision as seen previously. Therefore, neither ROC curves nor ROC plots are adequate answers to the issue of matching prediction and a balance between TP and TN.

2.3. The New Ratio

After a careful literature review, it was not possible to find an adequate solution matching the predictability with a balance between TP and TN. For creating a new ratio reaching this complimentary objective, we first consider as numerator the difference between the highest possible quantity from right classifications, obtained by the product , and wrong classifications, obtained by the product . We standardized this quantity by using as denominator the optimal quantity obtained from the best possible balanced prediction. If is the total number of observations, the best possible balanced prediction is given by and the optimal quantity is obtained by . Therefore, The ratio is symmetric, always derivable and ranges from −1 to +1. When the expected probability of the event E is not exactly 0.5, can be corrected by replacing 0.5 by and leading to This however does not change the linearity of the ratio, just the scale and the limits, ranging from to . In this situation, the ratio becomes equivalent to Youden’s coefficient, very close to the CC and to the distance from the best ROC plot. Since is used in situations where we expect a balance between TP and TN, approximates 0.5. In this situation, range of marginally changes. For instance, if , ranges from −1.04 to . Therefore, it is always suitable to use the standardized formula (3), ranging between −1 and .

2.4. The Validation

For validation, first, we applied to the real clinical challenge presented in the introduction. We used the data of a recently published study relating the performance of cardiac output response during PLR tests for predicting cardiac output response to 500 mL fluid bolus on 75 patients [9]. In this study, CC was used to determine the best pair of PLR and fluid bolus response cut-points. We reproduced here the complete CC table and we applied to the same data to compare the two ratios.

Second, we made a complete analysis of utility as compared to traditional ratios, by modeling all possible proportions in 2 * 2 matrixes of TP, TN, FP, and FN, in a hypothetical population of 1000 patients. Traditional ratios were RR, LR+, LR−, OR, , , , and CC, listed above plus other quantities listed by Rosner [10], including accuracy, contingency coefficient, and Equitable Threat Score. Then, we computed different proportions of total true responses . For each proportion of total true responses, we plotted and other ratios’ values when the balance between TP and TN was changing. We were therefore able to compare the impact of TP and TN balance, on each ratio, and for each level of total true responses.

Lastly, we derived the standard deviation (SD), precision (2SD/mean), and 95% confidence interval (mean  2SD) of by computing a random 5% fluctuation in the different 2 * 2 matrixes of 1000 hypothetical observations. Further, these quantities were compared with those obtained with other ratios.

3. Results

3.1. First Step (Clinical Challenge)

Table 1(a) shows the performance of PLR tests for predicting fluid bolus responses using CC, as done in the original study, along with reproducing the complete incremental analysis. Table 1(b) depicts the values applied to the same patients and the same cut-off points.

On Table 1(a), CC reaches high predictivity on the two extremities of the table. Perfect CC value = 1 is observed when PLR response is tested for predicting fluid bolus response . This is obtained from 74TP, 1TN, 0FP, and 0FN. When PLR response ≥0% is tested for predicting fluid bolus response ≥0%, is obtained from 50TP, 18TN, 0FP, and 7FN. On the other hand, when PLR response ≥25% is tested for predicting fluid bolus response ≥50%, is obtained from 1TP, 73TN, 1FP, and 0FN.

In contrast, on Table 1(b), high values are concentrated in the middle of the table. The best value is reached when PLR response ≥5% is tested for predicting fluid bolus response ≥10%. This is obtained from 33TP, 32TN, 6FP, and 4FN. In addition, Table 1(b) shows that isolates the highest values better than CC; in Table 1(a), 3 values are 15% close to the best score (0.78) while in Table 1(b) no values are 15% close to the best score (0.73).

3.2. Step Two (Comparative Analysis)

RR, LR+, LR−, OR, and ranges from zero to infinity, so that cannot be directly compared with them. Figure 1 compares with these ratios when the proportion of TP and TN are changing and for two different levels of total true responses (97% Figure 1(a) and 60% Figure 1(b)).

, , , and CC are normalized from −1 to ; thus, they can be directly compared to (Figure 2). The Equitable Threat Score not represented on this figure for clarity is very close to CC. All curves on Figures 1 and 2 have comparable shape than curve when the predictability is poor (Figures 1(b) and 2(b)). When the predictability is good (Figures 1(a) and 2(a)), all ratios except have relatively flat curve shape, indicating independence from the balance between TP and PN.

Figures 1 and 2 compared with other ratios using only two proportions of total true responses. A complete comparison between CC and is given in Figures 3 and 4.

3.3. Step Three (Precision)

The SD, precision and 95% CI of is very close to these of CC, depending on the ratio value (Figure 5). For example, the average precision of on Figure 5(b) is 58% versus 56% for CC (NS). On the middle of Figure 5(a) when and CC are close to 1, the precision is <1% for both ratios.

4. Discussion

A natural cut-off for categorizing two quantitative variables may be found when there is a clear inflexion point in their relationship. This is observed for the pressure-flow relationship in a Starling resistor [11] or the cells oxygen-supply and demand relationship [12]. In these two examples, a clear inflexion point leads to determine two clear-cut categories (pressure-flow dependency or not and oxygen-supply and demand dependency or not). Alternatively, the existence of a threshold of possible diagnostic interest may also be suspected from a clear inflexion point in a ROC curve, very close to the upper left or the lower right corner of the ROC space.

In other situations, dichotomization most often leads to loss of information, hiding the dose-response effect in most biological processes. Nevertheless, it is sometimes necessary to create conventional categories for various reasons. In this case, it is often of interest to create categories of comparable proportions and sizes; for instance, when generalizing the results of a study for creating therapeutic tests or protocols or for determining the cut-off point of an inclusion criterion of a two arms-study. Ideally, in 50% of patients, presence of an indicator will predict the event (TP) and, in 50%, the absence of this indicator will predict the nonevent (TN).

The ratio described above is close to the coefficient correlation (CC) and other normalized ratios (from −1 to ) when the predictability is poor. In contrast, when the predictive value is high, which usually is the first objective of any classification, is the unique ratio weighed by the proportion of patients classified as TP and TN. Therefore, appears as an interesting tool to match good predictability with two arms of comparable sizes.

The ratio has few limitations. When the proportion of false response is balanced between FN and FP, is lower in comparison to when FP and FN are clearly different, since the product FP * FN is subtracted in the numerator of . Except if homogeneity in the false responses is viewed as interesting, which seems to be a hypothetical situation, this decrease in value does not indicate real loss of clinical interest. However, this effect is also observed in other ratios that share with the same numerator ( and CC). Moreover, this effect is mostly effective when there is a high proportion of false response thus relating to poor predictability.

In the present validation study, it was not possible to derive a specific mathematic formula for estimating the standard deviation of . It would have been controversial to compare a modeled confidence interval with these of other ratios based on constant chi-square boundaries or specific formulas that are often approximate. According to us, modeling a 5% random error in 1000 hypothetical patients as done presently is suitable for comparing the precision of with that of other ratios. Nevertheless, determining an appropriate standard deviation formula stands as future research.

The best balance between TP and TN influences decision outcome. If we create a therapeutic protocol or recommendations by generalizing the conclusion of the clinical study presented here [9], the best cut-off, identified from CC values, will lead to give fluid to all patients with positive PLR tests (we excluded negative thresholds predicting harmful fluid bolus, as done in the published results, see Table 1(a), ). If the studied population is representative of future patients, 76% of patients will receive fluid amongst which 67% will probably benefit TP and 9% will not . Patients with a negative PLR will receive no fluid with low risk since there was no FP. In contrast, if we consider the -based best dichotomization, fluid infusion will be restricted to these patients with a PLR test ≥5%, therefore to 49% of the patients , among which 44% will benefit from it and 5% will not . The other 51% of the patients will receive no fluid and, for 8% of these, it will be inappropriate .

In this example, use of , leads to a more conservative protocol (49% of interventions instead of 76%) but with a small increase of apparent inappropriate categorization (13% versus 9%). This illustrates the fact that, by using , a better balance between TP and TN may be paid by a small decrease in the absolute predictability. The final choice is a clinical decision depending on the cost and side effects of treatments as well as on the consequences of miscategorizations. In this example, we can imagine that, in a specific unit with a low incidence of lung injury, a protocol based on CC would be preferred since the risk of overfilling would be limited. However, in this type of unit, a PLR test can be viewed as unnecessary since a fluid challenge can be proposed safely. In contrast, in a unit treating acute lung injuries, the risk of overfilling would lead to the use of . In their conclusions, although the researchers did not derive the ratio, they recommended using that specific threshold determined by (PLR  >  4% for predicting fluid response > 9%) after considering the least significant change of their measurements and the balance between risk of over treating and under treating the patients [9]. This illustrates the fact that using CC and looking at the best balance between TP and TN may provide with the same information than using . However, holds the advantage of allowing easier, faster, and a more reproducible approach.

The fact that reached the same conclusion as clinicians in this example is not a chance. We can speculate that in this population of patients, the probability of nonoptimal filling was close to 0.5. best pair of thresholds represents the highest real link including physiological and random variations. Noteworthy, the least significant change (change that have 95% chances to be real and not due to fluctuations of measurements) of the cardiac output monitoring system used here was 10%. In other words, confirms that, using this specific cardiac output monitoring system, a real increase in cardiac output after fluid bolus was 10%, and it was optimally predicted by a PLR response over 5%.

We conclude that a new ratio, , stands a consideration when creating conventional categories from quantitative variables besides when expecting a balance between true positive and true negative.

Conflict of Interests

The author declares that he has no conflict of interests.