#### Abstract

Many problem domains utilize discriminant analysis, for example, classification, prediction, and diagnoses, by applying artificial intelligence and machine learning. However, the results are rarely perfect and errors can cause significant losses. Hence, end users are best served when they have performance information relevant to their need. Starting with the most basic questions, this study considers eight summary statistics often seen in the literature and evaluates their end user efficacy. Results lead to proposed criteria necessary for end user efficacious summary statistics. Testing the same eight summary statistics shows that none satisfy all of the criteria. Hence, two criteria-compliant summary statistics are introduced. To show how end users can benefit, measure utility is demonstrated on two problems. A key finding of this study is that researchers can make their test outcomes more relevant to end users with minor changes in their analyses and presentation.

#### 1. Introduction

Artificial intelligence and machine learning are effective discriminant analysis (DA), for example, classification, prediction, and diagnoses, tools [1–3]. These tools are becoming increasingly popular for speeding up applied research and product development. A number of DA tool evaluation measures are seen in the literature, with acceptance varying by research domain. For example, the receiver operating characteristic is common in intrusion detection studies and the -score is common in information retrieval. Researchers use measures because they provide actionable information. The question of whether these measures also provide actionable information to end users is explicitly addressed.

As an example of the problem end users face, two commonly seen summary statistics [4, 5],* total accuracy rate* (TAR) [5] and -score [5], have opposite responses to relative class size () (also known as class imbalance and prevalence) [6]. (In the current context, summary statistics are formulae that take multiple joint probability table (JPT) based values as input and output a single value that represents the target DA tool’s composite utility. A key characteristic of summary statistics is that they are not monotonic. When they are plotted against the boundary, they have optima. A common summary statistic, the receiver operating characteristic area under the curve, is boundary invariant. Since it neither increases nor decreases, it is monotonic. Since there is only a single value, it is also an optimum.) The -score monotonically decreases as increases, while TAR monotonically increases.

End users want to know how a specific DA tool will impact their problem. An informative measure for any stakeholder must be sensitive to relevant problem domain characteristics and insensitive to irrelevant (confounding) characteristics. DA tool stakeholders can be partitioned into three groups.(i)*Basic researchers* focus on developing new DA algorithms. This group expects that an effective new DA algorithm will be useful in many problem domains, so their evaluations need to be application agnostic; specific problem domain characteristics are, in fact, confounding. Basic researchers introduced DA techniques such as nearest neighbor [7], neural net [8], and support vector machine [9].(ii)*Applied researchers* use DA algorithms on specific problem domains to create tools and code libraries useful for that domain; specific problem domain characteristics are important. Examples of tools incorporating DA algorithms are anomaly based intrusion detectors for cyber security [10, 11] and document classifiers for enterprise information retrieval systems [12]. In this context, the focus is on the DA tool; data sets are used to develop the tool/library.(iii)*End users* use DA tools to solve problems in their domain. Domain-specific characteristics are important, as are operational aspects like impact sensitivity to class boundary settings (, the setting that determines to which class each observation is allocated). Fields of study include medicine [13], molecular biology [14, 15], finance [16], and so forth. The end user’s context is the opposite of the other two groups; the DA tool classifies the data, rather than known data being used to evaluate the tool.

The researcher definitions specialize those published by the National Science Board [17].

Jamain and Hand comment on user need in their DA tool meta-analysis:

The real question a user generally wants to answer is “which classification methods [are] best for me to use on my problem with my data…” [18].

In artificial intelligence, Russell and Norvig express a similar sentiment:

As a general rule, it is better to design performance measures according to what one actually wants in the environment, rather than according to how one thinks the agents should behave [19].

In published studies read, proposed measures may be mapped to specific problem domains, but identifying a general means by which end users can quantify DA tool effectiveness in their setting has not been addressed. Indeed, Jamain and Hand generalize Duin’s sentiment regarding comparing automated, heavily parametrized DA tools (also known as classifiers):

“It is difficult to compare these types of classifiers in a fair and objective way [20].”

Seemingly, end user needs are viewed as too complex and diverse to address. End user issues, when discussed, have been constrained to specific problem domains. Quantifying end user impact was not found in a literature search.

End users face a daunting challenge, when selecting a DA tool summary statistic. Sokolova et al. comment “…the measures in use now do not fully meet the needs of learning problems in which the classes are* equally important* and where* several algorithms are compared*” [21]. These sentiments are echoed in the text by Japkowicz and Shah:

Although considerable effort has been made by researchers in both developing novel learning methods and improving existing models and approaches, these same researchers have not been completely successful in alleviating the users’ skepticism with regard to the worth of these developments. This is due, in part, to the lack of both depth and focus in what has become a ritualized evaluation method used to compare different approaches [5].

These authors also observe that problem domains have* de facto* standard summary statistics. By starting with the most basic questions (the approach used by Artzner et al., addressing a similar situation in the financial risk domain [22]), specifically addressing user’s issues, this goes directly to the heart of the matter as perceived by Japkowicz and Shah.

The balance of this paper is organized as follows. Section 2 presents the lexicon. Section 3 discusses relevant work. Section 4 identifies some questions end users have regarding DA tool efficacy. Section 5 explains the research protocol used. Section 6 reports this study’s results and summary statistic recommendations. Section 7 assesses the recommended changes. Section 8 summarizes findings and suggests future work.

#### 2. Lexicon

Although this paper applies well-established stochastic concepts, not all discussions use the same terminology [1, 2, 4, 5, 23, 24]. To avoid confusion, a lexicon and alternate terms seen in the literature are proved: : it is the population which, when observed, satisfy the characteristics defining . : it is the population which, when observed, do not satisfy the characteristics defining . : it is boundary, the DA tool vector that defines the partition between and . *Class imbalance*: see relative class size. *Confidence interval (CI)*: it is a range within which % of the observations lie. A 90% CI would exclude the 5% on each pdf tail. *Confusion matrix*: see joint probability table. *Contingency table*: see joint probability table. *Discriminant analysis (DA)*: it is a process whereby observations are tagged as being members of a class. *Diagnostic Odds Ratio (DOR)*: it is a DA tool evaluation measure [25]. *Diagnostic Power (DP)*: it is a DA tool evaluation measure [26]. *Error matrix*: see joint probability table. -score: it is a DA tool evaluation measure [27]. *Frequency table*: see joint probability table. : it is false positive, class events incorrectly flagged as class (“Type I error”). : it is false negative, class events incorrectly flagged as class (“Type II error”). *Ground Truth*: it is the actual class to which an observation belongs. : it is the gain or loss (positive or negative) associated with each element output. Each and every element output will affect the end user by the element of applicable to the category to which the element is binned. (Typically, gains are viewed as positive values and losses are negative values, although there are exceptions.) , , , , or ( denotes the specific element of the source set, , , , , and ): they are individual element impacts in the raw data. : it is the vector of JPT category impacts, expressed as statistical expectations (*expected individual element impact*) by category or class. *Joint probability table (JPT)*: it is an table in which one axis represents ground truth ( and ) and the other axis represents DA output ( and ). The cells contain the counts for observation’s based on their ground truth and DA output. The DA tool under test, configured with boundary vector (, a “surface” that partitions the problem space), bins into and . The JPT() bin counts are snapshots of DA tool labeling versus ground truth at . Frequently, these counts are presented as proportions of . In a JPT of proportions, each cell in Table 1 is divided by . The differences between the two table types is that the cell entries in Table 1 are integers, with the total of all four categories equaling , whereas the cell entries in a proportional JPT are rational numbers that sum up to one. Additionally, the proportional values represent the probability that, for a given relative class size (), any randomly selected DA tool output will be a member of that particular JPT category. *Mathews Correlation Coefficient (MCC)*: it is a DA tool evaluation measure [28]. *Mutual Information Coefficient (IC)*: it is a DA tool evaluation measure [29]. *Probability density function (pdf)*: it is the probability of a particular value, within the range of all possible values, being observed. *Proportion*: see relative class size. *Receiver Operating Characteristic Area Under the Curve (ROC-AUC)*: it is a DA tool evaluation measure [30–32]. *Relative class size*: [5]. : it is the uncategorized data set. : it is true positive, correctly identified events in class . : it is true negative, correctly identified events of class , the other class. *Total accuracy rate (TAR)*: it is a DA tool evaluation measure. (TAR has been in use so long; its source is not found cited.) : it is actual class events in the data set. : it is actual class events in the data set. *Youden index* (): it is a DA tool evaluation measure [33]. : it is events flagged by the DA tool as class . : it is events flagged by the DA tool as class . and constitute as partitioned according to ground truth. and constitute as partitioned by the DA tool output. , , , and cardinalities can be presented in a joint probability table (JPT). These are displayed in lower case: , , , and .

#### 3. Related Work

Summary statistics are a subset of the measures used for DA evaluation. Measures such as sensitivity, specificity, and true positive predictive rate are often seen in studies using DA. These measures are monotonic (do not have optima by which we identify optimum DA boundary settings); hence, they are not summary statistics. Their value is providing greater detail by quantifying particular aspects of DA tool performance. It should come as no surprise that some summary statistics are functions of monotonic JPT-based measures such as sensitivity, specificity, and true positive predictive rate.

Sokolova and Lapalme tested both monotonic measures and summary statistics used for DA tool evaluation for invariance to various JPT perturbations [34]. Their work’s value is based on assuming that measure selection should consider their invariance relative to the problem’s need for invariance. This work tightens their constraints by requiring that measures provide actionable information to the stakeholder, in this case, end users.

It might be fair to state that each end user’s need is, in some way, unique. However, uniqueness does not mean that useful common problem characteristics do not exist. Starting with very basic questions, this paper identifies a general DA tool problem framework. Using that framework, this work identifies criteria which end user efficacious measures for DA tool evaluation must satisfy and proposes two summary statistics that meet those criteria. End user efficacious summary statistic optima indicate the best overall DA tool utility across the observed boundary range. Ideally, these summary statistics also quantify some efficacious aspect of DA tool output for end users. Efficacious measure values enable end users to directly estimate how the DA tool will affect their situation.

The state-of-the-art DA evaluation has inspired many classifier evaluation reviews. With the exception of Baldi et. al’s narrowly scoped study [35], none address end user interests. In fact, generally, stakeholders were not identified. In some cases, measure characterizations were presented with the expectation that knowledgeable stakeholders would be able to apply the information. Relevant aspects of a few of the many characterization studies in the literature are summarized:(i)Parker analyzed five measures [36]. His analysis does not identify the stakeholder. However, his recommendation that an integrated measure should be used when possible suggests that his evaluation addresses researchers.(ii)Japkowicz reviews machine learning evaluation methods and suggests the need for “a framework that would link our various evaluation tools to the different types of problems they address and those that they fail to address” [37]. Japkowicz work focuses on issues within the machine learning domain but does not differentiate the differences between research and user interests.(iii)Recognizing the large number of comparative studies, Jamain and Hand executed a meta-analysis [18]. By mining relevant studies, they hoped to integrate their findings and gain insight into the problem. Their analysis did provide some insights. However, in closing, they note that the investigation did not shed light on end user issues. Their paper also indicated that they felt the problem was intractable.(iv)Caruana and Niculescu-Mizil investigate nine measures, partitioned into three different types [38]. Their results lead them to propose a measure suite which outperforms the individual measures. Their study, however, does not specify the stakeholders, nor does it address the end user’s potential need to address unequal event impacts.(v)In the same vein as Caruana and Niculescu-Mizil, Seliya et al. empirically compared twenty-two measures [39]. Their work identified strongly correlated measures with the intent of letting investigators select measure suites with minimal potential redundancies. Their work did not search for root causes or identify stakeholders.(vi)Baldi et al. assess measures, restricting their scope to two bioinformatics problems [35]. They review eleven measures and conclude that MCC may be the best overall for their domain. Baldi et al. also observed a slight difference in between MCC and IC. For their intended users, impacts are apparently not an issue, as they were not mentioned.(vii)Sokolova has been actively addressing the classifier performance measure problem [21, 34, 40]. In [34], Sokolova and Lapalme compare eight measures. They conclude that there is a subset of measures that are suitable for problems with restricted access to data, the need to compare several classifiers and equally weighted classes. In [21, 40], she reviews measures for their invariant properties. Thus, potential users can select measures that are suitable for their particular need.

The literature also includes two recent, relevant books. Witten et al. discuss cost-based analysis and present a reasonable case for its use and a means of classifier selection based on [41]. However, they do not identify the root problem and so stop short of identifying and addressing some of the end user impact factors addressed here. Japkowicz and Shah may have the most comprehensive discussion of DA evaluation [5]. Their invaluable book covers many challenges faced by evaluators. Like Witten et al., they “treat the symptoms” but do not identify the root problem.

In a literature review, no common understanding of what constitutes a good DA tool evaluation summary statistic was found; this seemed glaring in its absence. No “good summary statistic” criteria for DA tool evaluation were found. A key contribution of this paper is establishing four criteria for end user efficacious DA tool summary statistics.

#### 4. End User Measure Efficacy Considerations

The focus here is end user interests. For DA tool selection and deployment, texts such as Clemen [42] lay the foundation for three questions informed end users must answer:(1)*What is the DA tool’s impact on my problem?* To answer this question, consider what the summary statistic quantifies and how that relates to end users.(2)*What is the boundary that provides the optimum impact?* Only a boundary () sensitive summary statistic can provide this information.(3)*How sensitive is the impact to boundary selection?* Only a boundary sensitive summary statistic can provide this information.

Measurement theory provides additional insight on summary statistic efficacy: numbers are used in different ways. These uses constrain their information content and hence their utility. This work uses the scale-type definitions proposed by Stevens [43]. Stevens defined four scale types, nominal, ordinal, interval, and ratio. Ratio scales have the least functional constraints, so summary statistics using ratio scales are the most information rich ones. Ratio scales have two unique and readily identifiable characteristics.

*They Have Meaningful Zeros*. A meaningful zero for end users indicates that the statistical expectation of the DA tool’s output has no effect on their problem.

*They Have a “Standard Unit.”* This means that . One implication of having a standard unit is that there is no upper bound and the lower bound can be either zero or negative infinity. A DA tool’s output could negatively impact an end user, so the most generally useful measure’s scale range must be .

Reflecting on the end user interests, only a ratio scale summary statistic will satisfy point (1) above. A recurring topic in studies using DA tools is class imbalance or relative class size (a similar concept, prevalence, is seen in medical research [1, 2]). Japkowicz and Shah quantify class imbalance as [5]. is used to evaluate summary statistics regarding (i) the question answered, (ii) scale type, and (iii) sensitivity to environmental factors, , and pdf.

#### 5. Research Protocol

This study develops and tests a mitigation for the end user DA tool assessment gap introduced in Section 1.

##### 5.1. Problem Mitigation Plan

In a literature search, no foundation upon which users can base a measure selection to evaluate DA tools was found. However, the financial domain had a similar situation in assessing market risk. Artzner et al. tackled the problem by first establishing the need, then defining performance criteria, and ending with measurement recommendations [22]. This work applies Artzner et al.’s protocol by the following:

*(1) Defining Explicitly the End User’s DA Tool Evaluation Problem*. This is addressed in Section 4, presented as three questions.

*(2) Identifying an Efficacious Measure’s Necessary Properties*. Measure values that provide end users with actionable information (measure values with which end users can answer the three questions posed) must have certain properties. Section 6.2 proposes criteria necessary for efficacious measures.

*(3) Testing Measures for the Existence of These Properties*. Section 6.1 shares insights gained evaluating the eight measures relative to Section 4 questions.

*(4) Recommending Adjustments to Have Conformant Measures*. From Sections 6.4 to 7, this paper applies the insights gained, making specific recommendations regarding end user efficacious measures.

##### 5.2. Mitigation Plan Analysis

DA tool evaluation studies can be partitioned into two groups: those that use “real-world” data and those that use simulated data; both are used here.

Characterizing DA tool evaluation measures requires observing how the summary statistics respond as DA tool output varies. Observing the effect of incremental changes on real-world data is difficult. For this purpose, simulated DA tool output was used.

To preserve generality, distribution insensitive analytic procedures are used here. The analysis is nonparametric; medians are used instead of means and quantiles are used instead of standard deviations. Monte Carlo method-based tests use well-defined pdfs.

DA tool quality is based on supervised tests. This test framework is well established in artificial intelligence circles [4, 5, 19]. Supervised tests provide the ability to compare ground truth (the actual class membership of objects in the test set) to the DA tool’s class membership determination of the test set objects, that is, the DA tool’s output [44, 45].

One risk with using simulated data is that the situations may be unrealistic. Consequently, studies using solely simulated data suffer from the perception of being unproven. Often DA tests use real-world data from repositories, for example, the University of California, Irvine Machine Learning Repository (UCI). Alas, UCI does not have JPT category impact, which seriously diminishes that analytical approach’s sense of realism. Therefore, two published studies are reanalyzed to compare classifiers in two separate domains.

##### 5.3. Study Scope

There is a wide variety of classification problem types. By limiting this study, the risk of obscuring results is removed. This work is limited to problems where(i)the events mapped to each ground truth class ( and ) are independent;(ii)each event’s impact () is independent;(iii)the problem is restricted to a matrix;(iv)the end user’s problem either treats each input set element individually (as in the case of a medical diagnosis or intrusion detection) or the problem is based on the cumulative effect of the elements in the input stream (as in the case of bank loan application decisions or information retrieval).

#### 6. Results

This section summarizes how well the eight selected measures address the questions posed. Drawing on these findings, four criteria are presented to address end user’s needs. This section closes by using the criteria to make recommendations for end user efficacious DA evaluation measures.

##### 6.1. Evaluating the End User of Existing Measures

While investigating the current state of the art, eight commonly seen DA tool evaluation measures appear to be very common:* total accuracy rate*, the* Receiver Operating Characteristic Area Under the Curve*, *-score*,* Youden index*, two related measures,* Diagnostic Odds Ratio* and* Diagnostic Power*,* Mathews Correlation Coefficient,* and* Mutual Information Coefficient*. Each measures’ ability to answer the three questions posed in Section 4 was evaluated. Due to space constraints, findings are summarized here. The full analysis is on-line [46]. Key findings were as follows:(i)Most are ordinal scale measures and none were ratio scale measures. Answering the questions requires ratio scale values so none could answer the questions.(ii)The summary statistics measured characteristics that were at best niche problems.(iii)None included end user impact; however, -score did address category importance.(iv)Excepting ROC-AUC and TAR, all of the summary statistics reviewed define classifier quality equal to a fair coin as “zero.” This may be reasonable in research. However, a fair coin may not have a zero impact on an end user.

Additional measure characterizations are also available in Eiland and Liebrock [6].

##### 6.2. Efficacious End User Summary Statistics

Efficacious end user measures of DA tools must reflect the DA tool’s performance in the end users environment and their problem’s context. There are many types of performance, but this paper focuses on DA tool output utility, the ability to approach the optimal response for an end user. This section frames insights gained as DA-specific criteria and measurement theory principles. After presenting each criterion, eight commonly seen summary statistics are evaluated for compliance.

###### 6.2.1. Criterion 1, Category Impact

Consider a physician making two treatment decisions, one treating a potential cold and the other treating potential rheumatoid arthritis (RA). Someone with a cold, who is treated for it (), will experience a minimal negative impact on their quality of life (low QoL impact). However, someone with an untreated cold () can be quite miserable (high negative QoL impact). Someone without a cold and untreated () will experience no effect (no QoL impact); someone without a cold but treated for it () will experience a small impact from medication side effects and cost (low QoL impact). In this situation, the best strategy may be to minimize s and accept high , by treating anyone with the slightest indication for a cold.

The situation is different with RA. Someone with RA and treated ( will experience a minimal negative impact on their quality of life (low QoL impact). However, someone with untreated RA () will be significantly debilitated (high negative QoL impact). Someone without RA and untreated (), will be unaffected (no QoL impact). Someone without RA but treated () will experience significant disability (high QoL impact). Faced with this decision, the cold treatment strategy is inappropriate: both s and s must be minimized. A conscientious physician must consider category impacts.

An efficacious end user summary statistic must be sensitive to the same factors, to the same degree, as end users are to their respective problems. With regard to utility, the end user’s context is defined by the importance, or impact, of elements from each JPT category on the end user. The criterion follows directly from the previous discussion. A small change to any element of will generate corresponding changes in a compliant measure. For example, TAR and the -score can be interpreted as being the same base ratio, differing only by (recall that ):When , then ratio 1 is TAR. Likewise, when , then ratio 1 is -score. The response of these two measures to the same input JPTs is distinctly different. The commonality between TAR and -score leads to the first criterion.

*Criterion 1 (category importance). *An end user efficacious summary statistic must be a function of problem specific impact set , where each element of .

A summary statistic that complies with Criterion is sensitive to . Thus, end users can tune the measure’s output to suit their problem. Criterion provides a direct answer to the end user’s question “What is the DA tool’s impact on my problem?”. Since the end user’s other two questions also address impact, Criterion also addresses them.

None of the summary statistics reviewed satisfy Criterion . The -score, conditioned by , provides some ability to incorporate impact. However, since regardless of , it fails. The other summary statistics considered (TAR, Youden index, ROC-AUC, DOR/DP, MCC, and IC) do not have a provision for setting impacts; all have fixed impacts: implicitly, .

###### 6.2.2. Criterion 2, pdf Sensitivity

When making loan decisions, banks rely heavily on credit scores. In a stable economy, the credit score distribution (pdf) may also remain stable. However, disruptions, such as a reduction in force by a major employer, can cause the pdf to shift. Then, the number of loan defaults by applicants with low but acceptable credit scores may become unacceptable, requiring increasing the acceptable credit score. Finding the new optimum threshold will depend upon the new pdf’s shape.

DA problems may exit where the optimum boundary is not sensitive to input class pdfs. However, end users need to identify an appropriate DA tool boundary for their problem. Consider an end user environment with two classes, defined by and , where and are the distribution means and and are the distribution standard deviations. Let the end user have optimum boundary . Changing both distributions by adding to and , then the optimum boundary becomes and remains constant. There are myriad permutations that can be made to this simple end user environment. Most will affect and/or ; some will not. The point is that class distribution invariance is not an end-user-efficacious summary statistic characteristic; end users are better served by summary statistics that are class distribution sensitive.

*Criterion 2 (pdf sensitivity). *With a change in , (e.g., , , and ), where describes a perturbation in ’s and ’s source population, for all boundaries within , there exists . The same is true for a change in and for any .

For any boundary within the interval affected by a probability distribution change, an effective summary statistic’s expected output will reflect that change. A summary statistic compliant with Criterion will reflect how the target DA tool impacts the end user when provided with different inputs. Criterion addresses the second and third end user questions posed, “what is the boundary that provides the optimum impact?” and “how sensitive is the impact to boundary selection?” which are both related to classifier output pdfs. One end user’s optimum boundary may not be optimum for another.

Of the summary statistics considered, only TAR, MCC, and IC comply fully. AUC fails Criterion : it is boundary invariant. The Youden index and DOR/DP, being invariant, fail for sensitive problems. The -score fails because it is invariant.

###### 6.2.3. Criterion 3: DA Tool Output Basis

In the loan decision problem mentioned in Section 6.2.2, after the decision is made, the lender has knowledge of the loans made (, , and ) but knows only for the loan applications rejected. Given this information, the lender can verify the quality of their model. By comparing actual results against predictions, the lender can determine if their loan decision process is working as expected.

End users have limited visibility into a DA tool’s process. They have knowledge of inputs and outputs but not ground truth. Thus, end users may find DA tool evaluation measures that can be calculated from and more useful than others. Given the proper measure, end users can better assess their DA tool options and monitor DA tool effectiveness. These end user visibility observations lead to the third criterion.

*Criterion 3 (DA tool output basis). *An end user efficacious summary statistic must be quantifiable with information known and visible to the end user ( and ).

One advantage Criterion confers to end users is the ability to compare predicted outcomes to field observations. Criterion addresses information relevance, and Criterion addresses information availability. Availability is not explicitly mentioned in the three questions but is implicit; none of the questions can be answered, if the information is not available. Hence, Criterion is relevant to all three questions.

The invariant measures, ROC-AUC and the Youden index, have as their basis the measure suite [6]. Both of these measures are conditioned by ground truth ( and ), not the DA tool outputs visible to the end user ( and ); they do not satisfy Criterion . However, the invariant measure pair DOR/DP do. Criterion compliance is demonstrated by substituting the four conditional ratiosfor , , , and in the measures. In the DOR equation, Multiplying the numerator and denominator by results in the original DOR equation. , and thus DP also satisfies Criterion .

Using the same substitution above in the equations for TAR, -score, AUC, MCC, and IC shows none are equivalent to the original equations. Of the commonly seen summary statistics considered, only the DOR/DP satisfies Criterion .

###### 6.2.4. Criterion 4: Measure Value Appropriateness

Lenders are interested in loans that maximize profit. Physicians are interested in treatments that maximize patient QoL. Dairymen are interested in breeding cows to maximize milk production. If a dairyman, instead of being given data on pounds of milk produced per cow, was given data on the variation in milk produced per cow, using that information would be difficult to optimize milk production. Outputs from measures such as ROC-AUC and DOR/DP are not mappable to these user’s needs. For a measure to be end user efficacious, the end user must be able to map measure output to their problem. The end user wants to avoid making a decision, given unrelated information. An informed end user may know what the values presented quantify. But, if the values are not mappable to their problem, the end user must rely on “soft” evaluations, such as expert opinion, which may incur considerable uncertainty.

*Criterion 4 (measure value appropriateness). *The summary statistic output must quantify the DA tool’s impact on the end user’s characteristic of interest.

Criterion may seem self-evident, but not all measures satisfy it. For example, the ROC-AUC quantifies the probability that a randomly selected member of class will have a lower test value than a randomly selected member of class . Thus, the ROC-AUC value assumes prior knowledge of ground truth, which, if an end user knew, would mean no DA tool was needed. Criterion may seem similar to Criterion , but it addresses the DA tool function, while Criterion addresses the input’s information content.

DOR/DP quantify the odds of two randomly selected elements of the test set being one each and , rather than one each and . DOR/DP does not require prior knowledge of ground truth, but it is a very specific scenario. It requires output pairs, rather than considering individual outputs. Secondly, there are ten possible pairings (e.g., two s) and ninety unique ratios. Thus, DOR/DP output is not broadly applicable.

From the end user perspective, the ROC-AUC, DOR/DP, TAR, and -score all share another failing; all have lower bounds of zero and cannot quantify a negative impact.

###### 6.2.5. Preconditions from Measurement Theory

Measurement theory has addressed end user measure efficacy [47]. The scale-type definitions proposed by Stevens [43] are used here. Stevens defined four scale types, nominal, ordinal, interval, and ratio. Ratio scales have the least functional constraints, so measures using ratio scales are preferred. Of the three end user questions asked, measurement theory is relevant to two: “what is the DA tool’s impact on my problem?” and “how sensitive is the impact to boundary selection?” Both need measurement scales with meaningful zeros and standard sequences. The remaining question “what is the boundary that provides the optimum impact?” is answerable on an ordinal scale.

Of the measures reviewed (TAR, -score, MCC, Youden index, DOR/DP, IC, and ROC-AUC), none are quantified on a ratio scale.

##### 6.3. The Necessity of the Criteria

First the root problem was identified as a means of identifying measure characteristics that satisfy end user’s needs. Table 2 summarizes that mapping. There is a strong relationship between the measure properties proposed and the end user questions posed. In the table, “yes” means the criterion is necessary and “no” means the criterion is not.

The end user questions posed are topics addressed in business management programs and operations research, so they are generally applicable. Hence, the measure properties are generally applicable as well. Each of these criteria addresses at least one end user need. A measure that does not satisfy every criterion fails to provide some information needed by end users. Hence, these criteria are necessary for problems within the scope of this work. Investigating sufficiency will be a topic of future work.

##### 6.4. Two End User Efficacious DA Tool Evaluation Measures

Table 3 recaps how each summary statistic tested conforms to the criteria and ratio scale properties. None satisfy all of the criteria. Sokolova and Lapalme tested invariance to JPT perturbations [34]. Where their tests are relevant to proposed criteria, their results corroborate these results. As noted in Section 1 and supported in Section 6.2, the commonly seen DA tool evaluation measures do not well quantify end user impact. This section proposes suitable measures. Two problem types are considered separately.

When the impact is cumulative, there will be either a gain or loss (impact, ) associated with each element output. can be expressed as a statistical, not necessarily unique, expectation for each JPT category; (). Thus, each and every element output will affect the end user by the element of applicable to the category to which the element is binned. (The elements of can be defined in different ways, depending upon the information known about each element of . For instance, if elements were bank loan applications, then the impact could be quantified per dollar requested. Alternatively, impact could be based on the statistical expectation for the category. Using bank loan applications, the impact could be per loan.) An end user can expect the net gain or loss () to be the sum of the individual element gains and losses. For problems where impact is cumulative, can also be expressed on and , the outputs actually observed by the end user:

For the test set ,* estimated total impact* is . “Profit,” a measure for customer churn prediction models, was introduced by Verbraken et al. [48]. It differs from in two ways: (i) profit’s costs and benefits must all be positive values and (ii) misclassification costs are deducted, and correct classification gains are added. Intuitively, Verbraken et al.’s constraints seem correct: gains are positive and losses are negative. However, this is not universally true. Recasting a problem to fit profit’s requirements could cause the measurement scale to no longer have a meaningful zero resulting in an interval scale and invalidation of analysis such as Verbraken et al.’s proposed cost-benefit ratio. , as seen in (5), is not susceptible to measure degradation.

There are occasions when the impact is not cumulative, but each output is important individually, for example, a medical diagnosis. In these situations, is confounding, so normalized JPTs are used; normalization mathematically balances the relative class sizes, and thus it mitigates any skew resulting from (JPT normalization’s mitigation is limited by the strong law of large numbers; if the minor class sample is too small, then skew exerts a significant influence [6]). A normalized JPT is shown in Table 4. To facilitate comparison with nonnormalized JPTs, the sum of all categories is kept at one () and the individual input class values ( and ) add up to 0.5.

The end user’s concern, “given that a result is rendered, how am I affected?,” however, can be partitioned into two questions: “given that the result is positive, how am I affected?” and “given that the result is negative, how am I affected?.” (Problems where DA tool results are cumulative can be partitioned in the same way. However, when results are cumulative, question is the most useful for an end user; questions and may be of secondary importance. When individual outputs are important, questions and are primary.)

These questions indicate that the values of interest are weighted conditional expectations:Output is independent, thus the expected outcome impact equals its average:Substituting the normalized expressions from Table 4, the expected impact becomes

The two problem types (individual impact and cumulative impact) have their unique characteristics, resulting in different sets of relevant measures. For problems where impact is cumulative, the summary statistic,provides actionable information to the end user. The monotonic measures upon which it is based and which provide more insight into DA tool impact are the four individual category impacts, , where

Both and have optima, so they are summary statistics. In the case of , the associated measure suite (measures that provide insight into a specific aspect of the DA tool output) consists of and , the two outputs observable by end users. differs from the usual summary statistic in that it directly quantifies the characteristic of interest to end users. ’s measure suite consists of the conditional expectations of the two outputs observable by an end user, and . For problem domains where is appropriate, the summary statistic does contain less information. Consider, for instance, the example where a person receives a medical diagnosis. If the test result is positive, then that person’s impact will be* either ** or *. Likewise, if the test result is negative, then that person’s impact will be* either ** or *. The three composite measures, , , and , may have little utility for the patient. The values are, however, useful to diagnosticians in assessing diagnostic and treatment strategies.

and are suitable for many DA problems and extensible to DA problems with more than two classes. Since they are additive, all that is needed is to sum up the impact adjusted JPT values. For , the JPT must be normalized; then, each category is conditioned by its DA tag. Then, the impact adjusted values can be summed up.

###### 6.4.1. Comparing and against the Criteria

Commonly seen summary statistics do not satisfy these four criteria. How well do the impact measures and satisfy the criteria?

*Criterion 1 (category importance). *A measure that reflects category importance will exhibit sensitivity to change in category impact. If in (5) is replaced by , where represents a small change, then becomes , as changes, with a corresponding change in . (The correspondence is true for . If , then has no effect. If elements do not occur, then they have no impact.) A similar situation exists for . in (9) is replaced by , and then becomes . Other than the trivial case when a JPT cell cardinality equals zero, an change in the corresponding element of will change the value of the ratio. and satisfy Criterion .

*Criterion 2 (pdf sensitivity). *The discussion in Section 6.2 shows that JPT category cardinalities are sensitive to changes in pdf. Hence, Criterion compliant measures must be sensitive to changes in JPT category cardinalities. and are sensitive to changes in all four JPT category cardinalities, and thus they satisfy Criterion .

*Criterion 3 (DA tool output basis). * is based on the weighted conditional expectations (Equation (7)) on and , the outputs visible to the end user. Thus, can be calculated from data available to end users and can be deconstructed into and relevant components. , as written in (6), is the sum of the estimated impact of and ; can be deconstructed into and relevant components. Thus both satisfy Criterion .

*Criterion 4 (scale appropriateness). *For Criterion , needs to be quantified in a unit appropriate for end user impact, and then and satisfy Criterion .

*Measurement Theory: Outputting Ratio Scale Values*. When either or , then the DA tool has no noticeable overall effect on the end user; the measures have meaningful zeros.

The discussion for Criterion shows that and values have standard intervals. Regardless of the value of any , a change in its value will cause a corresponding change in and .

Table 5 summarizes this study’s results. From the end user perspective, and have all of the desired characteristics.

#### 7. Discussion

Starting from a well-defined problem, impact measures for end user problems have been devised. Here, measure usage and differences in the fundamental nature of the impact measures and the commonly seen ratio measures are discussed. Following this, measure outputs are compared on simulated classifier output. Finally a reanalysis of published studies shows additional insights end users can gain by using impact measures.

##### 7.1. Measure Comparisons

There are two key differences between the additive measures and and the commonly seen measures (all of which are ratio based):(i)The ratio-based measures are either unitless or use units with weak utility for end users. and have units defined in . For end user efficacy, the units must be relevant to the problem and provide end users actionable information.(ii)The ratio-based measures are mostly measured on ordinal scales, which limits comparison to rank ordering. and are measured on a ratio scale, where DA impacts are ordered, but difference and magnitude are also valid comparisons. An end user can determine not only which DA tool is better, but also* how much better*.

As noted in Section 1, measures differ in their response to changes in and accuracy. Generally, the differences are monotonically consistent, so rank ordering would not change. There are exceptions: (i) the sensitive measure TAR monotonic increase, as increases, MCC and IC monotonically decrease; (ii) ’s response to is subject to : it can either monotonically increase or decrease. When DA tools are evaluated with the same test data (for , must remain the same as well), this will not cause a change in rank ordering. If test sets with different s are used to evaluate DA tool performances, TAR would rank them differently from MCC or IC.

##### 7.2. Simulation Tests

Some use cases illustrate the value of the impact measures. Symmetrical class probability distributions can mask some differences, so for all measures class samples have the beta (1.5, 5.0) distribution.

Regarding ROC-AUC, , DOR/DP, TAR, MCC, and IC; testing took into consideration the following:(i)They all implicitly use .(ii)ROC-AUC, , and DOR/DP are invariant, so they can be used on test sets with any . Their invariance, however, forces . These measures are compared in use cases with .(iii)TAR, MCC, and IC are considered inappropriate for data with . These measures are compared in use cases with .

For , , and -score, there are a limited number of variations. For category impacts, JPTs are bilaterally symmetrical, so there are seven cases. The following list shows the category impact relationships and used:(i)One impact three impacts:(a), (Case A);(b), (Case B).(ii)Two impacts two impacts:(a), (Case C);(b), (Case D);(c), (Case E);(d), (Case F);(e), (Case G: this is the additive measure equivalent of for ratio measures).

This gives a total of twenty use cases, seven each for the two* impact* measures and six for -score (since for -score, cases A and C are identical).

Mapping to is not exact. However, from ’s somewhat subjective definition, it may be viewed as . This reduces -score’s test cases to three: , , and . Interestingly, and result in the same -score, so the seven distinct cases resolve to two distinct -scores.

Figure 1 compares the common measures suitable for and . The dots on each curve indicate the optimum boundary, . One important difference between the measures in Figures 1(a) and 1(b) is that the -axis units in Figure 1(b) are meaningful to end users. For the dairyman’s problem, perhaps the units would be tons of milk per day per herd. For the banker, perhaps annual profit. The -axis values in Figure 1(a) have no such relevance to the user. The variation in the optima and curve shapes in Figure 1(b) show ’s sensitivity to . This is a benefit for end users, as they can determine and sensitivity. Cases A and E appear identical left of , otherwise all of the cases tested are significantly different. Alas, sensitivity is confounding for DA research. Fortunately, there is commonality between one common measure and : TAR and , Case G, (the additive measure equivalent of for ratio measures) are extremely similar. The two measure’s output similarities may mean that and TAR are functionally equivalent (in an operational, not mathematical sense).

**(a) Each of these commonly seen measures identify similar optimum boundaries. However, the values differ significantly**

**(b)**is tunable to the end user’s situation. End users can readily identify , sensitivity, and the expected impact the classifier will have on their problemDOR/DP and ROC-AUC are not shown in these figures. DOR/DP is measured on an interval scale, so to show it would require a separate figure. Its characteristics, as defined by and curve shape, were far different from the other six measures evaluated. Its graph is included in the on-line technical report [46]. ROC-AUC is boundary invariant; its “curve” is a horizontal line, so ROC-AUC is not included in Figure 1(a).

Figure 2 shows the comparison between the -score and for . As with the previous comparisons, the important difference between the measures in Figures 2(a) and 2(b) is that the -axis units in Figure 2(b) are meaningful to end users. Although varies by an order of magnitude in the two tests, is essentially the same. Users for whom the importance of and is equal will be less satisfied than users for whom there is an order of magnitude difference in and ’s importance. exhibits a much greater sensitivity to . In every case, there are differences in the curves and . In all cases, the impact measures provide information on , sensitivity, and expected impact that are not available using the other measures.

**(a)**Varying does impact the*F*-score. However, are essentially the same

**(b)**is tunable to the end user’s situation. End users can readily identify , sensitivity, and the expected impact the classifier will have on their problem. In some cases, the end user could experience a negative result. This is reflected in ;*F*-score cannot show this information##### 7.3. Real-World Problem 1: Evaluating Rheumatoid Arthritis Diagnostic Tests

Nishimura et al. [13] published a meta-analysis [49] of evaluations of two rheumatoid arthritis (RA) diagnostic tests. The meta-analysis is quite thorough and accounts for many potential variations between studies. The team concludes that one test is better than the other, however, they do so without using a summary statistic. This reanalysis adds , the appropriate impact measure identified in Section 6.4. Using , end users can identify , impact, and sensitivity. Analysis is limited as results at the pre-defined are given, but not the underlying data. Information difference is shown between and the original results, but additional insights are not gained.

Nishimura et al.’s study uses two measures, positive likelihood ratio () and negative likelihood ratio (). These measures are not summary statistics, but they allow calculation of the underlying normalized JPT. The authors observe that RA treatment is harmful to and costly for persons with false positive results. Regardless of the diagnosis, a correct diagnosis maximizes the subject’s quality of life. Accordingly, the meaningful zero is defined as the cost associated with a correct diagnosis: . An incorrect diagnosis results in reduced quality of life, so Lajas et al.’s reported costs [50] are rounded to two significant digits, setting the misdiagnosis costs at 7,900 and 13,000. Hence, means “the test results have no negative effect on patient QoL” and the lower the value, the worse the impact on the patient. Since has a meaningful zero, a practitioner using would seem to have a direct mapping from test result to the patient’s expected experience.

In contrast, means the expected test result is somewhere between “a positive test result is never correct” and “a negative test result is never correct.” Similarly, means the expected test result is somewhere between “a positive test is always correct” and “a negative test is always correct.” Both and are related to patient QoL, but the mapping is not clear.

In an extension to Nishimura et al.’s report, calculate , , and on the pooled test data. Table 6 shows the original likelihood ratios reported by Nishimura et al. and the proposed new measures. (The parenthesized range is the 95% confidence interval. On a single tailed test, only one bound is relevant, and thus the bound indicates a 97.5% confidence.) The conclusions are reached using the proposed measures and the original measures match: the anti-CCP test is better than the RF test. Comparing for each test given that the end user context requires invariance, the anti-CCP test estimated annual economic impact on patients is four hundred dollars less than the RF test’s estimated annual economic impact.

Inspecting the raw JPT values in Table 7, anti-CCP has a substantially lower rate than the RF test. This might lead an end user to place substantially more trust in the RF test’s negative result than in a negative result from the anti-CCP test. However, that trust does not result in a better outcome for the patient; both tests have statistically equivalent negative impacts. Because and are measured in the same units, with the same meaningful zero, comparisons can be made between them. For example, the end user can see that, in contrast to a positive test result, a negative result can have a substantial negative annual cost: around 9,600 USD per year. An end user may not want to conclude a patient with a negative test result is RA-free without strong corroboration.

Assessing the RA test’s impacts is useful to the researcher but is substantially more valuable to the end user. Table 8 shows the JPTs for both raw proportions and impacts.

The raw test data were not available. If the raw data were available, it would be interesting to see how the curves for and compare to and . Likelihood ratios are not summary statistics, so they cannot directly provide ; it would also be interesting to compare derived from the likelihood ratios with that predicted by .

##### 7.4. Real-World Problem 2: Bank Loan Decisions

Optimizing bank loan decisions is a “cumulative output” DA problem type; thus end user impact is best quantified by . There is a body of credit scoring algorithm tests; none with sufficient data was available for a full reanalysis. The Abdou work [16] included sufficient details to compare peak outputs identified by the algorithms tested.

Abdou provides normalized JPTs and reports a misclassification cost ratio (MCR) of 5 : 1; MCR = cost of Type II errors ()cost of Type I errors (). (MCR considers direct costs only. It does not include the opportunity cost, the lost income attributable to qualified applicants not being funded.) Abdou does not provide loan amount information, so a “standard loan unit” was defined as some arbitrary number of Egyptian pounds (EGP) and impact was calculated per loan unit. The JPT categories are defined as follows.

*Good Applicants (**)*. These are loans that are made and pay-off as expected.

*Known Deadbeats (**)*. These are applicants rejected where ground truth is a known default. all applicants, including those rejected, incur an application processing cost; hence, is negative.

*Unknown Deadbeats (**).* These are loans made that defaulted. The value is based on MCR = 5 : 1.

*Unknown Good Applicants (**)*. These are rejected applicants that would have proven to be good.

In order to limit complexity, the standard loan unit has a defined annual profit expectation. Intuitively, a user could expect . In Abdou’s scenario, has a slight negative impact. This is due to application processing costs incurred for all applications.

Abdou also normalizes his data . However, the problem is sensitive, so JPT tuning is used to adjust to the reported value, . Abdou ran a sensitivity analysis on EMC; JPT tuning use illustrates how an end user can run a sensitivity analysis. (Such a sensitivity analysis can test results at the identified boundary; however, causes the optimum boundary to shift. So, without the actual data, JPT tuning cannot be used to estimate the peak impact [6].) JPT tuning is used for two other relative class sizes: and . Table 9 compares results with the estimated misclassification cost, EMC, reported by Abdou.

Abdou concludes that the WOE model performs best, based on EMC. shows that genetic programming performs best. Abdou does not report confidence intervals, but since probit analysis (PA) and results are so similar, the difference is likely statistically insignificant. Based on , WOE seems to perform worst. WOE has a substantial negative impact on the lender compared to either PA or GP. This reanalysis indicates that GP is at least equivalent to the best nonartificial intelligence method tested; this is consistent with other tests comparing artificial intelligence (AI) and non-AI methods.

Using the Egyptian banking assumptions presented here, sensitivity analysis of GP shows that, for , the annual profit per loan unit would range from fifty-six to sixty-seven percent of the amount that would be received if loan decisions were perfect. Thus, by using , the bank decision-makers receive valuable information that can be used to define loan application scoring policy and procedures. The banking environment assumptions used are probably not extensible to a wide bank pool. Thus, will be most useful when each institution tunes the values to their specific environment.

If Abdou’s raw data were available, it would be interesting to test identification and sensitivity. This problem’s meaningful zero is the loan portfolio with a net zero profit; is almost certainly different from those determined by Abdou. The hypothetical profits are probably greater than those determined using Abdou’s results.

#### 8. Conclusion

An important characteristic for DA tool end user efficacious summary statistics has been identified: impact (). First, An impact vector sets the end user impact of each joint probability table category: . Next, four criteria for end user efficacious summary statistics and measurement theory were evaluated and applied.

*Criterion 1 (category importance). *An end user efficacious summary statistic must be a function of problem specific, rational number impacts for each JPT category.

*Criterion 2 (pdf sensitivity). *An end user efficacious summary statistic must be sensitive to differences between end user environments and changes in an end user’s environment, as expressed in the DA tool input population and pdf.

*Criterion 3 (DA tool output basis). *An end user efficacious summary statistic must be quantifiable with information known and visible to the end user ( and ).

*Criterion 4 (measure value appropriateness). **An end user efficacious summary *statistic must quantify the DA tool’s impact on the characteristic of interest.

*Measurement Theory*. Ratio scales allow the most extensive analysis for end users.

Eight commonly seen DA tool summary statistics, total accuracy rate, -score, Youden index, Diagnostic Odds Ratio (and associated measure, discriminant power), ROC area under the curve, Mathews Correlation Coefficient, and Mutual Information Coefficient, fail to satisfy these criteria. Two criterion compliant end user efficacious summary statistics were identified along with their measure suites.(i)For cumulative DA tool output impact, the summary statistic isThe end user efficacious measure suite consists of(ii)For noncumulative DA tool output impact, the summary statistic isThe end user efficacious measure suite consists of

Generally, the intent of publishing DA tool performance data is to inform a broad readership, including potential end users. Using specific would be too restrictive, unless the report is for a specific audience, such as the rheumatoid arthritis study. Otherwise, end users would be better served if DA tool performance reports provide data that enables them to calculate their own impacts.

One suggestion is for researchers to publish and with balanced , on normalized JPT(), and the normalized JPT() for a range of boundaries. Researchers can use the published and values and end users have the values necessary to calculate actionable information for their specific problem.

The three-step process for end users is as follows:(1)Identify the appropriate measure, or .(2)If is appropriate, then use JPT tuning to compensate for their problem domain’s . If is appropriate, then condition the published JPTs by and .(3)Calculate the selected measure and select the JPT() with the best impact.

With these values, end users can also determine the DA output’s sensitivity to .

Future work includes loosening problem constraints to include unsupervised tests, exploring the potential relationship between -scores and and studying the relationship between and DA tool output. The cost curves discussed by Witten et al. [41] may provide a means to avoid publishing multiple JPTs for end users with cumulative DA tool type problems.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

The authors gratefully acknowledge Dr. Andrew Barnes (General Electric Global Research, Niskayuna, NY). His insights during this study’s formative stage were invaluable.