Research Article  Open Access
A SemiAutomated Usability Evaluation Framework for Interactive Image Segmentation Systems
Abstract
For complex segmentation tasks, the achievable accuracy of fully automated systems is inherently limited. Specifically, when a precise segmentation result is desired for a small amount of given data sets, semiautomatic methods exhibit a clear benefit for the user. The optimization of human computer interaction (HCI) is an essential part of interactive image segmentation. Nevertheless, publications introducing novel interactive segmentation systems (ISS) often lack an objective comparison of HCI aspects. It is demonstrated that even when the underlying segmentation algorithm is the same throughout interactive prototypes, their user experience may vary substantially. As a result, users prefer simple interfaces as well as a considerable degree of freedom to control each iterative step of the segmentation. In this article, an objective method for the comparison of ISS is proposed, based on extensive user studies. A summative qualitative content analysis is conducted via abstraction of visual and verbal feedback given by the participants. A direct assessment of the segmentation system is executed by the users via the system usability scale (SUS) and AttrakDiff2 questionnaires. Furthermore, an approximation of the findings regarding usability aspects in those studies is introduced, conducted solely from the systemmeasurable user actions during their usage of interactive segmentation prototypes. The prediction of all questionnaire results has an average relative error of 8.9%, which is close to the expected precision of the questionnaire results themselves. This automated evaluation scheme may significantly reduce the resources necessary to investigate each variation of a prototype’s user interface (UI) features and segmentation methodologies.
1. Introduction
To the best of our knowledge, there is not one publication in which user based scribbles are combined with standardized questionnaires in order to assess an interactive image segmentation system’s quality. This type of synergetic usability measure is a contribution of this work. In order to provide a guideline for an objective comparison of interactive image segmentation approaches, a prototype providing a semimanual pictorial user input, introduced in Section 2.2.1, is compared to a prototype with a guiding menudriven UI, described in Section 2.2.2. Both evaluation results are analyzed with respect to a joint prototype, defined in Section 2.2.3, incorporating aspects of both interface techniques. All three prototypes are built utilizing modern web technologies. An evaluation of the interactive prototypes is performed utilizing pragmatic usability aspects described in Section 4.2, as well as hedonic usability aspects analyzed in Section 4.3. These aspects are evaluated via two standardized questionnaires (System Usability Scale and AttrakDiff2) which form the ground truth for a subsequent prediction of the questionnaires’ findings via a regression analysis outlined in Section 3.3. The outcome of questionnaire result prediction from interaction log data only is detailed in Section 4.4. This novel automatic assessment of pragmatic as well as hedonic usability aspects is a contribution of this work. Our source code release for the automatic usability evaluation from user interaction log data can be found at https://github.com/mamrehn/interactive_image_segmentation_evaluation.
1.1. Image Segmentation Systems
Image segmentation can be defined as the partitioning of an image into a finite number of semantically nonoverlapping regions. A semantic label can be assigned to each region. In medical imaging, each individual region of a patients’ abdominal tissue might be regarded as healthy or cancerous. Segmentation systems can be grouped into three principal categories, each differing in the degree of involvement of an operating person (user): manual, automatic, and interactive. During manual tumor segmentation, a user provides all elements in the image grid which have neighboring elements of different labels than . The system then utilizes this closed curve contour line information to infer the labels for remaining image elements via simple region growing. This minimal assistance by the system causes the overall segmentation process of one lesion to take up to several minutes of user interaction time. However, reaching an appropriate or even perfect segmentation result (despite noteworthy interobserver difference [1]) is feasible [2, 3]. In practice, few timeconsuming manual segmentations are performed by domain experts, in order to utilize the results as a reference standard in radiotherapy planning [4]. A fully automated approach does not involve a user’s interference with the system. The introduced deficiency in domain knowledge for accurately labeling regions may be restored partially by automated segmentation approaches. The maximum accuracy of the segmentation result is therefore highly dependent on the individual set of rules or amount of training data available. If the segmentation task is sufficiently complex, a perfect result may not be reachable. Interactive approaches aim at a fast and exact segmentation by combining substantial assistance by the system with knowledge about a very good estimate of the true tumor extent provided by trained physicians during the segmentation process [5]. In contrast to fully automated solutions, prior knowledge is (also) provided during the segmentation process. Although, interactive approaches are also costly in terms of manual labor to some extent, they can supersede fully automated techniques in terms of accuracy. Due to their exact segmentation capabilities, interactive segmentation techniques are frequently chosen to outline pathologies during imaging assisted medical procedures, like hepatocellular carcinomata during transcatheter arterial chemoembolization (see Section 1.6).
1.2. Evaluation of Image Segmentation Systems
Performance evaluation is one of the most important aspects during the continuous improvement of systems and methodologies. With noninteractive computer vision and machine learning systems for image segmentation, an objective comparison of systems can be achieved by evaluating preselected data sets for training and testing. Similarity measures between segmentation outcome and ground truth images are utilized to quantify the quality of the segmentation result.
With interactive segmentation systems (ISS), a complete ground truth data set would also consist of the adaptive user interactions which advance the segmentation process. Therefore, when comparing ISS, the user needs to be involved in the evaluation process. User interaction data however is highly dependent on the users’ domain knowledge and the unique learning effect of the human throughout a period of exposure to the problem domain, the system’s underlying segmentation method and the users’ preferences towards this technique, and the design and usability (the user experience [6, 7]) of the interface which is presented to the user during the interactive segmentation procedure [3, 8]. This includes users’ differing preferences towards diverse interaction systems and tolerances for unexpected system behavior. Considering –, an analytically expressed objective function for an interactive system is hard to define. Intuitively, the user wants to achieve a satisfying result in a short amount of time with ease [9]. A direct assessment of a system’s usability is enabled via standardized questionnaires, as described in Section 2.3. Individual usage of ISS can be evaluated via the segmentation result’s similarity to the ground truth labeling according to the SørensenDice coefficient (Dice) [10] after each interaction. The interaction data utilized for these segmentations has to be representative in order to generalize the evaluation results.
1.3. Types of User Interaction
As described by Olabarriaga et al. [11] as well as Zhao and Xie [12], user interactions can be categorized with regard to the type of interface an ISS provides. The following categories are emphasized. A pictorial mask image is the most intuitive form of user input. Humans use this technique when transferring knowledge via a visual medium [13]. The mask overlayed on the visualization of the image to segment consists of structures called scribbles, where is the width and is the height of the 2D image in pixels. Scribbles are seed points, lines, and complex shapes, each represented as a set of individual seed points. One seed point is a tuple , where describes the position of the seed in image space. The class label of a scribble in a binary segmentation system is represented by . Scribbles need to be defined by the user in order to act as a representative subset of the ground truth segmentation .
A menudriven user input scheme as in [14, 15] limits the user’s scope of action. Users trade distinct control over the segmentation outcome for more guidance provided by the system. The locations or the shapes of newly created scribbles are fixed before presentation to the user. It is challenging to achieve an exact segmentation result using a method from this category. Rupprecht et al. [14] describe significant deficits in finding small objects and outline a tendency of the system to automatically choose seed point locations near the object border, which cannot be labeled by most users’ visual inspection and would therefore not have been selected by the users themselves. Advantages of menudriven user input are the high level of abstraction of the process, enabling efficient guidance for inexperienced users in their decision which action to perform for an optimal segmentation outcome (regarding accuracy over time or number of interactions) [11, 16].
1.4. Generation of Representative User Input
Nickisch et al. [17] describe crowd sourcing and user studies as two methods to generate plausible user input data. The cost efficient crowd sourcing method often lacks control and knowledge of the users’ motivation. Missing context information for crucial aspects of the data acquisition procedure creates a challenging task objectifying the evaluation results. Specialized fraud detection methods are commonly used in an attempt to prefilter the recorded corpus and extract a usable subset of data. McGuinness and O’Connor [18] proposed an evaluation of ISS via extensive user experiments. In these experiments, users are shown images with descriptions of the objects they are required to extract. Then, users mark foreground and background pixels utilizing a platform designed for this purpose. These acquisitions are more timeconsuming and cost intensive than crowd sourcing, since they require a constant involvement of users. However, the study’s creators are able to control many aspects of the data recording process, which enables detailed observations of user reactions. The data samples recorded are a representative subset of the focus group of the finalized system. A user study aims at maximizing repeatability of its results. In order to increase the objectivity of the evaluation in this work, a user study is chosen to be conducted. The study is described in Section 3.2.
1.5. StateoftheArt Evaluation of Interactive Segmentation Systems
1.5.1. Segmentation Challenges
In segmentation challenges like SLIVER07 [19] (mainly) fully automated approaches are competing for the highest score regarding a predefined image quality metric. Semiautomatic methods are allowed for submission if the manual interaction with the test data is strictly limited to preprocessing and (single seed point) initialization of an otherwise fully automated process. ISS may be included into the contests’ final ranking, but are regarded as noncompeting, since the structure of the challenges is solely designed for automated approaches. The PROMISE12 challenge [20] had a separate category for proposed interactive approaches, where the user (in this case, the person also describing the algorithm) may add an unlimited number of hints during segmentation, without observing the experts’ ground truth for the test set. No group of experts was provided to operate the interactive method for comparative results. The submitted interactive methods’ scores in the challenge’s ranking are therefore highly dependent on the domain knowledge of single operating users and can not be regarded as an objective measure.
1.5.2. Comparisons for Novel Segmentation Approaches
In principle, with every new proposal of an interactive segmentation algorithm or interface, the authors have to demonstrate the new method’s capabilities in an objective comparison with already established techniques. The effort spent for these comparisons by the original authors varies substantially. According to [9], many evaluation methods only consider a fixed input. This approach is especially unsuited for evaluation, without simultaneously defining an appropriate interface, which actually validates that a real person utilizing this UI is capable of generating similar input patterns to the ones provided. Although, there are some overview publications, which compare several approaches [11, 18, 21–23], the number of publications outlining new methods is disproportionately greater, leaving comparisons insufficiently covered. Olabarriaga et al. [11] main contribution is the proposition of criteria to evaluate interactive segmentation methods: accuracy, repeatability, and efficiency. McGuinness et al. [18] utilized a unified user interface with multiple underlying segmentation methods for the survey they conducted. They recorded the current segmentation masks after each interaction to gauge segmentation accuracy over time. Instead of utilizing a standardized questionnaire, users were asked to rate the difficulty and perceived accuracy of the segmentation tasks on a scale of 1 to 5. Their main contribution is an empirical study by subjects segmenting with four different segmentation methods in order to conclude that one of the four methods is best, given their data and participants. Their ranking is primarily based on the mean accuracy over time achieved per segmentation method. McGuinness et al. [22] define a robot user in order to simulate user interactions during an automated interactive segmentation system evaluation. However, they do not investigate the similarity of their rulebased robot user to seed input pattern by individual human subjects. Zhao et al. [21] concluded in their overview over interactive medical image segmentation techniques, that there is a clear need of welldefined performance evaluation protocols for interactive systems.
In Table 1, a clustering of popular publications describing novel interactive segmentation techniques is depicted. The evaluation methods can be compared by the type of data utilized as user input. Note that there is a trend towards more elaborate evaluations in more recent publications. The intent and perception of the interacting user are a valuable resource worth considering when comparing interactive segmentation systems [24]. However, only two of the related publications listed in Table 1 make use of the insights about complex thought processes of a human utilizing an interactive segmentation system for the ranking of novel interactive segmentation methods. Ramkumar et al. [25, 26] acquire these data by welldesigned questionnaires, but do not automate their evaluation method. We propose an automated, i.e. scalable, system to approximate pragmatic as well as hedonic usability aspects of a given interactive segmentation system.

1.6. Clinical Application for Interactive Segmentation
Hepatocellular carcinoma (HCC) is among the most prevalent malignant tumors worldwide [63, 64]. Only –% of cases are curable via surgery. Both, a patient’s HCC and hepatic cirrhosis in advanced stages may lead on to the necessity of alternative treatment methods. For these inoperable cases, transcatheter arterial chemoembolization (TACE) [65] is a promising and widely used minimally invasive intervention technique [66, 67]. During TACE, extrahepatic collateral vessels are occluded, which previously supplied the HCC with oxygenated blood. To locate these vessels, it is crucial to find the exact shape as well as the position of the tumor inside the liver. Interventional radiology is utilized to generate a volumetric conebeam Carm computed tomography (CBCT) [68] image of the patient’s abdomen, which is processed to precisely outline and label the lesion. The toxicity of TACE decreases, the less healthy tissue is labeled as pathologic. The efficacy of the therapy increases, the less cancerous tissue is falsely labeled as healthy [69]. However, precisely outlining the tumor is challenging, especially due to its variations in size and shape, as well as a high diversity in Xray attenuation coefficient values representing the lesion as illustrated in Figure 1. While fully automated systems may yield insufficiently accurate segmentation results, ISS tend to be well suited for an application during TACE.
2. Methods
In the following Section, the segmentation method underlying the user interface prototypes is described in Section 2.1 in order to subsequently adequately outline the different characteristics of each novel interface prototype in Section 2.2. Usability evaluation methods utilized are detailed regarding questionnaires in Section 2.3, semistructured feedback in Section 2.4, and the test environment in Section 2.5.
2.1. Segmentation Method
GrowCut [59] is a seeded image segmentation algorithm based on cellular automaton theory. The automaton is a tuple , where is the data the automaton operates on. In this case is the graph of image , where the pixels/voxels act as nodes . The nodes are connected by edges on a grid defined by the Moore neighborhood system. defines the automaton’s possible states and the state transition function utilized. As detailed in Equation (1), is the set of each node’s state, where is the node’s position in image space and is the class label of node at GrowCut iteration . is the strength of at iteration . The feature vector describes the node’s characteristics. The pixel value at image location is typically utilized as feature vector [59]. Here, we additionally define as a counter for accumulated label changes of during the GrowCut iteration, as described in [31], with . Note that this extension of GrowCut is later utilized for seed location suggestion in two of the three prototypes tested. A node’s strength is initialized with for scribbles, i. e. , and otherwise.
Iterations are performed utilizing local state transition rule : starting from initial seeds, labels are propagated based on local intensity features . At each discrete time step , each node attempts to conquer its direct neighbors. A node is conquered if the condition in Equation (2) is true. If node is conquered, the automaton’s state set is updated according to Equation (4). If is not conquered, the node’s state remains unchanged, i. e. . The process is guaranteed to converge with positive and bounded node strengths () monotonously decreasing (since ). The image’s final segmentation mask after convergence is encoded as part of state , specifically in for each node .
2.2. Interactive Segmentation Prototypes
Three interactive segmentation prototypes with different UIs were implemented for usability testing. The segmentation technique applied in all prototypes is based on the GrowCut approach as described in Section 2.1. GrowCut allows for efficient and parallelizable computation of image segmentations while providing an acceptable accuracy from only few initial seed points. The method is also chosen due to its tendency to benefit from careful placement of large quantities of seed points. It is therefore well suited for an integration into a highly interactive system. A learningbased segmentation system was not utilized for usability testing due to its inherent dependence of segmentation quality on the characteristics of prior training data, which potentially adds a significant bias to the test results, given only a small data set as utilized in the scope of this work.
All three user interfaces provided include an undo button to reverse the effects of the user’s latest action. A finish button is used to define the stopping criterion for the interactive image partitioning. The transparency of both, the contour line and seed mask displayed, is adjustable to one of five fixed values via the opacity toggle button. The image contrast and brightness (windowing) can be adapted with standard control sliders for the window width and the window center operating on the image intensity value range [70]. All protoypes incorporate a help button used to provide additional guidance for the prototype’s usage during the segmentation task. The segmentation process starts with a set of predefined backgroundlabels along the edges of the image, since an object is assumed to be located in its entirety inside the displayed region of the image.
2.2.1. SemiManual Segmentation Prototype
The UI of the semimanual prototype, depicted in Figure 2, provides several interaction elements. A user can add seed points as an overlay mask displayed on top of the image. These seed points have a predefined label of either foreground for the object or background used for all other image elements. The label of the next brush strokes (scribbles) can be altered via the buttons named object seed and background seed. After each interaction , a new iteration of the seeded segmentation is started given the image as well as the updated set of seeds as input.
2.2.2. Guided Segmentation Prototype
The system selects two seed point locations and , each with the lowest label certainty values assigned by the previous segmentation process. The seed point locations are shown to the user in each iteration , as depicted in Figure 3. There are four possible labeling schemes for those points in the underlying twoclass classification problem, since each seed point has a label . The interface providing advanced user guidance displays the four alternative segmentation contour lines, which are a result of the four possible next steps during the iterative interactive segmentation with respect to the labeling of the new seed points and . The user selects the only correct labeling, where all displayed object and background seeds are inside the object of interest and the image background, respectively. The alternative views on the right act as four buttons to define a selection. To further assist the user in their decision making, the region of interest, defined by and , is zoomed in for the option view on the right and displayed as a cyan rectangle in the overview image on the left of the UI. The differences regarding the previous iteration’s contour line and one of the four new options each are highlighted by dotted areas in the four overlay mask images. After the user selects one of the labelings, the two new seed points are added to the current set of scribbles . The scribbles are utilized as input for the next iteration, on which basis two new locations and are computed.
The systemdefined locations of the additional seed points can be determined by , the location(s) with maximum number of label changes during GrowCut segmentation. Frequent changes define specific image elements and areas in which the GrowCut algorithm indicates uncertainty in finding the correct labels. Two locations in are then selected as and , which stated the most changes in labeling during the previous segmentation with input image and seeds .
2.2.3. Joint Segmentation Prototype
The joint prototype depicted in Figure 4 is a combination of a pictorial interaction scheme and a menudriven approach. A set of preselected new seeds is displayed in each iteration. The seeds’ initial labels are set automatically, based on whether their position is inside (foreground) or outside (background) the current segmentation mask. The user may toggle the label of each of the new seeds, which also provides an intuitive undo functionality. The automated suggestion process for new seed point locations is depicted in Figure 5. The seed points are suggested deterministically based on the indices of the maximum values in an elementwise sum of three approximated influence maps. These maps are the gradient magnitude image of , the previous label changes per element in weighted by an empirically determined factor of , and an influence map based on the distance of each element in to the current contour line. Note that for the guided prototype (see Section 2.2.2), only was used for the selection of suggested seed point locations. This scheme was extended for the joint prototype, since extracting instead of only the top two points solely from potentially introduces suggested point locations forming impractical local clusters instead of spreading out with higher variance in the image domain. This process approximates the true influence or entropy (information gain) of each possible location for a new seed.
When all seed points presented to the user are toggled to their correct label, the user may click on the new points button to initiate the next iteration with an updated set of seed points . Another set of seed points is generated and displayed.
In addition to preselected seeds, a single new seed point can be added manually via a user’s longpress on any location in the image. A desired change in the current labeling of this region is interpreted given this user action. Therefore, the new seed point’s initial label is set by inverting the current label of the given location. A new segmentation is initiated by this interaction based on . Note that the labels of are still subject to change via toggle interactions until the new points button is pressed.
2.3. Questionnaires
2.3.1. System Usability Scale (SUS)
The SUS [71, 72] is a widely used, reliable, and lowcost survey to assess the overall usability of a prototype, product, or service [73]. Its focus is on pragmatic quality evaluation [74, 75]. The survey is technology agnostic, which enables a utilization of the usability of many types of user interfaces and ISS [76]. The questionnaire consists of ten statements and an unipolar fivepoint Likert scale [77]. This allows for an assessment in a time span of about three minutes per participant. The statements are as follows:(1)I think that I would like to use this system frequently.(2)I found the system unnecessarily complex.(3)I thought the system was easy to use.(4)I think that I would need the support of a technical person to be able to use this system.(5)I found the various functions in this system were well integrated.(6)I thought there was too much inconsistency in this system.(7)I would imagine that most people would learn to use this system very quickly.(8)I found the system very cumbersome to use.(9)I felt very confident using the system.(10)I needed to learn a lot of things before I could get going with this system.
The Likert scale provides a fixed choice response format to these expressions. The th choice in an point Likert scale always is the neutral element. Using the scale, subjects are asked to define their degree of consent to each given statement. The fixed choices for the fivepoint scale are named strongly disagree, disagree, undecided, agree, and strongly agree. During the evaluation of the survey, these names are assigned values per subject in the order presented, for statements with index . SUS scores enable simple interpretation schemes, understandable also in multidisciplinary project teams. The result of the SUS survey is a single scalar value, in the range of zero to as a composite measure of the overall usability. The score is computed according to Equation (5), as outlined in [71], given participants, where is the response to the statement by subject . A neutral participant () would produce a SUS score of . Although the SUS score allows for straightforward comparison of the usability throughout different systems, there is no simple intuition associated with the resulting scalar value. SUS scores do not provide a linear mapping of a system’s quality in terms of overall usability. In practice, a SUS of less than is often interpreted as an indicator of a substantial usability problem with the system. Bangor et al. [76, 78] proposed an interpretation of the score in a sevenpoint scale. They added an eleventh question to surveys they conducted. Here, participants were asked to describe the overall system as one of these seven items of an adjective rating scale: worst imaginable, awful, poor, OK, good, excellent, and best imaginable. The resulting SUS scores could then be correlated with the adjectives. The mapping from scores to adjectives resulting from their evaluation is depicted in Figure 6. This mapping also enables an absolute interpretation of a single SUS score.
2.3.2. Semantic Differential AttrakDiff2
A semantic differential is a technique for the measurement of meaning as defined by Osgood et al. [79, 80]. Semantic differentials are based on the theory, that the implicit anticipatory response of a person to a stimulus object is regarded as the object’s meaning. Since these implicit responses themselves cannot be recorded directly, more apparent responses like verbal expressions have to be considered [81, 82]. These verbal responses have to be sensitive to and maximally dependent on meaningful states while independent from each other [80]. Hassenzahl et al. [83, 84] defined a set of pairs of verbal expressions suitable to represent a subject’s opinion on the hedonic as well as pragmatic quality (both aspects of perception) and attractiveness (an aspect of assessment) of a given interactive system separately [85]. During evaluation, the pairs of complementary adjectives are clustered into four groups, each associated with a different aspect of quality. Pragmatic quality (PQ) is defined as the perceived usability of the interactive system, which is the ability to assist users to reach their goals by providing utile and usable functions [86]. The attractiveness (ATT) quantizes the overall appeal of the system [87]. The hedonic quality (HQ) [88] is separable into hedonic identity (HQI) and hedonic stimulus (HQS). HQI focuses on a user’s identification with the system and describes the ability of a product to communicate with other persons benefiting the user’s selfesteem [89]. HQS describes the perceived novelty of the system. HQS is associated with the desire to advance ones knowledge and proficiencies. The clustering into these four groups for the word pairs are defined as depicted in Table 2.

For each participant, the order of word pairs and order of the two elements of each pair are randomized prior to the survey’s execution. A bipolar [90] sevenpoint Likert scale is presented to the subjects to express their relative tendencies towards one of the two opposing statements (poles) of each expression pair, where index three denotes the neutral element. For the questionnaire’s evaluation for subject , each of the seven adjective pairs per group is assigned a score by each participant, reflecting their tendency towards the positive of the two adjectives. The overall ratings per group are defined in [83] as the mean scores computed over all subjects and statements , as depicted in Equation (6). Here, is the number of participants in the survey.Therefore, a neutral participant would produce an AttrakDiff2 score of four. The final averaged score of each group ranges from one (worst) to seven (best rating).
An overall evaluation of the AttrakDiff2 results can be conducted in the form of a portfolio representation [86]. HQ is the mean of a system’s HQI and HQS scores. PQ and HQ scores of a specific system and user are visualized as a point in a twodimensional graph. The % confidence interval is an estimate of plausible values for rating scores from additional study participants, and determines the extension of the rectangle around the described data point in each dimension. A small rectangle area represents a more homogeneous rating among the participants than a larger area. If a rectangle completely lies inside one of the seven fields with associated adjectives defined in [86], this adjective is regarded as the dominant descriptor of the system. Otherwise, systems can be particularized by overlapping fields’ adjectives. If the confidence rectangles of two systems overlap in their onedimensional projection on either HQ or PQ, their difference in AttrakDiff2 scores in regard to this dimension is not significant.
2.4. Qualitative Measures
In order to collect, normalize, and analyze visual and verbal feedback given by the participants, a summative qualitative content analysis is conducted via abstraction [91, 92]. The abstraction method reduces the overall transcript material while preserving its substantial contents by summarization. The corpus retains a valid mapping of the recording. An essential part of abstraction is the formulation of macro operators like elimination, generalization, construction, integration, selection, and bundling. The abstraction of statements is increased iteratively by the use of macro operators, which map statements of the current level of abstraction to the next, while clustering items based on their similarity [93].
2.5. HCI Evaluation
A user study is the most precise method for the evaluation of the quality of different interactive segmentation approaches [17]. Analytical measures as well as subjective measures can be derived from standardized user tests [94]. From interaction data recorded during the study, the reproducibility of segmentation results as well as the achievable accuracy with a given system per time can be estimated. The complexity and novelty of the system can be expressed via the observed convergence to the ground truth over time spent by the participants segmenting multiple images each. The user’s satisfaction with the interactive approaches is expressed by the analysis of questionnaires, which the study participant fills out immediately after their tests are conducted and before any discussion or debriefing has started. The respondent is asked to fill in the questionnaire as spontaneously as possible. Intuitive answers are desired as user feedback instead of wellthoughtout responses for each item in the questionnaire [71].
For the randomized A/B study, individuals are selected to approximate a representative sample of the intended users of the final system [95]. During the study, subjects are given multiple interactive segmentation tasks to fulfill each in a limit time frame. The user segments all images provided with two different methods (A and B). All subjects are given tasks in a randomized order to prevent a learning effect bias, which would allow for higher quality outcomes for the later tasks. Video and audio data of the subjects are recorded. Every user interaction recognized by the system and its time of occurrence are logged.
3. Experiments
3.1. Data Set for the Segmentation Tasks
In Figure 7 the data set used for the usability test is depicted. For this evaluation, the RGB colored images are converted to grayscale in order to increase similarity to the segmentation process of medical images acquired from CBCT. The conversion is performed in accordance with the ITU–R BT.7096 recommendation [96] for the extraction of true luminance defined by the International Commission on Illumination (CIE) from contemporary cathode ray tube (CRT) phosphors via Equation (7), where , , and are the linear red, green, and blue color channels of respectively. Image Figure 7(b) is initially presented to the study participants in order to familiarize themselves with the upcoming segmentation process. The segmentation tasks associated with images Figures 7(a), 7(c), and 7(d) are then displayed sequentially to the subjects in randomized order. The images are chosen to fulfill two goals of the study. Ambiguity of the ground truth has to be minimized in order to suppress noise in the quantitative data. Each test person should have the same understanding and consent about the correct outline of the object to segment. Therefore, clinical images can only be utilized with groups of specialized domain experts. The degree of complexity should vary between the images displayed to the users. Image (b), depicted in Figure 7, of moderate complexity with regard to its disagreement coefficient [97], is displayed first to learn the process of segmentation with the given prototype. Users are asked for an initial testing of a prototype’s features utilizing this image without any time pressure. The subsequent interactions during the segmentations of the remaining three images are recorded for each prototype and participant. The complexity increases from (a) to (d), according to the GTs’ MinkowskiBouligand dimensions [98]. The varying complexity enables a more objective and extended differentiation of subjects’ performances with given prototypes.
3.2. Usability Test Setup
Two separate user studies are conducted to test all prototypes described in Section 2.2, in order to keep the time for each test short (less than minutes per prototype), thus retaining the focus of the participants, while minimizing the occurrence of learning effect artifacts in the acquired data. Note that the participants use this time not only to finish the segmentation tasks, but also to familiarize themselves with the novel interaction system, as well as to form opinions about the system while testing their provided interaction features. The first user test is a randomized A/B test of the semimanual prototype (Section 2.2.1) and the guided prototype (Section 2.2.2). Ten individuals are selected as test subjects due to their advanced domain knowledge in the fields of medical image processing and mobile input devices. The subjects are given the task to segment different images with varying complexity, which are described in Section 3.1, in random order. A fourth input image of medium complexity is provided for the users to familiarize themselves with the ISS before the tests. As an interaction device, a mobile tablet computer is utilized, since the final segmentation method is intended for usage via such a medium. The small inch () WUXGA display and fingers utilized as a multitouch pointing device further exacerbate the challenge to fabricate an exact segmentation for the participants [99]. The user study environment is depicted in Figure 8. Audio and video recordings are evaluated via a qualitative content analysis, described in Section 2.4, in order to detect possible improvements for the tested prototypes and their interfaces. After segmentation, each participant fills out the SUS (Section 2.3.1) and AttrakDiff2 (Section 2.3.2) questionnaires.
The second user test is conducted for the joint segmentation prototype (Section 2.2.3). The data set and test setup are the same as in the first user study and all test persons of study also participated in study . One additional subject participated only in study . Two months passed between the conduction of the two studies, in which the former participants were not exposed to any of the prototypes. Therefore, the learning effect bias for the second test is neglectable.
3.3. Prediction of Questionnaire Results
The questionnaires’ PQ, HQ, HQI, HQS, ATT, and SUS results are predicted, based on features extracted from the interaction log data. For the prediction, a regression analysis is performed. Stochastic Gradient Boosting Regression Forests (GBRF) are an additive model for regression analysis [100–102]. In several stages, shallow regression trees are generated. Such a tree is a weak base learner each resulting in a prediction error , with high bias and low variance . These regression trees are utilized to minimize an arbitrarily differentiable loss function each on the negative gradient of the previous stage’s outcome, thus reducing the overall bias via boosting [103]. The Huber loss function [104] is utilized for this evaluation due to its increased robustness to outliers in the data with respect to the squared error loss.
The collected data set of user logs is split randomly in a ratio of for training and testing. An exhaustive grid search over parameter combinations is performed for each of the six GBRF estimators (one for each questionnaire result) with scorings based on an eightfold crossvalidation on the training set.
3.3.1. Feature Definition
The collected data contains samples with possible features each. The questionnaire results (PQ, HQ, HQS, HQI, ATT, SUS), are predicted based on features extracted from the interaction log data of the four images segmented with the system. Four features are the relative median seed positions per user and their standard deviation in two dimensions. additional features, like the number of undo operations (#Undos) and number of interactions (#Interactions), the overall computation time (Computation_time), overall interaction time (Interaction_time), elapsed real time (Wall_time), Final_Rand_index, and Final_Dice_score are reduced to one scalar value each by the mean and median, over the four segmentations per prototype and user, to obtain base features. Since these features each only correlate weakly with the questionnaire results, composite features are added in order to assist the model’s learning process for feature relations. Added features are composed of one base feature value divided by (the mean or median of) computation time, interaction time, or elapsed real time. The relations between those time values themselves are also added. In total, features directly related to the interaction log data are used. In addition, a principal component analysis (PCA) is performed in order to add % (22) features with maximized variance to the directly assessed ones to further assist the feature selection step via GBRFs.
3.3.2. Feature Selection for SUS Prediction
For the approximation of SUS results, a feature selection step is added to decrease the prediction error by an additional three percent points: here, after the described initial grid search, % (205) of the GBRF estimators, with the lowest mean deviance from the ground truth, are selected to approximate the most important features. From those estimators, the most important features for the GBRFs are extracted via a weighted feature importance voting. This feature importance voting by estimators ensures a more robust selection than deciding the feature ranking from only a single trained GBRF. After the voting, a second grid search over the same parameter combinations, but with a reduction from to only of the most important features is performed.
4. Results
4.1. Overall Usability
The result of the SUS score is depicted in Figure 9. According to the mapping (Figure 6) introduced in Section 2.3.1, the adjective rating of the semimanual and joint prototypes are excellent ( respective ), and the adjective associated with the guided prototype is good (67).
A graph representation of the similarity of individual usability aspects, based on the acquired questionnaire data, is depicted in Figure 10. Based on the Pearson correlation coefficients utilized as a metric for similarity, the SUS score has the most similarity to the pragmatic (PQ) and attractiveness (ATT) usability aspects provided by the AttrakDiff2 questionnaire.
4.2. Pragmatic Quality
The PQ results of the AttrakDiff2 questionnaire are illustrated in Figure 11. The PQ scores for semimanual, guided, and joint prototypes are %, %, and % of the maximum score, respectively. Since each of the % confidence intervals are nonoverlapping, the prototypes’ ranking regarding PQ are significant.
The quantitative evaluation of recorded interaction data is depicted in Figure 12. Dice scores before the first interaction are zero, except for the guided prototype (), where few fixed seed points had to be provided to initialize the system. Utilizing the semimanual prototype and starting from zero, a similar Dice measure to the guided prototype’s initialization is reached after about seven interactions, which takes seconds on average. The median values of final Dice scores per prototype are (semimanual), (guided), and (joint). The mean overall elapsed wall time in seconds spent for interactive segmentations per prototype are (semimanual), (guided), and (joint). Since segmenting with the guided version takes the longest time and does not yield the highest final Dice scores, the initial advantage from preexisting seed points does not bias the top ranking of a prototype in this evaluation.
4.3. Hedonic Quality
4.3.1. Identity and Stimulus
The AttrakDiff2 questionnaire provides a measure for the HQ of identity and stimulus introduced in Section 2.3.2. The HQ scores for semimanual, guided, and joint prototypes are %, %, and % of the maximum score, respectively. Since the % confidence intervals are overlapping for all three prototypes, no system ranks significantly higher than the others. An overall evaluation of the AttrakDiff2 results is conducted in the form of a portfolio representation depicted in Figure 13.
4.3.2. Qualitative Content Analysis
A summative qualitative content analysis as described in Section 2.4 is conducted on the audio and video data recorded during the study. After generalization and reduction of given statements, the following user feedback is extracted with respect to three problem statements: positive usability aspects, negative usability aspects, and user suggestions concerning existing functions or new functions.
Feedback for Multiple Prototypes(1)Responsiveness: the most common statement concerning the semimanual and joint version is that the user expected the zoom function to be more responsive and thus more time efficient.(2)Visibility: % of the participants had difficulties distinguishing between the segmentation contour line and either the background image or the foreground scribbles in the overlay mask, due to the proximity of their assigned color values.(3)Feature suggestion: deletion of individual seed points instead of all seeds from last interaction using undo.
Semimanual Segmentation Prototype(1)Mental model: % of test persons suggested clearly visible indication whether the label for the scribble drawn next will be foreground or background.(2)Visibility: hide previously drawn seed points, in order to prevent confusion with the current contour line and occultation of the underlying image.
Guided Segmentation Prototype(1)Responsiveness: % of test persons suggested an indicator for ongoing computations during their time of waiting.(2)Control: users would like to influence the location of new seed points, support for manual image zoom, and fine grained control for the undo function.
Joint Prototype(1)Visibility: % of users intuitively found the toggle functionality for seed labels without prior explanation.(2)Visibility: % of participants suggested visible instructions for manual seed generation.
4.4. Prediction of Questionnaire Results from Log Data
The questionnaires’ results are predicted via a regression analysis, based on features extracted from the interaction log data. A visualization of the feature importances for the regression analysis with respect to the GBRF is depicted in Figure 14. An evaluation with the test set is conducted as depicted in Table 3. The mean prediction errors for the questionnaires’ results are % for PQ and % for HQ. In both cases, the error of these (first) estimates is larger but close to the average % confidence intervals of % (PQ) and % (HQ) for the overall questionnaire results in the portfolio representation.

The similarity graph for the acquired usability aspects introduced in Figure 10 can be extended to outline the direct relationship between questionnaire results and recorded features. Such a graph is depicted in Figure 15. Notably, there is no individual feature, which strongly correlates with one of the questionnaire results. However, as the results of the regression analysis in Table 3 depict, there is a noteworthy dependence of the usability aspects measured by the SUS and AttrakDiff2 questionnaires and combinations of the recorded features. The most important features for the approximation of the questionnaire results are depicted in Table 4.

5. Discussion
5.1. Usability Aspects
Although the underlying segmentation algorithm is the interactive GrowCut method for all three prototypes tested, the measured user experiences varied significantly. In terms of user stimulus HQS a more innovative interaction system like the joint prototype is preferred to a traditional one. Pragmatic quality aspects, evaluated by SUS as well as AttrakDiff2’s PQ, clearly outline that the semimanual approach has an advantage over the other two techniques. This conclusion also manifests in the Dice coefficient values’ fast convergence rate towards its maximum for this prototype. The normalized median Wall_time spent for the overall segmentation of each image are % (semimanual), % (guided), and % (joint). As a result, users prefer the simple, pragmatic interface as well as a substantial degree of freedom to control each iterative step of the segmentation. The less cognitively challenging approach is preferred [26]. The other methods provide more guidance for aspects which the user aims to control themselves. In order to improve the productivity of an ISS, less guidance should be imposed in these cases, while providing more guidance on aspects of the process not apparent to the users’ focus of attention [105].
5.2. Usability Aspects Approximation
For ATT and HQI, the most discriminative features selected by GBRFs are the receiver operating characteristic area under the curve (ROC_AUC) of the final interactive segmentations over the elapsed real time which passed during segmentation (Wall_time). The Jaccard index [106] as well as the relative absolute area/volume difference (RAVD) each divided by the computation time is most relevant for HQ, respective HQS. The pragmatic quality’s (PQ) dominant features are composed of final Dice scores and time measurements per segmentation. The SUS results, quantifying the overall usability of a prototype, are mainly predicted based on the features with the highest level of abstraction used. In the top % (22) selected features, % of top SUS features are PCA values, as indicated in Table 4 and Figure 14 (left). In comparison, PQ %, HQ %, HQI %, ATT %, and HQS %.
6. Conclusion
For sufficiently complex tasks like the accurate segmentation of lesions during TACE, fully automated systems are, by their lack of domain knowledge, inherently limited in the achievable quality of their segmentation results. ISS may supersede fully automated systems in certain niches by cooperating with the human user in order to reach the common goal of an exact segmentation result in a short amount of time. The evaluation of interactive approaches is more demanding and less automated than the evaluation with other approaches, due to complex human behavior.
However, there are methods like extensive user studies to assess the quality of a given system. It was shown, that even a suitable approximation of a study’s results regarding pragmatic as well as hedonic usability aspects is achievable from a sole analysis of the users’ interaction recordings. Those records are straightforward to acquire during normal (digital) prototype usage and can lead to a good first estimate of the system’s usability aspects, without the need to significantly increase the temporal demands on each participant by a mandatory completion of questionnaires after each system usage.
This mapping of quantitative lowlevel features, which are exclusively based on measurable interactions with the system (like the final Dice score, computation times, or relative seed positions), may allow for a fully automated assessment of an interactive system’s quality.
7. Outlook
For the proposed automation, a rulebased user model (robot user) like [27, 34] or a learningbased user model could interact with the prototype system instead of a human user. This evaluation scheme may significantly reduce the amount of resources necessary to investigate each variation of a prototype’s UI features and segmentation methodologies. An estimate of a system’s usability can therefore be acquired fully automatically with dependence only on the chosen user model. In addition, the suitable approximation of a usability study’s result can be used as a descriptor, i.e. feature vector, for a user. These features can be utilized for a clustering of users, which is a necessary step for the application of a personalized segmentation system. Such an interactive segmentation system might benefit from prior knowledge about a user’s preferences and input patterns in order to achieve accurate segmentations from less interactions.
Appendix
A. Example for SUS Evaluation Equation (5)
The result of the SUS survey is a single scalar value, in the range of zero to as a composite measure of the overall usability. The score is computed according to Equation (5), as outlined in [71], given participants, where is the response to the statement by subject . Let participants answer the questions (listed in Section 2.3.1) of the SUS questionnaire as follows:where are rows in matrix . Then In this case, . Note that the factor in (5) normalizes the SUS score to a value .
B. Example for AttrakDiff Evaluation (6)
For the questionnaire’s evaluation for subject , each of the seven adjective pairs per group is assigned a score by each participant, reflecting their tendency towards the positive of the two adjectives. The overall ratings per group are defined in [83] as the mean scores computed over all subjects and statements , as depicted in (6). Here, is the number of participants in the survey. Let participants fill in the choices (listed in Table 2) of the AttrakDiff2 questionnaire as follows, where are rows in matrix :
Group PQ:
Group ATT:
Group HQI:
Group HQS:
After evaluation via (6) In this case, , , , and .
The confidence intervals can then be extracted via the percent point function (also called quantile function or inverse cumulative distribution function) for the selected % confidence interval. Note that and flatten the input matrix to a vector first, such that mean and standard deviation are computed from a list of values and the outcome is one scalar value per function. The confidence intervals for the example data are , , , and .
Data Availability
The interaction log data used to support the findings of this study can be requested from the corresponding author.
Disclosure
The concept and software presented in this paper are based on research and are not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
Thanks are due to Christian Kisker and Carina Lehle for their hard work with the data collection.
References
 A. S. Becker, B. K. Barth, P. H. Marquez et al., “Increased interreader agreement in diagnosis of hepatocellular carcinoma using an adapted LIRADS algorithm,” European Journal of Radiology, vol. 86, pp. 33–40, 2017. View at: Publisher Site  Google Scholar
 Y. S. Kim, J. W. Kim, W. S. Yoon et al., “Interobserver variability in gross tumor volume delineation for hepatocellular carcinoma,” Strahlentherapie und Onkologie, vol. 192, no. 10, pp. 714–721, 2016. View at: Publisher Site  Google Scholar
 T. S. Hong, W. R. Bosch, S. Krishnan et al., “Interobserver Variability in Target Definition for Hepatocellular Carcinoma With and Without Portal Vein Thrombus: Radiation Therapy Oncology Group Consensus Guidelines,” International Journal of Radiation Oncology • Biology • Physics, vol. 89, no. 4, pp. 804–813, 2014. View at: Publisher Site  Google Scholar
 J. H. Moltz, S. Braunewell, J. Rühaak et al., “Analysis of variability in manual liver tumor delineation in CT scans,” in Proceedings of the 2011 8th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, ISBI'11, pp. 1974–1977, 2011. View at: Google Scholar
 S. D. Olabarriaga and A. W. Smeulders, “Setting the mind for intelligent interactive segmentation: Overview, requirements, and framework,” in Proceedings of the Biennial International Conference on Information Processing in Medical Imaging, pp. 417–422, Springer, 1997. View at: Google Scholar
 M. Hassenzahl and N. Tractinsky, “User experience—a research agenda,” Behaviour & Information Technology, vol. 25, no. 2, pp. 91–97, 2006. View at: Publisher Site  Google Scholar
 E. L.C. Law, V. Roto, M. Hassenzahl, A. P. O. S. Vermeeren, and J. Kort, “Understanding, scoping and defining user experience: a survey approach,” Human Factors in Computing Systems (CHI), pp. 719–728, 2009. View at: Google Scholar
 M. Young, G. Dank, R. Roper, and T. Caro, “InterObserver Reliability,” Behaviour, vol. 69, no. 34, pp. 303–315, 1979. View at: Publisher Site  Google Scholar
 P. Kohli, H. Nickisch, C. Rother, and C. Rhemann, “Usercentric learning and evaluation of interactive segmentation systems,” Computer Vision (IJCV), vol. 100, no. 3, pp. 261–274, 2012. View at: Google Scholar
 L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, pp. 297–302, 1945. View at: Publisher Site  Google Scholar
 S. Olabarriaga and A. Smeulders, “Interaction in the segmentation of medical images: A survey,” Medical Image Analysis, vol. 5, no. 2, pp. 127–142, 2001. View at: Publisher Site  Google Scholar
 F. Zhao and X. Xie, “Interactive segmentation of medical images: A survey,” in Proceedings of the Medical Image Understanding and Analysis, 2012. View at: Google Scholar
 C. S. Puranik and C. J. Lonigan, “From scribbles to scrabble: Preschool childrenâ™s developing knowledge of written language,” Reading and Writing, vol. 24, no. 5, pp. 567–589, 2011. View at: Publisher Site  Google Scholar
 C. Rupprecht, L. Peter, and N. Navab, “Image segmentation in twenty questions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR '15, pp. 3314–3322, 2015. View at: Google Scholar
 J. Udupa, L. Wei, S. Samarasekera, Y. Miki, M. van Buchem, and R. Grossman, “Multiple sclerosis lesion quantification using fuzzyconnectedness principles,” IEEE Transactions on Medical Imaging, vol. 16, no. 5, pp. 598–609, 1997. View at: Publisher Site  Google Scholar
 S. D. Olabarriaga, Humancomputer interaction for the segmentation of medical images [dissertation], Advanced School for Computing and Imaging, 1999.
 H. Nickisch, C. Rother, P. Kohli, and C. Rhemann, “Learning an interactive segmentation system,” in Computer Vision, Graphics and Image Processing (ICVGIP), pp. 274–281, ACM, 2010. View at: Google Scholar
 K. McGuinness and N. E. O’Connor, “A comparative evaluation of interactive segmentation algorithms,” Pattern Recognition, vol. 43, no. 2, pp. 434–444, 2010. View at: Publisher Site  Google Scholar
 B. Van Ginneken, T. Heimann, and M. Styner, “3D segmentation in the clinic: A grand challenge,” in Medical Image Computing and ComputerAssisted Intervention (MICCAI), pp. 7–15, 2007. View at: Google Scholar
 G. Litjens, R. Toth, W. van de Ven et al., “Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge,” Medical Image Analysis, vol. 18, no. 2, pp. 359–373, 2014. View at: Publisher Site  Google Scholar
 F. Zhao and X. Xie, “An overview of interactive medical image segmentation,” Annals of the BMVA, vol. 2013, no. 7, pp. 1–22, 2013. View at: Google Scholar
 K. McGuinness and N. E. O’Connor, “Toward automated evaluation of interactive segmentation,” Computer Vision and Image Understanding, vol. 115, no. 6, pp. 868–884, 2011. View at: Publisher Site  Google Scholar
 M. Amrehn, J. Glasbrenner, S. Steidl, and A. Maier, “Comparative evaluation of interactive segmentation approaches,” in Bildverarbeitung für die Medizin (BVM), pp. 68–73, 2016. View at: Publisher Site  Google Scholar
 W. Yang, J. Cai, J. Zheng, and J. Luo, “UserFriendly Interactive Image Segmentation Through Unified Combinatorial User Inputs,” IEEE Transactions on Image Processing, vol. 19, no. 9, pp. 2470–2479, 2010. View at: Publisher Site  Google Scholar
 A. Ramkumar, P. J. Stappers, W. J. Niessen et al., “Using GOMS and NASATLX to evaluate humancomputer interaction process in interactive segmentation,” in Human–Computer Interaction (IHC), pp. 1–12, 2016. View at: Google Scholar
 A. Ramkumar, J. Dolz, H. A. Kirisli et al., “User Interaction in SemiAutomatic Segmentation of Organs at Risk: a Case Study in Radiotherapy,” Journal of Digital Imaging, vol. 29, no. 2, pp. 264–277, 2016. View at: Publisher Site  Google Scholar
 M. P. Amrehn, M. Strumia, M. Kowarschik, and A. Maier, “Interactive neural network robot user investigation for medical image segmentation,” in Bildverarbeitung für die Medizin (BVM), pp. 56–61, Springer, 2019. View at: Google Scholar
 N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang, “Deep interactive object selection,” in Computer Vision and Pattern Recognition (CVPR), pp. 373–381, IEEE, 2016. View at: Google Scholar
 G. Wang, M. A. Zuluaga, W. Li et al., “DeepIGeoS: A Deep Interactive Geodesic Framework for Medical Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. View at: Google Scholar
 D.J. Chen, H.T. Chen, and L.W. Chang, “Swipecut: Interactive segmentation with diversified seed proposals,” https://arxiv.org/abs/1812.07260, 2018. View at: Google Scholar
 M. P. Amrehn, M. Strumia, S. Steidl, T. Horz, M. Kowarschik, and A. Maier, “Ideal seed point location approximation for GrowCut interactive image segmentation,” in Bildverarbeitung für die Medizin (BVM), pp. 210–215, Springer, 2018. View at: Google Scholar
 J. H. Liew, Y. Wei, W. Xiong, S.H. Ong, and J. Feng, “Regional interactive image segmentation networks,” in Computer Vision (ICCV), pp. 2746–2754, IEEE, 2017. View at: Google Scholar
 G. Wang, W. Li, M. A. Zuluaga et al. et al., “Interactive medical image segmentation using deep learning with imagespecific finetuning,” Transactions on Medical Imaging (TMI), vol. 37, no. 7, pp. 1562–1573, 2018. View at: Google Scholar
 M. P. Amrehn, S. Gaube, M. Unberath et al., “UInet: Interactive artificial neural networks for iterative image segmentation based on a user model,” in Visual Computing for Biology and Medicine (VCBM), pp. 143–147, 2017. View at: Google Scholar
 M. P. Amrehn, S. Steidl, M. Kowarschik, and A. Maier, “Robust seed mask generation for interactive image segmentation,” in Proceedings of the Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), pp. 1–3, IEEE, 2017. View at: Google Scholar
 B. Jiang, T. Ren, and J. Bei, “Automatic scribble simulation for interactive image segmentation evaluation,” in Multimedia Modeling (MMM), pp. 596–608, Springer, 2016. View at: Google Scholar
 D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Computer Vision (ICCV), vol. 2, pp. 416–423, IEEE, 2001. View at: Google Scholar
 D.J. Chen, H.T. Chen, and L.W. Chang, “Interactive segmentation from 1bit feedback,” in Computer Vision (ACCV), pp. 261–274, Springer, 2016. View at: Google Scholar
 F. Andrade and E. V. Carrera, “Supervised evaluation of seedbased interactive image segmentation algorithms,” in Signal Processing, Images and Computer Vision (STSIVA), pp. 1–7, IEEE, 2015. View at: Google Scholar
 J. Bai and X. Wu, “Errortolerant scribbles based interactive image segmentation,” in Computer Vision and Pattern Recognition (CVPR), pp. 392–399, IEEE, 2014. View at: Google Scholar
 S. D. Jain and K. Grauman, “Predicting sufficient annotation strength for interactive foreground segmentation,” in Computer Vision (ICCV), pp. 1313–1320, IEEE, 2013. View at: Google Scholar
 J. He, C.S. Kim, and C.C. J. Kuo, Interactive Segmentation Techniques: Algorithms and Performance Evaluation, Springer Science & Business Media, 2013.
 Y. Zhao, X. Nie, Y. Duan, Y. Huang, and S. Luo, “A benchmark for interactive image segmentation algorithms,” in PersonOriented Vision (POV), pp. 33–38, IEEE, 2011. View at: Google Scholar
 A. Top, G. Hamarneh, and R. Abugharbieh, “Active learning for interactive 3d image segmentation,” in Medical Image Computing and ComputerAssisted Intervention (MICCAI), pp. 603–610, Springer, 2011. View at: Google Scholar
 V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman, “Geodesic star convexity for interactive image segmentation,” in Computer Vision and Pattern Recognition (CVPR), pp. 3129–3136, IEEE, 2010. View at: Google Scholar
 D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactive cosegmentation with intelligent scribble guidance,” in Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176, IEEE, 2010. View at: Google Scholar
 J. Ning, L. Zhang, D. Zhang, and C. Wu, “Interactive image segmentation by maximal similarity based region merging,” Pattern Recognition, vol. 43, no. 2, pp. 445–456, 2010. View at: Publisher Site  Google Scholar
 B. L. Price, B. Morse, and S. Cohen, “Geodesic graph cut for interactive image segmentation,” in Computer Vision and Pattern Recognition (CVPR), pp. 3161–3168, IEEE, 2010. View at: Google Scholar
 D. Singaraju, L. Grady, and R. Vidal, “Pbrush: Continuous valued mrfs with normed pairwise distributions for image segmentation,” in Computer Vision and Pattern Recognition (CVPR), pp. 1303–1310, IEEE, 2009. View at: Google Scholar
 C. Rother, V. Kolmogorov, and A. Blake, “Grabcut: Interactive foreground extraction using iterated graph cuts,” in Transactions on Graphics (TOG), vol. 23, pp. 309–314, ACM, 2004. View at: Google Scholar
 E. Moschidis and J. Graham, “A systematic performance evaluation of interactive image segmentation methods based on simulated user interaction,” in Biomedical Imaging (ISBI), pp. 928–931, IEEE, 2010. View at: Google Scholar
 “Simulation of user interaction for performance evaluation of interactive image segmentation methods,” in Medical Image Understanding and Analysis (MIUA), pp. 209–213, 2009. View at: Google Scholar
 O. Duchenne, J.Y. Audibert, R. Keriven, J. Ponce, and F. Ségonne, “Segmentation by transduction,” in Computer Vision and Pattern Recognition (CVPR), pp. 1–8, IEEE, 2008. View at: Google Scholar
 A. Levin, D. Lischinski, and Y. Weiss, “A closedform solution to natural image matting,” Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 228–242, 2008. View at: Publisher Site  Google Scholar
 S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based image segmentation with connectivity priors,” in Computer Vision and Pattern Recognition (CVPR), pp. 1–8, IEEE, 2008. View at: Google Scholar
 A. Protiere and G. Sapiro, “Interactive Image Segmentation via Adaptive Weighted Distances,” IEEE Transactions on Image Processing, vol. 16, no. 4, pp. 1046–1057, 2007. View at: Publisher Site  Google Scholar
 Y. Boykov and G. FunkaLea, “Graph cuts and efficient ND image segmentation,” International Journal of Computer Vision, vol. 70, no. 2, pp. 109–131, 2006. View at: Publisher Site  Google Scholar
 L. Grady, “Random walks for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 11, pp. 1768–1783, 2006. View at: Publisher Site  Google Scholar
 V. Vezhnevets and V. Konouchine, “GrowCut: Interactive multilabel nd image segmentation by cellular automata,” in Computer Graphics and Applications (Graphicon), vol. 8, pp. 150–156, Citeseer, 2005. View at: Google Scholar
 J. E. Cates, R. T. Whitaker, and G. M. Jones, “Case study: an evaluation of userassisted hierarchical watershed segmentation,” Medical Image Analysis, vol. 9, no. 6, pp. 566–578, 2005. View at: Publisher Site  Google Scholar
 Y. Li, J. Sun, C.K. Tang, and H.Y. Shum, “Lazy snapping,” in Transactions on Graphics (ToG), vol. 23, pp. 303–308, ACM, 2004. View at: Google Scholar
 A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr, “Interactive image segmentation using an adaptive gmmrf model,” in Computer Vision (ECCV), pp. 428–441, Springer, 2004. View at: Google Scholar
 J. W. Chung, H. Kim, J. Yoon et al., “Transcatheter Arterial Chemoembolization of Hepatocellular Carcinoma: Prevalence and Causative Factors of Extrahepatic Collateral Arteries in 479 Patients,” Korean Journal of Radiology, vol. 7, no. 4, pp. 257–266, 2006. View at: Publisher Site  Google Scholar
 K. A. McGlynn and W. T. London, “The global epidemiology of hepatocellular carcinoma: present and future,” Clinics in Liver Disease, vol. 15, no. 2, pp. 223–243, 2011. View at: Publisher Site  Google Scholar
 R. J. Lewandowski, J.F. Geschwind, E. Liapi, and R. Salem, “Transcatheter intraarterial therapies: Rationale and overview,” Radiology, vol. 259, no. 3, pp. 641–657, 2011. View at: Publisher Site  Google Scholar
 J. Bruix and M. Sherman, “Management of Hepatocellular carcinoma,” Hepatology, vol. 42, no. 5, pp. 1208–1236, 2005. View at: Publisher Site  Google Scholar
 J. Bruix and M. Sherman, “Management of hepatocellular carcinoma: an update,” Hepatology, vol. 53, no. 3, pp. 1020–1022, 2011. View at: Publisher Site  Google Scholar
 N. Strobel, O. Meissner, J. Boese, T. Brunner et al., “3D imaging with flatdetector Carm systems,” Multislice CT, pp. 33–51, 2009. View at: Google Scholar
 C. Lo, H. Ngan, W. Tso et al., “Randomized controlled trial of transarterial Lipiodol chemoembolization for unresectable hepatocellular carcinoma,” Hepatology, vol. 35, no. 5, pp. 1164–1171, 2002. View at: Publisher Site  Google Scholar
 Y. Jin, L. M. Fayad, and A. F. Laine, “Contrast enhancement by multiscale adaptive histogram equalization,” in Optical Science and Technology, pp. 206–213, International Society for Optics and Photonics, 2001. View at: Google Scholar
 J. Brooke, “SUS – A quick and dirty usability scale,” Usability Evaluation In Industry, pp. 189–194, 1996. View at: Google Scholar
 J. R. Lewis and J. Sauro, “The factor structure of the system usability scale,” in Proceedings of the International Conference on Human Centered Design (HCD), pp. 94–103, 2009. View at: Google Scholar
 P. T. Kortum and A. Bangor, “Usability Ratings for Everyday Products Measured With the System Usability Scale,” International Journal of HumanComputer Interaction, vol. 29, no. 2, pp. 67–76, 2013. View at: Publisher Site  Google Scholar
 ISO Central Secretary, “Ergonomic requirements for office work with visual display terminals (VDTs) – Part 11: Guidance on usability,” International Organization for Standardization (ISO), Geneva, CH, Standard ISO/TC 159/SC 4 924111:1998. View at: Google Scholar
 “Ergonomics of humansystem interaction – Part 11: Usability: Definitions and concepts , International Organization for Standardization (ISO),” Geneva, CH, Standard ISO/TC 159/SC 4 924111:2018, Mar. 2018. View at: Google Scholar
 A. Bangor, P. Kortum, and J. Miller, “Determining what individual sus scores mean: adding an adjective rating scale,” Journal of Usability Studies, vol. 4, no. 3, pp. 114–123, 2009. View at: Google Scholar
 R. Likert, “A technique for the measurement of attitudes,” Archives of Psychology, vol. 22, no. 140, pp. 3–55, 1932. View at: Google Scholar
 A. Bangor, P. T. Kortum, and J. T. Miller, “An empirical evaluation of the system usability scale,” International Journal of HumanComputer Interaction, vol. 24, no. 6, pp. 574–594, 2008. View at: Publisher Site  Google Scholar
 C. E. Osgood, “The nature and measurement of meaning,” Psychological Bulletin, vol. 49, no. 3, pp. 197–237, 1952. View at: Publisher Site  Google Scholar
 C. E. Osgood, G. J. Suci, and P. H. Tannenbaum, The Measurement of Meaning, University of Illinois Press, 1957.
 A. Mehrabian and J. A. Russell, An Approach to Environmental Psychology, MIT Press, 1974.
 M. Fishbein and I. Ajzen, Belief, Attitude, Intention, and Behavior: An Introduction to Theory and Research, AddisonWesley, 1975.
 M. Hassenzahl, M. Burmester, and F. Koller, “AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität,” in Mensch & Computer (MC), pp. 187–196, Springer, 2003. View at: Google Scholar
 M. Hassenzahl, A. Platz, M. Burmester, and K. Lehner, “Hedonic and ergonomic quality aspects determine a software's appeal,” in Human Factors in Computing Systems (CHI), SIGCHI, pp. 201–208, ACM, 2000. View at: Google Scholar
 M. Hassenzahl, “The effect of perceived hedonic quality on product appealingness,” International Journal of HumanComputer Interaction, vol. 13, no. 4, pp. 481–499, 2001. View at: Publisher Site  Google Scholar
 M. Hassenzahl, M. Burmester, and F. Koller, “Der user experience (UX) auf der Spur: Zum Einsatz von www.attrakdiff.de,” in User Experience Professionals Association International (UXPA), vol. 17, pp. 78–82, 2008. View at: Google Scholar
 M. Hassenzahl, R. Kekez, and M. Burmester, “The importance of a softwareâs pragmatic quality depends on usage modes,” in Proceedings of the 6th international conference on Work With Display Units (WWDU), ERGONOMIC Institut für Arbeitsund Sozialforschung, pp. 275276, Berlin, Germany, 2002. View at: Google Scholar
 S. Diefenbach and M. Hassenzahl, “Give me a reason: hedonic product choice and justification,” in Human factors in computing systems (CHI), pp. 3051–3056, 2008. View at: Google Scholar
 M. Hassenzahl, “The hedonic/pragmatic model of user experience,” Towards a UX manifesto, pp. 16–20, 2007. View at: Google Scholar
 J. McCroskey, “Bipolar scales,” in Measurement of Communication Behavior, P. Emmert and L. L. Barker, Eds., pp. 154–167, Longman Publishing Group, White Plains, NY, USA, 1989. View at: Google Scholar
 H. F. Hsieh and S. E. Shannon, “Three approaches to qualitative content analysis,” Qualitative Health Research, vol. 15, no. 9, pp. 1277–1288, 2005. View at: Publisher Site  Google Scholar
 S. Elo and H. Kyngäs, “The qualitative content analysis process,” Journal of Advanced Nursing, vol. 62, no. 1, pp. 107–115, 2008. View at: Publisher Site  Google Scholar
 P. Mayring, Qualitative Content Analysis: Theoretical Foundation, Basic Procedures and Software Solution, GESIS, 2014.
 Q. Gao, Y. Wang, F. Song, Z. Li, and X. Dong, “Mental workload measurement for emergency operating procedures in digital nuclear power plants,” Ergonomics, vol. 56, no. 7, pp. 1070–1085, 2013. View at: Publisher Site  Google Scholar
 D. Siroker and P. Koomen, A/B Testing: The Most Powerful Way to Turn Clicks Into Customers, Wiley Publishing, 1st edition, 2013.
 Recommendation Broadcasting service (television) BT.7096, “Basic parameter values for the hdtv standard for the studio and for international programme exchange,” International Telecommunication Union Radiocommunication Assembly (ITU–R), 1990.
 S. Hanneke, “A bound on the label complexity of agnostic active learning,” in Proceedings of the International conference on Machine Learning (ICML), pp. 353–360, 2007. View at: Google Scholar
 B. B. Mandelbrot, “How long is the coast of Britain? Statistical selfsimilarity and fractional dimension,” Science, vol. 156, no. 3775, pp. 636–638, 1967. View at: Publisher Site  Google Scholar
 D. A. Norman and J. Nielsen, “Gestural interfaces: a step backward in usability,” Interactions, vol. 17, no. 5, pp. 46–49, 2010. View at: Publisher Site  Google Scholar
 J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001. View at: Publisher Site  Google Scholar  MathSciNet
 “Stochastic gradient boosting,” Computational Statistics & Data Analysis (CSDA), vol. 38, no. 4, pp. 367–378, 2002. View at: Google Scholar
 T. Hastie, R. Tibshirani, and J. H. Friedman, “Boosting and additive trees,” The Elements of Statistical Learning, pp. 337–387, 2009. View at: Google Scholar
 L. Breiman, “Using adaptive bagging to debias regressions,” Tech. Rep., University of California, Berkeley, Calif, USA, 1999. View at: Google Scholar
 P. J. Huber, “Robust estimation of a location parameter,” Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964. View at: Publisher Site  Google Scholar  MathSciNet
 W. Heron, “Perception as a Function of Retinal Locus and Attention,” The American Journal of Psychology, vol. 70, no. 1, pp. 38–48, 1957. View at: Publisher Site  Google Scholar
 P. Jaccard, “The distribution of the flora in the alpine zone,” New Phytologist, vol. 11, no. 2, pp. 37–50, 1912. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2019 Mario Amrehn et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.