Complexity

Volume 2017 (2017), Article ID 7120691, 11 pages

https://doi.org/10.1155/2017/7120691

## Sparse Learning of the Disease Severity Score for High-Dimensional Data

^{1}Signals and Systems Department, School of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 73, 11120 Belgrade, Serbia^{2}Center for Data Analytics and Biomedical Informatics, College of Science and Technology, Temple University, 1925 North 12th Street, Philadelphia, PA 19122, USA

Correspondence should be addressed to Zoran Obradovic

Received 11 May 2017; Revised 6 November 2017; Accepted 27 November 2017; Published 18 December 2017

Academic Editor: Sergio Gómez

Copyright © 2017 Ivan Stojkovic and Zoran Obradovic. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Learning disease severity scores automatically from collected measurements may aid in the quality of both healthcare and scientific understanding. Some steps in that direction have been taken and machine learning algorithms for extracting scoring functions from data have been proposed. Given the rapid increase in both quantity and diversity of data measured and stored, the large amount of information is becoming one of the challenges for learning algorithms. In this work, we investigated the direction of the problem where the dimensionality of measured variables is large. Learning the severity score in such cases brings the issue of which of measured features are relevant. We have proposed a novel approach by combining desirable properties of existing formulations, which compares favorably to alternatives in accuracy and especially in the robustness of the learned scoring function. The proposed formulation has a nonsmooth penalty that induces sparsity. This problem is solved by addressing a dual formulation which is smooth and allows an efficient optimization. The proposed approach might be used as an effective and reliable tool for both scoring function learning and biomarker discovery, as demonstrated by identifying a stable set of genes related to influenza symptoms’ severity, which are enriched in immune-related processes.

#### 1. Introduction

Diseases and other health conditions require continuous monitoring and assessment of the subject’s state. The severity of the condition needs to be quantified, such that it can be used to guide medical decisions and allow appropriate and timely interventions. Disease severity scoring functions are typically used to quantify a patient’s condition. However, disease severity and health are often difficult to quantify. That is because they are essentially latent concepts and are not directly accessible or observable. In an absence of a direct measurement of health, the severity of a condition is estimated based on values of some surrogate variables that are observable and hopefully informative about the condition of interest. In clinical practice, commonly tracked variables include temperature, heart rate, blood pressure, and responsiveness, to name a few out of a myriad of possible other variables. A severity score is subsequently calculated from such observable quantities using some heuristic rules. Prominent examples of such rules are SOFA [1] score for sepsis, or more general ICU scoring systems like APACHE II [2]. Both relevant variables and associated heuristic rules are established in a consensus of expert bodies and relevant institutions based on experience and current understanding of a condition. That process is long and tedious and often results in extensively coarse scoring rules and a nonoptimal set of relevant observable variables.

Although utilizing data was always part of this process, recently it was acknowledged that it might be improved/complemented by using machine learning methods that can automatically extract both rules and relevant variables directly from the data. There are already a number of approaches for automatic learning of severity scores/rules from data. One way is to use discrete class labels for building classifiers and subsequently use the probability of a sample belonging to a certain class as a quantification measure of severity [3]. Another supervised approach is to learn the severity score function in a regression manner [4] from some surrogate of severity highly associated with undesired outcomes down the stream. A downside is that it already requires a good candidate for scoring function. An additional issue is that it might be sensitive to censoring due to treatment, where the severity of state is not acknowledged because treatment prevented undesired outcomes from happening. Some of the mentioned drawbacks are addressed in a more recent approach [5, 6] that is based on a clever observation that comparing two cases according to severity is easier to assess than to directly quantify the severity of a particular case. It was built upon the existing work on learning scoring functions for information retrieval tasks [7]. However, even this approach might be inappropriate in some cases, since it learns the severity score as a function of all measured variables, which will affect its performance when there are irrelevant features or when the number of features is much higher than the number of samples [8]. In essence, features unrelated to severity will be present even in small sets of measured variables, and, in high-throughput measurements like gene expression, this might be an even larger obstacle.

In this paper, we present an approach to the problem of learning disease severity scores in presence of irrelevant or high-dimensional measurements. We build on top of existing efforts by simultaneously performing feature selections that are most relevant for severity score learning. In particular, we are introducing the norm in the formulation of ranking SVM [7] along with the temporal smoothness constraint [6]. Attractive regularization properties of norm are already well acknowledged and exploited in a number of statistical learning methods since its introduction [9]. The proposed formulation of sparse severity score learning forces weights of (most of) the features to be exactly zero, therefore effectively performing feature selection by learning the sparse linear scoring function. This novel severity score objective function is convex and nonsmooth and it precludes the direct use of convenient optimization tools like gradient-based methods. Therefore, in this work, we are also providing the reformulation of the problem into its dual that is smooth and that allows efficient optimization. Other than learning the severity score from the data, which is an important instrument for assessing severity, the methodology may also be used to discover the most relevant variables/features for the disease severity phenotype. Such findings might be further used to suggest novel (testable) hypotheses about causal relations leading to disease manifestation and also to inspire novel therapeutic approaches.

The rest of the article is structured as follows: Methods begins with the introduction to related work and continues with new formulation and derivation of its solution. Results begins with evaluation on intuitive synthetic examples where the advantages of sparse severity score framework over the nonsparse one are apparent. Results continues with the assessment on a gene expression dataset of H3N2 viral infection responses. Efficacy and the robustness of the proposed method are compared favorably against multiple alternative methods. Results is concluded with gene ontology overrepresentation analysis of the discovered subset of genes most useful for the scoring function.

#### 2. Methods

##### 2.1. Previous and Related Work

As mentioned in Introduction, some of the first proposed severity score learning methods are supervised approaches that solve classification or regression tasks and whose solution provides a way to calculate a severity score.

For example, in [4] Alzheimer’s Disease severity, as measured by cognitive scores, was modeled as (temporal) multitask regression using the fused sparse group lasso approach. The approach was more concerned with the progression of the disease, hence the multitask formulation. However, as we are mostly interested in severity score mapping from a single time-point set of measurements, here we are presenting its more influential ancestor, the LASSO model [9]:

Here, is column vector of given numeric scores, associated with dimensional measurement matrix , while denotes the solution in form of a -dimensional column weight vector. We will use this model as one of the baselines for comparison as it is one of the main workhorses of biomarker selection [10] and even statistical learning in general.

Another approach used sparsity-inducing norm in combination with classical loss function for learning disease severity scoring function [3]. They proposed using regularized Logistic Regression model (among others), to model the severity scores for the abnormality of the skull in craniosynostosis cases:

This Sparse Logistic Regression formulation is another related model, as it also results in a sparse vector of feature weights that essentially regress the decision boundary between the severity classes and might be used as a mapping function for severity scores. In (2), is a binary label for th row of data matrix .

As outlined previously, these forms of supervision where estimates of severity score functions (or severity classes) are needed might be hard to obtain in order to be utilized for training the severity score automatically. On the other hand, obtaining the pairs of comparisons is an easier task. Seminal work of learning the scoring functions from the comparison labels is proposed in [7]. In that work, the ranking SVM formulation (see (3)) is developed to learn better document retrieval from click-through data. This great insight came from noticing that the clicked links automatically have greater ranks compared to the ones not clicked. And such kind of data is much more abundant than the user provided rankings.

Set is composed of comparison of ordered pairs , where has a higher rank than and which corresponds to rows of measurement matrix and , respectively. More recently the approach was adopted for learning the Sepsis Disease Severity Score [5]. In it (see (4)), the constraint that scoring function should gradually evolve over the time was introduced and hence a temporal smoothness term is added. In addition, nonsmooth Hinge loss () is replaced with its smooth approximation, Huber loss (), to obtain the formulation of (linear) Disease Severity Score Learning (DSSL) framework:

Temporal smoothness term in (4) penalizes high rates of change in severity in consecutive time steps and of a single subject . Set of all consecutive pairs in all subjects is denoted by and constants and are hyperparameters determining the cost of respective loss terms.

DSSL framework was adopted and extended in different ways. A multitask DSSL was proposed in [11], which utilizes matrix norm regularization to couple multiple distinct tasks. Nonlinear version of DSSL framework, as well as its solution in form of gradient boosted regression trees, was also proposed in [6]. Nevertheless, mentioned DSSL approaches are dense in a sense that they operate on all variables (in case of a linear version, all coefficients are typically nonzero). The approach in [11] is based on expensive proximal gradient optimization algorithm, which makes it unsuitable for high-dimensional problems. The utility of the approaches in [6] was presented on an application with a moderately small number of different pieces of clinical information, vitals, and laboratory analysis variables and it is not clear how the approach would perform in situations with high-dimensional data common in high-throughput techniques like genetic, genomic, epigenetic, proteomic, and so on.

Yet, high-throughput data is also a very rich source of useful biomarkers that could be used for diagnostic and prognostic purposes, as well as for obtaining insight into causal relations [12]. Therefore we are proposing an approach that is able to learn a (temporally smooth) scoring function from comparison data while simultaneously performing the selection of most relevant (important) variables.

##### 2.2. Proposed Model Formulation

In our Sparse Learning of Disease Severity Score (SLDSS) formulation, we combine attractive properties (and terms) of previously mentioned approaches, ranking SVM (see (3)) [7], temporal smoothness constraint (see (4)) [6], and norm from sparse methods (see (1) and (2)) [3, 9]:

In fact, since the model imposes both and norms on the feature vector , it resembles the elastic net regularization [13], which has an advantage of achieving higher stability with respect to random sampling [14].

The solution of the optimization objective defined in (5) serves as a sparse linear function that may be applied on measurements from the new patient, to obtain a scalar value of severity that might be compared to previously assessed cases and inform further actions. The sparse vector may also serve as an indicator of which features are the most influential for pairwise comparison. The formulation contains two nonsmooth terms, and Hinge loss, and therefore it is not directly solvable using off-the-shelf gradient methods. In DSSL formulation, the (nondifferentiable) Hinge loss is approximated with twice differentiable Huber loss, thus making the optimization criterion solvable using the second-order gradient methods (e.g., Newton and Quasi-Newton). In order to provide an efficient solution for the proposed nonsmooth objective, we will solve the smooth dual problem instead of relying on smooth approximation or nonsmooth optimization tools.

First we rewrite (5) into a more suitable form for which we will later provide the smooth dual problem. We aggregate the differences of measurements into single data matrix , where is a number of pairs in the comparison set . Similarly, we express measurement and temporal difference ratios as matrix , where rows are and is a number of pairs in the consecutive measurements set . We aggregate the norm and temporal smoothness terms (they are essentially weighting the square of optimization parameters) into a single weighted quadratic term , where , being -dimensional identity matrix. The first two terms, weighted quadratic norm and Hinge loss, resemble the well-known SVM criterion function that we will rewrite in its “soft" form with additional slack variables and their associated constraints. Additional set of “dummy variables" is introduced in term, with trivial constraints . The equation of the rewritten SLDSS now reads

Now we turn this constrained problem with inequalities and equalities into its Lagrangian dual. Constraints are moved to the criterion function as penal terms weighted by Lagrangian multipliers , , and . The equation of the SLDSS dual problem is

Given that optimization criterion is convex and feasible (Slater’s condition holds [15]), strong duality allows switching the order of maximization and minimization in (7), and minimization in primal variables can be safely performed first. Now we analyze the expression according to primal variables , , and and find the minimizing conditions for each of them.

The dual formulation is the quadratic function of parameters and we can find its optimal form as a function of new free parameters introduced in dual (by equating its gradient with zero):

Similarly, the expression for slack variables is a linear combination of dual variables and it is minimal when the directional gradient is equated to zero vector, giving the optimality condition in a form of an equality constraint:

Resulting equality constraint in combination with inequality can be reduced to just one constraint , which removes from further consideration.

For minimization over dummy variables , we use the convex (Fenchel) conjugate function of the expression [15] and obtain optimality condition as inequality constraint over the infinity norm of the dual variable:

When optimal (minimizing) conditions (see (8), (9), and (10)) are replaced in dual formulation (7), it becomes

After negating (11) to turn it into minimization problem and after simplification of the expression, final problem formulation is

The original nonsmooth problem is turned into the smooth dual problem, which can be solved for its two sets of parameters and . Since the strong duality holds, a solution to dual is a solution to the original problem, and optimal weight vector can be retrieved after plugging the solution of dual, and , into (8).

Similar dual formulation, just without the dummy variables and associated multipliers , might be used for DSSL with the exact Hinge loss, instead of the originally proposed DSSL which uses Huber loss approximation [6].

##### 2.3. Optimization Algorithm

The differentiable dual from (12) is, in fact, a quadratic optimization problem with box constraints:

There are ready-to-use tools for solving the problem in (13), and we utilized the built-in Matlab “quadprog" solver, which is implemented as a projection method with the active set.

#### 3. Results

##### 3.1. Severity Score Characterization on Synthetic Data

For the initial assessment of the proposed framework, we have generated a synthetic example with properties that motivated the approach. If a large number of variables are measured, many are expected to be irrelevant for the assessment of severity.

We defined the severity score as a linear combination of intensities of the first 10 features after initiating a set of 100. In addition, we set the coefficients to have different magnitudes, as it is expected that contribution of different variables is of various levels (Figure 1(a)). The remaining ninety features do not affect severity score at all; they are irrelevant and only introduce uncertainty into the problem. For training purposes, values of all features are randomly sampled from a uniform distribution for 10 fictitious subjects with 10 different measurements each. Severity scores are associated based on a linear function with weights depicted in Figure 1. Comparison labels (pairs) were generated as all possible pairs in which the first element (sample) has substantially higher severity score as compared to the second element. This requirement of substantial gap in severity between pairs serves to mimic the case where a doctor could claim, with high confidence, that one patient is in more severe condition than another. Such generated training data was utilized to fit Sparse LDSS, (dense) DSSL, and DSSL model trained on the exact features that are relevant, which we named Ideal DSSL in Table 1.