Computational and Mathematical Methods in Medicine

Volume 2017 (2017), Article ID 5271091, 17 pages

https://doi.org/10.1155/2017/5271091

## Nonparametric Subgroup Identification by PRIM and CART: A Simulation and Application Study

Institute of Medical Statistics and Epidemiology, Technische Universität München, Ismaninger Str. 22, 81675 Munich, Germany

Correspondence should be addressed to Armin Ott

Received 25 January 2017; Accepted 2 April 2017; Published 22 May 2017

Academic Editor: Olaf Gefeller

Copyright © 2017 Armin Ott and Alexander Hapfelmeier. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Two nonparametric methods for the identification of subgroups with outstanding outcome values are described and compared to each other in a simulation study and an application to clinical data. The Patient Rule Induction Method (PRIM) searches for box-shaped areas in the given data which exceed a minimal size and average outcome. This is achieved via a combination of iterative peeling and pasting steps, where small fractions of the data are removed or added to the current box. As an alternative, Classification and Regression Trees (CART) prediction models perform sequential binary splits of the data to produce subsets which can be interpreted as subgroups of heterogeneous outcome. PRIM and CART were compared in a simulation study to investigate their strengths and weaknesses under various data settings, taking different performance measures into account. PRIM was shown to be superior in rather complex settings such as those with few observations, a smaller signal-to-noise ratio, and more than one subgroup. CART showed the best performance in simpler situations. A practical application of the two methods was illustrated using a clinical data set. For this application, both methods produced similar results but the higher amount of user involvement of PRIM became apparent. PRIM can be flexibly tuned by the user, whereas CART, although simpler to implement, is rather static.

#### 1. Introduction

Subgroup identification, especially in high-dimensional data situations, is a common problem. The aim is to find subsets of the whole data set defined by covariates in which the outcome of interest is distributed differently than in other regions. Especially in the medical domain, there are many possibilities for applications of methods that address this problem. For example, in the context of personalized medicine, subgroup identification can be of interest if a treatment effect is enhanced or reduced for groups of patients defined by the baseline covariates (cf. [1, 2]) or it may be desirable to find subgroups of patients with a high risk of mortality (cf. [3]). In addition to applications in medicine, there are also other fields in which such methods are useful such as industrial process control (cf. [4]).

The Patient Rule Induction Method (PRIM) and Classification and Regression Trees (CART) are two popular nonparametric methods for subgroup identification. They employ two different strategies which are described in this paper. PRIM, which is less commonly used, is explained in more detail in this paper. It formulates the research question as an optimization problem where some target function has to be maximized or minimized. A simple solution to this is to find specific values or regions for a set of variables (covariates) conditioned on which another variable (outcome) takes extreme values. This way, one tries to identify subgroups in the whole data set in which the mean outcome (or another criterion) is high or low. By contrast, CART provides an empirical description of the conditional distribution of an outcome as it splits the data into disjoint subsets. Some of these subsets may depict subgroups of interest to a focused research question. To assess the performance of PRIM and CART in subgroup identification, they were compared in different data settings in a simulation study and an application to clinical data. Corresponding R-codes are given in the supplementary Appendices C–G in Supplementary Material available online at https://doi.org/10.1155/2017/5271091.

#### 2. The Patient Rule Induction Method (PRIM)

A PRIM model consists of boxes that define subsets (subgroups) with extreme outcome values. Boxes are defined by lower and upper threshold values for continuous covariates and subsets of the levels of categorical covariates. They are mainly characterized by their “target” and “support,” with the former being the result of the target function evaluated within the box and the latter describing the proportion of observations lying inside the box. Later in this section it will be shown that there is always a trade-off between those two values. A combination of two algorithms called “peeling” and “pasting” is used to fit the model in an iterative way (cf. [5, 6]).

##### 2.1. Peeling

The main component of PRIM is the so-called top-down peeling. This iterative algorithm starts with a large box that contains all observations of a data set. Within every peeling step, small fractions (subboxes) are removed (peeled) from the margins of the current box, one at a time. Out of all these possible subboxes, the one which maximizes the target function on the remaining observations in the box is chosen for removal. If the goal is to minimize the target function, the algorithm acts the same way after multiplying the outcome with the value at the beginning so that the minimization problem is transformed into a maximization problem.

For most applications, the arithmetic mean is a useful choice for the target function:Here, is the number of observations in the box:which results from the th iterative step after a subbox is chosen for removal out of the class of all possible subboxes such that

In cases with only continuous covariates , the set of possible subboxes is composed as follows:withwhere describes the -quantile of the observations of variable which lie in the current box .

Therefore, observations below the -quantile or above the ()-quantile are peeled off and can be seen as a metaparameter which is able to influence the result. Usually one chooses small values (0.05–0.1) which introduce the “patience” to the algorithm. should be small enough that a potential suboptimal step does not have too much impact on the result but also not too small, because otherwise the boxes would depend strongly on the random variability in the data.

The peeling procedure is repeated until the support of the current box falls below some threshold , such thatwhere denotes the indicator function which returns the value 1, if the condition in brackets is true and 0 otherwise.

The minimum support is another metaparameter which has to be determined by the user. The choice of this parameter depends on the analytic aims, but it should not be chosen too small, because very small boxes have strong dependency on the random noise in the data. Such a result would be very sensitive to small changes in the data set and prone to overfitting.

*Example 1. *A simple example of the peeling algorithm and the sequence of boxes resulting out of it is illustrated in Figure 1. Here we have a binary outcome and two metric covariates and which are sampled from uniform distributions between −10 and 10. There is one obvious box in which the outcome is more frequent; therefore the mean outcome (0/1 coded) is much higher than for the rest of the data. To improve the appearance, is chosen very high in this example at .

In the left upper panel, only the initial box containing all data points and the four candidate boxes for the first peeling step are shown. The second and third graphs illustrate the first two steps of the algorithm with the two subboxes and peeled of the current box. The fourth one shows the result of the algorithm which is continued until of is reached, so contains at least of all observations. It is also clear to see that the subboxes become smaller with each step, because the - and -quantiles refer only to the data that are included in the current box. In this case, the final box is determined as