Journal of Quality and Reliability Engineering

Volume 2015, Article ID 795154, 9 pages

http://dx.doi.org/10.1155/2015/795154

## Variable Selection Methods for Right-Censored Time-to-Event Data with High-Dimensional Covariates

Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA 02115, USA

Received 1 October 2014; Revised 13 April 2015; Accepted 16 April 2015

Academic Editor: Christian Kirchsteiger

Copyright © 2015 Keivan Sadeghzadeh and Nasser Fard. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Advancement in technology has led to greater accessibility of massive and complex data in many fields such as quality and reliability. The proper management and utilization of valuable data could significantly increase knowledge and reduce cost by preventive actions, whereas erroneous and misinterpreted data could lead to poor inference and decision making. On the other side, it has become more difficult to process the streaming high-dimensional time-to-event data in traditional application approaches, specifically in the presence of censored observations. This paper presents a multipurpose analytic model and practical nonparametric methods to analyze right-censored time-to-event data with high-dimensional covariates. In order to reduce redundant information and to facilitate practical interpretation, variable inefficiency in failure time is determined for the specific field of application. To investigate the performance of the proposed methods, these methods are compared with recent relevant approaches through numerical experiments and simulations.

#### 1. Introduction

Time-to-event data such as failure or survival times have been extensively studied in reliability engineering. By the advent of modern data collection technologies, a huge amount of this type of data includes high-dimensional covariates. This massive amount of data is increasingly accessible from various sources such as transaction-based information, information-sensing devices, remote sensing technologies, machines and logistics statistics, wireless sensor networks, and analytics in quality engineering, manufacturing, service operations, and many other segments. Unlike traditional datasets with few explanatory variables, the analysis of datasets with a high number of variables requires different approaches. In this situation, variable selection techniques could be used to determine a subset of variables that are significantly more valuable to analyze high-dimensional time-to-event datasets. If data is compiled and processed correctly, it can enable informed decision making [1–7].

In many professional areas and activities, as well as manufacturing and services, decision making is increasingly based on the type and size of data, as well as analytic methods, rather than on experience and intuition. It has been suggested that business should discover new ways to collect and use data every day and develop the ability to interpret data to increase the performance of their decision makers [8, 9]. As stated in a broad survey [10], advanced analytics are among the most popular techniques used in high-dimensional and massive data analysis and decision making process. As an analytical approach, decision making is the process of finding the best option from all feasible alternatives [11, 12].

The opportunity for manufacturing and services in the era of data is to analyze their performance to enhance the quality. Quality dimensions of products and services, defined by quality experts [13, 14] or perceived by customers [15], are summarized as performance, availability, reliability, maintainability, durability, serviceability, conformance, warranty, and aesthetics and reputation. Access to valuable data for sophisticated analytics can substantially improve the management decision making process. In reliability analysis, failure time is determined by the variables contributing to products’ failure time. Complex, high-dimensional, and censored time-to-event data provide an excellent chance for manufacturers to reduce costs, improve efficiency, and ultimately improve the quality of their products by detecting failure causes faster [16, 17].

Most of the traditional variable selection methods such as Akaike information criterion (AIC) [18] or Bayesian information criterion (BIC) [19] involve computational algorithms in a class of nondeterministic polynomial-time (NP-hard) and computational cost, making these procedures infeasible. Also, recently developed methods for identifying variable efficiency may operate faster, but the robustness is not consistent. These methods involve different estimation methods and assumptions such as the Cox proportional hazard model [20], accelerated failure time [21], Buckley-James estimator [22], random survival forests [23], additive risk models [24], weighted least squares [25], or classification and regression tree (CART) [26]. Due to the presence of censoring, analyzing time-to-event data with high-dimensional covariates and recognizing efficient covariates in terms of predictive power of survival is more challenging [24, 25]. This study is motivated by the importance of the aforementioned variable selection issue.

The objective of this study is to propose a combinational methodology for the variable reduction via determining variable inefficiency in right-censored high-dimensional time-to-event data. The aim of the proposed logical analytic model as well as methods and algorithms is also to reduce the volume of the failure time data and to identify a set of the most influential variables on failure time. Variable efficiency refers to the effect of a variable on failure or survival time in a right-censored time-to-event dataset with high-dimensional covariates. This paper presents two multipurpose nonparametric methods to analyze the aforementioned class of data.

The concept of time-to-event analysis and the commonly used and relevant data mining tools and techniques are presented in Section 2. The logical model for the transformation of the explanatory variable dataset to reach the logical representation of the original covariate dataset as a sort of binary variables is defined in Section 3, where each variable is represented by a Boolean vector to verify and to prove the sustainability of the transformation. Section 4 presents hybrid nonparametric variable selection methods and algorithms through variable efficiency. The validity of proposed methods is verified by results obtained in comparison to those from well-known methods through simulation patterns and by using different collected and simulated time-to-event datasets. The performance and verification of the proposed methods are presented in Section 5. Concluding remarks, including the advantages of the proposed methods, are discussed in Section 6.

The dynamic change of variables in time-dependent explanatory data streaming is of interest in the complementary level of this research. The computer software used in this research is the MATLAB R2011b programming environment.

#### 2. Basic Definitions

In this section, an applied introduction of time-to-event data analysis and a brief review of prominent data mining tools and techniques relevant to this study are presented.

##### 2.1. Time-to-Event Data Analysis

Time-to-event data analysis methods consider the time until the occurrence of an event. This time can be measured in any unit such as days, weeks, or years with this analysis widely used in reliability engineering. In time-to-event data, subjects are usually followed over a specified time period. The study of time-to-event data focuses on predicting the probability of survival or failure. Examples of time-to-event data are the lifetime of mechanic devices, electronic components, or complex systems [21, 27, 28].

Regression models cannot effectively perform in the presence of the censoring of observations [27, 29]. Censored data occurs when the information about the event time is not complete or missed for any reason. Right censoring occurs when a test subject does not remain under the test for a full test period or until it fails. In this paper, we focus on time-to-event data with this type of censoring.

Available methods to analyze time-to-event data and to find a relationship between survival time and other variables can be categorized in parametric, semiparametric, and nonparametric methods. Parametric survival analysis is based on survival function distributions such as the exponential function. Semiparametric models do not assume knowledge of absolute risk. These models estimate relative risk rather than absolute risk with this assumption called the proportional hazards assumption. In this category, the Cox proportional hazards regression analysis is by far the most popular model for survival data analysis. For moderate- to high-dimensional covariates, it is difficult to apply semiparametric methods [25]. In nonparametric methods which are useful when the underlying distribution of the problem is unknown, statistical assumptions are not required. These methods are commonly used to describe survivorship of a study population or compare two or more study populations. The Kaplan-Meier product limit estimate is a commonly used nonparametric method in estimating the survival function. This estimator has clear advantages since it does not require an approximation of the follow-up time assumption [27, 30].

The probability of the failure time occurring at time isIn time-to-event or survival analysis, the information on an event status and follow-up time are used to estimate a survival function, , which is defined as the probability that an object survives at least until time :From the definition of the cumulative distribution function (or failure function)Accordingly, the survival function is calculated by a probability density function asIn most applications, the survival function is shown as a step function rather than a smooth curve. The nonparametric estimate of according to the Kaplan-Meier (KM) estimator for distinct ordered event times to is as follows:where at each event time there are subjects at risk and is the number of subjects which experienced the event, for example, failed. Let denote the number of subjects censored between and . Then the likelihood function takes the following form:For the conditional probability of surviving, if we define , then the maximum-likelihood estimation of is as follows:Graphically, the Kaplan-Meier estimate is a step function with discontinuities which increases at observed failure times. It has been shown [31] that the KM estimator is consistent. The completely nonparametric nature of this estimator assures little or no loss in efficiency. A quick review of commonly used data mining tools and techniques in this study is presented next.

##### 2.2. Data Mining Tools and Techniques

To analyze time-to-event data, when the size and dimensions are large, advanced analytics are advantageous. Data reduction techniques are categorized in three main strategies: dimensionality reduction, numerical reduction, and data compression [32, 33]. Dimensionality reduction is the most efficient strategy in the field of large-scale data deals by reducing the number of random variables or attributes in the special circumstances of the problem. Dimensionality reduction methods are mainly wavelet transformations and principal components analysis (PCA) [34, 35]. The transformation and projection of the original data eliminate a subset of the original data in terms of the variables’ covariance.

All dimensionality reduction techniques are also classified as feature extraction and feature selection approaches. Feature extraction is defined as transforming the original data into a new lower dimensional space through some functional mapping such as PCA and SVD [36, 37]. Most unsupervised dimensionality reduction techniques are closely related to PCA which is one of the oldest and most well-known multivariate analysis techniques, but this technique is not applicable to large complex datasets [38]. Feature selection is denoted by selecting a subset of the original data (features) without a transformation in order to filter out irrelevant or redundant features, such as filter methods, wrapper methods, and embedded methods [39, 40]. The next section presents a proposed analytic logical model for the transformation of the explanatory variable dataset of a time-to-event data.

#### 3. Proposed Analytic Model

The direction for developing an analytic model and analyzing datasets depends on the type and size of the data. A multipurpose, flexible, and innovative model for a type of right-censored time-to-event data with a large number of variables when the correlation between variables is complicated or unknown provides the motivation to find an applicable solution for this type of data. For such data, we propose a model to simplify the original covariate dataset into a logical dataset by transformation lemma. In order to select the most significant variables in terms of efficiency, variable reduction methods and clustering algorithms are proposed in Section 4. The analytic model [41] and its following methods and algorithms are potentially applicable solutions for many problems in a vast area of science and technology.

The original right-censored high-dimensional time-to-event dataset may include any type of explanatory data as binary, continuous, categorical, or ordinal data. The concept of this proposed analytic model is that many variables are even binary or interchangeable with a binary variable such as dichotomous and Bernoulli variables. Also, the interpretation of a binary variable is simple, understandable, and comprehensible. In addition, the model is appropriate for fast and low-cost calculation which makes the time-dependent analysis with data streaming possible.

The random variables and represent the time-to-event and censoring time, respectively. Time-to-event data is represented by where and the censoring indicator if the event occurred, for instance failure is observed, otherwise 0. The observed covariate represents a set of variables. Let denote any observations by , , . It is assumed that the hazard at time only depends on the survivals at time which assures the independent right censoring assumption [21].

In order to simplify the original complex high-dimensional time-to-event dataset, we propose a transformed logical model. Based on the concept of this model, for any -by- dataset matrix , there are independent observations and variables. Each array of variables vectors will take only two possible values, canonically 0 and 1. Therefore, it is required to define a Bernoulli criterion to split all arrays as a set of binary outcomes and reach a logical dataset. Transforming numerical attributes to binary variables has been well studied [42]. In order to construct the abovementioned transformed logical time-to-event dataset as a simplified and applied representation of the original one, we define as an initial Bernoulli criterion:For any array in the -by- dataset matrix , assign a substituting array asNote that for each of the variable vectors the criterion could be defined by an expert using experimental or historical data as well (8). The proposed model assumes any array with a value of 1 as desired for an expert and 0 otherwise. In other words, represent the lack of the th variable in the th observation. In this fashion, only desired variables will be considered in each variable vector. The transformed dataset is used in the proposed methods and algorithms.

Therefore, the result of the transformation is an -by- dataset matrix which will be used in the following methods and algorithms. Also, we define the time-to-event vector including all observed failure times and as the survival function. The proposed logical model validation and verification of the robustness were presented comprehensively in [41, 43]. The logical model initially could be satisfied by the proper design of the data collection process based on Boolean logic to generate binary attributes.

#### 4. Proposed Methods and Heuristic Algorithms

In order to design appropriate methods and algorithms, a test is performed on the efficiency of a cluster of variables as a subdataset of the complete time-to-event covariates dataset. The test is done by comparing this subdataset with the complete transformed logical dataset. A key assumption in this approach is that the variable which is completely inefficient solely can provide a significant performance improvement when engaged with others, and two variables that are inefficient by themselves can be efficient together [40]. By expanding these assumptions over the presence of a large number of covariates with a complexity in correlation, the efficient subset of variables does not differ meaningfully from the effect of the whole body of variables on the time-to-event outcome. Therefore, comparing through nonparametric test, any subset from the complete dataset could be determined as an efficient or inefficient selection. Based on these assumptions, we design two methods and hybrid algorithms for the proposed analytic model for selecting inefficient variables in right-censored time-to-event datasets with high-dimensional covariates.

We use the Kaplan-Meier estimator in this study to estimate and graph survival probabilities as a function of time. In addition, a nonparametric method is used to test a null hypothesis of whether two samples are drawn from the same distribution as compared to a given alternative hypothesis. Among many nonparametric tests for comparing survival functions for the aforementioned propose, we use a log-rank test in our methods as best fit for the comparison of two nonparametric distributions. This test is the most commonly used one for a typical study under different models for the relationship between the groups [27, 30].

is constructed by observation vectors that correspond to each of the variables as a -by- matrix, a selected subset of . is defined as the number of observations in any subset of , where . We also define vector as follows:Vector is constructed by all nonzero arrays . The preliminary step for the highest efficiency in proposed methods is to cluster the variables based on the correlation coefficient matrix of the original dataset and choose a representative variable from each highly correlated cluster and then eliminate the other variables from the dataset. Let denote the covariance of variables and . The correlation coefficient matrix is defined as follows: and represent the values of variable in observation and the mean of variable and the second parenthesis defined similarly for variable .

Applying this lemma, for instance, to any given dataset, determines three highly correlated variables from . Only one of them is selected randomly and the other two are eliminated from the dataset. The outcome of this process ensures that the remaining variables for applying methods and heuristic algorithms are not highly correlated.

##### 4.1. Method I: Nonparametric Test Score Variable Clustering

The log-rank test score variable selection method is applied for selecting a subset of the best and the worst variables in terms of efficiency. The nonparametric test score (NTS) method is a variable clustering technique which selects a set of size variables from the transformed logical dataset and calculates the score of each variable in two levels. The first level is to determine the priority of the variable efficiency via the scores. The scores are obtained from the frequency of each variable in the rejected subsets from comparison with the original time-to-even vector . We code this level of calculation with letter F. The second level rates the variables by the cumulative score of each variable from comparisons of selected subsets of all nonparametric test results with the original time-to-even vector . This level acts as a searching procedure to detect the less efficient variables. The code which this level is denoted by is the letter C. The randomization (RN) algorithm randomly chooses a defined subset of from the transformed logical dataset of variable. We define a randomization dataset matrix where each row is formed by variable identification numbers in any selected subsets for overall subsets. The heuristic algorithm of NTS method level F is as in Algorithm 1.