Abstract

Advancement in technology has led to greater accessibility of massive and complex data in many fields such as quality and reliability. The proper management and utilization of valuable data could significantly increase knowledge and reduce cost by preventive actions, whereas erroneous and misinterpreted data could lead to poor inference and decision making. On the other side, it has become more difficult to process the streaming high-dimensional time-to-event data in traditional application approaches, specifically in the presence of censored observations. This paper presents a multipurpose analytic model and practical nonparametric methods to analyze right-censored time-to-event data with high-dimensional covariates. In order to reduce redundant information and to facilitate practical interpretation, variable inefficiency in failure time is determined for the specific field of application. To investigate the performance of the proposed methods, these methods are compared with recent relevant approaches through numerical experiments and simulations.

1. Introduction

Time-to-event data such as failure or survival times have been extensively studied in reliability engineering. By the advent of modern data collection technologies, a huge amount of this type of data includes high-dimensional covariates. This massive amount of data is increasingly accessible from various sources such as transaction-based information, information-sensing devices, remote sensing technologies, machines and logistics statistics, wireless sensor networks, and analytics in quality engineering, manufacturing, service operations, and many other segments. Unlike traditional datasets with few explanatory variables, the analysis of datasets with a high number of variables requires different approaches. In this situation, variable selection techniques could be used to determine a subset of variables that are significantly more valuable to analyze high-dimensional time-to-event datasets. If data is compiled and processed correctly, it can enable informed decision making [17].

In many professional areas and activities, as well as manufacturing and services, decision making is increasingly based on the type and size of data, as well as analytic methods, rather than on experience and intuition. It has been suggested that business should discover new ways to collect and use data every day and develop the ability to interpret data to increase the performance of their decision makers [8, 9]. As stated in a broad survey [10], advanced analytics are among the most popular techniques used in high-dimensional and massive data analysis and decision making process. As an analytical approach, decision making is the process of finding the best option from all feasible alternatives [11, 12].

The opportunity for manufacturing and services in the era of data is to analyze their performance to enhance the quality. Quality dimensions of products and services, defined by quality experts [13, 14] or perceived by customers [15], are summarized as performance, availability, reliability, maintainability, durability, serviceability, conformance, warranty, and aesthetics and reputation. Access to valuable data for sophisticated analytics can substantially improve the management decision making process. In reliability analysis, failure time is determined by the variables contributing to products’ failure time. Complex, high-dimensional, and censored time-to-event data provide an excellent chance for manufacturers to reduce costs, improve efficiency, and ultimately improve the quality of their products by detecting failure causes faster [16, 17].

Most of the traditional variable selection methods such as Akaike information criterion (AIC) [18] or Bayesian information criterion (BIC) [19] involve computational algorithms in a class of nondeterministic polynomial-time (NP-hard) and computational cost, making these procedures infeasible. Also, recently developed methods for identifying variable efficiency may operate faster, but the robustness is not consistent. These methods involve different estimation methods and assumptions such as the Cox proportional hazard model [20], accelerated failure time [21], Buckley-James estimator [22], random survival forests [23], additive risk models [24], weighted least squares [25], or classification and regression tree (CART) [26]. Due to the presence of censoring, analyzing time-to-event data with high-dimensional covariates and recognizing efficient covariates in terms of predictive power of survival is more challenging [24, 25]. This study is motivated by the importance of the aforementioned variable selection issue.

The objective of this study is to propose a combinational methodology for the variable reduction via determining variable inefficiency in right-censored high-dimensional time-to-event data. The aim of the proposed logical analytic model as well as methods and algorithms is also to reduce the volume of the failure time data and to identify a set of the most influential variables on failure time. Variable efficiency refers to the effect of a variable on failure or survival time in a right-censored time-to-event dataset with high-dimensional covariates. This paper presents two multipurpose nonparametric methods to analyze the aforementioned class of data.

The concept of time-to-event analysis and the commonly used and relevant data mining tools and techniques are presented in Section 2. The logical model for the transformation of the explanatory variable dataset to reach the logical representation of the original covariate dataset as a sort of binary variables is defined in Section 3, where each variable is represented by a Boolean vector to verify and to prove the sustainability of the transformation. Section 4 presents hybrid nonparametric variable selection methods and algorithms through variable efficiency. The validity of proposed methods is verified by results obtained in comparison to those from well-known methods through simulation patterns and by using different collected and simulated time-to-event datasets. The performance and verification of the proposed methods are presented in Section 5. Concluding remarks, including the advantages of the proposed methods, are discussed in Section 6.

The dynamic change of variables in time-dependent explanatory data streaming is of interest in the complementary level of this research. The computer software used in this research is the MATLAB R2011b programming environment.

2. Basic Definitions

In this section, an applied introduction of time-to-event data analysis and a brief review of prominent data mining tools and techniques relevant to this study are presented.

2.1. Time-to-Event Data Analysis

Time-to-event data analysis methods consider the time until the occurrence of an event. This time can be measured in any unit such as days, weeks, or years with this analysis widely used in reliability engineering. In time-to-event data, subjects are usually followed over a specified time period. The study of time-to-event data focuses on predicting the probability of survival or failure. Examples of time-to-event data are the lifetime of mechanic devices, electronic components, or complex systems [21, 27, 28].

Regression models cannot effectively perform in the presence of the censoring of observations [27, 29]. Censored data occurs when the information about the event time is not complete or missed for any reason. Right censoring occurs when a test subject does not remain under the test for a full test period or until it fails. In this paper, we focus on time-to-event data with this type of censoring.

Available methods to analyze time-to-event data and to find a relationship between survival time and other variables can be categorized in parametric, semiparametric, and nonparametric methods. Parametric survival analysis is based on survival function distributions such as the exponential function. Semiparametric models do not assume knowledge of absolute risk. These models estimate relative risk rather than absolute risk with this assumption called the proportional hazards assumption. In this category, the Cox proportional hazards regression analysis is by far the most popular model for survival data analysis. For moderate- to high-dimensional covariates, it is difficult to apply semiparametric methods [25]. In nonparametric methods which are useful when the underlying distribution of the problem is unknown, statistical assumptions are not required. These methods are commonly used to describe survivorship of a study population or compare two or more study populations. The Kaplan-Meier product limit estimate is a commonly used nonparametric method in estimating the survival function. This estimator has clear advantages since it does not require an approximation of the follow-up time assumption [27, 30].

The probability of the failure time occurring at time isIn time-to-event or survival analysis, the information on an event status and follow-up time are used to estimate a survival function, , which is defined as the probability that an object survives at least until time :From the definition of the cumulative distribution function (or failure function)Accordingly, the survival function is calculated by a probability density function asIn most applications, the survival function is shown as a step function rather than a smooth curve. The nonparametric estimate of according to the Kaplan-Meier (KM) estimator for distinct ordered event times to is as follows:where at each event time there are subjects at risk and is the number of subjects which experienced the event, for example, failed. Let denote the number of subjects censored between and . Then the likelihood function takes the following form:For the conditional probability of surviving, if we define , then the maximum-likelihood estimation of is as follows:Graphically, the Kaplan-Meier estimate is a step function with discontinuities which increases at observed failure times. It has been shown [31] that the KM estimator is consistent. The completely nonparametric nature of this estimator assures little or no loss in efficiency. A quick review of commonly used data mining tools and techniques in this study is presented next.

2.2. Data Mining Tools and Techniques

To analyze time-to-event data, when the size and dimensions are large, advanced analytics are advantageous. Data reduction techniques are categorized in three main strategies: dimensionality reduction, numerical reduction, and data compression [32, 33]. Dimensionality reduction is the most efficient strategy in the field of large-scale data deals by reducing the number of random variables or attributes in the special circumstances of the problem. Dimensionality reduction methods are mainly wavelet transformations and principal components analysis (PCA) [34, 35]. The transformation and projection of the original data eliminate a subset of the original data in terms of the variables’ covariance.

All dimensionality reduction techniques are also classified as feature extraction and feature selection approaches. Feature extraction is defined as transforming the original data into a new lower dimensional space through some functional mapping such as PCA and SVD [36, 37]. Most unsupervised dimensionality reduction techniques are closely related to PCA which is one of the oldest and most well-known multivariate analysis techniques, but this technique is not applicable to large complex datasets [38]. Feature selection is denoted by selecting a subset of the original data (features) without a transformation in order to filter out irrelevant or redundant features, such as filter methods, wrapper methods, and embedded methods [39, 40]. The next section presents a proposed analytic logical model for the transformation of the explanatory variable dataset of a time-to-event data.

3. Proposed Analytic Model

The direction for developing an analytic model and analyzing datasets depends on the type and size of the data. A multipurpose, flexible, and innovative model for a type of right-censored time-to-event data with a large number of variables when the correlation between variables is complicated or unknown provides the motivation to find an applicable solution for this type of data. For such data, we propose a model to simplify the original covariate dataset into a logical dataset by transformation lemma. In order to select the most significant variables in terms of efficiency, variable reduction methods and clustering algorithms are proposed in Section 4. The analytic model [41] and its following methods and algorithms are potentially applicable solutions for many problems in a vast area of science and technology.

The original right-censored high-dimensional time-to-event dataset may include any type of explanatory data as binary, continuous, categorical, or ordinal data. The concept of this proposed analytic model is that many variables are even binary or interchangeable with a binary variable such as dichotomous and Bernoulli variables. Also, the interpretation of a binary variable is simple, understandable, and comprehensible. In addition, the model is appropriate for fast and low-cost calculation which makes the time-dependent analysis with data streaming possible.

The random variables and represent the time-to-event and censoring time, respectively. Time-to-event data is represented by where and the censoring indicator if the event occurred, for instance failure is observed, otherwise 0. The observed covariate represents a set of variables. Let denote any observations by , , . It is assumed that the hazard at time only depends on the survivals at time which assures the independent right censoring assumption [21].

In order to simplify the original complex high-dimensional time-to-event dataset, we propose a transformed logical model. Based on the concept of this model, for any -by- dataset matrix , there are independent observations and variables. Each array of variables vectors will take only two possible values, canonically 0 and 1. Therefore, it is required to define a Bernoulli criterion to split all arrays as a set of binary outcomes and reach a logical dataset. Transforming numerical attributes to binary variables has been well studied [42]. In order to construct the abovementioned transformed logical time-to-event dataset as a simplified and applied representation of the original one, we define as an initial Bernoulli criterion:For any array in the -by- dataset matrix , assign a substituting array asNote that for each of the variable vectors the criterion could be defined by an expert using experimental or historical data as well (8). The proposed model assumes any array with a value of 1 as desired for an expert and 0 otherwise. In other words, represent the lack of the th variable in the th observation. In this fashion, only desired variables will be considered in each variable vector. The transformed dataset is used in the proposed methods and algorithms.

Therefore, the result of the transformation is an -by- dataset matrix which will be used in the following methods and algorithms. Also, we define the time-to-event vector including all observed failure times and as the survival function. The proposed logical model validation and verification of the robustness were presented comprehensively in [41, 43]. The logical model initially could be satisfied by the proper design of the data collection process based on Boolean logic to generate binary attributes.

4. Proposed Methods and Heuristic Algorithms

In order to design appropriate methods and algorithms, a test is performed on the efficiency of a cluster of variables as a subdataset of the complete time-to-event covariates dataset. The test is done by comparing this subdataset with the complete transformed logical dataset. A key assumption in this approach is that the variable which is completely inefficient solely can provide a significant performance improvement when engaged with others, and two variables that are inefficient by themselves can be efficient together [40]. By expanding these assumptions over the presence of a large number of covariates with a complexity in correlation, the efficient subset of variables does not differ meaningfully from the effect of the whole body of variables on the time-to-event outcome. Therefore, comparing through nonparametric test, any subset from the complete dataset could be determined as an efficient or inefficient selection. Based on these assumptions, we design two methods and hybrid algorithms for the proposed analytic model for selecting inefficient variables in right-censored time-to-event datasets with high-dimensional covariates.

We use the Kaplan-Meier estimator in this study to estimate and graph survival probabilities as a function of time. In addition, a nonparametric method is used to test a null hypothesis of whether two samples are drawn from the same distribution as compared to a given alternative hypothesis. Among many nonparametric tests for comparing survival functions for the aforementioned propose, we use a log-rank test in our methods as best fit for the comparison of two nonparametric distributions. This test is the most commonly used one for a typical study under different models for the relationship between the groups [27, 30].

is constructed by observation vectors that correspond to each of the variables as a -by- matrix, a selected subset of . is defined as the number of observations in any subset of , where . We also define vector as follows:Vector is constructed by all nonzero arrays . The preliminary step for the highest efficiency in proposed methods is to cluster the variables based on the correlation coefficient matrix of the original dataset and choose a representative variable from each highly correlated cluster and then eliminate the other variables from the dataset. Let denote the covariance of variables and . The correlation coefficient matrix is defined as follows: and represent the values of variable in observation and the mean of variable and the second parenthesis defined similarly for variable .

Applying this lemma, for instance, to any given dataset, determines three highly correlated variables from . Only one of them is selected randomly and the other two are eliminated from the dataset. The outcome of this process ensures that the remaining variables for applying methods and heuristic algorithms are not highly correlated.

4.1. Method I: Nonparametric Test Score Variable Clustering

The log-rank test score variable selection method is applied for selecting a subset of the best and the worst variables in terms of efficiency. The nonparametric test score (NTS) method is a variable clustering technique which selects a set of size variables from the transformed logical dataset and calculates the score of each variable in two levels. The first level is to determine the priority of the variable efficiency via the scores. The scores are obtained from the frequency of each variable in the rejected subsets from comparison with the original time-to-even vector . We code this level of calculation with letter F. The second level rates the variables by the cumulative score of each variable from comparisons of selected subsets of all nonparametric test results with the original time-to-even vector . This level acts as a searching procedure to detect the less efficient variables. The code which this level is denoted by is the letter C. The randomization (RN) algorithm randomly chooses a defined subset of from the transformed logical dataset of variable. We define a randomization dataset matrix where each row is formed by variable identification numbers in any selected subsets for overall subsets. The heuristic algorithm of NTS method level F is as in Algorithm 1.

Step  1. for i = 1 to q do
Step  2.  Compose the dataset for variable set i in including variables where j = 1 to k
Step  3.  Calculate over the dataset
Step  4.  Compare and with log-rank test
Step  5.  Save the test score for variables in subset i as
Step  6. end for
Step  7. Eliminate rows of where the array of the row has a value of more than 0.05,
   and call this new set
Step  8. Count the contribution (presence) of each variable based on its identification
   number in the first k columns and save as
Step  9. Return as the variable efficiency vector

The heuristic algorithm of NTS method level C is presented is Algorithm 2. The experiment results for these algorithms are followed in Section 5.

Step  1. for = 1 to do
Step  2.  Compose the dataset for variable set in including variables where = 1 to
Step  3.  Calculate over the dataset
Step  4.  Compare and with log-rank test
Step  5.  Save the test score for variables in subset as
Step  6. end for
Step  7. Assume = as the reverse variable efficiency vector where initially each array
   as the cumulative contribution score corresponding to a variable is zero
Step  8. for = 1 to do
Step  9.  for = 1 to do
Step  10.   Add the value of to the cumulative contribution score of the variable based
      on its identification number =
    end for
Step  11. end for
Step  12. Return = as the variable inefficiency vector

4.2. Method II: Splitting Semigreedy Clustering Algorithm

The splitting semigreedy (SSG) method to select an inefficient variable subset is proposed. And a clustering procedure through a randomly splitting approach to select the best local subset according to a defined criterion is incorporated. The concept of this method is inspired by the semigreedy heuristic [44, 45] and tabu search [46]. The criterion of this search is similar to Method I which is to collect the most inefficient variable subset via a log-rank test score. At each of trials, all variables from the transformed logical dataset are randomly clustered into subsets of size variables, where one cluster possibly contains less than variables and the number of clusters is equal to . To calculate the score summation for each variable over all trials, we define a randomization dataset matrix where each row is formed by variable identification numbers in any selected subsets for all selected subsets. The heuristic algorithm of SSG is presented in Algorithm 3. Comprehensive experimental results for the validation of the proposed methods by comparison with similar methods are presented next.

Step  1. for = 1 to do
Step  2.  Split the data into equally sized subsets
Step  3.  Compose the dataset for each subset
Step  4.  Calculate over the for each subset
Step  5.  Compare and with log-rank test and save the test score for each subset one by one
Step  6.  Select a subset with the highest test score
Step  7.   Save the test score for variables in the selected subset as
Step  8. end for
Step  9. Assume = as the reverse variable efficiency vector where initially each array
   as the cumulative contribution score corresponding to a variable is zero
Step  10. for = 1 to do
Step  11.  for = 1 to do
Step  12.  Add the value of to the cumulative contribution score of the variable
     based on its identification number =
    end for
Step  13. end for
Step  14. Return = as the variable inefficiency vector

5. Experimental Results and Analysis

To evaluate the performance of the proposed methods, the designed algorithms under different types of datasets including collected and simulated data are investigated. In order to obtain an estimation of desired number of variables in any selected subset of NTS (F and C levels) and SSG methods for calculations in hybrid algorithms (Algorithms 1, 2, and 3), we use principal component analysis (PCA) scree plot criterion [47]. The criterion for this evaluation is based on the eigenvalue of components in a dataset. The cut-off in the scree plot is interpreted as a set of significant eigenvalues among all components which leads to determining , number of efficient variables in a given dataset (see Figure 1). We apply this technique on the original dataset to determine .

5.1. PBC Data Numerical Experiment

First, the well-known primary biliary cirrhosis (PBC) dataset (Fleming and Harrington 1991) is considered as the sample collected dataset. These data are from a double-blinded randomized trial including 312 observations. This right-censored dataset contains 17 variables in addition to censoring information, identification number, and event times for each observation. Experimental results for the 276 observations with complete records are presented. For the original PBC dataset, approximate value of is 3. This value is the subset size in the algorithms applied on transformed logical dataset in the -by- matrix, shown in Figure 1.

To verify the performance of the proposed methods, the result of methods and algorithms for the transformed logical PBC dataset is compared with the results of random survival forests (RSF) method [23, 48], additive risk model (ADD) [24], and weighted least square (LS) [25] for variable selection in the PBC dataset given in Table 1. RSF is a method for the analysis of right-censored survival data using splitting rules for growing survival trees, and it is based on the random forests (RF) approach that ensembles learning which can be improved further by injecting randomization into the base learning process. Therefore, the RSF methodology is an extension of the RF method in which randomization is used in two forms; by randomly drawing a bootstrap sample of the data and by randomly selecting a subset of variables [23]. The ADD proposes a principal component regression to give unique and numerically stable estimators. This model assumes that the hazard function is the summation of the baseline hazard function and the regression function of covariates [24]. The LS estimator uses KM weights to account for censoring in the least squares criterion and makes the adaptation of this approach to multiple covariate settings computationally feasible [25]. A comprehensive comparison of RSF, ADD, and LS performance with other relevant methods in high-dimensional time-to-event data analysis such as Cox’s proportional hazard model, LASSO, and PCR has been presented in [2325].

Each number in Tables 1 and 2 represents a specific variable in experiment dataset. For example, in Table 1, variable 15 is among the selected less efficient variables by NTS (F), SSG, RSF, and LS methods.

From the results shown on Table 1, the NTS (F) method has the closest performance to the SSG. This is an advantage of these methods that more than 80% of inefficient variables which have been detected by other methods (RSF, LS, and ADD) are collected by proposed algorithms at significantly short periods of calculation. The robustness of this class of methods has been examined for several time-to-event data samples collected with high-dimensional covariates in this study.

5.2. Simulation Numerical Experiment

Continuing the validation of the proposed methods, we set observations and variables and simulated event times from a pseudorandom algorithm. In addition to constant and periodic binary numbers, normal and exponential distributed pseudorandom numbers are generated as independent values of explanatory variables. The censored observations were also generated independent of the events. Additionally, some variables are set as a linear function of event time intentionally. We present the results of methods and algorithms applying the simulated data in Table 2. These results are compared with the simulation defined pattern and the comparison verifies the performance of all proposed methods and algorithms.

Inefficiency analysis results for the simulation experiment show that variables with identification numbers 5, 10, and 25 are detected as less efficient variables among all 25 simulated variables by NTS (F), NTS (C), and SSG methods. In this example, one could eliminate these less efficient identified variables if the objective is to reduce the number of variables in the dataset for further analysis.

In addition, a set of numerical experiments are presented to evaluate the performance of the proposed methods as shown in Table 3. Six different simulation models are defined. These models are set as explained above for = 400 observations and = 25 variables following same data generating algorithm. The number of significant inefficient variables in each simulation model is set as presented in Definition in Table 3. The simulations are repeated 100 times independently. For each method, represents integer average number of determined and selected inefficient variables and denotes the performance of the method, for 100 trails, where is as follows:and is the average of performance of each method. The adjusted censoring rates for all simulation numerical experiments are .

5.3. Hybrid Scattering for Variable Inefficiency

Figure 2 displays a graphical interpretation of the results from the proposed methods, hybrid scattering for variable inefficiency for three criteria as (a) NTS (F) score, (b) NTS (C) score, and (c) SSG score for PBC data. Each variable is exposed by a circle icon where two axes represent the standardized criteria (a) and (b) and the diameter of each circle demonstrates the standardized criterion (c). Therefore, it is interpreted from the hybrid scatter plot that each variable with larger diameter and more distance from the center has less efficiency and is an ideal candidate to remove from the dataset if it is desired.

6. Conclusions

This paper presents applied procedures beneficial to covariate reduction through an inefficient variable selection approach which enables one to obtain an appropriate variable subset in a right-censored high-dimensional and large-scale time-to-event data. Two hybrid methods are introduced in which the experimental results and comparisons demonstrate their performance. By using such novel methods in the field of reliability, data analysis and decision making processes will be faster, simpler, and more accurate.

We aim to apply median rank estimates in the proposed nonparametric methods and develop a weighted score and penalized nonparametric maximum likelihood and least square in the proposed logical time-to-event model by the accelerated failure time model. Also, the next challenge in this research is to face the data streaming circumstances in which variables are time-dependent and real-time analysis of data is more complicated.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.