Abstract
High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in highdimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (Cstep), and the Cstep algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the Cstep algorithm runs very slowly, especially when dealing with highdimensional omics data. The ARCstep (Cstep combined with an acceptancerejection scheme) algorithm is proposed. In simulation experiments, the ARCstep algorithm converged faster (the average computation time was only 2% of that of the Cstep algorithm) and was more accurate in terms of variable selection and outlier identification than the Cstep algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNAseq data. ARCstep can solve the problem of the Cstep not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the Cstep algorithm, the ARCstep algorithm can be extended to other robust models with regularized parameters.
1. Introduction
The first challenge presented by omics data is the high dimension, which far exceeds the sample size. The second challenge is the presence of noise in the omics data. This noise may be caused by misdiagnosis, mislabelling, recording errors, technical problems in the laboratory, or sample heterogeneity [1, 2]. Penalized regression is a common method to solve the problem of variable selection and prediction for a highdimensional dataset. It has been applied to omics data such as gene expression (4), GWAS [3], and DNA methylation [4]. However, the outliers in the data make the estimation of penalized regression inaccurate, so biomarkers cannot be properly screened. Additionally, the identification and further investigation of these outliers can correct the errors during the experiment or investigation. Therefore, it is very important to develop robust statistical methods for penalized regression.
A robust estimation method, least trimmed square (LTS), was proposed by Rousseeuw [5]. LTS is highly robust to outliers in both the response and predictors. It is effective for identifying outliers and can solve the problem of the masking phenomenon caused by the coexistence of multiple outliers [5, 6]. Alfons et al. [6] applied LTS to LASSOtype penalized linear regression to solve the problem of robust highdimensional variable selection when the dependent variable is quantitative data. Kurnaz et al. [7] applied LTS to elastic net (EN) type penalized linear and logistic regression to solve the problem of robust highdimensional variable selection when the dependent variable is quantitative and binary data (enetLTS).
Both studies adopted the concentration step (Cstep) in the FASTLTS algorithm proposed by Rousseeuw and Van Driessen [8]. The basic ideas were an inequality involving order statistics and sums of squared residuals. This inequality guarantees that the criterion function declines monotonically as the iteration progresses. However, when it is applied to penalized regression based on trimming, the inequality does not necessarily hold due to the change of the regularized parameters. Thus, the criterion function cannot be guaranteed to decrease. Through our previous simulation study [9], we have found that enetLTS is effective in identifying mislabeled samples in highdimensional data with mislabeled error. However, it is also found that for a dataset with , , and an outlier ratio of 10%, it takes nearly 2 hours (Intel Core i76500U @2.50GHz) to run enetLTS once. For the omics data in real data analysis with and , enetLTS running time is about 77.8 hours (Intel Xeon Silver 4112 @2.60GHZ), which obviously does not meet the requirements for efficient data processing.
Therefore, the Cstep algorithm needs to be improved to adapt to highdimensional data. In this study, the ARCstep algorithm is proposed to solve the estimation of robust penalized regression based on trimming, which combines the Cstep algorithm with the acceptancerejection algorithm proposed by Chakraborty and Chaudhuri [10] . Two algorithms are compared in terms of variable selection and outlier identification accuracy and computation speed in simulation study. An RNAseq dataset for triple negative breast cancer (TNBC) [1] that contains 28 samples with discordant labels obtained from different tests (immunohistochemical (IHC) method or fluorescence in situ hybridization (FISH)) is used to illustrate the application of the two algorithms.
The structure of this paper is as follows: In results section, simulation experiments are described that compare the MTLEN (elastic nettype maximum trimmed likelihood) estimation using the ARCstep algorithm with enetLTS. The results of enetLTS and MTLEN applied to a triple negative breast cancer (TNBC) RNAseq dataset are compared. Then, the results are discussed and concluded.
In this article, a robust penalized logistic regression model based on trimming is introduced in Section 2. And the ARCstep algorithm is proposed and described in Section 3. In Section 4, simulation experiments are described that compare the MTLEN (elastic nettype maximum trimmed likelihood) estimation using the ARCstep algorithm with enetLTS. The results of enetLTS and MTLEN applied to a triple negative breast cancer (TNBC) RNAseq dataset are compared in Section 5. We conclude with a discussion in Section 6 and a conclusion in Section 7.
2. Robust Penalized Logistic Regression Model Based on Trimming
Kurnaz et al. [7] proposed an ENtype penalized logistic regression based on trimming. where , where is the ordered deviance. ( means rounding down to the nearest integer) and where is the trimmed portion. Compared with EN, enetLTS only retains observations with the smallest deviances, whereas  least likely observations under the given model are excluded.
Robust penalized logistic regression model based on trimming was denoted as enetLTS (robust EN based on the LTS), and Cstep algorithm was adopted. We denote it as in this paper. The estimate of the same model obtained by the ARCstep algorithm is recorded as the ENtype maximum trimmed likelihood estimate .
3. Algorithm
3.1. CStep Algorithm
Kurnaz et al. [7] adopted the Cstep algorithm in enetLTS. This algorithm was described below.
Let be the criterion function of the penalized logistic regression based on the subsample , where . Thus,
Additionally, represents .
When the regularized parameters and are fixed, at the th step of the iteration, is the current subset with observations, and is the solution of the penalized logistic regression based on . The negative loglikelihood functions corresponding to observations can be derived from . The subsample consists of the smallest negative loglikelihood observations, that is, where ,.
Thus, can be obtained. is the subset that minimizes the criterion function under the solution . Then, penalized logistic regression is applied to subset . If and are unchanged, we get the solution which minimize the solution of criterion function under the regularization parameters and . Thus holds. Therefore, when is fixed,
The definition of makes the first equation hold. The definition of makes the second inequality hold.
For the Cstep algorithm, the candidate subset is constructed by sorting out samples with the smallest negative loglikelihood contribution to . Then, the Cstep algorithm continues until .
Therefore, when and remain unchanged, as the number of iterations increases, the criterion function decreases. Because the criterion function is nonnegative and the number of subsets with sample size is limited, the Cstep algorithm must converge to the subset with the smallest criterion function after a limited number of steps.

The Cstep algorithm is described in Algorithm 1, where “continueCstep” is set so that the absolute value of the difference between the likelihood functions of two iterations is less than some small value.
However, when penalized regression is performed on the subset , the regularized parameters and are not fixed. The regularized parameters are usually determined by data, such as by crossvalidation. The regularized parameters determined for penalized regression performed on two different subsets are often different, which leads to the second inequality of [11] not necessarily being true.
A way to solve the problem is to set all and values firstly. For a certain combination of and , perform the Cstep algorithm until convergence. Then, compare the convergent subsets under different regularized parameters, and select the subset that minimizes the criterion function. If the number of values is 40 and that of values is 20, there are 800 parameter combinations. This means running the Cstep algorithm 800 times, which will undoubtedly make the algorithm very slow.
3.2. ARCstep Algorithm
In this study, the ARCstep algorithm is proposed to solve the estimation of the robust penalized regression based on trimming, which combines the Cstep algorithm with the acceptancerejection algorithm, which was proposed by Chakraborty and Chaudhuri [10].
3.2.1. AcceptanceRejection Algorithm
The acceptancerejection algorithm is similar to that of MetropoliseHastings in MCMC. Let represent the subset at the th step of the iteration. Then, a randomly selected sample outside of replaces one of the samples in to form . The corresponding likelihood function is obtained after penalized regression is performed on . If the criterion function corresponding to is better than that corresponding to the current subset , then is accepted as with probability one, and . Otherwise, is accepted as with a probability of , so that the algorithm can escape the local optimal value.
In the acceptancerejection algorithm, the candidate sample at each step is randomly selected from the remaining samples other than the current subset . Thus, whether the candidate subset can improve the criterion function better is completely random, which leads to the slower convergence of the iteration. The advantage of this algorithm is that, whether the criterion function corresponding to the candidate subset is better than that of the current subset is examined at each step. Moreover, the subset with the optimal criterion function up to the current step is recorded at each step.
3.2.2. ARCstep Algorithm
The changes of the regularized parameters λ and make the Cstep algorithm hardly gradually converge to the subset with the smallest criterion function. Suppose the current subset is , and we obtain , and corresponding criterion function after the penalized regression is performed on . The smallest negative loglikelihood observations constitute the subset , so that holds. Then, penalized regression is performed on , is obtained, and the corresponding regularized parameters changed to and . The corresponding criterion function of is not necessarily less than . The ARCstep algorithm adds the step of comparing the criterion function of the candidate subset with that of the current subset . If , to avoid falling into a local optimum, is a random number that follows the Bernoulli distribution with , where . If , then . If , then , that is, no replacement. The criterion function corresponding to the initial subset is recorded as the optimal subset, that is, , and . At each step of the iteration, the criterion function is compared with . If , then and . in the last step is the solution.
To make the proportion of samples with in the candidate subset consistent with that in the full set, the samples constituting the candidate subset are selected in the following manner. consists of observations with the smallest among observations with (set a total of observations), and observations with the smallest among observations with (set a total of observations), where , and means round down. is the trimming ratio and . In comparison with the acceptancerejection algorithm, for which consists of samples selected randomly from the complementary set, of ARCstep is composed of observations with the smallest deviance; that is, each sample of contains information that improves the criterion function; hence, the algorithm converges to the subset with the optimal criterion function faster. The ARCstep algorithm is described in Algorithm 2.
The acceptance probability . It is inversely proportional to the absolute value of the difference between the two likelihood functions and . The acceptance probability is also related to . According to , the acceptance probability is inversely proportional to , which is the th step of the iteration. Similar to the study of Farcomeni and Viviani [12], , and the acceptance probability is inversely proportional to the sample size of the subset. When other features remain unchanged, the larger the sample size of the subset, the smaller the probability of being accepted. Additionally, if the current subset is not replaced after iterations, the iteration process is stopped.

To ensure that the initial subset does not contain outliers, the sample size should be smaller. The initial subset consisted of six observations, three of which were randomly selected from groups and , respectively. In order to make the algorithm reach the global optimal value, multiple initial subsets were selected.
First, the twostep iteration of ARCstep was performed on 500 initial subsets, and 500 updated subsets were obtained. Then, the 10 subsets with the smallest criterion function were retained. Then, ARCstep was performed on these 10 subsets until convergence. Among the 10 convergent subsets, the subset with the smallest criterion function was selected, denoted by. The penalized regression was performed on, and was obtained.
3.2.3. Reweighted Step
In this article, we choose the subset of size where . So is the initial guess that less than 25% of outliers contained in the data. This is a rather conservative estimation of proportion of outliers. There may not be so many outliers in the data. Therefore, reweighted step is considered to detect outliers via . Then, these outliers are excluded, and a new subset is obtained. Then, ENtype penalized logistic regression is applied to to get the solution . Usually, the size of is larger than , such that more samples can improve the performance of compared to . We called reweighted MTLEN (Rwt MTLEN). To distinguish them, the unweighted is called Raw MTLEN.
3.2.4. Choice of the Regularized Parameters and Standardization of Predictors
We select λ over a grid of values in the interval (] as discussed by Breheny and Huang [13]. where is the dependent variable and is the th independent variable. In iteration step of ARCstep, we take a grid with steps of size 0.05 and to reduce the computational burden. In the reweighted step, we take a grid with steps of size 0.01 of to derive the solution . The choice of is selected by crossvalidation in the interval [0.1,1] with a step size of 0.1.
It would be better to standardize predictors before applying the penalized regression. Standardization mainly is aimed at eliminating the influence of dimension and quantity of a predictor. However, the mean and standard deviation computed from all sample are not robust with outliers. In the algorithm described above, penalized regression is applied to the subset in every iteration step of ARCstep. So we firstly, respectively, compute mean and standard deviation from subsamples. Then, we standardize all samples with this mean and standard deviation before applying penalized regression.
4. Simulation Study
4.1. Comparison of MTLEN and enetLTS on Outlier Detection and Variable Selection
Simulation settings were the same as Sun et al. [9]. The parameter of both enetLTS and MTLEN was both set to , which meant the trimmed rate is 25%. The parameters in Ensemble followed Lopes et al. [1].
In the simulation experiment, we compared the two methods enetLTS and MTLEN using Cstep and ARCstep algorithms, respectively. Through our previous research [9] and subsequent simulation experiments, we can see that enetLTS is good at identifying outliers. However, the FDR of its variable selection is high, and many unrelated variables are identified. When encountering mislabeled omics data, we can combine enetLTS with Ensemble. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy. Then, we added the third method Ensemble to the simulation experiment. A detailed description of Ensemble is provided in our previous study [9].
The performances of the three methods are summarized in Figure 1.
The outlier detection accuracy of the three methods is shown in Figure 1. Here, we used two indicators Sn (sensitivity) and FPR (False Positive Rate) [14]. Sn represents the proportion of true misclassified individuals identified as misclassified ones among all true misclassified observations. FPR represents the proportion of individuals with correct labels that are wrongly categorized as misclassified ones.
The outliers identified by MTLEN had the higher Sn than enetLTS. When the proportion of outliers were 10% and 15%, the gap between them further widened. MTLEN FPRs were close to enetLTS. Ensemble has the lowest Sn and FPRs among the three methods. Therefore, MTLEN had the best accuracy in identifying outliers.
The variable selection accuracy of the three methods is shown in Figure 1. PSR (Positive Selection Rate) indicates the proportion of true diseaserelated biomarkers identified in all true diseaserelated biomarkers. FDR (False Discovery Rate) represents the proportion of biomarkers that are not related to disease among all the screened biomarkers. A comprehensive indicator GM [15, 16] for the accuracy of variable selection was used, which is the geometric mean of PSR and (1 − FDR). High accuracy of variable selection is indicated by a high GM.
MTLEN variable selection accuracy was very similar to enetLTS with high PSR and FDR. As also shown in our previous study [9], Ensemble had the highest variable selection accuracy with much low FDR; however, Ensemble missed some associated variables when the proportion of outliers was 10% or 15%.
In terms of variable selection, when there were a small proportion of outliers, Ensemble performed best. However, its accuracy was greatly decreased when the proportion of outliers was large. In terms of outlier detection, regardless of the portion of outliers, MTLEN had the highest outlier detection accuracy among the three methods.
4.2. Combining with Ensemble to Improve the Accuracy of Variable Selection
In our previous study [9], we considered a twostep procedure when the proportion of outliers was relatively large. We found that it improved the variable selection accuracy by applying Ensemble on a subset with outliers identified by enetLTS removed. In this study, we also used MTLEN to detect outliers and then applied Ensemble on the subset with outliers removed. The results of MTLEN and enetLTS were compared by simulation, which is shown in Table 1.
From Table 1, compared with the results in the original data, the PSR of Ensemble raised from 0.533 to 0.644, and the GM was improved from 0.714 to 0.786 for subset after removing outliers identified by enetLTS. For subset with outliers identified by MTLEN removed, the results of Ensemble were also improved with PSR increased from 0.533 to 0.708 and GM increased from 0.714 to 0.828. It can be seen that after removing the outliers identified by MTLEN, the accuracy of Ensemble variable selection is the highest.
4.3. The Computation Times of enetLTS and MTLEN
From Table 2, the computation time of enetLTS is 39 times that of MTLEN (Intel Core i76500U @2.50GHz); that is, the computation time of MTLEN was 2% of that of enetLTS. This is because the Cstep algorithm used by enetLTS does not take into account the regularized parameters that need to be determined at each step of the iteration. The criterion function cannot be guaranteed to gradually decrease, which makes the algorithm converge slowly. The ARCstep algorithm adopted by MTLEN solves this problem well, which greatly improves the convergence speed.
5. Case Study
In the previous study [9], we compared the application of enetLTS, Ensemble, and Rlogreg on a TNBC dataset from the TCGABRCA data collection. The results showed that enetLTS identified 68 outliers, seven of which were individuals with inconsistent labels. After removing the outliers identified by enetLTS, the prediction accuracy of the three Ensemble models was improved, and the number of associated genes identified increased from 5 to 9. In this study, we applied MTLEN to this TNBC dataset. The outliers identified by MTLEN were compared with those by enetLTS, and we also compared the performances of Ensemble after removing the outliers identified by MTLEN and enetLTS, respectively.
From Tables 3 and 4, among the 68 outliers identified by enetLTS, 3 of them were labeled as TNBC, which were also identified by MTLEN; among them, 65 individuals with nonTNBC labels included 35 nonTNBC patients identified by MTLEN. In other words, 38 of the 47 outliers with nonTNBC labels identified by MTLEN were also identified by enetLTS. However, nine patients with TNBC labels were not identified by enetLTS. These 9 TNBC patients were highly expressed in one or more of the three genes, suggesting that they were likely to be nonTNBC patients or misclassified individuals. For example, TCGABHA42U (HER2 38.37), TCGAE2A1L7 (ER 29.61, PR 22.98), TCGAOLA97C (PR 8.56), TCGAA2A1G6 (ER 23.90, PR 21.45, HER2 29.74), TCGAA2A0EQ (ER 2.13, HER2 30.15), TCGAEWA1OV (HER2 28.91), TCGAOLA5D6 (HER2 72.13), TCGAC8A26X (HER2 60.12), and TCGALLA740 (HER2 68.56), with high expression in one or more of three receptors, were more likely not to be a TNBC patients; that is, his/her labels were probably wrong. Seven of the 47 outliers identified by MTLEN were suspect individuals with inconsistent HER2 labels. Six of them were labeled as nonTNBC, which were also detected by enetLTS. The remaining one “TCGAA2A0EQ” was labeled as TNBC, which was not detected by enetLTS.
A total of 213 genes were identified by MTLEN, and 40 genes with the largest absolute value are listed in Table 5. Among them, FOXA1 [17], ERBB2 [18], GRB7 [19], KRT16 [20], CXXC5 [21], FOXC1 [22], TFF3 [23], COL9A3 [24], FABP7 [25], CCNE1 [26], GZMB [27], and MIEN1 [28] were reported to be related to TNBC.
In our previous study [9], we combined the advantages of enetLTS and Ensemble and removed 68 outliers identified by enetLTS, then ran Ensemble on a subset (856 samples), to improve the accuracy of gene selection. In this study, we removed 47 misclassification samples detected by MTLEN and then ran Ensemble in the remaining 877 samples. The results are shown in Tables 6 and 7.
From Table 6, for the subset with outliers detected by enetLTS removed, the prediction index MR of the three models in Ensemble was much lower than that on the original TNBC dataset; the MR of EN decreased from 0.012 to 0, the SPLSDA MR reduced from 0.064 to 0.008, and the SGPLS MR reduced from 0.059 to 0.015. When Ensemble was run on a subset of 47 outliers identified by MTLEN, the prediction accuracy MR of the three models in Ensemble also decreased greatly, to 0.001, 0.014, and 0.013, respectively.
For subset with 68 outliers detected by enetLTS removed, the intersection of variables selected using the three Ensemble models increased from five to nine genes, namely, CA12 [29], GABRP [30], VGLL1 [31], AGR2 [32], GATA3 [17], FOXA1 [17], TFF3 [23], AGR3 [33], and KRT16 [20], were reported to be related to TNBC.
From Table 7, for subset with 47 outliers detected by MTLEN removed, the intersection of variables selected using the three Ensemble models was 12 genes. Among them, ESR1, one of three key variables, and FOXC1 [22], AGR2 [32], FOXA1 [17], TFF3 [23], TFF1 [34], AGR3 [33], KRT6B [35], and KRT16 [20] have been reported to be related to TNBC. KLK6 [36], FDCSP [37], and PPP1R14C [38] have been reported to be related to other types of tumors. Their association with TNBC needs further study.
6. Discussion
Through our previous research [9], we have found that in highdimensional data with mislabeled error, robust trimmed penalized regression is a recommended method in identifying mislabeled samples. However, the Cstep algorithm to implement this method (enetLTS) is too slow to meet the requirement of data analysis for highdimensional omics data. The reason is that for LTS without regularized parameters, the inequality that guarantees the convergence of the Cstep algorithm is established. However, for the robust trimmed penalized regression with regularized parameters, the inequality does not necessarily hold due to the change of the regularized parameters.
In the ARCstep algorithm, penalized regression is repeatedly performed on the subset at each step to concentrate on the individuals who fit the model best gradually; that is, the idea of the Cstep algorithm is still adopted. However, ARCstep can solve the problem of the Cstep algorithm not converging because the regularized parameters change during the iteration. A comparison of the likelihood function of the current subset and that of the candidate subset is used to determine whether to replace the current subset with the candidate subset in ARCstep, thereby ensuring that the iterative process is in the direction that improves the criterion function. To avoid falling into a local optimum, the Metropolistype probabilistic acceptancerejection algorithm is combined.
Through simulation experiments, it is found that MTLEN using ARCstep algorithm was more accurate than enetLTS using Cstep algorithm in outlier identification. In particular, the accuracy of Ensemble variable selection on the subset after removing outliers identified by MTLEN was higher than the result of Ensemble running on the subset after removing outliers identified by enetLTS. The ARCstep algorithm adopted by MTLEN greatly improved the convergence speed; that is, the computation time of MTLEN was 2% of that of enetLTS.
If a misclassified sample identified by a certain method is labeled as nonTNBC, it means that the expression of the key genes ER, PR, or HER2 is false positive in this patient. Similarly, if a misclassified sample identified is labeled as TNBC, it implies that the expression of ER, PR, or HER2 is a false negative in the patient. In the analysis of the TNBC dataset, there are 153 individuals labeled as TNBC in this TNBC dataset. There are 3 samples identified by enetLTS that were labeled as TNBC patients with false negative rate 2% (3/153). Twelve individuals labeled as TNBC patients were identified as mislabeled samples by MTLEN with false negative rate 7.8% (12/153). In the TNBC dataset, IHC test of ER and PR was adopted for all patients. For HER2 detection, the results of IHC were for 507 patients. According to previous studies, the false negative rates of IHC test for ER, PR, and HER2 were not low, 15.1% ~21.8% for ER [39], 6.8% (4/58) for PR [40], and 6.2% (4/65) for HER2 [41], respectively. Therefore, the false negative misclassified samples identified by MTLEN were more likely to be close to the reality than enetLTS.
A large class of computational problems in robust statistics can be formulated as the selection of the optimal subset of data based on some criterion function [10]. ARCstep algorithm, as the improvement of Cstep algorithm, can be extended to other robust models with regularized parameters. It is an effective algorithm for finding the most suitable subset of regularized models, such as robust Adaptive LASSO, Group LASSO, SCAD, and MCP. The ARCstep algorithm can be extended to other generalized linear models, such as penalized multiclass logistic regression and penalized Poisson regression.
7. Conclusion
ARCstep can solve the problem of the Cstep algorithm not converging because the regularized parameters change during the iteration. It provides an idea for developing the efficient algorithm of robust penalized regression based on trimming. The ARCstep algorithm can be extended to other robust models with regularized parameters. In practice, MTLEN using ARCstep algorithm is the recommended method for mislabeled sample identification in omics data because of its high accuracy and high operation speed. When the proportion of mislabeled samples is relatively low and ≤5%, Ensemble can be used for variable selection. When the proportion of mislabeled samples is >5%, Ensemble can be used for variable selection on a subset of data after removing mislabeled samples identified by MTLEN.
Data Availability
Code is available on Github (https://github.com/hwsun2000/ARCstep). The BRCA RNASeq FPKM dataset was imported using the “brca.data” R package (https://github.com/averissimo/brca.data/releases/download/1.0/brca.data_1.0.tar.gz).
Disclosure
The funders played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Conflicts of Interest
The authors declare that they have no conflicts of interests.
Acknowledgments
This research work was funded by the National Natural Science Foundation for Young Scholars of China (Grant No. 81502891) and the National Natural Science Foundation of China (Grant No. 81872715).