An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Sun, Hongwei; Wang, Jiu; Zhang, Zhongwen; Hu, Naibao; Wang, Tong

doi:https://doi.org/10.1155/2021/9436582

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Discussion Conclusion Data Availability Disclosure Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2021 | Article ID 9436582 | https://doi.org/10.1155/2021/9436582

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Hongwei Sun,^1,2Jiu Wang,¹Zhongwen Zhang,¹Naibao Hu,¹and Tong Wang²

Academic Editor: Po-Hsiang Tsui

Received13 May 2021

Accepted30 Nov 2021

Published22 Dec 2021

Abstract

High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.

1. Introduction

The first challenge presented by omics data is the high dimension, which far exceeds the sample size. The second challenge is the presence of noise in the omics data. This noise may be caused by misdiagnosis, mislabelling, recording errors, technical problems in the laboratory, or sample heterogeneity [1, 2]. Penalized regression is a common method to solve the problem of variable selection and prediction for a high-dimensional dataset. It has been applied to omics data such as gene expression (4), GWAS [3], and DNA methylation [4]. However, the outliers in the data make the estimation of penalized regression inaccurate, so biomarkers cannot be properly screened. Additionally, the identification and further investigation of these outliers can correct the errors during the experiment or investigation. Therefore, it is very important to develop robust statistical methods for penalized regression.

A robust estimation method, least trimmed square (LTS), was proposed by Rousseeuw [5]. LTS is highly robust to outliers in both the response and predictors. It is effective for identifying outliers and can solve the problem of the masking phenomenon caused by the coexistence of multiple outliers [5, 6]. Alfons et al. [6] applied LTS to LASSO-type penalized linear regression to solve the problem of robust high-dimensional variable selection when the dependent variable is quantitative data. Kurnaz et al. [7] applied LTS to elastic net- (EN-) type penalized linear and logistic regression to solve the problem of robust high-dimensional variable selection when the dependent variable is quantitative and binary data (enetLTS).

Both studies adopted the concentration step (C-step) in the FAST-LTS algorithm proposed by Rousseeuw and Van Driessen [8]. The basic ideas were an inequality involving order statistics and sums of squared residuals. This inequality guarantees that the criterion function declines monotonically as the iteration progresses. However, when it is applied to penalized regression based on trimming, the inequality does not necessarily hold due to the change of the regularized parameters. Thus, the criterion function cannot be guaranteed to decrease. Through our previous simulation study [9], we have found that enetLTS is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, it is also found that for a dataset with , , and an outlier ratio of 10%, it takes nearly 2 hours (Intel Core i7-6500U @2.50GHz) to run enetLTS once. For the omics data in real data analysis with and , enetLTS running time is about 77.8 hours (Intel Xeon Silver 4112 @2.60GHZ), which obviously does not meet the requirements for efficient data processing.

Therefore, the C-step algorithm needs to be improved to adapt to high-dimensional data. In this study, the AR-Cstep algorithm is proposed to solve the estimation of robust penalized regression based on trimming, which combines the C-step algorithm with the acceptance-rejection algorithm proposed by Chakraborty and Chaudhuri [10] . Two algorithms are compared in terms of variable selection and outlier identification accuracy and computation speed in simulation study. An RNA-seq dataset for triple negative breast cancer (TNBC) [1] that contains 28 samples with discordant labels obtained from different tests (immunohistochemical (IHC) method or fluorescence in situ hybridization (FISH)) is used to illustrate the application of the two algorithms.

The structure of this paper is as follows: In results section, simulation experiments are described that compare the MTL-EN (elastic net-type maximum trimmed likelihood) estimation using the AR-Cstep algorithm with enetLTS. The results of enetLTS and MTL-EN applied to a triple negative breast cancer (TNBC) RNA-seq dataset are compared. Then, the results are discussed and concluded.

In this article, a robust penalized logistic regression model based on trimming is introduced in Section 2. And the AR-Cstep algorithm is proposed and described in Section 3. In Section 4, simulation experiments are described that compare the MTL-EN (elastic net-type maximum trimmed likelihood) estimation using the AR-Cstep algorithm with enetLTS. The results of enetLTS and MTL-EN applied to a triple negative breast cancer (TNBC) RNA-seq dataset are compared in Section 5. We conclude with a discussion in Section 6 and a conclusion in Section 7.

2. Robust Penalized Logistic Regression Model Based on Trimming

Kurnaz et al. [7] proposed an EN-type penalized logistic regression based on trimming. where , where is the ordered deviance. ( means rounding down to the nearest integer) and where is the trimmed portion. Compared with EN, enetLTS only retains observations with the smallest deviances, whereas - least likely observations under the given model are excluded.

Robust penalized logistic regression model based on trimming was denoted as enetLTS (robust EN based on the LTS), and C-step algorithm was adopted. We denote it as in this paper. The estimate of the same model obtained by the AR-Cstep algorithm is recorded as the EN-type maximum trimmed likelihood estimate .

3. Algorithm

3.1. C-Step Algorithm

Kurnaz et al. [7] adopted the C-step algorithm in enetLTS. This algorithm was described below.

Let be the criterion function of the penalized logistic regression based on the subsample , where . Thus,

Additionally, represents .

When the regularized parameters and are fixed, at the th step of the iteration, is the current subset with observations, and is the solution of the penalized logistic regression based on . The negative log-likelihood functions corresponding to observations can be derived from . The subsample consists of the smallest negative log-likelihood observations, that is, where ,.

Thus, can be obtained. is the subset that minimizes the criterion function under the solution . Then, penalized logistic regression is applied to subset . If and are unchanged, we get the solution which minimize the solution of criterion function under the regularization parameters and . Thus holds. Therefore, when is fixed,

The definition of makes the first equation hold. The definition of makes the second inequality hold.

For the C-step algorithm, the candidate subset is constructed by sorting out samples with the smallest negative log-likelihood contribution to . Then, the C-step algorithm continues until .

Therefore, when and remain unchanged, as the number of iterations increases, the criterion function decreases. Because the criterion function is nonnegative and the number of subsets with sample size is limited, the C-step algorithm must converge to the subset with the smallest criterion function after a limited number of steps.

While (continueCstep)
do
Penalized logisitic regression is applied on the current subset , and get

For , of every observation is got and observations are sorted according to their deviances.
,
The h observations with smallest negative log-likelihood function are retained to form a subset .
end

The C-step algorithm is described in Algorithm 1, where “continueCstep” is set so that the absolute value of the difference between the likelihood functions of two iterations is less than some small value.

However, when penalized regression is performed on the subset , the regularized parameters and are not fixed. The regularized parameters are usually determined by data, such as by cross-validation. The regularized parameters determined for penalized regression performed on two different subsets are often different, which leads to the second inequality of [11] not necessarily being true.

A way to solve the problem is to set all and values firstly. For a certain combination of and , perform the C-step algorithm until convergence. Then, compare the convergent subsets under different regularized parameters, and select the subset that minimizes the criterion function. If the number of values is 40 and that of values is 20, there are 800 parameter combinations. This means running the C-step algorithm 800 times, which will undoubtedly make the algorithm very slow.

3.2. AR-Cstep Algorithm

In this study, the AR-Cstep algorithm is proposed to solve the estimation of the robust penalized regression based on trimming, which combines the C-step algorithm with the acceptance-rejection algorithm, which was proposed by Chakraborty and Chaudhuri [10].

3.2.1. Acceptance-Rejection Algorithm

The acceptance-rejection algorithm is similar to that of Metropolise-Hastings in MCMC. Let represent the subset at the th step of the iteration. Then, a randomly selected sample outside of replaces one of the samples in to form . The corresponding likelihood function is obtained after penalized regression is performed on . If the criterion function corresponding to is better than that corresponding to the current subset , then is accepted as with probability one, and . Otherwise, is accepted as with a probability of , so that the algorithm can escape the local optimal value.

In the acceptance-rejection algorithm, the candidate sample at each step is randomly selected from the remaining samples other than the current subset . Thus, whether the candidate subset can improve the criterion function better is completely random, which leads to the slower convergence of the iteration. The advantage of this algorithm is that, whether the criterion function corresponding to the candidate subset is better than that of the current subset is examined at each step. Moreover, the subset with the optimal criterion function up to the current step is recorded at each step.

3.2.2. AR-Cstep Algorithm

The changes of the regularized parameters λ and make the C-step algorithm hardly gradually converge to the subset with the smallest criterion function. Suppose the current subset is , and we obtain , and corresponding criterion function after the penalized regression is performed on . The smallest negative log-likelihood observations constitute the subset , so that holds. Then, penalized regression is performed on , is obtained, and the corresponding regularized parameters changed to and . The corresponding criterion function of is not necessarily less than . The AR-Cstep algorithm adds the step of comparing the criterion function of the candidate subset with that of the current subset . If , to avoid falling into a local optimum, is a random number that follows the Bernoulli distribution with , where . If , then . If , then , that is, no replacement. The criterion function corresponding to the initial subset is recorded as the optimal subset, that is, , and . At each step of the iteration, the criterion function is compared with . If , then and . in the last step is the solution.

To make the proportion of samples with in the candidate subset consistent with that in the full set, the samples constituting the candidate subset are selected in the following manner. consists of observations with the smallest among observations with (set a total of observations), and observations with the smallest among observations with (set a total of observations), where , and means round down. is the trimming ratio and . In comparison with the acceptance-rejection algorithm, for which consists of samples selected randomly from the complementary set, of AR-Cstep is composed of observations with the smallest deviance; that is, each sample of contains information that improves the criterion function; hence, the algorithm converges to the subset with the optimal criterion function faster. The AR-Cstep algorithm is described in Algorithm 2.

The acceptance probability . It is inversely proportional to the absolute value of the difference between the two likelihood functions and . The acceptance probability is also related to . According to , the acceptance probability is inversely proportional to , which is the th step of the iteration. Similar to the study of Farcomeni and Viviani [12], , and the acceptance probability is inversely proportional to the sample size of the subset. When other features remain unchanged, the larger the sample size of the subset, the smaller the probability of being accepted. Additionally, if the current subset is not replaced after iterations, the iteration process is stopped.

k represents the kth iteration, and r represents that the current subset has not been replaced after r iterations.
While (k<=kmax & r<=2)
do

Under , corresponding to each sample is derived. The current criterion function is

Candidate subset , where
, is the index of individuals with .
, is the index of individuals with .
, , 。

Under , corresponding to each sample is derived. The corresponding criterion function is

If then

If ，then

U is a random number that obeys the Bernoulli distribution with the parameter p.
if U=1 then

else

end
end
end

To ensure that the initial subset does not contain outliers, the sample size should be smaller. The initial subset consisted of six observations, three of which were randomly selected from groups and , respectively. In order to make the algorithm reach the global optimal value, multiple initial subsets were selected.

First, the two-step iteration of AR-Cstep was performed on 500 initial subsets, and 500 updated subsets were obtained. Then, the 10 subsets with the smallest criterion function were retained. Then, AR-Cstep was performed on these 10 subsets until convergence. Among the 10 convergent subsets, the subset with the smallest criterion function was selected, denoted by. The penalized regression was performed on, and was obtained.

3.2.3. Reweighted Step

In this article, we choose the subset of size where . So is the initial guess that less than 25% of outliers contained in the data. This is a rather conservative estimation of proportion of outliers. There may not be so many outliers in the data. Therefore, reweighted step is considered to detect outliers via . Then, these outliers are excluded, and a new subset is obtained. Then, EN-type penalized logistic regression is applied to to get the solution . Usually, the size of is larger than , such that more samples can improve the performance of compared to . We called reweighted MTL-EN (Rwt MTL-EN). To distinguish them, the unweighted is called Raw MTL-EN.

3.2.4. Choice of the Regularized Parameters and Standardization of Predictors

We select λ over a grid of values in the interval (] as discussed by Breheny and Huang [13]. where is the dependent variable and is the th independent variable. In iteration step of AR-Cstep, we take a grid with steps of size 0.05 and to reduce the computational burden. In the reweighted step, we take a grid with steps of size 0.01 of to derive the solution . The choice of is selected by cross-validation in the interval [0.1,1] with a step size of 0.1.

It would be better to standardize predictors before applying the penalized regression. Standardization mainly is aimed at eliminating the influence of dimension and quantity of a predictor. However, the mean and standard deviation computed from all sample are not robust with outliers. In the algorithm described above, penalized regression is applied to the subset in every iteration step of AR-Cstep. So we firstly, respectively, compute mean and standard deviation from subsamples. Then, we standardize all samples with this mean and standard deviation before applying penalized regression.

4. Simulation Study

4.1. Comparison of MTL-EN and enetLTS on Outlier Detection and Variable Selection

Simulation settings were the same as Sun et al. [9]. The parameter of both enetLTS and MTL-EN was both set to , which meant the trimmed rate is 25%. The parameters in Ensemble followed Lopes et al. [1].

In the simulation experiment, we compared the two methods enetLTS and MTL-EN using C-step and AR-Cstep algorithms, respectively. Through our previous research [9] and subsequent simulation experiments, we can see that enetLTS is good at identifying outliers. However, the FDR of its variable selection is high, and many unrelated variables are identified. When encountering mislabeled omics data, we can combine enetLTS with Ensemble. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy. Then, we added the third method Ensemble to the simulation experiment. A detailed description of Ensemble is provided in our previous study [9].

The performances of the three methods are summarized in Figure 1.

The outlier detection accuracy of the three methods is shown in Figure 1. Here, we used two indicators Sn (sensitivity) and FPR (False Positive Rate) [14]. Sn represents the proportion of true misclassified individuals identified as misclassified ones among all true misclassified observations. FPR represents the proportion of individuals with correct labels that are wrongly categorized as misclassified ones.

The outliers identified by MTL-EN had the higher Sn than enetLTS. When the proportion of outliers were 10% and 15%, the gap between them further widened. MTL-EN FPRs were close to enetLTS. Ensemble has the lowest Sn and FPRs among the three methods. Therefore, MTL-EN had the best accuracy in identifying outliers.

The variable selection accuracy of the three methods is shown in Figure 1. PSR (Positive Selection Rate) indicates the proportion of true disease-related biomarkers identified in all true disease-related biomarkers. FDR (False Discovery Rate) represents the proportion of biomarkers that are not related to disease among all the screened biomarkers. A comprehensive indicator GM [15, 16] for the accuracy of variable selection was used, which is the geometric mean of PSR and (1 − FDR). High accuracy of variable selection is indicated by a high GM.

MTL-EN variable selection accuracy was very similar to enetLTS with high PSR and FDR. As also shown in our previous study [9], Ensemble had the highest variable selection accuracy with much low FDR; however, Ensemble missed some associated variables when the proportion of outliers was 10% or 15%.

In terms of variable selection, when there were a small proportion of outliers, Ensemble performed best. However, its accuracy was greatly decreased when the proportion of outliers was large. In terms of outlier detection, regardless of the portion of outliers, MTL-EN had the highest outlier detection accuracy among the three methods.

4.2. Combining with Ensemble to Improve the Accuracy of Variable Selection

In our previous study [9], we considered a two-step procedure when the proportion of outliers was relatively large. We found that it improved the variable selection accuracy by applying Ensemble on a subset with outliers identified by enetLTS removed. In this study, we also used MTL-EN to detect outliers and then applied Ensemble on the subset with outliers removed. The results of MTL-EN and enetLTS were compared by simulation, which is shown in Table 1.

From Table 1, compared with the results in the original data, the PSR of Ensemble raised from 0.533 to 0.644, and the GM was improved from 0.714 to 0.786 for subset after removing outliers identified by enetLTS. For subset with outliers identified by MTL-EN removed, the results of Ensemble were also improved with PSR increased from 0.533 to 0.708 and GM increased from 0.714 to 0.828. It can be seen that after removing the outliers identified by MTL-EN, the accuracy of Ensemble variable selection is the highest.

4.3. The Computation Times of enetLTS and MTL-EN

From Table 2, the computation time of enetLTS is 39 times that of MTL-EN (Intel Core i7-6500U @2.50GHz); that is, the computation time of MTL-EN was 2% of that of enetLTS. This is because the C-step algorithm used by enetLTS does not take into account the regularized parameters that need to be determined at each step of the iteration. The criterion function cannot be guaranteed to gradually decrease, which makes the algorithm converge slowly. The AR-Cstep algorithm adopted by MTL-EN solves this problem well, which greatly improves the convergence speed.

5. Case Study

In the previous study [9], we compared the application of enetLTS, Ensemble, and Rlogreg on a TNBC dataset from the TCGA-BRCA data collection. The results showed that enetLTS identified 68 outliers, seven of which were individuals with inconsistent labels. After removing the outliers identified by enetLTS, the prediction accuracy of the three Ensemble models was improved, and the number of associated genes identified increased from 5 to 9. In this study, we applied MTL-EN to this TNBC dataset. The outliers identified by MTL-EN were compared with those by enetLTS, and we also compared the performances of Ensemble after removing the outliers identified by MTL-EN and enetLTS, respectively.

From Tables 3 and 4, among the 68 outliers identified by enetLTS, 3 of them were labeled as TNBC, which were also identified by MTL-EN; among them, 65 individuals with non-TNBC labels included 35 non-TNBC patients identified by MTL-EN. In other words, 38 of the 47 outliers with non-TNBC labels identified by MTL-EN were also identified by enetLTS. However, nine patients with TNBC labels were not identified by enetLTS. These 9 TNBC patients were highly expressed in one or more of the three genes, suggesting that they were likely to be non-TNBC patients or misclassified individuals. For example, TCGA-BH-A42U (HER2 38.37), TCGA-E2-A1L7 (ER 29.61, PR 22.98), TCGA-OL-A97C (PR 8.56), TCGA-A2-A1G6 (ER 23.90, PR 21.45, HER2 29.74), TCGA-A2-A0EQ (ER 2.13, HER2 30.15), TCGA-EW-A1OV (HER2 28.91), TCGA-OL-A5D6 (HER2 72.13), TCGA-C8-A26X (HER2 60.12), and TCGA-LL-A740 (HER2 68.56), with high expression in one or more of three receptors, were more likely not to be a TNBC patients; that is, his/her labels were probably wrong. Seven of the 47 outliers identified by MTL-EN were suspect individuals with inconsistent HER2 labels. Six of them were labeled as non-TNBC, which were also detected by enetLTS. The remaining one “TCGA-A2-A0EQ” was labeled as TNBC, which was not detected by enetLTS.

A total of 213 genes were identified by MTL-EN, and 40 genes with the largest absolute value are listed in Table 5. Among them, FOXA1 [17], ERBB2 [18], GRB7 [19], KRT16 [20], CXXC5 [21], FOXC1 [22], TFF3 [23], COL9A3 [24], FABP7 [25], CCNE1 [26], GZMB [27], and MIEN1 [28] were reported to be related to TNBC.

In our previous study [9], we combined the advantages of enetLTS and Ensemble and removed 68 outliers identified by enetLTS, then ran Ensemble on a subset (856 samples), to improve the accuracy of gene selection. In this study, we removed 47 misclassification samples detected by MTL-EN and then ran Ensemble in the remaining 877 samples. The results are shown in Tables 6 and 7.

From Table 6, for the subset with outliers detected by enetLTS removed, the prediction index MR of the three models in Ensemble was much lower than that on the original TNBC dataset; the MR of EN decreased from 0.012 to 0, the SPLS-DA MR reduced from 0.064 to 0.008, and the SGPLS MR reduced from 0.059 to 0.015. When Ensemble was run on a subset of 47 outliers identified by MTL-EN, the prediction accuracy MR of the three models in Ensemble also decreased greatly, to 0.001, 0.014, and 0.013, respectively.

For subset with 68 outliers detected by enetLTS removed, the intersection of variables selected using the three Ensemble models increased from five to nine genes, namely, CA12 [29], GABRP [30], VGLL1 [31], AGR2 [32], GATA3 [17], FOXA1 [17], TFF3 [23], AGR3 [33], and KRT16 [20], were reported to be related to TNBC.

From Table 7, for subset with 47 outliers detected by MTL-EN removed, the intersection of variables selected using the three Ensemble models was 12 genes. Among them, ESR1, one of three key variables, and FOXC1 [22], AGR2 [32], FOXA1 [17], TFF3 [23], TFF1 [34], AGR3 [33], KRT6B [35], and KRT16 [20] have been reported to be related to TNBC. KLK6 [36], FDCSP [37], and PPP1R14C [38] have been reported to be related to other types of tumors. Their association with TNBC needs further study.

6. Discussion

Through our previous research [9], we have found that in high-dimensional data with mislabeled error, robust trimmed penalized regression is a recommended method in identifying mislabeled samples. However, the C-step algorithm to implement this method (enetLTS) is too slow to meet the requirement of data analysis for high-dimensional omics data. The reason is that for LTS without regularized parameters, the inequality that guarantees the convergence of the C-step algorithm is established. However, for the robust trimmed penalized regression with regularized parameters, the inequality does not necessarily hold due to the change of the regularized parameters.

In the AR-Cstep algorithm, penalized regression is repeatedly performed on the subset at each step to concentrate on the individuals who fit the model best gradually; that is, the idea of the C-step algorithm is still adopted. However, AR-Cstep can solve the problem of the C-step algorithm not converging because the regularized parameters change during the iteration. A comparison of the likelihood function of the current subset and that of the candidate subset is used to determine whether to replace the current subset with the candidate subset in AR-Cstep, thereby ensuring that the iterative process is in the direction that improves the criterion function. To avoid falling into a local optimum, the Metropolis-type probabilistic acceptance-rejection algorithm is combined.

Through simulation experiments, it is found that MTL-EN using AR-Cstep algorithm was more accurate than enetLTS using C-step algorithm in outlier identification. In particular, the accuracy of Ensemble variable selection on the subset after removing outliers identified by MTL-EN was higher than the result of Ensemble running on the subset after removing outliers identified by enetLTS. The AR-Cstep algorithm adopted by MTL-EN greatly improved the convergence speed; that is, the computation time of MTL-EN was 2% of that of enetLTS.

If a misclassified sample identified by a certain method is labeled as non-TNBC, it means that the expression of the key genes ER, PR, or HER2 is false positive in this patient. Similarly, if a misclassified sample identified is labeled as TNBC, it implies that the expression of ER, PR, or HER2 is a false negative in the patient. In the analysis of the TNBC dataset, there are 153 individuals labeled as TNBC in this TNBC dataset. There are 3 samples identified by enetLTS that were labeled as TNBC patients with false negative rate 2% (3/153). Twelve individuals labeled as TNBC patients were identified as mislabeled samples by MTL-EN with false negative rate 7.8% (12/153). In the TNBC dataset, IHC test of ER and PR was adopted for all patients. For HER2 detection, the results of IHC were for 507 patients. According to previous studies, the false negative rates of IHC test for ER, PR, and HER2 were not low, 15.1% ~21.8% for ER [39], 6.8% (4/58) for PR [40], and 6.2% (4/65) for HER2 [41], respectively. Therefore, the false negative misclassified samples identified by MTL-EN were more likely to be close to the reality than enetLTS.

A large class of computational problems in robust statistics can be formulated as the selection of the optimal subset of data based on some criterion function [10]. AR-Cstep algorithm, as the improvement of C-step algorithm, can be extended to other robust models with regularized parameters. It is an effective algorithm for finding the most suitable subset of regularized models, such as robust Adaptive LASSO, Group LASSO, SCAD, and MCP. The AR-Cstep algorithm can be extended to other generalized linear models, such as penalized multiclass logistic regression and penalized Poisson regression.

7. Conclusion

AR-Cstep can solve the problem of the C-step algorithm not converging because the regularized parameters change during the iteration. It provides an idea for developing the efficient algorithm of robust penalized regression based on trimming. The AR-Cstep algorithm can be extended to other robust models with regularized parameters. In practice, MTL-EN using AR-Cstep algorithm is the recommended method for mislabeled sample identification in omics data because of its high accuracy and high operation speed. When the proportion of mislabeled samples is relatively low and ≤5%, Ensemble can be used for variable selection. When the proportion of mislabeled samples is >5%, Ensemble can be used for variable selection on a subset of data after removing mislabeled samples identified by MTL-EN.

Data Availability

Code is available on Github (https://github.com/hwsun2000/AR-Cstep). The BRCA RNA-Seq FPKM dataset was imported using the “brca.data” R package (https://github.com/averissimo/brca.data/releases/download/1.0/brca.data_1.0.tar.gz).

Disclosure

The funders played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interests.

Acknowledgments

This research work was funded by the National Natural Science Foundation for Young Scholars of China (Grant No. 81502891) and the National Natural Science Foundation of China (Grant No. 81872715).

References

M. B. Lopes, A. Verissimo, E. Carrasquinha, S. Casimiro, N. Beerenwinkel, and S. Vinga, “Ensemble outlier detection and gene selection in triple-negative breast cancer data,” BMC Bioinformatics, vol. 19, no. 1, p. 168, 2018.
View at: Publisher Site | Google Scholar
C. Wu and S. Ma, “A selective review of robust variable selection with applications in bioinformatics,” Briefings in Bioinformatics, vol. 16, no. 5, pp. 873–883, 2015.
View at: Publisher Site | Google Scholar
K. L. Ayers and H. J. Cordell, “SNP selection in genome-wide and candidate gene studies via penalized logistic regression,” Genetic Epidemiology, vol. 34, no. 8, pp. 879–891, 2010.
View at: Publisher Site | Google Scholar
H. Sun and S. Wang, “Penalized logistic regression for high-dimensional DNA methylation data with case-control studies,” Bioinformatics, vol. 28, no. 10, pp. 1368–1375, 2012.
View at: Publisher Site | Google Scholar
P. J. Rousseeuw, “Least median of squares regression,” Journal of the American Statistical Association, vol. 79, no. 388, pp. 871–880, 1984.
View at: Publisher Site | Google Scholar
A. Alfons, C. Croux, and S. Gelper, “Sparse least trimmed squares regression for analyzing high-dimensional large data sets,” The Annals of Applied Statistics, vol. 7, no. 1, pp. 226–248, 2013.
View at: Publisher Site | Google Scholar
F. S. Kurnaz, I. Hoffmann, and P. Filzmoser, “Robust and sparse estimation methods for high-dimensional linear and logistic regression,” Chemometrics and Intelligent Laboratory Systems, vol. 172, pp. 211–222, 2018.
View at: Publisher Site | Google Scholar
P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for large data sets,” Data Mining and Knowledge Discovery, vol. 12, no. 1, pp. 29–45, 2006.
View at: Publisher Site | Google Scholar
H. Sun, Y. Cui, H. Wang, H. Liu, and T. Wang, “Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data,” BMC Bioinformatics, vol. 21, no. 1, p. 357, 2020.
View at: Publisher Site | Google Scholar
B. Chakraborty and P. Chaudhuri, “On an optimization problem in robust statistics,” Journal of Computational and Graphical Statistics, vol. 17, no. 3, pp. 683–702, 2008.
View at: Publisher Site | Google Scholar
J. Zhu and T. Hastie, “Classification of gene microarrays by penalized logistic regression,” Biostatistics, vol. 5, no. 3, pp. 427–443, 2004.
View at: Publisher Site | Google Scholar
A. Farcomeni and S. Viviani, “Robust estimation for the Cox regression model based on trimming,” Biometrical Journal, vol. 53, no. 6, pp. 956–973, 2011.
View at: Publisher Site | Google Scholar
P. Breheny and J. Huang, “Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection,” The Annals of Applied Statistics, vol. 5, no. 1, p. 232, 2011.
View at: Publisher Site | Google Scholar
L. D. Maxim, R. Niebo, and M. J. Utell, “Screening tests: a review with examples,” Inhalation Toxicology, vol. 26, no. 13, pp. 811–828, 2014.
View at: Publisher Site | Google Scholar
N. Ternes, F. Rotolo, and S. Michiels, “Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional Cox regression models,” Statistics in Medicine, vol. 35, no. 15, pp. 2561–2573, 2016.
View at: Publisher Site | Google Scholar
H. Uno, T. Cai, M. J. Pencina, R. B. D'Agostino, and L. J. Wei, “On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data,” Statistics in Medicine, vol. 30, no. 10, pp. 1105–1117, 2011.
View at: Publisher Site | Google Scholar
X. Dai, R. Ma, X. Zhao, and F. Zhou, “Epigenetic profiles capturing breast cancer stemness for triple negative breast cancer control,” Epigenomics, vol. 11, no. 16, pp. 1811–1825, 2019.
View at: Publisher Site | Google Scholar
R. Wang, Z. Huang, C. Qian et al., “LncRNA WEE2-AS1 promotes proliferation and inhibits apoptosis in triple negative breast cancer cells via regulating miR-32-5p/TOB1 axis,” Biochemical and Biophysical Research Communications, vol. 526, no. 4, pp. 1005–1012, 2020.
View at: Publisher Site | Google Scholar
M. J. Gunzburg, K. Kulkarni, G. M. Watson et al., “Unexpected involvement of staple leads to redesign of selective bicyclic peptide inhibitor of Grb7,” Scientific Reports, vol. 6, no. 1, article 27060, 2016.
View at: Publisher Site | Google Scholar
K. D. Yu, R. Zhu, M. Zhan et al., “Identification of prognosis-relevant subgroups in patients with chemoresistant triple-negative breast cancer,” Clinical Cancer Research, vol. 19, no. 10, pp. 2723–2733, 2013.
View at: Publisher Site | Google Scholar
L. Fang, Y. Wang, Y. Gao, and X. Chen, “Overexpression of CXXC5 is a strong poor prognostic factor in ER+ breast cancer,” Oncology Letters, vol. 16, no. 1, pp. 395–401, 2018.
View at: Publisher Site | Google Scholar
H. Pan, Z. Peng, J. Lin, X. Ren, G. Zhang, and Y. Cui, “Forkhead box C1 boosts triple-negative breast cancer metastasis through activating the transcription of chemokine receptor-4,” Cancer Science, vol. 109, no. 12, pp. 3794–3804, 2018.
View at: Publisher Site | Google Scholar
G. G. Jinesh, E. R. Flores, and A. S. Brohl, “Chromosome 19 miRNA cluster and CEBPB expression specifically mark and potentially drive triple negative breast cancers,” PLoS One, vol. 13, no. 10, article e0206008, 2018.
View at: Publisher Site | Google Scholar
X. Lv, M. He, Y. Zhao et al., “Identification of potential key genes and pathways predicting pathogenesis and prognosis for triple-negative breast cancer,” Cancer Cell International, vol. 19, no. 1, p. 172, 2019.
View at: Publisher Site | Google Scholar
R. Z. Liu, K. Graham, D. D. Glubrecht, R. Lai, J. R. Mackey, and R. Godbout, “A fatty acid-binding protein 7/RXRβ pathway enhances survival and proliferation in triple-negative breast cancer,” The Journal of Pathology, vol. 228, no. 3, pp. 310–321, 2012.
View at: Publisher Site | Google Scholar
R. Yang, L. Xing, X. Zheng, Y. Sun, X. Wang, and J. Chen, “The circRNA circAGFG1 acts as a sponge of miR-195-5p to promote triple-negative breast cancer progression through regulating CCNE1 expression,” Molecular Cancer, vol. 18, no. 1, p. 4, 2019.
View at: Publisher Site | Google Scholar
J. Pérez-Pena, J. Tibor Fekete, R. Páez et al., “A transcriptomic immunologic signature predicts favorable outcome in neoadjuvant chemotherapy treated triple negative breast tumors,” Frontiers in Immunology, vol. 10, p. 2802, 2019.
View at: Publisher Site | Google Scholar
X. Yu, W. Xiao, H. Song, Y. Jin, J. Xu, and X. Liu, “CircRNA_100876 sponges miR-136 to promote proliferation and metastasis of gastric cancer by upregulating MIEN1 expression,” Gene, vol. 748, article 144678, 2020.
View at: Publisher Site | Google Scholar
Y. Wang, H. Li, J. Ma et al., “Integrated bioinformatics data analysis reveals prognostic significance of SIDT1 in triple-negative breast cancer,” Oncotargets and Therapy, vol. Volume 12, pp. 8401–8410, 2019.
View at: Publisher Site | Google Scholar
V. B. Wali, G. A. Patwardhan, V. Pelekanou et al., “Identification and validation of a novel biologics target in triple negative breast cancer,” Scientific Reports, vol. 9, no. 1, p. 14934, 2019.
View at: Publisher Site | Google Scholar
M. Castilla, M. López-García, M. R. Atienza et al., “VGLL1 expression is associated with a triple-negative basal-like phenotype in breast cancer,” Endocrine-Related Cancer, vol. 21, no. 4, pp. 587–599, 2014.
View at: Google Scholar
P. Segaert, M. B. Lopes, S. Casimiro, S. Vinga, and P. J. Rousseeuw, “Robust identification of target genes and outliers in triple-negative breast cancer data,” Statistical Methods in Medical Research, vol. 28, no. 10-11, pp. 3042–3056, 2019.
View at: Publisher Site | Google Scholar
A. Umesh, J. Park, J. Shima et al., “Identification of AGR3 as a potential biomarker though public genomic data analysis of triple-negative (TN) versus triple-positive (TP) breast cancer (BC),” Clinical Oncology, vol. 30, 27_suppl, p. 31, 2012.
View at: Publisher Site | Google Scholar
J. Yi, L. Ren, D. Li et al., “Trefoil factor 1 (TFF1) is a potential prognostic biomarker with functional significance in breast cancers,” Biomedicine & Pharmacotherapy, vol. 124, article 109827, 2020.
View at: Publisher Site | Google Scholar
G. M. Sizemore, S. T. Sizemore, D. D. Seachrist, and R. A. Keri, “GABA (A) receptor pi (GABRP) stimulates basal-like breast cancer cell migration through activation of extracellular-regulated kinase 1/2 (ERK1/2),” The Journal of Biological Chemistry, vol. 289, no. 35, pp. 24102–24113, 2014.
View at: Publisher Site | Google Scholar
A. Sananes, I. Cohen, A. Shahar et al., “A potent, proteolysis-resistant inhibitor of kallikrein-related peptidase 6 (KLK6) for cancer therapy, developed by combinatorial engineering,” The Journal of Biological Chemistry, vol. 293, no. 33, pp. 12663–12680, 2018.
View at: Publisher Site | Google Scholar
A. Shergalis, A. Bankhead, U. Luesakul, N. Muangsin, and N. Neamati, “Current challenges and opportunities in treating glioblastoma,” Pharmacological Reviews, vol. 70, no. 3, pp. 412–445, 2018.
View at: Publisher Site | Google Scholar
J. Grey, D. Jones, L. Wilson et al., “Differential regulation of the androgen receptor by protein phosphatase regulatory subunits,” Oncotarget, vol. 9, no. 3, pp. 3922–3935, 2018.
View at: Publisher Site | Google Scholar
Q. Li, A. C. Eklund, N. Juul et al., “Minimising immunohistochemical false negative ER classification using a complementary 23 gene expression signature of ER status,” PLoS One, vol. 5, no. 12, article e15031, 2010.
View at: Publisher Site | Google Scholar
G. B. Fakhri, R. S. Akel, M. K. Khalil, D. A. Mukherji, F. I. Boulos, and A. H. Tfayli, “Concordance between immunohistochemistry and microarray gene expression profiling for estrogen receptor, progesterone receptor, and HER2 receptor statuses in breast cancer patients in Lebanon,” International Journal of Breast Cancer, vol. 2018, Article ID 8530318, 6 pages, 2018.
View at: Publisher Site | Google Scholar
S. F. Wu, Y. Y. Liu, X. D. Liu, Y. Jiang, and X. Zeng, “HER2 gene status and mRNA expression in immunohistochemistry 1+ breast cancer,” Zhonghua bing li xue za zhi = Chinese Journal of Pathology, vol. 47, no. 7, p. 522, 2018.
View at: Google Scholar

Copyright

Copyright © 2021 Hongwei Sun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

299

Downloads

740

Citations