One-Step Robust Low-Rank Subspace Segmentation for Tumor Sample Clustering

Liu, Jian; Cheng, Yuhu; Wang, Xuesong; Ge, Shuguang

doi:https://doi.org/10.1155/2021/9990297

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Methods Results and Discussion Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Artificial Intelligence and Machine Learning-Driven Decision-Making

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 9990297 | https://doi.org/10.1155/2021/9990297

One-Step Robust Low-Rank Subspace Segmentation for Tumor Sample Clustering

Jian Liu,¹Yuhu Cheng,¹Xuesong Wang,¹and Shuguang Ge¹

Academic Editor: Wei Xiang

Received23 Aug 2021

Revised13 Nov 2021

Accepted19 Nov 2021

Published08 Dec 2021

Abstract

Clustering of tumor samples can help identify cancer types and discover new cancer subtypes, which is essential for effective cancer treatment. Although many traditional clustering methods have been proposed for tumor sample clustering, advanced algorithms with better performance are still needed. Low-rank subspace clustering is a popular algorithm in recent years. In this paper, we propose a novel one-step robust low-rank subspace segmentation method (ORLRS) for clustering the tumor sample. For a gene expression data set, we seek its lowest rank representation matrix and the noise matrix. By imposing the discrete constraint on the low-rank matrix, without performing spectral clustering, ORLRS learns the cluster indicators of subspaces directly, i.e., performing the clustering task in one step. To improve the robustness of the method, capped norm is adopted to remove the extreme data outliers in the noise matrix. Furthermore, we conduct an efficient solution to solve the problem of ORLRS. Experiments on several tumor gene expression data demonstrate the effectiveness of ORLRS.

1. Introduction

Tumor is a group of cells that have undergone unregulated growth and often form a mass or lump. It is critical to reveal the pathogenesis of cancer by analyzing tumor gene expression data. The advances of various sequencing technologies have made it possible to measure the expression levels of thousands of genes simultaneously [1]. Increasingly, one challenge is how to interpret these gene expression data to gain insights into mechanisms of tumors [2]. Many advanced machine learning algorithms [3–9] have thus been proposed to analyze various data. Among them, clustering can be used for discovering tumor samples with similar molecular expression patterns [10, 11].

Many traditional clustering methods, such as hierarchical clustering (HC) [12, 13], self-organizing maps (SOM) [14], nonnegative matrix factorization (NMF) [15, 16], and principal component analysis (PCA) [17–20] have been used for gene expression data clustering. The gene expression data often contains structures that can be represented and processed by some parametric models. The linear subspaces are possible to characterize a given set of data since they are easy to calculate and often effective in real applications. The subspace methods, such as NMF, are essentially based on the assumption that the data is approximately drawn from a low-dimensional subspace. In recent years, these methods have been gaining much attention. For example, Yu et al. proposed a correntropy-based hypergraph regularized NMF (CHNMF) method for clustering and feature selection [21]. Specifically, the correntropy is used in the loss term of CHNMF instead of the Euclidean norm to improve the robustness of the algorithm. And, CHNMF also uses the hypergraph regularization to explore the high-order geometric information in more sample points. Jiao et al. proposed a hypergraph regularized constrained nonnegative matrix factorization (HCNMF) method for selecting differentially expressed genes and tumor sample classification [22]. HCNMF incorporates a hypergraph regularization constraint to consider the higher order data sample relationships. A nonnegative matrix factorization framework based on multisubspace cell similarity learning for unsupervised scRNA-seq data analysis (MscNMF) was proposed by Wang et al. [23]. MscNMF can learn the gene features and cell features of different subspaces, and the correlation and heterogeneity between cells will be more prominent in multisubspaces, resulting in the final cell similarity learning will be more satisfactory.

However, real data rarely can be well represented by a single subspace. A more reasonable model is to assume that the data are lying near multiple subspaces (i.e., the data are considered as samples approximately drawn from a mixture of multiple low-dimensional subspaces). Subspace clustering (or segmentation) has been proposed to improve clustering accuracy. It is assumed that the data points are drawn from the combination of multiple low-dimensional subspaces. The goal of subspace clustering is to obtain such multiple low-dimensional subspaces with each subspace corresponding to a cluster. Subspace clustering has obtained promising results in previous studies, and subspace clustering methods have been found widespread applications in many areas, such as pattern recognition [24], image processing [25], and bioinformatics [26].

When the data are clean, i.e., the samples can be strictly drawn from multiple subspaces, several existing methods, such as sparse subspace clustering (SSC) [27], low-rank representation (LRR) [5], and low-rank model with discrete group structure constraint (LRS) [28], are able to solve the subspace clustering problem. SSC clusters the data drawn from multiple low-dimensional subspaces based on sparse representation (SR) [29]. Since low-rank structure can well perform matrix recover, the multiple subspaces can be exactly recovered by LRR. Recently, many excellent works based on low-rank representation are published. For example, Tang et al. proposed a multiview subspace clustering model by learning a joint affinity graph for multiview subspace clustering based on low-rank representation with diversity regularization and rank constraint [30]. This method can effectively suppress redundancy and enhance the diversity of different feature views. In addition, the cluster number is used to promote affinity graph learning by using a rank constraint. In [31], an unsupervised linear feature selective projection (FSP) method was proposed for feature extraction with low-rank embedding and dual Laplacian regularization. FSP can take advantage of the inherent relationship between data and can effectively suppress the influence of noise. LRR have two steps in the clustering task: building the affinity matrix and performing spectral clustering. How to define an excellent affinity matrix is crucial. Furthermore, the clustering problem will be transformed into a segmentation problem of graph by using spectral clustering. The choice of segmentation criteria will directly affect the clustering results. To address the above concerns, LRS directly grasps the indicators of different subspaces via the discrete constraint. As a result, multiple low-rank subspaces can be obtained clearly. Furthermore, Nie et al. introduced a piecewise function to relax the rank constraint which makes LRS better at handling the noisy dataset than the preliminary version [32].

As pointed out in [33], one major challenge of subspace clustering is to deal with the outliers that exist in data. Therefore, robust subspace clustering has become an active research topic. To address the robustness issue, the main idea is to explore the L_2,1-norm based objective functions since the nonsquared residuals of L_2,1-norm can reduce the effects of data outliers. In [34, 35], the L_2,1-norm is adopted in robust PCA (RPCA) for detecting outliers. In [33], Liu et al. proposed a robust LRR model via L_2,1-norm for subspace clustering. Although the L_2,1-norm is robust to outliers, it still suffers from the extreme data outliers. The L_2,1-norm just reduces, not completely removes, the effects of the outliers. Capped norm is a more robust strategy than L_2,1-norm due to the fact that it can remove the effects of the outliers. It has been recently studied in many applications [36, 37].

In this paper, a one-step robust low-rank subspace segmentation (ORLRS) method via the discrete constraint and capped norm is proposed for clustering tumor sample. For a data set with genes and samples, a low-rank representation matrix and a noise matrix , i.e., , are being sought. The low-rank representation of the -th subspace can be denoted as . Here, we impose the discrete constraint on a diagonal matrix to obtain the low-rank representation , where and ( is the number of total subspaces and is an identify matrix). The indicators of the -th cluster are included in . In contrast to traditional low-rank based models, we can directly learn the cluster indicators. To avoid trivial solutions and approximate the low-rank constraint, the rank of all subspace simultaneously can be minimized as , where denotes the Schatten -norm which has a better relaxation than the nuclear norm [38]. For the noise matrix , capped norm is used to improve the robustness. We define as a thresholding parameter for choosing the extreme data outliers, and then the capped norm of can be formulated as . This function treats equally if is smaller than . Hence, it is more robust to outliers than L_2,1-norm. Meanwhile, we derive an efficient optimization algorithm to solve ORLRS with a rigorous theoretical analysis.

The main contributions of our paper are given as follows: ① Compared with traditional low-rank representation-based methods, ORLRS can obtain the clustering result directly by learning a subspace indicator matrix from the low-rank representation matrix without spectral clustering. This avoids the graph construction process in spectral clustering and makes the clustering process simpler. ② We introduced the capped norm into our model and formed a novel objective function for the gene expression data clustering task. Capped norm is used to constrain the noise matrix to improve the robustness of ORLRS. ③ Optimizing the objective function of ORLRS is a nontrivial problem, thus we derive a new optimization algorithm to solve the problem. Furthermore, we have also given a rigorous convergence analysis of ORLRS.

The remainder of the paper is structured as follows. In Section 2, the proposed ORLRS is presented, and the theoretical analysis of the proposed method is provided. Experimental results are presented in Section 3. In Section 4, the conclusions are given.

2. Methods

We start with a brief introduction of several classical clustering methods. Then, the proposed ORLRS is presented, and the optimal solution and convergence analysis of ORLRS is provided.

2.1. Subspace Clustering via LRR

Denote as a data set with features and samples. LRR can be defined aswhere , i.e., the nuclear norm of [33], can detect outliers with column-wise sparsity, is a dictionary, and is a balance parameter.

A brief explanation of LRR subspace clustering process is provided as follows. Firstly, the low-rank problem is solved by equation (1). Then, the optimal solution to equation (1) is used to calculate the affinity matrix by , where is the absolute value function. Finally, the data are clustered by using spectral clustering [39].

2.2. One-Step Robust Low-Rank Subspace Clustering

In this paper, we propose the one-step robust low-rank subspace clustering (ORLRS) method via discrete constraint and capped norm. Different from LRR, ORLRS was proposed for clustering the data by learning the indicators.

Suppose the data matrix has subspaces , the low-rank representation of each subspace needs to be optimized. In the clustering task, we want each subspace to belong to its own cluster. To obtain a low-rank representation of each subspace, the following formula should be computed: , which has trivial solution. Therefore, we need to solve the problem in another way. We define a cluster indicator matrix as : if the -th sample belongs to the -th subspace, and otherwise. And, the diagonal matrices are defined as , where the diagonal elements of are formed by the -th row of and is the identity matrix. Then, can be represented as the -th subspace of . That is, can be rewritten as . We can get the clustering label in one step by directly optimizing [28].

Finally, the problem of the one-step low-rank subspace clustering method can be defined aswhere is the Schatten -norm of . The clustering indicators of each subspace can be obtained from the optimized diagonal matrix directly.

However, equation (2) is sensitive to data outliers in practical problems since it does not consider the noise in data. To address the robustness problem, we represent the gene expression data with genes and samples as the addition of low-rank representation matrix and the noise matrix , i.e., , which is the same strategy as in RPCA. Our one-step low-rank subspace clustering problem can be written aswhere is a balance parameter and indicates certain regularization strategy. Note that Schatten -norm is used to approximate the low-rank problem in equation (3) since it is a better relaxation for the rank constraint problem than nuclear norm [38]. The Schatten -norm of a matrix was defined as , where is the -th singular value of . In [38], the convergence of Schatten -norm with is proved. Here, we set to guarantee the convergence of first term in equation (3). So, the range of is .

To seek a better robustness strategy for the outliers, we adopt capped norm to regularize the noise matrix , i.e., . Then, equation (3) becomeswhere is a thresholding parameter for choosing the data outliers. If the data point , we consider as extreme outlier, and it is capped as . In this way, the influence of extreme outliers is fixed. For other data point , equation (4) will minimize , i.e., the L_2,1-norm. That is, if is set as , is equivalent to . Thus, the capped norm is a more robust strategy than L_2,1-norm.

As a result, ORLRS provides a more robust low-rank subspace clustering model by using capped norm. And, the clustering indicators of each subspace can be obtained from the optimized diagonal matrix directly. We will propose an efficient optimization algorithm to solve equation (4) in Section 2.3.

2.3. Optimization Algorithm

The objective function equation (4) of the ORLRS is nonconvex, thus jointly optimizing , , and is extremely difficult. The augmented Lagrange multiplier (ALM) algorithm is used to optimize equation (4). The Lagrangian function of equation (4) can be written aswhere is a Lagrange multiplier, is a penalty parameter, is the Frobenius norm, and encodes the constraints of . We rewrite equation (5) as follows:

We divide equation (6) into three subproblems: optimizing while fixing and , optimizing while fixing and , and optimizing while fixing and .

2.3.1. Fixing and to Optimize

Equation (6) can be simplified towhere .

Lemma 1. (Araki-Lieb-Thirring [40, 41]). For any positive semidefinite matrices , , the following inequality holds when :

While for , the inequality is reversed.

Following , [28] and Lemma 1, the first term in equation (7) can be denoted as since . According to , we convert the first term in equation (7) to

Then, equation (7) can be represented as

Taking derivative w.r.t and setting to zero, the above formula becomeswhere . So, we can achieve the optimal :

2.3.2. Fixing and to Optimize

Here, we can denote equation (6) aswhere . It can be easily verified that the derivative of equation (13) is equivalent to the derivative ofwhere

Equation (14) can be formulated aswhere is a diagonal matrix with . The problem of equation (16) can be optimized by using the iterative reweighted optimization strategy.

When fixing , taking derivative w.r.t and setting it to zero, the above formulation can be written as

So, we can obtain the optimal :

When fixing , the updating rule for is as follows:

2.3.3. Fixing and to Optimize

We can rewrite equation (6) as

Taking derivative w.r.t and setting to zero, the above formulation can be written aswhere .

Since depends on , an iteration-based algorithm is used to obtain the solution of equation (21). Firstly, we calculate by using the current solution of . If is given, the solution of to the following objective function will satisfy equation (21):

The current solution of can be updated according to the optimal solution to equation (22).

Denote that , equation (22) can be written as

Due to are diagonal matrices, the above formulation becomeswhere is the -th diagonal element of matrix and is the -th diagonal element of matrix . We can optimize equation (24) by

The algorithm to solve the problem of ORLRS is summarized in Algorithm 1.

	Input: data matrix: , number of subspace , the low-rank constraint parameter , balance parameter , threshold parameter .
	Initialize:, ,,, , , , as the identity matrix, such that the discrete constraints in equation (16) are satisfied.
	Output: the optimal for the -th cluster.
	while not converge do
(1)	Fix the others and update : ① calculate , ② calculate , ③update by .
(2)	Fix the others and update : ① calculate , ② update by , ③ calculate .
(3)	Fix the others and update : ① calculate , ② calculate , ③ update , where the -th diagonal element of matrix is updated by equation (33).
(4)	Update the multiplier: .
(5)	Update the parameter by .
(6)	Check the convergence condition
	.
	end while

2.4. Convergence Analysis

In this section, the convergence analysis of the proposed algorithm will be proved.

Theorem 1. At each iteration, the updating rule in Algorithm 1 for matrix while fixing others will monotonically decrease the objective value in equation (4) when .

Proof. It can be verified that equation (12) is the solution to the following problem:Then, at the iterationThat is,Equation (28) can be converted toaccording to Lemma 2 in [38].

Lemma 2. For any positive definite matrices , the following inequality holds when .Note that, here, we set , so equation (30) is equivalent toThen, we have

Combining equations (29) and (32), we have

That is to say,

Thus, the updating rule for matrix in Algorithm 1 will not increase the objective value of the problem in equation (10) at each iteration when .

Theorem 2. At each iteration, the updating rule in Algorithm 1 for matrix while fixing others will monotonically decrease the objective value in equation (4).

Proof. We fist prepare the following lemma in [37].

Lemma 3. Given , we have the following inequality:

It can be verified that equation (18) is the solution to the following problem:

Suppose the updated in Algorithm 1 is while fixing others. Since is the optimal solution to equation (4), we have

According to the definition of in equation (19) and Lemma 3, we have

Summing over equations (37) and (38) at both sides, we can obtain

Therefore, at each iteration, the updating rule in Algorithm 1 for matrix while fixing others will monotonically decrease the objective value in equation (4).

Theorem 3. At each iteration, the updating rule in Algorithm 1 for while fixing others will monotonically decrease the objective value in equation (4) when .

Proof. It can be easily verified that equation (25) is the solution to the following problem: Assume the updated in Algorithm 1 is . Since is the optimal solution to the equation (22), we can haveAccording to the definition of in Algorithm 1, equation (41) can be written asAccording to the Cauchy–Schwarz inequality, it can be proved that, when , we haveThus, combining inequations (42) and (43), we can obtainEquation (44) indicates that the updating rule in Algorithm 1 for while fixing others will monotonically decrease the objective value in equation (4) during the iteration until the algorithm converges when . In practice, the algorithm is also converged when . If the objective function of equation (40) is changed to ( in Algorithm 1 becomes ), the convergence is also observed [28].
As a result, the objective of equation (4) is nonincreasing under the updates of , , and according to Theorems 1–3, respectively. Therefore, the iteratively updating Algorithm 1 converges to a local optimal.

2.5. Complexity Analysis

In Algorithm 1, the most complicated calculations are and in Step 3. We suppose in the low-rank representation matrix . Firstly, needs to be computed. Denoting the SVD of is . Computing needs SVD of , which takes . can be decomposed as by SVD, which takes . So, computing takes and computing takes , where c is the number of clusters. For , we only need to compute the diagonal elements, which takes . And, computing takes . In summary, the computational complexity of Algorithm 1 is , where t is the iteration number.

3. Results and Discussion

We test ORLRS on six publicly available gene expression data sets, i.e., Leukemia [42], DLBCL [43], Colon cancer [44], Brain_Tumor1 [43], Brain_Tumor2 [43], and 9_Tumors [43].

Following [28, 45–47], clustering accuracy (ACC) is a widely used evaluation method for tumor clustering. Given a data point , suppose as the target label and as the truth label. ACC can be denoted as [45]where if and if , maps to the equivalent label from the raw data and is the number of tumor samples.

We also evaluate the clustering performance by normalized mutual information (NMI) [48]. NMI is defined aswhere is the mutual information function between the true class label C and the clustering label S and is the entropy function. The larger the NMI value is, the better the clustering result is.

3.1. Gene Expression Data Sets

A brief introduction of six gene expression data sets is presented, and the detailed information of these datasets is summarized in Table 1. Leukemia data contain 25 cases of AML and 47 cases of ALL. It is packaged into a 7129 × 72 matrix [42]. DLBCL data consist of 5469 genes and 77 samples. These samples include 58 patients of diffuse large B-cell lymphoma (DLBCL) and 19 patients of follicular lymphomas (FL) [43]. The colon cancer data [44] consists of a matrix that includes 2000 genes and 62 tissues. These tissues are divided into 22 normal and 40 colon tumor samples Brain_Tumor1 data set consists of 5920 genes in 90 patient samples. These samples contain 5 types of histological diagnoses, i.e., 60 cases of medulloblastoma, 10 cases of malignant glioma, 10 cases of atypical teratoid/rhabdoid tumors (AT/RTs), 4 cases of normal cerebellum, and 6 cases of primitive neuroectodermal tumors (PNETs). The Brain_Tumor2 data set contains 10367 genes in 50 samples. It contains 4 types of malignant glioma, i.e., classic glioblastomas (CG), classic anaplastic oligodendrogliomas (CAO), nonclassic glioblastomas (NCG), and nonclassic anaplastic oligodendrogliomas (NCAO) [34]. 9_Tumors data set integrates 9 tumor types to develop a genomics-based approach to the prediction of drug response. It contains 5726 genes in 60 samples. The number of samples of 9 tumor types is shown as follows: 9 samples of non-small-cell carcinoma (NSCLC), 7 samples of colon cancer, 8 samples of breast cancer, 6 samples of ovary cancer, 6 samples of leukemia, 8 samples of renal cancer, 8 samples of melanoma, 2 samples of prostate cancer, and 6 samples central nervous system cancer (CNS).

3.2. Comparison Algorithms

We compare LRS [28], Ext-LRR [32], RPCA [3], PLRR [47], robust LRR [33], LatLRR [49], robust NMF [50], and K-means [51] with the proposed method for tumor clustering. In these methods, LRS is the basic version of our method to implement the one-step clustering, and Ext-LRR is a simpler and more effective extension work compared with LRS; RPCA is a classic robust learning algorithm; PLRR (projection LRR) is one of the latest subspace clustering methods for tumor sample clustering; Robust LRR and LatLRR are the best state-of-art low-rank subspace segmentation algorithms; robust NMF is a classic NMF-based method and is widely used for tumor clustering. K-means is the most commonly used clustering method and is embedded into many methods including PLRR, robust LRR, and LatLRR to achieve better performance. Since our proposed method is a novel one-step robust low-rank subspace clustering model, we choose these methods as our comparison algorithms.

3.3. Parameter Setting

Since gene expression data have the characteristics of high-dimensional and small samples, we use PCA to perform dimensionality reduction. And, we use the K-means method to initialize in the proposed ORLRS. Here, three parameters, i.e., threshold parameter , balance parameter , and low-rank constraint parameter , need to be determined. In the experiment, we investigated one parameter by fixing the other two parameters. Since the initialization of will bring some uncertainty, the proposed ORLRS method run 100 times, and the average of the accuracies of 100 times is reported. The choices of parameters in the following are heuristic and might not be the best for tumor clustering.

3.3.1. Determination of Threshold Parameter

In the ORLRS model, data outliers are not heuristically determined based on the magnitude. They are selected during the optimization process. The data outliers may be distinct at different iterations (with the same thresholding parameter), while we iteratively optimize the objective function of ORLRS method. When the algorithm converges, likely correct extreme data outliers can be found. So, we just need to determine one value of for each data set.

Figure 1 presents the results of ORLRS with different . Since the gene expression levels in different data are very different, the values of extreme data outliers are also very different. So, the value of has a large range in six data sets. From Figure 1, we can observe that ORLRS can obtain the best performance in the case of in Leukemia data, DLBCL data, Colon cancer data, Brain_Tumor1 data, Brain_Tumor2 data, and 9_Tumors data, respectively. The results indicate that the value of should be determined appropriately. If the value of is too large, we will miss some extreme outliers. If the value of is too small, some important information may be removed, thereby affecting the clustering performance.

(a)

(b)

(c)

(d)

(e)

(f)

3.3.2. Determination of Balance Parameter

Figure 2 presents the results of ORLRS with different . ORLRS can obtain the best results in the case of in Leukemia data, DLBCL data, Colon cancer data, Brain_Tumor1 data, Brain_Tumor2 data, and 9_Tumors data, respectively. According to the experimental results in each data set, before reaching the best results, the clustering accuracies showed an overall upward trend when increases; after achieving the best results, the clustering accuracies showed an overall downward trend when increases. So, we suggest a rough range on the choice of .

(a)

(b)

(c)

(d)

(e)

(f)

3.3.3. Determination of Schatten P-Norm Parameter

Since the algorithm is converged for the Schatten p-norm parameter , we determine the value of in this range. Figure 3 presents the results of ORLRS with different . ORLRS can achieve the best performance in the case of in Leukemia data, DLBCL data, Colon cancer data, Brain_Tumor1 data, Brain_Tumor2 data, and 9_Tumors data, respectively. So, a general guidance is given on the choice of .

(a)

(b)

(c)

(d)

(e)

(f)

3.4. Experimental Results

In this section, experimental results of our proposed method and six comparison algorithms, i.e., LRS, Ext-LRR, RPCA, PLRR, robust LRR, LatLRR, robust NMF, and K-means, are reported. ORLRS, LRS, and Ext-LRR use K-means to initialize the indicator matrix . PLRR, robust LRR, and LatLRR use the normalized cuts method to segment data, which cluster data points by using the K-means method. For robust NMF method, we initial the coefficient matrix and basis matrix randomly. To avoid randomness, we run all methods 100 times, and the mean and standard error results of the clustering accuracies of 100 times are shown in Table 2. The best result of each data is indicated in bold.

Based on the results reported in Table 2, we have the following observations and discussions. ORLRS extends LRS by adding a noise matrix into the objective function to enhance the robustness, which contributes to the observation that ORLRS outperforms LRS. From the results shown in Table 2, it can be observed that ORLRS achieves generally 8%–19% higher performances than LRS in terms of the clustering accuracy on four data sets, i.e., Leukemia data, DLBCL data, Colon cancer data, and Brain_Tumor1 data. On Brain_Tumor2 and 9_Tumors data sets, ORLRS has a slightly better performance than LRS. ORLRS has better results than Ext-LRR on all datasets. Compared to the three classical low-rank based methods, PLRR, robust LRR, and LatLRR, the clustering accuracy of ORLRS is 1%–9% higher on all six data. The main reason is that we use capped norm to remove the extreme outliers in the noise matrix and Schatten p-norm to better approximate the low-rank representation. Compared with traditional clustering methods, RPCA, robust NMF, and K-means, ORLRS achieves outstanding results on all of the six data sets.

The NMI results on five gene expression data sets are shown in Table 3. The best result of each data is indicated in bold. Due to the NMI results of all the methods on colon data is less than 0.1, we only reported the results of remain five data sets. From Table 3, we can observe that ORLRS has better results on all the five data sets than PLRR, robust LRR, LatLRR, robust NMF, RPCA, and Ext-LRR. Except on 9_Tumors data set, our method outperforms LRS and K-means on the other four data sets.

3.5. Convergence Curves and Running Time

We plotted the convergence curves of our ORLRS on different datasets. The convergence curves can be found in Figure 4. It shows that our method can converge around the 10-th iteration on all six data sets. In Table 4, we also reported the running time of ORLRS on six gene expression data sets without dimensionality reduction by PCA. We implement our experiment with MATLAB R2020b on an ordinary computer, which is configured with Intel i9-10900 KF (up to 3.70 GHz) cores, 8 GB RAM, and Windows 10 operating system.

(a)

(b)

(c)

(d)

(e)

(f)

4. Conclusions

In this paper, a novel one-step robust low-rank subspace clustering method (ORLRS) is proposed for tumor clustering, where the gene expression data set is represented by a low-rank matrix and a noise matrix. By using the Schatten -norm and discrete constraint, low-rank representation of each subspace can be well obtained. Different from traditional low-rank-based methods, such as LRR and LatLRR, ORLRS learns indicators directly and perform clustering process in one step by using the discrete constraint. Capped norm is used to improve the robustness of ORLRS since it can effectively remove the extreme data outliers in the noise matrix. Furthermore, we propose an efficient algorithm to solve the proposed subspace clustering model, and the convergence of the proposed algorithm is proved. We thus can discover the clusters of tumor data depending on the optimal cluster indicators. We tested the proposed ORLRS method on six tumor data. The results are proved that ORLRS is an excellent method for clustering tumor sample.

There remain several interesting directions for future work. First, it might be better to learn a dictionary for ORLRS since some low-rank subspace segmentation methods achieve significant improvements by learning a dictionary. Second, ORLRS may be extended to solve other problems, such as matrix recovery and classification. Third, ORLRS may be employed in other applications, such as gene clustering and coclustering.

Data Availability

The data used to support the findings of this study are available from the first author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 61906198, 61976215, and 61772532) and the Natural Science Foundation of Jiangsu Province (Grant no. BK20190622).

References

M. J. Heller, “DNA microarray technology: devices, systems, and applications,” Annual Review of Biomedical Engineering, vol. 4, no. 1, pp. 129–153, 2002.
View at: Publisher Site | Google Scholar
J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the National Academy of Sciences, vol. 101, no. 12, pp. 4164–4169, 2004.
View at: Publisher Site | Google Scholar
E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, 2009.
View at: Google Scholar
W. Hua and L. Mo, “Clustering ensemble model based on self-organizing map network,” Computational Intelligence and Neuroscience, vol. 2020, Article ID 2971565, 11 pages, 2020.
View at: Publisher Site | Google Scholar
G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” Proceedings of the 27th International Conference on Machine Learning, vol. 3, pp. 663–670, 2010.
View at: Google Scholar
J. Zhang and Z. Ma, “Hybrid fuzzy clustering method based on FCM and enhanced logarithmical PSO (ELPSO),” Computational Intelligence and Neuroscience, vol. 2020, Article ID 1386839, 12 pages, 2020.
View at: Publisher Site | Google Scholar
X. Chang, F. Nie, S. Wang, Y. Yang, X. Zhou, and C. Zhang, “Compound rank- $k$ projections for bilinear analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 7, pp. 1502–1513, 2016.
View at: Publisher Site | Google Scholar
D. Yuan, X. Chang, P.-Y. Huang, Q. Liu, and Z. He, “Self-supervised deep correlation tracking,” IEEE Transactions on Image Processing, vol. 30, pp. 976–985, 2021.
View at: Publisher Site | Google Scholar
H. Wang, Z. Li, Y. Li, B. B. Gupta, and C. Choi, “Visual saliency guided complex image retrieval,” Pattern Recognition Letters, vol. 130, pp. 64–72, 2020.
View at: Publisher Site | Google Scholar
C. H. Chun-Hou Zheng, L. Lei Zhang, V. T. Ng, S. C. Chi Keung Shiu, and D. S. De-Shuang Huang, “Molecular pattern discovery based on penalized matrix decomposition,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1592–1603, 2011.
View at: Publisher Site | Google Scholar
J. Qiang, W. Ding, M. Kuijjer, J. Quackenbush, and P. Chen, “Clustering sparse data with feature correlation with application to discover subtypes in cancer,” IEEE Access, vol. 8, pp. 67775–67789, 2020.
View at: Publisher Site | Google Scholar
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences, vol. 95, no. 25, pp. 14863–14868, 1998.
View at: Publisher Site | Google Scholar
C. M. Perou, T. Sørlie, M. B. Eisen et al., “Molecular portraits of human breast tumours,” Nature, vol. 490, no. 7418, pp. 61–70, 2000.
View at: Publisher Site | Google Scholar
C. Scheidegger, L. Sigg, and R. Behra, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Brain Research, vol. 501, no. 2, pp. 205–214, 1999.
View at: Google Scholar
C. H. Chun-Hou Zheng, D. S. De-Shuang Huang, L. Lei Zhang, and X. Z. Xiang-Zhen Kong, “Tumor clustering using nonnegative matrix factorization with gene selection,” IEEE Transactions on Information Technology in Biomedicine, vol. 13, no. 4, pp. 599–607, 2009.
View at: Publisher Site | Google Scholar
Y. Gao and G. Church, “Improving molecular cancer class discovery through sparse non-negative matrix factorization,” Bioinformatics, vol. 21, no. 21, pp. 3970–3975, 2005.
View at: Publisher Site | Google Scholar
I. T. Jolliffe, “Principal component analysis,” Journal of Marketing Research, vol. 87, no. 100, p. 513, 2002.
View at: Google Scholar
K. Y. Yeung and W. L. Ruzzo, “Principal component analysis for clustering gene expression data,” Bioinformatics, vol. 17, no. 9, pp. 763–774, 2001.
View at: Publisher Site | Google Scholar
H.-Q. Wang, D.-S. Huang, X.-M. Zhao, and X. Huang, “A novel clustering analysis based on PCA and SOMs for gene expression patterns,” Lecture Notes in Computer Science, vol. 3174, pp. 476–481, 2004.
View at: Publisher Site | Google Scholar
C. Meng, O. A. Zeleznik, G. G. Thallinger, B. Kuster, A. M. Gholami, and A. C. Culhane, “Dimension reduction techniques for the integrative analysis of multi-omics data,” Briefings in Bioinformatics, vol. 17, no. 4, pp. 628–641, 2016.
View at: Publisher Site | Google Scholar
N. Yu, M.-J. Wu, J.-X. Liu, C.-H. Zheng, and Y. Xu, “Correntropy-based hypergraph regularized nmf for clustering and feature selection on multi-cancer integrated data,” IEEE Transactions on Cybernetics, vol. 51, no. 8, pp. 3952–3963, 2021.
View at: Publisher Site | Google Scholar
C.-N. Jiao, Y.-L. Gao, N. Yu, J.-X. Liu, and L.-Y. Qi, “Hyper-graph regularized constrained nmf for selecting differentially expressed genes and tumor classification,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 10, pp. 3002–3011, 2020.
View at: Publisher Site | Google Scholar
C. Wang, Y.-L. Gao, X.-Z. Kong, J.-X. Liu, and C.-H. Zheng, “Unsupervised cluster analysis and gene marker extraction of scrna-seq data based on non-negative matrix factorization,” IEEE Journal of Biomedical and Health Informatics, 2021, In press.
View at: Publisher Site | Google Scholar
R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011.
View at: Publisher Site | Google Scholar
J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clustering appearances of objects under varying illumination conditions,” IEEE Computer Society Conference on Computer Vision & Pattern Recognition, vol. 1, p. I, 2003.
View at: Google Scholar
Y. Cui, C.-H. Zheng, and J. Yang, “Identifying subspace gene clusters from microarray data using low-rank representation,” PLoS One, vol. 8, no. 3, Article ID e59377, 2013.
View at: Publisher Site | Google Scholar
E. Elhamifar and R. Vidal, “Sparse subspace clustering,” Proceedings of IEEE Conference on Computer Vision & Pattern Recognition, vol. 35, no. 11, pp. 2790–2797, 2009.
View at: Google Scholar
F. Nie and H. Huang, “Subspace clustering via new low-rank model with discrete group structure constraint,” in Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1874–1880, New York, NY, USA, July 2016.
View at: Google Scholar
B. Nasihatkon and R. Hartley, “Graph connectivity in sparse subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, pp. 2137–2144, Colorado Springs, CO, USA, June 2011.
View at: Publisher Site | Google Scholar
C. Tang, X. Zhu, X. Liu et al., “Learning a joint affinity graph for multiview subspace clustering,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1724–1736, 2019.
View at: Publisher Site | Google Scholar
C. Tang, X. Liu, X. Zhu et al., “Feature selective projection with low-rank embedding and dual laplacian regularization,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 9, 2020.
View at: Publisher Site | Google Scholar
F. Nie, W. Chang, Z. Hu, and X. Li, “Robust subspace clustering with low-rank structure constraint,” IEEE Transactions on Knowledge and Data Engineering, p. 1, 2020, In press.
View at: Publisher Site | Google Scholar
G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.
View at: Publisher Site | Google Scholar
H. Xu, C. Caramanis, and S. Sanghavi, “Robust PCA via outlier pursuit,” IEEE Transactions on Information Theory, vol. 58, no. 5, pp. 3047–3064, 2010.
View at: Publisher Site | Google Scholar
W. Siming and L. Zhouchen, “Analysis and improvement of low rank representation for subspace segmentation,” 2011, https://arxiv.org/abs/1107.1561.
View at: Google Scholar
H. Gao, F. Nie, W. Cai, and H. Huang, “Robust capped norm nonnegative matrix factorization: capped norm NMF,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 871–880, Toronto, Canada, October 2015.
View at: Google Scholar
F. Nie, Z. Huo, and H. Huang, “Joint capped norms minimization for robust matrix recovery,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2557–2563, Melbourne, Australia, August 2017.
View at: Google Scholar
F. Nie, H. Huang, and C. H. Ding, “Low-rank matrix recovery via efficient Schatten p-norm minimization,” in Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 655–661, Toronto, Canada, July 2012.
View at: Google Scholar
Z. Zhang, X. Liu, and L. Wang, “Spectral clustering algorithm based on improved Gaussian kernel function and beetle antennae search with damping factor,” Computational Intelligence and Neuroscience, vol. 2020, Article ID 1648573, 9 pages, 2020.
View at: Publisher Site | Google Scholar
E. H. Lieb and W. E. Thirring, Inequalities for the Moments of the Eigenvalues of the Schrödinger Hamiltonian and Their Telation to Sobolev Inequalities, Springer, Berlin, Germany, 2005.
H. Araki, “On an inequality of lieb and thirring,” Letters in Mathematical Physics, vol. 19, no. 2, pp. 167–170, 1990.
View at: Publisher Site | Google Scholar
L. Yu, C. Ding, and S. Loscalzo, “Stable feature selection via dense feature groups,” Proceedings of ACM Sigkdd International Conference on Knowledge Discovery & Data Mining, vol. 40, no. 1, pp. 803–811, 2008.
View at: Google Scholar
A. Statnikov, I. Tsamardinos, Y. Dosbayev, and C. F. Aliferis, “GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data,” International Journal of Medical Informatics, vol. 74, no. 7, pp. 491–503, 2005.
View at: Publisher Site | Google Scholar
G. Getz, H. Gal, I. Kela, D. A. Notterman, and E. Domany, “Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data,” Bioinformatics, vol. 19, no. 9, pp. 1079–1089, 2003.
View at: Publisher Site | Google Scholar
D. Cai, X. He, X. Wu, and J. Han, “Non-negative matrix factorization on manifold,” in Proceedings of the 8th IEEE International Conference on Data Mining, pp. 63–72, Pisa, Italy, December 2008.
View at: Publisher Site | Google Scholar
X. Chen and C. Jian, “Gene expression data clustering based on graph regularized subspace segmentation,” Neurocomputing, vol. 143, no. 16, pp. 44–50, 2014.
View at: Publisher Site | Google Scholar
X. Chen, M. Liao, and X. Ye, “Projection subspace clustering,” Journal of Algorithms & Computational Technology, vol. 11, no. 3, pp. 224–233, 2017.
View at: Publisher Site | Google Scholar
P. A Estévez, M Tesmer, C. A Perez, and J. M Zurada, “Normalized mutual information feature selection,” IEEE Transactions on Neural Networks, vol. 20, no. 2, pp. 189–201, 2009.
View at: Publisher Site | Google Scholar
G. Liu and S. Yan, “Latent low-rank representation for subspace segmentation and feature extraction,” in Proceedings of the International Conference on Computer Vision, no. 4, pp. 1615–1622, Barcelona, Spain, November 2011.
View at: Publisher Site | Google Scholar
D. Kong, H. Huang, and H. Huang, “Robust nonnegative matrix factorization using L21-norm,” in Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 673–682, Scotland, UK, October 2011.
View at: Google Scholar
J. A. Hartigan, “Direct clustering of a data matrix,” Journal of the American Statistical Association, vol. 67, no. 337, pp. 123–129, 1972.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Jian Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

288

Downloads

796

Citations