#### Abstract

Nonnegative Matrix Factorization (NMF) is a significant big data analysis technique. However, standard NMF regularized by simple graph does not have discriminative function, and traditional graph models cannot accurately reflect the problem of multigeometry information between data. To solve the above problem, this paper proposed a new method called Hypergraph Regularized Discriminative Nonnegative Matrix Factorization (HDNMF), which captures intrinsic geometry by constructing hypergraphs rather than simple graphs. The introduction of the hypergraph method allows high-order relationships between samples to be considered, and the introduction of label information enables the method to have discriminative effect. Both the hypergraph Laplace and the discriminative label information are utilized together to learn the projection matrix in the standard method. In addition, we offered a corresponding multiplication update solution for the optimization. Experiments indicate that the method proposed is more effective by comparing with the earlier methods.

#### 1. Introduction

With the development of sequencing technology [1] and gene detection technology [2], a lot of genomic data have been collected. Genomic data usually have the characteristics of high-dimensional small samples, and how to extract useful information from massive genomic data has become the most challenging task. To increase the processing efficiency of such high-dimensional data, a series of dimensionality reduction techniques [3] have been proposed. Among the various dimensionality reduction methods, NMF and its improved NMF-based methods are widely used in the field of gene data processing.

There is some physiology and psychological evidence that humans rely on part-based representations for some object recognition [4]. The Nonnegative Matrix Factorization (NMF) [5] method is capable of learning the various parts of the face and the semantic features of the text. NMF is a powerful technology for component-based data analysis. It is intended to find two nonnegative matrices to learn the part-based representation of the standard data itself. NMF has been popular for decades and successfully implemented in a wide range of fields, including robotics control [6], image analysis [7], and biomedical engineering [8]. To this end, we will provide a brief introduction to the relevant methods.

A variant of NMF called the Graph Regularized Nonnegative Matrix Factorization (GNMF) method [9] was proposed by Cai et al. which takes into consideration the geometric structure between data and uses K-nearest neighbor graph coding to determine the geometry of the data. The method works well in cluster applications but achieves mediocre results in classification problems. To improve the effect of GNMF in the classification, a method named Graph Regularized Discriminative Nonnegative Matrix Factorization (GDNMF) [10] was proposed by Long et al. which considers geometry between data and label information. The discriminating power of different classes is increased by considering the label information. In the GDNMF method, the introduction of the dictionary matrix [11] has achieved excellent performance in the feature selection and classification of genomic data. The GNMF and GDNMF methods construct a simple graph based on the geometric relationships between the sample data, and the high-order relationships between the sample data may be ignored. So Zeng et al. perfectly integrated the hypergraph regularization into the standard NMF, called Hypergraph Regularized Nonnegative Matrix Factorization (HNMF) [12]. The hypergraph regularized takes into account the intrinsic manifold structure of the sample data. This method encodes the geometric information of the data space by constructing a hypergraph rather than a simple graph. The data representations discovered by HNMF are not only partial but also sparse. So HNMF can show better performance. In addition, Peng et al. proposed an NMF deformation method called Parallel Vector Field Regularized Nonnegative Matrix Factorization for Image Representation [13]. This method can effectively improve the calculation speed. To increase the adaptability of the method and avoid the ambiguity of artificial selection, the Flexible Nonnegative Matrix Factorization with Adaptively Learned Graph Regularization was proposed by Peng et al. [14], which can effectively solve the above problems. In summary, various variants of NMF have their own unique advantages in feature selection or cluster classification.

On the one hand, discriminative Nonnegative Matrix Factorization has also been well applied in other areas, such as image representation [15], image classification [16], and diesel engine fault diagnosis [17]. On the other hand, hypergraph regularization has become more and more popular in recent years, for example, in image click prediction [18], image ranking [19], and image restoration [20].

Inspired by the above work, we propose a novel method called Hypergraph Regularized Discriminative Nonnegative Matrix Factorization (HDNMF), which takes into consideration the sample data with manifold inherent structure. The geometry of the data space is encoded by constructing a hypergraph rather than a simple graph. To consider the part of the data itself and the space characteristics, the label information is seen as a significant factor. We construct the K-nearest neighbor graph [21] to encode the geometry of the data space and increase the discriminative power of different classes. In this paper, an optimization scheme is shown in detail, and the objective function is solved by multiplication iterative update. Experiments demonstrate that our proposed method achieves better results than some NMF variants.

#### 2. Materials and Methods

Let the input matrix be data of rows and columns, where rows represent genes and columns represent samples. Usually the value of is large, which may make data processing of poor accuracy. Therefore, the dimensionality reduction of the data matrix has grown up to be a crucial step. The standard NMF and its improved methods are popular techniques for dimensionality reduction. We will introduce the standard NMF method and several standard NMF improvement methods in this section.

##### 2.1. Related Work

###### 2.1.1. Standard NMF (NMF)

Nonnegative factorization was provided by Melvyn W. Jeter and Wallace C. Pye [22] in 1984 and the positive matrix factorization by Paatero and Tapper [23] in 1995. Lee and Seung [24, 25] continued to research on it. It was a matrix factorization method that mainly analyzes high-dimensional data matrices with nonnegative factors. Assuming there are three nonnegative matrices , , and , standard NMF can make the following formula established: .

The objective function of the standard NMF was used to minimize the Euclidean distance [26] between and by using the multiplicative update rules:where is the -norm of the matrix. is named the input data matrix, is named the basic matrix, and is called the coefficient matrix. The elements involved are nonnegative. The update rules are shown as follows in Lee and Seungâ€™s paper:

###### 2.1.2. Graph Regularized NMF (GNMF)

The GNMF [9] method was used to construct the geometry structure by using the nearest neighbor graph. It minimizes the objective function as follows:where is the trace of the matrix, is the regularization parameter which controls the smoothness of the new representation, is the graph Laplacian matrix (), is the weight matrix of the nearest neighbor graph, and is the diagonal matrix.

###### 2.1.3. Graph Regularized Discriminative NMF (GDNMF)

The method GDNMF is proposed, which applies the intrinsic simple geometry between data and discriminative label information to design the objective function. The formula is as follows:where is the graph Laplacian matrix (), is the weight matrix of the nearest neighbor graph, and is the diagonal matrix. is initialized randomly in this method. , **,** and are nonnegative matrices; and are nonnegative regularization parameters.

###### 2.1.4. Hypergraph Regularized NMF (HNMF)

The Laplacian eigenmaps (LE) method is a classical manifold method based on simple graphs. The method mainly considers the relationship between two vertices, while the hypergraph considers the relationship between three and more vertices. In the HNMF method, the hypergraph and NMF combine very well; its objective function is below:where is the trace of the matrix and is the regularization parameter. is the hypergraph Laplacian matrix (), is the weight matrix of the nearest neighbor graph, and is the diagonal matrix.

##### 2.2. Methodology

Inspired by the GDNMF method and HNMF method, we added the discriminative label information to the HNMF method. The definition and multiplication update rules for HDNMF are shown below.

###### 2.2.1. Hypergraph Regularized Discriminative NMF (HDNMF)

In a simple graph, two vertices are connected by an edge, and the weight of the edge is utilized to represent the affinity relationship between the two vertices. In fact, the interrelationship between multiple vertices is also essential. To solve this problem, the hypergraph [27, 28] emerges, and its edges can establish a link between two or more vertices.

The hypergraph consists of , , and , where represents the set of vertices; represents the hyperedge set; represents the weight set of the hyperedge; the weight of the hyperedge is represented as . The correlation matrix of hypergraph is given below [28]:The degree of one edge is as follows:The degree of one hyperedge is as follows:Let be a diagonal matrix, which corresponds to the element corresponding to the vertex degree; then the nonstandardized superlattices matrix is as follows:where .

A label matrix is defined as follows:where represents the th sample of the class label of , and the number of categories in the training set is .

We can obtain the following minimization problem:where is the trace of the matrix, is the hypergraph Laplacian matrix (), is the weight matrix of the nearest neighbor graph, and is the diagonal matrix. is initialized randomly in this method. , **,** and are nonnegative matrices; and are nonnegative regularization parameters.

###### 2.2.2. The Update Rules of HDNMF

The multiplication update rules are extended according to the standard NMFâ€™s F-norm [29], to find the local optimum. Equation (11) can be written as follows:Equation (12) can be written as a Lagrange function as follows:where , , and are the Lagrange multipliers.

The partial derivatives of with respect to , , and , respectively, areThe following formulas can be obtained by using the KKT condition:We can gain the updating rules for , **,** and :The algorithm of HDNMF is shown in Algorithm 1.

###### 2.2.3. Complexity Analysis

In this subsection, we compare the computational complexity of NMF and HDNMF based on multiplication of update rules. The calculation operation count of every iteration is considered by (2) and (20)-(22). The results and parameters are listed in Tables 1 and 2.

For the HDNMF method, the matrix is sparse. So, we need multiplication and addition to count . As for the label matrix , we need multiplication and addition to count , and we need multiplication and addition to count .

In addition to the multiplication update, HDNMF requests to set up the weight matrix and requests to set up the indicator matrix . Assume that, after the multiplication iteration of multiplication, the total cost of NMF is ; the total of HDNMF the cost is .

#### 3. Results and Discussion

On the one hand, HDNMF considers the high-order relationship about samples; on the other hand, the HDNMF method uses the label information to make the method discriminative while constructing the internal geometry of the data. To evaluate the effectiveness and discrimination of the method, the HDNMF method was compared with the other methods (NMF, DNMF, LNMF, GNMF, GDNMF, and HNMF).

##### 3.1. Datasets Description

The Cancer Genome Atlas (TCGA), as the largest cancer genome database, has immeasurable information value. The data included cholangiocarcinoma data (CHOL), esophageal cancer data (ESCA), pancreatic cancer data (PAAD), colorectal cancer data (COAD), and head and neck squamous cell carcinoma (HNSC) data. All of this data can be obtained from the TCGA database at https://cancergenome.nih.gov/. Each dataset consists of two classes, normal samples and diseased samples. The dimension of the datasets is 20502. The rows of the data represent gene features, and the columns represent gene samples.

Firstly, the CHOL data contained 9 normal samples and 36 diseased samples, the PAAD data contained 4 normal samples of 176 diseased samples, the HNSC data contained 20 normal samples and 198 diseased samples, the ESCA data contained 9 normal samples and 183 diseased samples, and the COAD data contained 19 normal samples and 262 diseased samples. Then, the normal samples are removed. Finally, we integrate the PAAD, ESCA, and CHOL data into datasets with dimensions of 20502395 (INTA 1) and integrate the PAAD, COAD, HNSC, ESCA, and CHOL data into one dataset with dimensions of 205021055 (INTA 2).

##### 3.2. Implementation Issues

###### 3.2.1. Parameter Selection

As for the experiments, the datasets used were INTA 1 and INTA 2. For each test, 10-fold cross validation uses training data to adjust parameters. The advantage of cross validation is that all test datasets are independent. The regularized parameter matrix directly affects the classification results of the HDNMF method. In order to select the optimal parameters, we adjust the regularized parameters exponentially in a specific domain, such as from to . This section uses the tenfold cross validation method to automatically adjust the selection of the best regularized parameters in the corresponding training datasets in the following range. The results were shown in Figure 1.

As shown in the figure, the warmer the color, the higher the classification ACC. As can be seen from Figure 1, dataset INTA 1 can get better results when and ; as for dataset INTA 2, it can get better results when and . In particular, the value of has small significant effect on the classification result when .

###### 3.2.2. The Influence of Dimensions on Classification ACC

In this part, we do the HDNMF method when dimension is from 1 to 30 to explore the effect of dimension size on the classification effect. Based on the results in Figure 2, the following points can be summarized:

(1) When the dimension is too small, the classification ACC effect of all methods is not ideal, mainly due to the loss of a large amount of useful information due to the large dimensionality reduction

(2) As the dimension increases, the effect of the classification ACC gradually increases and tends to be stable after a certain degree, which is mainly due to the large amount of information loss that can be avoided when the dimension is large enough

###### 3.2.3. HDNMF Time Comparison

We compare the average runtime of all the methods on the two datasets in Table 4. The experimental results were obtained by performing all methods on an Inter(R) Core(TM) i7-7700 CPU @ 3.6 GHz Windows Server with 64 G RAM. The time costs of the classification process for the all methods are listed. Obviously, the NMF method with constraints usually requires more runtime, and our proposed HDNMF method requires the longest runtime. The main reason is that the HDNMF method takes a lot of time in constructing the hypergraph and updates the hyperedge in each iterate. Therefore, how to speed up the proposed algorithm is an interesting work.

###### 3.2.4. The Convergence Analysis

The updating rules for the HDNMF method in (20)-(22) are easy to understand and it can be demonstrated that the above rules are convergent. Figure 3 shows the convergence about two datasets; the red dashed lines are the error value of the HDNMF method. We can see that our method will converge in 100 iterations.

##### 3.3. Classification

Since HDNMF introduces label information into the NMF method, it is necessary to perform the test of the effect of the classification [28]. There are many improved methods for existing NMFs for classification, such as methods which were mentioned in [30, 31].

Classification is a part of the most widely used technologies in the field of data mining. It is a technique for constructing a classifier based on the characteristics of experimental data and then using the classifier to classify samples of unknown categories. The classifier is generally divided into two stages of training and testing; in the training stage, the characteristics of the training datasets are first analyzed, and then an accurate description of the corresponding datasets is generated for each category; in the testing phase, the model classifies the test set to test its classification ACC. In the experimental part, we use the KNN method to classify the sample. For a given set of training samples with class labels, the similarity measure function is used to find the nearest neighbors that are most similar to the samples to be classified, and the decision is made according to the corresponding principle.

The specific classification process (K-nearest neighbors, KNN) is as follows: (1) for the training sets, calculate the distance from the unknown sample to all known samples; (2) select the parameter (where is the most basic parameter in KNN, representing any number of neighbors); (3) for test sets, the unknown samples are classified into the largest number of categories in the sample according to the majority voting rule.

###### 3.3.1. HDNMF for INTA 1 and INTA 2 Classification

For the dimension equal to 3 for INTA 1 and 5 for INTA 2 the classification ACC are listed in Table 3. The optimal results are italic, and the experiment proves that our HDNMF method can achieve better classification results than other methods.

It is possible to conclude that the introduction of label information is the main reason for the significant classification ACC effect. Therefore, the HDNMF method can not only improve the interpretability of the NMF method, but also overcome the ambiguity of the unsupervised learning training samples.

##### 3.4. Co-Differentially Expressed Gene Selection

High-dimensional small sample data contain many unrelated or redundant features. In biological processes, most genes serve as the standard function of life support, so researchers focus on a small number of differentially expressed genes that play a critical role in life. Feature selection [32] is the easiest way to resolve these problems and identify differentially expressed genes. It can replace the standard data by choosing the largest amount of information without losing a lot of information. Here we show the experimental results and analysis of different methods for selecting common differentially expressed genes (co-differentially expressed genes) on integrated datasets. The relevant score (RS) refers to the correlation between genes and diseases. The higher the relevant score, the higher the correlation between genes and diseases. The number of genes (NUM) refers to the number of co-differentially expressed genes selected by each method characteristic that match the proven disease genes in GeneCards.

How the GeneCards were used is described as follows: we visited the official website of GeneCards (https://www.genecards.org/). On the one hand, all types of diseased genes that have been validated can be obtained by entering the type of cancer at this web address (exporting the table for the determination of NUM). On the other hand, enter the gene name at this web address to get the details of the gene.

###### 3.4.1. INTA 1 Co-Differentially Expressed Gene Selection Results

The HDNMF method and the NMF series of methods are used to select co-differentially expressed genes of the integrated datasets. We selected 500 genes for each method to evaluate the co-differentially expressed genes selected by using different methods.

For the 500 co-differentially expressed genes selected by the seven methods, we compare the three diseases that make up the integrated INTA 1. The comparison data are from GeneCards (https://www.genecards.org/). The co-differentially expressed genes excavated by different methods are listed in Table 5: the number of genes (NUM), the total related score (TRS), and the average related score (ARS). Higher NUM, TRS, and ARS values indicate better method performance. From the table, we can see that the HDNMF method can achieve better experimental results.

We summarize the co-differentially expressed genes in the shared part in Table 6, including the co-differentially expressed genes of each method and the number of unique co-differentially expressed genes excavated by each method. Unique co-differentially expressed genes selected by each method have been indicated in italic. In the table, we can clearly see that there are 0 unique co-differentially expressed genes excavated by the DNMF, LNMF, GNMF, and GDNMF methods; 2 unique co-differentially expressed genes excavated by the NMF and HNMF methods; and 7 unique co-differentially expressed genes mined by our HDNMF method. This indicates that our method works well.

In Table 7, we summarize the detailed information of the co-differentially expressed genes excavated only by the HDNMF method in the table by GeneCards, including the official name, related diseases, related GO annotations, RS, and paralog gene. All genes in the table have a high related score, which means that these genes can be regarded as important pathogenic genes to help disease research. Medical research in this area will also provide us with unexpected new discoveries. Therefore, the unique co-differentially expressed genes excavated by the HDNMF method can promote further research on these three diseases.

From Table 7, we can see the relevant information of the unique co-differentially expressed genes selected by the HDNMF method. ERBB3 has the highest relevance score. ERBB3 is a protein-coding gene. When it mutates, the tumor suppressor protein does not form properly. In pancreatic cancer, ERBB3 is a preferred dimerization partner for EGFR, and ERBB3 protein expression levels are directly related to the antiproliferative effects of erlotinib (EGFR-specific tyrosine kinase inhibitor) and transient knockdown of ERBB3 expression gain resistance to EGFR targeted therapy. This leads to the onset of cancer. ERBB3 has been shown to be associated with PAAD [33]. ERBB3 forms heterodimers with other kinase-active EGF (epidermal growth factor) receptor family members. Heterodimerization leads to activation of pathways leading to cell proliferation or differentiation. Amplification of this gene and/or overexpression of its protein has been reported in many cancers, such as PAAD, ESCA [34], and CHOL [35]. Therefore, mutations in one gene may be involved in the production of multiple cancers. This suggests that biologists can further study the link between different cancers.

###### 3.4.2. INTA 2 Co-Differentially Expressed Gene Selection Results

Similar to Table 5, Table 8 was subjected to the same treatment to obtain NUM, TRS, and ARS of the INDA 2 co-differentially expressed genes. From Table 8, we can see that our method is better than other methods.

After the intersection of the co-differentially expressed genes selected by the five methods is removed, the co-differentially expressed genes excavated for each method are given in Table 9. As showed in Table 9, the number of unique public genes mined by the NMF, DNMF, LNMF, GNMF, GDNMF, and HNMF methods is 0, 0, 4, 0, 0, and 0. The HDNMF method has been tapped into 5. Next comparing the obtained co-differentially expressed genes, we can obtain unique co-differentially expressed genes obtained only by our method, and these genes are necessary for the study of diseases in the future. Therefore, our method is more suitable for mining co-differentially expressed genes. Unique co-differentially expressed genes selected by each method have been indicated in italic.

Some of the results from these and other studies are summarized below:

(1) Unique co-differentially expressed genes mined by HDNMF on the INTA1 datasets are greater than the INTA 2 datasets. This huge difference is due to the complexity of the different datasets. For example, the composition of the INTA1 data has three diseases, and the INTA2 datasets constitute five diseases

(2) The genes AGT, ANPEP, CRP, and ERBB3 in Tables 8 and 10 have a higher correlation score with the experimental datasets, and these three genes are ignored by most mining methods. Therefore this reflects the accuracy of the HDNMF method

##### 3.5. The Pathway Analysis

Joint biological processes of the selected genes selected in the experiment can be attributed to pathways that help us understand the advanced functions of biology and biological systems at the molecular level. We used the Kyoto Encyclopedia of Genes and Genomes (KEGG) online analysis tool to analyze co-differentially expressed genes identified by HDNMF. In this experiment, 500 genes identified were enclosed in KEGG, and we can obtain the corresponding disease pathway. The FDR of the HDNMF method and other methods are showed in Table 11 (INTA 1) and Table 12 (INTA 2). The smaller the FDR, the better the results. So, the HDNMF method achieves better results.

Taking ECM-receptor interaction as an example, the literature has been shown to be closely related to PAAD [36], ESCA [37], and CHOL [38]. Other pathways can also prove their rationality by consulting the literature.

#### 4. Conclusions

We presented a novel matrix factorization method called Hypergraph Regularized Discriminative Nonnegative Matrix Factorization (HDNMF). The method introduced hypergraph and discriminative label information into the standard NMF method. On the one hand, the hypergraph can find high-order geometric relations that are neglected by simple graphs; on the other hand, discriminative label information can make the method have supervisory functions. Experiments have shown that HDNMF can achieve better results than standard NMF and its improved methods. Since the hypergraph was in the process of construction, it got a lot of time to build the hyperedge; the program running time is longer than other methods. How to accelerate the proposed algorithm is an interesting work in the forthcoming work.

#### Data Availability

The datasets that support the findings of this study are available in https://cancergenome.nih.gov/.

#### Conflicts of Interest

There are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61872220 and 61572284.