Abstract

Metabolic pathway is an important type of biological pathways. It produces essential molecules and energies to maintain the life of living organisms. Each metabolic pathway consists of a chain of chemical reactions, which always need enzymes to participate in. Thus, chemicals and enzymes are two major components for each metabolic pathway. Although several metabolic pathways have been uncovered, the metabolic pathway system is still far from complete. Some hidden chemicals or enzymes are not discovered in a certain metabolic pathway. Besides the traditional experiments to detect hidden chemicals or enzymes, an alternative pipeline is to design efficient computational methods. In this study, we proposed a powerful multilabel classifier, called iMPTCE-Hnetwork, to uniformly assign chemicals and enzymes to metabolic pathway types reported in KEGG. Such classifier adopted the embedding features derived from a heterogeneous network, which defined chemicals and enzymes as nodes and the interactions between chemicals and enzymes as edges, through a powerful network embedding algorithm, Mashup. The popular RAndom k-labELsets (RAKEL) algorithm was employed to construct the classifier, which incorporated the support vector machine (polynomial kernel) as the basic classifier. The ten-fold cross-validation results indicated that such a classifier had good performance with accuracy higher than 0.800 and exact match higher than 0.750. Several comparisons were done to indicate the superiority of the iMPTCE-Hnetwork.

1. Introduction

Metabolic pathway is an essential type of biological pathways in living organisms. It generates necessary molecules and energies to maintain the life of the organisms [1]. In each metabolic pathway, there are several continuous chemical reactions, which change one molecule to another with the help of some enzymes. Chemicals and enzymes are the two major components in each metabolic pathway. Identification of chemicals and enzymes in each pathway as full as possible is helpful to understand its mechanism. To date, several public databases, such as KEGG [2, 3], provide detailed information of validated metabolic pathways. However, the completeness of each metabolic pathway is still a problem. There still exist undiscovered chemicals or enzymes for some metabolic pathways. The traditional experiment is a solid pipeline to determine novel chemicals or enzymes of metabolic pathways. However, it is always time-consuming and expensive. Thus, it is urgent to design novel approaches to accelerate the detection procedures and decrease the costs.

With the development of computer science and technique, it becomes more and more popular to tackle different biological and medical problems with advanced computational methods. Among them, the machine learning-based method is always an important choice. For the problem addressed in this study, several such methods have been proposed in the recent ten years. Most of them were designed to assign chemicals to corresponding metabolic pathway types. Cai et al. [4] first built a nearest neighbor algorithm- (NNA-) based model to predict metabolic pathway type of chemicals, where chemicals were represented by functional group compositions. Later, Lu et al. [1] improved this model by adopting a more powerful classification algorithm, AdaBoost. These two methods can only deal with chemicals participating in only one metabolic pathway type. In fact, several chemicals can belong to two or more pathway types, inducing the limitation of the above methods. After that, investigators began to design models that can deal with chemicals in multiple metabolic pathway types. Hu et al. [5] gave a computational method with chemical-chemical interaction (CCI) information, which can rank the candidate pathway types for a given chemical. The chemical has the highest probability to participate in the first pathway type, followed by the second pathway type, and so on. Chen et al. [6] adopted the same scheme to list the candidate pathway types. Chemicals were encoded by their molecular fragment features, and a support vector machine (SVM) [7] was adopted to give a score to each pathway type. Although the above methods can process chemicals with multiple pathway types, they were not pure multilabel classifiers because they cannot determine which pathway types are the predicted pathways. Recently, Baranwal et al. [8] presented a powerful multilabel classifier to assign chemicals to multiple pathways, which adopted graph convolutional networks for obtaining the molecular shape features of chemicals. Jia et al. [9] built a multilabel web server, iMPT-FRAKEL, to predict metabolic pathways of chemicals. This web server had wide applications because it only needed the SMILEs strings of chemicals as the input. Besides, some other studies tackled the problem in different ways. Fang and Chen [10] deemed the pairs of chemicals and pathway types as samples. In this case, the multilabel classification problem was transformed into a binary classification problem. Jia et al. [11] extended the above model to an actual metabolic pathway rather than a pathway type. The concept of “similarity” was adopted to extract essential features for each pair of chemical and pathway. Guo et al. [12] constructed a SVM-based model for each pathway type, where chemicals were represented by embedding features extracted from multiple chemical networks. From the above descriptions, we can see that they only tackled one component, chemicals, in metabolic pathways. As for the other component, enzyme, only one study is involved to our knowledge. Gao et al. [13] generalized Hu et al.’s method [5] by employing chemical-protein interaction (CPI) and protein-protein interaction (PPI) information, thereby giving a pathway type rank for a given chemical or enzyme. As mentioned above, such method was not a pure multilabel classifier and it directly used the linkage between chemicals and proteins but not deeply mine the hidden information behind the linkage.

In this study, we adopted the metabolic pathway information reported in KEGG, where chemicals and enzymes are classified into 11 metabolic pathway types. A heterogeneous network was built to organize the chemicals, enzymes, CCI, CPI, and PPI information, where chemicals and enzymes defined nodes and three types of interaction information determined edges. To fully mine deep information in the heterogeneous network, a powerful network embedding algorithm, Mashup [14], was applied to such network. Informative features were obtained for each chemical and enzyme. These features and labels, representing metabolic pathway types, were fed into the RAndom k-labELsets (RAKEL) [15] algorithm to build the classifier, iMPTCE-Hnetwork. SVM (polynomial kernel) [7] was adopted as the basic classifier. The effects of heterogeneous networks and the merits of combining the information of chemicals and enzymes were elaborated. Furthermore, the comparisons of other multilabel classifiers with Binary Relevance (BR) [16], features derived from other network embedding algorithms, or other basic classifiers were done to indicate the superiority of the iMPTCE-Hnetwork.

2. Materials and Methods

2.1. Materials

The chemicals and enzymes (human) in metabolic pathways were retrieved from the KEGG PATHWAY (https://www.genome.jp/kegg/pathway.html, accessed in September 2019) [2, 3]. 5682 chemicals, encoded by KEGG IDs, and 792 enzymes, represented by EC numbers, were obtained. For the same representation of chemicals and enzymes in the constructed network, KEGG IDs of chemicals were map onto their PubChem IDs and specific human proteins of obtained EC numbers were extracted, which were further converted into Ensembl IDs. According to KEGG PATHWAY, these chemicals and enzymes were classified into 11 metabolic pathway types, which are listed in column 1 of Table 1. All above-obtained chemicals and enzymes were used as nodes in the constructed heterogeneous network that was described in “Heterogeneous Network Construction.” However, some nodes were isolated in the network and were discarded. As a result, we obtained 2329 chemicals (PubChem IDs) and 1124 human enzymes (Ensembl IDs). The number of chemicals and enzymes in each metabolic pathway type is listed in Table 1. Detailed chemicals and enzymes in each metabolic pathway type are provided in Table S1.

It is easy to see from Table 1 (last two rows) that the total number of chemicals/enzymes in eleven metabolic pathway types was larger than the number of different chemicals/enzymes. Thus, the problem of assigning chemicals and enzymes to metabolic pathway types was a multilabel classification problem. In this study, a uniform multilabel classifier was built to correctly predict metabolic pathway types of chemicals and enzymes.

2.2. Heterogeneous Network Construction

The chemicals and enzymes are the major components in metabolic pathways. The classic feature extraction methods always pick up essential features from the properties of themselves. With the development of the network technique, it provides another pipeline to access important features of chemicals and enzymes. Here, we adopted a network scheme to organize chemicals and enzymes.

To construct the network, we downloaded the information of CCIs and CPIs from STITCH (http://stitch.embl.de/, version 4.0) [17, 18]. Furthermore, the information of PPIs was retrieved from STRING (https://string-db.org/, version 10.0) [19, 20]. For CCIs, the file “chemical_chemical.links.v4.0.tsv.gz” was downloaded, from which we extracted the CCIs between 2329 chemicals. As a result, 82368 CCIs were obtained. Each CCI contained two chemicals, represented by PubChem IDs, and one confidence score with a range between 1 and 999. Such score integrated several types of associations derived from different aspects of chemicals, including structures, activities, reactions, and literature occurrence. Thus, such score can widely measure the associations of chemicals. For formulation, let us denote this score on the CCI between chemicals and as . For CPIs, we downloaded the file, named “9606.protein_chemical.links.v4.0.tsv.gz.” From this file, the CPIs between 2329 chemicals and 1124 enzymes were picked up, resulting in 41066 CPIs. Each CPI consists of one chemical and one protein, denoted by PubChem ID and Ensembl ID, respectively, and one confidence score with a range between 1 and 999. Such score is also obtained by evaluating several aspects of chemicals and proteins. The score between chemical and protein was denoted by . As for PPIs, the file “9606.protein.links.v10.txt.gz” in STRING was downloaded. We extracted PPIs between 1124 enzymes, obtaining 59868 PPIs. Two proteins, represented by Ensembl IDs, and one confidence score comprised each PPI. Likewise, the score integrated several types of associations of proteins and can widely measure their linkage, and its range is also between 1 and 999. For convenience, the score of proteins and was denoted by . Because all the above confidence score is between 1 and 999, we refined them by dividing it by 1000 so that the refined confidence score was between 0 and 1.

According to CCIs retrieved from STITCH, we constructed a chemical network. Such network defined 2329 chemicals as nodes. Two nodes were adjacent if and only if their corresponding chemicals can comprise a CCI with the refined confidence score larger than zero. Moreover, the refined confidence score was assigned to the corresponding edge as its weight. For convenience, such network was denoted by . A bipartite network was built according to CPIs retrieved from STITCH. Each edge connected a chemical node and an enzyme node if they can comprise a CPI with the refined confidence score higher than zero. Likewise, the refined confidence score was defined as the weight of the corresponding edge. This network was denoted as . The third protein network was constructed with the PPIs obtained from STRING. Two nodes were connected by an edge if and only if their corresponding proteins can comprise a PPI with the refined confidence score higher than zero. Also, the refined score was assigned to the edge as its weight. Such network was denoted by .

The above-constructed three networks were combined to build a large heterogeneous network. For an easy description, this network was denoted as . The construction procedures of are illustrated in Figure 1.

2.3. Network Embedding Algorithm

A heterogeneous network was built in the above section. Informative relationship between chemicals and enzymes was contained in such network. In this study, a powerful network embedding algorithm, Mashup [14], was employed to extract informative features of chemicals and enzymes. This algorithm has been adopted to deal with several biological and medical problems [12, 2127]. Its brief description was as follows.

The Mashup consists of two stages to extract embedding features of nodes in a network, say . In the first stage, each node in is picked up as the seed node of the random walk with restart (RWR) algorithm [28, 29]. When the RWR algorithm stops, probabilities assigned to all nodes are aligned together to comprise a raw feature vector of , denoted as . However, such vector has a high dimension. A dimensionality reduction procedure is necessary, which is done in the second stage. Let be the final feature vector of and be the context feature vector of in . The purpose of the second stage is to determine the optimal components in these two vectors. Thus, an optimization problem is set up as follows: where stands for the number of nodes in network , denotes the function of KL-divergence (relative entropy), the components in are defined as below

The outcome was selected as the feature vector of , which would be used to construct the classification model.

This study adopted the Mashup program retrieved from http://cb.csail.mit.edu/cb/mashup/. For convenience, default parameters were used.

2.4. Multilabel Classifier (iMPTCE-Hnetwork)

Because some chemicals/enzymes can belong to two or more metabolic pathway types, the problem of assigning chemicals/enzymes to metabolic pathway types is evidently a multilabel classification problem. Generally, to deal with such problem, there are two types of schemes. The first one is problem transformation, that is, the original multilabel classification problem is transformed into several single-label classification problems. The second one is the algorithm adaption. This scheme extends the existing one single-label classification algorithm so that the new algorithm can process multilabel problems. In this study, we adopted the first scheme to build the classification model.

RAndom k-labELsets (RAKEL) [15] is a classic method to build multilabel classifiers, which can be deemed as the generalization of the Label Powerset (LP) algorithm. To date, this method has been applied to build several multilabel classifiers for tackling different biological and medical problems [9, 21, 3034]. For a multilabel problem involving labels ( in this study), let be its label set. Select an integer with and construct all k-subsets of . For one randomly selected -subset , an LP classifier is constructed. In detail, members in the power set of are defined as the new labels. And each sample is assigned a new label according to its original labels. For example, if a sample has two labels, say and , a new label, representing , is assigned to this sample. After that, each sample has only one new label, and a single-label classifier, called LP classifier, can be built based on a single-label classification algorithm. Clearly, multiple LP classifier with different -subsets should be constructed because one -subset cannot cover all labels. Thus, there are another parameters, , of RAKEL to determine the number of LP classifiers. The final multilabel classifier integrates the above-constructed LP classifiers.

Given a query sample, each LP classifier gives its binary decision on each label. The average of binary decisions on each label is computed. If the average is larger than a predefined threshold (generally, it is set to 0.5), the corresponding label is assigned to the sample.

In this study, we adopted the tool “RAKEL” in Meka (http://waikato.github.io/meka/) [35], which implements the RAKEL described above. As mentioned above, and are two major parameters of RAKEL. Several values were set to them for building an optimal multilabel classifier. For an easy description, the classifier constructed by RAKEL was called the RAKEL classifier.

2.5. Classification Algorithm

The RAKEL was used to construct a multi-label classifier. One basic single-label classification algorithm was necessary to set up multiple LP classifiers. Here, we selected one of the most classic classification algorithms, SVM [7], which has wide applications in bioinformatics [21, 30, 31, 3640].

SVM is a statistical learning theory-based classification algorithm. The key point is to find out an optimal hyperplane, which can separate samples into two classes with maximum margin. However, in most cases, samples in their original space are not linearly separable, that is, such hyperplane cannot be found out. A kernel function is employed to map samples into another space with higher dimensions, in which samples are linearly separable. After the optimal hyperplane has been discovered, the class of a new sample is determined according to the side of the hyperplane it belongs to. The original SVM can only deal with binary problems. Some schemes (e.g., one-versus-others, one-versus-one) can be adopted so that it can tackle problems with multiple classes.

In this study, we selected the SVM whose training procedures are optimized by the sequential minimal optimization algorithm [41]. Such type of SVM is integrated in Meka. The tool “RAKEL” can directly invoke it. Two kernel functions (polynomial kernel, RBF kernel) were tried to select the best one.

2.6. Performance Evaluation

All constructed multilabel classifiers were evaluated by ten-fold cross-validation [42]. In such test, samples are divided into ten parts. Each part is singled out one by one as the test dataset, whereas the rest nine parts are used to training the classifier. As a result, each sample is tested exactly once.

According to the results of ten-fold cross-validation, some measurements can be computed. Here, we selected three measurements: accuracy, exact match, and hamming loss [9, 21, 30, 31]. Their definitions are as below where stands for the total number of samples, represents the number of labels, and denote the actual and predicted label set of the -th sample, indicates the symmetric difference operation, and is defined as follows:

Evidently, for accuracy and exact match, the higher they are, the better performance the classifier has. On the contrary, the lower the hamming loss is, the better the performance is. Furthermore, to give a uniform evaluation, we used the following equation to integrate the above three measurements

The classifier with a high integrated score indicated its high performance. We tried to construct a classifier with an integrated score as high as possible.

3. Results and Discussion

In this study, we proposed a multilabel classifier, called the iMPTCE-Hnetwork, to identify metabolic pathway types of chemicals and enzymes. The entire procedures are illustrated in Figure 2. This section gave evaluation results of such a classifier and elaborated its high utility.

3.1. Performance of the iMPTCE-Hnetwork

The constructed classifier used the feature vectors by applying Mashup on a heterogeneous network . However, the dimension of the vector was a problem. Here, we tried six dimensions varying between 50 and 300 with an interval of 50. As for the main parameters and in RAKEL, was set to 5 and 10, was tried on each value between 2 and 11. Furthermore, SVM was selected as the basic classification algorithm. The kernel was set as the polynomial kernel, where the exponent (E) was set to 1, 2, and 3. Three values, including 1, 2, and 3, of regularization parameter were tried. A lot of classifiers with all possible settings were constructed, which were further evaluated by ten-fold cross-validation.

The test results indicated that when , , , , and , the iMPTCE-Hnetwork yielded the highest integrated score, which was 0.591 (Table 2). The accuracy, exact match, and hamming loss were 0.818, 0.754, and 0.042 (Table 2), respectively. The accuracy was higher than 0.800, and the exact match exceeded 0.750, suggesting the good performance of the iMPTCE-Hnetwork.

Besides, to fully evaluate the performance of iMPTCE-Hnetwork with ten-fold cross-validation, we further did the ten-fold cross-validation 100 times. The obtained values of accuracy, exact match, hamming loss, and integrated score induced four violin plots, as shown in Figure 3. It can be observed that accuracy varied between 0.805 and 0.825, the exact match changed between 0.740 and 0.760, the hamming loss was between 0.040 and 0.045, and the integrated score varied between 0.570 and 0.600. These results implied that the iMPTCE-Hnetwork was quite stable for different divisions of samples.

3.2. Analysis of the Effects of Heterogeneous Network

The proposed classifier, iMPTCE-Hnetwork, adopted the chemical and enzyme features derived from the heterogeneous network . Clearly, the accuracy of is an essential factor which can influence the performance of the classifier. In this section, an analysis would be given to indicate the importance of the heterogeneous network. Furthermore, we also analyzed the contributions of chemical and protein networks for building the classifier.

For the heterogeneous network , a permutation was done for nodes (representing enzymes) and nodes (denoting chemicals), respectively. Obtained features (250-D) were fed into the RAKEL to construct the classifier. The same parameter setting (, , , ) was used. Such a classifier was assessed by ten-fold cross-validation. To make the results more reliable, the above procedures were done 100 times, resulting in 100 values of accuracy, exact match, hamming loss, and integrated score. These values are shown in Figure 4, in which the performance of iMPTCE-Hnetwork is also listed. It can be observed that the permutation made the performance of the classifier very poor. Compared with the performance of iMPTCE-Hnetwork, the accuracy, exact match, and integrated score reduced about 0.667, 0.630, and 0.570, respectively, whereas hamming loss increased about 0.145. It is indicated that the accuracy of heterogeneous network is very important to construct the efficient classifier.

In addition, we also analyzed the importance of chemical and protein networks for constructing the classifier. First, we only permutated the protein (enzyme) nodes. Features (250-D) yielded by Mashup were used to construct the RAKEL classifier (, , , ), which was also evaluated by ten-fold cross-validation. Such procedures were also done 100 times. The average performance is shown in Figure 4. Evidently, the classifier became worse. The accuracy, exact match, and integrated score declined about 0.230, 0.220, and 0.306, respectively, whereas hamming loss increased about 0.056. Second, the above procedures were done for chemical nodes. Results are illustrated in Figure 4. Also, the performance of the classifier decreased. In detail, accuracy, exact match, and integrated score were about 0.478, 0.448, and 0.497, respectively, lower than those of the iMPTCE-Hnetwork, and the hamming loss was about 0.097 higher than that of the iMPTCE-Hnetwork. It can be observed that when chemical nodes were permutated, the performance of the classifier was much lower than that of the classifier with enzyme permutation, suggesting that the chemical network gave more contribution to build the classifier.

3.3. The Superiority of Combining the Information of Chemicals and Enzymes

As mentioned above, iMPTCE-Hnetwork provided good performance for the identification of metabolic pathway types of chemicals and enzymes. From its construction procedures, we can see that the information of chemicals and enzymes was poured into a uniform system, that is, the information of chemicals can be used to predict metabolic pathway types of enzymes and vice versa. This fact may lead to the superiority of the classifier. Here, we gave some analyses.

There were two essential stages that the information of chemicals and enzymes was utilized with each other. The first stage was the feature extraction. Because chemicals and enzymes have several different points, an ordinary method may only consider chemicals (enzymes) when extracting chemical (enzyme) features and excluding the information of enzymes (chemicals). In our classifier, chemicals and enzymes were all deemed as nodes in the heterogeneous network , that is, they were combined together to extract features. The second essential stage was the classification. In the iMPTCE-Hnetwork, the enzyme features were used to predict the metabolic pathway types of chemicals and vice versa. Ordinary methods may separate the classification procedures, i.e., the prediction of metabolic pathway types of chemicals (enzymes) only used the chemical (enzyme) features. Considering the above two stages, we did the following two tests.

For the first test, we extracted chemical features from the chemical network and enzyme features from the protein network with Mashup. In this case, the feature extraction procedures separated the information of chemicals and enzymes. Because we did not know which dimension was best, six dimensions from 50 to 300 with an interval of 50 were obtained. With a given dimension, a multilabel classifier was set up for chemicals and enzyme, respectively. We used the same parameter setting of iMPTCE-Hnetwork. Each classifier was evaluated by ten-fold cross-validation. For each combination of dimensions of chemicals and enzymes, we combined the cross-validation results to compute four measurements mentioned in “Performance Evaluation.” As a result, when the dimensions of chemicals and enzymes were all 100, we obtained the highest integrated score (0.120). The accuracy was 0.390, the exact match was 0.352, and hamming loss was 0.130, which are shown in Figure 5. It can be observed that such performance was much lower than that of the iMPTCE-Hnetwork, implying feature extraction with the combination of chemicals and enzymes improved the quality of chemical and enzyme features.

The second test was for the classification procedure. We used the 250-D feature vectors of the iMPTCE-Hnetwork. However, the classification procedures strictly separated chemicals and enzymes. The predicted results of chemicals and enzymes were combined to calculate four measurements. The integrated score was 0.583, the accuracy was 0.814, the exact match was 0.749, and the hamming loss was 0.043, as shown in Figure 5. This performance was slightly lower than that of the iMPTCE-Hnetwork, implying the classification procedure by combining chemical and enzyme features can also improve the performance of the classifier. However, its influence was much smaller than that of the feature extraction.

With the above arguments, the combination of chemicals and enzymes is an important aspect to cause the good performance of our classifier.

3.4. Comparison of the RAKEL Classifiers with Different Classification Algorithms

The classifier, iMPTCE-Hnetwork, was built based on SVM (polynomial kernel). In fact, we also tried SVM (RBF kernel) and random forest (RF) [43]. Like SVM, RF is also a widely used and powerful classification algorithm [8, 11, 22, 4447]. For SVM (RBF kernel), the same values of regularization parameter were tried, and was set to 0.01, 0.02, and 0.03. As for RF, the main parameter, the number of decision trees, was set to different values from 10 to 100 with an interval of 10. Same dimensions and parameters of RAKEL ( and ), which were tried when constructing iMPTCE-Hnetwork, were also used. Each classifier was also assessed by ten-fold cross-validation. The best performance for SVM (RBF kernel) and RF is listed in Table 2. Four measurements for SVM (RBF kernel) were 0.757, 0.670, 0.055, and 0.479, respectively, whereas they were 0.803, 0.743, 0.045, and 0.570, respectively, when the basic classifier was RF. Compared with the performance of iMPTCE-Hnetwork, also listed in Table 2, their performance was less or more lower. For the integrated score, they were about 0.112 and 0.021 lower, respectively. The RAKEL classifier with RF was slightly inferior to iMPTCE-Hnetwork, but the performance of RAKEL classifier with SVM (RBF kernel) was much lower than that of the iMPTCE-Hnetwork. It is suggested that the selection of SVM (polynomial kernel) as the basic classifier was a relatively proper choice.

3.5. Comparison of the BR Classifiers

The RAKEL algorithm is an efficient scheme for tackling multilabel classification problems. The Binary Relevance (BR) [16] method is another widely used scheme. This scheme adopted the one-against-all strategy to construct several binary classifiers for each label and integrated them together. In fact, if the parameter of RAKEL is set to 1, RAKEL is the same as BR. Here, we compared the classifiers based on BR with RAKEL classifiers. For convenience, the classifier based on BR was called the BR classifier. Three basic classifiers: SVM (polynomial kernel), SVM (RBF kernel), and RF, were adopted to construct the BR classifiers. All parameter settings mentioned above were tried. Each BR classifier was evaluated by ten-fold cross-validation. The best performance for each basic classifier is listed in Table 2.

The BR classifier with SVM (polynomial kernel) yielded the accuracy of 0.786, the exact match of 0.690, the hamming loss of 0.043, and the integrated score of 0.519. When the basic classifier was SVM (RBF kernel), the BR classifier provided the accuracy of 0.598, the exact match of 0.533, the hamming loss of 0.058, and the integrated score of 0.300. As for the RF, its BR classifier generated the accuracy of 0.666, the exact match of 0.602, the hamming loss of 0.052, and the integrated score of 0.380. Each measurement was lower than the corresponding one of the iMPTCE-Hnetwork, suggesting that RAKEL was more efficient than BR for the identification of the metabolic pathway types of chemicals and enzymes. Furthermore, three basic classifiers gave the same strength for building the BR classifiers as for building the RAKEL classifiers, that is, the SVM (polynomial kernel) was the best, followed by RF and SVM (RBF kernel).

3.6. Comparison of Classifiers with Other Embedding Features

The iMPTCE-Hnetwork was constructed using the chemical and enzyme features derived from the heterogeneous network by Mashup. To date, several network embedding algorithms have been proposed and applied to deal with some realistic problems. Here, we selected two of them to make comparisons. They were DeepWalk [48] and Node2vec [49]. These two algorithms adopted quite different schemes to extract features for representing nodes. They always produce a lot of paths for each node. Each path is deemed as a sentence, and nodes in the paths are termed as words. Then, Word2vec [50] is applied to these sentences to extract features. The main difference of these two algorithms is the way to produce paths. Node2vec adopts a more advanced scheme and thus is considered to be more powerful than DeepWalk. We downloaded the DeepWalk program at https://github.com/phanein/deepwalk, and the program of Node2vec was retrieved from https://snap.stanford.edu/node2vec/. They were all applied to the heterogeneous network with their default parameters. Likewise, the dimension was set to 50 to 300 with an interval of 50.

Features yielded by DeepWalk and Node2vec were fed into RAKEL with different values of and and different basic classifiers (SVM (polynomial kernel), SVM (RBF kernel), and RF) to construct RAKEL classifiers. All classifiers were evaluated by ten-fold cross-validation. For features yielded by DeepWalk, the best performance of RAKEL classifiers with one of the three basic classifiers is shown in Table 3. It can be seen that these classifiers all gave a poor performance. The accuracy was all lower than 0.350, the exact match was lower than 0.310, the hamming loss was higher than 0.140, and the integrated score was lower than 0.090. Compared with the measurements of RAKEL classifiers using features produced by Mashup (Table 2), they were much lower. As for the features generated by Node2vec, RAKEL classifiers gave better performance. Four measurements are listed in Table 4. The accuracy was higher than 0.700, the exact match was higher than 0.650, the hamming loss was lower than 0.060, and the integrated score was higher than 0.460. However, they were still inferior to those of the RAKEL classifiers using features produced by Mashup (Table 2). It can be concluded that the features yielded by the Mashup were more informative than those produced by DeepWalk and Node2vec for the identification of metabolic pathway types of chemicals and enzymes.

Besides, to fully elaborate the conclusion in the above paragraph, we also used the features yielded by DeepWalk and Node2vec to build BR classifiers. The ten-fold cross-validation results are listed in Tables 3 and 4, respectively. For BR classifiers using features yielded by DeepWalk, their performance was still very poor. Given the same basic classifier, such BR classifier was much inferior to that with features yielded by Mashup. The BR classifier with features yielded by Node2vec gave much better performance. When the basic classifier was SVM (RBF kernel), the BR classifier was slightly superior to the BR classifier using features yielded by Mashup. However, the other two basic classifiers still generated lower performance. In addition, the best BR classifier using features yielded by Mashup was much better than the best BR classifier using features yielded by Node2vec. These results further confirmed that the features yielded by the Mashup were more efficient, which was an important reason why iMPTCE-Hnetwork can provide such high performance.

4. Conclusions

This study proposed an efficient multilabel classifier for the identification of metabolic pathway types of chemicals and enzymes. We did several tests to elaborate its rationality, including parameter settings, selection of basic classifiers, scheme for tackling multilabel problems, and network embedding algorithms. The main merit of the proposed classifier is the integration of chemical and enzyme information. Their information is utilized with each other in both feature extraction and classification procedures. It is hopeful that this classifier can be a useful tool for investigating the metabolic pathway system.

Data Availability

The original data used to support the findings of this study are available at the KEGG PATHWAY and in the supplementary information files.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Authors’ Contributions

Yuanyuan Zhu and Bin Hu contributed equally to this work.

Acknowledgments

This study was supported by the Natural Science Foundation of Shanghai (17ZR1412500), National Natural Science Foundation of China (61772028), Key-Area Research and Development Program of Guangdong Province (2018B020203003), Guangzhou Science and Technology Planning Project (201707020007), and Science and Technology Planning Project of Guangdong Province (2017A010405039).

Supplementary Materials

Table S1: Chemicals and enzymes in each metabolic pathway. (Supplementary Materials)