Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects

Wu, Zixin; Chen, Lei

doi:https://doi.org/10.1155/2022/9547317

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Materials and Methods Results and Discussion Conclusions Data Availability Conflicts of Interest Acknowledgments Supplementary Materials References Copyright Related Articles

Special Issue

Machine Learning and Network Methods for Biology and Medicine 2021

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 9547317 | https://doi.org/10.1155/2022/9547317

Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects

Zixin Wu¹and Lei Chen¹

Academic Editor: Rafik Karaman

Received20 Jul 2021

Revised18 Sept 2021

Accepted15 Mar 2022

Published01 Apr 2022

Abstract

Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed.

1. Introduction

Drugs are important in treating various diseases; however, their therapeutic effects are accompanied by negative effects called side effects. In the pharmaceutical field, drug side effect is classified as an adverse drug reaction (ADR), the harmful or accidental reactions of qualified drugs that are irrelevant to the purpose of their use under normal usage and dosage. Some market-approved drugs may generate unaccepted side effects that can be harmful to the human body and bring high risks to pharmaceutical companies. For example, fluconazole and atorvastatin have potential hepatotoxicity and nephrotoxicity that can increase transaminase when used in specific patients such as those with liver disease. Side effects are one of the major obstacles in launching new drugs and delaying their development. Thus, determining all the side effects for a given drug is an important topic in drug development. Despite their efficiency in identifying side effects, solid clinical trials are time consuming and expensive and thus cannot meet the demand of large-scale tests. Thus, rapid and cheap methods for the identification of drug side effects must be developed.

Many advanced computational algorithms have been proposed [1–5] to provide strong technique support to deal with various medical problems. Several computational methods have been developed for the identification of drug side effects. Most of them are machine learning-based techniques that deeply investigate current information on drug side effects and develop proper patterns that can be used to predict side effects for a given new drug. Some early methods consisted of an individual binary classifier for each side effect [6–10]; hence, they always contain several binary classifiers that must be simultaneously executed to determine all side effects for a given drug. In view of this situation, some other techniques were directly built with multilabel classifiers [11–16] that identify side effects as labels and drugs as samples. Recommender systems were also proposed to predict drug side effects [17–19]. Recent works paired drugs and side effects as samples to convert the original problem as binary classification [20–22]. A key step in developing such binary classifiers is to extract essential properties from each drug–side effect pair. Some researchers used a similarity-based scheme to extract features [21, 22]; for convenience, they extracted only one feature from one type of drug association, a process called single-feature sampling scheme. However, some essential information may be omitted. For research continuation, a novel feature extraction scheme that can hold essential information for each drug–side effect pair must be developed.

In this study, an efficient binary classifier was proposed for the identification of drug side effects. Drugs and side effects were also paired as samples [20–22]. The single-feature sampling scheme [21, 22] was generalized to extract essential features from each pair. Named as multiple-feature sampling scheme, this newly proposed strategy can generate multiple features from each type of drug association. Classic machine learning algorithm, random forest (RF) [23], was adopted as the prediction engine. According to the 10-fold cross-validation results, the performance of such classifier was better than that of the previous classifier that uses original single sampling scheme for feature extraction. Further tests suggested that classifiers with other classification algorithms and features yielded by the multiple sampling scheme were all superior to those with the same classification algorithm and features generated by the original scheme. This finding indicated the power of the features generated by the proposed feature extraction scheme.

2. Materials and Methods

2.1. Benchmark Dataset

Data on 841 drugs and their side effects (824) [20–22] were extracted from SIDER (http://sideeffects.embl.de/) [24], a public database collecting the information of marketed drugs and their ADRs. The original data contained 888 drugs and 1385 side effects. The side effects that were annotated to no more than five drugs were excluded. Furthermore, drugs without the properties mentioned in Section 2.2 were discarded. From the remaining 841 drugs and 824 side effects, 57,058 drug–side effect pairs were obtained. Each pair indicated that the specific drug in the pair has the side effect in the same pair. Given that these pairs indicate the relationship between one drug and one side effect, they were termed as positive samples and comprised the positive dataset (PDS).

In addition to PDS, a negative dataset (NDS) was necessary in building an efficient binary classifier. A total of 57,058 drug–side effect pairs were produced by randomly pairing one drug and one side effect [20, 21]. However, no pairs can be labeled as positive samples. Therefore, these pairs constituted one NDS. Different NDSs may influence the performance of the classifier. Therefore, four other NDSs were also generated. Finally, five datasets each containing the PDS and one NDS were produced and denoted by DS₁, DS₂, DS₃, DS₄, and DS₅.

2.2. Drug Association Obtained from Different Drug Properties

Two drugs with strong associations always share similar functions [25–29]. Side effects can be deemed as one type of drug function. Thus, classifiers can be constructed by adopting features derived from drug associations. From different aspects of drugs, several types of drug associations can be measured and quantified. For easy comparisons, the drug associations adopted in a previous study [21] were adopted, and their brief descriptions are as follows.

2.2.1. Drug Fingerprint Association

Simplified molecular input line entry specification (SMILES) string [30] is a widely used scheme for drug representation. Fingerprints can be extracted from this string using existing software, such as RDKit [31]. The associations of two drugs can be evaluated by comparing their fingerprints. Here, ECFP_4 fingerprints and Tanimoto coefficient were used to measure such association between any two drugs. For formulation, this association for drugs and was denoted by .

2.2.2. Drug Structural Association

In addition to SMILES string, another popular drug representation scheme is graph-based method. Here, each drug is represented by a graph with nodes depicting atoms and edges indicating bonds. The association of two drugs can be assessed by considering the similarity of two corresponding graphs. “SIMCOMP” (https://www.genome.jp/tools/simcomp/) reported in the KEGG [32, 33] was set up based on such idea. This tool can output the associations of a given drug with other drugs as measured by scores between 0 and 1. Such association for drugs and was denoted by .

2.2.3. Drug Anatomical Therapeutic Chemical (ATC) Code Association

The ATC system is a widely accepted and used in drug classification. Each drug in such system is assigned five-level ATC codes that indicate its essential properties. For two drugs, their association can be measured according to their ATC codes. This study used the same method in [21] to evaluate drug association based on their ATC codes. For convenience, the association of drugs and was denoted by .

2.2.4. Drug Literature Association

Given the extensive literature on drugs, the association of two drugs can be measured from their cooccurrence in some literature and natural language processing methods. The well-known public database, STITCH (version 4.0, http://stitch4.embl.de/) [34], provides such associations, which were directly employed in this study. “Textmining” score was extracted from the downloaded file “chemical_chemical.links.detailed.v4.0.tsv.” For drugs and , their literature association was denoted by .

2.2.5. Drug Target Protein Association

Target protein is the basic property of drugs. Hence, the association of two drugs can be estimated by comparing their target proteins. In this study, the target proteins of drugs were retrieved from DrugBank (https://go.drugbank.com/) [35]. Each drug was encoded into a binary vector by applying one-hot scheme to its target proteins. The direction cosine of two vectors was defined as such association of two drugs. For formulation, this association between drugs and was denoted as .

2.3. Feature Engineering

In Section 2.2, five types of drug associations that have been used to extract features to represent drug–side effect pairs [21, 22] were employed. These features indicated the linkage between one drug and one side effect in a drug–side effect pair. However, they extract only one feature from each type of drug association and thus cannot fully capture the essential linkage between the drug and the side effect. This study proposed a novel feature extraction scheme called multiple-feature sampling scheme, which can extract multiple features from one type of drug association. For a clear description, some denotations are necessary. For one drug–side effect pair , where and indicate one drug and one side effect, respectively, let be a set consisting of drugs having side effect that have been extracted from the training dataset. If also has side effect , then, it would not be included in . For one type of drug association, all values between and drugs in are selected. Denoted by (where represents the type of drug association used to construct such list), a candidate feature list for is then constructed with the decreasing order of above values. The top value in this list has been previously chosen as exclusive feature [21, 22]. Selection of several values in this list can contain more information to represent the linkage of drug and side effect . On the basis of the different selection models, two strategies were proposed, namely, discrete and continuous strategies. Their procedures are shown in Figure 1.

2.3.1. Discrete Strategy

In this strategy, several values from the list are selected to indicate the distribution of values in the list. In this way, these selected values can fully indicate the linkage between drug and side effect . This process can be achieved by selecting some discrete values in the list. For example, the value at the first place or that at the top place can be selected. These values comprise a set of features from one type of drug association.

2.3.2. Continuous Strategy

This strategy differs from the first one. Given that the linkage of drug and side effect is highly indicated by some top values in the list, these values must be properly selected because they may fully contain the essential information. For an integer between 1 and 100, the top values in the list were selected as features.

2.4. Classification Algorithm

A proper classification algorithm is important in building an efficient classifier. In this study, RF [23] was adopted to construct the classifier. RF is one of the most classic classification algorithms and has been used to set up many classifiers in bioinformatics [36–41].

RF is an integrated classification algorithm containing several decision trees, each of which is constructed by two random selection procedures. The first procedure is to select samples. Given a dataset with samples, randomly select samples with replacement from such dataset. The second procedure is to select features to split each node. The selected features should be much less than overall features. After the predefined number of decision trees has been constructed, RF integrates them by major voting. For a query sample, each decision tree gives its prediction. The majority prediction is the predicted result of RF. Although a decision tree is a relative weak classification algorithm, RF is extremely powerful and has always been an important candidate to build different classifiers.

In this study, “RandomForest” in Weka [42] was directly used to implement the abovementioned RF. Default parameters were adopted, and the number of decision trees was set to 100.

In addition to RF, the following classification algorithms were used to build corresponding classifiers: support vector machine (SVM) (polynomial kernel, RBF kernel) [43], Adaboost M1 [44], Bagging [45], Bayesian network [46], Naive Bayes [47], -nearest neighbor (KNN) [48], decision tree (C4.5) [49], PART [50], logistic regression [51], multilayer perceptron (MLP) [52], and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [53]. The goal is to confirm that the features yielded by the multiple sampling scheme are more effective than those yielded by the single sampling scheme. For convenience, corresponding tools in Weka were used to implement the above classification algorithms under default parameters. These classification algorithms adopt different principles and procedures for classification. Therefore, their usage can fully test the utility of the proposed feature sampling scheme. If the classifier with features yielded by the multiple sampling scheme is superior to that with previous features for any of these classification algorithms, then, the robustness of the novel features obtained by the multiple sampling scheme is confirmed.

2.5. Accuracy Measurement

Ten-fold cross-validation [54–59] was adopted to evaluate the performance of all constructed classifiers. Such method randomly divides the original dataset into ten parts. Each part is singled out one by one as the test set, and the remaining parts constitute the training set. Samples in the test set are predicted by the classifier based on the training set. Thus, each sample is tested exactly once.

For a binary classification problem, four entries can be counted by comparing the predicted and true classes of each sample, that is, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The following measurements were based on these four entries: sensitivity (SN) (also called recall), specificity (SP), prediction accuracy (ACC), Matthews correlation coefficient (MCC) [20, 21, 37, 60–63], precision, and -measure. Their definitions are as follows:

ACC, MCC, and -measure use all four entries and thus are more important than the other three measurements. Receiver operating characteristic (ROC) curve [64] and precision-recall (PR) curve were further employed to fully assess the performance of constructed classifiers. These curves indicate the performance of classifiers under different thresholds. ROC curve takes 1-SP as -axis and SN as the -axis, and PR curve takes recall as -axis and precision as -axis. Areas under these two curves (AUROC and AUPR) are important measurements to evaluate the performance of classifiers. Among the abovementioned parameters, MCC was selected as the main measurement.

3. Results and Discussion

A novel feature extraction method was proposed to extract essential features from drug–side effect pairs. On the basis of these features, efficient classifiers to predict drug side effects were established. All procedures are illustrated in Figure 2.

3.1. Performance of the RF Classifiers with Discrete Strategy

The discrete strategy picks some discrete values in the candidate feature list. Given that the top value in such list is the most important and has been previously selected as the exclusive feature [21, 65], this top value is always picked up as one feature. As mentioned in Section 2.3, the value located at top place in the list was also selected. In this study, was set as 5, 10, 15, and 20. Values with high ranks in the candidate feature list are more important than those with low ranks, that is, the top value is the most important, followed by values at 5%, 10%, 15%, and 20%. Incremental feature selection was adopted to generate four feature subsets as listed in column 1 of Table 1. With each feature subsets derived from five types of drug associations, a RF classifier was built on each of five datasets and evaluated by 10-fold cross-validation. The average performance is listed in Table 1. MCC followed an increasing trend when the values at top 5%, 10%, 15%, and 20% were added. Other five measurements also generally followed such trend. The RF classifiers with all selected features (top values and those at 5%, 10%, 15%, and 20%) generated the highest MCC of 0.7172. This finding indicated that the features yielded by such multiple-feature sampling scheme were quite efficient for the identification of drug side effects.

The ROC and PR curves of these four RF classifiers were investigated, and the results are shown in Figure 3. All AUROCs and AUPRs were higher than 0.900 and 0.910, respectively, thus, further suggesting the good performance of RF classifiers with discrete strategy.

(a)

(b)

3.2. Performance of RF Classifiers with Continuous Strategy

Different from discrete strategy, continuous strategy selected values from the candidate feature list in a continuous way. As mentioned in Section 2.3, top values in the candidate feature list can be chosen as features. Here, some values including 10, 20, 30, and 40 and four feature subsets were tested. A RF classifier was also built on each of the five datasets by using the feature subsets derived from the five types of drug associations. Each classifier was assessed by 10-fold cross-validation, and the average performance is listed in Table 2. When (top 20%), the RF classifier yielded the highest MCC of 0.8661 and generated the ACC of 0.9312, -measure of 0.9278, SN of 0.8852, SP of 0.9771, and precision of 0.9747. Compared with the RF classifiers with discrete strategy, the best RF with continuous strategy had higher measurements, particularly for MCC (by 15%), ACC (by 7%), and -measure (by 7%). These results indicated that the features obtained by continuous strategy were more powerful in identifying drug side effects than those yielded by discrete strategy.

The ROC and PR curves of RF classifiers with continuous strategy were plotted as shown in Figure 4. All ROC curves were close to the point (0, 1), and all PR curves were close to the point (1, 1). The AUROCs and AUPRs were all quite high. Compared with AUROCs and AUPRs for discrete strategy, those for continuous strategy were generally higher. This finding further confirmed that the features yielded by continuous strategy were more powerful than those yielded by discrete strategy.

(a)

(b)

3.3. Comparison of RF Classifiers with Single- and Multiple-Feature Sampling

A multiple-feature sampling scheme was proposed to extract essential features from each drug–side effect pair. Previous studies [21, 22] only picked up the top value as the feature, and this technique was called single sampling scheme. This section compares the RF classifiers with these two feature sampling schemes.

The average performances of RF classifiers with single-feature sampling scheme are listed in Table 3. The MCC was 0.5997, ACC was 0.7999, and -measure was 0.7988. Other three measurements (SN, SP, and precision) were 0.7948, 0.8049, and 0.8030, respectively. The best performing (highest MCC) RF classifiers with discrete and continuous strategies were selected for comparison and are also listed in Table 3. The MCCs for two strategies were 0.7172 and 0.8661, which were higher than that for the RF classifier with single-feature sampling scheme. Same conclusions can be obtained for other five measurements. The ROC and PR curves of RF classifier with single-feature sampling scheme were also plotted (Figure 3) and were found to be always under those of RF classifiers with discrete strategy. The AUROC and AUPR of the RF classifier with single-feature sampling scheme were 0.870 and 0.878, respectively, which were also lower than those of the RF classifier with discrete strategy. For the RF classifier with continuous strategy, its AUROCs and AUPRs (Figure 4) were even better than those of the RF classifier with discrete strategy and were also higher than those of the RF classifier with single-feature sampling scheme. All these results implied that the features yielded by the multiple sampling scheme contained more essential information of drug–side effect pairs than those obtained by the single sampling scheme. These features provide RF with improved performance.

3.4. Performance of Other Classifiers with Multiple-Feature Sampling Scheme

The RF classifiers with features yielded by multiple sampling (discrete strategy) were superior to those with features yielded by single sampling, and the RF classifiers with continuous strategy were better than those with discrete strategy. However, the relevance of this result to the selection of classification algorithms must be explored. In this section, 12 classification algorithms mentioned in Section 2.4 were tested. The classifiers with different algorithms and all feature subsets used for RF were constructed and evaluated by 10-fold cross-validation. The predicted results are listed in Tables S1–S24.

The performances of classifiers with single sampling and the best performance of classifiers with multiple sampling are listed in Table 4. The classifiers with multiple sampling (discrete strategy) were generally better than those with single sampling, and those with continuous strategy were superior to those with discrete strategy and single sampling. For a visualized confirmation, a radar graph was plotted for each value of ACC, MCC, and -measure as illustrated in Figure 5. For each measurement, the area in the closed curve of classifiers with multiple sampling (continuous strategy) was the largest, followed by the closed curve of classifiers with multiple sampling (discrete strategy); the area in the closed curve of classifiers with single sampling was the smallest. On the basis of these results, multiple sampling scheme is more efficient to capture the essential properties of drug–side effect pairs than single sampling scheme, and continuous strategy is better than discrete strategy.

(a)

(b)

(c)

3.5. Analysis of the Parameter of Continuous Strategy

For the continuous strategy, the parameter is a key factor that determines the number of selected features from the candidate feature list. Here, its influence on the performance of classifiers was investigated.

For RF classifiers, the highest MCC of 0.8661 was achieved when (Table 2). For other classifiers with different classification algorithms, always yields the best performance as shown in Figure 6. Among the 13 classifiers with different classification algorithms, 10 provided the best performance when , occupying 76.92%. Meanwhile, two yielded the best performance when . This phenomenon was reasonable. When is extremely small, some essential information of drug–side effect pairs cannot be included. When is large, several noises may be employed. Current investigation revealed that the values of can be taken in an interval [20, 30].

4. Conclusions

This study prevents a novel investigation on drug side effects. The contributions contained two aspects. One was the multiple-feature sampling scheme that can extract essential features from drug–side effect pairs, and other one was novel computational methods for the identification of drug side effects based on the features yielded by the multiple sampling scheme. Classifiers were built on the basis of different classification algorithms. By comparison, the classifiers using features yielded by the multiple sampling scheme performed better than those using features yielded by the single sampling scheme. The proposed classifiers can be useful tools to identify drug side effects, and the novel feature extraction scheme can be applied to other similar biological or medical problems.

Data Availability

The original data used to support the findings of this study are available at SIDER and in supplementary information files.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of Shanghai (17ZR1412500).

Supplementary Materials

Table S1: performance of SVM (polynomial kernel) classifier with discrete strategy. Table S2: performance of SVM (polynomial kernel) classifier with continuous strategy. Table S3: performance of SVM (RBF kernel) classifier with discrete strategy. Table S4: performance of SVM (RBF kernel) classifier with continuous strategy. Table S5: performance of Adaboost M1 classifier with discrete strategy. Table S6: performance of Adaboost M1 classifier with continuous strategy. Table S7: performance of Bagging classifier with discrete strategy. Table S8: performance of Bagging classifier with continuous strategy. Table S9: performance of Bayesian network classifier with discrete strategy. Table S10: performance of Bayesian network classifier with continuous strategy. Table S11: performance of Naive Bayes classifier with discrete strategy. Table S12: performance of Naive Bayes classifier with continuous strategy. Table S13: performance of KNN classifier with discrete strategy. Table S14: performance of KNN classifier with continuous strategy. Table S15: performance of decision tree classifier with discrete strategy. Table S16: performance of decision tree classifier with continuous strategy. Table S17: performance of PART classifier with discrete strategy. Table S18: performance of PART classifier with continuous strategy. Table S19: performance of logistic regression classifier with discrete strategy. Table S20: performance of logistic regression classifier with continuous strategy. Table S2: performance of multilayer perceptron classifier with discrete strategy. Table S22: performance of multilayer perceptron classifier with continuous strategy. Table S23: performance of RIPPER classifier with discrete strategy. Table S24: performance of RIPPER classifier with continuous strategy. (Supplementary Materials)

References

A. Onan, S. Korukoğlu, and H. Bulut, “A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification,” Expert Systems with Applications, vol. 62, pp. 1–16, 2016.
View at: Publisher Site | Google Scholar
A. Onan, S. Korukoğlu, and H. Bulut, “Ensemble of keyword extraction methods and classifiers in text classification,” Expert Systems with Applications, vol. 57, pp. 232–247, 2016.
View at: Publisher Site | Google Scholar
A. Onan and S. Korukoğlu, “Exploring performance of instance selection methods in text sentiment classification,” Artificial Intelligence Perspectives in Intelligent Systems, Springer, pp. 167–179, 2016.
View at: Publisher Site | Google Scholar
A. Onan, S. Korukoğlu, and H. Bulut, “A hybrid ensemble pruning approach based on consensus clustering and multi- objective evolutionary algorithm for sentiment classification,” Information Processing & Management, vol. 53, no. 4, pp. 814–833, 2017.
View at: Publisher Site | Google Scholar
A. Onan and M. A. Toçoğlu, “A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification,” IEEE Access, vol. 9, pp. 7701–7722, 2021.
View at: Publisher Site | Google Scholar
E. Pauwels, V. Stoven, and Y. Yamanishi, “Predicting drug side-effect profiles: a chemical fragment-based approach,” BMC Bioinformatics, vol. 12, no. 1, p. 169, 2011.
View at: Publisher Site | Google Scholar
S. Jamal, S. Goyal, A. Shanker, and A. Grover, “Predicting neurological adverse drug reactions based on biological, chemical and phenotypic properties of drugs using machine learning models,” Scientific Reports, vol. 7, no. 1, p. 872, 2017.
View at: Publisher Site | Google Scholar
Y. Zheng, H. Peng, S. Ghosh, C. Lan, and J. Li, “Inverse similarity and reliable negative samples for drug side-effect prediction,” BMC Bioinformatics, vol. 19, Suppl 13, p. 554, 2019.
View at: Publisher Site | Google Scholar
M. Liu, Y. Wu, Y. Chen et al., “Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs,” Journal of the American Medical Informatics Association, vol. 19, no. e1, pp. e28–e35, 2012.
View at: Publisher Site | Google Scholar
S. Dey, H. Luo, A. Fokoue, J. Hu, and P. Zhang, “Predicting adverse drug reactions through interpretable deep learning framework,” BMC Bioinformatics, vol. 19, Suppl 21, p. 476, 2018.
View at: Publisher Site | Google Scholar
L. Chen, T. Huang, J. Zhang et al., “Predicting drugs side effects based on chemical-chemical interactions and protein-chemical interactions,” BioMed Research International, vol. 2013, Article ID 485034, 8 pages, 2013.
View at: Publisher Site | Google Scholar
W. Zhang, F. Liu, L. Luo, and J. Zhang, “Predicting drug side effects by multi-label learning and ensemble learning,” BMC Bioinformatics, vol. 16, no. 1, p. 365, 2015.
View at: Publisher Site | Google Scholar
N. Atias and R. Sharan, “An algorithmic framework for predicting side effects of drugs,” Journal of Computational Biology, vol. 18, no. 3, pp. 207–218, 2011.
View at: Publisher Site | Google Scholar
E. Muñoz, V. Novácek, and P. Y. Vandenbussche, “Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models,” Briefings in Bioinformatics, vol. 20, no. 1, pp. 190–202, 2019.
View at: Publisher Site | Google Scholar
W. Zhang, Y. Chen, S. Tu, F. Liu, and Q. Qu, “Drug side effect prediction through linear neighborhoods and multiple data source integration,” in IEEE International Conference on Bioinformatics and Biomedicine, pp. 427–434, Shenzhen, Guangdong, China, 2016.
View at: Google Scholar
E. Munoz, V. Novacek, and P. Y. Vandenbussche, “Using drug similarities for discovery of possible adverse reactions,” in American Medical Informatics Association Annual Symposium Proceedings, pp. 924–933, USA, 2016.
View at: Google Scholar
Y. J. Ding, J. J. Tang, and F. Guo, “Identification of drug-side effect association via multiple information integration with centered kernel alignment,” Neurocomputing, vol. 325, pp. 211–224, 2019.
View at: Publisher Site | Google Scholar
X. Guo, W. Zhou, Y. Yu, Y. Ding, J. Tang, and F. Guo, “A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment,” BioMed Research International, vol. 2020, Article ID 4675395, 11 pages, 2020.
View at: Publisher Site | Google Scholar
Y. Ding, J. Tang, and F. Guo, “Identification of drug-side effect association via semi-supervised model and multiple kernel learning,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 6, pp. 2619–2632, 2019.
View at: Publisher Site | Google Scholar
X. Zhao, L. Chen, Z. H. Guo, and T. Liu, “Predicting drug side effects with compact integration of heterogeneous networks,” Current Bioinformatics, vol. 14, no. 8, pp. 709–720, 2019.
View at: Publisher Site | Google Scholar
X. Zhao, L. Chen, and J. Lu, “A similarity-based method for prediction of drug side effects with heterogeneous information,” Mathematical Biosciences, vol. 306, pp. 136–144, 2018.
View at: Publisher Site | Google Scholar
H. Liang, L. Chen, X. Zhao, and X. Zhang, “Prediction of drug side effects with a refined negative sample selection strategy,” Computational and Mathematical Methods in Medicine, vol. 2020, Article ID 1573543, 16 pages, 2020.
View at: Publisher Site | Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
M. Kuhn, M. Campillos, I. Letunic, L. J. Jensen, and P. Bork, “A side effect resource to capture phenotypic effects of drugs,” Molecular Systems Biology, vol. 6, no. 1, p. 343, 2010.
View at: Publisher Site | Google Scholar
L. L. Hu, C. Chen, T. Huang, Y. D. Cai, and K. C. Chou, “Predicting biological functions of compounds based on chemical-chemical interactions,” PLoS One, vol. 6, no. 12, article e29491, 2011.
View at: Publisher Site | Google Scholar
L. Chen, W. M. Zeng, Y. D. Cai, K. Y. Feng, and K. C. Chou, “Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities,” PLoS One, vol. 7, no. 4, article e35254, 2012.
View at: Publisher Site | Google Scholar
L. Chen, J. Lu, N. Zhang, T. Huang, and Y. D. Cai, “A hybrid method for prediction and repositioning of drug anatomical therapeutic chemical classes,” Molecular BioSystems, vol. 10, no. 4, pp. 868–877, 2014.
View at: Publisher Site | Google Scholar
L. Chen, T. Liu, and X. Zhao, “Inferring anatomical therapeutic chemical (ATC) class of drugs using shortest path and random walk with restart algorithms,” Biochimica et Biophysica Acta-Molecular Basis of Disease, vol. 1864, no. 6, pp. 2228–2240, 2018.
View at: Publisher Site | Google Scholar
H. Y. Liang, B. Hu, L. Chen, S. Wang, and Aorigele, “Recognizing novel chemicals/drugs for anatomical therapeutic chemical classes with a heat diffusion algorithm,” Biochimica et Biophysica Acta-Molecular Basis of Disease, vol. 1866, no. 11, article 165910, 2020.
View at: Publisher Site | Google Scholar
D. Weininger, “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules,” Journal of Chemical Information and Computer Sciences, vol. 28, no. 1, pp. 31–36, 1988.
View at: Publisher Site | Google Scholar
G. Landrum, “RDKit: open-source cheminformatics,” 2006, http://www.rdkit.org.
View at: Google Scholar
M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, and K. Morishima, “KEGG: new perspectives on genomes, pathways, diseases and drugs,” Nucleic Acids Research, vol. 45, no. D1, pp. D353–D361, 2017.
View at: Publisher Site | Google Scholar
M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, 2000.
View at: Publisher Site | Google Scholar
M. Kuhn, D. Szklarczyk, S. Pletscher-Frankild et al., “STITCH 4: integration of protein–chemical interactions with user data,” Nucleic Acids Research, vol. 42, no. Database issue, pp. 401–407, 2014.
View at: Publisher Site | Google Scholar
D. S. Wishart, Y. D. Feunang, A. C. Guo et al., “DrugBank 5.0: a major update to the DrugBank database for 2018,” Nucleic Acids Research, vol. 46, no. D1, pp. D1074–D1082, 2018.
View at: Publisher Site | Google Scholar
M. Carlos, K. Zoran, and S. Juan, “Predicting non-deposition sediment transport in sewer pipes using random forest,” Water Research, vol. 189, p. 116639, 2021.
View at: Publisher Site | Google Scholar
Y. Jia, R. Zhao, and L. Chen, “Similarity-based machine learning model for predicting the metabolic pathways of compounds,” IEEE Access, vol. 8, pp. 130687–130696, 2020.
View at: Publisher Site | Google Scholar
D. V. Urista, D. B. Carrué, I. Otero et al., “Prediction of antimalarial drug-decorated nanoparticle delivery systems with random forest models,” Biology, vol. 9, no. 8, p. 198, 2020.
View at: Publisher Site | Google Scholar
Z. B. Lv, J. Zhang, H. Ding, and Q. Zou, “RF-PseU: a random forest predictor for RNA pseudouridine sites,” Frontiers in Bioengineering and Biotechnology, vol. 8, p. 10, 2020.
View at: Publisher Site | Google Scholar
M. Baranwal, A. Magner, P. Elvati, J. Saldinger, A. Violi, and A. O. Hero, “A deep learning architecture for metabolic pathway prediction,” Bioinformatics, vol. 36, no. 8, pp. 2547–2553, 2020.
View at: Publisher Site | Google Scholar
Y. Yang and L. Chen, “Identification of drug–disease associations by using multiple drug and disease networks,” Current Bioinformatics, vol. 17, no. 1, pp. 48–59, 2022.
View at: Google Scholar
I. H. Witten and E. Frank, Data Mining:Practical Machine Learning Tools and Techniques, Kaufmann, San Francisco, Morgan, 2nd ed edition, 2005.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
View at: Publisher Site | Google Scholar
Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” Thirteenth International Conference on ML, Citeseer, 1996.
View at: Google Scholar
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
View at: Publisher Site | Google Scholar
S. Lee and S. Shimoji, “BAYESNET: Bayesian Classification Network Based on Biased Random Competition Using Gaussian Kernels,” in IEEE International Conference on Neural Networks, San Francisco, CA, USA, 1993.
View at: Google Scholar
I. Rish, “An empirical study of the naive Bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, IBM New York, USA, 2001.
View at: Google Scholar
T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
View at: Publisher Site | Google Scholar
R. Quinlan, C4.5: Programs for Machine Learning., Morgan Kaufmann Publishers, San Mateo, CA, USA, 1993.
E. Frank and I. H. Witten, “Generating accurate rule sets without global optimization,” in 15th International Conference on Machine Learning, pp. 144–151, San Francisco, CA, USA, 1998.
View at: Google Scholar
M. Sumner, E. Frank, and M. Hall, “Speeding up logistic model tree induction,” European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2005.
View at: Publisher Site | Google Scholar
S. K. Pal and S. Mitra, Multilayer perceptron, fuzzy sets, classifiaction, IEEE, 1992.
W. W. Cohen, “Fast effective rule induction,” in Machine Learning Proceedings 1995, Morgan Kaufmann Publishers, Inc, 1995.
View at: Publisher Site | Google Scholar
R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” International joint Conference on artificial intelligence, Lawrence Erlbaum Associates Ltd., 1995.
View at: Google Scholar
Y.-H. Zhang, Z. Li, T. Zeng et al., “Detecting the multiomics signatures of factor-specific inflammatory effects on airway smooth muscles,” Frontiers in Genetics, vol. 11, article 599970, 2021.
View at: Publisher Site | Google Scholar
Y. H. Zhang, H. Li, T. Zeng et al., “Identifying transcriptomic signatures and rules for SARS-CoV-2 infection,” Frontiers in Cell and Development Biology, vol. 8, article 627302, 2021.
View at: Publisher Site | Google Scholar
X. Pan, H. Li, T. Zeng et al., “Identification of protein subcellular localization with network and functional embeddings,” Frontiers in Genetics, vol. 11, article 626500, 2021.
View at: Publisher Site | Google Scholar
Y. Zhu, B. Hu, L. Chen, and Q. Dai, “iMPTCE-Hnetwork: a multi-label classifier for identifying metabolic pathway types of chemicals and enzymes with a heterogeneous network,” Computational and Mathematical Methods in Medicine, vol. 2021, Article ID 6683051, 12 pages, 2021.
View at: Publisher Site | Google Scholar
J.-P. Zhou, L. Chen, and Z.-H. Guo, “iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs,” Bioinformatics, vol. 36, no. 5, pp. 1391–1396, 2020.
View at: Publisher Site | Google Scholar
B. Matthews, “Comparison of the predicted and observed secondary structure of T4 phage lysozyme,” Structure, vol. 405, no. 2, pp. 442–451, 1975.
View at: Publisher Site | Google Scholar
Y.-H. Zhang, T. Zeng, L. Chen, T. Huang, and Y. D. Cai, “Determining protein-protein functional associations by functional rules based on gene ontology and KEGG pathway,” Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, vol. 1869, no. 6, article 140621, 2021.
View at: Publisher Site | Google Scholar
L. Chen, C. Chu, Y. H. Zhang et al., “Identification of drug-drug interactions using chemical interactions,” Current Bioinformatics, vol. 12, no. 6, pp. 526–534, 2017.
View at: Publisher Site | Google Scholar
L. Chen, S. Wang, Y. H. Zhang et al., “Identify key sequence features to improve CRISPR sgRNA efficacy,” IEEE Access, vol. 5, pp. 26582–26590, 2017.
View at: Publisher Site | Google Scholar
J. Egan, Signal Detection Theory and ROC Analysis, Academic Press, New York, 1975.
Z. Liu, F. Guo, J. Gu et al., “Similarity-based prediction for anatomical therapeutic chemical classification of drugs by integrating multiple data sources,” Bioinformatics, vol. 31, no. 11, pp. 1788–1795, 2015.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Zixin Wu and Lei Chen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

542

Downloads

635

Citations