Abstract

Drug-target interactions provide useful information for biomedical drug discovery as well as drug development. However, it is costly and time consuming to find drug-target interactions by experimental methods. As a result, developing computational approaches for this task is necessary and has practical significance. In this study, we establish a novel dual Laplacian graph regularized logistic matrix factorization model for drug-target interaction prediction, referred to as DLGrLMF briefly. Specifically, DLGrLMF regards the task of drug-target interaction prediction as a weighted logistic matrix factorization problem, in which the experimentally validated interactions are allocated with larger weights. Meanwhile, by considering that drugs with similar chemical structure should have interactions with similar targets and targets with similar genomic sequence similarity should in turn have interactions with similar drugs, the drug pairwise chemical structure similarities as well as the target pairwise genomic sequence similarities are fully exploited to serve the matrix factorization problem by using a dual Laplacian graph regularization term. In addition, we design a gradient descent algorithm to solve the resultant optimization problem. Finally, the efficacy of DLGrLMF is validated on various benchmark datasets and the experimental results demonstrate that DLGrLMF performs better than other state-of-the-art methods. Case studies are also conducted to validate that DLGrLMF can successfully predict most of the experimental validated drug-target interactions.

1. Introduction

It is well known that drug discovery is a difficult and expensive process, and identifying potential drug-target interactions (DTIs) plays an important role in yielding successful candidate compounds for drug development. Predicting interactions between different drugs and targets can provide critical information by discovering off-target effects. Accurate prediction of DTIs can also substantially accelerate lead generation. Drug-target interaction prediction can be also regarded as a useful step in biomedical research and precision medicine [19]. However, it is still time consuming for traditional experimental approaches to identify potential DTIs, and the success rates are also very low. In addition, only a very limited number of DTIs have been experimentally validated. Therefore, it is necessary to develop computational methods for DTIs, which can significantly reduce both the time and labor costs, as well as improve the efficiency of drug discovery. Furthermore, there are various datasets which contain experimentally validated interactions of drugs and targets, such as KEGG [10], DrugBank [11], and GenBank [12], which also benefit the prediction of potential DTIs by using computational techniques.

In recent years, a large number of computational methods for DTI prediction have been proposed, and these methods are often based on some machine learning and data mining models, e.g., logistic regression [13, 14], support vector machine (SVM) [1517], Bayesian classifiers [18], matrix completion [9], matrix factorization [19, 20], kernel learning [21, 22], and network inference [2325]. For classification-based methods, they treat drug-target interaction pairs and noninteraction pairs as positive instances or negative instances and convert the DTI prediction problem into a label classification task [14, 17]. In [15], a genetic algorithm is used to screen related compounds; the drug-target pairs with strong binding capacity were found with SVM and particle swarm optimization. Garcasosa et al. [13, 18] used logistic regression and naive Bayesian classifiers for the classification of compounds. In [26], the experimentally validated targets are employed to train a SVM model and find potential proteins with similar structure. Network-based methods [23, 27, 28] utilize the theory of network and graph [29] and incorporate the drug and target similarities with experimentally validated interactions to infer the potential unknown drug-target interactions. Due to the strong learning capability of data representation, matrix factorization has also been used for drug-target interaction prediction, such as Bayesian matrix factorization [30], collaborative matrix factorization [31], and robust graph regularized matrix factorization [20]. The principle of these matrix factorization-based methods lies in that a high-dimensional drug-target interaction matrix can be decomposed into a multiplication of low-dimensional matrices, and the intrinsic property of original data can be well captured by these low-dimensional matrices. Due to the powerful feature representation and linear relation learning capability, deep neural network-based methods are also proposed to learn the relationship between drug and target [32, 33].

Although achieving great success by previous computational model learning-based methods, there are still some limitations and much room for improvement. Firstly, in previous methods, the experimentally validated and unknown interactions are treated equally during the learning process, which could deduce some noisy information. Secondly, in some learning models, drug features and target features are difficult to select. In order to solve the above issues, we propose a novel drug-target interaction prediction model based on logistic matrix factorization with dual Laplacian graph regularization term by using experimentally validated interactions, referred to as DLGrLMF briefly. In our DLGrLMF model, the chemical structure similarities between drug pairs, the genomic sequence similarities between target pairs, and the experimentally validated interactions are integrated together. The similarities between drug neighbors and target neighbors are exploited to represent the latent factor vector of the factorized matrices, and the potential interactions are determined by a probability score through the logic function.

The efficacy of the proposed DLGrLMF was evaluated on five benchmark datasets, and we compared DLGrLMF with several other state-of-the-art drug-target interaction prediction approaches in terms of 10-fold cross-validation, and the results demonstrate that the proposed DLGrLMF clearly outperforms other methods. In addition, in order to validate the ability to predict potential drug-target interactions, case studies are also performed and the results also demonstrate that DLGrLMF can accurately predict most of the experimental validated drug-target interactions.

2. Materials and Methods

2.1. Datasets Used in Experiments

In this work, four small-scale benchmark datasets and a large-scale dataset are used in the experiments to evaluate the DTI prediction performance of the proposed DLGrLMF model. The four small-scale datasets include nuclear receptors (NRs), G protein-coupled receptors (GPCRs), ion channels (ICs), and enzymes (Es) [34], there are four different types of target protein, and they are publicly available at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. As to the large-scale dataset DrugBank (DB) [35], it is a unique bioinformatics and cheminformatics resource which combines detailed drug data with comprehensive drug-target information. We use the data released on Jul. 03, 2018 (version 5.1.1), in our experiments. The drug and target data were extracted from the DrugBank database website at http://www.drugbank.ca/. We only use the approved drug-target interactions for experiments. To this end, there are totally 1936 drugs, 1609 targets, and 7019 approved drug-target interactions, respectively. We download the approved drug structures and approved target sequences from https://www.drugbank.ca/releases/latest#structures and https://www.drugbank.ca/releases/latest#target-sequences, respectively.

In Table 1, we present the detailed statistics of the five datasets. Three types of information for each dataset are summarized, including the similarities between drug pairs, the similarities between target pairs, and the experimental validated DTIs. Specifically, the validated DTIs are obtained from public datasets including KEGG BRITE [36], BRENDA [37], DrugBank [38], and SuperTarget [39].

2.2. Problem Formulation of DTI Prediction

In order to make the subsequent expression clearer, we first give a brief problem formulation of DTI prediction. Throughout this paper, we use two sets and to represent targets and drugs, respectively. The experimentally validated DTIs are denoted as a binary matrix . If a drug has been experimentally validated to interact with a target , then ; otherwise, . The elements with a value of “1” in represent the “known interactions” and can be regarded as positive observations, while the zero elements in are set as “unknown interactions” and can be regarded as negative observations. In addition, the drug similarities are denoted as , and the target similarities are represented as . DTI prediction is aimed at discovering the potential interactions from the negative observations by using certain prior information of drugs and targets. The candidate drug-target interactions will be chosen as predicted interactions according to their predicted probabilities in descending order.

For each dataset, three matrices including , , and are provided, which represent the drug-target interactions, drug similarities, and target similarities, respectively. Each entry of represents the similarity between a drug pair, which is measured by using SIMCOMP [40] that describes the chemical structure similarity between drugs. In SIMCOMP, the similarity between two compounds and can be computed as . As to , genomic sequence similarity is used to denoting the similarity score between two proteins, which are obtained from the KEGG GENES dataset [36]. The sequence similarities between two proteins and are computed via a normalized version of the Smith–Waterman scores [41], which is defined as , where represents the original unnormalized Smith-Waterman score.

2.3. Proposed DLGrLMF Model

Matrix factorization has been developed for recommendation systems in the very beginning, which decomposes the observation data matrix into two low-dimensional matrices and , where is the so-called number of hidden factors. Then, and can be regarded as the latent representation of drugs and targets in the hidden space. In recent years, it has also been used for predicting the incRNA-miRNA relationship [42] and drug-target interaction [20]. In this work, we propose a drug-target interaction prediction model via logistic matrix factorization with dual Laplacian graph regularization. Here, the occurrence probability of a certain drug-target interaction is calculated based on the inner product of the latent factor vectors from drug and target. Specifically, can be formulated as follows: where is a latent row vector to represent drug , and is a latent row vector to represent target . Then, and can be used to represent the potential characteristics of all drugs and targets, respectively. In this work, and are initialized by zero-mean spherical Gaussian priors as follows: where represents a Gaussian probability density function with a mean of , a variance of , and an independent variable .In drug-target interaction prediction studies, known DTIs are experimentally verified and they should be more reliable than unknown relationships. Therefore, we should allocate higher weights to those known DTIs [43, 44]. Specifically, the known DTI pairs and negative samples are used for training, where the constant determines the significance of the interaction pairs. Then, the posterior distribution in logarithm can be written as follows: where is a constant.

Although minimizing Equation (3) can exploit latent vectors globally to predict potential DTIs, the local similarity information implied among drugs and targets is not taken into consideration. Therefore, we use the drug similarities and the target similarities to boost the prediction performance. Instead of using all of the similarities in and , we extract the local neighbor information of drugs and targets. As to different drugs, the local neighbor similarity matrix can be obtained from by where denoted the neighbors of drug .In a similar way, the local neighbor similarity matrix can be obtained from by where denoted the neighbors of target .Since the iterations between drugs and targets are complex, different to previous graph regularized methods that only use the first-order connections to reflect the local pairwise proximity between vertices in a graph [4548], we use the second-order connection to constrain that similar drugs should be connected with similar targets. Therefore, we have the following similarity affinity matrices calculating form: where and represent the th and th column of original , respectively, and and represent the th and th column of original , respectively.

The main idea of our proposed DLGrLMF model is under the assumption that if the chemical structure of two drugs from a drug pair is similar to each other, their latent representation should also be closed to each other. Similarly, the latent representation of two targets should also be similar to each other if their genomic sequence similarities are closed to each other. For drugs, we can minimize the following problem: As to different targets, we have the following similar minimization problem: By some simple algebra, Equations (7) and (8) can be transformed into the following form: where is the corresponding Laplacian matrix of drugs with , and is the diagonal matrix with . is the corresponding target Laplacian matrix of targets with , and is the diagonal matrix with .By combining Equations (9) and (10) and the maximization of Equation (3) together, we have our final DLGrLMF model as follows: which is equal to the following problem: where and represent two identity matrices with size and , respectively. and are two nonnegative constants to balance the regularization terms. In Equation (12), the first and second terms constitute the logistic matrix factorization model to formulate the drug-target interaction probability. The third term is the Laplacian regularization term to capture the local relationship between drug pairs, and the fourth term is the Laplacian regularization term to capture the local relationship between target pairs.

As can be seen from Equation (12), DLGrLMF models the interaction probability between a drug-target pair by a logistic function and decomposes the probability matrix into drug-specific and target-specific latent vectors. In DLGrLMF, a biologically validated drug-target pair is treated as positive examples, while an unknown pair is treated as a negative example. In such a manner, DLGrLMF assigns higher weights to positive observations than negatives. Since the positive pairs are biologically validated and thus usually more trustworthy while the negative pairs could contain potential DTIs and are thus unreliable, our method can fully exploit the useful information in validated interaction pairs.

In this work, we use gradient descent to optimize Equation (12). Supposing the objective function is denoted as , then the partial derivatives of with respect to and can be obtained as follows: where denotes the Hadamard product of two matrices. Each element of (i.e., ) is formulated by Equation (1), which denotes the probability of interaction between drug and target . and are randomly initialized. During the optimization process, and are updated until to be stable. After we get the final solution and , the final probability of interaction between drug and target can be calculated as follows:

3. Experimental Results

3.1. Evaluation Metrics

In order to validate the efficacy of our proposed method, experiments on five datasets mentioned in Section 2.1 are conducted. Similar to several previous works [4951], two evaluation metrics including precision-recall (PR) curves and the Area Under the Precision-Recall curves (AUPR) [52] are utilized for performance evaluation. Since we intend to avoid incorrect predictions being recommended by the prediction algorithms [52], AUPR is desirable for evaluation because it can penalize the false positives more.

3.2. Experiment Settings

In our experiments, we use other six drug-target interaction prediction techniques for performance comparison, they are bipartite local model using neighbor-based interaction-profile inferring (BLMNII) [49], weighted nearest neighbor profile (WNN) [50], collaborative matrix factorization (CMF) [31], graph regularized matrix factorization (GRMF) [51], neighborhood regularized logistic matrix factorization (NRLMF) [45] and label propagation with linear neighborhood information (LPLNI) [53], and dual Laplacian graph regularized matrix completion for drug-target interaction prediction (DLGRMC). For each method, we perform 5 repetitions of 10-fold cross-validation (CV) on different datasets. In each repetition, the observed DTI indicator matrix was divided into 10 folds. Then, each fold was selected for testing while the remaining 9 folds were used for training; the final AUPR score was the average results over 5 repetitions.

Similar to previous works [31, 54, 55], we conduct CV under the following three different settings: (i): CV on drug-target pairs—we randomly select some entries from (i.e., drug-target pairs) for testing, which refers to test the efficacy of the DTI prediction method for new (unknown) drug-target pairs(ii): CV on drugs—we randomly select several rows in (i.e., drugs) for testing, which refers to the DTI prediction for new drugs(iii): CV on targets—we randomly select a portion of columns in (i.e., targets) for testing; this setting refers to the DTI prediction for new targets

As to , , and , we use 90% of entries in , 90% of rows in , and 90% of columns in as training data and the remaining data as testing data in each round, respectively.

3.3. DTI Prediction Results

In Tables 24, we show the predicted AUPR values of different methods on different datasets under varying CV settings. As can be seen, our proposed DLGrLMF consistently outperforms other methods on all of the datasets. Considering that the drug discovery and development aim to serve the treatment of disease, in order to predict new targets which the drugs react, we plot the precision-recall (PR) curves of the results under for all of the datasets. The PR curves are shown in Figure 1; the results also demonstrate the superiority of our proposed DLGrLMF.

3.4. Case Study

In order to validate the capacity of DLGrLMF for potential DTI prediction, we randomly choose a drug from each dataset and report the top 10 predicted interactions of different methods under . The predicted results are reported in Tables 58. As can be seen from the results, our proposed DLGrLMF can successfully predict a larger amount of the experimental validated DTIs when compared with other methods, which also indicates that DLGrLMF is capable of predicting novel DTIs for drug development.

4. Discussion and Conclusions

In this paper, we propose a novel dual Laplacian graph regularized logistic matrix factorization model for drug-target interaction prediction, i.e., DLGrLMF. Specifically, DLGrLMF regards the task of drug-target interaction prediction as a weighted logistic matrix factorization problem, in which the experimentally validated interactions are allocated with larger weights. Meanwhile, by considering that drugs with similar chemical structure should have interactions with similar targets and targets with similar genomic sequence similarity should in turn have interactions with similar drugs, the drug pairwise chemical structure similarities as well as the target pairwise genomic sequence similarities are fully exploited to serve the matrix factorization problem by using a dual Laplacian graph regularization term. By performing extensive experiments, the efficacy of the proposed method can be well validated, and case studies demonstrate that the proposed method is powerful to predict potential novel drug-target interactions.

In addition, experimental results also demonstrate that there is still much room for improvement since there also exists missed interactions in case studies. In this work, only one type of representation for drugs or targets is used. In practical, each drug/target is often with multiple representations. For example, a drug can be represented by its chemical structure or by its chemical response in different cells. A protein target can be represented by its sequence or by its gene expression values in different cells. In our future work, we will try to integrate multiple representations for drug-target interaction prediction and we believe that the prediction results can be improved with a large margin.

Data Availability

The datasets used in this work are publicly available at http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Dr. Yong Liu for providing their code for implementing the logistic matrix factorization model.