Abstract

The past few decades have witnessed the boom in pharmacology as well as the dilemma of drug development. Playing a crucial role in drug design, the screening of potential human proteins of drug targets from open access database with well-measured physical and chemical properties is a task of challenge but significance. In this paper, the screening of potential drug target proteins (DTPs) from a fine collected dataset containing 5376 unlabeled proteins and 517 known DTPs was researched. Our objective is to screen potential DTPs from the 5376 proteins. Here we proposed two strategies assisting the construction of dataset of reliable nondrug target proteins (NDTPs) and then bagging of decision trees method was employed in the final prediction. Such two-stage algorithms have shown their effectiveness and superior performance on the testing set. Both of the algorithms maintained higher recall ratios of DTPs, respectively, 93.5% and 97.4%. In one turn of experiments, strategy1-based bagging of decision trees algorithm screened about 558 possible DTPs while 1782 potential DTPs were predicted in the second algorithm. Besides, two strategy-based algorithms showed the consensus of the predictions in the results, with approximately 442 potential DTPs in common. These selected DTPs provide reliable choices for further verification based on biomedical experiments.

1. Background

In domains of biotechnology, pharmacology, and medicine development, identification of drug targets is to discover new candidate molecules that are active in the process of remedies with drugs. A notation is given in [1] that the drug target is a broad concept ranging from molecular entities such as Ribonucleic Acids (RNAs), genes, and proteins to biological phenomena like phenotypes or pathways.

History about the drug development has confirmed a fact that most failures in drug exploration can be attributed to inappropriate targets pursued [2, 3]. It is widely acknowledged that identifying potential targets for intervention is the first and foremost step in the modern drug campaign [1, 47], which has attracted increasing attention and focus from both academia and industry. Once the molecule was predicted as drug target, the engineering of drug design would begin in clinical trials. Since such programs, involving huge investments from pharmaceutical corporations and governments, are exactly time-consuming and labor-intensive, the choice of potential targets for experiments seems quite crucial.

As the dataset collected in our experiments is trapped in a special case where limited drug target proteins are known while the rest are uncertain in labels, the screening of potential drug target proteins from the unlabeled is complicated. A prior information supported in our research lies in low ratio of “druggable” genomes in humans, approximating to 10% [8]. In the light of this, the nondrug target proteins (NDTPs) would dominate the unlabeled by inference. For more detailed information about our dataset, see Materials and Methods, and our ultimate objective is to screen several reliable drug target proteins (DTPs) from the unlabeled. Looking back to the previous methodologies of identification of drug target proteins (IDTPs), some specific biological hypotheses were required such as side-effect similarity [9], chemical structure, and genomic sequence information [10]. For further review about this, refer to [4]. To overcome the limits on the reliability of hypothesis and explore a robust way to address the problem as well, we have developed a novel paradigm combining the proteins biochemical characteristics with the booming data mining techniques. Figure 1 shows the process of drug discovery using data mining techniques. Inspired by a family of algorithms with regard to the positive and unlabeled learning, we transferred the existing knowledge into the domain of bioinformatics. A two-stage paradigm was adopted for the screening task, with the final result showing the efficiency of our algorithms.

2. Materials and Methods

2.1. Data Collection and Preliminary Analysis
2.1.1. Data Collection

Proteins, as one of the main sources of drug targets, have been a lasting heated topic for researchers from various domains. Some of them interact with each other, forming the basis of signal transduction pathways and transcriptional regulatory networks. As the focus of our research, proteins of drug targets are those functional biomolecules addressed and controlled by some active compounds. In this paper, we collected proteins from the DrugBank Database (Version 3.0) in which 1604 proteins were annotated as drug targets [11]. Further data cleaning was imposed by removing the nonhuman proteins as well as those sequences larger than 20% using PISCES [12]. As the compounds of atoms and molecules, whether the protein can be the candidate for the drug targets is frequently determined by factors like water solubility, hydrogen ion concentration (pH), trait of bases, and its structure. Though the interaction relations provide the additional information for the screening, they are not exactly reliable. Other properties of proteins also originate from the basic chemical or physical properties of proteins in essence. Our selected properties in the research were just some basic chemical or physical properties of proteins. We followed the extracting process in [13]. Then some properties of significance for our task were extracted such as peptide cleavages [14], N-glycosylation [15], O-glycosylation [16], low complexity regions [17], transmembrane helices [18], and some other influential physical or chemical characteristics. These properties were important clues in deciding the biological activity of proteins. We made use of pepstats, an online software from EMBOSS [19], to calculate statistics of properties. In our article, we also call the unlabeled proteins as uncertain NDTPs because of the former prior information about the proportions of DTPs in dataset. The uncertain NDTPs were those when we did not know whether any of them would be the drug target candidates. Finally, a collected dataset with 517 known DTPs and 5376 uncertain NDTPs was employed for the screening task. Specifically, some proteins in the 5376 proteins would be recommended as most likely DTPs from the dataset of uncertain NDTPs. Further information about the dataset for experiments is illustrated in Figure 2 and supporting materials are in the website http://pan.baidu.com/s/1pLDCkcF.

2.1.2. Preliminary Analysis

To eliminate the effect of scales, we impose normalization on each continuous property at first. The detailed process is as follows:where is the normalized value of some property , is the mean of the population, and is the standard deviation of the property. After the preprocessing, we need to apply hypothesis tests to check whether the information of each property is beneficial for our screening task. More specifically, Kolmogorov-Smirnov two-sided test was picked as the technique while the DTPs and the unlabeled were recognized as two classes. Since the unlabeled were dominated by the NDTPs, it was reasonable to consider that the traits of the NDTPs can be well approximated by the distribution of the unlabeled with some noise from the potential DTPs. We denote the list of properties in the following order: Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tyr, Tiny, Small, Aliphatic, Aromatic, Nonpolar, Polar, Charged, Basic, Acidic, Hydrophobicity, SignalP, LowComplexityRegions, Ogly_S, Ogly_T, Ngly, and Trans_Helices. All of these have been elaborated in the former work [13] with the detailed process of property extraction. The final results in Table 1 show the difference of significance between two classes, suggesting almost all of these properties in our dataset are discriminating and the effectiveness of properties would further support the following experiments.

Another factor would affect the predicting performance is the correlation between the properties. After computing the values of correlation, covariance matrix is visualized in Figure 3. In the figure, the names of the horizontal axis are just in the order of the list of properties from the top to the bottom as well as from left to the right in axes. As is shown in the figure, the properties have weak correlations between each other, indicating less information redundancy in the properties. Up to now, it seems that the task is able to learn since the properties are quite information-beneficial. Something must be emphasized that we only make use of continuous properties in our experiments to relieve the dimension disaster which comes from the nominal properties. Further experimental results would confirm our induction.

2.2. Two-Stage Methodologies

Taking details into consideration, an arduous task is the identification. The task is just a problem of one class classification, which is also viewed as the type of transductive learning [20]. If we want to establish a classifier, some negative samples, NDTPs, are in necessity. Here, we innovatively employ anomaly detection techniques [21] to convert the problem of one class classification to the general binary classification problem. Therefore, the task is addressed in the two-stage paradigm. Specifically speaking, the first stage is to screen some reliable negative ones for the formation of training dataset in the binary classification and then a classifier is constructed with the help of obtained dataset in the second stage. The flowchart in Figure 4 illustrates our framework in detail.

2.2.1. Strategies in the First Stage

The construction of the negative from the collected unlabeled is a nontrivial task and in some sense, it is to screen some reliable NDTPs. Though a prior knowledge indicates NDTPs’ large occupation in the unlabeled, the discriminating criteria between two classes are hard to make up. Here, some statistical analyses with proper techniques are employed for NDTPs’ fine extraction and we devise two strategies for the choice of reliable NDTPs from the perspective of statistical anomaly detection. Both of them are mining the inner discrimination between distributions of DTPs and the uncertain NDTPs.

Strategy  1. Such strategy is in a nonparametric style and the computations in the initial process only rely on the known DTPs. Here 31 continuous properties are just the decisive factors. Each property of proteins provides us with measurable criteria for selecting reliable NDTPs. An intuitive way is to characterize the extent of sample’s violating the statistical indexes or patterns displayed in the dataset of known DTPs. The range of the property where DTPs occupy in a higher probability can be restricted based on the quantile information on the accumulated distribution of the known DTPs and those proteins whose values of the property fall out of the range are more likely to share similar patterns with the reliable NDTPs.

In our experiments, the reliable range for the DTPs regarding one continuous property is defined as an interval where and are quantiles of some property, respectively. In our experiments, is set as 10% while is 90%. Displayed in Figure 5, any sample of the unlabeled with value of the property lower than the down threshold or higher than the up threshold is judged as the property violation towards the frequent DTPs’ pattern. Another crucial definition is the extent of unknown sample’s violating towards the frequent DTPs’ pattern, which is really complicated to determine. To simplify the process and maintain the anomaly information, we count the number of 31 property values not conforming to the reliable ranges for each unlabeled protein and use the count as the index measuring the reliability of being the NDTP for each sample.

After a series of computations, a statistical result is given in Figure 6 and for the Selection Algorithm in reliable interval the threshold to screen likely NDTPs is set as to make a trade-off between class-balance in the training dataset and reliability of NDTPs. In this way, 441 proteins are selected from the unlabeled as the most likely NDTPs for further training.

Strategy  2. As we know, the dataset of the unlabeled is capable of approximating the distribution of NDTPs, but such approximation is biased because of the potential DTPs’ existence. Meanwhile, the distribution of DTPs is easily captured with the help of 517 known DTPs. When the unlabeled is combined with the labeled, semisupervised learning framework can be utilized to exploit additional information in the unlabeled, contributing to the reduction in the bias of probability density estimation.

Expectation maximization (EM) [22] is the algorithm we employed for learning the mixture of probability distributions. Gaussian distributions are frequently used in mixture models as approximation of distributions.

The model can be described as where the mixture coefficients are in the interval with constraint and is the parameter set of probability distributions. The mixture coefficients can be explained as the prior weights of mixed distributions.

The objective of model is to maximize the likelihood of the whole dataset as

Equivalent objective is the maximization of log likelihood:

For our problem, we denote the and , respectively, as the parameters for the DTPs and NDTPs distributions.

As some samples have been determined as DTPs, it would be better to incorporate such partial label information to the model. Denoting as the known DTPs, the objective in our problem is adapted as

Applying EM algorithms to optimize the objective, we can obtain the final parameters. Once the parameters learned, the mixture model is derived. As a generative model, the probability likelihood that assigns the sample to each class can be computed. The probability of assigning a sample to the NDTP, which we mostly care about, can be calculated as

The calculation is just the posterior probability by Bayesian inference.

Ranking scores of the above probability in decreasing order, some reliable NDTPs are selected as the top 441 in the rankings just to maintain the same number as in Strategy  1.

2.2.2. Classifier Establishment in the Second Stage

In the first stage, several reliable NDTPs are screened to constitute the part of training dataset. Then, bagging of decision trees [23], a traditional but efficient model, is developed for the further identification. Bagging takes advantage of bootstrapping [24] technique over training dataset to generate a series of meta models with variance. Benefiting from the randomness, several learned meta models as decision trees are aggregated to capture the complex boundary of concept. Especially for our task, each extracted property has been proved to be information discriminative between classes and the information redundancy is in a rather low level, so a meta decision tree easily established by learning random subset over some property is beneficial and effective in practice. In the experimental process, bagging is performed by running package of scikit-learn [25].

In our experiment, the partition criteria were chosen as the Gini index as follows.

Define the entropy of the dataset as where is the proportion of samples belonging to class ().

Then the Gini index can be computed as where corresponds to the samples belonging to branch nodes derived from the types of property .

Maximizing the Gini index is our partition criteria. Besides, the minimum samples for splitting were set as 2 and minimum samples of leaf were 1.

3. Experiments and Results Analysis

3.1. Experimental Settings and Some Metrics

A persuasive manipulation in the experiments is to partition the dataset into the training set and testing set. Here 70% known DTPs in random selection (361 randomly selected known DTPs) and the well-picked reliable NDTPs (441 NDTPs) in the first stage were merged into the training set. The rest of the dataset including 156 known DTPs as the positive and 4935 uncertain NDTPs as the negative acted for evaluating our two-stage models. That is, the 4935 uncertain NDTPs were for the final screening of the potential DTPs. To eliminate the randomness from the partition, we averaged the results in 10 independent turns during the process of result analysis. Algorithm 1 makes use of the reliable intervals to detect reliable NDTPs. Algorithm 2 is in a semisupervised style to form the dataset of reliable NDTPs. For the process of the meta decision tree, see Algorithm 3. Algorithm 4 illustrates the bagging method.

Input: The positive dataset Pos, the unlabeled dataset , the threshold to measure the
extent of violation
()  Initialize the reliable negative dataset RN = NULL
()  For each property :
()    Compute the reliable interval of Pos corresponding to
()  End for; Obtain a series of reliable interval
()  For each sample in U:
()   For each property in u:
()      count = 0
()      If locates out of the corresponding reliable :
()       count = count + 1
()   If
()   
Output: The set of reliable negative samples
Input: The unlabeled dataset , the positive dataset , the number of selection
Initialization the reliable negative set RN = NULL
Run EM on mixture model using U and P to derive the mixture probability distributions
             
For each sample in U:
  Compute the probability of the sample assigned as the negative
       
Rank the above probability likelihood in decreasing order
Select the top L samples to append the RN
Output: The reliable negative samples RN
Input: The training dataset and properties set
Process: Function
 Generate a node;
 If all of the samples belong to the same class C then
   Assign the node as the leaf node of class C; Return
 End if
 If Or samples in D achieve same values on P then
   Assign the node as the leaf node of the class C when most of samples belong to class ; Return
 End if
 Choose the best partition property from as ;
 For each value in :
  Generate a branch for the node; Let be the subset of in which sample holds the value ;
   If then
    Assign the node of branch as the leaf node of the class C when most of samples belong
       to class C; Return
  Else:
    Set the Tree_Generator as the node of branch
  End if
End for
Output: A
Input: The training dataset D, the meta learning model , the number of meta models K
For do:
  Bootstrap on to obtain
  Train a meta decision tree with
Ensemble of meta models as
           
  where
           
  
Output: the ensemble model

In our research, we accomplished the screening task by directly learning in a supervised style. Furthermore, the metrics for the binary classification can also be employed for performance evaluation. The confusion matrix in (9) provides the result in an intuitive way.

FN stands for the number of DTPs by mistake identified as the nontargets and the rest can be understood in a similar way.

Of great importance is the recall ratio of DTPs in our task, which is defined as

To maintain the low ratio of incorrectly recognizing the DTPs, the recall ratio of the DTPs is also important.

Meanwhile, the precision of the NDTPs should be monitored as well.

Besides, the accuracy is estimated as

In some sense, due to the dominance of the uncertain NDTPs, the accuracy seems not as important as the former two metrics. The uncertainty of testing set also leaves room of tolerance about the precision of the negative. More specifically, the relatively but not extremely high level of recall ratio of NDTPs contributes the final decision on DTPs’ screening. During the process of bagging, the decisive parameter is the number of meta decision trees denoted as _estimators in scikit-learn [25]. To explore optimal parameters for the bagging of decision trees, we ranged the scope of _estimators from 5 to 2000 with the step width of 5. The criteria for the choice of optimal _estimators were the recall ratio of DTPs.

The predicting process is all of our concern. Since the prior information indicates the small ratio of DTPs in the unlabeled, the predicted FP in the testing set can be taken as the main source of candidate DTPs. The mechanism behind this prediction pipeline is that the known DTPs and the potential DTPs share the same statistical distribution trait, so FP may contain most of candidate DTPs if the recall ratio of DTPs maintains a higher level.

3.2. Analysis of Results
3.2.1. Case Analysis in One Turn

In one turn of the experiments, we derived confusion matrix as follows:The meaning is the same as (9) and positive is the DTPs with the negative denoted as NDTPs. The bold number is the number of predicted DTPs. Equation (14a) is the result using Strategy  1 while (14b) is the result using Strategy  2.

It was significant that both of two strategies based bagging of decision trees achieved higher recall ratios of DTPs, reaching 93.5% and 97.4%, respectively. Something also worthy of noticing was that with the help of Strategy  1-based bagging method, the recall ratio of the uncertain NDTPs reached about 88.7%. Such results just conformed to the prior information that the actual NDTPs dominated the dataset. However, the confusion matrix of Strategy  2 maintained a relatively lower recall ratio of NDTPs approximately 63.9%, indicating Strategy  2 based method was able to provide a broad but rough scope for the final recommendation.

During the prediction process, we have directly taken the samples of FP in the confusion matrix as the potential DTPs. The consistency of two strategies has been verified in this turn; Figure 7 is the Venn graph about the proportions of predicted DTPs by employing two strategies. We suggested that about 442 proteins were predicted as the potential DTPs at the same time, occupying most of predicted potential DTPs from Strategy  1 based method.

The detailed information about the commonly predicted potential drug target proteins in two strategies based bagging of decision trees has been uploaded in the website http://pan.baidu.com/s/1c1SB2EG.

3.2.2. Sensitivity Analysis to Data Partition

As the comparison to our strategies, random sampling method for the negative construction was performed in our research, which was a prevailing practice [26]. In other words, 441 proteins were randomly picked up to form the set of most likely NDTPs in the training dataset. Bagging of decision trees was combined for classification as well.

Table 2 illustrates the results of the above experiments in 10 turns, including the circumstance of fitting on the training dataset. By averaging the metrics and, respectively, computing the variance, an evident but valuable conclusion was drawn that bagging of decision trees using our strategies worked steadily with low variance. In contrast, S2-bagging and S1-bagging achieved higher recall ratios of DTPs, recall ratios of NDTPs, and precisions of NDTPs. It suggested that S1-bagging can finely detect the potential drug target proteins while S2-bagging offered a broader range for further screening. In Table 2, another interesting fact about the RS-bagging was that the overfitting on the training dataset severely damaged the testing results, leading to low recall ratios for the DTPs. We confirmed that random sampling for the selection of NDTPs as training dataset would not find reliable ones though the actual DTPs occupies a small proportion of the unlabeled.

Besides, the higher performance on the training dataset using random sampling technique has made inevitable bias in the predicting process. Such circumstance did not happen when employing S1-bagging and S2-bagging. What is more, the performance on training dataset using two strategies was superior. In Table 3, we carried out Student’s paired -test for checking the results of significance. For each metric, 10 independent results were compared in pairs between S1-bagging, S2-bagging, and RS-bagging. As is shown in the table, both of the two strategies were significantly superior to the RS-bagging in three metrics, namely, recall ratios of DTPs, recall ratios of NDTPs, and precisions of NDTPs. Something interesting was that, for the precision of NDTPs, S2-bagging was not significantly better than S1-bagging. Totally, the results of Student’s paired -test have verified the effectiveness of two proposed methods in the sense of significance.

Figure 8 provides the 10 independent experimental results corresponding to the metrics of recall ratios and precisions on the testing dataset. Three radar figures further supported the stability of algorithms using our strategies. Both of S1-bagging and S2-bagging were robust to various testing datasets.

4. Conclusions

In conclusion, we have designed two strategies with bagging of decision trees as the classifier to accomplish the screening task. With 517 known DTPs available, we wish to screen some potential DTPs from a well-collected unlabeled set including 5376 proteins. The main challenge is to generate a proper training set for data mining techniques when only one label exists in our collected dataset with highly imbalance distribution [27]. In the initial process, two strategies motivated by the ideology of anomaly detection have contributed to screening some reliable NDTPs for the negative training set’s construction. Then bagging of decision trees was carried out for the final screening task.

The outstanding performance witnessed the effectiveness and robustness of our algorithms by 10 independent turns. Finally, 552 and 1782 proteins derived by running two models in one turn were suggested as potential DTPs. In particular, the 441 proteins were predicted as the common potential drug targets by two strategies based methods for further verification. Though the suggested candidates range a little due to the random sampling, the stability of algorithms has been proved to ensure the reliability of the results.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.