Abstract

Previous methods for predicting protein-protein interactions (PPIs) were mainly focused on PPIs within a single species, but PPIs across different species have recently emerged as an important issue in some areas such as viral infection. The primary focus of this study is to predict PPIs between virus and its targeted host, which are involved in viral infection. We developed a general method that predicts interactions between virus and host proteins using the repeat patterns and composition of amino acids. In independent testing of the method with PPIs of new viruses and hosts, it showed a high performance comparable to the best performance of other methods for single virus-host PPIs. In comparison of our method with others using same datasets, our method outperformed the others. The repeat patterns and composition of amino acids are simple, yet powerful features for predicting virus-host PPIs. The method developed in this study will help in finding new virus-host PPIs for which little information is available.

1. Introduction

Viral infection involves a large number of protein-protein interactions (PPIs) between virus and its targeted host. These interactions range from the initial binding of viral coat proteins to host membrane receptor to hijack the host transcription machinery by virus proteins. Various viral diseases are caused by infection with pathogenic viruses. For instance, Ebola virus disease is a highly contagious and fatal disease caused by infection with Ebola virus. During the 2014 Ebola epidemic, the world witnessed over 28,000 cases and over 11,000 deaths [1]. So far, there is no specific vaccine or effective treatment for Ebola virus disease [2]. Despite the increased number of known virus-host PPIs, viral infection mechanism is not fully understood. Thus, identifying interactions between virus proteins and host proteins helps understand the mechanism of viral infection and develop treatments and vaccines.

So far, many computational methods have been developed to predict PPIs. However, most of these methods predict PPIs within a single species and cannot be used to predict PPIs between different species because they do not distinguish interactions between proteins of the same species from those of different species. Recently, a few computational methods have been developed to predict virus-host PPIs using machine learning methods. For instance, a homology-based method [3] predicts PPIs between H. sapiens and M. tuberculosis H37Rv. Support vector machine (SVM) models developed by Cui et al. [4] and Kim et al. [5] predicted PPIs between human and two types of viruses (hepatitis C virus and human papillomavirus). However, these methods are intended for PPIs between virus of a single type and host of a single type. Recent computational methods developed for predicting virus-host PPIs [68] are also limited to PPIs between human and the human immunodeficiency virus 1 (HIV-1) and cannot predict PPIs of new viruses or new hosts which have no known PPIs to the methods. A recent SVM model called DeNovo can exceptionally predict PPIs of new viruses with a shared host [9].

In this paper, we present a new method for predicting virus-host PPIs, which is applicable to new viruses or hosts using amino acid repeat patterns and composition. Proteins in a variety of species contain significant amino acid repeats, with more abundance of repeats in eukaryotic proteins than in prokaryotic proteins [10, 11]. It has been found that proteins with a large number of amino acid repeats have a greater number of interacting partners compared to those without [12]. Experimental results of our method show that the repeat patterns and local composition of amino acids are simple, yet powerful features for predicting virus-host PPIs. The rest of this paper discusses the details of the method and its experimental results.

2. Materials and Methods

2.1. Features and Representation

Proteins are of different lengths and have different amino acid compositions. Many features of proteins have been used to predict PPIs from protein sequences. In this study, we represent a virus-host PPI by three features (F1, F2, and F3):F1: sum of squared length of single amino acid repeats (SARs) in the entire protein sequenceF2: maximum of the sum of squared length of SARs in a window of 6 residuesF3: composition of amino acids in 5 partitions of the protein sequence

F1, which is the sum of squared length of SARs in the protein sequence, is defined by (1). Since SAR of length 1 is also included in F1, the F1 score reflects global composition of amino acids as well as amino acid repeats. Figure 1 shows an example of how we compute F1.

Feature F2 is defined by (2). It appears to be similar to F1, but there are two differences: (1) for F2, the sum of squared length of SARs is computed for every window of size 6 instead of a whole protein sequence, and (2) the maximum of the sum of squared length of SARs in a window is selected for F2. For example, a protein sequence SWWWWRSSSRRRRRRSSSWW has 15 possible windows of size 6, as shown in Figure 2. For each amino acid, we compute its F2 score by selecting the maximum of the sum of squared length of the SAR in a window of size 6:

The reason that we use a window of size 6 for F2 is because a window larger than 6 residues generates a same score for different repeat patterns. For example, with a window of size 7, we may obtain a same value of F2 even for different patterns of single amino acid repeats, whereas with a window of size 6, we obtain all different values of F2 for different patterns of single amino acid repeats (Figure 3).

While feature F1 represents the repeat patterns and global composition of amino acids in the whole protein sequence, feature F3 represents the local composition of amino acids. For feature F3, we partition a protein sequence into 5 segments of equal length except the last one and compute the composition of amino acids in each of the 5 segments. Since the three features, F1, F2, and F3, are computed for each amino acid, every pair of virus and host proteins is represented in a feature vector with 280 elements (140 for a virus protein and 140 for a host protein).

2.2. Datasets of Virus-Host PPIs

We constructed several datasets of virus-host PPIs to examine the applicability of the prediction method to new viruses and hosts. The first training dataset consists of PPIs of human with positive-sense single-stranded RNA (+ssRNA) viruses except hepatitis C virus (HCV) and severe acute respiratory syndrome (SARS) virus. The SVM model trained with the training dataset was tested on PPIs of five new viruses: HCV, SARS virus, influenza A (H1N1) virus, human papillomavirus (HPV-16), and human immunodeficiency virus HIV-1. Both HCV and SARS are positive-sense single-stranded RNA (+ssRNA) viruses, but the remaining three viruses are of different type. H1N1 virus is a negative-sense single-stranded RNA (−ssRNA) virus, HPV-16 is a double-stranded DNA (dsDNA) virus, and HIV-1 is a retrovirus.

The second training dataset is composed of PPIs between human and +ssRNA viruses, including HCV and SARS virus. The SVM model trained on the second training dataset was tested on PPIs of new hosts: Mus musculus, Bos taurus, Rattus norvegicus, Sus scrofa, and Escherichia coli K-12.

Data of virus-host PPIs were collected from IntAct [13] and VirusMentha [14]. But PPIs of HCV with human were obtained from the Hepatitis C Virus Protein Interaction Database (HCVpro) [15] because HCVpro has more human-HCV PPIs than IntAct. The sequences of the proteins involved in the virus-host PPIs were obtained from the UniProt database [16].

The training and test datasets constructed in our study can be summarized as follows.

1. Training (TR) and Test (TS) Datasets for Assessing the Applicability of the Prediction Model to New VirusesTR1: 638 PPIs between 499 human proteins and 25 +ssRNA virus proteinsTS1: 515 PPIs between 423 human proteins and 11 HCV proteinsTS2: 30 PPIs between 27 human proteins and 12 SARS virus proteinsTS3: 377 PPIs between 307 human proteins and 10 H1N1 virus proteinsTS4: 319 PPIs between 298 human proteins and 11 HPV-16 proteinsTS5: 1,578 PPIs between 1,056 human proteins and 46 HIV-1 proteins

2. Training (TR) and Test (TS) Datasets for Assessing the Applicability of the Prediction Model to New HostsTR2: 689 PPIs between 522 human proteins and 35 +ssRNA virus proteinsTS6: 191 PPIs between 141 Mus musculus proteins and 116 virus proteinsTS7: 125 PPIs between 87 Bos taurus proteins and 34 virus proteinsTS8: 86 PPIs between 79 Rattus norvegicus proteins and 24 virus proteinsTS9: 57 PPIs between 38 Sus scrofa proteins and 10 virus proteinsTS10: 78 PPIs between 64 Escherichia coli K-12 proteins and 27 virus proteins

Machine learning-based approaches to PPI prediction require both positive and negative PPI data, but negative data are not available in databases. Constructing a negative dataset of PPIs is not straightforward because there is no experimentally verified noninteracting pair [17]. Eid et al. [9], for example, used negative sampling for their negative dataset. In our study, we constructed a negative dataset with human proteins whose sequence similarity is lower than 40% to any human protein in the positive dataset by running CD-HIT [18]. Our negative dataset includes 2,819 interactions between 90 virus proteins and 2,819 human proteins. The training and test datasets constructed in this study are available in Additional files 1 and 2.

2.3. Prediction Models of Virus-Host PPIs

We built several support vector machine (SVM) models using LIBSVM [19] to evaluate our approach. The radial basis function (RBF) was used as a kernel of the SVM models, and the best values of parameters C and γ were obtained by running the grid search of LIBSVM on training datasets. Unless specified otherwise, the results shown in this paper were obtained with C = 2 and γ = 0.5. The SVM models take a pair of virus and host protein sequences as input. As output, the SVM models classify whether or not the virus protein interacts with the host protein. The SVM models and supporting data are available at http://www.bclab.inha.ac.kr/VHPPI.

3. Results and Discussion

3.1. Performance Measures

The performance of the SVM models was evaluated by several measures: sensitivity (Sn), specificity (Sp), accuracy (Acc), positive predictive value (PPV), negative predictive value (NPV), and Matthews correlation coefficient (MCC), which are defined by the following equations:

In (3)–(8), true positives (TP) are host proteins that are correctly predicted as interacting with a virus protein. True negatives (TN) are noninteracting host proteins that are correctly predicted as noninteracting with a virus protein. False positives (FP) are noninteracting host proteins that are incorrectly predicted as interacting with a virus protein. False negatives (FN) are interacting host proteins that are incorrectly predicted as noninteracting with a virus protein.

3.2. Results of Cross Validation

We performed 10-fold cross validation of the SVM model with several datasets which contain different ratios (1 : 1, 1 : 2, and 1 : 3) of positive to negative PPIs between +ssRNA viruses and human. As shown in Table 1, the best performance of the SVM model was observed in the balanced dataset with 1 : 1 ratio of positive to negative data. As expected, running the SVM model on unbalanced datasets resulted in lower performances than running it on the balanced dataset with 1 : 1 ratio of positive to negative data. Datasets are available in Additional file 3.

We also examined the contribution of the features to the prediction performance of the SVM model. Table 2 shows the results of using different combinations of features in 10-fold cross validation of the SVM model with the 1 : 1 dataset of Table 1. Among the single features, F3, which is the local composition of amino acids, was the best in all performance measures. With F3 alone, the SVM model achieved an accuracy above 92% and an MCC above 0.86, indicating that F3 is a very powerful feature in predicting virus-host PPIs. The best performance of the SVM model was observed when F1 and F3 were used. We also examined this work with different combinations of features. We used double amino acid repeats (DARs) for F1 and F2 instead of single amino acid repeats (SARs), but here for F2, we used a window size of 10 residues not 6 residues because we are working with DAR, so a window size of 10 residues is the biggest available window size that obtain a different value for every double amino acid repeat in it, but a window size of 6 residues does the same thing for the single amino acid repeat.

For features F1 and F2, we tried both single amino acid repeats (SARs) and double amino acid repeats (DARs) along with different partitions of a protein sequence. As shown in Table 3, SAR resulted in a better performance than DAR.

For feature F3, we tried several different partitions of a protein sequence in several datasets. Table 4 shows the performance of our SVM model in three different datasets of virus-host PPIs. All the results shown in Table 4 were obtained by using SAR for features F1 and F2, but with different partitions for feature F3. On average, partitioning a protein sequence into 5 segments showed the best performance in all performance measures except sensitivity. In addition to the performance gain, partitioning a protein sequence into 5 segments is more advantageous than 7 or 9 segments with respect to the size of a feature vector that represents the sequence. When we partition a protein sequence into 5 segments, every pair of virus and host proteins is encoded in a feature vector with 280 elements (20 elements for F1, 20 elements for F2, and 20  5 = 100 elements for F3 for each of the virus and host proteins). If we partition a protein sequence into 7 or 9 partitions, a feature vector will require 360 elements (20 elements for F1, 20 elements for F2, and 20  7 = 140 elements for F3 for each of the virus and host proteins) or 440 elements (20 elements for F1, 20 elements for F2, and 20  9 = 180 elements for F3 for each of the virus and host proteins). However, the larger feature vectors did not result in performance improvement in predicting virus-host PPIs.

3.3. Results of Independent Testing on PPIs of New Viruses

As discussed earlier, we trained the SVM model with the training dataset TR1 consisting of PPIs of human with +ssRNA viruses except hepatitis C virus (HCV) and SARS virus and tested it on PPIs of new viruses which were not used in training the model. The test datasets include PPIs of five viruses (HCV, SARS virus, H1N1 virus, HPV-16, and HIV-1) with human. H1N1 virus is a negative-sense single-stranded RNA (-ssRNA) virus, and HPV-16 is a double-stranded RNA (dsDNA) virus. HIV-1 is a retrovirus, which is a +ssRNA virus with a DNA intermediate.

In addition to species difference, we also analyzed the sequence similarity between the training datasets and test datasets using EMBOSS Needle tool [20] to assess the independence of the test data from the training data. As shown in Table 5, target virus proteins in the test datasets showed a very low average sequence similarity in the range (3.12% to 5.20%) to the virus proteins in the training dataset (see Additional file 4 for the similarity of every sequence pair between the training and test datasets).

Table 6 shows the results of testing the prediction model on 5 independent datasets of PPIs of new viruses. Despite such a low sequence similarity and species difference, the SVM model showed a high performance in independent testing. In particular, the SVM model showed a higher sensitivity (94.37% and 96.67%) for HCV and SARS virus, which are +ssRNA viruses. It is interesting to note that HPV-16, which is a dsDNA virus, showed the highest specificity of 94.04% and accuracy of 87.93%. Figure 4 shows the ROC curves of independent testing of the SVM model on PPIs of five new viruses.

3.4. Results of Independent Testing on PPIs of New Hosts

In order to examine the applicability of the SVM model to new hosts, we tested it on PPIs of viruses with new hosts, which were not used in training the model. As described earlier, the model trained with PPIs of human with +ssRNA viruses was tested on PPIs of five new hosts (Mus musculus, Bos taurus, Rattus norvegicus, Sus scrofa, and Escherichia coli K-12) with the viruses. As shown earlier in Table 5, the average sequence similarity of the human proteins in the training dataset to the new hosts is low, ranging between 8.04% and 9.76%. Despite the low sequence similarity and species difference, testing the model on PPIs of new hosts showed a relatively good performance (Table 7). Figure 5 shows the ROC curves of independent testing of the SVM model on PPIs of five new hosts.

It is interesting to note that proteins of new hosts have a higher average sequence similarity to those in training datasets than proteins of new viruses, but the SVM model showed a lower performance for new hosts. This can be explained by the number of partner proteins of the target proteins shared by training and test datasets. As shown in Table 8, the number of common proteins between the test datasets for new viruses (TS1-TS5) and their training dataset TR1 is larger than the number of common proteins between the test datasets for new hosts (TS6-TS10) and their training dataset TR2. Thus, the SVM model showed a better performance for new viruses than for new hosts. These results corroborate the known problem with pair-input methods, which was first reported by Park and Marcotte [21]. According to their study [21], prediction methods that operate on pairs of objects such as PPIs perform much better for test pairs that share components with a training set than for those that do not. Thus, our prediction model showed a better performance in testing for new viruses which share more partner proteins (i.e., host proteins) with training datasets than in testing for new hosts which share fewer partner proteins (i.e., virus proteins) with training datasets.

3.5. Comparison to Other Methods

For a comparative purpose, we ran our SVM model on the datasets of two other methods for virus-host PPIs: Barman’s method [22] and DeNovo [9]. In Barman’s study [22], three machine learning methods (SVM, Naive Bayes, and Random Forest) were used to predict virus-host PPIs using several features such as domain -domain association in interacting protein pairs and composition of methionine, serine, and valine in virus proteins. In a 5-fold cross validation with virus-host PPIs from VirusMINT [23], their Random Forest (RF) and SVM showed a better performance than Naive Bayes. Thus, we tested our SVM model on the same dataset used in Barman’s study, which contains 1,035 positive and 1,035 negative interactions between 160 virus proteins of 65 types and 667 human proteins. As shown in Table 9, our SVM model outperformed Barman’s SVM model in all performance measures and our SVM model outperformed Barman’s RF model in all performance measures except specificity and PPV. The dataset used for comparison of our SVM model with Barman’s SVM and Random Forest models is available in Additional file 5.

For comparison with DeNovo’s SVM model, we tested our SVM model on DeNovo’s SLiM testing set, which contains 425 positive and 425 negative PPIs (Supplementary file S12 used in DeNovo’s study ST6). As shown in Table 10, our SVM model was better than DeNovo in all performance measures (sensitivity of 86%, specificity of 87%, and accuracy of 86%). The dataset used for comparison of our SVM model with DeNovo is available in Additional file 6.

4. Conclusions

Amino acid repeats are prevalent in a variety of proteins but are rarely used in predicting PPIs. We developed a new method that predicts potential interactions between virus and host proteins using global and local compositions of amino acids as well as amino acid repeat patterns.

We tested the prediction model on independent datasets of virus-host PPIs, which were not used in training the model and have a very low sequence similarity to any protein in training datasets of the model. Despite a low sequence similarity between proteins in training datasets and target proteins in test datasets, the prediction model showed a high performance comparable to the best performance of other methods for single virus-host PPIs. In comparison of our method with others using same datasets, our method outperformed the others. Experimental results demonstrate that the repeat patterns and composition of amino acids are simple, yet powerful features for predicting virus-host PPIs. The method can be used to find potential PPIs of new viruses or hosts, for which little information is known.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by Inha University Research Grant.

Supplementary Materials

The supporting data and program are available at http://www.bclab.inha.ac.kr/VHPPI. Additional file 1: training and test datasets for examining the applicability of the SVM model to new viruses. Additional file 2: training and test datasets for examining the applicability of the SVM model to new hosts. Additional file 3: training datasets with different ratios of positive to negative instances (1 : 1, 1 : 2, and 1 : 3). Additional file 4: the sequence similarity of each protein pair (x, y) for x in the training dataset and y in the test dataset. Additional file 5: training and test datasets of virus-host PPIs from Barman’s study [22]. Additional file 6: training and test datasets of SLiMs from DeNovo [9]. (Supplementary Materials)