Integrated Analysis of Multiscale Large-Scale Biological Data for Investigating Human DiseaseView this Special Issue
Improving the Understanding of Pathogenesis of Human Papillomavirus 16 via Mapping Protein-Protein Interaction Network
The human papillomavirus 16 (HPV16) has high risk to lead various cancers and afflictions, especially, the cervical cancer. Therefore, investigating the pathogenesis of HPV16 is very important for public health. Protein-protein interaction (PPI) network between HPV16 and human was used as a measure to improve our understanding of its pathogenesis. By adopting sequence and topological features, a support vector machine (SVM) model was built to predict new interactions between HPV16 and human proteins. All interactions were comprehensively investigated and analyzed. The analysis indicated that HPV16 enlarged its scope of influence by interacting with human proteins as much as possible. These interactions alter a broad array of cell cycle progression. Furthermore, not only was HPV16 highly prone to interact with hub proteins and bottleneck proteins, but also it could effectively affect a breadth of signaling pathways. In addition, we found that the HPV16 evolved into high carcinogenicity on the condition that its own reproduction had been ensured. Meanwhile, this work will contribute to providing potential new targets for antiviral therapeutics and help experimental research in the future.
Human papillomavirus (HPV) has been tantamount to cervical cancer which ranked as the third most common cancer and the fourth most common cause of cancer death, but its actual footprint is much bigger [1, 2]. Persistent infection with mucosal HPV types, especially with HPV16, can also lead to the form of penile, vulvar, vaginal, anal, and oropharyngeal cancer, recurrent respiratory papillomatosis, and certain head afflictions [3, 4]. Furthermore, some data show that the actual number of cases of anal and oropharyngeal cancers is increasing and may have already exceeded (or will soon exceed) that of cervical cancer. HPVs were divided into five different genera: Alpha, Beta, Gamma, Mu, and Nu [5, 6]. HPVs were also classified as cutaneous or mucosal according to their tropism. There are both cutaneous and mucosal HPV for Alphapapillomavirus. Other genera are cutaneous. In addition, 12 mucosal HPVs (HPV16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, and 59) were classified as high-risk (HR) HPV types by the International Agency for Research on Cancer (IARC) in 2009 [7, 8]. More than 96.6% of cervical cancer is caused by HR HPVs, while about 54.4% is caused by HPV16. In all HPV-positive noncervical cancers, HPV16 is also the most common HPV type detected. The HPV16 encodes eight proteins: E1, E2, E4, E5, E6, E7, L1, and L2 [9, 10]. These proteins are classified as adaptive proteins which have high carcinogenicity (E5, E6, and E7) and core set (E1, E2, L2, and L1). The E4 protein is embedded within the E2 protein .
HPV16 appears to be extraordinary: how can such a small amount of proteins do so much ? Protein-protein interaction (PPI) network is a feasible strategy to improve our understanding of its pathogenesis. Several human-pathogen interaction networks have been reported, such as Plasmodium falciparum, Yersinia pestis, hepatitis C virus (HCV), and Epstein-Barr virus (EBV) [13–16]. Dyer et al. integrated and compared publicly available human-pathogen PPIs from 190 different pathogens to provide a global view of pathogenesis strategies . Unfortunately, it is very limited that PPI pairs between HPV16 and human are obtained by experiment. Therefore, computational methods to predict PPIs have an important role . The SVM with 217-dimensional vector was employed to predict the interactions of HPV16 and HPV18 proteins with human proteins by Cui et al. at the same time . But it is easy to lead overfitting for small sample. In this paper, a new method was employed to represent protein sequence. A support vector machine (SVM) model with sequence and topological features was built to predict new interactions between HPV16 and human proteins. Subsequently, all interactions were filtered and further analyzed by some strategies.
2.1. Data Sources
We collected human PPIs from large-scale high-throughput screens [20–22] and several interaction databases [23–26], which contained 193,801 interactions among 13,306 proteins. The Pathway Interaction Database (PID) is a growing collection of human signaling and regulatory pathways curated from peer-reviewed literature . As a source of reliable information we extracted about 224 different pathways from the PID. Then the interactions between HPV16 and human proteins were extracted from IntAct , APID , and VirHostNet . After removing redundancy, a total of 174 interactions were identified and used as positive training set (see Table S1 in the supplementary material available online at http://dx.doi.org/10.1155/2015/890381).
We collected 254 new nonredundant interaction pairs from the literature (see Table S2 in the supplementary material). Finally, the 254 interaction pairs were used as positive test set. It should be noted that whether it was positive training set or positive test set, the interactions were centered on E6 protein and E7 protein because of experimental biases.
2.2. Choosing of Negative Set
As a 2-class classification, both positive set and negative set are needed . Positive set is interacting pairs and negative set is noninteracting pairs. Unfortunately, the noninteracting pairs were not readily available. In the absence of negative set, the following strategy was adopted to choose negative set. This strategy was based on such an assumption that proteins locating different subcellular localizations do not interact . First, the all human proteins of human PPI network were grouped into eight subsets based on the eight main types of localization—cytoplasm, nucleus, mitochondrion, endoplasmic reticulum, golgi apparatus, peroxisome, cytoplasm&nucleus and secreted. Then we totaled subsets of human proteins which were targeted by a kind of HPV16 protein denoted as . Therefore, other proteins that did not appear in those subsets were made as candidates who did not interact with . Finally, the same amounts of proteins with targeted human proteins of were randomly picked as negative set of . For example, eight human proteins targeted by E5 protein occupied cytoplasm subset and nucleus subset in positive training set; thus, other human proteins which did not appear in those two subsets were made as candidates and eight proteins of candidates were randomly picked as negative training set of E5 protein.
2.3. Feature Extraction
The sequence compositions of the protein pair and the topological features of corresponding human protein were employed to represent protein interaction between HPV16 and human.
In accordance with Shen et al. , a protein sequence was represented by three consecutive amino acids. On account of limited sample, however, another class of amino acids was used to reduce the dimension of the vector space of feature vectors. Based on the chemical nature of the side chain of the amino acid, twenty amino acids were classified into five categories: , , , , and . The third category and the fourth category were incorporated into one category, and four categories were considered in total. So there are 4 × 4 × 4 = 64 possible amino acid combinations. The frequency of a combination in a protein was defined as , where was the occurrences of combination in protein . An interaction between a HPV protein and a human protein was represented by their frequency difference, . The parameter was normalized bywhere is the frequency difference of the th combination. The numerical value of ranges from −1 to 1.
Besides the standardized frequency difference, degree and betweenness of the human proteins were also used as features. Ultimately, a 66-dimensional vector was built to represent each protein pair. Each interaction was labeled +1 and noninteraction was labeled −1.
The classification model for predicting PPIs was based on support vector machine (SVM) using LIBSVM  with the radial basis function (RBF).
There are three differences between our representation and that of Cui et al. . First, twenty amino acids were classified into six classes by Cui et al.: , , , , , and . So there are 6 × 6 × 6 = 216 possible amino acid combinations. Second, standardization was done by
Third, a feature element was used to represent the types of virus proteins and was included in a feature vector.
2.4. Tissue Specificity Filtering
To ensure utmost biological relevance, tissue specificity filtering was adopted. It has been known that HPVs infect epithelial cells in oral mucosa or skin . In addition, HPVs also lead to recurrent respiratory papillomatosis, head afflictions, and cancers of the cervix uteri, vulva, anus, and oropharynx (including base of the tongue and tonsils) and interact with basal cell and the immune system [3, 35]. We extracted proteins in those cells, tissues, and systems from HPRD . Finally, interactions were filtered by selecting interaction pairs which only contain those proteins.
2.5. Enrichment and Pathway Participation Coefficient
Proteins were grouped according to their degree in integrated human PPI network. Each group where each protein has at least interactions was represented by . In each group the number of human proteins that were targeted by HPV16, , was calculated. As a null hypothesis, we randomly sampled protein set from the integrated human PPI network and then calculated corresponding number of targeted proteins, . Finally the enrichment of targeted proteins was defined as . In addition to degree, the same calculation was performed for betweenness. It was noted that points to an enrichment and vice versa.
For each protein that was involved in pathways and the integrated human PPI network, the corresponding pathway participation coefficient (PPC) in the total set of pathways was defined as , where was the set of interaction partners of in the pathway . If a protein predominantly interacted with partners that were members of the same pathway, PPC tended toward 1. Otherwise PPC tended to 0.
2.6. GO Term Enrichment
The Gene Ontology (GO) is a hierarchically organized, controlled vocabulary to consistently describe and annotate gene products . GO term enrichment was performed using the DAVID Functional Annotation Chart tool [38, 39]. GO terms are controlled vocabularies that form a directed acyclic graph (DAG), whereby individual terms are represented as nodes connected to more specific nodes by directed edges, such that each term is a more specific child of one or more parents. Therefore, to avoid very general and uninformative GO terms, only GO level 5 terms were considered. The values were corrected for multiple testing using the Bonferroni procedure and transformed by taking the for easier visualization [40, 41].
3. Results and Discussion
We extracted 174 interactions between HPV16 and human proteins and integrated a human PPI network including 193,801 interactions. A flowchart of the whole experiment is shown in Figure 1.
3.1. Choosing of Negative Training Set and Evaluating of Model
The 174 interactions between virus and human proteins were used as positive training set. The selection of negative training set was fundamental to the reliability of the prediction model . Based on a rational assumption, the negative training set was chosen (see Methods section). The SVM with 5-fold cross-validation was employed to optimize the parameters and check the reliability of randomly selected negative training set. Repeating such random trials 1,000 times and calculating average accuracy (%), we chose a result approaching average accuracy to build model and plot ROC curve (Figure 2) which allowed for a true positive rate TPR = 74.71%, a false positive rate FPR = 8.62%, and area under the curve AUC = 0.8627. Other results were dotted clouds. It was demonstrated that the method of choosing negative training set was significantly reliable and robust.
To evaluate expansibility of the model, a positive test set was collected. Negative test set was selected by the same method with choosing negative training set. Repeating trials 1,000 times, this model, on average, achieved an accuracy AC = %, TPR = 78.7%, and FPR = %. For comparison, we tested the method of Cui et al. on whole modeling and evaluating. Our method outperformed the method of Cui et al., which, on average, achieved AC = %, TPR = 63.4%, and FPR = %.
3.2. Inferring and Filtering of Candidate Interactions
To find candidate interactions, we ran BLAST with the known targeted human proteins as query sequences against the human proteins in integrated human PPI network. Specifically, we considered a pair of proteins with homology if their -value was <10−6. A candidate interaction was detected between a HPV16 protein and homologous protein of targeted human protein. The final set contained 3022 candidate interactions between 8 virus and 1,950 human proteins.
The model built by SVM was applied to evaluate candidate interactions. The 1,121 interactions between 8 virus and 701 human proteins were finally obtained. The 701 human proteins were refined further by selecting human proteins that have the same GO cellular component terms with homologous human proteins from the positive training set. 1,015 interactions were obtained by this refinement. To ensure utmost biological relevance for the 1,015 interactions, tissue specificity filtering was adopted (see Methods section). Filtering interactions provided 644 interactions between 8 HPV16 proteins and 405 human proteins. For simplicity of reference, the filtering result was named as predicted set. Meanwhile, positive set including training set and test set was also filtered by tissue specificity. Finally, all filtering results were combined, providing a total of 877 interactions between 8 virus and 603 human proteins. This set was called as all set.
3.3. Distribution of Targeted Human Proteins Based on Host-Virus Interaction
Now we paid more attention to the all set. The frequency of human proteins that interacted with the same number of viral proteins was calculated. We observed that most human proteins (69.52%) merely interacted with a virus protein in Figure 3(a). The positive training set and the predicted set were addressed by the same calculation method, and their results illustrated similar trend with all set. It suggested that HPV16 interacted with human proteins as much as possible to enlarge its scope of influence by its limited proteins. In order to provide all necessary cellular proteins required for viral replication, the virus has to keep its host cell in cycle . At the molecular level, virus proteins interact with many key cell cycle regulatory proteins, including cyclin-dependent kinase (CDK), cyclin-dependent kinase inhibitors, and cyclin proteins (Figure 3(b)). Among them, CDK2 and CDK7 are the most prominent. The two proteins simultaneously interact with five virus proteins in all set and the five virus proteins are L2, E4, E5, E6, and E7. Combination of CDK2 and some cyclins regulates G1/S transition. CDK7 is both a CDK-activating kinase (CAK), which is able to phosphorylate and activate CDK1, CDK2, CDK4, and CDK6 within the activation segment (T-loop) [43–46], and an essential component of the transcription factor TFIIH, which phosphorylates the C-terminal domain (CTD) at Ser 5 of the largest subunit of Pol II [47–49]. These interactions, together with other proteins that bind to HPV16, alter a broad array of cell cycle progression; for example, they block cellular proliferation by causing cell cycle arrest in S-phase [12, 50, 51]. The myosin light chain kinase (MLCK) is also targeted by five virus proteins. It has been proven that MLCK plays a role in the regulation of epithelial cell survival  and modulates hypotonicity-induced Ca2+ entry and Cl− channel activity in human cervical cancer cells . In addition, HPV16 may be similar to arrest defective-1 that controls tumor cell behavior by MLCK .
3.4. Statistical Implications of Targeted Host Proteins Based on Human PPI Network
We calculated the enrichment of targeted human proteins as a function of the degree of human proteins (see Methods section). With an average over 1,000 randomizations, we observed that whether it was all set, predicted set, or positive training set, HPV16 preferred to interact with hub proteins (proteins interacting with a large number of partners) in the integrated human PPI network (Figure 4(a)). Subsequently, we calculated the enrichment of targeted proteins as a function of the betweenness and consistent trend has shown that bottleneck proteins (proteins that are central to many paths in the network) were more affected by virus (Figure 4(b)). Testing the significance that HPV16 tended to interact with hub and bottleneck proteins, we used Fisher’s exact test, allowing us to find a statistically significant tendency that HPV16 is indeed highly prone to interact with hub proteins and bottleneck proteins (Figure 4(c)).
We speculated that virus interacted with human proteins as much as possible while tending to influence many signaling pathways to mediate the infection. PPC was adopted to measure this tendency (see Methods section). Focusing on the positive training set, we observed that most human proteins occurred in a variety of pathways through its interaction partners in integrated human PPI network (Figure 4(d)). The predicted set and the all set showed more enforced maxima around low values of PPC. As a comparison, we randomly selected a subset of equal size with human proteins in all set from integrated human PPI network and repeated 1,000 times to calculate average value of PPC. Ignoring the last bar, we found that the random set obeyed the normal distribution, but the all set was linear relationship. Such results strongly indicated that the HPV16 effectively affected a breadth of signaling pathways [13, 55, 56].
3.5. Functional Analysis of Targeted Host Proteins
GO term enrichment was employed to perform the comprehensive functional analysis for human proteins of the all set. The main advantage of this approach is that we can make use of term-term relationships, in which joint terms may contain unique biological meaning for a given study .
For all targeted human proteins, significant enrichment was observed in the processes of phosphorylation, metabolism, signaling, cell death and apoptosis, gene expression, and positive or negative regulation terms (Figure 5(a)). This observation was also reflected on the functions which include kinase activity, receptor activity, promoter, DNA binding, and so on (Figure 5(b)). MAPK is a particularly important component in protein kinase phosphorylation cascade. It can enter the nucleus and phosphorylate serine/threonine residues of substrate proteins which contain transcription factors of regulating the cell cycle and cell differentiation. Notably, viral proteins strongly interacted with members of the MAPK family (MAPK1, 3, 6, 7, 8, 9, 11, and 14). Besides MAPK family, partial members of MAP2K and MAP3K family were also targeted (Figure 3(b)). HPV16 controls phosphorylation cascade so that cell behaviors including cell proliferation and differentiation, cell survival, and apoptosis are broken.
Five proteins (E1, E2 (and E4), L1, and L2) are encoded by all known PVs. There is a hypothesis that the ancestral papillomavirus did not contain adaptive proteins and only need the core set to meet the basic requirements of a viral infection . In the process of evolution, HPV16 produced all of the adaptive proteins. It was surprising that the top four of biological process enrichment of all adaptive proteins were the same as core set’s top four, and then processes involving apoptosis and death were enriched for core set (Figures 6(a) and 6(c)). This showed that HPV16 would evolve carcinogenicity, but only on the condition that its own reproduction had been ensured. The E4 protein has the functions of adaptive proteins and core set (Figure 6) but prefers the latter. In other words, as a part of the proteins encoded by all known PVs E4 must first guarantee viral reproduction and then together with adaptive proteins enhance the carcinogenicity of HPV16.
Significant challenges currently impair experiments to get a more complete map of interactions between HPV16 and human proteins, facilitating computational methods to detect potential interactions. Sequence features are popular because of its simplicity and availability. SVM has been shown to perform well in multiple areas including detecting remote protein homologies, evaluating microarray expression data, and checking new interactions [33, 58, 59]. On the basis of facts above we predicted new interactions between HPV16 and human proteins. The predicted set and other known interactions were integrated and filtered, providing a total of 877 interactions between 8 virus and 603 human proteins. According to the interactions between the virus and human proteins, we plotted the distribution of targeted host proteins. The distribution showed that the virus enlarged its scope of influence by interacting with host proteins as much as possible. HPV16 alters a broad array of cell cycle progression by a number of PPIs. Utilizing integrated human PPI network the enrichment of targeted host proteins as a function of their degree or betweenness was calculated. Results suggested that HPV16 was highly prone to interact with hub proteins and bottleneck proteins, perhaps because these proteins control critical processes in the human cell . PPC was used as a measure of diversity. In the light of their distributions, targeted human proteins effectively mediated the diversity of influenced signaling pathways which helps virus mediate the infection. GO term enrichment was utilized to perform the comprehensive functional analysis. We found that cell behaviors of host cell were broken; the HPV16 produced many other functions by evolution, but it was based on the premise that its own reproduction has been guaranteed.
The integration and analysis of virus-host interactions boosts our knowledge about the function of HPV16 proteins and relations between virus and human proteins. These results improve our understanding of HPV16 pathogenesis and provide potential new targets for interfering with either HPV16 or human at key points in the infection. Our results may point to important areas of research to guide further experimental studies.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (21375095) and the National Natural Science Foundation of China (21305096).
The Supplementary Material provides three tables which describe positive training set, positive test set and all set, respectively. The Table S1 lists the positive training set which contains 174 interactions. The Table S2 lists the positive test set based on literature curation with indication of Pubmed ID. Column 1, virus protein; column 2, targeted human protein/gene description; column3, Pubmed ID supported this interaction. The Table S3 lists the all set which consists of filtering results of predicted set, positive test set and training set.
Humans IWGotEoCRt, “Biological agents. Volume 100 B. A review of human carcinogens,” IARC monographs on the evaluation of carcinogenic risks to humans/World Health Organization, International Agency for Research on Cancer. Part B, vol. 100, pp. 1–441, 2012.View at: Google Scholar
M. Habig, H. Smola, V. S. Dole, R. Derynck, H. Pfister, and S. Smola-Hess, “E7 proteins from high- and low-risk human papillomaviruses bind to TGF-β-regulated Smad proteins and inhibit their transcriptional activity,” Archives of Virology, vol. 151, no. 10, pp. 1961–1972, 2006.View at: Publisher Site | Google Scholar
M.-R. Shen, P. Furla, C.-Y. Chou, and C. J. Ellory, “Myosin light chain kinase modulates hypotonicity-induced Ca2+ entry and Cl- channel activity in human cervical cancer cells,” Pflugers Archiv European Journal of Physiology, vol. 444, no. 1-2, pp. 276–285, 2002.View at: Publisher Site | Google Scholar
C. Pérez-Plasencia, G. Vázquez-Ortiz, R. López-Romero, P. Piña-Sanchez, J. Moreno, and M. Salcedo, “Genome wide expression analysis in HPV16 cervical cancer: identification of altered metabolic pathways,” Infectious Agents and Cancer, vol. 2, no. 1, article 16, 2007.View at: Publisher Site | Google Scholar
T. Jaakkola, M. Diekhans, and D. Haussler, “Using the Fisher kernel method to detect remote protein homologies,” Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB '99), pp. 149–158, 1999.View at: Google Scholar