Machine Learning and Network Methods for Biology and Medicine 2020
View this Special IssueResearch Article  Open Access
Xinnan Xu, Rui Kong, Xiaoqing Liu, Pingan He, Qi Dai, "Prediction of HighRisk Types of Human Papillomaviruses Using Reduced Amino Acid Modes", Computational and Mathematical Methods in Medicine, vol. 2020, Article ID 5325304, 10 pages, 2020. https://doi.org/10.1155/2020/5325304
Prediction of HighRisk Types of Human Papillomaviruses Using Reduced Amino Acid Modes
Abstract
A human papillomavirus type plays an important role in the early diagnosis of cervical cancer. Most of the prediction methods use protein sequence and structure information, but the reduced amino acid modes have not been used until now. In this paper, we introduced the modes of reduced amino acids to predict highrisk HPV. We first reduced 20 amino acids into several nonoverlapping groups and calculated their structure and physicochemical modes for highrisk HPV prediction, which was tested and compared with the existing methods on 68 samples of known HPV types. The experiment result indicates that the proposed method achieved better performance with an accuracy of 96.49%, indicating that the reduced amino acid modes might be used to improve the prediction of highrisk HPV types.
1. Introduction
Cervical cancer is a cancer with a higher morbidity and mortality rate among women worldwide [1]. There are about 500,000 new cases of cervical cancer each year, with 280,000 deaths [2], which has become the second largest female cancer [3, 4]. Studies have indicated that human papillomavirus (HPV) infection is closely related to the occurrence and development of cervical cancer, and certain types of HPV cause abnormal tissue growth in the form of papilloma [5–7].
Human papillomavirus belongs to the papillomavirus family. It is an icosahedral, uncoated particle composed of doublestranded DNA of approximately 8,000 nucleotide base pairs [8, 9]. The circular DNA is about 55 nm in diameter [10–13]. To date, there are more than 150 types of human papillomavirus (HPV), and some new HPV types will be found when there are significant homologous differences between some new HPV types and defined HPV types [14–16]. Epidemiological studies have shown a strong correlation between genital HPV and cervical cancer. Genital HPV can be divided into three types according to its relative malignancy: lowrisk type, intermediaterisk type, and highrisk type. The clinical association studies usually use two types of HPV: highrisk and lowrisk. Lowrisk types are associated with lowgrade lesions, while highrisk viral types are more closely related to highgrade cervical lesions and cancer [17]. Highrisk types included HPV16, HPV18, HPV26, HPV31, HPV33, HPV35, HPV39, HPV45, HPV5153, HPV56, HPV58, HPV59, HPV66, HPV68, HPV70, HPV73, HPV82, and HPV85 [18]. HPV16 and HPV18 accounted for 62.6% and 15.7% of cervical cancers [19], respectively. Therefore, the identification of highrisk HPV has become an important part of the diagnosis and treatment of cervical cancer.
Up to now, many epidemiological and experimental methods can identify HPV types [5, 20–22], mainly using polymerase chain reaction (PCR) technology, and be applied to rapid detection of clinical samples. With the rapid growth of human papillomavirus (HPV) data and sensitivity requirements, we need a reliable and effective calculation method to predict the highrisk types of HPV directly.
In recent years, several computational models have been proposed to predict highrisk HPV types. Eom et al. studied the sequence fragments and introduced genetic algorithms to predict the HPV types [23]. Joung et al. used support vector machines to predict the HPV types based on the hidden Markov model [24, 25]. Park et al. proposed to use decision trees to predict human papillomavirus types [26]. Kim and Zhang calculated the distance of amino acid pairs and further predict the risk types of HPV based on E6 proteins [7, 9]. Kim et al. proposed a set of support vector machines (GSVM) for the classification of HPV types using the differential molecular sequence of protein secondary structure [13]. Esmaeili et al. used ROC to classify HPV types based on Chou’s pseudo amino acid composition [27]. Alemi et al. compared the physicochemical properties between the high and lowrisk HPV types, and they used support vector machines to predict the highrisk HPV types [28].
These methods have performed well in the prediction of highrisk HPV types, but the challenge of extracting HPV information remains. The information widely used in the prediction of highrisk types of HPV is based on sequence information, but the information limited to the characteristics of 20 AAs and their reduction groups has not been explored so far. In this paper, we proposed a novel method to predict highrisk types of HPVs based on the reduced amino acid modes. We classified 20 amino acids into several groups and extract their structure and chemical properties. These extracted features were used to predict the highrisk type of HPVs based on a support vector machine. Through some experiments and comparative analysis, we want to evaluate the efficiency of the proposed method, as well as the efficiency of various reduced amino acid modes.
2. Materials and Methods
2.1. Datasets
There are eight open reading frames that encode early and late genes of the HPVs [11]. The early and late genes have polyA signal 1 and polyA signal 2. The produce of the late genes are L1 and L2 proteins which affect the viral capsid structure [12], while early genes are transformed into E1E7 proteins. We constructed seven protein databases of the HPVs whose sequences are downloaded from the Los Alamos National Laboratory (LANL). Each protein has 72 HPV types. If a certain type of protein lacks the sequences of HPVs, we downloaded the missing sequence from the National Biotechnology Information Center. Since the E4 protein cannot be found in the National Biotechnology Information Center, its total number is 71. According to an HPV compendium, seventeen HPV types are classified as highrisk types (HPV16, HPV18, HPV31, HPV33, HPV35, HPV39, HPV45, HPV51, HPV52, HPV56, HPV58, HPV59, HPV61, HPV66, HPV67, HPV68, and HPV72), and the remaining is lowrisk type [13].
2.2. Reduced Amino Acids (RedAAs)
20 amino acids have subtle differences, but some of them have similar basic structures and functions. AAindex is a database of physical and biochemical indicators of amino acids established by Tomii and Kanehisa [29]. It mainly includes three parts: AAindex 1, AAindex 2, and AAindex 3. AAindex 1 is a database that describes the physicochemical and biological properties of amino acids. AAindex 2 is the matrix of amino acid mutation, and AAindex 3 is the protein contact potential statistics. These data are from published articles. We mainly used AAindex 1 to calculate the correlation coefficient as the distance between the two indicators. AAindex 1 currently contains 544 indexes, and this article selected 522 indexes. These 522 characteristics are further divided into 7 categories: (A)—alpha and turn propensities, (B)—beta propensity, (C)—composition, (H)—hydrophobicity, (P)—physicochemical properties, and (O)—other properties [29].
Here, we introduced BLOSUM62 to classify amino acids to simplify sequence analysis [30]. We denote the th group as and denote its th amino acid as . Using BLOSUM62, we calculated the similarity score between and the th amino acid as follows: where denotes the substitution value between and . Then, we summed up all scores of different groups as the score between and : where is the th group size of , is the th group size of , is the total number of occurrences in , and is the group size. measures the degree of retention of parent sequence information. Given a size group, we analyzed all amino acid groups and calculated the similarity score between the parent sequence and the reduced sequence. The reduced alphabets were selected according to their scores. For example, 20 AAs are reduced into 9 RedAAs ({C}, {G}, {P}, {IMLV}, {AST}, {NH}, {YFW}, {DEQ}, and {RK}) in the BLOSUM62 matrix.
2.3. Reduced Amino Acid Modes (RedAA Modes)
20 amino acids were divided into the following nonoverlapping groups according to their physicochemical properties in AAindex, and four types of the reduced amino acid modes were calculated as protein structural and physicochemical features.
2.3.1. Content Modes
The first mode is associated with the contentspecific features, including the distribution of the RedAA and RedAA pattern in protein sequences.
(1) Kmer. Protein sequences and peptides can be seen as a collection of symbols, and their characteristics can be analyzed by the frequency of their small fragments. mers are consecutive characters in reduced proteins, and a sliding window of length can be used to calculate their frequencies [31–33], moving from position 1 to with one base at a time. It allows the overlaps of the mers and is calculated as where is the occurrence number of the mer and is mer set of the RedAAs.
(2) RCTD. “Composition (C),” “Transition (T),” and “Distribution (D)” are three descriptors of RedAAs, which are defined as follows [34, 35]:
Composition: it can be regarded as a single monomer of the reduced sequence, and the sequence components are described by calculating the percentage of each RedAA.
Transition: it can be used as the conversion of RedAA and by calculating the frequency of followed by : where and are the “” and “” numbers, respectively, in the reduced sequence with length .
Distribution: it describes the RedAA distribution in the reduced sequence, including the specified coding categories: 25%, 50%, 75%, and 100%.
(3) PRseAAC. Type I PRseAAC and type II PRseAAC are widely used pseudoreduced AA compositions (PRseAAC) [36–38].
Type I PRseAAC was proposed by KuoChen Chou, which is defined as follows: where is the RedAA frequency and is the weighting factor. is calculated as where is the RedAAs’ property and is the RedAA size.
Type II PRseAAC can be calculated as where is the RedAA frequency, is the weighting factor, is the RedAAs’ property, is the RedAA size, and is the sequence length.
2.3.2. Correlation Mode
The second RedAA mode is based on the characteristics of correlation, which describes the correlation among the RedAAs. In the proposed RedAA mode, three different autocorrelation features are implemented: normalized Moreau–Broto autocorrelation (NMB) [39], Moran autocorrelation () [40], and Geary autocorrelation () [41].
(1) NMB. The RedAA NMB is defined as where denotes the RedAA property at position of the sequence, is the autocorrelation lag, and is the sequence length.
(2) M. The RedAA M can be calculated as where denotes the RedAA property at position of the sequence, is the autocorrelation lag, and is the sequence length.
(3) G. The RedAA is defined as where denotes the RedAA property at position of the sequence, is the autocorrelation lag, and is the sequence length.
2.3.3. Order Mode
The order mode reflects the physical and chemical interaction among the RedAA pairs. There are two kinds of order modes: sequence coupling score and quasisequence score [42].
(1) Sequence Coupling Score. The sequence coupling score is calculated: where is the SchneiderWrede physicochemical distance or Grantham chemical distance between the RedAAs at positions and and .
(2) QuasiSequence Score. The quasisequence score of the RedAA is defined: where is the RedAA frequency and denotes the weighting factor.
The quasisequence score can be calculated as where is the sequence coupling score, is the RedAA frequency, and denotes the weighting factor.
2.3.4. Position Mode
The position mode represents the distribution of RedAA positions of protein sequences based on the coefficient of variations [32, 43]. First, we converted the protein sequence into a digital sequence and calculated the probabilities of the separation distance between two adjacent RedAAs. The mean and variance are defined:
We then calculated the positional information : where is the reciprocal of the coefficient of variation (CV) which compares the degree of change between two datasets, even if there are large differences between their means. In this paper, it was denoted as the RedAA position characteristics.
2.4. Prediction Algorithm
is an HPV label set, is from the highrisk type, and is from the lowrisk type. We used to represent the th features of the RedAA modes of the th HPV sample, where . All of the features of the RedAA modes for all HPV samples are denoted as
We used a support vector machine (SVM) to predict the HPV type, which is expressed as follows: where is a linear combination of a set of nonlinear data conversion: where denotes the bias term, denotes some regularization parameters, and is the training error. The above problem can be expressed:
Here, the Gaussian kernel function is used to calculate instead of and . The separation problem can be expressed:
The training model can predict the risk type of the test sample according to the following formula:
indicates that the sample belongs to the highrisk type; otherwise, it belongs to the lowrisk type. In order to obtain a better model, we used a simple grid search strategy based on 10fold crossvalidation to find the optimal model for each dataset.
3. Results and Discussion
3.1. Evaluation Measures
There are three popular methods to evaluate the efficiency of prediction models: subsampling test, independent test, and jackknife test. Since the jackknife test can evaluate the efficiency of various predictor variables, we used it to evaluate the efficiency of the proposed method and calculated the class accuracies and overall accuracies: where denotes true positives, denotes false positives, denotes true negatives, and denotes false negatives.
3.2. HPV Classification
We used the jackknife test to evaluate the performance of the proposed RedAA modes. We divided the 20 amino acids into 5 to 19 groups and calculated their RedAA modes as protein features and then input them into the support vector machine to predict the HPV type. Table 1 shows the tagged HPV types and the predicted results.

It can be seen from Table 1 that the 65 HPV types predicted by our method are consistent with the actual types and have better performance. However, HPV72 is predicted to be lowrisk but is actually highrisk, and HPV30 is predicted to be highrisk but is actually lowrisk. For further comparison, we compared our results with Kim et al.’s results [13]. For Kim et al.’s prediction, HPV56 was predicted to be potentially highrisk, and we predicted it to be highrisk; HPV53 and HPV73 were predicted to be potentially highrisk, but in our results, they were lowrisk. Phylogenetic analysis showed that HPV30 was closely related to the established oncogenic type HPV56, suggesting that HPV30 was more likely to be a highrisk type. The results show that the proposed method is more consistent with the actual risk type.
We further compared our method with the following method: SVM based on the mismatch [24], SVM classifier based on the linear kernel [13], SVM based on the gap spectral kernel (Gap) [7], BLAST model [13] and integrated SVM (Ensemble) [13], and two text prediction methods based on AdaCost [26] and naive Bayes [26]. The accuracy of our method reaches 96.49%, while the accuracy of the integrated SVM is 94.12%, the accuracy of the SVM based on the unmatched kernel is 92.70%, the accuracy of the SVM based on the linear kernel is 90.28%, and the accuracy of BLAST reaches 91.18%. As for the text prediction method, AdaCost [26] has an accuracy rate of 93.05%, while naive Bayes [26] has an accuracy rate of 81.94%. The comparison also shows that the RedAA model is more effective in classifying the risk types of human papillomaviruses.
3.3. The Performance of the Early and Late Proteins in HPV Type Prediction
Early HPV proteins contain E1, E2, E4, E5, E6, and E7, and late proteins include L1 and L2 [3, 5]. Information commonly used for highrisk and lowrisk HPV prediction includes information on protein sequences, secondary structure, and pseudoamino acid composition, in which most of them use E6, E7, or L1 protein [23–28]. In this paper, we used seven protein datasets of early and late proteins in HPV type prediction and compared their performance. Figure 1 compares the accuracy of each category and the overall accuracy based on early and late proteins.
Figure 1 shows that the prediction accuracy of lowrisk types is higher than that of highrisk types, except for E5 protein. L1 protein outperforms other HPV proteins in the prediction of lowrisk types. L2 protein performs best in highrisk type predictions. The above research shows that E6, E7, L1, and L2 proteins are closely related to highrisk HPV and play an important role in the occurrence and development of diseases [14]. The function of L1 protein in lowrisk and highrisk types is not exactly the same. L1 protein in the highrisk type exists in the form of integration, and L1 gene product selfassembly efficiency is low. L1 protein in the lowrisk type exists in the form of free tissue, with high selfassembly efficiency. In highrisk typing, if L1 protein mutates, L1 protein cannot combine with L2 protein to form capsid protein and then cannot assemble HPVinfected virus particles. When HPV enters the host cell, the viral DNA replicates in large quantities and can integrate with the host cell DNA, resulting in host cell infection, infinite value addition, and cell immortalization. The results show that L1 protein performs better in the prediction of highrisk HPV types, while L2 protein is more suitable for lowrisk HPV types.
3.4. Influence of the Physicochemical Properties of Amino Acids
The proposed method reduced 20 AAs into several nonoverlapping groups, which relies heavily on the physical and biochemical indices of amino acids. The 522 characteristics of AAindex are divided into seven categories according to their physical and biochemical features [29]. The largest group is hydrophobicity and the second largest group is alpha and turn propensities, and the sizes of the other four groups are relatively small. For each HPV protein, we used 522 physicochemical properties to calculate six kinds of reduced AA modes. For each class of the physicochemical properties of amino acids, we calculated their mean of the overall accuracies of HPV type prediction. The comparison of different physicochemical property classes and the RedAA modes is shown in Figure 2.
From Figure 2, it can be found that the proposed prediction has no obvious preference among 7 classes of physicochemical properties for E1 proteins. As for E2 proteins, composition is the best of the six reduced AA modes. For E4 proteins, the physicochemical properties of beta and composition are better. For the reduced AA mode position and RCTD, the physicochemical properties of beta are better in prediction, but composition is better for the other four modes. The results of E5, E6, E7, L1, and L2 proteins are similar to those of E2 proteins, and the six reduced AA modes show better performance in beta physicochemical properties. These results indicate that E5, E6, E7, L1, and L2 proteins have a preference for beta physicochemical properties to reduce amino acids and calculate the six reduced AA modes in HPV type prediction.
3.5. Comparison of the Reduced Amino Acid Modes
In order to evaluate the performance of different modes, we used 522 physicochemical properties to calculate the RedAA modes of all the early and late proteins and calculated their average of the overall accuracies of HPV type prediction, which is shown in Figure 2. Figure 2 shows that six RedAA modes have the same preference trend among seven classifications of the physicochemical properties. As for E1, E2, E4, E5, and E7 proteins, PRseAAC is better than the other RedAA modes, and the average accuracy of its prediction of HPV typing is also significantly higher than the average of other RedAA modes. As for E6, L1, and L2 proteins, RTCD outperforms the other five RedAA modes. In addition, PRseAAC and RTCD show better performance in beta physicochemical properties of the amino acids.
3.6. Influence of the Number of Reduced Amino Acids
The proposed method used the structural and physicochemical features of reduced amino acids, which reduces the dimension of input information and improves the efficiency of the prediction model. However, it should be noted that the RedAA modes are associated with the number of reduced amino acids. In order to discuss the influence of the RedAA size, we reduced 20 amino acids into 519 classes based on 522 physicochemical properties and calculated their RedAA modes PRseAAC and RTCD for of all the early and late proteins. The average accuracies of the RedAA modes PRseAAC and RTCD with 519 RedAAs are summarized in Figure 3.
(a)
(b)
Figure 3 shows the accuracy of HPV type prediction with the increase in reduced amino acids when combining the PRseAAC and physicochemical properties of amino acids for E1 proteins, and the bestperforming PRseAAC achieves 95.378% accuracy with 19 reduced amino acids. For E2 proteins, the prediction model achieves the best performance with the PRseAAC and the physical and physicochemical properties of the composition class when amino acids are reduced to 14 classes. As for E5 and E7, PRseAAC achieves 87.18% and 75.07% accuracies when 20 amino acids are reduced to 7 and 12 classes, respectively. For E6, L1, and L2 proteins, the combination of the RCTD and beta physicochemical properties achieves best performances with 8, 15, and 11 reduced amino acids, respectively.
4. Conclusion
Genital papillomavirus is closely related to cervical cancer, especially highrisk HPV. Therefore, the identification of the HPV risk type is of great significance for the cervical cancer. We proposed a computational method for the prediction of the highrisk HPV based on the RedAA modes. With the help of the physicochemical properties of the amino acids, we reduced 20 amino acids into several nonoverlapping groups and calculated the structure and physicochemical characteristics of reduced AAs (RedAA) as the RedAA modes. We used reduced sequence information to predict highrisk types of HPV. Experiments with 68 known HPV types show that the proposed method has better performance than previous methods.
The first contribution is that L1 protein performs better in the prediction of highrisk HPV types, while L2 protein is more suitable for lowrisk HPV types. The second contribution can be indicated from the influence of the physicochemical properties of amino acids; we noticed that E5, E6, E7, L1, and L2 proteins have a preference for beta physicochemical properties to reduce amino acids. The third contribution can be deduced from the comparison of the reduced amino acid modes; we found that the PRseAAC and RTCD outperform the other four RedAA modes and show better performance in beta physicochemical properties of the amino acids. The final contribution can be seen from the influence of the number of reduced amino acids; we noticed that the combination of the RCTD and beta physicochemical properties achieves the best performances with 8, 15, and 11 reduced amino acids for E6, L1, and L2 proteins, respectively.
Data Availability
All the data used to support the findings of this study are available from the Los Alamos National Laboratory (https://pave.niaid.nih.gov/lanlarchives).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61772028) and research Grants from Zhejiang Provincial Natural Science Foundation of China (LY20F020016).
References
 E. K. Yim and J. S. Park, “Role of proteomics in translational research in cervical cancer,” Expert Review of Proteomics, vol. 3, no. 1, pp. 21–36, 2014. View at: Publisher Site  Google Scholar
 O. PeraltaZaragoza, V. H. BermúdezMorales, C. PérezPlasencia, J. SalazarLeón, C. GómezCerón, and V. MadridMarina, “Targeted treatments for cervical cancer: a review,” OncoTargets and Therapy, vol. 5, pp. 315–328, 2012. View at: Publisher Site  Google Scholar
 A. Jemal, F. Bray, M. M. Center, J. Ferlay, E. Ward, and D. Forman, “Global cancer statistics,” CA: a Cancer Journal for Clinicians, vol. 61, no. 2, pp. 69–90, 2011. View at: Publisher Site  Google Scholar
 D. Forman, C. de Martel, C. J. Lacey et al., “Global burden of human papillomavirus and related diseases,” Vaccine, vol. 30, no. 5, pp. F12–F23, 2012. View at: Publisher Site  Google Scholar
 F. X. Bosch, M. M. Manos, N. Munoz et al., “Prevalence of human papillomavirus in cervical cancer: a worldwide perspective,” Journal of the National Cancer Institute, vol. 87, no. 11, pp. 796–802, 1995. View at: Publisher Site  Google Scholar
 M. H. Schiffman, H. M. Bauer, R. N. Hoover et al., “Epidemiologic evidence showing that human papillomavirus infection causes most cervical intraepithelial neoplasia,” Journal of the National Cancer Institute, vol. 85, no. 12, pp. 958–964, 1993. View at: Publisher Site  Google Scholar
 S. Kim and J.H. Eom, “Prediction of the human papillomavirus risk types using gapspectrum kernels,” LNCS, vol. 3973, pp. 710–715, 2006. View at: Publisher Site  Google Scholar
 C. L. Pang and F. Thierry, “Human papillomavirus proteins as prospective therapeutic targets,” Microbial Pathogenesis, vol. 58, pp. 55–65, 2013. View at: Publisher Site  Google Scholar
 S. Kim and B.T. Zhang, “Human papillomavirus risk type classification from protein sequences using support vector machines,” LNCS, vol. 3907, pp. 57–66, 2006. View at: Publisher Site  Google Scholar
 J. Haedicke and T. Iftner, “Human papillomaviruses and cancer,” Radiotherapy and Oncology, vol. 108, no. 3, pp. 397–402, 2013. View at: Publisher Site  Google Scholar
 J. Peng, L. Gao, J. Guo et al., “Typespecific detection of 30 oncogenic human papillomaviruses by genotyping both E6 and L1 genes,” Journal of Clinical Microbiology, vol. 51, no. 2, pp. 402–408, 2013. View at: Publisher Site  Google Scholar
 M. S. Longworth and L. A. Laimins, “Pathogenesis of human papillomaviruses in differentiating epithelia,” Microbiology and Molecular Biology Reviews, vol. 68, no. 2, pp. 362–372, 2004. View at: Publisher Site  Google Scholar
 S. Kim, J. Kim, and B. T. Zhang, “Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures,” Computers in Biology and Medicine, vol. 39, no. 2, pp. 187–193, 2009. View at: Publisher Site  Google Scholar
 E. M. de Villiers, C. Fauquet, T. R. Broker, H. U. Bernard, and H. zur Hausen, “Classification of papillomaviruses,” Virology, vol. 324, no. 1, pp. 17–27, 2004. View at: Publisher Site  Google Scholar
 K. Münger, A. Baldwin, K. M. Edwards et al., “Mechanisms of human papillomavirusinduced oncogenesis,” Journal of Virology, vol. 78, no. 21, pp. 11451–11460, 2004. View at: Publisher Site  Google Scholar
 M. L. Eide and H. Debaque, “HPV detection methods and genotyping techniques in screening for cervical cancer,” Annales de Pathologie, vol. 32, no. 6, pp. e15–e23, 2012. View at: Publisher Site  Google Scholar
 M. F. Janicek and H. E. Averette, “Cervical cancer: prevention Diagnosis, and Therapeutics,” CA: A Cancer Journal for Clinicians, vol. 51, no. 2, pp. 92–114, 2001. View at: Publisher Site  Google Scholar
 M. D. Kaspersen, P. B. Larsen, H. J. Ingerslev et al., “Identification of multiple HPV types on spermatozoa from human sperm donors,” PLoS One, vol. 6, no. 3, article e18095, 2011. View at: Publisher Site  Google Scholar
 P. Guan, R. HowellJones, N. Li et al., “Human papillomavirus types in 115,789 HPVpositive women: a metaanalysis from cervical infection to cancer,” International Journal of Cancer, vol. 131, no. 10, pp. 2349–2359, 2012. View at: Publisher Site  Google Scholar
 H. Furumoto and M. Irahara, “Human papilloma virus (HPV) and cervical cancer,” Journal of Medical Investigation, vol. 49, no. 34, pp. 124–133, 2002. View at: Google Scholar
 R. D. Burk, G. Y. F. Ho, L. Beardsley, M. Lempa, M. Peters, and R. Bierman, “Sexual behavior and partner characteristics are the predominant risk factors for genital human papillomavirus infection in young women,” The Journal of Infectious Diseases, vol. 174, no. 4, pp. 679–689, 1996. View at: Publisher Site  Google Scholar
 N. Muñoz, F. X. Bosch, S. de Sanjosé et al., “Epidemiologic classification of human papillomavirus types associated with cervical cancer,” New England Journal of Medicine, vol. 348, no. 6, pp. 518–527, 2003. View at: Publisher Site  Google Scholar
 J.H. Eom, S.B. Park, and B.T. Zhang, “Genetic mining of DNA sequence structures for effective classification of the risk types of human papillomavirus(HPV),” in Neural Information Processing, N. R. Pal, N. Kasabov, R. K. Mudi, S. Pal, and S. K. Parui, Eds., pp. 1334–1343, Springer, Berlin, Heidelberg, 2004. View at: Google Scholar
 J.G. Joung, O. Sok June, and B.T. Zhang, “Prediction of the risk types of human papillomaviruses by support vector machines,” in PRICAI 2004: Trends in Artificial Intelligence, pp. 723–731, Springer, Berlin, Heidelberg, 2004. View at: Google Scholar
 J.G. Joung, O. Sok June, and B.T. Zhang, “Protein sequencebased risk classification for human papillomaviruses,” Computers in Biology and Medicine, vol. 36, no. 6, pp. 656–667, 2006. View at: Publisher Site  Google Scholar
 S. B. Park, S. H. Wang, and B. T. Zhang, “Mining the risk types of human papillomavirus (HPV) by AdaCost,” in Lecture Notes in Computer Science, pp. 403–412, Springer, Berlin, Heidelberg, 2003. View at: Google Scholar
 M. Esmaeili, H. Mohabatkar, and S. Mohsenzadeh, “Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses,” Journal of Theoretical Biology, vol. 263, no. 2, pp. 203–209, 2010. View at: Publisher Site  Google Scholar
 M. Alemi, H. Mohabatkar, and M. Behbahani, “In silico comparison of low and highrisk human papillomavirus proteins,” Applied Biochemistry and Biotechnology, vol. 172, no. 1, pp. 188–195, 2014. View at: Publisher Site  Google Scholar
 K. Tomii and M. Kanehisa, “Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins,” Protein Engineering, vol. 9, no. 1, pp. 27–36, 1996. View at: Publisher Site  Google Scholar
 T. Li, K. Fan, J. Wang, and W. Wang, “Reduction of protein sequence complexity by residue grouping,” Protein Engineering Design and Selection, vol. 16, no. 5, pp. 323–330, 2003. View at: Publisher Site  Google Scholar
 M. Bhasin and G. P. S. Raghava, “Classification of nuclear receptors based on amino acid composition and dipeptide composition,” The Journal of Biological Chemistry, vol. 279, no. 22, pp. 23262–23266, 2004. View at: Publisher Site  Google Scholar
 Q. Dai, Y. Li, X. Q. Liu, Y. H. Yao, Y. J. Cao, and P. He, “Comparison study on statistical features of predicted secondary structures for protein structural class prediction: from content to position,” BMC Bioinformatics, vol. 14, no. 1, p. 152, 2013. View at: Publisher Site  Google Scholar
 Q. Dai, L. Wu, and L. H. Li, “Improving protein structural class prediction using novel combined sequence information and predicted secondary structural features,” Journal of Computational Chemistry, vol. 32, no. 16, pp. 3393–3398, 2011. View at: Publisher Site  Google Scholar
 J. Cui, L. Han, H. Lin et al., “Prediction of MHCbinding peptides of flexible lengths from sequencederived structural and physicochemical properties,” Molecular Immunology, vol. 44, no. 5, pp. 866–877, 2007. View at: Publisher Site  Google Scholar
 L. Y. Han, C. J. Zheng, B. Xie et al., “Support vector machines approach for predicting druggable proteins: recent progress in its exploration and investigation of its usefulness,” Drug Discovery Today, vol. 12, no. 78, pp. 304–313, 2007. View at: Publisher Site  Google Scholar
 Y. L. Chen and Q. Z. Li, “Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo amino acid composition,” Journal of Theoretical Biology, vol. 248, no. 2, pp. 377–381, 2007. View at: Publisher Site  Google Scholar
 H. B. Shen and K. C. Chou, “Using ensemble classifier to identify membrane protein types,” Amino Acids, vol. 32, no. 4, pp. 483–488, 2007. View at: Publisher Site  Google Scholar
 X. Q. Yu, X. Q. Zheng, T. G. Liu, Y. Dou, and J. Wang, “Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation,” Amino Acids, vol. 42, no. 5, pp. 1619–1625, 2012. View at: Publisher Site  Google Scholar
 Z. P. Feng and C. T. Zhang, “Prediction of membrane protein types based on the hydrophobic index of amino acids,” Journal of Protein Chemistry, vol. 19, no. 4, pp. 269–275, 2000. View at: Publisher Site  Google Scholar
 D. S. Horne, “Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities,” Biopolymers, vol. 27, no. 3, pp. 451–477, 1988. View at: Publisher Site  Google Scholar
 R. R. Sokal and B. A. Thomson, “Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population,” American Journal of Physical Anthropology, vol. 129, no. 1, pp. 121–131, 2006. View at: Publisher Site  Google Scholar
 K. C. Chou, “Prediction of protein subcellular locations by incorporating quasisequenceorder effect,” Biochemical and Biophysical Research Communications, vol. 278, no. 2, pp. 477–483, 2000. View at: Publisher Site  Google Scholar
 S. L. Zhang, Y. Y. Liang, and X. G. Yuan, “Improving the prediction accuracy of protein structural class: approached with alternating word frequency and normalized LempelZiv complexity,” Journal of Theoretical Biology, vol. 341, pp. 71–77, 2014. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2020 Xinnan Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.