Abstract

Cervical cancer is one of the main causes of cancer death all over the world. Most diseases such as cervical epithelial atypical hyperplasia and invasive cervical cancer are closely related to the continuous infection of high-risk types of human papillomavirus. Therefore, the high-risk types of human papillomavirus are the key to the prevention and treatment of cervical cancer. With the accumulation of high-throughput and clinical data, the use of systematic and quantitative methods for mathematical modeling and computational prediction has become more and more important. This paper summarizes the mathematical models and prediction methods of the risk types of human papillomavirus, especially around the key steps such as feature extraction, feature selection, and prediction algorithms. We summarized and discussed the advantages and disadvantages of existing algorithms, which provides a theoretical basis for follow-up research.

1. Introduction

Human papillomavirus (HPV) usually causes benign papillomas and epithelial malignancies [1]. About 30 years ago, researchers found a close relationship between HPV and cervical cancer. Since then, specific HPV type DNA has been found in almost all cervical cancer biopsies [2]. Epidemiological studies also continue to show that HPV is the main cause of cervical cancer. HPV is a papillomavacuolar virus of the family lactomaviridae. It is a spherical DNA virus with 72 shell particles on the surface, a 20 hedral three-dimensional symmetrical nucleocapsid structure, with a diameter of about 45-55 nm [3]. It contains 8000 base pairs, of which 88% are viral proteins. According to the function of HPV genome, it can be divided into three parts: noncoding region, early gene region, and late gene region. The noncoding regions include promoters, enhancers, and silencers, which play an important role in DNA replication and transcriptional regulation. The early gene region contains six genes E1, E2, E4, E5, E6, and E7, including proteins and oncogenes required in the process of replication [4]. The late gene region contains L1 and L2 genes, which encode the structural proteins that constitute the virus capsid [5].

The early region encodes and produces six proteins. Their functions are as follows: E1 protein mainly controls virus replication and replication inhibition and is expressed in the early stage of virus infection; E2 protein is a specific DNA binding protein [6], which can not only regulate mRNA transcription and DNA replication but also inhibit the expression of E6 and E7 proteins; E4 protein is expressed during virus infection and plays an important role in virus replication and mutation; E5 protein is the smallest transforming protein [7]. It consists of two functional regions, one is the amino terminal hydrophobic region, which is related to the insertion position of E5 protein on the membrane, and the other is the hydrophilic region of carboxyl terminal. If the carboxyl terminal is injected into resting cells, it can induce cell DNA synthesis. In addition, it can also induce the expression of a variety of oncogenes; E6 and E7 proteins are the two most important proteins. They can not only regulate the cell cycle [8] but also play an important role in the proliferation of cancer cells.

More than 200 HPV types have been identified, and about 40 types can invade the female reproductive tract [9]. When the identified HPV has significant homology differences with the defined HPV types, some new types will be defined. Epidemiological studies have shown that there is a strong relationship between reproductive human papillomavirus and cervical cancer, which is not related to other risk factors. According to its relative malignancy, HPV can be divided into two or three types: low-risk type, medium-risk type, and high-risk type. The high-risk HPVs include HPV16, 18, 31, 33, 34, 35, 39, and 45, and the low-risk HPVs include HPV6, 11, 42, 43, and 44 [1013].

The distribution of HPV types has obvious geographical characteristics. Research shows that among the high-risk types worldwide, HPV16 accounts for 51% [14], followed by HPV16, accounting for 16%, others, such as HPV45 accounts for 9%, HPV31 accounts for 6%, and HPV33 accounts for 3%. The sum of these four common HPV types exceeds 80%. However, in Latin America, HPV33 is the most common, followed by HPV39 and 59. In Asia, in addition to the most common HPV16 and 18 [15], HPV 52 and 58 account for far more cervical cancer than Western and African countries.

Different types of HPV have different clinical manifestations. According to the tissues invaded, HPV can be divided into high (low) risk type of skin and high (low) risk type of mucosa [16]. Low-risk skin types such as HPV1, 2, 3, and 4 are generally related to common warts, flat warts, and other diseases. High-risk skin types such as HPV5, 8, 14, and 17 are associated with anal cancer, prostate cancer, bladder cancer, and so on. For low-risk types of mucosa: HPV6, 11, 13, 32, 34, 40, 42, etc., they are generally related to diseases such as genital, anal, oropharyngeal, and esophageal mucosal infection. Mucosal high-risk types such as HPV16, 18, 30, 31, 33, 35, and 39 can lead to cervical cancer, rectal cancer, tonsillar cancer, and other diseases [17].

Now, there are many epidemiological and experimental methods to identify HPV types [1821]. They mainly use polymerase chain reaction (PCR) technology to realize the rapid detection of clinical samples. With the rapid accumulation of human papillomavirus data, a reliable and effective calculation method is needed to predict high-risk HPV. In recent years, many computational models have been proposed to predict HPV high-risk types. Eom et al. designed genetic algorithm to predict HPV type through the sequence fragment distribution characteristics of HPV [22]. Joung et al. designed the prediction method of HPV type based on support vector machine prediction and hidden Markov model [23, 24]. Through systematic analysis, Park et al. suggested using decision tree to construct the typing model of human papillomavirus [25]. Kim and Zhang calculated the distance distribution of amino acid pairs and further predicted the risk type of HPV through E6 protein [26, 27]. Kim et al. extracted the differential features of protein secondary structure and designed a set of support vector machine (GSVM) to classify HPV types [28]. Esmaeili et al. calculated Chou’s pseudoamino acid composition of HPV sequence and classified HPV types using ROC [29]. Alemi et al. systematically compared the physical and chemical properties of high-risk and low-risk HPV and build a prediction model of high-risk HPV based on the support vector machine [30]. Wang et al. used protein “sequence space” to explore this information to predict high-risk types of HPVs [31]. Xu et al. proposed a HPV high-risk prediction model based on reduced amino acids, and they divided 20 amino acids into several nonoverlapping groups and calculated the structure and chemical patterns of each group as the basis for distinguishing high-risk HPV [32].

In this paper, we introduce the development of HPV typing in cervical cancer, mainly focusing on sequence characteristics and predicted secondary structure characteristics. The feature fusion method and feature selection algorithm are discussed. Finally, we review the multiple classification and various machine learning algorithms in HPV typing of cervical cancer.

2. Benchmark Datasets

HPV database is from Los Alamos National Laboratory (LANL), with a total of 72 HPV types. HPV risk types were manually determined based on the HPV profile [33]. If HPV belongs to skin related or skin group, HPV is classified as a low-risk type. On the other hand, it is classified as a high-risk type if HPV is known to be a high-risk type of cervical cancer [28]. Some researchers may build their own HPV dataset from SWISS-PROT. For these sequences, some researchers will complete their own wonderful processing. Here, we focus on using mathematical models to detect the risk types of human papillomavirus. Therefore, the detailed processing of data sets is no longer our important discussion content.

3. Methods

3.1. Feature Extraction

The extraction of HPV sequence information is transforming from complex structure to extraction of mathematical features [34]. Directly extracting information from the complex structure of HPV protein will cause large errors, and the extracted information may not be comprehensive. However, it is very easy to directly extract the mathematical features from the protein sequence, there is no influence of error on the connection, and the extracted information contains all the protein information.

3.1.1. Sequence Features

(1) K-Mer. Protein sequences and peptides can be regarded as a set of signs [35], and we can analyze their characteristics through the frequency of k-mer to obtain the difference between the two HPV sequences. If is 1, it means that it is the composition of amino acids, which is calculated as follows [36]: where means the number of the amino acid in the protein sequence.

(2) Order-Based Features. Order-based features reflect the physical and chemical interaction among the amino acids pairs, which can be described by sequence coupling score and quasisequence score [37].

Suppose is a protein sequence with a length of , is denoted as the sequence interaction factor describing the sequence influence: where is the physical and chemical distance between the amino acid and , which can be calculated:

The calculation method of can consider the hydrophobicity value, polarity value, and side chain volume of amino acids. Thus, can be defined by the following features:

The first 20 features are k-mer features, and the latter -dimensional features represent the interaction information of amino acid sequences.

(3) Amino Acid Distribution. Amino acid distribution is composed of amino acid composition, transformation, and distribution. This component can be regarded as the amino acid component of HPV, which can be original sequence or reduced sequence to describe the sequence component [38]. The transformation can be described by calculating the number of conversions between a base and a subsequent base. It can be calculated by the following formula: where is the number of amino acids, and is the number of conversions between the base and a subsequent base .

(4) Pseudoamino Acid Composition (PseAAC). In order to avoid the loss of many important information hidden in HPV sequences, Chou proposed pseudoamino acid composition (PseAAC), which is not a simple AAC to represent protein samples [39]. Pseudoamino acid composition is a vector with size , where is the composition size of the amino acids, and the latter dimension is the information of pseudoamino acid composition [29]. They can be calculated by the following formula: where is the frequency of RedAA, is the weight, and can be calculated by the following formula:

represents molecular weight of amino acid residue , and is the molecular weight of amino acid residue . Then, the protein sequence can be mapped to the following feature vectors by this way:

3.1.2. Evolutionary Profile

(1) Position Specific Scoring Matrix (PSSM). PSSM is developed on the basis of position frequency matrix [40], which can quantitatively describe the evolutionary characteristics of HPV sequences. At present, it is the most commonly used matrix feature model, especially for protein structural class prediction [41]. When a protein s with length is given, its PSSM is defined as where represents that the residue at the position of mutated into the type of during the course of biological evolution. indicates the score of mutation, and the ordered 20 amino acids are marked as 1, 2, 3… 20. The close evolutionary relationship among these HPV sequences can be described by their PSSM [42]. We search homogeneous sequences from SWISS-PROT database with help of PSI-BLAST [4345] and normalize PSSM using the following sigmoid function:

(2) Structural Properties of PSSM. When is used to describe a function , is the length of a function . The PSSM of the sequence will be transformed into a RedPSSM of its reduced sequence , where and are finite alphabet sets of the original and reduced sequences. is defined as where represents the size of the reduced group which is made up with the reduced amino acid , and belongs to .

Auto covariance (AC) is a correlation factor, which has been widely used in various fields of bioinformatics [46]. It can connect adjacent residues according to protein sequence and convert PSSM into equal length vector [47]. The AC variable can represent the average interaction between residues with a series of lags, which is defined as where can indicate the interaction between two spacer residues. But it can just calculate the residue interaction in the same column [48]. In order to solve this problem, we calculated total correlations among all the columns by the following formula:

3.1.3. Secondary Structure Features

In order to improve the accuracy of HPV high-risk types, some researchers analyzed the protein secondary structures of HPV, extracted its characteristics, and constructed prediction models [28, 49]. Given a protein sequence, we predict its secondary structure sequence. Its secondary structure content can be calculated as where belong to . is -helix, is -strand, and is coil. Gk and Herand calculated the first and second order composition moment vector (CMV) as the structure features [50], which is defined as the following formula: where represents the element at position in the secondary structure sequence with length , which is the vector order.

In order to further study the arrangement of different structural elements, some important structural fragments or patterns have been proposed one after another [51]. The length of the longest segment is defined as the following formula: where represents the function of the maximal segment length, and is the composed of each segment of the structure element [52]. In order to reduce the length effect, a normalized length of the longest segment is defined where is the sequence length. In addition to the maximal segment length, the average length of the segment is also an important feature of protein secondary structure, which is defined as where represents the substance of , and is a function of segment length. The normalized average length of the segment is where represents the length of protein sequence.

Zhang et al. analyzed the secondary structure of protein, only considered the helical and folded structural units, ignoring the irregular curl [53]. According to this simplification rule, they got a simplified protein secondary structure sequence and defined the transition probability matrix (TPM) as follows: where

where belongs to , is followed by the , and the size of the event is described by .

Most researchers pay attention to the content of structural units, so sometimes they do not know the location information of these structural units in the protein structure. Dai et al. proposed some location-based structural features [54]. If the distance of the given structural element is 1, they will be divided into a structural domain. If not, they will be divided into two different structural domains. If the is given, its probability distribution can be achieved. The definitions of digital feature semimean and semivariance are as follows:

The ratio of the standard to can be calculated by:

is used to describe the position distribution of predicted secondary structural elements.

3.2. Feature Selection

Feature selection (SF) is one of the important steps in data processing, which plays an important role in data mining, pattern recognition, and machine learning [32]. It solves the problem of how to select the input features corresponding to the optimal prediction results. Through feature selection, the complexity of the problem can be reduced, and the prediction accuracy, robustness, and interpretability of the learning algorithm can be improved [55]. Therefore, feature selection can be used to select the characteristics of HPV risk types [56]. More importantly, it helps to have a deeper understanding of HPV sequences when we analyze their characteristics. This section summarizes some feature selection methods.

3.2.1. Mutual Information

The feature selection method based on mutual information can capture the nonlinear relationship between variables, which is more suitable for dealing with complex classification problems [57]. Mutual information can be expressed as characteristic matrix , and the relevant category is . Mutual information can be defined as

When the set of is selected, the redundant formula Eq. (29) is

As can be seen from the above definition, mutual information aims to select the features that are most relevant to the target category and have the least redundancy between the selected features [58]. Therefore, feature selection can be realized directly according to the value of mutual information.

3.2.2. Support Vector Machine Recursive Feature Elimination (SVM-RFE)

Support vector machine has been widely used in the field of pattern recognition and is very suitable for small samples with high-dimensional data. SVM-RFE is a sequential reverse selection (SBS) algorithm based on the maximum interval principle of SVM [59]. SVM-RFE can be divided into linear SVM-RFE and nonlinear SVM-RFE.

(1) Linear SVM-RFE. There is a training sample set , , . The linear SVM can be calculated by the following formula: where is the weight vector of the optimal hyperplane, and is the threshold.

By introducing Lagrange’s formula, the optimization problem of SVM can be transformed into the following dual programming problem: where can be calculated by solving the maximum value of under the range of and . So Eq. (32) can be defined as follows:

The ranking criterion score of the -th feature is defined as the following equation:

During the training process, the feature with the smallest score of the ranking criterion is removed, and the remaining features are used to train the SVM for the next iteration.

(2) Nonlinear SVM-RFE. When the sample size is larger than the number of features, nonlinear SVM-RFE can obtain better results [60]. Features can be mapped to new spaces in higher dimensions by nonlinear SVM-RFE:

After being mapped to the new space, the samples can be divided into linear features. It can be calculated by the formula of Lagrangian:

Here, can be changed into a Gaussian kernel formula :

The ranking criteria of feature can be described by the following formula: where represents that k has been dropped.

3.2.3. Genetic Algorithm

Genetic algorithm is an adaptive search strategy, and its principle is similar to the survival mechanism of the fittest in nature [61]. It has strong adaptability, strong independence of domain knowledge, and can carry out a large number of parallel computing. Therefore, it is more suitable for processing large-scale complex data, especially for solving multiobjective optimization problems [62]. Given an HPV data set, we construct its characteristic matrix and use genetic algorithm to obtain an eigenvector, which is an optimal feature set [63].

3.2.4. Kurtosis and Skewness

Kurtosis and skewness are two characteristics describing distribution [64], which can distinguish the distribution of different characteristics. Therefore, many studies calculated the skewness and kurtosis of each feature and selected certain features according to the value of skewness and kurtosis. Given the distribution of a feature , its kurtosis and skewness are defined as follows:

3.2.5. ReliefF Algorithm

ReliefF is an independent evaluation method, which evaluates each feature separately and assigns weight to the feature [65, 66]. For each time, it randomly selects a sample from the training set and then calculates the latest instance from the same class and different sample sets, and then the weight update rules in each step are as follows: where represents the difference between the sample and the sample on the feature , and indicates the nearest neighbor of the position of in class Eq. (41). can be calculated by the following formula:

3.3. Prediction Algorithms
3.3.1. Support Vector Machines (SVM)

SVM is a discriminant classifier, which is defined by the classification hyperplane [27]. In other words, the labeled training samples are used to train the model, and then the test sample classification is realized by outputting the best hyperplane [67]. When the prediction problem is nonlinear, the objective function of SVM can be defined as

where is a kernel function, is a support vector, and belongs to . The Gaussian radial basis kernel function with strong learning ability and small error is often selected as the kernel function, which is defined as where is the coefficient of the kernel function and has high flexibility.

3.3.2. Principal Component Analysis (PCA)

PCA is a statistical analysis method that converts multiple indicators into a few comprehensive indicators. The idea of PCA is to replace the original more with fewer comprehensive variables [68]. The first step of PCA is to normalize the matrix which can also be expressed as the following mathematical formula:

The second step of PCA is to obtain the correlation of coefficient matrix for :

3.3.3. -Nearest Neighbor Algorithm (KNN)

KNN is an important nonparametric classification method, which classifies the samples according to the categories of most -nearest neighbor samples in the feature space [69]. However, the basic -nearest neighbor classification algorithm needs global search, which has high computational complexity and slow computing speed. In order to reduce the influence of the same feature function in the traditional KNN algorithm, different weights can be assigned to the features in the distance formula to measure the similarity. For example, in Euclidean distance formula, different weights are assigned to different features, as shown in the formula:

3.3.4. Partial Least Square Discriminant Analysis (PLS-DA)

PLS-DA is a standard high-dimensional data analysis method, and it is especially suitable for situations where there are a large number of explanatory variables [70], multicollinearity samples, few observations, and large interference noises. PLS-DA first treats the sample category with dummy variables which can be calculated by the following formula:

where represents categorical variables, and is a dummy variable. The relationship model between explanatory variables, response variables is established by least square regression, and then, the category of each sample is determined by comparing the predicted values of model response variables. If the predicted value of the dummy variable component is the largest, it is determined that the sample belongs to the category corresponding to the dummy variable.

3.3.5. Classification and Regression Tree (CART)

CART is a nonparametric statistical process of data analysis. Its characteristic is to make full use of the binary tree structure in the calculation process, that is, the root node contains all samples, and the root node is divided into two children under certain partition rules [71]. The process of node is repeated on the subbook point and becomes a regression process until it can no longer be divided into leaf nodes [72]. When all nodes can be classified into class , the impurity of node can be expressed by the following formula: where is the proportion of the sample which belongs to class . If class can be divided into and . The probability of in the sample is and is . And the size of impurities will be expressed by the following formula:

3.3.6. Linear Discriminant Analysis (LDA)

LDA is a statistical analysis method used to determine the type of sample [73], which has been widely used in the prediction of protein structural classes. By finding the feature vector , the sets of metadata are mapped into another lower-dimensional direction. Then classify with the sample in the new space.

is an interclass deviation matrix:

is an intraclass deviation matrix: where represents class , represents class . Where and , respectively, represent the average of class and class .

3.3.7. Extreme Gradient Boosting (XGBoost)

The gradient boosting algorithm framework supports a variety of different loss functions [74]. In addition to the exponential loss function, it also includes the mean square error loss function and logarithmic loss function. XGBoost algorithm is an enhanced version of gradient boosting algorithm. It is more efficient, flexible, and portable [75]. It used the training data to predict a target variable .

First of all, the objective function can be calculated as where represents the number of trees, is the training loss function, and is the regularization term.

Then, the XGBoost takes the Taylor expansion of the loss function up to the second order and removes all the constants, so the specific objective at step can be describes as where and can be described as follows:

The value of the objective function only depends on and .

It can optimize every loss function, including logistic regression and pairwise ranking, and it is simple to parallel and can greatly enhance the program efficiency with a fast model exploration [76].

3.4. Evaluation Measure

After realizing HPV prediction, we need to use statistical test method to evaluate the efficiency of prediction model. Leave-one-out cross-validation (LOO-CV) is a more common method of Bayesian model and a special case of -fold cross validation [77], because when is equal to sample size , and it can be regarded as -fold cross validation.

HPV typing is a two or three classification problem. If the HPV model predicts a positive result () and the true result is also positive, it is called true positive (). If the predicted result is negative (), the real result is also positive, which is called false positive (). The results can be calculated by the following formula:

4. Conclusion

High-risk HPV accounts for a higher proportion of cervical cancer in the world. Therefore, the identification of high-risk HPV is of great significance for the treatment of some major cancers such as cervical cancer [78]. At present, there are common epidemiological detection techniques for HPV typing [79], such as nucleic acid imprinted in situ hybridization or dot blot hybridization, and the detection infection rate is about 10-20% [80]. In addition, there are test methods, such as polymerase chain reaction (PCR). All HPV types can be detected by PCR, and the detection infection rate is more than 40%. In addition to experimental methods, some computational methods have been proposed to predict HPV typing.

We summarize the mathematical models and prediction methods of the risk types of human papillomavirus, especially around the key steps such as feature extraction, feature selection, and prediction algorithms. From the above research papers, it can be found that the prediction accuracy of low-risk types is often higher than that of high-risk types. E7 performed better than other HPV proteins in low-risk experiments. However, E6 performs best among all HPV proteins for high-risk prediction and all types of prediction experiments, which is just consistent with experimental studies. E5, E6, and E7 proteins of high-risk HPV play an important role in disease progression and cancer [81]. Therefore, E6 protein sequence is more suitable for HPV high-risk prediction, and E7 protein sequence is more suitable for HPV low-risk prediction. However, there are some exceptions. In the reduced amino acid modes, L1 protein performs better in predicting high-risk HPV, while L2 protein is more suitable for low-risk HPV. These conclusions can provide a theoretical basis for subsequent research and guide the establishment of HPV typing models.

Mutations usually lead to cell dysfunction, cell death, and even cancer in higher organisms. Therefore, mutations in HPV protein may make the virus more easily induced or carcinogenic and increase the chance of reinfecting the host or fleeing the host immune system. For example, the carcinogenicity of HPV is mainly controlled by E6 and E7 proteins, which often produce internal variation. The mutation frequency of E6 in cancer is 20% ~90%, and that of E7 is 60% ~90% [82]. A study in Japan showed that the mutation of aspartate to glutamate (d25e) at position 25 of HPV16 E6 protein was related to the DRB1 1502 allele of HLA II. This mutation is considered to be an important mutation in invasive cancer and cervical intraepithelial neoplasia. In the Netherlands, the frequency of HLV DRB1 07 allele increased in cancer patients with l83v mutation (). If the information of mutation information is considered in HPV classification model, it is also an effective way to improve HPV typing detection.

Data Availability

HPV database can be downloaded from Los Alamos National Laboratory (LANL, https://lanl.gov).

Conflicts of Interest

The authors declare no conflicts of interest, financial, or otherwise.

Acknowledgments

The authors thank all the anonymous referees for their valuable suggestions and support. This work is supported by the Natural Science Foundation of Heilongjiang Province of China (LH2019F046).