Abstract

Swine influenza viruses (SIVs) can unforeseeably cross the species barriers and directly infect humans, which pose huge challenges for public health and trigger pandemic risk at irregular intervals. Computational tools are needed to predict infection phenotype and early pandemic risk of SIVs. For this purpose, we propose a feature representation algorithm to predict cross-species infection of SIVs. We built a high-quality dataset of 1902 viruses. A feature representation learning scheme was applied to learn feature representations from 64 well-trained random forest models with multiple feature descriptors of mutant amino acid in the viral proteins, including compositional information, position-specific information, and physicochemical properties. Class and probabilistic information were integrated into the feature representations, and redundant features were removed by feature space optimization. High performance was achieved using 20 informative features and 22 probabilistic information. The proposed method will facilitate SIV characterization of transmission phenotype.

1. Introduction

The influenza A virus (family: Orthomyxoviridae) genome contains eight segmental RNAs [1]. The hemagglutinin (HA) gene is located in the fourth segment, and the neuraminidase (NA) gene is in the sixth segment. According to the antigenic characteristics of HA and NA, the influenza A virus has 18 HA subtypes and 11 NA subtypes [24]. Besides the fast mutation rates of viral genes, segmental reassortments of viral genomes facilitate the occurrence of novel virus with the changes of host barriers [5, 6]. The 1957 pandemic was caused by an influenza A (H1N1) virus, which has a genome that contains segments 2, 4, and 5 from the avian influenza virus, whereas the 1968 pandemic was caused by an influenza A (H3N2) virus, which has a genome that contains segments 2 and 4 from the avian influenza virus [1].

Swine influenza virus (SIV) is an influenza A virus that causes acute respiratory infectious disease of swine [7]. There are three main SIV subtypes circulating worldwide (H1N1, H1N2, and H3N2), and these subtypes can be subdivided into different genotypes, including the classical H1N1, avian-like H1N1, human-like H3N2, reassorted H3N2, and H1N2 subgroups [810]. SIV genotypes are diverse, and its gene pool in nature is heterogeneous. SIV can cross the species barriers unforeseeably and directly infect humans. The 2009 H1N1 pandemic killed more than 18,000 people, and the viral pathogen during the pandemic was a novel SIV that was produced by genome reassortment between genome-reassorted swine viruses from America and Europe [1113]. H3N2 variant viruses (H3N2v), which have segment 7 from the 2009 H1N1 pandemic virus in their genomes, were identified in swine in 2010 and first detected in people in 2011 [14]. Clinical cases of H1N1 variant viruses (H1N1v) and H1N2 variant viruses (H1N2v) were also been reported after 2011 [15]. SIVs are a huge challenge for human public health and may trigger pandemic risk.

Computational bioinformatics tools are needed to predict transmission phenotype and pandemic risk of SIVs. For this purpose, machine-learning methods may be ideal tools [1618]. Machine learning techniques have great potential for virus screening because they can use viral protein sequences as input without the need for prior knowledge. In this paper, we propose a feature representation algorithm to predict cross-species infection of SIVs. The algorithm includes the sequence-based feature descriptors to build a comprehensive predictive model with sufficient information from different aspects. Sequence-based features with class information or probabilistic information are learnt from well-trained random forest (RF) classifiers that can learn a set of features. The dimension of the feature space was reduced using the minimum redundance maximum relevance (mRMR) method to obtain the most informative features and distinguish SIVs with different transmission phenotypes.

To identify SIVs capable of interspecies transmission, we constructed a predictor with two predictive models that were trained using 20 features based on class information or 22 features based on probabilistic information under the RF classifier. The predictor with the feature representation learning achieved a high prediction performance. This study provides an important tool in predicting cross-species infection of SIVs for public health.

2. Materials and Methods

2.1. Data

Viral sequences of influenza viruses isolated from swine and human were downloaded from the GISAID EpiFlu public database (http://platform.gisaid.org/epi3/frontend) [2, 3]. GISAID deposits high-quality genomic sequences along with their clinical information in the database. Since sequence redundancy was very high and genome coverage varied greatly, raw data were filtered using public bioinformatics tools and algorithms (Table S1).

We obtained 5860 SIVs and 44,623 human influenza viruses from the GISAID database on 21 March 2019. The dataset included all of the 11 influenza virus proteins (PB2, PB1, PB1-F2, PA, HA, NP, NA, M1, M2, NS1, and NEP) encoded in eight genome segments. Strains without any of the 11 protein sequences or without subtype information were excluded. Amino acid positions in the 11 proteins were determined using the multiple sequence alignment tool MUSCLE [19]. Strains with more than three amino acids missing at the terminal ends of the viral proteins were removed, and if there were only a few missing residues, they were added according to those in viral proteins with highest identity. We used the fast-clustering algorithm of the CD-Hit tool to reduce the redundancy in the dataset [20]. Ambiguous amino acid residues, such as X and B, were likely caused by sequencing error and were replaced by those in viral protein with highest identity. Strains with large numbers of ambiguous residues in viral protein were also removed.

The final dataset for predicting cross-species infection contained two categories of viruses: (1) 769 viruses isolated from human (positive sample; H1N1, H1N2, H2N2, and H3N2 subtypes); (2) 1133 influenza viruses isolated from swine (negative sample; H1N1, H1N2, and H3N2 subtypes). The positive samples were composed by seasonal human influenza virus, 2009 pandemic swine virus, and variant swine virus isolated from human. Since these viruses could be also isolated from swine [1], they were excluded from the negative samples according to the similarity of genome sequence. Information about the 1902 strains is summarized in Table S1.

2.2. Signature Amino Acid Positions Based on Entropy

Most of the amino acid residues in the viral proteins were conserved. To reduce the computing complexity, amino acid residues were filtered by the entropy measure at each position of the 11 viral proteins. For a given position , the entropy value was computed using the formula [21]: , where is the observed probability of amino acid at position . High entropy values indicate high amino acid mutation rates at the corresponding position. We set the threshold of entropy difference as 1.5 and obtained 36 signature positions, and therefore, each strain was represented by a list of 36 amino acid residues in the screened positions.

2.3. Representation of Signature Amino Acid Set

Mutations in the viral proteins determine the pathogenicity or virulence of SIVs [1]. After obtaining the entropy ranking for each position, 36 significant amino acids were screened. Six encoding algorithms for compositional information, position-specific information, and physicochemical properties of amino acids were used to explore the key information required for high-quality predictions [16]. The encoding algorithms for the signature amino acid set to transform SIV into fix-length vectors are detailed below.

2.3.1. Amino Acid Composition

The amino acid composition (AAC) is a 20-dimension vector as usual. Because the gaps (deletion or insertion) in viral proteins occurred frequently during the evolution of SIV, we defined the AAC as a 21-dimension vector to represent the frequency of the 20 amino acid residues and one gap in the 36 signature positions of the viral proteins. For example, if the amino acid type occurs times in the amino acid set of a specific virus, the frequency of is denoted as . A 21-dimensional feature vector that represents the frequencies of the 20 different amino acids and one gap was obtained for each strain.

2.3.2. Parallel Correlation-Based Pseudo-Amino-Acid Composition

Parallel correlation-based pseudo-amino-acid composition (PC-PseAAC) method was used to compute the parallel correlation of any two amino acids in the 36 signature amino acid positions in the viral protein sequences [22]. For a virus D, the PC-PseAAC feature vector was defined by where where is an integer, represents the normalized occurrence frequency of the 20 amino acids and one gap in virus , is the highest tier of the correlation along , and is the correlation function that measures the -tier sequence-order correlation between all the -th most contiguous residues along D. The function is given as where represents the five amino acid factors that correspond to the -th amino acid in virus D, respectively [23]. If is , then equals . The five factors for gap (deletion or insertion) were simply set to zero.

2.3.3. G-Gap Dipeptide Composition

The G-gap dipeptide composition (GGAP) is the dipeptide composition coupled with local order information of any two interval residues among the 36 amino acid residues of the 11 viral proteins for each virus. The GGAP is commonly used feature descriptor for sequence analysis and model construction. In this paper, GGAP is a 441-dimension vector that represents the frequency of dipeptide comprising 20 amino acid residues and one gap. It is defined as where is the occurrence frequency of the -th () G-gap dipeptide, which is defined as where is the occurrence number of the -th G-gap dipeptide in the 36 signature amino acid residues. The dimension of the GGAP feature vector is . Deletion or insertion is also computed.

2.3.4. Twenty-Bit Features

Position-specific information and physicochemical properties were used to encode the 36 amino acid residues for each virus. Five physicochemical property descriptors of the standard amino acids were constructed, namely, polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [23]. For each descriptor, the standard amino acid alphabets were classified into three groups, and the deletion/insertion (indel) was regarded as the fourth group. Representation of 20 standard amino acids and one indel was according to the five physicochemical properties. Each residue was encoded as a 20-bit vector comprising 0/1 elements, where the position of the bit was set to 1 if the residue belongs to the corresponding group; otherwise, it was 0. Given the amino acid augment approach, the top residues with the highest entropy values were selected. The dimension of the feature vector was .

2.3.5. Twenty-One-Bit Features

Twenty-one-bit feature was like a one-hot encoding. In this algorithm, each amino acid residue is transformed into a 21-bit 0/1 vector. (e.g., Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; indel by 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1). Given the amino acid augment approach, each strain with the top residues was represented by a dimensional feature vector.

2.3.6. Overlapping Property Features

This algorithm divided the 20 standard amino acids and one gap (indel) into 11 different groups according to physicochemical properties. The distribution of the 20 stranded amino acids in the 10 groups can overlap [24]. The 10 amino acid groups were , , , , , , , , , and . Indels form the 11th group. Each amino acid residue was represented by an 11-dimensional 0/1 vector. The position of the vector was set to 1 if the residue belongs to the physicochemical property group; otherwise, it was 0. Given the amino acid augment approach, the top residues with the highest entropy values were selected. The amino acid augment was encoded with an feature vector.

2.4. Framework of Feature Representation Learning

The framework of the feature representation learning algorithm, which includes two main steps, feature representation learning and feature representation optimization, is shown in Figure 1. Firstly, feature representations from a set of feature descriptors are generated using the RF classifier systems. Secondly, the feature representations learnt from the first step are optimized to yield informative feature subsets. The two-step feature representation learning procedure was as follows [16].

2.4.1. Feature Representation Learning

The six feature encoding algorithms were AAC, PC-PseAAC, GGAP, 20-bit features (BIT20), 21-bit features (BIT21), and overlapping property features (OLP), all of which are described above. A feature pool was built to generate as much information as possible in the predicting models with different parameters. For example, is a common parameter for BIT20, BIT21, and OLP. Because the 36 significant amino acids were screened after the entropy ranking was obtained, we set as 4-36 by step 4. The maximum value was set as 36 because there were 36 signature positions, and therefore, a total of 27 feature descriptors are obtained for BIT20, BIT21, and OLP. A similar procedure was used for PC-PseAAC and GGAP. With the use of different parameters, a total of 64 feature descriptors were in the feature pool. Information about all the feature descriptors is provided in Table 1.

Before the optimization of feature representation, two types of predictions were used to fulfill the learning. All the 64 descriptors in the feature pool were used to train and predict with the RF models, and two types of predictions were achieved. The first prediction type was the class label (positive or negative): positive samples (swine viruses with the phenotype of cross-species infection) were marked as 1, and negative samples (swine viruses without the phenotype of cross-species infection) were marked as 0. The second prediction type was the pseudo probability that a sample belongs to a certain class (positive or negative). For each prediction type, all 64 outputs computed by the 64 RF models were concatenated as a new feature vector. Each swine virus was eventually represented by two 64-dimensional feature vectors, which were marked “class” and “prob,” respectively. Feature vector “class” comprised the class information learnt from the original feature pool, and feature vector “prob” comprised the probabilistic information. Fast speed for computation was expected for the first type models while high performance for prediction accuracy was expected for the second type models.

2.4.2. Feature Representation Optimization

The two prediction types were further optimized to improve their feature representation ability. A well-known feature selected method, mRMR, was used to rank the features of the “class” and “prob” information [25]. The mRMR method uses the mutual information to maximize the mutual information between the joint distribution of the selected features and the class labels and minimizes the redundancy between the selected features. The mRMR method was used to optimize the feature representations and obtain the feature list ranked by their importance scores. The sequential forward search (SFS) strategy was used to increase the features from the ranked feature list one by one [16]. After training the RF classifier, the feature subset with the best performance was considered as the optimal subset. We obtained 20 optimal features for “class” and the 25 optimal features for “prob.”

2.5. RF Algorithm

An RF algorithm was used to obtain two types of feature vectors and construct models of prediction for cross-species infection of SIVs. RF machine-learning algorithms are robust and have been used widely to model biology data [4]. The RF behaves like an ensemble algorithm and proposes a set of decision trees by random feature selection. We used the RF algorithm in the R environment in this study [26]. All the experiments were done using version 3.5.0 of R with the default parameters (tree ).

2.6. Evaluation Metrics

We used four commonly used metrics to evaluate the model performance, namely, sensitivity (SN), specificity (SP), accuracy (ACC), and Mathew’s correlation coefficient (MCC) as follows: where indicates true positive, the number of correctly predicted true strains with the phenotype of human infection; indicates true negative, the number of correctly predicted true strains without the phenotype of human infection; indicates false positive, the number of strains without the phenotype of human infection predicted to be strains with the phenotype of human infection; and is false negative, the number of strains with the phenotype of human infection predicted to be strains without the phenotype of human infection. The sensitivity and specificity metrics measure the predictive ability of a model in positive and negative cases, respectively. The other two measures, ACC and MCC, were used to evaluate the overall performance of the models. For these four metrics, high scores indicate high performance of the models.

The receiver operating characteristic (ROC) curve, which is used to evaluate the overall performance of a binary classifier system [27], was also used in this study. The ROC curve is generated by plotting the true positive rate against the false positive rate under different classification thresholds. We also calculated the area under ROC curve (AUC) to evaluate the predictive performance of the models. AUC values range from 0.5 to 1.

2.7. Tenfold Cross-Validation Method

The 10-fold cross-validation method was used to evaluate the predictive performance of the models. The models were trained on 692 positive samples and 1019 negative samples that were selected randomly from the cleaned dataset. The remaining 10% of samples (77 positive and 114 negative) were used as an independent test dataset to assess the performances of the classifiers. This process was repeated 10 times, and the 10 results are averaged to obtain the final evaluation of prediction performance.

3. Results and Discussion

3.1. Signature Position of SIV

After elimination of redundancy and other necessary cleaning of viral data from the GISAID database, the final dataset for the prediction of cross-species infection contained two categories of viruses: 769 viruses isolated from human and 1133 viruses isolated from swine. The 769 human viruses were considered as positive samples because they were verified to have the ability of infection among humans. The 1133 swine viruses were considered as negative samples. Information about these virus strains is summarized in Table S1.

To screen the signature position, the entropies in each position of the 11 viral proteins were calculated, respectively. As shown in Table 2, the HA protein contained the highest number of selected amino acid residues (25/36), which is consistent with the known role of HA mainly in receptor-binding and fusion activity for cross-species infection of SIVs. Positions HA102-HA290 are located in or close to the host receptor binding region [28, 29], and HA163 and HA189 are related to the specificity of receptor binding [30, 31]. The signature positions were verified to be related with the mechanism of interspecies transmission or high efficiency of transmission among humans, which would rationalize the model and benefit predicting accuracy.

3.2. Optimal Feature Representations

The mRMR feature ranking algorithm was used to select the 64-dimensional feature vector, which comprised the predictions from the 64 RF models (Figure 1). A ranked feature list for the 64 features was generated after sorting their importance scores from the mRMR algorithm. The sequential forward search strategy was proposed to explore the optimal feature representations from the ranked list of 64 features. The features were increased one by one according to the sequence in the list, and the RF classifiers were trained. The influenza virus data were tested with the 10-fold cross-validation method. The sequential forward search curves for the ACC and MCC metrics were drawn to find the optimized feature (Figure 2). For the class features, the RF classifier performed best with maximum ACC and MCC of 95.69% and 91.03%, respectively, when the feature number 20 was selected (Figures 2(a) and 2(b)). This result indicates that the first 20 features from the ranked feature list had the optimal representation ability to distinguish swine viruses with the ability of cross-species infection. For the probabilistic features, the RF classifier performed best with the first 25 features (ACC of 96.37% and MCC of 92.46%; Figures 2(a) and 2(b)). The screened 20 class and 25 probabilistic features were used to build the predictive classifiers of cross-species infection.

3.3. Comparison of Optimal Feature Representations with Individual Descriptors

Using the class feature, optimal representation vectors with 20 dimensions were obtained from 20 individual feature descriptors. The predictive performance of the optimal feature was compared with the six top individual descriptors to evaluate the learning ability of the feature representation. The 10-fold cross-validation tests were fulfilled based on the dataset.

The performances of the optimal class features and the compared individual features are illustrated in Figure 3(a), and the ROC curves were shown in Figure 3(b). The optimal features gave the best predictive performance with maximum ACC and MCC of 95.68% and 91.03%, respectively, which are higher than the values obtained with the second-best feature descriptor BIT20 () (Figure 3(a)). The AUC (0.97) obtained using our feature descriptor was better than that of BIT20 (; ). Notably, only 20 features were used for optimal feature, whereas BIT20 () used 80 features. Moreover, the 25 optimal feature representations based on probabilistic information were compared with the individual feature descriptors. The performances of the optimal probabilistic features and the compared individual features are shown in Figure 3(c), and the ROC curves are shown in Figure 3(d). The results were consistent and indicated that the probabilistic feature representations outperformed the other six feature descriptors. The optimal features gave the best predictive performance with maximum ACC and MCC of 96.37% and 92.46%, respectively, which are higher than those of the six individual features.

3.4. Comparison of Class and Probabilistic Information

Influenza viruses were represented by class and probabilistic information, and their feature vectors comprised the predictions of the 64 RF models to encode signature positions of 36 amino acids. The performance of different information to predict cross-species infection of swine influenza virus was evaluated. As shown in Table 3, the feature vector using probabilistic information outperformed the feature vector using class information. The overall performance based on probabilistic information had ACC and MCC values of 95.95% and 91.59%, respectively, whereas the overall performance based on class information had ACC and MCC values of 95.22% and 90.05%, respectively. The performances of the two optimal feature vectors also are shown in Table 3. The overall 64-dimensional feature vectors encoded with class and probabilistic information were compared with the optimal features. After ranking by the mRMR algorithm, the performance based on the optimal probabilistic information increased from 95.95% to 96.37% for ACC and from 91.59% to 92.46% for MCC, and the performance based on class information increased from 95.22% to 95.69% for ACC and from 90.05% to 91.03% for MCC. These results confirmed that the probabilistic feature identified infection better than the class feature. However, both feature types had predictive power for cross-species infection and were used to construct predictive models.

3.5. Comparison of Feature Representation Learning with Ensemble Learning

Traditional ensemble learning methods combine predicting results from multiple models to make decisions and for classification. The feature representation learning used and optimized the predictions of the 64 RF models to obtain a predicting model that was similar to traditional ensemble learning models. Two types of feature representation learning (class information learning and probabilistic information) were compared with two classical ensemble learning methods (majority voting and probability averaging). Majority voting considers the majority predictions of the 64 RF models and makes predictions according to the majority rule. Probability averaging simply computes the probabilistic values of the 64 RF models and makes prediction based on the threshold. As shown in Table 4, both types of feature representation learning gave better performances than the two traditional ensemble learning methods. With the probabilistic information, the feature learning strategy had maximum ACC and MCC of 96.37% and 92.46%, respectively. The ACC and MCC obtained with our strategy were about 2% and 3% higher, respectively, than those obtained with the ensemble strategies. Based on class information, the feature learning strategy had the maximum ACC and MCC of 95.69% and 91.03%, respectively. The ACC and MMC obtained with our strategy were about 1% and 3% higher, respectively, than those obtained with the ensemble strategies. Notably, our feature learning strategy achieved a remarkable improvement, even though ensemble learning is considered an effective way to improve predictive performances.

3.6. Comparison of Our Predictor with Classical Classifiers

We used the RF algorithm and class or probabilistic information to construct predictor for SIVs. To evaluate the predictive performance of the RF method, we compared our predictor with the traditional classifiers, Support vector machine (SVM), Naïve Bayes (NB), and K-nearest neighbor (KNN), on our dataset with 10-fold cross-validation. The parameters for these classifiers were the same with those in the references [4] and [32]. The results showed that the RF method gave the best overall predictive performance based on the class information with maximum ACC and MCC of 95.69% and 91.03%, which were 1.32% and 2.71% higher, respectively, than those obtained with the NB method (Figures 4(a) and 4(b)). Our AUC (0.97) was better than that of KNN (; ). We also compared our predictor with the traditional classifiers based on probabilistic information. The probabilistic feature representation outperformed the other three classifiers. The RF method gave the best overall predictive performance based on the probabilistic information with maximum ACC and MCC of 96.37% and 92.46%, which were 2.58% and 5.38% higher, respectively, than those with the NB method (Figures 4(c) and 4(d)). Our AUC (0.98) was better than that of NB (). Overall, the results show that the RF method produced better predictions of infection than the support vector machine, NB, and KNN methods.

4. Conclusions

A model for predicting cross-species infection of SIVs was described in the paper. The major contribution of this predictor was the set of informative features of viral proteins that were learned from a total of 64 feature descriptors, including compositional, position-specific, and physicochemical information. A feature representation learning scheme was proposed. We integrated class and probabilistic information into our feature representations and removed redundant and irrelevant features by feature space optimization to improve the feature representation ability. The ten-fold cross-validation results showed that a high predictive performance was achieved using 20 informative features and 22 probabilistic information. We compared the feature representation learning scheme with those of different learning strategies and confirmed that feature representation learning scheme gave better predictions. We anticipate that our method will be a powerful tool for large-scale identification of swine influenza viruses and will facilitate the characterization of their transmission phenotype and accelerate their applications in virology.

Data Availability

After the registration for any application (https://www.gisaid.org/registration/register/), the public sequences of influenza viruses used in this paper can be downloaded from the GISAID EpiFlu database (http://platform.gisaid.org/epi3/frontend) under the database access agreement (https://platform.epicov.org/epi3/frontend#5aa0ce) and with the acknowledgment GISAID data contributors (https://www.gisaid.org/help/publish-with-data-from-gisaid/). We used the Python programming language to create an easy-to-use tool that implements our predictor and handle massive data, which is freely accessible via https://github.com/kouzheng/SIVPred-FL.

Ethical Approval

No animals/humans were used for studies that are base of this research.

Conflicts of Interest

The authors declare no conflict of interest, financial, or otherwise.

Authors’ Contributions

Xiaoli Qiang designed the framework of analysis. Zheng Kou, Junjie Li, and Saeed Kosari performed all computational work. Zheng Kou and Xinyue Fan implemented the code. Zheng Kou and Xiaoli Qiang wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgments

We would like to acknowledge the originating and submitting laboratories of the viral sequences from the GISAID EpiFlu public database. This work was supported by the National Natural Science Foundation of China (61972109, 62172114).

Supplementary Materials

Information about the final 1902 strains is summarized in Table S1. (Supplementary materials)