Research Article

A Deep Learning Approach for Predicting Antigenic Variation of Influenza A H3N2

Table 3

The prediction performance of our deep learning approach and other existing approaches.

ApproachesTraining setValidation setAgreementa (%)Sensitivitya (%)Specificitya (%)MCCa

Multiple regression [18]181 HI experiments31878 pairs in Smith’s datasetb89.89
Multiple regression on physicochemical properties [20]394 HI experiments31878 pairs in Smith’s datasetb96.9699.5582.300.877
Decision tree [19]181 HI experiments31878 pairs in Smith’s datasetb96.20
Joint random forest methodc [23]28690 pairs in Smith’s dataset31878 pairs in Smith’s datasetb96.498.177.70.758
Stacked autoencoderd [32]80% of the 8097 pairs in a concise version of Smith’s dataset20% of the 8097 pairs in a concise version of Smith’s dataset959593
Our deep learning approacheThe filtered virus pairs formed by 70% of 253 strains in Smith’s datasetThe filtered virus pairs formed by 30% of 253 strains in Smith’s dataset97.1696.8597.340.939

aThe mark “—“ means that there is no relevant data in literature. bSmith’s dataset contains 31878 pairwise comparisons among 253 viral strains that belong to 11 clusters; out of the 31878 virus pairs, 27098 pairs composed of the strains from different clusters contain antigenic variations, whereas 4780 pairs composed of the strains from the same clusters possess similar antigens [36]. cYao et al. performed 10-fold cross-validation on Smith’s dataset. dThe stacked autoencoder model was developed based on a concise dataset obtained by removing from Smith’s dataset the sequence pairs that contain more than 9 antigenic variation-causing mutations followed by further removing the redundant pairs. eOur deep learning method was developed based on a more concise dataset built from Smith’s dataset (for details of constructing the dataset, see Section 2.1); the advantage of our dataset is that the virus pair-constituting strains in the training set and validation set are completely nonoverlapping or different.