Computational and Mathematical Methods in Medicine

Research Article

A Deep Learning Approach for Predicting Antigenic Variation of Influenza A H3N2

Table 3

The prediction performance of our deep learning approach and other existing approaches.


Approaches	Training set	Validation set	Agreement^a (%)	Sensitivity^a (%)	Specificity^a (%)	MCC^a

Multiple regression [18]	181 HI experiments	31878 pairs in Smith’s dataset^b	89.89	—	—	—
Multiple regression on physicochemical properties [20]	394 HI experiments	31878 pairs in Smith’s dataset^b	96.96	99.55	82.30	0.877
Decision tree [19]	181 HI experiments	31878 pairs in Smith’s dataset^b	96.20	—	—	—
Joint random forest method^c [23]	28690 pairs in Smith’s dataset	31878 pairs in Smith’s dataset^b	96.4	98.1	77.7	0.758
Stacked autoencoder^d [32]	80% of the 8097 pairs in a concise version of Smith’s dataset	20% of the 8097 pairs in a concise version of Smith’s dataset	95	95	93	—
Our deep learning approach^e	The filtered virus pairs formed by 70% of 253 strains in Smith’s dataset	The filtered virus pairs formed by 30% of 253 strains in Smith’s dataset	97.16	96.85	97.34	0.939

^aThe mark “—“ means that there is no relevant data in literature. ^bSmith’s dataset contains 31878 pairwise comparisons among 253 viral strains that belong to 11 clusters; out of the 31878 virus pairs, 27098 pairs composed of the strains from different clusters contain antigenic variations, whereas 4780 pairs composed of the strains from the same clusters possess similar antigens [36]. ^cYao et al. performed 10-fold cross-validation on Smith’s dataset. ^dThe stacked autoencoder model was developed based on a concise dataset obtained by removing from Smith’s dataset the sequence pairs that contain more than 9 antigenic variation-causing mutations followed by further removing the redundant pairs. ^eOur deep learning method was developed based on a more concise dataset built from Smith’s dataset (for details of constructing the dataset, see Section 2.1); the advantage of our dataset is that the virus pair-constituting strains in the training set and validation set are completely nonoverlapping or different.