Research Article

Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm

Figure 6

Phylogenetic trees of the amino acid sequences of V-exons for each iteration step. (a) Positively identified V-exon sequences are classified into their respective locus; the clearly delineated clades (i.e., IGHV, IGLV, IGKV, TRAV/D, TRBV, and TRGV) show that this classification is correct. The V-exon sequences were aligned with Clustal omega [27]. For constructing the phylogenetic trees, a maximum likelihood algorithm with the WAG matrix and 500 bootstrap replicates were realized for validation. Rooting was performed at the midpoint, and linearization provided by Mega [29] was applied to improve the visualization of the trees. In the initial iteration (b), only known V-exon sequences from humans and mouse were used in the training set. From this training, predictions were made by processing 14 WGS of primates; the discovered sequences from these primates were used to retrain Random Forests, thereby refining the possibility of including V-genes that are more distant in homology. In the third iteration (c), the program VgeneFinder uncovered 15 times more sequences than from the start of the iteration. For illustration, sequences from a small section of the TRAV are amplified (inset). More details of the branch distances can be found in Supplementary Materials.