Research Article

Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm

Figure 1

Iterative workflow for predicting V-gene repertoire from WGS datasets. The algorithm bootstraps from a small set of initial V-gene sequences (step 1); these sequences are converted from nucleotide to amino acid sequences so that a multiresolution (MR) feature vector is constructed. Random Forests are trained for each MR levels; and the training matrices are saved for each MR level. In the prediction phase, the collection of exons, obtained from different unconnected contigs the WGS files, is processed with Random Forests (for each multiresolution level) to determine those that have sufficient probability (homologous to the training sets) for being V-genes. The results are a set of V-exons classified into their respective locus.