Research Article

Iterative Variable Gene Discovery from Whole Genome Sequencing with a Bootstrapped Multiresolution Algorithm

Figure 2

Process of obtaining candidate exon sequences. (a) The definition of an in-frame exon sequence between the -AG- start motif and the RSS canonical -CAC- motif. (b) Identification of all sequence possibilities between the AG-CAC motifs. (c) Examples of overlapping exon intervals; candidates are reduced with an interval tree, while best candidate V-genes are chosen by maximum probability. (d) Multiscale decomposition of a sequence stored as a recursive tree structure. (e) High-level flow diagram of steps of the iterative bootstrap training process: n is the iteration step, is the set of V-exons used in training Random Forests (using >100 random trees and default parameters from the sklearn library) for each level, are the new exons that have been discovered at step n and will be added to the iteration for training, and and represent exon intervals and training matrices, respectively, for which maximum likelihood criteria are applied.
(a)
(b)
(c)
(d)
(e)