Research Article

A Novel Bioinformatics Method for Efficient Knowledge Discovery by BLSOM from Big Genomic Sequence Data

Figure 1

BLSOMs for 100 kb sequences derived from 10 vertebrata genomes. (a) DegPenta. Lattice points containing sequences from multiple species are indicated in black and those containing sequences from a single species are indicated in color as shown in the keys. (b) G + C%. For each lattice point in the DegPenta, G + C% was calculated and divided into 21 categories with an equal number of lattices. The lattice points belonging to the categories of the highest, middle, and lowest G + C% are shown in wine red, white, and green, respectively. (c) Diagnostic pentanucleotides responsible for species-specific clustering. Occurrence of each pentanucleotide for each lattice point was calculated and normalized with occurrence expected from the mononucleotide composition for the respective lattice point [16, 17]. This observed/expected ratio is indicated in color presented under the panel. This ratio has been shown to be useful in unveiling genome signatures because the oligonucleotide composition can be analyzed independently of a simplex effect reflecting the mononucleotide composition of genomic sequences [1618].
765648.fig.001