Identification and Quantification of Genomic Repeats and Sample Contamination in Assemblies of 454 Pyrosequencing Reads
Identifying contigs from contaminating bacteria in cyanobacterial genome assemblies derived from DNA from nonaxenic cultures. (a) and (b): scatter plot showing for each contig (minimum length 500 bp), the GC percentage and read depth (log scale) for the P. rubescens NIVA CYA98 (a) and A. flos-aquae (b) assemblies. Contigs with low read depths to the left of the dotted line at 10x read depth. The low read depth contigs fall into two clusters based on GC percentage. (c) and (d): MEGAN comparisons of the low (green) and high (red) read depth contigs from the P. rubescens NIVA CYA98 (c) and A. flos-aquae (d) assemblies. Trees collapsed at the Family taxonomic level. Numbers with the taxon names are number of hits summarized to that node and all nodes below in the NCBI taxonomic tree. Circles sizes are log-scale relative to the number of hits. For the contigs with low read depths, it is indicated that if they fall into the high or low GC% cluster of contigs. “Not assigned”: contigs that were not assigned to any branch of the tree due to too low bit score (cutoff at 100) or because they are the only contig that were assigned to a particular taxon.