Research Article

Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

Figure 2

The ability of BLAST to discern the correct strain genome (in red-dashed), species genome (in green-dash) and the correct genus label (in blue) for the known 63, 500 (100 randomly selected reads from each of 635 known genomes) plus unknown 10, 200 (100 randomly selected reads from 102 novel genomes) 25 bp reads. The ROC curves compare BLAST's bit scores against a varying threshold. The plot demonstrates that BLAST predicts most “known” genomes correctly at the optimal operating point, but incorrectly detects “unknown” genomes. For the strain detection, the area-under-the-curve is 60.1% with the best threshold yielding a sensitivity of 99.8% and specificity of 20.4%. For the species-level detection, the AUC is 65% with 99.1% sensitivity and a specificity of 34.7%. For the genus detection, the area-under-the-curve is 78.9% with the best threshold yielding a sensitivity of 98.6% and specificity of 59.3%. The red line represents the 50% chance line.
495849.fig.002