Research Article

Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

Table 2

Comparison of overall classification accuracies (the number of reads that are identified as “known” that are classified into their correct class plus the no. of unknowns that are correctly rejected divided by all reads) on the 275-strain test set. Using 5-fold cross-validation, the maximum standard deviation is 1%. NBC, BLAST, and PhymmBL, in their native form, cannot detect “unknown” classes while the methods combined with a detector can. Performance is also compared to MEGAN and SOrt-ITEMS accuracy. N/A is designated for the species-level for SOrt-ITEMS since it did not classify anything below the genus level. SOrt-ITEMS obtains the best performance for 500 bp reads for the genus level but is under the 1% standard deviation threshold to be statistically significant. WebCarma was not included because its overall performance for 500 bp reads was 50% for the species level and 37% for the genus level. Note that the overall classification performance increases dramatically when a detector is added to NBC and PhymmBL.

Species
Fragment length NBC BLAST PhymmBL MEGAN SOrt-ITEMS NBC + detector PhymmBL + detector

500 bp 27.5% 28.1% 28.0% 63.2% N/A 78.0% 78.6%
100 bp 25.3% 26.1% 26.9% 69.4% N/A 78.3% 81.1%
25 bp 20.9% 22.8% 23.5% 68.1% N/A 74.7% 73.6%

Genus
Fragment length NBC BLAST PhymmBL MEGAN SOrt-ITEMS NBC + detector PhymmBL + detector

500 bp 43.4% 49.2% 51.4% 68.8% 71.0% 53.6% 70.8%
100 bp 37.6% 42.8% 44.4% 66.5% 64.0% 54.9% 67.4%
25 bp 30.0% 32.7% 33.5% 54.8% 40.1% 45.3% 60.3%