Review Article

The A, C, G, and T of Genome Assembly

Table 4

Some common assembly statistics. Here an indicates higher is better while a implies less is better.

Description

N50: quantified the length of the scaffold at which 50% of the total assembled size of the sequence is covered. NG50: evaluated in a way similar to N50. However, here the length of the sequence is either known or predicted [1, 29]. NA50 and NGA50: these metrics deal with aligned blocks rather than contigs [35]. Continuity: similar to N50, NA50, NG50, and NGA50 there are other metrics like N75, NA75, NG75, NGA75, N90, NA90, NG90, and NGA90. Number of Genes: an assembly which exhibits more highly conserved core Eukaryotic genes is considered better [29]. Accuracy: if an assembly reports at least 90% of its bases with a minimum of 5× coverage, it is considered accurate. Choppiness: the average contig length should be greater than a certain threshold. Otherwise, the assembly needs to be redrafted. Validity: the fraction of assembly that can be confirmed by a reference sequence [29]. Completeness: an assembly is considered complete if the scaffolds cover more than 90% of the actual genome. Length of the Longest Scaffold: typically the greater the length, the better the assembly. Similar is the case of the shortest scaffold. Number of scaffolds > X, where X is a user-defined length. Similarly, % age of scaffolds > X. Total Length of the Scaffolds and Total Scaffold Length as Percentage of Estimated Genome Size: the closer it is to 100%, the better it is. Percentage of Contigs Scaffolded: percentage of contigs that were connected with one another during the scaffolding process [1].

Number of Gaps in the Assembly: by aligning paired-read data onto scaffolds one may determine scaffolding errors [1]. Number of Scaffolds: an assembly which has a smaller number of scaffolds would be assumed to be better. For example, the optimum assembly would be one continuous sequence depicting the true sequence. LG50 Scaffold Count: number of scaffolds counted in reaching NG50 threshold. Similar would be the case of LG75 and LG90. Percentage of Unscaffolded Contigs: since contigs may remain unscaffolded.