Research Article

Linking De Novo Assembly Results with Long DNA Reads Using the dnaasm-link Application

Figure 2

The process of generating scaffolds from contigs and long DNA reads. (a) In the presented example there is an input set of contigs composed of five sequences: ACCGAAT, ACTGAAA, GACTTTACGATAACTG, TGGATCTAGC and ACTGGGACAAAT. The set of long reads contains seven sequences: ACCGAATAAAGACTTTACGATAACT, GCCGAACAACGACTTTACGAT, ACTGATTCCCTTTACAACT, TAACTGAAAATGGATC, TAACTGCCCCTGGATC, TAACTGAAAAACTGGG, and TAACTGCCCCACTGGG. Firstly, from each long DNA read a set of k-mer pairs is generated. The values of k (k-mer size), d (distance between the beginnings of k-mers in a pair), and t (sliding step) parameters are equal to 5, 10, and 1, respectively. For example, from the TAACTGAAAATGGATC read, two pairs of k-mers are generated (TAACT,TGGAT) and (AACTG,GGATC). The result of this step is a set of k-mer pairs containing 30 elements. (b) The connection graph built from 30 pairs of k-mers from the previous step and five previously mentioned contigs. Each of the contigs creates a separate vertex. Pairs of k-mers, depending on the contig on which they are located, form the edge of the connection graph. The numbers above the edges represent the number of elements supporting the specified edge, in turn: (i) number of DNA reads, (ii) number of k-mer pairs, and (iii) number of DNA reads where the specified DNA read is taken into account if the number of k-mer pairs in this read is greater than the threshold value (in the presented example the value of this threshold is equal to 1). (c) The filtered connection graph. The applied filter assumes rejection of edges for which there is no DNA read with the number of k-mers above 1 (the third number above the edge should be greater than 0 in proper edges). The values of all three parameters in the filtering step can be set by the user. (d) The result of the algorithm. The set of scaffolds is built from four sequences; the only connection in the example is the combination of ACCGAAT and GACTTTACGATAACTG contigs into the ACCGAATNNNGACTTTACGATAACTG scaffold. This scaffold has not been extended to the right because there is ambiguity of connections. The ratio of the number of k-mer pairs related to the source contig (GACTTTACGATAACTG) is smaller than the threshold value (in the example, the threshold value is equal to 0.3). The ratio, in both cases (GACTTTACGATAACTG with TGGATCTAGC and GACTTTACGATAACTG with ACTGGGACAAAT), is equal to 0.5. It is worth noting that the length of sequence of “N” signs in ACCGAATNNNGACTTTACGATAACTG is equal to 3, which results from the mapping places of the related k-mer pairs, (ACCGA,GACTT), (CCGAA,ACTTT), (CGAAT,CTTTA), and (CCGAA,ACTTT), to the contigs. It is also worth emphasizing that, in gap-filling mode, the “NNN” sequence would be changed to “AAA” from the ACCGAATAAAGACTTTACGATAACT read.