Linking De Novo Assembly Results with Long DNA Reads Using the dnaasm-link Application
Currently, third-generation sequencing techniques, which make it possible to obtain much longer DNA reads compared to the next-generation sequencing technologies, are becoming more and more popular. There are many possibilities for combining data from next-generation and third-generation sequencing. Herein, we present a new application called dnaasm-link for linking contigs, the result of de novo assembly of second-generation sequencing data, with long DNA reads. Our tool includes an integrated module to fill gaps with a suitable fragment of an appropriate long DNA read, which improves the consistency of the resulting DNA sequences. This feature is very important, in particular for complex DNA regions. Our implementation is found to outperform other state-of-the-art tools in terms of speed and memory requirements, which may enable its usage for organisms with a large genome, something which is not possible in existing applications. The presented application has many advantages: (i) it significantly optimizes memory and reduces computation time; (ii) it fills gaps with an appropriate fragment of a specified long DNA read; (iii) it reduces the number of spanned and unspanned gaps in existing genome drafts. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3). The demo application, Docker image, and source code can be downloaded from project homepage.
High-throughput sequencing devices, called next-generation sequencers, have provided lots of DNA sequences for various organisms. However, a very large number of draft genome sequences are still incomplete. For example, in GenBank, 90% of bacterial genomes are incomplete . In order to improve the consistency and completeness of the draft of reference genomes, which are produced based on short reads obtained from second-generation sequencers, third-generation long read sequencing can be used. Due to this, third-generation sequencing technologies are becoming more popular; for example, in 2018 the de novo human genome assembled from only long DNA reads was published .
Third-generation sequencing makes it possible to obtain much longer DNA reads compared to second-generation sequencing technologies. However, the error rate in long reads from third-generation devices compared to short DNA reads from second-generation sequencers is significantly higher [3, 4]. Moreover, the cost per sample of third-generation sequencing is higher than second-generation sequencing .
An obvious concept of using both types of reads in de novo assembly, hybrid assembly is currently being explored [6, 7]. There are many possibilities for combining data from second-generation sequencing and third-generation sequencing. The four most popular are listed below.(1)Long DNA reads could be mapped directly onto the de Bruijn graph, which is built from short DNA reads. Then, dedicated algorithms allow us to resolve some ambiguity in the de Bruijn graph, which can improve the consistency of the resulting DNA sequences. Such an approach is implemented in some de novo DNA assemblers for second-generation reads, e.g., Velvet , ABySS , and SPAdes .(2)Long DNA reads could be de novo assembled with dedicated assemblers, e.g., Canu , Falcon , and miniasm . The created DNA sequences can be improved in terms of quality by mapping short DNA reads and correcting assembly errors using Pilon  or quiver  applications.(3)Short DNA reads could be used to correct long DNA reads, for example, with CoLoRMap  or Nanocorr  tools. Then, long and corrected DNA reads could be assembled with assemblers for third-generation sequencing data (as depicted in the previous point).(4)Short DNA reads could be de novo assembled using assemblers dedicated to second-generation sequencing data (as depicted in point 1). Then, long DNA reads could be used to link the resulting DNA sequences (contigs), for example, with LINKS  or SSPACE-LongRead  applications.
In this paper, we present a new application called dnaasm-link for combining the output of a de novo assembler with long DNA reads (point 4 of the above list). Our software contains a module for filling the gaps between contigs with a specified sequence from an appropriate long DNA read. This feature is very important, in particular for complex DNA regions. What is more, our method has a much shorter calculation time as well as much lower memory requirements in comparison to other tools. Significant memory optimization and reduction of computation time may enable the usage of the application for organisms with a large genome, which may be cumbersome or even impossible for existing applications (estimated resources required for scaffolding of a human genome (~3 Gbp): dnaasm-link, 8 h/600GB; LINKS, 2 days/6TB; SSPACE-LongReads, 5 days/130GB).
The presented algorithm was implemented as a new extension of the dnaasm assembler . The demo application, Docker image, and source code are available at the project homepage: http://dnaasm.sourceforge.net.
2. Materials and Methods
The presented algorithm efficiently finds and joins adjacent contigs using long reads. The contigs are produced by a de novo DNA assembler from short and high quality reads from second-generation sequencers. In our approach, the contigs are created using the de Bruijn graph algorithm implemented in a dnaasm assembler . The new algorithm, called dnaasm-link, checks which contigs have a subsequence similar to a subsequence in a long read, then finds adjacent contigs, calculates the distance between contigs, and fills the gap with a sequence from the appropriate long DNA read. The presented approach and details of implementation are described below.
2.1. Finding Adjacent Contigs
The algorithm uses k-mer similarity to find adjacent contigs. This algorithm consists of several stages.
Firstly, a set of k-mers is generated from the input set of contigs, each of them being inserted into a Bloom filter . A Bloom filter is a probabilistic data structure that efficiently tests whether a k-mer is present in a set. The length of analysed k-mers (the value of parameter ) can be set by the user based on the error rate of long DNA reads: the higher the error rate, the lower the value. The default value is 15. This step is depicted in Figure 1(a).
Secondly, a set of long DNA reads begins; a set of k-mer pairs with the distance is generated (paired k-mers that map onto two different contigs are used to link these contigs in the next step of the algorithm). The default value is . It should be mentioned that we do not generate a full k-spectrum here; we rather use the step value , set by default to . This step is depicted in Figure 1(b). The pairs in which both k-mers are in the previously generated Bloom filter are processed further, as depicted in Figure 1(c).
Thirdly, a set of unique k-mers is determined. This process consists of counting the number of instances of a given k-mer in the input set of contigs. K-mers which occur more than once are treated as nonunique. All pairs of k-mers containing at least one nonunique k-mer are removed from further considerations, as depicted in Figure 1(d).
Next, a connection graph is built. This graph is composed of vertices that represent contigs and edges that represent connections between contigs derived from pairs of k-mers from long DNA reads. Each edge contains three parameters that define the strength of the specified connection. These parameters are(i)the number of connections between a given pair of contigs defined as the number of k-mers pairs;(ii)the number of connections between a given pair of contigs defined as the number of DNA reads;(iii)the number of connections between a given pair of contigs defined as the number of DNA reads, where a specified DNA read is taken into consideration if the number of k-mer pairs in this read is greater than the threshold value specified by the user.
After building the connection graph a set of filters is applied to remove some edges representing connections. Filters remove the edges where at least one of the three parameters mentioned above is lower than the corresponding thresholds set by the user.
Finally, the process of generating the resulting set of scaffolds from the connection graph is performed. At first, a list of all vertices from the connection graph is prepared. The list should represent contigs sorted by their lengths in descending order. Next, all vertices on the list are marked as unseen. In each iteration an unseen vertex pointing to the longest contig becomes a seed for a new scaffold. The seed is expanded to the right and to the left by attaching consecutive contigs to both ends, based on the connection graph. During the expansion, two situations can occur: (i) the specified contig is connected only with a single vertex in a contig graph, then, considered contigs are joined; (ii) the specified contig is connected with more than a single vertex. In this situation, the vertex with the largest number of connecting pairs of k-mers is preferred. At this stage, two adjacent contigs are not joined into a single sequence, but rather are separated by a gap/overlap placeholder that will be replaced with a proper sequence by the gap-filling algorithm described in the next section. All vertices used to construct a single scaffold are marked as seen and are not taken into consideration in the next iterations of the algorithm. The process is repeated until there are no unseen vertices. The main steps of the described above algorithm are presented in Figure 2.
2.2. Gap-Filling Algorithm
When generating scaffolds, two contigs may overlap. In this case, a single “N” sign is inserted between them. However, the contigs may be separated by a gap. The final step of the presented algorithm aims to estimate the gap size and to fill it with a fragment of a long DNA read. For each linking k-mer pair, the gap size is calculated based on (1) the fixed distance between k-mers in the long read, (2) lengths of contigs, and (3) k-mers’ offsets on contigs. Sequencing errors in long reads may cause distances computed for each k-mer pair to be different. An average value is taken as an estimate. In the same manner, if the offset of each k-mer pair extracted from the long read is known, it is possible to determine the offset of a subsequence of a read corresponding to each of the gaps in scaffolds. Contigs are covered by reads containing multiple errors, and consequently, multiple different gap sequences may be generated. In the presented application, a gap sequence is taken directly from the read which covers the considered contigs with the greatest number of k-mer pairs.
Numerical experiments were performed to compare the presented application with other available tools and to indicate the advantages of filling gaps in scaffolds using long DNA reads. Briefly, the first experiment compares the quality of results obtained in the presented method with other tools for hybrid assembly. The second experiment was carried out on artificially generated data and it indicates the benefits of using both short and long DNA reads over using only the output from second-generation sequencers. Finally, the calculation time and memory usage of the application compared to other tools were measured.
To evaluate the quality of resulting DNA sequences in experiments we used QUAST  ver. 4.1. We compared DNA sequences in terms of(i)the number of resulting DNA sequences longer than 1000 bp;(ii)the number of misassemblies: sum of relocations, translocations, and inversions;(iii)N50 statistic: the length of the DNA sequence for which the sum of lengths of all sequences of that length or longer is greater than half of an assembly;(iv)NA50 statistic: the same as N50, but not for all resulting DNA sequences, only for a set of aligned blocks which are the results of breaking input DNA sequences at misassembly events;(v)the largest DNA sequence;(vi)the largest alignment, the length of the largest continuous alignment in the resulting DNA sequences;(vii)the average number of mismatches per 100 kbp;(viii)the average number of indels per 100 kbp;(ix)the average number of uncalled bases (Ns) per 100 kbp.
Moreover, we used the BUSCO  ver. 2.0 tool to compare the DNA sequence in terms of the number of reconstructed core genes, genes present as single-copy in at least 90% of the species from the selected group. As part of the evaluation of DNA sequences, we distinguished four groups: (i) complete and single-copy, (ii) complete and duplicated, (iii) fragmented, and (iv) missing core genes. A detailed description of the experiments and the results obtained can be found in the next parts of this section.
3.1. Comparison with Other Tools
We compared the results of our application with other tools for hybrid assembly that connect contigs using long reads. The main objective was comparison in terms of linking the contigs with long DNA reads and filling in the resulting gaps. For the above experiment we used publicly available data for Escherichia coli (4,641,652 bp) and Saccharomyces cerevisiae (12,157,105 bp). Both of the datasets on which we worked came from Nanocorr’s  research (http://schatzlab.cshl.edu/data/nanocorr); the names of the files are provided in Supplementary Materials (available here). The above files are the result of de novo assembly of short DNA reads and the correction of Oxford Nanopore Technologies (ONT) reads by short DNA reads. Basic parameters of the input set of long DNA reads and contigs are presented in Table 1.
We compared our approach with two state-of-the-art tools used to join contigs into scaffolds with long reads: LINKS  ver. 1.8.5 and SSPACE-LongRead  ver. 1.1.0. For additional comparison we used scaffolders designed to operate on another kind of read: paired-end tags (PETs) and mate-pairs (MPs). These were OPERA-LG  ver. 2.0.6, BOSS  and ScaffMatch  ver. 0.9.0. We prepared input data for these scaffolders using the Fast-SG  tool. This application generates paired DNA reads from long reads and maps such paired reads onto preassembled contigs. Parameter values for the applications and the appropriate commands are provided in Supplementary Materials, while the results of the evaluation are presented in Tables 2 and 3.
Our experiment indicates that the dnaasm-link application gives slightly better results than existing tools in terms of the quantity and quality of the resulting DNA sequences. Looking at N50 and the largest DNA sequence, it seems that dnaasm-link largely improves the assembly. In terms of mismatches and core genes, dnaasm-link seems to be in line with the other approaches. What is more, de novo assembly by tools that treat short and long reads differently (LINKS, SSPACE-LongRead, dnaasm-link) gives better results than converting long reads into short reads to increase sequencing coverage followed by de novo assembly.
3.2. Impact of Adding Long DNA Reads to Contigs Generated from Short DNA Reads
We examined how the combination of short and long DNA reads affects the length and quantity of the resulting DNA sequences. In this study we used the Saccharomyces cerevisiae (GenBank NC_001133 … NC_001148, NC_001224) reference genome. From this genome, we generated nine sets of short DNA reads using the pIRS  ver. 1.1.1 application and five sets of long reads using the NanoSim  ver. 1.0.0 tool, where each set had a different depth of coverage. The details of application used and dataset parameters are provided in Supplementary Materials.
The generated short reads were de novo assembled by ABySS ver. 2.0.1, then contigs were linked using long reads. The results, presented in Figure 4, prove that combining long DNA reads with short ones can significantly increase the consistency of the resulting assemblies by reducing the final number of scaffolds. Moreover, increasing the coverage of any sequencing technology above a certain level does not improve the results further.
Next, we investigated how the use of long DNA reads affects the reconstruction of complex DNA structures such as long tandem repeats. We compared our method to a technique where gaps are filled with short DNA reads. In this experiment we generated an input set of reads for two organisms: Escherichia coli (GenBank NC_000913) and Saccharomyces cerevisiae (GenBank NC_001133 … NC_001148, NC_001224). We used the same applications, pIRS and NanoSim as before. Their parameters are provided in Supplementary Materials. The short reads were de novo assembled by ABySS . Next, we linked contigs with long DNA reads using the dnaasm-link tool in two modes: with and without gap filling. Then, the scaffolds produced by dnaasm-link without gap filling were treated by three tools for filling gaps with short DNA reads: GapFiller  ver. 1.10.0, Sealer  ver. 1.9.0, and SOAPdenovo2 GapCloser  ver. 1.12.0. Finally, we compared a number of detected tandem repeats using the Tandem Repeats Finder application . This application was also launched on the reference genomes, to determine ground truth data for this study. The results presented in Table 4 depict the advantage of filling gaps using dnaasm-link over other existing methods.
3.3. Time and Memory Usage
We examined the dnaasm-link application in terms of performance, as this can be crucial in the analysis of large volume sequencing data. Our application was compared with LINKS  and SSPACE-LongRead  in terms of time and memory usage. The results are presented in Figure 5.
As expected, combining contigs in applications with accurate mapping takes much more time than in k-mer based tools, in particular, because of the time required to map long DNA reads to preassembled contigs. For example, the calculation time of the SSPACE-LongRead application, for which BLASR  software is used in the mapping process, is over 15 times longer than for tools using a k-mer approach, like the dnaasm-link tool. Our tool is significantly faster than the LINKS application, because LINKS, which uses a similar algorithm, is implemented in Perl. In addition, the LINKS application requires much more RAM memory; for example, for a genome of size 100 Mbp and coverage of long reads equal to 30x, the LINKS application uses over 200 GB of RAM memory, and our application only 18.3 GB.
The dnaasm-link application is a new tool for both connecting contigs and filling the gaps between them with long DNA reads. The presented results indicate that the application works similarly to existing tools in terms of the quality of the resulting DNA sequences. However, it works significantly faster with much less RAM memory usage, which can be crucial for large volume sequencing data. Moreover, the presented software contains a module for filling the gaps between contigs with a specified sequence from an appropriate long DNA read, which is not implemented in similar tools.
The procedure of filling the gaps with an appropriate fragment of a specified long DNA read can significantly increase the parameters of the resulting DNA sequences (in the resulting DNA sequences there will be fewer gaps, which may lead to more detailed analyses, e.g., genome annotation). In the presented study we indicated that a very large number of complex DNA structures, especially tandem repeats, could not be properly reproduced without using long DNA reads. Moreover, the addition of long DNA reads, even with very low coverage, can significantly reduce the number of resulting DNA sequences and improve their consistency in relation to the results obtained only from short DNA reads.
In the presented application, a gap within scaffolds could be optionally filled with a fragment of a single long DNA read. However, this solution is not ideal, because such a read may contain many errors, especially if the long reads are raw, i.e., if errors have not been corrected before. In order to control this issue, in the future we plan to add a module to create consensus from several DNA reads. The result of the consensus of several long reads would be inserted into the gap instead of the raw fragment of a single long read, which would significantly reduce the number of errors in the considered DNA fragments. However, the preliminary study shows a big increase in time complexity when consensus is calculated with the use of a multialignment dynamic programming algorithm.
In the future, we also plan to add a module for analysing the similarity of k-mers, which would take into account the fact that the k-mers may contain errors. The presented tool is based on k-mers, which should contain as few errors as possible, because each single error in the specified DNA sequence causes the creation of erroneous k-mers in the k-spectrum. To deal with this problem, in the next version of the software, we will add a module which will investigate the profile of a specified k-mer and compare it to the profiles of other k-mers. The profile will contain several pieces of information, e.g., number of specified 2-mers and their location in the investigated k-mer.
The presented application is available under GNU Library or Lesser General Public License version 3.0 (LGPLv3). In order to easily use the software, the demo application with web interface as well as the Docker  container with the dnaasm-link tool is available. What is more, the user can download binary files as well as source code and compile the application with any changes in the algorithm.
As more and more genomes are sequenced, it becomes desirable to correctly reproduce their DNA sequences, especially, from short and long DNA reads. Here we have presented dnaasm-link, a tool for linking contigs, the result of de novo assembly of second-generation sequencing data, with long DNA reads.
dnaasm-link is implemented in C++ and is freely available under GNU Library or Lesser General Public License version 3.0 (LGPLv3). It and related materials can be downloaded from project homepage http://dnaasm.sourceforge.net.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Funding was provided by Polish National Science Centre grant no 2016/21/B/ST6/01471.
File with commands to reproduce experiments. (Supplementary Materials)
X. Li, L. Kui, J. Zhang et al., “Improved hybrid de novo genome assembly of domesticated apple (Malus x domestica),” GigaScience, vol. 5, 2016.View at: Google Scholar
C. Yang, J. Chu, R. L. Warren, and I. Birol, “NanoSim: nanopore sequence read simulator based on statistical characterization,” GigaScience, vol. 6, no. 4, 2017.View at: Google Scholar
D. Paulino, R. L. Warren, B. P. Vandervalk, A. Raymond, S. D. Jackman, and I. Birol, “Sealer: a scalable gap-closing application for finishing draft genomes,” BMC Bioinformatics, vol. 16, no. 1, p. 230, 2015.View at: Google Scholar
D. Merkel, “Docker: lightweight Linux containers for consistent development and deployment,” Linux Journal, vol. 2014, no. 239, 2014.View at: Google Scholar