International Journal of Genomics

Volume 2017, Article ID 6120980, 12 pages

https://doi.org/10.1155/2017/6120980

## HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly

Department of CSE, BUET, ECE Building West Palasi, Dhaka 1205, Bangladesh

Correspondence should be addressed to M. Sohel Rahman; db.ca.teub.esc@namharsm

Received 9 April 2017; Revised 19 July 2017; Accepted 26 July 2017; Published 27 August 2017

Academic Editor: Brian Wigdahl

Copyright © 2017 Md Mahfuzer Rahman et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Background*. The rapid advancement of sequencing technologies has made it possible to regularly produce millions of high-quality reads from the DNA samples in the sequencing laboratories. To this end, the *de Bruijn graph* is a popular data structure in the genome assembly literature for efficient representation and processing of data. Due to the number of nodes in a de Bruijn graph, the main barrier here is the memory and runtime. Therefore, this area has received significant attention in contemporary literature. *Results*. In this paper, we present an approach called HaVec that attempts to achieve a balance between the memory consumption and the running time. HaVec uses a hash table along with an auxiliary vector data structure to store the de Bruijn graph thereby improving the total memory usage and the running time. A critical and noteworthy feature of HaVec is that it exhibits no false positive error. *Conclusions*. In general, the graph construction procedure takes the major share of the time involved in an assembly process. HaVec can be seen as a significant advancement in this aspect. We anticipate that HaVec will be extremely useful in the de Bruijn graph-based genome assembly.

#### 1. Background

The rapid advancement of the next-generation sequencing technologies has made it possible to regularly produce numerous reads from the DNA samples in the sequencing laboratories. In particular, the number of reads now is in the range of hundreds of millions. Hence, the current challenges include efficient processing of this data which may reach even a couple hundred GB. To this end, the *de Bruijn graph* is a popular data structure in the genome assembly literature for efficient representation and processing of data. In a de Bruijn graph, the nodes represent the distinct *k*-mers that occur in the reads and there exists an edge between the two nodes if there is a (*k*–1)-length overlap between the suffix and prefix of the corresponding *k*-mers, respectively. Because there could be a huge number of nodes in a de Bruijn graph, the researchers are motivated to focus on devising a compact representation of this graph. Example of such works includes but are not limited to [1–8].

The *Bloom filter* is a popular data structure that can represent a set and is capable of testing whether a given element is present or not there. And it can do this efficiently both in terms of memory and speed. The base data structure of a Bloom filter consists of an *m*-bit array, initialized to zero. It further uses hash functions. To insert or test the membership of an element, a total of array positions are computed using each of the hash functions. To insert, all corresponding positions in the bit array are set to 1. Similarly, the membership operation returns yes if and only if all of these bit positions have 1 (i.e., are set). Note that the Bloom filters are probabilistic data structures: a negative response to a membership test for an element ensures that the element is definitely absent; however, a positive response cannot certainly indicate the presence of the element in the set. So, even if a Bloom filter membership test returns true, the element may not in fact be present in the set. Such a positive response is referred to as a “false positive.”

Designing lightweight implementations of de Bruijn graphs has been the focus of attention in recent times. For example, minimum-information de Bruijn graphs, pioneered by [3], ensure its lightweight by not recording read locations and paired-end information. A distributed de Bruijn graph is implemented by [4] which reduces the memory usage per node. On the other hand, Conway and Bromage [5] have proposed storing an implicit, immutable graph representation by applying sparse bit array structures. In these methods, portions of the de Bruijn graph are greedily extended to compute local assemblies around sequences of interest, and these methods use negligible memory. Interestingly, Ye et al. [6] proved that a graph roughly equivalent to the de Bruijn graph can be obtained by storing only one out of nodes .

Pell et al. [7] have employed a Bloom filter to devise the probabilistic de Bruijn graph. Using their method, the graph encoding can be achieved with as little as 4 bits per node. However, the inherent limitation of the Bloom filter is that it can report false positive results in the introduction of false nodes and false branching in their approach. Still, it can be shown that the global structure of the graph can be approximately preserved, up to a certain false positive rate. Notably, in [7], we do not find the authors to perform the assembly directly by traversing the probabilistic graph. Instead, the graph has firstly been used to partition the set of reads into smaller sets, and subsequently a classical assembler has been used for assembly purposes.

Recently, Chikhi and Rizk [8] have again proposed a Bloom filter based on a new encoding of the de Bruijn graph. They have introduced an additional structure that is instrumental in removing critical false positives. One drawback of their approach is the use of auxiliary memory, that is, its strong dependence on the free space in the hard disk. This in fact can affect the performance of their approach severely. In particular, this is clearly evident when the number of unique *k*-mers in a file skyrockets. For example, when the number of unique *k*-mers in a file becomes 2 × 10^{9}, it takes more than 10 hours to complete the critical false positive calculation. To summarize, their approach, in addition to the RAM usage, requires the total free hard disk space to be used over and over again. This in the end affects the runtime and it becomes prohibitively high. Another limitation of this approach is that it cannot handle the situation when the *k*-mers are of even length.

According to the present state of the art, memory-efficient Bloom filter representations of de Bruijn graphs have two critical issues, namely, the high running time and the task of false positive computation. On the other hand, other traditional approaches that do not have these issues need much higher memory.

In this paper, we make an effort to alleviate these problems. In particular, we present a new algorithm based on *hashing* and *auxiliary vector data structures* and call this algorithm *HaVec*. The key features of HaVec are as follows which can be seen as the main contributions in this paper:
(1)HaVec introduces a novel graph construction approach that has all three desired properties: it is error free, its running time is low, and it is relatively memory efficient and hence requires sufficiently low memory.(2)It introduces the idea of using a hash table along with an auxiliary vector data structure to store the *k*-mers along with their neighbour information.(3)It constructs such a graph representation that generates no false positives. As a result, only true neighbours are found for traversing the whole graph.

We note that some preliminary results of this research work were presented at the 17th International Conference on Computer and Information Technology (ICCIT 2014) [9].

#### 2. Methods

##### 2.1. General Overview

Let us consider the genome assembly process when a de Bruijn graph is used. Because of the high memory requirement, traditional graph representation approaches do not scale well. This is specially true in case of large graphs having millions of nodes and edges. A Bloom filter can offer a memory-efficient alternative. In this option, edge is not stored explicitly; rather a present bit is used for every node. The procedure is well known and briefly described below for completeness. For each node in the graph, a hash value is produced, which along with the table size produces an index in the table. The most popular and easy method to produce this index is to divide the hash value by the table size to get the remainder. Now, if the node is present, the corresponding index as calculated above is set to 1. Similarly, to check the presence (absence) of a node in the graph, we do the same calculation and simply check whether the corresponding index is 1 (0). At this point, recall that a Bloom filter may produce false positives. Hence, if the corresponding index is 0, then the node is definitely absent; otherwise, the node is possibly present.

Now the question is how can we compute the edges? Again, the procedure is simple. Recall that a node corresponds to a *k*-mer. So, from a node (say *x*), all possible neighbours can be easily generated. Now we can easily check whether a generated possible neighbour (say *y*) is indeed present or not in the same way described above. And if *y* is absent in the Bloom filter, we can decide that the edge (*x*, *y*) is definitely absent in the graph; otherwise, the edge is possibly present there.

Now the problem of using the Bloom filter to represent the graph lies in the probability that more than one node may generate the same index: when divided by the table size and hash values of more than one node may produce the same remainder. So, there is a chance for a false edge to be created in the graph if a neighbour node is generated falsely; that is, if the corresponding bit is set due to a different node generating the same reminder. This is why we may have false positives when using a Bloom filter.

If the false positives are eliminated, then, the Bloom filter will undoubtedly be one of the best candidates (if not the best) to represent a de Bruijn graph. Note that an increase in the table size of a Bloom filter surely decreases the false positive rate; however, it will never become zero. In this paper, we present a crucial observation to tackle this issue: even if the same reminder is produced from more than one node following the abovementioned division operation (i.e., ), the quotient for each division operation must be different. So, if two nodes are pointing to the same index in the hash table, by examining the respective quotient values, we can easily verify which one is falsely generated and which one is indeed the real one. This works like a fairy tale! However, there is a catch: now, for each index in the table, we have to keep track of a mapping between hash values and quotients.

Our approach is quite simple and described below. We use a total of different hash functions (say , ). So for each node, this allows us to produce a total of hash values. At first, we make an attempt to store the node using the index generated by . If that fails, that is, if some other node has already occupied it, we use and so on. However, it may very well happen that all and fail to provide a free index. In that case, being out of options, we have to resort to our auxiliary vector data structure. We now use the index value generated by the last hash function, , to select a position in the vector data structure. Note that the same problem of multiple index values pointing to the same position can happen here as well. This is handled by maintaining a list of indices in that position. A (second level) vector structure is maintained for a particular index of that list, where all the collided nodes on that index are stored. For a detailed description please refer to Section 2.3.

##### 2.2. de Bruijn Graphs, Hash Tables, and Auxiliary Vector Structures

As has been mentioned above, HaVec does not maintain an explicit graph structure; rather, it uses the *k*-mer’s information to construct the de Bruijn graph. And it stores the information of the *k*-mers using the hash table and if needed using the auxiliary vector data structures. Given a *k*-mer (i.e., a node), HaVec can generate its correct neighbours simply by examining its neighbour bits. In what follows, we will describe the procedure in detail.

###### 2.2.1. Hash Table Structure

HaVec uses hashing for faster access. In the hash table, for each index, HaVec uses 40 bits, that is, 5 bytes of memory as will be evident shortly (please see also Table 1).
(1)Because we are working on DNA sequences, each node (i.e., *k*-mer) cannot have more than four neighbouring *k*-mers. To compute a possible neighbour of a given *k*-mer, we just need to remove its first symbol after appending it to one of the four nucleotides. Now, there are a total of 16 possible ways one *k*-mer can have neighbours:
(i)It can have no neighbours (we have only one possibility).(ii)Or it can have only one neighbour (we have 4 possibilities).(iii)Or it can have only 2 neighbours (we have possibilities).(iv)Or it can have 3 neighbours (we have possibilities).(v)Or it can have all 4 neighbours (we have only one possibility).

Hence, HaVec employs 4 bits for this purpose, where a particular bit corresponds to a particular nucleotide. (2)HaVec uses 3 bits to keep track of the hash functions thereby accommodating a maximum of 8 hash functions (in this setting).(3)The quotient value therefore can be stored in the remaining 33 bits.