Abstract

Haplotype is a pattern of single nucleotide polymorphisms (SNPs) on a single chromosome. Constructing a pair of haplotypes from aligned and overlapping but intermixed and erroneous fragments of the chromosomal sequences is a nontrivial problem. Minimum error correction approach aims to minimize the number of errors to be corrected so that the pair of haplotypes can be constructed through consensus of the fragments. We give a heuristic algorithm (HMEC) that searches through alternative solutions using a gain measure and stops whenever no better solution can be achieved. Time complexity of each iteration is for an SNP matrix where and are the number of fragments (number of rows) and number of SNP sites (number of columns), respectively, in an SNP matrix. Alternative gain measure is also given to reduce running time. We have compared our algorithm with other methods in terms of accuracy and running time on both simulated and real data, and our extensive experimental results indicate the superiority of our algorithm over others.

1. Introduction

A single DNA molecule is a long chain of nucleotides (base pairs). There are four such nucleotides which are represented by the set of symbols . It is generally accepted that genomes of two humans are almost 99% identical at DNA level. However, at certain specific sites, variation is observed across the human population which is commonly known as “single nucleotide polymorphism” and abbreviated as “SNP” [1]. The nucleotide involved in a SNP site is called allele. If a SNP site can have only two nucleotides, it is called biallelic. If it can have more than two alleles it is called a multiallelic SNP [2]. From now on, we will consider the simplest case where only bi-allelic SNPs occur in a specific pair of DNA.

The single nucleotide polymorphism (SNP) is believed to be the most widespread form of genetic variation [3]. The sequence of all SNPs in a given chromosome is called haplotype. Haplotyping an individual deals with determining a pair of haplotypes, one for each copy of a given chromosome. A chromosome is a complicated structure of a DNA molecule bound by proteins. This pair of haplotypes completely define the SNP fingerprints of an individual for a specific pair of chromosomes. Given the two sequences of bases, haplotyping is straight forward and just needs to iterate through both the sequences and remove all the common alleles from them. But haplotyping becomes difficult when we want to construct haplotypes from sequencing data for higher reliability. Sequencing data for a genome does not contain the complete sequences of bases for a specific chromosome, rather it provides a set of fragments of arbitrary length for the whole genome. Therefore, the actual problem of haplotyping is to find two haplotypes from the set of overlapping fragments of both the chromosomes, where fragments might contain errors and it is not known which copy of the chromosome a particular fragment belongs to.

The problem of haplotyping has been studied extensively. The general minimum error correction (MEC) problem was proved to be NP-hard [4]. It was also proved to be NP-hard even if the SNP matrix is gapless using a reduction from the MAX-CUT problem [1]. A method based on genetic algorithm has been proposed to solve this problem [5]. Several heuristic methods have also been proposed to find haplotypes efficiently. HapCUT [6] and ReFHap [7] are two of the most accurate algorithms in this regard.

In this paper, we give a heuristic algorithm for individual haplotyping based on minimum error correction. The complexity of each iteration is for an SNP matrix of dimension . The algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm for bipartitioning a hypergraph minimizing the cut size [8]. Extensive simulations indicate that HMEC outperforms the genetic algorithms of Wang et al. [5] in terms of both reconstruction rate and running time, and it has better (in most cases) or comparable accuracy and significantly smaller running time than that of HapCUT [6], which is the most accurate heuristic algorithm available. We also compared HMEC with some other algorithms such as SpeedHap [9], FastHare [10], MLF [11], 2 distance MEC [12], and SHR-3 [13] using the HapMap-based instance generator and comparison framework [14, 15].

The rest of the paper is organized as follows. In Section 2, we present some definitions and preliminary ideas. In Section 3, we present our algorithm for individual haplotyping. We describe the data structure and complexity of our algorithm in Section 4. We report on an extensive performance study evaluating HMEC with other available techniques in Section 5. Finally, we conclude in Section 6 by suggesting some future research directions. An earlier version of this paper was accepted for presentation at BMEI 2008 [16].

2. Preliminaries

In this section, we give some definitions and preliminary ideas.

Let be the set of bi-allelic SNP sites. Let be the set of fragments produced from two copies of the chromosome. Each fragment contains information of nonzero number of SNPs in . Because the SNPs are bi-allelic, let the two possible alleles for each SNP site be and , where they can be any two elements of the set . Since all the nucleotides are the same at the sites other than SNP sites, we can remove these extraneous sites from all the fragments and consider the fragments as the sequences of the SNP sites only. Thus each fragment is a string of symbols of length where “−” denotes an undetermined SNP which we call a hole. All the fragments can be arranged in an matrix , , , where row is a fragment from and column is a SNP from . This matrix is called the SNP matrix as follows

The consecutive sequence of “−”s that lies between two nonhole symbols is called a gap. A gapless SNP matrix is the one that has no gap in any of the fragments. In (1), the first, second, and third rows have no gaps while each of the fourth and sixth rows has one gap.

A SNP matrix can be viewed as an ordered set of fragments where a fragment is an ordered set of alleles. A fragment is called to cover the SNP if and called to skip the SNP if . Let and be two fragments. The distance between two fragments, , is defined as the number of SNPs that are covered by both of the fragments and have different alleles. Hence, where is defined as

In (1), the distance between the second and the third fragment is two, as they differ in the seventh and ninth SNP sites (columns).

Two fragments and are said to be conflicting if . Let be a partition of , where and are two sets of fragments taken from so that and [5]. In Figure 1(b), an arbitrary partition corresponding to the SNP matrix of Figure 1(a) is shown. A SNP matrix is an error-free matrix if and only if there exists a partition of such that for any two fragments , , and are non-conflicting, that is, . Such a partition is called an error-free partition. The partition in the Figure 1(b) is not error free since in and in . For an error-free SNP matrix, a haplotype , is constructed from its corresponding fragment class using the following formula: where is called the defining class of haplotype , and , where   and , denotes the element of the haplotype .

We now describe the general minimum error correction problem. If a matrix is not error-free, there will be no error-free partition . For such a matrix , there will be at least one conflicting pair of fragments in each of the classes for all possible partitions. Therefore it is impossible to construct a haplotype that is non-conflicting with all the fragments in its defining class of fragments. If we are given a partition and two haplotypes and constructed from then the number of errors that needs to be corrected can be calculated by the following formula:

The MEC problem asks to find a partition that minimizes the error function over all such partitions of an SNP matrix .

3. A Heuristic Algorithm

In this section, we give our heuristic algorithm based on minimum error correction which we call HMEC.

Construction of a haplotype from an erroneous class requires correction of SNP values, that is, alleles, in the fragments. We want to minimize the number of error corrections. Therefore, for each SNP site, the haplotype should take the allele that is present in the majority of the fragments. Let be the number of fragments of a collection that have 0 in the SNP. Similarly, defines the number of 1s [5]. Therefore, to minimize the number of errors for a specific partition , the haplotype should be constructed according to the following methodology: where and . In Figure 1(c), two haplotypes and , associated with the partition in Figure 1(b), are constructed in this way.

We will use a heuristic search to find the best partition. This algorithm starts with a current partition and iteratively searches a better partition. In each iteration, the algorithm performs a sequence of transfer of fragments from their present collection to the other one so that the partition becomes less erroneous. The transfer of a fragment from one collection to the other can increase or decrease the error function . Let the partition before transferring a fragment be and the partition resulted is . We define the gain of the transfer as . Figure 2 demonstrates an example calculation of the gain measure. Let , be an ordering of all the fragments in a partition in such a way that fragment will precede fragment if all the fragments before in have already been transferred to form an intermediate partition and over . Hence, at the start of each iteration. We also define the cumulative gain of a fragment ordering up to the fragment as . Here . The maximum cumulative gain, is defined as

In Section 4, we shall describe these terms with an example.

In each iteration, the algorithm finds the current ordering of and transfers only those fragments of that can achieve the and the fragment that is the last to be transferred is referred as . Thus the algorithm moves from one partition to another reducing the error function by an amount of . Please see Algorithm 1 for a basic description of HMEC. The algorithm continues as long as and stops whenever .

FREE_LOCKS()
CLEAR_LOG()
repeat always
while there is an unlocked fragment in do
  begin
   find a free fragment so that is maximum
   transfer to the other class
   update the haplotypes after the transfer
   LOCK( )
   LOG_RECORD( , )
  end do
 FREE_LOCKS()
 check the log and find and
if  
  begin
   set new by rolling back the transfers that occurred
   after the transfer of
   calculate haplotypes of
   CLEAR_LOG()
   continue with the the loop
end if
else
  begin
   terminate the algorithm and output current haplotypes
 end else
end repeat

4. Data Structures and Complexity

This section deals with the data structures and the complexity of our algorithm. Here we also suggest some approximation to improve the performance of our algorithm.

First, to find in each iteration, the algorithm repeatedly transfers the fragment that is not transferred previously in this iteration and has maximum gain over all such fragments. To accomplish this, we use a locking mechanism. At the beginning of each iteration, all the fragments are set free. The free fragment with maximum gain is found out and tentatively transferred to the other collection. After the transfer, the fragment is locked at the new collection. This tentative transfer creates the first intermediate partition . The algorithm then finds the next free fragment with maximum gain in and transfer and lock that fragment to create the . Thus, free fragments are transferred until all the fragments are locked and the order of the transfer () is stored in the log table along with the cumulative gains (). is the maximum and is the fragment corresponding to in the log table.

After finishing all such tentative transfers, becomes an undefined partition. To change to the desired “current” partition of the next iteration, the algorithm checks the log to find the and , and rollback the transfer of all the fragments that were transferred after . When the rollback completes, becomes ready for the next iteration.

While tentatively transferring a free fragment, the algorithm needs to find the fragment with maximum gain among the free fragments (which are not yet transferred). This requires calculating gains for each of them. To calculate the for a fragment, we need to calculate two error values of two different partitions: the present intermediate partition and the next partition which will be resulted if is transferred. Each of these error function requires calculation of two new haplotypes from their corresponding collections (see Figure 2). Although and the haplotypes of can be found from the previous transfer, calculation of requires construction of haplotypes of . Since, the difference between and is only one transfer, we can introduce differential calculation of haplotypes , of next partition from the haplotypes of , of present partition. For this purpose, the algorithm stores and values of the present partition. After a transfer these values will either remain same or be incremented or decremented by 1. That is why it is now possible to construct , in time. To compute from the haplotypes requires time. Thus running time to compute the as well as to compute is .

For each intermediate partition , , we need to compute measures for unlocked fragments to find the maximum one. The transfer of this fragments requires updating of and , and . Hence, it also needs time to run. Finally, there will be such transfer in each iteration and maximum rollbacks. Thus each iteration will require running time.

We now give an example illustrating a single iteration of our algorithm. Figure 3 demonstrates an example iteration of HMEC. We consider that the current partition is the partition given in Figure 1(b) for the SNP matrix of Figure 1(a). All the intermediate partitions , are shown sequentially and the gains of each fragment over the intermediate partitions are shown on the right of each partition. The free fragment with maximum gain is marked in each intermediate partition. For example, the fragment with the maximum gain in is fragment which has gain two. After each transfer, the transferred fragment is shown locked by a circle. Here, the ordering of the fragments is which is also the order of locking of the fragments. This order will be stored along with the s in the log table. Figure 4 demonstrates the resulting log table of the illustrated iteration. All the tentative transfers after have to be rolled back so that the becomes the next .

We now give an approximate gain measure to make our algorithm faster. For large SNP matrix, running time is critical to the performance of the algorithm. We can use an approximation in the calculation of the by using only the fragment and not using the other fragments. The approximate gain should be

Here is the haplotype of ’s present collection of partition , and is the haplotype of ’s next collection of partition . This function ignores the effect of fragments other than on , but reduces the run time of gain calculation to . Therefore, the total run time of each iteration will be .

5. Performance Comparison

In this section, we demonstrate the performance of our algorithm using both real biological and simulated datasets. We compared our algorithm with GMEC [5] and HapCUT [6]. We performed the simulation using the data from angiotensin-converting enzyme (ACE) [17] and public Daly set [18] to compare with GMEC [5], and used the HuRef data [19] to compare with HapCUT. We also compared HMEC with some other algorithms (SpeedHap [9], FastHare [10], MLF [11], 2 distance MEC [12], SHR-3 [13]) using ReHap website interface [15], which is an HapMap-based instance generator and comparison framework [14].

5.1. Comparison with GMEC

In this section, we compare the performance of our algorithm with the genetic algorithm, which we call GMEC, described in [5].

We first sample the original haplotype pair into many fragments with different coverage and error rates. Each fragment works as a distinct sample of the same specimen. Here coverage rate indicates the percentage of the total columns of the SNP matrix that have been sampled out. The remaining slots are gaps. We then introduce some specific amount of error into these samples. The simulation was controlled in several ways. We varied the error rate while number of fragments and coverage rate were kept constant. Also, coverage was varied while number of fragments and error rate were kept constant.

Notice that the way we introduced error and controlled the coverage rate is not necessarily same as that of [5]. Therefore, the reconstruction rates of the branch and bound algorithm described in [5], which is an exact algorithm, should not be compared with those of HMEC.

5.1.1. Experiment on Angiotensin-Converting Enzyme (ACE)

Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls fluid-electrolyte balance and systematic blood pressure. Because it has a key function in the renin-angiotensin system, many association studies have been performed with DCP1 (encode angiotensin-converting anzyme) [20]. Rieder et al. completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes [17].

We take six pairs of haplotypes to perform the simulation. We generate 50 fragments from each of these haplotype pairs with varying coverage and error rate. We perform the simulation for three different coverage rates (25%, 50%, and 75%). For each of these coverage rates, we perform our simulation for different error rates such as 5%, 10%, 15%, 20%, 25%, 30%, and 50%. In every case, we compare our algorithm with GMEC. Figure 5 illustrates the comparison that bears the clear testimony to the superiority of our algorithm. For most instances, the reconstruction rate achieved by our algorithm is 100% or greater than 98%. Only for a few cases with very high error rate and low coverage value, the reconstruction rate falls below 95%. We also perform the experiment for different coverage rates while keeping the error rate constant. Figure 6 illustrates the performance of these two algorithms for various coverage values. Here also, our algorithm clearly outperforms the genetic algorithm except for very low coverage value (which is unrealistic).

5.1.2. Simulation on Data from Chromosome 5q31

In this section, we discuss our simulation results on the data from public Daly set. Daly et al. reported a high-resolution analysis of a haplotype structure across 500 kb on chromosome 5q31 using 103 SNPs in a European derived population which consists of 129 trios [18, 20].

We performed the experiment exactly in the same way that we did for angiotensin-converting enzyme. Figure 7 demonstrates the results. Experimental results suggest that HMEC is much better than the genetic algorithm. Again, for most cases, the reconstruction rate achieved by our algorithm is 100% or greater than 98%. For every instance, our algorithm exhibits better performance than that of GMEC.

5.1.3. Experiment on Simulated Data

We used simulated data for further evaluation of HMEC. One of the very important advantages of our algorithm is that it takes very short time to reconstruct the haplotypes. Our algorithm is much faster than the GMEC. Table 1 shows the running time of HMEC and GMEC. Here, we perform the simulation by varying the length (length denotes the number of SNP sites in the haplotype pair) of the haplotypes while fixed the value of the coverage rate and the error rate at 50%. Since haplotypes with such varying lengths are not available, we rely on the simulated data. Clearly, HMEC is much faster than GMEC. For example, while HMEC can reconstruct a haplotype with 936 sites in a fraction of a second, GMEC takes 72 seconds.

5.2. Comparison with HapCUT

HapCUT [6] is one of the most accurate heuristic algorithms for individual haplotyping. HapCUT uses a random initial haplotype configuration and builds a graph. It computes max-cut on the graph to find the position to flip and iterates until no improvement in MEC score is achieved. We have performed extensive experiments to compare the performance of HMEC and HapCUT. Experimental results suggest that although HapCUT is reliable, its running time is too large to be a realistic choice for whole genome haplotyping. Our algorithm computes the haplotypes significantly faster than HapCUT without losing accuracy.

We used the filtered HuRef data from Levy et al. [19] to evaluate the performance. We generated several test data sets varying the coverage and error rate, and tested the performance of HMEC and HapCUT on these data sets. The results are shown in Tables 2, 3, 4, and 5.

Experimental results indicate that the reconstruction rates of both HapCUT and HMEC are reasonably good. For very low coverage, reconstruction rate of HapCUT is slightly better than HMEC. However, as the coverage rate increases, HMEC begins to outperform HapCUT. Notice that HapCUT is only better than HMEC for very low (and thus unrealistic) coverage values (5% and 10% coverage). However, for higher coverage, HMEC consistently performs better than HapCUT. Furthermore, HapCUT is significantly slower than HMEC. For an instance, with 35% coverage and 40% error rate, HapCUT takes 215.5 seconds where HMEC takes only a fraction of a second (see Table 5). Therefore, although generally HapCUT provides reliable reconstruction rate, on large dataset, it is an unrealistic choice due to its time consuming operations. On the other hand, HMEC provides the high accuracy with much less amount of time.

We also used some sample data from ReHap project [14, 15]. We created the allele matrix from the ReHap error matrix and fed the matrix to both HMEC and HapCUT algorithms. HMEC consistently produces higher reconstruction rate than HapCUT (see Table 6). Also, the running time of HMEC is clearly much better than HapCUT.

5.3. Comparison Using ReHap

We also compared HMEC with some other well-known algorithms such as SpeedHap [9], FastHare [10], MLF [11], 2 distance MEC [12], and SHR-3 [13] using HapMap-based instance generator and comparison framework [14, 15]. The results are shown in Table 7 and Table 8 that suggest that no single method clearly outperforms the others in all cases. However, reconstruction rates achieved by HMEC are the highest or very close to the highest. This bears a clear testimony to its suitability as a practical tool for individual haplotyping.

6. Conclusion

In this paper, we present a heuristic algorithm (HMEC) based on minimum error correction that computes highly accurate haplotypes significantly faster than the known algorithms for haplotyping. The algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm for bipartitioning a hyper graph minimizing the cut size [8]. We report on an extensive performance study evaluating our approach with other available techniques using both real and simulated datasets. Comprehensive performance study shows that our algorithm outperforms (in most cases) or matches the accuracy of other well-known methods, but runs in a fraction of the time needed for other techniques. High accuracy and very fast running time make our technique suitable for genome-wide scale data.

The accuracy of the algorithm can be improved by incorporating some prior knowledge. For example, small groups of fragments that are declared to be in the same haplotype can be identified. Probabilistic methods like expectation maximization (EM) also deserve some consideration over such optimization problems. In the near future, we intend to consider these issues to make further improvements.

Acknowledgments

The authors thank Rui-Sheng Wang (one of the authors of [5]) for providing them with the ACE and chromosome 5q31 datasets.