Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools
Table 1
Detailed comparison of different bisulfite short reads mapping tools.
Programs
Year
Algorithmic Technique used
Language
Aligner
Input
Output
Min./Max. read length
Mismatches
Indels
Gaps
Single/Paired-end
Multi-threaded
Nondirectional
ERNE-bs5
2012
Hash genome indexing uses a 5-letter (Cm, Cu) for storing methylation information and uses a weighted context-aware Hamming distance to identify a T coming from an unmethylated C.
C++
None
gz/bz2/fastq/fasta
BAM/ SAM
up to 600 bp
1 every 15 bp (-errors arg)
Yes
Yes
both
Yes
No
BatMeth
2012
FM index integrates mismatch counting, list filtering and mismatch stage filtering and fast mapping onto two indexes.
Perl/C++
None
fasta
NA
NA
up to 5 (-) in a read
No
No
Yes
Yes
Yes
BiSS
2012
Reference genome hashing, local Smith-Waterman alignment
Perl
None
fasta/fastq/gz/SAM/BAM
SAM/BAM/Next GenMap
up to 4096 bp
(- from 0 to 1) in a read Default
Yes
Yes
Yes
Yes
No
Bismark
2011
FM-Index enumerates all possible T to C conversion
Perl
Bowtie/Bowtie2
fasta/fastq
BAM/SAM
Bowtie: up to 1000 bp Bowtie 2: unlimited
0 or 1 in a seed (-)
Yes
Yes
both
Yes
Yes
BS-Seeker2
2013
FM-Index enumerates all possible T to C conversion
Python
Bowtie2/Bowtie/SOAP/RMAP
fasta, fastq, qseq, pure sequence
BAM/SAM/BS-Seeker
50–500 bp
up to 4 per read (-)
Yes
Yes
Single
No
Yes
BS-Seeker
2010
FM-Index, enumerates all possible T to C conversion, converts the genome to 3 letters, and uses Bowtie to align reads
Python
Bowtie
fasta, fastq, qseq, pure sequence
BAM/SAM/BS_Seeker
50–250 bp
up to 3 per read (-)
Yes
No
Single
No
Yes
BSMAP
2009
hashing of reference genome and bitwise masking tries all possible T to C combinations for reads
Python
SOAP
fasta/fastq/ SAM
SAM/txt
up to 144 bp
up to 15 in a read (-)
up to 3 bp
both
Yes
Yes
RMAP
2008
Wildcard matching for mapping Ts, incorporates the use of quality scores directly into the mapping process
C++
fastq/fasta
BED
unlimited
up to 10 in a read (-)
No
No
both
No
No
BRAT-BW
2012
Converts a TA reference and CG reference; two FM indices are built on the positive strand of the reference genome
C++
Text file with input file names in fastq, sequence only
txt
32 bp-unlimited
unlimited
No
No
both
Yes
Yes
MAQ
2008
Builds multiple hash tables to index the reads, scans the reference genome against the hash tables to find hits
Perl/C/C++
fastq
maq
Up to 63 bp
up to 3 per read
Yes, -
No
both
No
No
PASH
2010
Implements -mer level alignment using multipositional hash tables
C
fastq
Txt/SAM
NA
Yes
Yes
No
Single
No
No
Novo-align
2010
Hashing genome
C/C++
fastq
SAM/BAM
up to 8 per read, 16 for paired end reads
Yes
Yes
up to 7 bp on single end reads
Both
No
Yes
Methyl-coder
2011
FM-Index, all Cs converted to Ts
C/C++/Python
GSNAP/bowtie
fastq/fasta
BAM/SAM
Bowtie: up to 1000 bp
Yes
No
Yes
both
No
No
GSNAP
2005
-mer hashing of reference genome
C/Perl
gzip/fastq, fasta/bzip2
SAM/GSNAP
14–250 bp
Yes
Yes
Yes
both
yes
No
BFAST
2009
Uses multiple indexing strategies: hashing and suffix array of the reference genome
C
fastq/bz2/gzip
SAM
NA
Yes
Yes
Yes
both
Yes
Yes
Segemehl
2008
Enhanced suffix arrays to find exact and inexact matches. Align to read using Myers bitvector algorithm
C/C++
fasta
SAM
unlimited
Yes
(-)
Yes
both
Yes
No
BFAST does not have a direct option for bisulfite mapping, users have to convert Cs to Ts in both a reference genome and reads and then align converted reads to the converted reference genome.
*Parenthesis in mismatches column indicates parameter for mismatches in a program. *1A min percentages of matches per read.