Advances in Bioinformatics

Research Article

Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools

Table 1

Detailed comparison of different bisulfite short reads mapping tools.

Programs

Year

Algorithmic Technique used

Language

Aligner

Input

Output

Min./Max. read length

Mismatches

Indels

Gaps

Single/Paired-end

Multi-threaded

Nondirectional

ERNE-bs5

2012

Hash genome indexing uses a 5-letter (Cm, Cu) for storing methylation information and uses a weighted context-aware Hamming distance to identify a T coming from an unmethylated C.

C++

None

gz/bz2/fastq/fasta

BAM/
SAM

up to 600 bp

1 every 15 bp (-errors arg)

Yes

both

Yes

BatMeth

2012

FM index integrates mismatch counting, list filtering and mismatch stage filtering and fast mapping onto two indexes.

Perl/C++

None

fasta

up to 5 (-) in a read

Yes

BiSS

2012

Reference genome hashing, local Smith-Waterman alignment

Perl

None

fasta/fastq/gz/SAM/BAM

SAM/BAM/Next GenMap

up to 4096 bp

(- from 0 to 1) in a read Default

Yes

Bismark

2011

FM-Index enumerates all possible T to C conversion

Perl

Bowtie/Bowtie2

fasta/fastq

BAM/SAM

Bowtie: up to 1000 bp Bowtie 2: unlimited

0 or 1 in a seed (-)

Yes

both

Yes

BS-Seeker2

2013

FM-Index enumerates all possible T to C conversion

Python

Bowtie2/Bowtie/SOAP/RMAP

fasta, fastq, qseq, pure sequence

BAM/SAM/BS-Seeker

50–500 bp

up to 4 per read (-)

Yes

Single

Yes

BS-Seeker

2010

FM-Index, enumerates all possible T to C conversion, converts the genome to 3 letters, and uses Bowtie to align reads

Python

Bowtie

fasta, fastq, qseq, pure sequence

BAM/SAM/BS_Seeker

50–250 bp

up to 3 per read (-)

Yes

Single

Yes

BSMAP

2009

hashing of reference genome and bitwise masking tries all possible T to C combinations for reads

Python

SOAP

fasta/fastq/
SAM

SAM/txt

up to 144 bp

up to 15 in a read (-)

up to 3 bp

both

Yes

RMAP

2008

Wildcard matching for mapping Ts, incorporates the use of quality scores directly into the mapping process

C++

fastq/fasta

BED

unlimited

up to 10 in a read (-)

both

BRAT-BW

2012

Converts a TA reference and CG reference; two FM indices are built on the positive strand of the reference genome

C++

Text file with input file names in fastq, sequence only

txt

32 bp-unlimited

unlimited

both

Yes

MAQ

2008

Builds multiple hash tables to index the reads, scans the reference genome against the hash tables to find hits

Perl/C/C++

fastq

maq

Up to 63 bp

up to 3 per read

Yes, -

both

PASH

2010

Implements -mer level alignment using multipositional hash tables

fastq

Txt/SAM

Yes

Single

Novo-align

2010

Hashing genome

C/C++

fastq

SAM/BAM

up to 8 per read, 16 for paired end reads

Yes

up to 7 bp on single end reads

Both

Yes

Methyl-coder

2011

FM-Index, all Cs converted to Ts

C/C++/Python

GSNAP/bowtie

fastq/fasta

BAM/SAM

Bowtie: up to 1000 bp

Yes

both

GSNAP

2005

-mer hashing of reference genome

C/Perl

gzip/fastq, fasta/bzip2

SAM/GSNAP

14–250 bp

Yes

both

yes

BFAST

2009

Uses multiple indexing strategies: hashing and suffix array of the reference genome

fastq/bz2/gzip

SAM

Yes

both

Yes

Segemehl

2008

Enhanced suffix arrays to find exact and inexact matches. Align to read using Myers bitvector algorithm

C/C++

fasta

SAM

unlimited

Yes

(-)

Yes

both

Yes

BFAST does not have a direct option for bisulfite mapping, users have to convert Cs to Ts in both a reference genome and reads and then align converted reads to the converted reference genome.
*Parenthesis in mismatches column indicates parameter for mismatches in a program.
^*1A min percentages of matches per read.