International Journal of Genomics

Review Article

A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

Table 1


Computational tools	Description	Website	References

Alignment tools
Burrows-Wheeler Aligner (BWA)	Perform short reads alignment using BWT approach against a references genome allowing for gaps/mismatches.	http://bio-bwa.sourceforge.net/	[8]
Bowtie (1 & 2)	Performs short read alignment using the Burrows-Wheeler index in order to be memory efficient, while still maintaining an alignment speed of over 25 million 35 bp reads per hour.	http://bowtie-bio.sourceforge.net/index.shtml	[9, 10]
ELAND	Short read aligner that achieves speed by splitting reads into equal lengths and applying seed templates to guarantee hits with only 2 mismatches.	http://www.illumina.com/	Illumina, Inc.
GEM	Short read aligner using string matching instead of BWT to deliver precision and speed.	http://algorithms.cnag.cat/wiki/The_GEM_library	[11]
GSNAP	Performs short and long read alignment, detects long and short distance splicing, SNPs, and is capable of detecting bisulfite-treated DNA for methylation studies.	http://research-pub.gene.com/gmap/	[12]
MAQ	Short read aligner compatible with Illumina-Solexa and ABI SOLiD data, performs ungapped alignment allowing 2-3 mismatches for single-end reads and one mismatch for paired-end reads.	http://maq.sourceforge.net/	[13]
mrFAST	Performs short read alignment allowing for INDELs up to 8 bp, for Illumina generated data. Paired-end mapping using a one end anchored algorithm allows for detection of novel insertions.	http://mrfast.sourceforge.net/	[14]
Novoalign	Alignment done on paired-end or single-end sequences, also capable of doing methylation studies. Allows for a mismatch up to 50% of a read length and has built-in adapter and base quality trimming.	http://www.novocraft.com/products/novoalign/	http://www.novocraft.com/
SOAP (1 & 2)	SOAP2 improved speed by an order of magnitude over SOAP1 and can align a wide range of read lengths at the speed of 2 minutes for one million single-end reads using a two-way BWT algorithm.	http://soap.genomics.org.cn/	[15, 16]
SSAHA	Uses a hashing algorithm to find exact or close to exact matching in DNA and protein databases, analogous to doing a BLAST search for each read.	https://www.vectorbase.org/glossary/ssaha-sequence-search-and-alignment-hashing-algorithm/	[17]
Stampy	Alignment done using a hashing algorithm and statistical model, to align Illumina reads for genome, RNA, and Chip sequencing allowing for a large number or variations including insertions and deletions.	http://www.well.ox.ac.uk/project-stampy	[18]
YOABS	Uses a 0() algorithm that uses both hash and tri-based methods that are effective in aligning sequences over 200 bp with 3 times less memory and ten times faster than SSAHA.	Available by request for noncommercial use	[19]
HTSeq	Python based package with many functions to facilitate several aspects of sequencing studies.	http://www-huber.embl.de/HTSeq/doc/overview.html

Auxiliary tools
FastUniq	Imports, sorts, and identifies PCR duplicates of short sequences from sequencing data.	https://sourceforge.net/projects/fastuniq/	[23]
Picard	Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.	http://picard.sourceforge.net/
SAMtools	Suite of tools capable of viewing, indexing, editing, writing, and reading SAM, BAM, and CRAM formatted files.	http://www.htslib.org/	[7]

SNV and SV calling
GATK	Variant calling of SNPs and small INDELs; can also be used on nonhuman and nondiploid organisms.	https://www.broadinstitute.org/gatk/	[4–6]
SAMtools	Suite of tools capable of viewing, indexing, editing, writing, and reading SAM, BAM, and CRAM formatted files.	http://www.htslib.org/	[7]
VCMM	Detection of SNVs and INDELs using the multinomial probabilistic method in WES and WGS data.	http://emu.src.riken.jp/VCMM/	[25]
FreeBayes	Detection of SNPs, MNPs, INDELs, and structural variants (SVs) from sequencing alignments using Bayesian statistical methods.	https://github.com/ekg/freebayes	[27]
indelMINER	Splitread algorithm to identify breakpoint in INDELs from paired-end sequencing data.	https://github.com/aakrosh/indelMINER	[32]
Pindel	Detection of INDELs using a pattern growth approach with anchor points to provide nucleotide-level resolution.	http://gmt.genome.wustl.edu/packages/pindel/	[30]
Platypus	Detection of SNPs, MNPs, INDELs, replacements, and structural variants (SVs) from sequencing alignments using local realignment and local assembly to achieve high specificity and sensitivity.	http://www.well.ox.ac.uk/platypus	[26]
Splitread	Detection of INDELs less than 50 bp long from WES or WGS data, using a split-read algorithm.	http://splitread.sourceforge.net/	[31]
Sprites	Detection of INDELs is done using a split-read and soft-clipping approach that is especially sensitive in datasets with low coverage.	https://github.com/zhangzhen/sprites	[33]

VCF annotation
ANNOVAR	Provides up-to-date annotation of VCF files by gene, region, and filters from several other databases.	http://annovar.openbioinformatics.org/	[34]
MuTect	Postprocesses variants to eliminate artifacts from hybrid capture, short read alignment, and next-generation sequencing.	http://www.broadinstitute.org/cancer/cga/mutect	[35]
SnpEff	Uses 38,000 genomes to predict and annotate the effects of variants on genes.	http://snpeff.sourceforge.net/	[36]
SnpSift	Tools to manipulate VCF files including filtering, annotation, case controls, transition, and transversion rates and more.	http://snpeff.sourceforge.net/SnpSift.html	[37]
VAT	Annotation of variants by functionality in a cloud computing environment.	http://vat.gersteinlab.org/	[38]

Database filtration
1000 Genomes Project	Genotype information from a population of 1000 healthy individuals.	http://www.1000genomes.org/	[41]
dbSNP	Database of genomic variants from 53 organisms.	https://www.ncbi.nlm.nih.gov/projects/SNP/	[39]
LOVD	Open source database of freely available gene-centered collection of DNA variants and storage of patient and NGS data.	http://www.lovd.nl/3.0/home	[40]
COSMIC	Database containing somatic mutations from human cancers separated into expert curated data and genome-wide screen published in scientific literature.	http://cancer.sanger.ac.uk/cosmic	[42]
NHLBI GO Exome Sequencing Project (ESP)	Database of genes and mechanisms that contribute to blood, lung, and heart disorders through NGS data in various populations.	http://evs.gs.washington.edu/EVS/
Exome Aggregation Consortium (ExAC)	Database of 60,706 unrelated individuals from disease and population exome sequencing studies.	http://exac.broadinstitute.org/	[3]
SeattleSeq Annotation	Part of the NHBLI sequencing project; this database contains novel and known SNVs and INDELs including accession number, function of the variant, and HapMap frequencies, clinical association, and PolyPhen predictions.	http://snp.gs.washington.edu/SeattleSeqAnnotation137/

Functional predictors
CADD	Machine learning algorithm to score all possible 8.6 million substitutions in the human reference genome from 1 to 99 based on known and simulated functional variants.	http://cadd.gs.washington.edu/info	[49]
FATHMM	Uses Hidden Markov Models to predict the functional consequences of SNVs in coding and noncoding variants through a web server.	http://fathmm.biocompute.org.uk/	[46]
LRT	Uses the Likelihood Ratio statistical test to compare a variant to known variants and determine if they are predicted to be benign, deleterious, or unknown.	http://genome.cshlp.org/content/19/9/1553.long	[45]
PolyPhen-2	Predicts potential impact of a nonsynonymous variant using comparative and physical characteristics.	http://genetics.bwh.harvard.edu/pph2/	[44]
SIFT	By using PSI-BLAST, a prediction can be made on the effect of a nonsynonymous mutation within a protein.	http://sift.jcvi.org/	[43]
VEST	Machine learning approach to determine the probability that a missense mutation will impair the functionality of a protein.	http://karchinlab.org/apps/appVest.html	[48]
MetaSVM & MetaLR	Integration of a Support Vector Machine and Logistic Regression to integrate nine deleterious prediction scores of missense mutations.	https://sites.google.com/site/jpopgen/dbNSFP	[47]

Significant somatic mutations
SomaticSniper	Using two bam files as input, this tool uses the genotype likelihood model of MAZ to calculate the probability that the tumor and normal samples are different, thus identifying somatic variants.	http://gmt.genome.wustl.edu/packages/somatic-sniper/	[50]
MuTect	Using statistical analysis to predict the likelihood of a somatic mutation using two Bayesian approaches.	https://www.broadinstitute.org/cancer/cga/mutect	[35]
VarSim	By leveraging on previously reported mutations, a random mutation simulation is preformed to predict somatic mutations.	http://bioinform.github.io/varsim/	[51]
SomVarIUS	Identification of somatic variants from unpaired tissue samples with a sequencing depth of 150x and 67% precision, implemented in Python.	https://github.com/kylessmith/SomVarIUS	[52]

Copy number alteration
Control-FREEC	Detects copy number changes and loss of heterozygosity (LOH) from paired SAM/BAM files by computing and normalizing copy number and beta allele frequency.	http://bioinfo-out.curie.fr/projects/freec/	[59]
CNV-seq	Mapped read count is calculated over a sliding window in Perl and R to determine copy number from HTS studies.	http://tiger.dbs.nus.edu.sg/cnv-seq/	[53]
SegSeq	Using 14 million aligned sequence reads from cancer cell lines, equal copy number alterations are calculated from sequencing data.	https://www.broadinstitute.org/cancer/cga/segseq	[54]
VarScan2	Determines copy number changes in matched or unmatched samples using read ratios and then postprocessed with a circular binary segmentation algorithm.	http://dkoboldt.github.io/varscan/using-varscan.html	[61]
ExomeAI	Detects allele imbalance including LOH in unmatched tumor samples using a statistical approach that is capable of handling low-quality datasets.	http://gqinnovationcenter.com/index.aspx	[64]
CNVseeqer	Exon coverage between matched sequences was calculated using ratios followed by the circular binary segmentation algorithm.	http://icb.med.cornell.edu/wiki/index.php?title=Elementolab/CNVseeqer&redirect=no	[60]
EXCAVATOR	Detects copy number variants from WES data in 3 steps using a Hidden Markov Model algorithm.	https://sourceforge.net/projects/excavatortool/	[57]
ExomeCNV	R package used to detect copy number variants of loss of heterozygosity from WES data.	https://secure.genome.ucla.edu/index.php/ExomeCNV_User_Guide	[58]
ADTEx	Detection of aberrations in tumor exomes by detecting B-allele frequencies and implemented in R.	http://adtex.sourceforge.net/	[55]
CONTRA	Uses normalized depth of coverage to detect copy number changes from targeted resequencing data including WES.	https://sourceforge.net/projects/contra-cnv/	[56]

Driver prediction tools
CHASM	Machine learning method that predicts the functional significance of somatic mutations.	http://karchinlab.org/apps/appChasm.html	[65]
Dendrix	De novo drivers are discovered from cancer only mutational data including genes, nucleotides, or domains that have high exclusivity and coverage.	http://compbio.cs.brown.edu/projects/dendrix/	[66]
MutSigCV	Gene-specific and patient-specific mutation frequencies are incorporated to find mutations in genes that are mutated more often than would be expected by chance.	http://www.broadinstitute.org/cancer/software/genepattern/modules/docs/MutSigCV	[67]

Pathway analysis tools and resources
KEGG	Database using maps of known biological processes that allows searching for genes and color coding of results.	http://www.genome.jp/kegg/	[68]
DAVID	Allows for users to input a large set of genes and discover the functional annotation of the gene list including pathways, gene ontology terms, and more.	https://david.ncifcrf.gov/	[69]
STRING	Network visualization of protein-protein interactions of over 2,031 organisms.	http://string-db.org/	[70]
BEReX	Uses biomedical knowledge to allow users to search for relationships between biomedical entities.	http://infos.korea.ac.kr/berex/	[71]
DAPPLE	Uses a list of genes to determine physical connectivity among proteins according to protein-protein interactions.	http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001273	[72]
SNPsea	Uses a linkage disequilibrium to determine pathways and cell types that are likely to be affected based on SNP data.	http://www.broadinstitute.org/mpg/snpsea/	[73]

Tools and resources for linking variants to therapeutics
cBioPortal	Database that allows the download, analysis, and visualization of cancer sequencing studies, including providing patient and clinical data for samples.	http://www.cbioportal.org/	[78]
My Cancer Genome	Database for cancer research that provides linkage of mutational status to therapies and available clinical trials.	https://www.mycancergenome.org/	http://www.mycancergenome.org/
ClinVar	Database of relationship between phenotypes and human variations, showing the relationship between health status and human variations and known implications.	https://www.ncbi.nlm.nih.gov/clinvar/	[74]
DSigDB	Database of drug signatures that includes 19,531 genes and 17,389 compounds that can in part help identify compounds for drug repurposing studies in translational research.	http://tanlab.ucdenver.edu/DSigDB	[77]
PharmGKB	Knowledge base allowing visualization of a variety of drug-gene knowledge.	https://www.pharmgkb.org/	[75]
DrugBank	Contains detailed drug information with comprehensive drug target information for 8,206 drugs.	http://www.drugbank.ca/	[76]

WES data analysis pipelines
fast2VCF	Whole Exome Sequencing pipeline that starts with raw sequencing (fastq) files and ends with a VCF file that has good capability for novel and expert users.	http://fastq2vcf.sourceforge.net/	[80]
SeqMule	WES or WGS pipeline that combines the information from over ten alignment and analysis tools to arrive at a VCF file that can be used in both Mendelian and cancer studies.	http://seqmule.openbioinformatics.org/en/latest/	[79]
IMPACT	WES data analysis pipeline that starts with raw sequencing reads and analyzes SNVs and CNAs and links this data to a list of prioritized drugs from clinical trials and DSigDB.	http://tanlab.ucdenver.edu/IMPACT/	[81]
Genomes on the Cloud (GotCloud)	Automated sequencing pipeline that performs in part alignment, variant calling, and quality control that can be run on Amazon Web Services EC2 as well as local machines and clusters.	http://genome.sph.umich.edu/wiki/GotCloud