Review Article

A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

Table 1


Computational toolsDescriptionWebsiteReferences

Alignment tools
Burrows-Wheeler Aligner (BWA)Perform short reads alignment using BWT approach against a references genome allowing for gaps/mismatches.http://bio-bwa.sourceforge.net/[8]
Bowtie (1 & 2)Performs short read alignment using the Burrows-Wheeler index in order to be memory efficient, while still maintaining an alignment speed of over 25 million 35 bp reads per hour.http://bowtie-bio.sourceforge.net/index.shtml[9, 10]
ELANDShort read aligner that achieves speed by splitting reads into equal lengths and applying seed templates to guarantee hits with only 2 mismatches.http://www.illumina.com/Illumina, Inc.
GEMShort read aligner using string matching instead of BWT to deliver precision and speed.http://algorithms.cnag.cat/wiki/The_GEM_library[11]
GSNAPPerforms short and long read alignment, detects long and short distance splicing, SNPs, and is capable of detecting bisulfite-treated DNA for methylation studies.http://research-pub.gene.com/gmap/[12]
MAQShort read aligner compatible with Illumina-Solexa and ABI SOLiD data, performs ungapped alignment allowing 2-3 mismatches for single-end reads and one mismatch for paired-end reads.http://maq.sourceforge.net/[13]
mrFASTPerforms short read alignment allowing for INDELs up to 8 bp, for Illumina generated data. Paired-end mapping using a one end anchored algorithm allows for detection of novel insertions.http://mrfast.sourceforge.net/[14]
NovoalignAlignment done on paired-end or single-end sequences, also capable of doing methylation studies. Allows for a mismatch up to 50% of a read length and has built-in adapter and base quality trimming.http://www.novocraft.com/products/novoalign/http://www.novocraft.com/
SOAP (1 & 2)SOAP2 improved speed by an order of magnitude over SOAP1 and can align a wide range of read lengths at the speed of 2 minutes for one million single-end reads using a two-way BWT algorithm.http://soap.genomics.org.cn/[15, 16]
SSAHAUses a hashing algorithm to find exact or close to exact matching in DNA and protein databases, analogous to doing a BLAST search for each read.https://www.vectorbase.org/glossary/ssaha-sequence-search-and-alignment-hashing-algorithm/[17]
StampyAlignment done using a hashing algorithm and statistical model, to align Illumina reads for genome, RNA, and Chip sequencing allowing for a large number or variations including insertions and deletions.http://www.well.ox.ac.uk/project-stampy[18]
YOABSUses a 0() algorithm that uses both hash and tri-based methods that are effective in aligning sequences over 200 bp with 3 times less memory and ten times faster than SSAHA.Available by request for noncommercial use [19]
HTSeqPython based package with many functions to facilitate several aspects of sequencing studies.http://www-huber.embl.de/HTSeq/doc/overview.html

Auxiliary tools
FastUniqImports, sorts, and identifies PCR duplicates of short sequences from sequencing data.https://sourceforge.net/projects/fastuniq/[23]
PicardPicard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.http://picard.sourceforge.net/
SAMtoolsSuite of tools capable of viewing, indexing, editing, writing, and reading SAM, BAM, and CRAM formatted files.http://www.htslib.org/[7]

SNV and SV calling
GATKVariant calling of SNPs and small INDELs; can also be used on nonhuman and nondiploid organisms.https://www.broadinstitute.org/gatk/[46]
SAMtoolsSuite of tools capable of viewing, indexing, editing, writing, and reading SAM, BAM, and CRAM formatted files.http://www.htslib.org/[7]
VCMMDetection of SNVs and INDELs using the multinomial probabilistic method in WES and WGS data.http://emu.src.riken.jp/VCMM/[25]
FreeBayesDetection of SNPs, MNPs, INDELs, and structural variants (SVs) from sequencing alignments using Bayesian statistical methods.https://github.com/ekg/freebayes[27]
indelMINERSplitread algorithm to identify breakpoint in INDELs from paired-end sequencing data.https://github.com/aakrosh/indelMINER[32]
PindelDetection of INDELs using a pattern growth approach with anchor points to provide nucleotide-level resolution.http://gmt.genome.wustl.edu/packages/pindel/[30]
PlatypusDetection of SNPs, MNPs, INDELs, replacements, and structural variants (SVs) from sequencing alignments using local realignment and local assembly to achieve high specificity and sensitivity.http://www.well.ox.ac.uk/platypus[26]
SplitreadDetection of INDELs less than 50 bp long from WES or WGS data, using a split-read algorithm.http://splitread.sourceforge.net/[31]
SpritesDetection of INDELs is done using a split-read and soft-clipping approach that is especially sensitive in datasets with low coverage.https://github.com/zhangzhen/sprites[33]

VCF annotation
ANNOVARProvides up-to-date annotation of VCF files by gene, region, and filters from several other databases.http://annovar.openbioinformatics.org/[34]
MuTectPostprocesses variants to eliminate artifacts from hybrid capture, short read alignment, and next-generation sequencing.http://www.broadinstitute.org/cancer/cga/mutect[35]
SnpEffUses 38,000 genomes to predict and annotate the effects of variants on genes.http://snpeff.sourceforge.net/[36]
SnpSiftTools to manipulate VCF files including filtering, annotation, case controls, transition, and transversion rates and more.http://snpeff.sourceforge.net/SnpSift.html[37]
VATAnnotation of variants by functionality in a cloud computing environment.http://vat.gersteinlab.org/[38]

Database filtration
1000 Genomes ProjectGenotype information from a population of 1000 healthy individuals.http://www.1000genomes.org/[41]
dbSNPDatabase of genomic variants from 53 organisms.https://www.ncbi.nlm.nih.gov/projects/SNP/[39]
LOVDOpen source database of freely available gene-centered collection of DNA variants and storage of patient and NGS data.http://www.lovd.nl/3.0/home[40]
COSMICDatabase containing somatic mutations from human cancers separated into expert curated data and genome-wide screen published in scientific literature.http://cancer.sanger.ac.uk/cosmic [42]
NHLBI GO Exome Sequencing Project (ESP)Database of genes and mechanisms that contribute to blood, lung, and heart disorders through NGS data in various populations.http://evs.gs.washington.edu/EVS/
Exome Aggregation Consortium (ExAC)Database of 60,706 unrelated individuals from disease and population exome sequencing studies.http://exac.broadinstitute.org/ [3]
SeattleSeq AnnotationPart of the NHBLI sequencing project; this database contains novel and known SNVs and INDELs including accession number, function of the variant, and HapMap frequencies, clinical association, and PolyPhen predictions.http://snp.gs.washington.edu/SeattleSeqAnnotation137/

Functional predictors
CADDMachine learning algorithm to score all possible 8.6 million substitutions in the human reference genome from 1 to 99 based on known and simulated functional variants.http://cadd.gs.washington.edu/info[49]
FATHMMUses Hidden Markov Models to predict the functional consequences of SNVs in coding and noncoding variants through a web server.http://fathmm.biocompute.org.uk/[46]
LRTUses the Likelihood Ratio statistical test to compare a variant to known variants and determine if they are predicted to be benign, deleterious, or unknown.http://genome.cshlp.org/content/19/9/1553.long[45]
PolyPhen-2Predicts potential impact of a nonsynonymous variant using comparative and physical characteristics.http://genetics.bwh.harvard.edu/pph2/[44]
SIFTBy using PSI-BLAST, a prediction can be made on the effect of a nonsynonymous mutation within a protein.http://sift.jcvi.org/[43]
VESTMachine learning approach to determine the probability that a missense mutation will impair the functionality of a protein.http://karchinlab.org/apps/appVest.html[48]
MetaSVM & MetaLRIntegration of a Support Vector Machine and Logistic Regression to integrate nine deleterious prediction scores of missense mutations.https://sites.google.com/site/jpopgen/dbNSFP[47]

Significant somatic mutations
SomaticSniperUsing two bam files as input, this tool uses the genotype likelihood model of MAZ to calculate the probability that the tumor and normal samples are different, thus identifying somatic variants.http://gmt.genome.wustl.edu/packages/somatic-sniper/[50]
MuTectUsing statistical analysis to predict the likelihood of a somatic mutation using two Bayesian approaches.https://www.broadinstitute.org/cancer/cga/mutect[35]
VarSimBy leveraging on previously reported mutations, a random mutation simulation is preformed to predict somatic mutations.http://bioinform.github.io/varsim/[51]
SomVarIUSIdentification of somatic variants from unpaired tissue samples with a sequencing depth of 150x and 67% precision, implemented in Python.https://github.com/kylessmith/SomVarIUS[52]

Copy number alteration
Control-FREECDetects copy number changes and loss of heterozygosity (LOH) from paired SAM/BAM files by computing and normalizing copy number and beta allele frequency.http://bioinfo-out.curie.fr/projects/freec/[59]
CNV-seqMapped read count is calculated over a sliding window in Perl and R to determine copy number from HTS studies.http://tiger.dbs.nus.edu.sg/cnv-seq/[53]
SegSeqUsing 14 million aligned sequence reads from cancer cell lines, equal copy number alterations are calculated from sequencing data.https://www.broadinstitute.org/cancer/cga/segseq[54]
VarScan2Determines copy number changes in matched or unmatched samples using read ratios and then postprocessed with a circular binary segmentation algorithm.http://dkoboldt.github.io/varscan/using-varscan.html[61]
ExomeAIDetects allele imbalance including LOH in unmatched tumor samples using a statistical approach that is capable of handling low-quality datasets.http://gqinnovationcenter.com/index.aspx[64]
CNVseeqerExon coverage between matched sequences was calculated using ratios followed by the circular binary segmentation algorithm.http://icb.med.cornell.edu/wiki/index.php?title=Elementolab/CNVseeqer&redirect=no[60]
EXCAVATORDetects copy number variants from WES data in 3 steps using a Hidden Markov Model algorithm.https://sourceforge.net/projects/excavatortool/[57]
ExomeCNVR package used to detect copy number variants of loss of heterozygosity from WES data.https://secure.genome.ucla.edu/index.php/ExomeCNV_User_Guide[58]
ADTExDetection of aberrations in tumor exomes by detecting B-allele frequencies and implemented in R.http://adtex.sourceforge.net/[55]
CONTRAUses normalized depth of coverage to detect copy number changes from targeted resequencing data including WES.https://sourceforge.net/projects/contra-cnv/[56]

Driver prediction tools
CHASMMachine learning method that predicts the functional significance of somatic mutations.http://karchinlab.org/apps/appChasm.html[65]
DendrixDe novo drivers are discovered from cancer only mutational data including genes, nucleotides, or domains that have high exclusivity and coverage.http://compbio.cs.brown.edu/projects/dendrix/[66]
MutSigCVGene-specific and patient-specific mutation frequencies are incorporated to find mutations in genes that are mutated more often than would be expected by chance.http://www.broadinstitute.org/cancer/software/genepattern/modules/docs/MutSigCV[67]

Pathway analysis tools and resources
KEGGDatabase using maps of known biological processes that allows searching for genes and color coding of results.http://www.genome.jp/kegg/[68]
DAVIDAllows for users to input a large set of genes and discover the functional annotation of the gene list including pathways, gene ontology terms, and more.https://david.ncifcrf.gov/[69]
STRINGNetwork visualization of protein-protein interactions of over 2,031 organisms.http://string-db.org/[70]
BEReXUses biomedical knowledge to allow users to search for relationships between biomedical entities.http://infos.korea.ac.kr/berex/[71]
DAPPLEUses a list of genes to determine physical connectivity among proteins according to protein-protein interactions.http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001273[72]
SNPseaUses a linkage disequilibrium to determine pathways and cell types that are likely to be affected based on SNP data.http://www.broadinstitute.org/mpg/snpsea/[73]

Tools and resources for linking variants to therapeutics
cBioPortalDatabase that allows the download, analysis, and visualization of cancer sequencing studies, including providing patient and clinical data for samples.http://www.cbioportal.org/[78]
My Cancer GenomeDatabase for cancer research that provides linkage of mutational status to therapies and available clinical trials.https://www.mycancergenome.org/http://www.mycancergenome.org/
ClinVarDatabase of relationship between phenotypes and human variations, showing the relationship between health status and human variations and known implications.https://www.ncbi.nlm.nih.gov/clinvar/[74]
DSigDBDatabase of drug signatures that includes 19,531 genes and 17,389 compounds that can in part help identify compounds for drug repurposing studies in translational research.http://tanlab.ucdenver.edu/DSigDB[77]
PharmGKBKnowledge base allowing visualization of a variety of drug-gene knowledge.https://www.pharmgkb.org/[75]
DrugBankContains detailed drug information with comprehensive drug target information for 8,206 drugs.http://www.drugbank.ca/[76]

WES data analysis pipelines
fast2VCFWhole Exome Sequencing pipeline that starts with raw sequencing (fastq) files and ends with a VCF file that has good capability for novel and expert users.http://fastq2vcf.sourceforge.net/[80]
SeqMuleWES or WGS pipeline that combines the information from over ten alignment and analysis tools to arrive at a VCF file that can be used in both Mendelian and cancer studies.http://seqmule.openbioinformatics.org/en/latest/[79]
IMPACTWES data analysis pipeline that starts with raw sequencing reads and analyzes SNVs and CNAs and links this data to a list of prioritized drugs from clinical trials and DSigDB.http://tanlab.ucdenver.edu/IMPACT/ [81]
Genomes on the Cloud (GotCloud)Automated sequencing pipeline that performs in part alignment, variant calling, and quality control that can be run on Amazon Web Services EC2 as well as local machines and clusters.http://genome.sph.umich.edu/wiki/GotCloud