The Promise of Genomic Studies on Human Diseases: From Basic Science to Clinical ApplicationView this Special Issue
Review Article | Open Access
Jennifer D. Hintzsche, William A. Robinson, Aik Choon Tan, "A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data", International Journal of Genomics, vol. 2016, Article ID 7983236, 16 pages, 2016. https://doi.org/10.1155/2016/7983236
A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data
Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome and is becoming a standard approach in studying genetic variants in diseases. Understanding the exomes of individuals at single base resolution allows the identification of actionable mutations for disease treatment and management. WES technologies have shifted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. Novel computational tools and methods have been developed to analyze and interpret WES data. Here, we review some of the current tools that are being used to analyze WES data. These tools range from the alignment of raw sequencing reads all the way to linking variants to actionable therapeutics. Strengths and weaknesses of each tool are discussed for the purpose of helping researchers make more informative decisions on selecting the best tools to analyze their WES data.
Recent advances in next-generation sequencing technologies provide revolutionary opportunities to characterize the genomic landscapes of individuals at single base resolution for identifying actionable mutations for disease treatment and management [1, 2]. Whole Exome Sequencing (WES) is the application of the next-generation technology to determine the variations in the exome, that is, all coding regions of known genes in a genome. For example, more than 85% of disease-causing mutations in Mendelian diseases are found in the exome, and WES provides an unbiased approach to detect these variants in the era of personalized and precision medicine. Next-generation sequencing technologies have shifted the bottleneck in experimental data production to computationally intensive informatics-based data analysis. For example, the Exome Aggregation Consortium (ExAC) has assembled and reanalyzed WES data of 60,706 unrelated individuals from various disease-specific and population genetic studies . To gain insights in WES, novel computational algorithms and bioinformatics methods represent a critical component in modern biomedical research to analyze and interpret these massive datasets.
Genomic studies that employ WES have increased over the years, and new bioinformatics methods and computational tools have developed to assist the analysis and interpretation of this data (Figure 1). The majority of WES computational tools are centered on the generation of a Variant Calling Format (VCF) file from raw sequencing data. Once the VCF files have been generated, further downstream analyses can be performed by other computational methods. Therefore, in this review we have classified bioinformatics methods and computational tools into Pre-VCF and Post-VCF categories. Pre-VCF workflows include tools for aligning the raw sequencing reads to a reference genome, variant detection, and annotation. Post-VCF workflows include methods for somatic mutation detection, pathway analysis, copy number alterations, INDEL identification, and driver prediction. Depending on the nature of the hypothesis, beyond VCF analysis can also include methods that link variants to clinical data as well as potential therapeutics (Figure 2).
Computational tools developed to align raw sequencing data to an annotated VCF file have been well established. Most studies tend to follow workflows associated with GATK [4–6], SAMtools , or a combination of these. In general, workflows start with aligning WES reads to a reference genome and noting reads that vary. The most common of these variants are single nucleotide variants (SNVs) but also include insertions, deletions, and rearrangements. The location of these variants is used to annotate them to a specific gene. After annotation, the SNVs found can be compared to databases of SNVs found in other studies. This allows for the determination of frequency of a particular SNV in a given population. In some studies, such as those relating to cancer, rare somatic mutations are of interest. However, in Mendelian studies, the germline mutational landscape will be of more interest than somatic mutations. Before a final VCF file is produced for a given sample, software can be used to predict if the variant will be functionally damaging to the protein for prioritizing candidate genes for further study.
Bioinformatics methods developed beyond the establishment of annotated VCF files are far less established. In cancer research, the most established types of beyond VCF tools are focused on the detection of somatic mutations. However, there are strides being made to develop other computational tools including pathway analysis, copy number alteration, INDEL identification, driver mutation predictions, and linking candidate genes to clinical data and actionable targets.
Here, we will review recent computational tools in the analysis and interpretation of WES data, with special focus on the applications of these methods in cancer research. We have surveyed the current trends in next-generation sequencing analysis tools and compared their methodology so that researches can better determine which tools are the best for their WES study and the advancement of precision medicine. In addition, we include a list of publicly available bioinformatics and computational tools as a reference for WES studies (Table 1).
2. Computational Tools in Pre-VCF Analyses
Alignment, removal of duplicates, variant calling, annotation, filtration, and prediction are all parts of the steps leading up to the generation of a filtered and annotated VCF file. Here we review each one of these steps, as shown in Figure 2, and compare and contrast some of the tools that can be used to perform the Pre-VCF analysis steps.
2.1. Alignment Tools
The first step in any analysis of next-generation sequencing is to align the sequencing reads to a reference genome. The two most common reference genomes for humans currently are hg18 and hg19. Several aligning algorithms have been developed including but not limited to BWA , Bowtie 1  and 2 , GEM , ELAND (Illumina, Inc.), GSNAP , MAQ , mrFAST , Novoalign (http://www.novocraft.com/), SOAP 1  and 2 , SSAHA , Stampy , and YOABS . Each method has its own unique features and many papers have reviewed the differences between them [20–22], and we will not review these tools in depth here. The three most commonly used of these algorithms are BWA, Bowtie (1 and 2), and SOAP (1 and 2).
2.2. Auxiliary Tools
Some auxiliary tools have been developed to filter aligned reads to ensure higher quality data for downstream analyses. PCR amplification can introduce duplicate reads of paired-end reads in sequencing data. These duplicate reads can influence the depth of the mapped reads and downstream analyses. For example, if a variant is detected in duplicate reads, the proportion of reads containing a variant could pass the threshold needed for variant calling, thus calling a falsely positive variant. Therefore, removing duplicate reads is a crucial step in accurately representing the sequencing depth during downstream analyses. Several tools have been developed to detect PCR duplicates including Picard (http://picard.sourceforge.net./), FastUniq , and SAMtools . SAMtools rmdup finds reads that start and end at the same position, find the read with the highest quality score, and mark the rest of the duplicates for removal. Picard finds identical 5′ positions for both reads in a mate pair and marks them as duplicates. In contrast, FastUniq takes a de novo approach to quickly identify PCR duplicates. FastUniq imports all reads, sorts them according to their location, and then marks duplicates. This allows FastUniq not to require complete genome sequences as prerequisites. Due to the different algorithms each of these tools use, these tools can remove PCR duplicates individually or in combination.
2.3. Methods for Single Nucleotide Variants (SNVs) Calling
After sequences have been aligned to the reference genome, the next step is to perform variant detection in the WES data. There are four general categories of variant calling strategies: germline variants, somatic variants, copy number variations, and structural variants. Multiple tools that perform one or more of these variant calling techniques were recently compared to each other . Some common SNV calling programs are GATK [4–6], SAMtools , and VCMM . The actual SNV calling mechanisms of GATK and SAMtools are very similar. However, the context before and after SNV calling represents the differences between these tools. GATK assumes each sequencing error is independent while SAMtools believes a secondary error carries more weight. After SNV calling GATK learns from data while SAMtools relies on options of the user, Variant Caller with Multinomial probabilistic Model (VCMM) is another tool developed to detect SNVs and INDELs from WES and Whole Genome Sequencing (WGS) studies using a multinomial probabilistic model with quality score and a strand bias filter . VCMM suppressed the false-positive and false-negative variant calls when compared to GATK and SAMtools. However, the number of variant calls was similar to previous studies. The comparison done by the authors of VCMM demonstrated that while all three methods call a large number of common SNVs, each tool also identifies SNVs not found by the other methods . The ability of each method to call SNVs not found by the others should be taken into account when choosing a SNV variant calling tool(s).
2.4. Methods for Structural Variants (SVs) Identification
Structural Variants (SVs) such as insertions and deletions (INDELs) in high-throughput sequencing data are more challenging to identify than single nucleotide variants because they could involve an undefined number of nucleotides. The majority of WES studies follow SAMtools  or GATK [4–6] workflows which will identify INDELs in the data. However, other software has been developed to increase the sensitivity of INDEL discovery while simultaneously decreasing the false discovery rate.
Platypus  was developed to find SNVs, INDELs, and complex polymorphisms using local de novo assembly. When compared to SAMtools and GATK, Platypus had the lowest Fosmid false discovery rate for both SNVs and INDELs in whole genome sequencing of 15 samples. It also had the shortest runtime of these tools. However, GATK and SAMtools had lower Fosmid false discovery rates than Platypus when finding SNVs and INDELs in WES data . Therefore, Platypus seems to be appropriate for whole genome sequencing but caution should be used when utilizing this tool with WES data.
FreeBayes uses a unique approach to INDEL detection compared to other tools. The method utilizes haplotype-based variant detection under a Bayesian statistics framework . This method has been used in several studies in combination with other approaches for the identifying of unique INDELs [28, 29].
Pindel was one of the first programs developed to address the issue of unidentified large INDELs due to the short length of WGS reads . In brief, after alignment of the reads to the reference genome, Pindel identifies reads where one end was mapped and the other was not . Then, Pindel searches the reference genome for the unmapped portion of this read over a user defined area of the genome . This split-read algorithm successfully identified large INDELs. Other computational tools developed after Pindel still utilize this algorithm as the foundation in their methods for detecting INDELs.
Splitread  was developed to specifically identify structural variants and INDELs in WES data from 1 bp to 1 Mbp building on the split-read approach of Pindel . The algorithms used by SAMtools and GATK limit the size of structural variants, with variants greater than 15 bp rarely being identified . Splitread anchors one end of a read and clusters the unanchored ends to identify size, content, and location of structural variants . When compared to GATK, Splitread called 70% of the same INDELs but identified 19 more unique INDELs, 13 of which were verified by sanger sequencing . The unique ability of Splitread to identify large structural variants and INDELs merits it being used in conjunction with other INDEL detecting software in WES analysis.
Recently developed indelMINER is a compilation of tools that takes the strengths of split-read and de novo assembly to determine INDELs from paired-end reads of WGS data . Comparisons were done between SAMtools, Pindel, and indelMINER on a simulated dataset with 7,500 INDELs . SAMtools found the least INDELs with 6,491, followed by Pindel with 7,239 and indelMINER with 7,365 INDELs identified. However, indelMINER’s false-positive percentage (3.57%) was higher than SAMtools (2.65%) but lower than Pindel (4.53%). Conversely, indelMINER did have the lowest number of false-negatives with 398 compared to 589 and 1,181 for Pindel and SAMtools, respectively. Each of these tools has their own strengths and weaknesses as demonstrated by the authors of indelMINER . Therefore, it can be predicted that future tools developed for SV detection will take an approach similar to indelMINER in trying to incorporate the best methods that have been developed thus far.
Most of the recent SV detection tools rely on realigning split-reads for detecting deletions. Instead of a more universal approach like indelMINER, Sprites  aims to solve the problem of deletions with microhomologies and deletions with microinsertions. Sprites algorithm realigns soft-clipping reads to find the longest prefix or suffix that has a match in the target sequence. In terms of the -score, Sprites performed better than Pindel using real and simulated data .
All of these tools use different algorithms to address the problem of structural variants, which are common in human genomes. Each of these tools has strengths and weaknesses in detecting SVs. Therefore, it is suggested to use several of these tools in combination to detect SVs in WES.
2.5. VCF Annotation Methods
Once the variants are detected and called, the next step is to annotate these variants. The two most popular VCF annotation tools are ANNOVAR  and MuTect  which is part of the GATK pipeline. ANNOVAR was developed in 2010 with the aim to rapidly annotate millions of variants with ease and remains one of the popular variant annotation methods to date . ANNOVAR can use gene, region, or filter-based annotation to access over 20 public databases for variants annotation. MuTect is another method that uses Bayesian classifiers for detecting and annotating variants [34, 35]. MuTect has been widely used in cancer genomics research, especially in The Cancer Genome Atlas projects. Other VCF annotation tools are SnpEff  and SnpSift . SnpEff can perform annotation for multiple variants and SnpSift allows rapid detection of significant variants from the VCF files . The Variant Annotation Tool (VAT) distinguishes itself from other annotation tools in one aspect by adding cloud computing capabilities . VAT annotation occurs at the transcript level to determine whether all or only a subset of the transcript isoforms of a gene is affected. VAT is dynamic in that it also annotates Multiple Nucleotide Polymorphisms (MNPs) and can be used on more than just the human species.
2.6. Database and Resources for Variant Filtration
During the annotation process, many resources and databases could be used as filtering criteria for detecting novel variants from common polymorphisms. These databases score a variant by its minor allelic frequency (MAF) within a specific population or study. The need for filtration of variants based on this number is subject to the purpose of the study. For example, Mendelian studies would be interested in including common SNVs while cancer studies usually focus on rare variants found in less than 1% of the population. NCBI dbSNP database, established in 2001, is an evolving database containing both well-known and rare variants from many organisms . dbSNP also contains additional information including disease association, genotype origin, and somatic and germline variant information .
The Leiden Open Variation Database (LOVD) developed in 2005 links its database to several other repositories so that the user can make comparisons and gain further information . One of the most popular SNV databases was developed in 2010 from the 1000 Genomes Project that uses statistics from the sequencing of more than 1000 “healthy” people of all ethnicities . This is especially helpful for cancer studies, as damaging mutations found in cancer are often very rare in a healthy population. Another database essential for cancer studies is the Catalogue of Somatic Mutations In Cancer (COSMIC) . This database of somatic mutations found in cancer studies from almost 20,000 publications allows for identification of potentially important cancer-related variants. More recently, the Exome Aggregation Consortium (ExAC) has assembled and reanalyzed WES data of 60,706 unrelated individuals from various disease-specific and population genetic studies . The ExAC web portal and data provide a resource for assessing the significance of variants detected in WES data .
2.7. Functional Predictors of Mutation
Besides knowing if a particular variant has been previously identified, researchers may also want to determine the effect of a variant. Many functional prediction tools have been developed that all vary slightly in their algorithms. While individual prediction software can be used, ANNOVAR provides users with scores from several different functional predictors including SIFT, PolyPhen-2, LRT, FATHMM, MetaSVM, MetaLR, VEST, and CADD .
SIFT determines if a variant is deleterious using PSI-BLAST to determine conservation of amino acids based on closely related sequence alignments . PolyPhen-2 uses a pipeline involving eight sequence based methods and three structure based methods in order to determine if a mutation is benign, probably deleterious, or known to be deleterious . The Likelihood Ratio Test (LRT) uses conservation between closely related species to determine a mutations functional impact . When three genomes underwent analysis by SIFT, PolyPhen-2, and LRT, only 5% of all predicted deleterious mutations were agreed to be deleterious by all three methods . Therefore, it has been shown that using multiple mutational predictors is necessary for detecting a wide range of deleterious SNVs. FATHMM employs sequence conservation within Hidden Markov Models for predicting the functional effects of protein missense mutation . FATHMM weighs mutations based on their pathogenicity by the predicted interaction of the protein domain .
MetaSVM and MetaLR represent two ensemble methods that combine 10 predictor scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, and PhyloP) and the maximum frequency observed in the 1000 genomes populations for predicting the deleterious variants . MetaSVM and MetaLR are based on the ensemble Support Vector Machine (SVM) and Logistic Regression (LR), respectively, for predicting the final variant scores .
The Variant Effect Scoring Tool (VEST) is similar to MetaSVM and MetaLR in that it uses a training set and machine learning to predict functionality of mutations . The main difference in the VEST approach is that the training set and prediction methodology are specifically designed for Mendelian studies . The Combined Annotation Dependent Depletion (CADD) method differentiates itself by integrating multiple variants with mutations that have survived natural selection as well as simulated mutations .
While all of these methods predict the functionality of a mutation, they all vary slightly in their methodological and biological assumptions. Dong et al. have recently tested the performance of these prediction algorithms on known datasets . They pointed out that these methods rarely unanimously agree on if a mutation is deleterious. Therefore, it is important to consider the methodology of the predictor as well as the focus of the study when interpreting deleterious prediction results.
3. Computational Methods for Beyond VCF Analyses
After a VCF file has been generated, annotated, and filtered, there are several types of analyses that can be performed (Figure 2). Here we outline six major types of analyses that can be performed after the generation of a VCF file, with special focus on WES in cancer research: (i) significant somatic mutations, (ii) pathway analysis, (iii) copy number estimation, (iv) driver prediction, (v) linking variants to clinical information and actionable therapies, and (vi) emerging applications of WES in cancer research.
3.1. Methods to Determine Significant Somatic Mutations
After VCF annotation, a WES sample can have thousands of SNVs identified; however, most of them will be silent (synonymous) mutations and will not be meaningful for follow-up study. Therefore, it is important to identify significant somatic mutations from these variants. Several tools have been developed to do this task for the analysis of cancer WES data, including SomaticSniper , MuTect , VarSim , and SomVarIUS .
SomaticSniper is a computational program that compares the normal and tumor samples to find out which mutations are unique to the tumor sample, hence predicted as somatic mutations . SomaticSniper uses the genotype likelihood model of MAQ (as implemented in SAMtools) and then calculates the probability that the tumor and normal genotypes are different. The probability is reported as a somatic score which is the Phred-scaled probability. SomaticSniper has been applied in various cancer research studies to detect significant somatic variants.
Another popular somatic mutation identification tool is MuTect , developed by the Broad Institute. MuTect, like SomaticSniper, uses paired normal and cancer samples as input for detecting somatic mutations. After removing low-quality reads, MuTect uses a variant detection statistic to determine if a variant is more probable than a sequencing error. MuTect then searches for six types of known sequencing artifacts and removes them. A panel of normal samples as well as the dbSNP database is used for comparison to remove common polymorphisms. By doing this, the number of somatic mutations is not only identified but also reduced to a more probable set of candidate genes. MuTect has been widely used in Broad Institute cancer genomics studies.
While SomaticSniper and MuTect require data from both paired cancer and normal samples, VarSim  and SomVarIUS  do not require a normal sample to call somatic mutations. Unlike most programs of its kind, VarSim  uses a two-step process utilizing both simulation and experimental data for assessing alignment and variant calling accuracy. In the first step, VarSim simulates diploid genomes with germline and somatic mutations based on a realistic model that includes SNVs and SVs. In the second step, VarSim performs somatic variant detection using the simulated data and validates the cancer mutations in the tumor VCF. SomVarIUS is another recent computational method to detect somatic variants in cancer exomes without a normal paired sample . In brief, SomVarIUS consists of 3 steps for somatic variant detection. SomVarIUS first prioritizes potential variant sites, estimates the probability of a sequencing error followed by the probability that an observed variant is germline or somatic. In samples with greater than 150x coverage, SomVarIUS identifies somatic variants with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor samples . Both VarSim and SomVarIUS will be useful for cancer samples that lack the corresponding normal samples for somatic variant detection.
3.2. Computational Tools for Estimating Copy Number Alteration
One active research area in WES data analysis is the development of computational methods for estimating copy number alterations (CNAs). Many tools have been developed for estimating CNAs from WES data based on paired normal-tumor samples such as CNV-seq , SegSeq , ADTEx , CONTRA , EXCAVATOR , ExomeCNV , Control-FREEC (control-FREE Copy number caller) , and CNVseeqer . For example, VarScan2  is a computational tool that can estimate somatic mutations and CNAs from paired normal-tumor samples. VarScan2 utilizes a normal sample to find Somatic CNAs (SCNAs) by first comparing Q20 read depths between normal and tumor samples and normalizes them based on the amount of input data for each sample . Copy number alteration is inferred from the log2 of the ratio of tumor depth to normal depth for each region . Lastly, the circular segmentation (CBS) algorithm  is utilized to merge adjacent segments to call a set of SCNAs. These SCNAs could be further classified as large-scale (>25% of chromosome arm) or focal (<25%) events in the WES data .
Recently, ExomeAI was developed to detect Allelic Imbalance (AI) from WES data . Utilizing heterozygous sites, ExomeAI finds deviations from the expected 1 : 1 ratio between an A- and B-allele in multiple tumor samples without a normal comparison. Absolute deviation of B-allele frequency from .05 is calculated and similar to VarScan2; the CBS algorithm is applied to each chromosomal arm . In order to reduce the number of false positives, a database was created with 500 (and counting) normal samples to filter out known AIs. This represents a novel tool to analyze WES for the detection of recurrent AI events without matched normal samples.
A systematic evaluation of somatic copy number estimation tools for WES data has been recently published . In this study, six computational tools for CNAs detection (ADTEx, CONTRA, Control-FREEC, EXCAVATOR, ExomeCNV, and VarScan2) were evaluated using WES data from three TCGA datasets. Using a SNP array as the reference, this study found that these algorithms gave highly variable results. The authors found that ADTEx and EXCAVATOR had the best performance with relatively high precision and sensitivity when compared to the reference set. The study showed that the current CNA detection tools for WES data still have limitations and called for more robust algorithms for this challenging task.
3.3. Computational Tools for Predicting Drivers in Cancer Exomes
Cancer is a disease driven by genetic variations and copy number alterations. These genetic events can be classified into two classes, “driver” and “passenger” mutations. Driver mutations are the key mutation that drive the development of cancer and provide a survival advantage, whereas passenger mutations are “by-stander” alterations that happen to be altered in the primary cells but do not provide a survival advantage. As the cancer exomes tend to have high mutational burdens, identifying the “driver” mutations from the “passenger” mutations is one of the key analyses in cancer research. Several tools have been developed to find driver mutations including but not limited to CHASM , Dendrix , and MutSigCV .
CHASM (Cancer-specific High-throughput Annotation of Somatic Mutations) uses random forest as the machine learning approach to distinguish the difference between driver and passenger mutations in cancer . CHASM was trained on the curated driver mutations obtained from the COSMIC database (“positive examples”) and synthetic passenger mutations generated according to the background of base substitution frequencies observed in specific tumor types (“negative examples”). CHASM can achieve high sensitivity and specificity when discriminating between known driver missense mutations and randomly generated missense mutations when tested in real tumor samples. This method has been one of the popular driver detection prediction tools for cancer researchers and has been applied in various cancer genomic studies.
Another common driver mutation tool is MutSigCV developed to resolve the problem of extensive false-positive findings that overshadow true driver mutations . As the size of cancer genomes sequenced has increased implausible genes (such as TTN) have been falsely reported as being related to cancer when in fact their large size just makes the probability they would be mutated by chance increase . MutSigCV takes into account patient-specific mutation frequency and spectrum as well as gene-specific background mutation rates, expression level, and replication time. By pooling all of this available data into one tool, MutSigCV has become a standard tool used for driver mutation identification in cancer studies.
De novo Driver Exclusivity (Dendrix) is a novel computational tool to determine de novo driver pathways (gene sets) from somatic mutations in patient data . The main goal of the Dendrix algorithm is to find gene sets with high coverage and high exclusivity properties from the somatic data. The high coverage property assumes most patients have at least one driver mutation in the gene set, whereas the high exclusivity property assumes that these driver mutations are rarely mutated together in the same patient. Two algorithms were developed in Dendrix, one based on a greedy algorithm and one based on the Markov Chain Monte Carlo (MCMC) algorithm, to measure sets of genes that exhibit both criteria. When Dendrix was applied to the TCGA data, the algorithms identified groups of genes that were mutated in large subsets of patients and these mutations were mutually exclusive. This tool provides an opportunity to analyze WES data to identify driver pathways in cancer genomic studies.
3.4. Methods for Pathway Analysis
After candidate somatic mutations have been identified; one common type of analysis is to determine which pathways are affected by these mutations. Common pathway resources and tools used for these types of analysis include KEGG , DAVID , STRING , BEReX , DAPPLE , and SNPsea .
KEGG represents one of the most popular databases for pathway analysis. DAVID is a popular online tool for performing functional enrichment analysis based on user defined gene lists. STRING is the largest protein-protein interactions database for querying and searching for interactions between user defined gene lists. BEReX integrates STRING, KEGG, and other data sources to explore biomedical interactions between genes, drugs, pathways, and diseases. Both STRING and BEReX allow users to perform functional enrichment analysis and the flexibility to explore the interactions between user defined gene lists by expanding the networks.
DAPPLE (Disease Association Protein-Protein Link Evaluator) uses literature reported protein-protein interactions to identify significant physical connectivity among the genes of interest . DAPPLE hypothesizes that genetic variation affects underlying mechanisms only detectable by protein-protein interactions . SNPsea is another pathway analysis tool that requires specific SNP data . SNPsea calculates linkage disequilibrium between involved genes and uses a sampling approach to determine conditions that are affected by these interactions.
3.5. Computational Tools for Linking Variants to Treatments
The ability to link variants with actionable drug targets is an emerging research topic in precision medicine. Databases such as My Cancer Genome have provided the framework for these studies (https://www.mycancergenome.org/). My Cancer Genome provides a bridge between genomic data and clinical therapeutic treatments. Similarly, ClinVar provides information on the relationship between variants and clinical therapy . By collecting both the variants and the clinical significance related to these variants, ClinVar offers a database for researchers to explore the significance of sequencing findings in the clinical setting . Pharmacological databases such as PharmGKB , DrugBank , and DSigDB  provide the link between drug and drug targets (variants). For example, by querying a list of variants to one of these databases, it allows users to identify actionable targets via enrichment analysis for the repurposing of drugs.
Similarly, the ability to incorporate clinical data into sequencing studies is vital to the advancement of personalized medicine. However, due to the lack of integration between electronic health records (EHR) and molecular analysis, this remains one of the challenges in translating WES data analysis into clinical practice. Projects such as cBioPortal provide a framework for incorporating sequencing data with available clinical data . New methods for addressing this task are urgently needed to take advantage of the important applications of WES data within the clinic in order to advance precision medicine.
4. WES Analysis Pipelines
WES data analysis pipelines integrate computational tools and methods described in the previous sections in a single analysis workflow. Here, we review three recent sequencing pipelines SeqMule , Fastq2vcf , and IMPACT  that assimilate some of the tools described in previous sections.
SeqMule stands out in part due to the use of five alignment tools (BWA, Bowtie 1 and 2, SOAP2, and SNAP) and five different variant calling algorithms (GATK, SAMtools, VarScan2, FreeBayes, and SOAPsnp) . SeqMule contains at least one feature that performs Pre-VCF analyses to generate a filtered VCF file. SeqMule also generates an accompanying HTML-based report and images to show an overview of every step in the pipeline. Fastq2vcf also performs the Pre-VCF analyses using BWA as an alignment tool and variant calling by GATK, UnifiedGenotyper, HaplotypeCaller, SAMtools, and SNVer resulting in a filtered VCF after implementation of ANNOVAR and VEP . Fastq2vcf can be used in a single or parallel computing environment on variety of sequencing data.
Both SeqMule and Fastq2vcf pipelines focus on taking raw sequencing data and converting it into a filtered VCF file. IMPACT (Integrating Molecular Profiles with ACtionable Therapeutics) WES data analysis pipeline was developed to take this analysis a step further by linking a filtered VCF to actionable therapeutics . The IMPACT pipeline contains four analytical modules: detecting somatic variants; calling copy number alterations; predicting drugs against the deleterious variants; and tumor heterogeneity analysis. IMPACT has been applied to longitudinal samples obtained from a melanoma patient and identified novel acquired resistance mutations to treatment. IMPACT analysis revealed loss of CDKN2A as a novel resistance mechanism to the combination of dabrafenib and trametinib treatment and predicted potential drugs for further pharmacological and biological studies .
To compare the strengths and weaknesses between these three WES pipelines, SeqMule allows the use of different alignment algorithms in its pipeline whereas IMPACT and Fastq2vcf only utilize BWA as the sequencing alignment algorithm. SAMtools is the common tool used by IMPACT, Fastq2vcf, and SeqMule to call variants. In addition, Fastq2vcf and SeqMule employ GATK and other variant calling algorithms for variants detection. Fastq2vcf and IMPACT both annotate the variants with ANNOVAR. Fastq2vcf also utilizes VEP and IMPACT utilizes SIFT and PolyPhen-2 as the primary variants prediction methods. For Post-VCF analysis, IMPACT pipeline has more options as compared to SeqMule and Fastq2vcf. In particular, IMPACT performs copy number analysis, tumor heterogeneity, and linking of actionable therapeutics to the molecular profiles. However, IMPACT is only designed to be performed on tumor samples while SeqMule and Fastq2vcf are designed for any WES dataset. Therefore, it is advisable for the users to consider the analytic needs to select the appropriate WES data analysis pipeline for their research.
As recently discussed by Altman et al., part of the U.S. Precision Medicine Initiative (PMI) includes being able to define a gold standard of pipelines and tools for specific sequencing studies to enable a new era of medicine . Automated pipelines such as these will accelerate the analysis and interpretation of WES data. Future development of data analysis pipeline will be needed to incorporate newer and wider tools tailored for specific research questions.
In summary, we have reviewed several computational tools for the analysis and interpretation of WES data. These computational methods were developed to generate VCF files from raw sequencing data, as well as tools that perform downstream analyses in WES studies. Each tool has specific strengths and weaknesses, and it appears that using several of them in combination would lead to more accurate results. Currently, there are still challenges for bioinformaticians at every step in analyzing WES data. However, the greatest area of need is in the development of tools that can link the information found in a VCF file to clinical databases and therapeutics. Research in this area will help to advance precision medicine by providing user-friendly and informative knowledge to transcend the laboratory.
The authors declare that there are no competing interests regarding the publication of this paper.
The authors would like to thank the Translational Bioinformatics and Cancer Systems Biology Lab members for their constructive comments on this manuscript. This work is partly supported by the National Institutes of Health P50CA058187, Cancer League of Colorado, the David F. and Margaret T. Grohne Family Foundation, the Rifkin Endowed Chair (WAR), and the Moore Family Foundation.
- M. L. Metzker, “Sequencing technologies—the next generation,” Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010.
- S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of age: ten years of next-generation sequencing technologies,” Nature Reviews Genetics, vol. 17, no. 6, pp. 333–351, 2016.
- M. Lek, K. J. Karczewski, E. V. Minikel et al., “Analysis of protein-coding genetic variation in 60,706 humans,” Nature, vol. 536, no. 7616, pp. 285–291, 2016.
- A. McKenna, M. Hanna, E. Banks et al., “The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data,” Genome Research, vol. 20, no. 9, pp. 1297–1303, 2010.
- M. A. DePristo, E. Banks, R. Poplin et al., “A framework for variation discovery and genotyping using next-generation DNA sequencing data,” Nature Genetics, vol. 43, no. 5, pp. 491–498, 2011.
- G. A. Van der, M. O. Auwera, C. Hartl et al., “From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline,” Current Protocols in Bioinformatics, vol. 11, no. 1110, pp. 11.10.1–11.10.33, 2013.
- H. Li, B. Handsaker, A. Wysoker et al., “The Sequence Alignment/Map format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 2009.
- H. Li and R. Durbin, “Fast and accurate long-read alignment with Burrows-Wheeler transform,” Bioinformatics, vol. 26, no. 5, Article ID btp698, pp. 589–595, 2010.
- B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome,” Genome Biology, vol. 10, no. 3, article R25, 2009.
- B. Langmead and S. L. Salzberg, “Fast gapped-read alignment with Bowtie 2,” Nature Methods, vol. 9, no. 4, pp. 357–359, 2012.
- S. Marco-Sola, M. Sammeth, R. Guigó, and P. Ribeca, “The GEM mapper: fast, accurate and versatile alignment by filtration,” Nature Methods, vol. 9, no. 12, pp. 1185–1188, 2012.
- T. D. Wu and S. Nacu, “Fast and SNP-tolerant detection of complex variants and splicing in short reads,” Bioinformatics, vol. 26, no. 7, pp. 873–881, 2010.
- H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research, vol. 18, no. 11, pp. 1851–1858, 2008.
- C. Alkan, J. M. Kidd, T. Marques-Bonet et al., “Personalized copy number and segmental duplication maps using next-generation sequencing,” Nature Genetics, vol. 41, no. 10, pp. 1061–1067, 2009.
- R. Li, Y. Li, K. Kristiansen, and J. Wang, “SOAP: short oligonucleotide alignment program,” Bioinformatics, vol. 24, no. 5, pp. 713–714, 2008.
- R. Li, C. Yu, Y. Li et al., “SOAP2: an improved ultrafast tool for short read alignment,” Bioinformatics, vol. 25, no. 15, pp. 1966–1967, 2009.
- Z. Ning, A. J. Cox, and J. C. Mullikin, “SSAHA: a fast search method for large DNA databases,” Genome Research, vol. 11, no. 10, pp. 1725–1729, 2001.
- G. Lunter and M. Goodson, “Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads,” Genome Research, vol. 21, no. 6, pp. 936–939, 2011.
- V. L. Galinsky, “YOABS: yet other aligner of biological sequences—an efficient linearly scaling nucleotide aligner,” Bioinformatics, vol. 28, no. 8, Article ID bts102, pp. 1070–1077, 2012.
- H. Li and N. Homer, “A survey of sequence alignment algorithms for next-generation sequencing,” Briefings in Bioinformatics, vol. 11, no. 5, Article ID bbq015, pp. 473–483, 2010.
- J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nature Biotechnology, vol. 26, no. 10, pp. 1135–1145, 2008.
- T. J. Treangen and S. L. Salzberg, “Repetitive DNA and next-generation sequencing: computational challenges and solutions,” Nature Reviews Genetics, vol. 13, no. 1, pp. 36–46, 2012.
- H. Xu, X. Luo, J. Qian et al., “FastUniq: a fast de novo duplicates removal tool for paired short reads,” PLoS ONE, vol. 7, no. 12, Article ID e52249, 2012.
- S. Pabinger, A. Dander, M. Fischer et al., “A survey of tools for variant analysis of next-generation genome sequencing data,” Briefings in Bioinformatics, vol. 15, no. 2, pp. 256–278, 2014.
- D. Shigemizu, A. Fujimoto, S. Akiyama et al., “A practical method to detect SNVs and indels from whole genome and exome sequencing data,” Scientific Reports, vol. 3, article 2161, 2013.
- A. Rimmer, H. Phan, I. Mathieson et al., “Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications,” Nature Genetics, vol. 46, no. 8, pp. 912–918, 2014.
- E. Garrison and G. Marth, “Haplotype-based variant detection from short-read sequencing,” https://arxiv.org/abs/1207.3907.
- G. Ellison, S. Huang, H. Carr et al., “A reliable method for the detection of BRCA1 and BRCA2 mutations in fixed tumour tissue utilising multiplex PCR-based targeted next generation sequencing,” BMC Clinical Pathology, vol. 15, no. 1, article 5, 2015.
- A. K. Talukder, S. Ravishankar, K. Sasmal et al., “XomAnnotate: analysis of heterogeneous and complex exome—a step towards translational medicine,” PLoS ONE, vol. 10, no. 4, Article ID e0123569, 2015.
- K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads,” Bioinformatics, vol. 25, no. 21, pp. 2865–2871, 2009.
- E. Karakoc, C. Alkan, B. J. O'Roak et al., “Detection of structural variants and indels within exome data,” Nature Methods, vol. 9, no. 2, pp. 176–178, 2012.
- A. Ratan, T. L. Olson, T. P. Loughran, and W. Miller, “Identification of indels in next-generation sequencing data,” BMC Bioinformatics, vol. 16, no. 1, article 42, 2015.
- Z. Zhang, J. Wang, J. Luo et al., “Sprites: detection of deletions from sequencing data by re-aligning split reads,” Bioinformatics, vol. 32, no. 12, pp. 1788–1796, 2016.
- K. Wang, M. Li, and H. Hakonarson, “ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data,” Nucleic Acids Research, vol. 38, no. 16, article e164, 2010.
- K. Cibulskis, M. S. Lawrence, S. L. Carter et al., “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples,” Nature Biotechnology, vol. 31, no. 3, pp. 213–219, 2013.
- P. Cingolani, A. Platts, L. L. Wang et al., “A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3,” Fly, vol. 6, no. 2, pp. 80–92, 2012.
- P. Cingolani, V. M. Patel, M. Coon et al., “Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift,” Frontiers in Genetics, vol. 3, article 35, 2012.
- L. Habegger, S. Balasubramanian, D. Z. Chen et al., “VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment,” Bioinformatics, vol. 28, no. 17, pp. 2267–2269, 2012.
- S. T. Sherry, M.-H. Ward, M. Kholodov et al., “DbSNP: the NCBI database of genetic variation,” Nucleic Acids Research, vol. 29, no. 1, pp. 308–311, 2001.
- I. F. A. C. Fokkema, J. T. den Dunnen, and P. E. M. Taschner, “LOVD: easy creation of a locus-specific sequence variation database using an ‘LSDB-in-a-box’ approach,” Human Mutation, vol. 26, no. 2, pp. 63–68, 2005.
- 1000 Genomes Project Consortium, G. R. Abecasis, D. Altshuler et al., “A map of human genome variation from population-scale sequencing,” Nature, vol. 467, no. 7319, pp. 1061–1073, 2010.
- S. A. Forbes, D. Beare, P. Gunasekaran et al., “COSMIC: exploring the world's knowledge of somatic mutations in human cancer,” Nucleic Acids Research, vol. 43, no. 1, pp. D805–D811, 2015.
- P. Kumar, S. Henikoff, and P. C. Ng, “Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm,” Nature Protocols, vol. 4, no. 7, pp. 1073–1081, 2009.
- I. A. Adzhubei, S. Schmidt, L. Peshkin et al., “A method and server for predicting damaging missense mutations,” Nature Methods, vol. 7, no. 4, pp. 248–249, 2010.
- S. Chun and J. C. Fay, “Identification of deleterious mutations within three human genomes,” Genome Research, vol. 19, no. 9, pp. 1553–1561, 2009.
- H. A. Shihab, J. Gough, D. N. Cooper et al., “Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models,” Human Mutation, vol. 34, no. 1, pp. 57–65, 2013.
- C. Dong, P. Wei, X. Jian et al., “Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies,” Human Molecular Genetics, vol. 24, no. 8, pp. 2125–2137, 2015.
- H. Carter, C. Douville, P. D. Stenson, D. N. Cooper, and R. Karchin, “Identifying Mendelian disease genes with the variant effect scoring tool,” BMC Genomics, vol. 14, p. S3, 2013.
- M. Kircher, D. M. Witten, P. Jain, B. J. O'Roak, G. M. Cooper, and J. Shendure, “A general framework for estimating the relative pathogenicity of human genetic variants,” Nature Genetics, vol. 46, no. 3, pp. 310–315, 2014.
- D. E. Larson, C. C. Harris, K. Chen et al., “Somaticsniper: identification of somatic point mutations in whole genome sequencing data,” Bioinformatics, vol. 28, no. 3, Article ID btr665, pp. 311–317, 2012.
- J. C. Mu, M. Mohiyuddin, J. Li et al., “VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications,” Bioinformatics, vol. 31, no. 9, pp. 1469–1471, 2014.
- K. S. Smith, V. K. Yadav, S. Pei, D. A. Pollyea, C. T. Jordan, and S. De, “SomVarIUS: somatic variant identification from unpaired tissue samples,” Bioinformatics, vol. 32, no. 6, pp. 808–813, 2015.
- C. Xie and M. T. Tammi, “CNV-seq, a new method to detect copy number variation using high-throughput sequencing,” BMC Bioinformatics, vol. 10, no. 1, article 80, 2009.
- D. Y. Chiang, G. Getz, D. B. Jaffe et al., “High-resolution mapping of copy-number alterations with massively parallel sequencing,” Nature Methods, vol. 6, no. 1, pp. 99–103, 2009.
- K. C. Amarasinghe, J. Li, S. M. Hunter et al., “Inferring copy number and genotype in tumour exome data,” BMC Genomics, vol. 15, no. 1, article 732, 2014.
- J. Li, R. Lupat, K. C. Amarasinghe et al., “CONTRA: copy number analysis for targeted resequencing,” Bioinformatics, vol. 28, no. 10, Article ID bts146, pp. 1307–1313, 2012.
- A. Magi, L. Tattini, I. Cifola et al., “EXCAVATOR: detecting copy number variants from whole-exome sequencing data,” Genome Biology, vol. 14, no. 10, article R120, 2013.
- J. F. Sathirapongsasuti, H. Lee, B. A. J. Horst et al., “Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV,” Bioinformatics, vol. 27, no. 19, Article ID btr462, pp. 2648–2654, 2011.
- V. Boeva, A. Zinovyev, K. Bleakley et al., “Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,” Bioinformatics, vol. 27, no. 2, pp. 268–269, 2011.
- Y. Jiang, D. Redmond, K. Nie et al., “Deep sequencing reveals clonal evolution patterns and mutation events associated with relapse in B-cell lymphomas,” Genome Biology, vol. 15, no. 8, article 432, 2014.
- D. C. Koboldt, Q. Zhang, D. E. Larson et al., “VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing,” Genome Research, vol. 22, no. 3, pp. 568–576, 2012.
- A. B. Olshen, H. Bengtsson, P. Neuvial, P. T. Spellman, R. A. Olshen, and V. E. Seshan, “Parent-specific copy number in paired tumor-normal studies using circular binary segmentation,” Bioinformatics, vol. 27, no. 15, Article ID btr329, pp. 2038–2046, 2011.
- J.-Y. Nam, N. K. Kim, S. C. Kim et al., “Evaluation of somatic copy number estimation tools for whole-exome sequencing data,” Briefings in Bioinformatics, vol. 17, no. 2, pp. 185–192, 2015.
- J. Nadaf, J. Majewski, and S. Fahiminiya, “ExomeAI: detection of recurrent allelic imbalance in tumors using whole-exome sequencing data,” Bioinformatics, vol. 31, no. 3, pp. 429–431, 2014.
- H. Carter, J. Samayoa, R. H. Hruban, and R. Karchin, “Prioritization of driver mutations in pancreatic cancer using cancer-specific high-throughput annotation of somatic mutations (CHASM),” Cancer Biology & Therapy, vol. 10, no. 6, pp. 582–587, 2010.
- F. Vandin, E. Upfal, and B. J. Raphael, “De novo discovery of mutated driver pathways in cancer,” Genome Research, vol. 22, no. 2, pp. 375–385, 2012.
- M. S. Lawrence, P. Stojanov, P. Polak et al., “Mutational heterogeneity in cancer and the search for new cancer-associated genes,” Nature, vol. 499, no. 7457, pp. 214–218, 2013.
- M. Kanehisa and S. Goto, “KEGG: kyoto encyclopedia of genes and genomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, 2000.
- D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic Acids Research, vol. 37, no. 1, pp. 1–13, 2009.
- D. Szklarczyk, A. Franceschini, S. Wyder et al., “STRING v10: protein-protein interaction networks, integrated over the tree of life,” Nucleic Acids Research, vol. 43, no. 1, pp. D447–D452, 2015.
- M. Jeon, S. Lee, K. Lee, A.-C. Tan, and J. Kang, “BEReX: biomedical entity-relationship eXplorer,” Bioinformatics, vol. 30, no. 1, pp. 135–136, 2014.
- E. J. Rossin, K. Lage, S. Raychaudhuri et al., “Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology,” PLoS Genetics, vol. 7, no. 1, Article ID e1001273, 2011.
- K. Slowikowski, X. Hu, and S. Raychaudhuri, “SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci,” Bioinformatics, vol. 30, no. 17, pp. 2496–2497, 2014.
- M. J. Landrum, J. M. Lee, G. R. Riley et al., “ClinVar: public archive of relationships among sequence variation and human phenotype,” Nucleic Acids Research, vol. 42, no. 1, pp. D980–D985, 2014.
- M. Whirl-Carrillo, E. M. McDonagh, J. M. Hebert et al., “Pharmacogenomics knowledge for personalized medicine,” Clinical Pharmacology and Therapeutics, vol. 92, no. 4, pp. 414–417, 2012.
- V. Law, C. Knox, Y. Djoumbou et al., “DrugBank 4.0: shedding new light on drug metabolism,” Nucleic Acids Research, vol. 42, no. 1, pp. D1091–D1097, 2014.
- M. Yoo, J. Shin, J. Kim et al., “DSigDB: drug signatures database for gene set analysis,” Bioinformatics, vol. 31, no. 18, pp. 3069–3071, 2014.
- E. Cerami, J. Gao, U. Dogrusoz et al., “The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data,” Cancer Discovery, vol. 2, no. 5, pp. 401–404, 2012.
- Y. Guo, X. Ding, Y. Shen, G. J. Lyon, and K. Wang, “SeqMule: automated pipeline for analysis of human exome/genome sequencing data,” Scientific Reports, vol. 5, article 14283, 2015.
- X. Gao, J. Xu, and J. Starmer, “Fastq2vcf: a concise and transparent pipeline for whole-exome sequencing data analyses,” BMC Research Notes, vol. 8, no. 1, p. 72, 2015.
- J. Hintzsche, J. Kim, V. Yadav et al., “IMPACT: a whole-exome sequencing analysis pipeline for integrating molecular profiles with actionable therapeutics in clinical samples,” Journal of the American Medical Informatics Association, vol. 23, no. 4, pp. 721–730, 2016.
- R. B. Altman, S. Prabhu, A. Sidow et al., “A research roadmap for next-generation sequencing informatics,” Science Translational Medicine, vol. 8, no. 335, Article ID 335ps10, 2016.
Copyright © 2016 Jennifer D. Hintzsche et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.