Abstract

Copy number variations (CNVs) are structural variants associated with human diseases. Recent studies verified that disease-related genes are based on the extraction of rare de novo and transmitted CNVs from exome sequencing data. The need for more efficient and accurate methods has increased, which still remains a challenging problem due to coverage biases, as well as the sparse, small-sized, and noncontinuous nature of exome sequencing. In this study, we developed a new CNV detection method, ExCNVSS, based on read coverage depth evaluation and scale-space filtering to resolve these problems. We also developed the method ExCNVSSnoRatio, which is a version of ExCNVSS, for applying to cases with an input of test data only without the need to consider the availability of a matched control. To evaluate the performance of our method, we tested it with 11 different simulated data sets and 10 real HapMap samples’ data. The results demonstrated that ExCNVSS outperformed three other state-of-the-art methods and that our method corrected for coverage biases and detected all-sized CNVs even without matched control data.

1. Introduction

Recent technological advances in next-generation sequencing (NGS) and massively accumulated exome sequencing data highlight the need to detect disease-related genes and genetic variations from exome sequencing. The analysis of exome sequencing data became available even in small-scale laboratories due to its low-level memory requirement and decreased computational complexity compared to whole genome sequencing data. Furthermore, recent developments in many web-based and/or cloud-based pipelines of exome sequencing data analysis facilitate analyses, such as preprocessing, alignment processing, variant detection, and functional study, especially in small-scale laboratories [1, 2].

However, these pipelines are restricted to the extraction of simple variants, such as SNPs and short indels, which are not suitable for detecting structural variants (SV), such as copy number variations (CNVs) and large indels. A CNV is defined as a DNA segment of 50 bp or larger and present at a variable copy number in comparison with a reference genome. A CNV is an important variant associated with human diseases such as autism, intellectual disability, epilepsy, schizophrenia, obesity, and cancer [36]. Specifically, researchers verified disease-causing genes based on the extraction of rare, de novo, and transmitted CNVs from exome sequencing data [79].

However, exome-based CNV detection still remains a challenging problem due to two obstacles: one is the presence of coverage biases introduced by the capture and sequencing of exomes and the other is the sparse, small size, and noncontinuous nature of target regions [10]. There are publically available CNV detection methods based on read depth approaches, including ExomeCNV [11], Contra [12], CoNIFER [13], XHMM [14], and Excavator [15]. Each of these methods implements key strategies to mitigate coverage biases caused by the capture and sequencing of exomes. ExomeCNV involves a modeling method using the Geary-Hinkley transformation to obtain normally distributed read coverage data. Contra adopts a normalization method that includes the use of base-level log-ratios and corrects for an imbalanced library size. Both CoNIFER and XHMM combine read coverage data with singular value decomposition (SVD) and principal component analysis (PCA) methods to identify and remove experimental noise. Excavator adopts a median normalization procedure to reduce systematic biases due to GC content, mappability, and exon size. While some of these methods reduce systematic biases in test data by efficiently utilizing many samples, they may have a limited application only in sequencing experiments dealing with a large number of samples. CoNIFER and XHMM require many samples at once in order to normalize the test data by SVD and PCA procedures. The baseline control suggested by Contra also requires many samples to generate a pooled model.

To detect the boundaries of variant regions, some of these CNV detection methods adopt a simple or modified circular binary segmentation algorithm [16], which usually performs well for subdividing a continuous region. However, this may result in missing larger or smaller variants due to sparsely targeted regions in exome sequencing data [15].

To overcome the obstacles presented by conventional methods, we developed a new CNV detection method, ExCNVSS, based on read coverage depth evaluation and scale-space filtering [17]. Our key strategies include correcting coverage biases introduced by capture and sequencing through read coverage depth evaluation and consideration for the sparse, small size, and noncontinuous nature of target regions through the multiresolution system of scale-space filtering. This enables the detection of different types and the exact location of CNVs of all sizes. Furthermore, ExCNVSSnoRatio, a version of ExCNVSS developed with the intention of applying it to the case of only the input of test data and without using control data, can detect all-sized copy number gains and losses for concatenated, arbitrary-sized exonic regions even when a matched control is not available.

Our method can be summarized as follows: (1) It extracts base-level read coverage depth within each targeted exonic region from the read alignment data and merges them to generate concatenated base-level read coverage data. (2) To reduce the coverage bias effect, base-level read coverage data are normalized by our four-step normalization protocol. In each step, target exon read coverage data are considered to be evaluated from test data only or from the ratio of test and control data, according to the contents of the input, test data only, or both test and control data. (3) The scale-space filtering is then applied to normalized base-level read coverage data using a Gaussian convolution for various scales according to a given scaling parameter. By differentiating the scale-space filtered data twice and then finding zero-crossing points of the second derivatives, inflection points of the scale-space filtered data are calculated per scale. (4) Finally, the types and exact locations of CNVs of test data are obtained by using parametric baselines, which are evaluated from the normalized base-level coverage data, and by analyzing the finger print map, which is the collection of contours of the zero-crossing points for various scales.

We carried out simulation experiments to assess the performance of ExCNVSS and to extract the optimal values of parametric baselines from the results. The performance assessment of ExCNVSS was obtained by evaluating the false negative rate (FNR) and false positive rate (FPR) on the basis of the number of detected target-level CNV regions in transcript coordinates. The performance of ExCNVSS was then compared with conventional methods. In addition, the performance of ExCNVSS was validated by experiments with 10 individual HapMap samples using optimal parametric baselines. The results of the experiments showed a reasonable trade-off between FNR and FPR, even when an artificial data set was used as a pseudo-control, which showed that ExCNVSS could precisely detect CNVs of various types and sizes.

2. Materials and Methods

Figure 1 shows the flowchart of the overall process of our method. It includes two procedures: data preprocessing and CNV estimation. A new, four-step normalization protocol was implemented for the data preprocessing procedure. The scale-space filtering, which is consisted of Gaussian convolution, finger print mapping, baseline adjustment, interval search, and CNV detection, was applied for the CNV estimation procedure [18].

2.1. Preprocessing Data

The normalization protocol of the preprocessing procedure implemented consisted of four steps: evaluation of base-level read coverage data, segmentation, estimation of segment-level normalized mean read coverage data, and estimation of base-level normalized distribution of read coverage data in order to minimize the effect of the sources of variation, such as GC content bias [19], library size effect [20], and exon edge bias [21]. In each step, the read coverage data were considered to be evaluated from test data only or from the ratio of test and control data, according to the contents of the input, test data only, or both test and control data. The details of each step are described in the following subsections in which the case of the input with respect to test and control data is considered.

2.1.1. Evaluation of Base-Level Read Coverage Data

Read coverage data and were extracted from input data and , which were read alignment results of test and control genomic sequencing data, respectively. The concatenated target exon read coverage data and were extracted from the read coverage data and , respectively, where and are the th target exon read coverage data with length of test and control data, respectively, and is the total number of target exons. Then, the base-level read coverage data were obtained by evaluating the target exon read coverage ratio data , where is the sequence of the ratio , , of the th target exon read coverage data and , multiplied by the parameter for correcting the imbalanced library size effect between test and control read coverage data. The parameter is the ratio of the total sums and of the test and control read coverage data in all the target exons. In the estimation of the sequence of the ratio , , the cases of the read coverage data with values of nearly zero are considered as follows, where represents a very small nonnegative number which is here set to be :

2.1.2. Segmentation

The sequence , , of each target exon is partitioned into segment sequence , , with equal size starting from , where the remnant sequences between and are neglected.

2.1.3. Estimation of Segment-Level Normalized Mean Read Coverage Data

The mean , , of each segment was adjusted to be the normalized mean , , , by using its -score, where and are the mean and standard deviation of the means of each segment in all target exons. Then, the normalized mean , , , of each segment was shifted by the minimum value of the normalized means in order to ensure that the mean of each segment was not to be less than zero.

2.1.4. Estimation of the Base-Level Normalized Distribution of Read Coverage Data

The sequence , , , of each segment of the th target exon read coverage ratio data was readjusted into , , , to be normally distributed within the segment based on the normalized mean and the standard deviation of the segment.

2.2. Copy Number Estimation

The CNV estimation procedure included five steps: Gaussian convolution, finger print mapping, baseline adjustment, interval search, and CNV detection as described in our previous work [18]. Some changes were necessary in the steps of baseline adjustment and CNV detection to reduce the effect of the sources of variation. Therefore, descriptions of the CNV estimation procedure mainly concerned changed parts in the steps of baseline adjustment and CNV detection in this section.

In the Gaussian convolution step, the sequence , , of readjusted target exon read coverage ratio data obtained in the preprocessing procedure was decomposed into layers by Gaussian convolution with increasing as in the following equation:where is the scale-space image of , , representing the index of the layer of the scale-space image, is the value of the scale parameter at layer , and is the window size of the Gaussian kernel , which is set to . The scale parameter is the standard deviation of the Gaussian kernel and is set to considering the range of detectable CNV size and time complexity. Here, we obtained the scale-space image of by applying a discrete Fourier transform in the frequency domain to reduce the computational complexity. Let and be the discrete Fourier transform of and , respectively. The scale-space image was then obtained by , where is the inverse discrete Fourier transform operator. Then, the zero-crossing points of the second-order derivatives of the scale-space image were searched for in each layer in the step of fingerprint mapping. Here, the second derivative of was approximated by the second-order difference, . A zero-crossing signal was defined as follows:where the condition represents the zero-crossing point at which crosses zero from minus to plus and the condition from plus to minus. Next, in the baseline adjustment step, two parametric baselines, and , were calculated for each layer that had more than two nonzero elements in the zero-crossing signal using an empirical cumulative distribution function of the scale-space image, . The parametric baseline was estimated to be the lowest value of the scale-space image among those ranked within a given threshold from the top at layer . Similarly, the parametric baseline was estimated to be the highest value of the scale-space image among those ranked within a given threshold from the bottom at layer , where the threshold was especially decided considering the portion of the test read coverage data with a value of zero. In the interval search step, intervals were searched from the zero-crossing signal using the parametric baselines and for each layer. The th interval at layer was defined as a closed interval in the position index of the zero-crossing signal , which is a set of the position indices of between and inclusive, satisfying the following three conditions to be a putative CNV region. First, the interval does not include position indices corresponding to all regions of CNVs already declared at layers above the layer . Second, and for all the position indices between and . Third, the average of the scale-space image on the position indices between and inclusive is beyond the given parametric baselines, or . Once we had the th interval as a putative CNV region, we traced the zero-crossing signal from the positions and at layer until we obtained the corresponding positions and , respectively, bounded at layer , where the closed interval is to be declared as a CNV. Finally, the CNV detection step was preceded by searching for intervals from the top layer to the bottom layer sequentially. When searching for intervals at layer , the sum of sets corresponding to all the regions of CNVs already declared at the upper layers from to were excluded, where is the total number of CNVs detected at layer , as described in a previous work [18]. The type and localization of a CNV were determined by using the results of the interval search. An interval identifies the region where a statistically significant variation occurred on the input sequence and a CNV gain or loss was to be detected. That is, a CNV gain or loss was identified if the average of a scale-space image in the interval was above or below , respectively. Then, the localization of a CNV was defined by tracing to the corresponding region as the layer converges to zero.

2.3. Materials

TargetedSim [http://sourceforge.net/projects/targetedsim/] is a simulation tool that creates paired-end reads from targeted regions in a chromosome and can also simulate gains or losses of CNVs at random locations within targeted regions. For generating a simulated exome read data set, we used the TargetedSim tool, which has been developed by the Contra project group [12]. Test and control data were simulated as Illumina paired-end reads using chromosome 1 of the human reference assembly (hg19). The read length and median insert size of the simulated data were 36 bp and 200 bp, respectively. The simulated data covered 21,881 target regions (a total length of 5,256,986 bp, average length of 240 bp, minimum length of 115 bp, and maximum length of 8,551 bp) in chromosome 1, which are the same data used by the Agilent SureSelect Human All Exon 50 Mb V3 capture platform [https://earray.chem.agilent.com/suredesign/]. We generated 11 test data sets, each of which contained approximately 20 to 30 CNV regions, corresponding to approximately 70–100 target regions of an appropriate size (an average length of 222 bp, a minimum length of 120 bp, and a maximum length of 8,260 bp in the transcript coordinate). We aligned test and control read data to the human reference assembly using BWA [http://bio-bwa.sourceforge.net/] and obtained test and control BAM files with an average coverage level of 40x, respectively, which is assumed to be the bottom limit of a reasonable amount of sequence for variants calling.

The exome sequencing data downloaded from the 1000 Genome Project website (http://www.1000genomes.org) were used for the experiment with real human data. The downloaded data were BAM format files of 10 HapMap samples: NA12843 (47x), NA12842 (182.6x), NA12748 (49.5x), NA12718 (102.9x), NA12275 (86.9x), NA12273 (77.2x), NA12272 (92.6x), NA11843 (50.9x), NA10847 (99.2x), and NA06984 (54x), each of which is a member of the Utah residents (CEU) population and was sequenced in the same BI genome center and captured using the same assay (Agilent SureSelect Human All Exon V2). One individual sample of NA19152 (101.6x) was also downloaded for use as a control data set, which is a member of the Yorba (YRI) population, sequenced and captured by using the same technology as the test data sets. The capture platform covered 20,258 target regions (a total length of 4,775,342 bp, average length of 235 bp, a minimum length of 115 bp, and a maximum length of 8,551 bp) in chromosome 1.

As the downloaded 10 germline data sets were generated without the availability of matched control data sets, a pseudo-control data set had to be created to serve as the control. There are two methods that have been used to generate a control sample data: one derives a matched control data set from a pool of other samples by averaging the depth of coverage of each exon across all exomes; the other uses a specific and different germline sample as the control. In general, generating a pooled sample is tedious and time-consuming work that entails preprocessing tens or hundreds of samples, which are captured and sequenced by the same platform. When a specific individual sample is used as a pseudo-control, the selection is made carefully such that (1) the pseudo-control sample is a member of other populations having a different background (not genetically related), (2) it is captured using the same probe set and capture method and sequenced in the same manner as the test samples, and (3) it has the same gender as the test samples.

However, even if a pooled sample is generated from many well-selected independent samples or a specific sample is selected from unrelated individuals, such as from a different population, we cannot ascertain that this pseudo-control data actually has an average genomic normal copy number of 2 and does not share common CNV regions with the test sample data [11].

The pseudo-control should capture the technical variation of a platform, but not CNV variations in the test sample. With these considerations, we propose using an artificially simulated data as pseudo-control data. Currently, there have been various simulation methods that generate reads by emphasizing different characteristics of real sequencing data for various applications. Wessim [22] particularly aims for a real exome sequencing simulation. As effective pseudo-control read data, we adopted a simulated exome data by Wessim. Wessim emulates conventional exome capture technologies, such as Agilent’s SureSelect and Roche/NimbleGen’s SeqCap, and generates realistic synthetic exome sequencing data, in which fragment length and GC content are rigorously considered to reproduce accurate coverage biases. We aligned the pseudo-control read data to a human reference assembly using BWA and generated a BAM file with an average coverage level of 40x.

The BAM file of each of the test and control samples was processed, sorted, and filtered with SAMtools [http://samtools.sourceforge.net/]. After removing PCR duplicate reads with MarkDuplicates of Picard [http://picard.sourceforge.net], local realignment around indel was performed using the RealignerTargetCreator and IndelRealigner of GATK [https://software.broadinstitute.org/gatk/].

The performance of ExCNVSS was assessed by estimating the FNR and FPR on the basis of the number of detected target-level CNV regions in transcript coordinates. Each region was considered validated if an algorithm called for more than 30% of synthetic or known CNV regions. ExCNVSS was compared with three conventional CNV detection methods: ExomeCNV, Contra, and Excavator. Furthermore, the performances of all four methods were assessed and compared with ExCNVSSnoRatio.

The experiments were carried out in Windows 7 and CentOS 6.2 on an Intel Core i7 3.5 GHz CPU with 32 GB of main memory and a 2 TB hard drive. The programming language used for the development of ExCNVSS was MATLAB.

3. Results and Discussion

3.1. Experiments with Simulated Data

The first experiment was carried out to assess the performance of ExCNVSS according to various values of the threshold values and , which were used for determining the parametric baselines. Performance was assessed by estimating FNRs and FPRs on the basis of the number of detected target-level CNV regions. The experiments for each threshold value were performed with 11 different simulated data sets, the results of which were averaged for the assessment. The overall FNRs and FPRs were in the range of 10.93–13.76% and 3.11–62.21%, respectively, for various threshold values ( and ). The best performance was obtained at threshold values of of 0.9875 and of 0.0125, where the values of FNR and FPR were 13.76% and 3.11%, respectively. Therefore, the threshold values of 0.9875 and of 0.0125 were used as defaults for ExCNVSS. Similarly, ExCNVSSnoRatio showed FNRs and FPRs in the range of 26.73–29.33% and 5.94–46.66%, respectively. The best performance of ExCNVSSnoRatio was obtained at of 0.9875 and of 0.04, where the values of FNR and FPR were 26.73% and 5.94%, respectively. Therefore, threshold values of of 0.9875 and of 0.04 were used as defaults for ExCNVSSnoRatio.

The performance of ExCNVSS was compared with that of ExCNVSSnoRatio, Contra, Excavator, and ExomeCNV on 11 simulated data sets, where various parameters of each method were determined according to the instructions in each manual. The parameters variable in Contra, Excavator, and ExomeCNV are as follows: Contra (numBin, minReadDepth, minNBases, pval, and nomultimapped); Excavator (ω, θ, Norm, , seg, , and ); ExomeCNV (coverage.cutoff, admix, sdundo, alpha, min.spec, and min.sens). We calculated the performance in order to check the increase in FPRs and the change in FNRs while changing values of pval ( value threshold for filtering) for Contra, (baseline probability) for Excavator, and min.spec (desired minimum specificity) for ExomeCNV, which seemed to be directly related to FPR.

The overall FNR of Contra was between 23.11% and 33.88% and the FPR between 0.02% and 93.83% for various pval values. Contra achieved an average FNR of 33.88% and FPR of 0.02% with its default setting. The overall FNR for Excavator was between 27.99% and 52.60% and the FPR between 0.23% and 82.11% for various values. Excavator showed an average FNR of 52.60% and FPR of 0.23% with its default parameter settings. The overall FNR for ExomeCNV was between 45.52% and 88.69%, and the FPR was between 0.02% and 94.02% for various min.spec values. ExomeCNV showed an average FNR of 88.69% and FPR of 0.02% with its default setting.

Figure 2 presents the receiver operation characteristic (ROC) curves of ExCNVSS, ExCNVSSnoRatio, Contra, Excavator, and ExomeCNV for comparison. As shown in the ROC curves, the performance of ExCNVSS was better than those of the other four methods. The conventional methods, including Contra, Excavator, and ExomeCNV, were very conservative in calling a region significant, resulting in high FNRs and low FPRs with default parameter settings. Although some parameters can be varied to relax the specificity for these methods, remarkable improvements have not been observed in FNRs. However, ExCNVSSnoRatio provided a good performance in FNR with little increase in FPR, even though control data to compensate for inherent coverage biases were not applied. These results suggest that both ExCNVSS and ExCNVSSnoRatio can be very robust in error-prone environments, resulting in a good performance even at relatively low-level coverage data.

The second experiment was carried out to assess the performance of ExCNVSS with respect to the size of CNVs. The size of a target region in most exome capture platforms is typically small and approximately 90% of target regions are <300 bp in length. For the experiment, we simulated 861 loss and gain target regions (minimum length of 120 bp, maximum length of 8,260 bp, and an average length of 222 bp), including single exon losses and gains, as well as variations spanning multiple exons in the test data sets. We also generated control data sets with no CNVs and the same mean coverage (40x) as the test data sets.

Table 1 shows the performance of methods on simulated data sets, representing the total number of correctly detected instances of small (100–159 bp), medium (160–299 bp), and large (300–8260 bp) variants, along with the fraction of gain/loss regions of each in parentheses. The second and third columns for each method represent false negative and false positive rates and the range of detected gain/loss region sizes (min/max), respectively.

The results show that ExCNVSS is superior to the other four methods in terms of detecting CNVs of various sizes. As ExCNVSS detects larger CNVs at a higher scale and smaller CNVs at a lower scale, the FNR can be reduced in various CNV sizes compared to conventional methods using target-level log-ratio detection and segmentation. Additionally, compared with other methods, ExCNVSS detects more CNV loss regions, which may represent severe mutations in Mendelian diseases. It has been acknowledged that CNV losses are usually more harmful because a great deal of genetic information is missing, whereas CNV gains involve repeating nucleotide units.

ExCNVSSnoRatio showed a slightly lower performance in detecting larger CNVs than the small or medium-sized CNVs. This could be because biases may not be compensated sufficiently at large CNV regions by our segmentation and normalization method without control data. Contra achieves a good performance in detecting medium-sized CNVs, while the FNR increased in detecting smaller and larger CNVs. As previously mentioned, Contra, Excavator, and ExomeCNV are conservative in calling a region significant and they show relatively high FNRs and low FPRs with default parameter settings. We can deduce that ExCNVSS and ExCNVSSnoRatio are effective methods in detecting CNVs of various sizes by reducing the inherent noise in exome read coverage data.

3.2. Experiments with HapMap Samples

The downloaded BAM files of 10 HapMap samples were used for experiments with real human data. The performance assessment was accomplished by evaluating the FNR and FPR on the basis of the Phase 3 variant list of the 1000 Genome project released in 2014. Each region was considered validated if the algorithm called for more than 30% of the known CNV region profiled in the Phase 3 variant list. However, it should be noted that a true gold standard CNV list for these HapMap samples is still not available, and this list does not have 100% sensitivity and specificity [23].

As previously mentioned, ExCNVSS, Excavator, Contra, and ExomeCNV require two input data samples, test and control, to identify CNV variants. In this real data experiment, two different types of pseudo-control data were used: one was an artificial data set that simulates realistic synthetic exome sequencing data, and the other was a specific sample data set that was selected from unrelated individuals, such as from a different population.

In the first experiment, we used an artificial exome data generated by Wessim [http://sak042.github.io/Wessim/]. Wessim provides two distinct approaches for exome read generation: ideal target approach and probe hybridization approach. Using probe hybridization approach is recommended when the probe sequence is available; it is much more realistic and recovers the statistics of real data with default parameter setting. Table 2 describes a quantitative analysis of experimental results on the whole region of chromosome 1 of the 10 HapMap samples, in which the performance of ExCNVSS was compared with those of ExCNVSSnoRatio, ExomeCNV, Contra, and Excavator. Here, ExomeCNV, Contra, and Excavator adopted the same Wessim data set as the control.

In general, selecting the optimal parameter matching the characteristics of the data is crucial in increasing an algorithm’s performance. However, finding the optimal parameters of each algorithm requires the exact understanding of the mathematical model derived from the interaction between multiple parameters, which is classified as an extremely difficult and cautious operation.

Therefore, we referred to the part of the evaluation of performances using HapMap samples in the original paper of each algorithm and used the values of the parameters as the optimal parameters of each algorithm for our real data experiments using HapMap samples. In paper [15] covering Excavator, parameters (, , Norm = 105, and ) were used for the analysis of 20 HapMap samples, and in paper [12] covering Contra, parameter (val = 0.01) was used for the analysis of 5 HapMap samples for their evaluation of performance test results. Only in the case of ExomeCNV, the parameters used for the analysis of HapMap samples were not specified [11]. As a result, in the cases of Excavator and Contra the parameter values given in the original paper were used as the optimal parameters for our own experiment, and in the case of ExomeCNV the default parameters used in the performance test in previous papers were used as the optimal parameters for our own experiment.

Table 2 shows the results of the performance evaluation of each algorithm using real data with the optimal parameters. In Table 2, the first column for each method represents the total number of correctly detected instances and the fraction of gain/loss regions is given in parentheses. The second column represents false negative and false positive rates, respectively.

In the 10 HapMap samples, ExCNVSS obtained the best results for FNR, followed by ExomeCNV, Excavator, ExCNVSSnoRatio, and Contra. Contra was the best for FPR, followed by Excavator, ExCNVSS, ExCNVSSnoRatio, and ExomeCNV. Among these five methods, both ExCNVSS and Excavator showed the best performances. The overall FNRs for ExCNVSS and Excavator were between 25.64% and 45.54% and between 33.68% and 54.81%, respectively. The FPRs for ExCNVSS and Excavator were between 10.45% and 13.21% and between 4.43% and 6.10%, respectively. However, Excavator produced poor results in identifying CNV loss regions due to only a small number of CNV events being detected. ExomeCNV obtained a relatively good performance in FNR since it returned a large number of instances of CNVs. In contrast, Contra showed poor performance in FNR, since it returned only a small number of instances of CNVs. Collectively, ExCNVSS showed a reasonable trade-off between FNR and FPR, which efficiently detected CNVs of various types and sizes.

In the second experiment, we used the exome read data from an individual sample of NA19152 (a member of the YRI population) as a control data set, which was also used as a control for the analysis of HapMap samples in paper [15]. Table 3 describes the experimental results on HapMap samples, each of which is a member of the CEU population with the optimal parameters. In the 10 HapMap samples, all methods gave poor results, with the exception of ExCNVSSnoRatio. Even ExCNVSS and Excavator gave poor performance in FNR while preserving similar performance in FPR compared with the first experiment. However, although no control data were used, ExCNVSSnoRatio showed a better performance than the other methods.

From the results, we can see that, even through the efforts to select a proper pseudo-control sample from the other individual samples, we could not remove the biases introduced by capture and sequencing at all. The adoption of the control data standards is a crucial process in variants calling, as it may help to manage the inherent noise of the test data, affecting the overall performances of the methods. These results show that a well-made simulated data set can be used as a good alternative control to reduce coverage biases of the test data, compared to using real data. Furthermore, ExCNVSSnoRatio can be an alternative to ExCNVSS in the absence of proper matched control data.

4. Conclusions

As advanced NGS technologies produce a large number of short reads at lower costs and increased speeds that accumulate exome sequencing data, the need to detect even small disease-related genetic variations directly from exome sequencing is expected to drastically increase. We have developed an exon-based CNV detection method using read coverage depth evaluation and scale-space filtering. Our method corrects coverage biases and considers the sparse, small size, and noncontinuous nature of target regions. We tested the method on both simulated and real data, and the results show that the method can be applied to relatively low-level coverage data with practical specificity and sensitivity. We have also developed a method that can be applied to cases of input data only, and the results show that the method can detect all-sized CNV gains and losses for concatenated arbitrary-sized exonic regions, even when a matched control is not available.

The performances of our methods show excellent FNRs and relatively fair FPRs compared to conventional methods. Furthermore, the performance of our methods show the superiority of detecting CNVs of various sizes, with good values of FNRs and acceptable values of FPRs. Specially in the assessment using 10 real HapMap samples’ data, one of our methods showed the best performance in FNRs and a fairly good performance in FPRs compared to conventional methods including ExomeCNV, Excavator, and Contra. This suggests that our method can reliably detect all-sized CNVs from sensitive exome sequencing data without considering the availability of a matched control. ExCNVSS and ExCNVSS_noRaio are freely accessible at http://dblab.hallym.ac.kr/ExCNVSS/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Unjoo Lee and Jeehee Yoon contributed equally to this work.

Acknowledgments

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2014R1A2A1A11052141). This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2014R1A1A2057199).