Computational and Mathematical Methods in Medicine

Volume 2013 (2013), Article ID 179761, 13 pages

http://dx.doi.org/10.1155/2013/179761

## The Number of Candidate Variants in Exome Sequencing for Mendelian Disease under No Genetic Heterogeneity

^{1}Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka 411-8540, Japan^{2}Department of Mathematical Analysis and Statistical Inference, The Institute of Statistical Mathematics, Research Organization of Information and Systems, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan

Received 30 January 2013; Revised 25 March 2013; Accepted 29 March 2013

Academic Editor: Shigeyuki Matsui

Copyright © 2013 Jo Nishino and Shuhei Mano. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

There has been recent success in identifying disease-causing variants in Mendelian disorders by exome sequencing followed by simple filtering techniques. Studies generally assume complete or high penetrance. However, there are likely many failed and unpublished studies due in part to incomplete penetrance or phenocopy. In this study, the expected number of candidate single-nucleotide variants (SNVs) in exome data for autosomal dominant or recessive Mendelian disorders was investigated under the assumption of “no genetic heterogeneity.” All variants were assumed to be under the “null model,” and sample allele frequencies were modeled using a standard population genetics theory. To investigate the properties of pedigree data, full-sibs were considered in addition to unrelated individuals. In both cases, particularly regarding full-sibs, the number of SNVs remained very high without controls. The high efficacy of controls was also confirmed. When controls were used with a relatively large total sample size (e.g., ), filtering incorporating of incomplete penetrance and phenocopy efficiently reduced the number of candidate SNVs. This suggests that filtering is useful when an assumption of no “genetic heterogeneity” is appropriate and could provide general guidelines for sample size determination.

#### 1. Introduction

Understanding associations between human genetic variations and phenotypes, including risk of disease, is important for successful realization of personalized medicine. Such variants can be used as biomarkers. Recent advances in high-throughput sequencing technology (“next-generation DNA sequencing” (NGS)) enable exploration of human genetic variations on genome-wide and individual levels.

The international “1,000-Genome Project,” which uses NGS technology, was launched in 2008. The project aims to create a detailed catalog of human genetic variations by sequencing at least 1,000 individuals [1]. This type of catalog would provide a basis for studies on disease-causing variants or genes. In the last decade, genome-wide association studies (GWAS) using single-nucleotide polymorphism (SNP) genotyping arrays have been successful, although genetic variants identified by GWAS only explain a small proportion of heritability for many complex diseases [2]. A major reason for this limitation is that the “common disease, common variant” hypothesis is a prerequisite for GWAS [2]. The hypothesis that many common diseases are caused by “common variants” (i.e., variants present in more than 1–5% of a population) as detected by SNP genotyping arrays is not likely realistic. Attention has been gradually turned to “rare variants,” which can be detected by NGS technology.

The cost of DNA sequencing is continuously being reduced. However, whole genome sequencing is still too expensive. Recently, sequencing the exome (all protein-coding regions in the genome) has been considered for identifying disease-causing genes or variants. The human exome sequence consists of approximately 30 Mb pairs (nucleotides), corresponding to approximately 1% of the total genome. Thus, exome sequencing is cost effective. Ng et al. [3] provided a proof of concept that exome sequencing can be used to identify disease-causing genes or variants using a simple filtering approach. To date, more than 100 disease-causing genes for Mendelian disorders have been identified using exome sequencing [4].

Analyses of exome data for Mendelian disorders are conducted in a simple, intuitive manner. For example, Ng et al. [3] “reidentified” the MYH3 gene, which is known to cause the rare autosomal dominant disorder Freeman-Sheldon syndrome, as follows: (1) retention of genes in which at least one nonsynonymous single-nucleotide variant (SNV), splice-site variation or indel was present in four unrelated affected individuals and (2) filtering out (removing) variants present in the exomes of eight control individuals or samples from a public database (dbSNP). As an example of using whole genome sequencing for a single patient in a pedigree, Sobreira et al. [5] identified the causative gene of the rare autosomal dominant disease metachondromatosis. In advance linkage analysis using SNP genotyping arrays was conducted, and whole genomes of a single patient and eight unrelated controls were sequenced. The researchers focused on regions with high positive LOD scores and used sequences from the eight controls and dbSNP data as filters to remove variants. They then identified a patient-specific deletion in an exon of PTPN11.

Exome sequencing is an effective method for identifying disease-causing variants in Mendelian disorders. However, there are likely a large number of failed and unpublished studies due to incomplete penetrance, phenocopy, or genotyping error (including sequencing error). Is exome analysis for Mendelian disease actually applicable under assumptions of incomplete penetrance and phenocopy? What is the necessary sample size? To answer such questions, theoretical, simple model studies are suitable. Theoretical research is rarely used for exome analysis in Mendelian disease, even in cases of complete penetrance and no phenocopy.

In exome sequencing, short reads produced by NGS are mapped to the reference sequence, which is the standard human genome sequence, and variants are detected against the reference (Figure 1(a)). Disease-causing variants are searched for based on variants detected in affected individuals. In this study, the number of candidate SNVs for diseases following Mendelian inheritance modes, including autosomal dominant and recessive, was investigated under the assumption of “no genetic heterogeneity” (i.e., no allelic or locus heterogeneity or situations in which a genetic disease is caused by a variant on a gene instead of several variants on one or more genes). It was assumed that allelic types of all variants are independent of the affected status (i.e., all variants are under the “null model”). This is valid because there is only one disease-causing variant. Allelic frequencies in a sample were modeled using a standard population genetics theory. Exome sequences with and without controls were considered, and incomplete penetrance and phenocopy were incorporated as filtering conditions (Figures 1(b), 1(c), and 1(d)). Differences between data from unrelated individuals and pedigrees were also evaluated (Figures 1(e) and 1(f)). Public databases (e.g., dbSNP or 1,000 Genome Project database), which can include errors and generally do not provide phenotype information, are often used to filter out SNVs in exome analysis, but were not considered in this study. Zhi and Chen [6] modeled an analysis of exome sequencing. The authors investigated the power of various conditions, including the number of mutations identified after filtering (corresponding to the number of SNVs after filtering in this study), inheritance modes of disease (i.e., autosomal dominant and recessive), locus heterogeneity, gene length, sample size, and others. Common or low quality variants were filtered out in advance and disease-causing genes were explored under genetic heterogeneity. The authors treated the number of SNVs after filtering as a known constant. In contrast, we directly filtered disease-causing variants according to modes of inheritance under the assumption of “no genetic heterogeneity” and evaluated the number of candidate SNVs after filtering. In addition, although the term “SNV” means “single-nucleotide variant” as shown in Figure 1(a), it can be interpreted simply as a “variant,” including “splice-site variant” or “indel.” The term “SNV” is used in this study because there are fewer splice-site variants or indels than SNVs in exome sequences [3].

#### 2. Method

There are roughly 20,000 SNVs in a single human exome [3]. That is, diploid exome sequences (two haploid exome sequences) have different allelic types (alternative types, ) from haploid reference sequences (reference types, ) at ~20,000 DNA sites (Figure 1(a)). According to the population genetics theory described below, the expected number of SNVs with mutant and ancestral alleles in haploid sequences randomly sampled from a population can be obtained using a simple formula. In Section 2.1, we used this formula to derive an expression for the expected number of SNVs with alternative and reference alleles in haploid sequences randomly sampled from a population. In Section 2.2, exome sequences of unrelated affected individuals (Figure 1(e)) were considered, and the expected number of SNVs for individuals with genotypes , , and (, and , resp.) was obtained. This enabled calculation of the expected number of SNVs after filtering, as illustrated in Figures 1(b) and 1(c). In Section 2.3, a case with additional controls was considered (Figure 1(d)). In Section 2.4, we considered data from full-sibs with and without controls in a nuclear family to investigate the properties of the expected number of SNVs using exome sequences from a pedigree (Figure 1(f)).

##### 2.1. Site Frequency Spectrum of the Alternative Allele

We considered haploid sequences randomly sampled from a population under the Wright-Fisher diffusion model. The infinite-site model of neutral mutations was assumed. We denoted the diploid population size and mutation rate per haploid sequence per generation by *PopSize* and , respectively. indicates the number of SNVs with mutant (derived) and ancestral alleles in haploid sequences. is the “site frequency spectrum” of the mutant (derived) allele in a sample. According to Fu [7], the expectation of is the result of
where . This simple formula does not include the sample size . As described in the following section, the point estimate of for the human exome is ~13,333. For example, when considering four haploid exomes (equivalent to two unrelated diploid exomes), the number of SNVs with one, two, and three mutant alleles is expected to be 13,333, 6,666.50, and 4,444.33, respectively.

However, in practice it is often not known if the DNA type at a segregating site is mutant or ancestral. In exome analysis, DNA types are generally expressed as “reference ()” or “alternative ()” because variants in exome sequences are detected based on comparison with a reference genome sequence (Figure 1(a)). This study was also carried out in terms of “Reference type ()” or “Alternative type ()”. Thus, as a first step, we defined in place of to derive the expression .

In addition to haploid sequences, we considered that a reference sequence was also randomly sampled from a population ( sequences). We defined as the number of SNVs with alternative and reference alleles in the haploid sequences. In a segregating site in sequences, reference DNA is either mutant or ancestral. The expected number of SNVs in which reference DNA is mutant and reference alleles in the haploid sequences is derived by the product of the expected number of SNVs with mutant alleles in sequences, based on (1), and the probability that a mutant allele is chosen as a reference from alleles with mutant alleles, . This is represented as

Similarly, the expected number of SNVs in which reference DNA is ancestral and reference alleles in haploid sequences was obtained. The expectation is represented as the product of the expected number of SNVs with mutant alleles in sequences, based on (1), and the probability that a mutant allele is chosen as a reference from alleles with mutant alleles, . The resulting equation is

The expectation of , , is equal to the sum of (2) and (3), resulting in

The formula does not include sample size. Interestingly, this result is obtained by (1), assuming that the alternative alleles are a mutant. Note that can be equal to at most in (4) ( alleles are all alternatives at a particular DNA site).

##### 2.2. Unrelated Affected Individuals

Next, consider exome sequences of unrelated affected individuals under the Wright-Fisher diffusion model (Figure 1(e)). The infinite-site model of neutral mutations was assumed again. Assuming that diploid exome sequences and a reference sequence are “randomly sampled” from the population, we obtained the expected number of SNVs in which the number of individuals with genotypes , , and is , , and , respectively. Here, “randomly sampled” means that diploid exome sequences and a reference sequence are “randomly sampled” ( times), which is equivalent to haploid exome sequences that are “randomly sampled” ( times), followed by one sequence chosen as a reference from the sequences. The remaining sequences are randomly joined to form diploids. The latter is used for illustrative purposes.

Conditions of the variables were collected. As in Section 2.1 and denote the number of reference and alternative alleles in a site, respectively. One hasNote that given or (and constant ), there is only one independent variable among , and . For example, if , and are fixed, the other two variables, and , are automatically determined.

Let be the number of SNVs in which the number of reference and alternative alleles is and , respectively, and the number of individuals with genotypes , , and is , , and , respectively, in total individuals. The expected number of SNVs, , is defined only when all conditions of (5a) and (5b) are met. First, we considered haploid exome sequences and a reference to be “randomly sampled” ( times). The number of SNVs with alternative and reference alleles in the haploid samples can be readily obtained by (4). The probability that the genotype configuration was determined given that a DNA site has alternative alleles was denoted as . The number of distinct permutations of is given by . How many permutations result in the genotype configuration ? The number of ways to determine the genotype of each individual in distinct individuals and generate a genotype configuration of is equal to . The genotype can be generated from the two runs, and . Therefore, the number of permutations used to generate the genotype configuration is derived from and = . The expression of was shown elsewhere and used to perform the exact test of Hardy-Weinberg equilibrium [8]. Let us give a proof of the following proposition.

Proposition 1. *. *

*Proof. *, where is the expectation with respect to the diffusion model and is the expectation with respect to the binomial sampling. Binomial sampling is -times Bernoulli trial, addressing whether a site indicates genotype counts of . Probability of the Bernoulli trial is . Therefore, = . is represented by (4) and the proposition follows.

Then, we have

Here, is not defined if (5a) and (5b) are not satisfied. For example, in the case of (4 haploid sequences), the expected number of SNVs for = , , , , satisfying (5a) and (5b) is , , , , and , respectively. If we use 13,333 as human exome , is 13,333, 4,444.33, 2,222.17, 4,444.33, and 3,333.25, respectively. If both individuals are affected by a certain recessive disease with the genotype at a causal DNA site, we can use a filter to retain variants in which both individuals have the genotype . The expected number of SNVs after filtering is = 3,333.25. Similarly, when both individuals are affected by a dominant disease with genotypes or at a causal DNA site, the expected number of SNVs after filtering is = 4,444.33 + 4,444.33 + 3,333.25 = 12,221.91. In this way, by summing for all sets of that satisfy (5a) and (5b) and including a filtering condition, the expected number of SNVs after filtering can be calculated.

In some cases, factors such as reduced penetrance, phenocopy (including misdiagnosis), or genotyping errors should be taken into account. So, consider filtering to retain only SNVs in which at least of affected individuals have in cases of recessive disease or or in cases of dominant disease (Figure 1(c)). At the disease-causing variant site, this allows the phenocopy (or genotyping error) from genotype or to in cases of recessive disease or from genotype to or in cases of dominant disease. The following are detailed methods of calculating the expected number of SNVs after filtering.

As noted, given or (and constant ), there is only one independent variable in the conditions of (5b). In case of recessive disease, we can express as a function of , , using (5b), denoted by . Specifically, this can be expressed as

is not defined if (5a) is not satisfied. After filtering, the expected number of SNVs in which at least of affected individuals have is calculated by

In cases of dominant disease, denoting as the number of individuals with genotypes or can be expressed as as a function of , , using (5b), denoted by . This results in

is not defined if (5a) is not satisfied. After filtering, the expected number of SNVs in which at least of affected individuals have or is calculated by

##### 2.3. Unrelated Affected Individuals with Controls

Consider exome sequences of unrelated individuals consisting of affected individuals and controls. In cases of recessive disease, we considered a filter to retain only SNVs in which at least of affected and at most of control individuals have (Figure 1(d), left). This allows the phenocopy (or genotyping error) from genotype or to and/or the reduced penetrance of at a disease-causing variant site. Similarly, in cases of dominant disease, we considered a filter to retain only SNVs in which at least of affected and at most of control individuals have or (Figure 1(d), right).

First we did not distinguish affected individuals from controls in total individuals. The expected number of SNVs in which the number of alternative alleles is and the number of individuals with genotype is is still given by (7). Next we assumed that affected individuals and controls were randomly selected from individuals. Considering recessive diseases, for a given , the number of individuals with genotypes , , in affected individuals follows a hypergeometric distribution. As a result, the expected number of SNVs, , in which the number of alternative alleles is and the number of individuals with genotype is in affected individuals is represented as

After filtering, the expected number of SNVs in which at least of affected individuals and at most of control individuals have the genotype is obtained by summing : where the sum of is over the value satisfying the filtering condition, . denotes the number of individuals with genotypes in the controls.

Similarly, considering dominant diseases the expected number of SNVs in which the number of alternative alleles is and the number of individuals with genotypes or () in affected individuals is represented by

After filtering, the expected number of SNVs in which at least of affected individuals and at most of control individuals have the or genotypes is obtained by summing : where the sum of is over the value satisfying the filtering condition, . Here, denotes the number of individuals with or genotypes in controls.

##### 2.4. Full-Sibs with and without Controls

To investigate the properties of the number of SNVs using exomes from a pedigree, we considered full-sibs with and without controls in a nuclear family (Figure 1(f)). Assumptions were that four haploid exome sequences of both parents and a reference sequence were randomly sampled from a population under the Wright-Fisher diffusion model. The infinite-site model of neutral mutations was also assumed.

The expected number of SNVs with a particular genotype configuration from both parents was obtained by (6). Otherwise using formula (4), the expected number was obtained as follows: the expected number of SNVs with both parents genotypes , , and is readily obtained by substituting and 4 into (4) to be , and , respectively. Here, indicates that the genotype of one parent is and that of the other is , and so on. Although the expected number of SNVs in which in four haploid sequences is by substituting into (4), SNVs likely result in two genotype configurations, and . Considering random combinations of , the expected number of SNVs with and is represented by , . Given the genotype configuration of both parents, the number of sibs with genotypes , , and follows a polynomial distribution. For possible genotype configurations of both parents, Table 1 shows the expected number of SNVs and probabilities that a sib with a particular genotype would be born (i.e., parameters of a polynomial distribution).

The expected number, , of SNVs in which the number of sibs with genotypes , , and is , , and , respectively, is represented as where , and . Being simplified, this is shown as

Here, , , nonnegative integers and . Using (16), the expected number of SNVs after filtering is calculated as shown. This calculation is easier than that in unrelated individuals. In recessive diseases, the expected number of SNVs in which at least of affected individuals have after filtering is calculated as where the summation is over , satisfying the filter condition . Similarly, in cases of dominant disease, the expected number of SNVs in which at least of affected individuals have an genotype after filtering is calculated using (17), where if , the summation is over , satisfying the filter condition .

We considered affected sibs with control sibs. Given genotype configurations of both parents at a site, the number of and sibs with genotypes , , and at the site follows independent polynomial distribution. , , and were the number of , , and , respectively, in affected sibs, and , , and were the number of , and , respectively, in control sibs. The expected number, , of SNVs with the genotype configuration of sibs is represented as where ; ; ; , , , , , nonnegative integers; and . The expected number of SNVs after filtering is calculated as shown just below. In cases of recessive disease, the expected number of SNVs in which at least of affected and at most of control individuals have the genotype after filtering is obtained by where the summation is over , satisfying the filter condition . Similarly, in cases of dominant disease, the expected number of SNVs in which at least of affected and at most of control individuals have or after filtering is calculated using (18), where if and , the summation is over , satisfying the filter condition .

#### 3. Results and Discussion

##### 3.1. An Estimator of for Human Exome

According to Table 2 in Ng et al. [3], there are roughly 20,000 SNVs in a single human exome, including synonymous and non-synonymous variants. All results in this study are based on the estimate 13,333, which was obtained based on 20,000 SNVs per individual as follows: the expected number of SNVs detected in one human is represented as using (4), with possible 2 values of . If the observed number of SNVs detected in one human is 20,000, then 20,000 is used to obtain 13,333. Note that the number of SNVs per single human exome (20,000) varies between races and is based on different methods of exome capture, mapping to a reference genome, genotype calling algorithms, or by definition of an exome. The results of this study also varied slightly based on the estimators used.

##### 3.2. Unrelated Individuals without Controls in Dominant Disease

The expected number of SNVs after filtering in cases of dominant disease and unrelated individuals without controls is plotted in Figure 2(a). Several values used in Figure 2 are listed in Table 2. When a stringent filter (i.e., set to retain only SNVs in which 100% of individuals sampled have the genotype ) was used, the number of SNVs appeared to decay exponentially with sample size . However, the decrease in the number of SNVs was slower as increased. As shown in Table 2, the expected number of SNVs for 1, 2, 3, 4, 50, and 51 were 19999.50, 12221.92 (61.11%), 9333.10 (76.36%), 7761.71 (83.16%), 1808.56, and 1789.36 (98.94%), respectively, with ratios of the expected SNVs for to those for shown in parentheses. The first few individuals were highly effective in removing SNVs, but additional individuals were not. This was obvious when nonstringent filters (i.e., remaining SNVs in which at least 90% or 80% of individuals have the genotype ) were used. In those cases, certain asymptotic values likely exist. For example, ≥90% of the filtered expected number of SNVs was 5540.39 for 50, but only 5306.36 for 100. From the perspective of identifying disease-causing variants, it is clear that nonstringent filters that take phenocopy into account do not work well even if the sample size is very large. However, using stringent filters, the expected number of SNVs remains high even if the sample size is large (1249.75 SNVs for 100). This shows that it is generally difficult to identify a disease-causing variant by filtering without a control.

##### 3.3. Unrelated Individuals with Controls in Dominant Disease

As shown in Figure 2(b), filtering with controls is highly effective in removing SNVs. When half of the samples were controls and a stringent filter was used, the expected number of SNVs was less than one at 14 and 0.001 at 20. Even with a single control, the situation changed drastically compared to cases without controls. For example, for 10, the expected number of SNVs was 4450.2 without a control, which dropped to 284.27 with one control. Using nonstringent filters that take phenocopy into account (i.e., remaining SNVs in which 80% of affected individuals have the genotype ), an asymptotic value of approximately 700 may occur with one control, but filtering efficiency is improved if the number of controls totals 3 (21.74 SNVs for 53). In addition to phenocopy, filters that take reduced penetrance into account also work reasonably well if half of the exome samples () are controls. For example, the expected number of SNVs in which 80% of affected individuals and 20% of controls have the genotype was 28.51 and 0.02 for 20 and 50, respectively.

##### 3.4. Unrelated Individuals and Recessive Disease

The number of SNVs after filtering in recessive disease shows a similar tendency to SNVs in dominant disease, as shown in Figures 3(a) and 3(b). Table 3 lists some of the values used in Figure 3. Without controls, filtering does not work well, particularly when phenocopy is taken into account. With controls, filtering efficiency is highly improved even when phenocopy and reduced penetrance are considered. However, filtering efficiency for recessive disease is at most ten times higher compared to dominant disease. For example, stringent filtering of 100 without a control resulted in an expected number of SNVs of 1249.75 for dominant disease, but only 66.67 for recessive disease. Using stringent filtering, the expected number of SNVs for recessive disease was which is derived from (7) or directly from (4) by substituting . In contrast, the expected number of SNVs for dominant disease is represented as which is derived from (9) and (10).

##### 3.5. Full-Sibs with and without Controls

The expected number of SNVs after filtering in the case of full-sibs for dominant disease is shown in Figure 4. Table 4 lists several of the values used in Figure 4. Filtering efficacy in sibs was clearly worse than that in unrelated individuals (cf. Figure 4(a) with Figure 2(a)). For a given sample size , the expected number of SNVs for 100%, 90%, and 80% filtering was relatively similar compared to unrelated individuals. There was also a higher asymptotic value for 100%, 90% and 80% filtering. The asymptotic value was 9999.75, as explained below based on 100% filtering. When the sample size is large, DNA sites in which the parents have genotypes or are removed by filtering because a certain proportion of sibs have the genotype (Table 1). In contrast, even if the sample size is large, DNA sites in which the parents have genotypes of , , or are not removed and the expected site is shown as (Table 1). With 90% and 80% filtering, this is correct.

However, the situation drastically improved when we used controls (Figure 4(b)). For a given sample size , the expected number of SNVs in sibs was comparable to the expected number in unrelated individuals. For example, if half of the exome samples were controls, the expected SNVs in which at least 80% of affected individuals and at most 20% of controls have the genotype were 487.69 ( 10), 28.51 ( 20) and 0.02 ( 50) when unrelated exomes were used and 512.68 ( 10), 40.85 ( 20), and 0.06 ( 50) when full-sibs exomes were used.

The number of SNVs after filtering in sibs for recessive disease shows a similar tendency to dominant disease, as shown in Figures 5(a) and 5(b). Table 5 lists some of the values used in Figure 5. Without controls, the efficiency of filtering in sibs was clearly worse. The asymptotic value for recessive disease was 3333.25, which was obtained the same way as for dominant disease. The number of SNVs for recessive disease reached asymptotic values for 100%, 90%, and 80% filtering faster than for dominant disease. The effect of controls in recessive and dominant disease was high. For a given sample size , the expected number of SNVs in sibs was comparable to that in unrelated individuals. For example, if half of the exome samples were controls, the expected SNVs in which at least 80% of affected individuals and at most 20% of controls have the genotype were 203.70 ( 10), 11.86 ( 20), and 0.01 ( 50) when unrelated exomes were used and 200.09 ( 10), 14.26 ( 20), and 0.02 ( 50) when full-sibs exomes were used.

##### 3.6. Assumptions

We assumed that haploid sequences were randomly sampled from a population under the Wright-Fisher diffusion model with a constant population size, with in unrelated individuals and in full-sibs (Figures 1(e) and 1(f)). The infinite-site model of neutral mutations was also assumed. The expected frequency spectrum of sequences is represented by formula (1). All of the results derived from this method are based on this formula. However, human populations have expanded and mutations in non-synonymous sites are not at least strictly neutral but might be averagely deleterious, which may skew the frequency spectrum toward rare variants (e.g., see [9] for population expansion and [10] for non-synonymous mutations). The skew is more pronounced when the sample size is large (e.g., 500), but not when the sample size is small [9]. In addition, the reference sequence is known to be a mosaic of a number of human DNA. The fact does not affect the expected number of candidate SNVs since any small chromosomal region or any DNA site of the reference sequence is still a haploid sample from a population. On the other hand, our results may be affected by the fact that the reference sequence and the exome sequences have different ethnic background. But it is surely that those are derived from a human population. As a whole, the expected frequency spectrum given by (1) is rough approximation and the effect of various filtering manner, incorporating modes of inheritance, incomplete penetrance or phenocopy, and control, on the number of candidate SNVs can be assessed as described above.

#### 4. Conclusions and Practical Implications

Using a standard population genetics model, we modeled exome analysis for Mendelian disease and developed a method for calculating the expected number of candidate SNVs after filtering under a “no genetic heterogeneity” assumption. Exome sequences of unrelated individuals and full-sibs were considered with and without controls for dominant and recessive diseases. Without controls, particularly for full-sibs, the filtering approach had poor efficiency in reducing the number of candidate SNVs even when using a stringent filter (Figures 2(a), 3(a), 4(a), and 5(a)). With controls, the filtering efficacy was considerably improved, even when incorporating phenocopy or incomplete penetrance (Figures 2(b), 3(b), 4(b), and 5(b)). This was true in cases of unrelated individuals and full-sibs for dominant and recessive diseases.

For rare dominant diseases, it is plausible that affected individuals in a pedigree share one disease-causing variant, even if the disease shows genetic heterogeneity. This indicates that the assumption of “no genetic heterogeneity” is appropriate because the frequencies of variants of the rare disease are also rare in a population, and only one founder in the pedigree should have one of the disease-causing variants (e.g., see Sobreira et al. [5] or Wang et al. [11]). For rare recessive diseases, affected members in a pedigree generally do not share one disease-causing variant. It is possible that affected individuals in the pedigree may be “compound heterozygotes” at a disease locus or heterozygotic for two disease-causing variants in a gene (e.g., Lalonde et al. [12]). For a consanguineous pedigree with a rare recessive disease, the assumption of “no genetic heterogeneity” is still appropriate in that affected individuals in the pedigree are expected to be autozygous for the disease-causing variant (e.g., see Walsh et al. [13]).

As described in Section 3.5 and shown in Figure 4(b), filtering by incorporating incomplete penetrance and phenocopy can efficiently reduce the number of candidate SNVs when the sample size is relatively large. If the property of results for full-sibs is extrapolatable to those for general pedigrees, this means that filtering approach works well in case of a pedigree data for dominant disease or a consanguineous pedigree data for recessive disease even in cases of incomplete penetrance and phenocopy. The approach presented in this study could provide general guidelines for sample size determination in exome sequencing for Mendelian disease.

#### Acknowledgments

The authors are grateful to Shigeki Nakagome for valuable comments and discussions. This work was supported by health labor sciences research Grant from The Ministry of Health Labour and Welfare (H23-jituyouka(nanbyou)-006 and H23-kannen-005) and by the Grant of National Center for Global Health and Medicine (H22-302).

#### References

- The 1000 Genomes Project Consortium, “An integrated map of genetic variation from 1,092 human genomes,”
*Nature*, vol. 491, pp. 56–65, 2012. View at Google Scholar - T. A. Manolio, F. S. Collins, N. J. Cox et al., “Finding the missing heritability of complex diseases,”
*Nature*, vol. 461, no. 7265, pp. 747–753, 2009. View at Publisher · View at Google Scholar · View at Scopus - S. B. Ng, E. H. Turner, P. D. Robertson et al., “Targeted capture and massively parallel sequencing of 12 human exomes,”
*Nature*, vol. 461, no. 7261, pp. 272–276, 2009. View at Publisher · View at Google Scholar · View at Scopus - B. Rabbani, N. Mahdieh, K. Hosomichi, H. Nakaoka, and I. Inoue, “Next-generation sequencing: impact of exome sequencing in characterizing Mendelian disorders,”
*Journal of Human Genetics*, vol. 57, no. 10, pp. 621–632, 2012. View at Publisher · View at Google Scholar - N. L. Sobreira, E. T. Cirulli, D. Avramopoulos et al., “Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene,”
*PLoS Genetics*, vol. 6, no. 6, Article ID e1000991, 2010. View at Publisher · View at Google Scholar · View at Scopus - D. Zhi and R. Chen, “Statistical guidance for experimental design and data analysis of mutation detection in rare monogenic Mendelian diseases by exome sequencing,”
*PLoS ONE*, vol. 7, no. 2, Article ID e31358, 2012. View at Publisher · View at Google Scholar - Y. X. Fu, “Statistical properties of segregating sites,”
*Theoretical Population Biology*, vol. 48, no. 2, pp. 172–197, 1995. View at Publisher · View at Google Scholar · View at Scopus - B. S. Weir,
*Genetic Data Analysis II*, Sinauer Associates, Sunderland, Mass, USA, 1996. - A. Keinan and A. G. Clark, “Recent explosive human population growth has resulted in an excess of rare genetic variants,”
*Science*, vol. 336, no. 6082, pp. 740–743, 2012. View at Publisher · View at Google Scholar - Y. Li, N. Vinckenbosch, G. Tian et al., “Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants,”
*Nature Genetics*, vol. 42, no. 11, pp. 969–972, 2010. View at Publisher · View at Google Scholar - J. L. Wang, X. Yang, K. Xia et al., “TGM6 identified as a novel causative gene of spinocerebellar ataxias using exome sequencing,”
*Brain*, vol. 133, pp. 3510–3518, 2010. View at Google Scholar - E. Lalonde, S. Albrecht, K. C. H. Ha et al., “Unexpected allelic heterogeneity and spectrum of mutations in fowler syndrome revealed by next-generation exome sequencing,”
*Human Mutation*, vol. 31, pp. 1–6, 2010. View at Google Scholar - T. Walsh, H. Shahin, T. Elkan-Miller et al., “Whole exome sequencing and homozygosity mapping identify mutation in the cell polarity protein GPSM2 as the cause of nonsyndromic hearing loss DFNB82,”
*American Journal of Human Genetics*, vol. 87, no. 1, pp. 90–94, 2010. View at Publisher · View at Google Scholar · View at Scopus