Abstract

Though pleiotropy, which refers to the phenomenon of a gene affecting multiple traits, has long played a central role in genetics, development, and evolution, estimation of the number of pleiotropy components remains a hard mission to accomplish. In this paper, we report a newly developed software package, Genepleio, to estimate the effective gene pleiotropy from phylogenetic analysis of protein sequences. Since this estimate can be interpreted as the minimum pleiotropy of a gene, it is used to play a role of reference for many empirical pleiotropy measures. This work would facilitate our understanding of how gene pleiotropy affects the pattern of genotype-phenotype map and the consequence of organismal evolution.

1. Introduction

Understanding the role of gene pleiotropy in the map from genotypes to phenotypes has been one of the central topics for biologists, which refers to the phenomenon of a gene affecting multiple traits As a major measure for the functional importance of a gene, this concept has played a fundamental role in genetics, development, and evolution (see [13] for recent reviews and comments). However, the degree of gene pleiotropy remains largely unknown. Historically, proposed the concept of universal pleiotropy; that is, a single mutation can potentially affect all phenotypic traits. Though Fisher’s model has been widely used as a theoretical basis for exploring the evolutionary interplay between the genotype and phenotype, the notion of universal pleiotropy has not been well tested.

Indeed, compared with the wide availability of genomics data, the whole-range phenotype recourses, or “phenomics,” are highly limited. Nevertheless, recent advances have been able to bring high throughput data to bear on the nature and extent of pleiotropy [46]. These experiments showed that the number of phenotypic traits that may be affected by a gene may be actually limited implying the role of modularity in shaping the degree of gene pleiotropy. Many controversial issues such as the problem of arbitrary number of correlated traits may directly affect the count of phenotypes that are predicted to be affected by a gene. On the other hand, a new approach has emerged in the past decade, to estimate the gene pleiotropy from genetics or sequence data, rather than from the affected phenotypes [713] (Chen et al., 2013). In particular, Gu [8] developed a practically feasible approach. It proposed that molecular evolution of a gene occurs in a multidimensional space corresponding to the same canonical number of molecular phenotypes. Random mutations of the gene could affect these molecular phenotypes constrained by the stabilizing selection. Moreover, Gu [8] developed a statistical method to estimate the “effective pleiotropy” () of a gene from the multiple sequence alignment of protein sequences. Most genes have in the range between 1 and 20 [11], with the medium of of these estimates that is comparable to some empirical pleiotropy measures [1, 3]. Yet the relationship between these two approaches remains complex. As the degree of gene pleiotropy is a basic parameter for understanding the complexity of genotype-phenotype map, to facilitate this line of research we develop a software package Genepleio (freely available at http://www.xungulab.com) to estimate the effective gene pleiotropy from the protein sequences.

2. Material and Methods

2.1. Sequences

Three groups of datasets include eight vertebrates, twelve fruit flies, and seven yeast species. Each dataset contains 300 random selected orthologous sets. Eight vertebrates are fugu, zebrafish, xenopus, chicken, dog, cow, mouse, and human. Twelve Drosophila species are D. melanogaster, D. pseudoobscura, D. sechellia, D. simulans, D. yakuba, D. erecta, D. ananassae, D. persimilis, D. willistoni, D. mojavensis, D. virilis, and D. grimshawi. Seven yeast species are Candida glabrata, Debaryomyces hansenii, Kluyveromyces lactis, Saccharomyces bayanus, Yarrowia lipolytica, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. Vertebrate CDS and protein sequences were extracted from Ensmart, Drosophila CDS and protein sequences were extracted from FlyBase, and yeast CDS was extracted from ORNA Man’s dataset (corresponding protein sequences were translated by Bioperl).

2.2. Sequences Alignment

Multiple protein sequence alignment for each orthologous group was obtained by ClustalW at default settings.

2.3. Estimation of

The number of synonymous substitutions per synonymous site () and the number of nonsynonymous substitutions per nonsynonymous site () between human and mouse orthologs were calculated by CODEML of PAML package [14]. When calculating the variance of and , we changed “getSE = 1” in CODEML control file; otherwise, we used the default CODEML parameters. We used the estimates of between human and mouse for vertebrates, those between D. melanogaster (dmel) and D. sechellia (dsec) for Drosophila, and those between S. bayanus and S. cerevisiae for yeasts, respectively.

3. Results and Discussion

3.1. Estimation of Effective Gene Pleiotropy

Gu [8] analyzed the pleiotropy model of molecular evolution under the following assumptions. (i) -dimensional molecular phenotypes of the gene are under Gaussian-like stabilizing selection, indicating a single fitness optima for multiple functions. Any deviation from the optima is under the purifying selection. (ii) The fitness optima of may shift randomly during the course of evolution, according to a multivariate normal distribution. It generates the process of microadaptation that could be caused by the external (environmental) or internal (physiological) perturbations or the functional compensation for the previously fixed slightly deleterious mutation. (iii) And the distribution of mutational effects, , follows a multivariate normal distribution.

The estimation procedure implemented in the software Genepleio is summarized in Figure 1. We address several key issues concisely to help in understanding how the software works. One may see Gu [8], Su et al. [11], and Gu (2014) for technical details.

3.1.1. Calculation of -Measure

Calculation of is the key step to estimate . Biologically, measures the strength of rate variation among sites: when , and when . After the gene phylogeny is given or inferred by the NJ option, the software implemented the methods of Gu and Zhang [15] to infer the number of amino acid changes along the phylogeny at each site, after correcting the multiple hits. The next step is to calculate the mean () and variance () of the estimated number of changes over sites. Under the Poisson-based evolutionary model, can be estimated by .

3.1.2. Estimation of Effective Gene Pleiotropy ()

Genepleio has implemented the following procedure to estimate . (i) Calculate the ratio (the ratio of nonsynonymous to synonymous rates) used as an empirical measure for the mean sequence conservation. (ii) Calculate the -function . (iii) And the effective gene pleiotropy () can be estimated by numerically solving the following equation: where .

3.1.3. Estimation of Selection Intensities

There are two types of selection intensity measures. The first one is the (overall) selection intensity of the gene under study, , for the overall strength of purifying selection imposed on the protein sequence; the negative sign indicates the negative (purifying) selection. The second one is the baseline selection intensity, , which is a scaled measure for the contribution of a single pleiotropy component. The relationship between and is . When is obtained, the software estimates the effective selection intensity for and the effective baseline selection intensity () for the baseline selection intensity .

3.1.4. Calculation of Sampling Variance of

The sampling variance of can be approximately calculated by the delta method. Numerical analysis of (1) found that the following formula is accurate enough in practice:

In (2), can be estimated by the delta method so that , where , are the variances of and , respectively. The sampling variance is difficult to compute analytically. Genepleio has implemented a bootstrapping approach to calculate the sampling variance of , , whereas sampling variances of and , and , depend on the users’ input; their default values are set to be zero. Using this method, we can bootstrap 100 times within 1~2 minutes.

We use triosephosphate isomerase gene (TPI1, SWISSPROT P60174) for illustration. (i) Infer the phylogenetic tree from the multiple alignment of vertebrate homologous TPI1 protein sequences (human, mouse, dog, cow, chicken, xenopus, fugu, and zebrafish), which is consistent with the known vertebrate phylogeny. (ii) Estimate between the human and mouse genes by the likelihood method using PAML. (iii) Estimate -index for the rate variation among sites; for TPI1 gene. (iv) And we estimated and the mean selection intensity . Then, the baseline selection intensity is given by .

The estimated effective gene pleiotropy varies among different treatments but the scale of variation is small. On the other hand, we found that when the number of changes at each site is estimated by the parsimony method without any correction, gene pleiotropy tends to be overestimated. At any rate, we conclude that these 5–10% estimation differences should not affect the general pattern about the degree of gene pleiotropy.

3.2. Biological Interpretation of

The key question is how one can acquire the number of pleiotropy components of a gene without biologically knowing each component (Su et al., 2010) [12, 16]. Gu (2014) [17] addressed this issue, showing that the method of Gu [8] actually aims to estimate the rank () of genotype-phenotype map. The main result can be concisely represented by the following simple formula: , where is the minimum pleiotropy among all legitimate pleiotropy measures and is the rank of mutational effects. In short, the meaning of “effective gene pleiotropy” () estimated by Gu-2007 method is as follows. (i) is an estimate of , the rank of genotype-phenotype map. (ii) is an estimate for the minimum pleiotropy only if . (iii) Gu-2007 method attempted to estimate the pleiotropy of amino acid sites, a conserved proxy to the true minimum pleiotropy. (iv) With a sufficiently large phylogeny such that the rank of mutational effect at an amino acid site is (the number of amino acid types minus one), one can estimate in the range from 1 to 19 by this method. And () is a conserved estimate of because those pleiotropy components that have small effects on fitness would be effectively removed by the estimation procedure.

3.3. Software Overview

We have developed Genepleio, a GUI-based software package that estimates the effective gene pleiotropy from the phylogenetic sequence analysis of amino acids. Genepleio has three inputs. (i) Input the file of multisequence alignment (MSA) of protein sequences: Genepleio supports the alignment format of CLUSTALW. As required by the method, the multiple alignment file should contain at least four sequences with reasonably large sequence divergences between them. (ii) Input two values (nonsynonymous distance) and (synonymous distance). Several methods such as PAML can be used to obtain these estimates. We suggest choosing closely related sequences, say, , to avoid large sampling variance when calculating the ratio. Note that the ratio should be less than 1; otherwise, the gene may not be suitable to do this type of analysis. (iii) Input the tree file in the Phylip format. Alternatively, one may use the neighbor-joining (NJ) method implemented in the software to infer the gene phylogeny. As illustrated in Figure 2, the interface of Genepleio includes three main tab pages: the first page is for the MSA input and values, the second page is for the input or the inference of the phylogenetic tree, and the third page will output the results of estimation.

We have conducted a preliminary analysis and found that the 75% quantile of estimated is typically within , suggesting that estimation as a measure of gene pleiotropy is statistically reliable. Besides, we notice that the contributions from and are nontrivial. In other words, the sampling variance of would be severely underestimated if the user has no input for the sampling variances of and .

There are some notices about usage of Genepleio. First, the multiple alignment file should contain more than four sequences; second, the value should be within ; third, the sequences similarity >90% should be cautious because of the lack of statistical power; fourth, in order to shorten the time consumed, we do not give the mean of value through bootstrapping in Genepleio. Nevertheless, according to much simulation, the mean value is close to the estimated value, so we only give the estimated value.

3.4. Computer Simulations

We have carried computer simulations to evaluate the software performance. We set to vary from 1 to 20, with the fixed baseline selection intensity , 1.0, or 2.0, respectively. In particular, we consider three simulation models. (a) Independent-equal model: pleiotropy components are identical and independent of each other. (b) Independent-unequal model: pleiotropy components are independent of each other but have different strengths. (c) Random-matrix model: the strengths and correlations between pleiotropy components are randomly drawn from a specified random matrix model. Our main results are summarized in Table 1. In general, the estimation bias of decreases with the increasing of the baseline selection intensity . For instance, in model (a), underestimation of is considerable only when is very small, say, 0.5 or less. The estimation bias becomes intermediate when and becomes negligible when (not shown). Moreover, the estimation bias of may increase when the simulation model becomes more complex. Indeed, in model (c), only describes the canonical number of pleiotropy components that could be much less than the number of pleiotropy components used in the simulation model.

3.5. Case Studies

To validate the performance of the newly developed software for the estimation of , we analyzed three datasets, each of which includes eight vertebrates, twelve fruit flies, or seven yeast species, respectively (Figure 3). Each dataset contains 300 random selected (one-to-one) orthologous sets. We calculated , (selection intensity), and (the baseline selection intensity). Our analysis shows that all estimates are in a reasonable range with only a few outliners. Interestingly, the distribution of estimates is similar across these distantly related species. The underlying reason to explain this similarity remains unclear. is an important parameter for evolutionary analysis. Indeed, the square of coefficient correlation () between and is 0.64 in vertebrates, 0.94 in yeasts, and 0.55 in fruit flies, suggesting that gene pleiotropy may be an important evolutionary constraint in molecular evolution. In short, in this paper, we have reported a new software package Genepleio and demonstrated the steps of gene pleiotropy () estimation. We also examined the extent to which the statistical properties of and affect the estimation efficiency of and . Comparison among three different groups of species validates the stability of estimation procedure.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Authors’ Contribution

Wenhai Chen and Dandan Chen have equal contributions.

Acknowledgments

The authors are grateful to Yong Huang for the involvement of early work and to Zhixi Su and Wei Huang for critical comments on the early version of the paper.