Abstract

We characterized ectopic gene conversions in the genome of ten hemiascomycete yeast species. Of the ten species, three diverged prior to the whole genome duplication (WGD) event present in the yeast lineage and seven diverged after it. We analyzed gene conversions from three separate datasets: paralogs from the three pre-WGD species, paralogs from the seven post-WGD species, and common ohnologs from the seven post-WGD species. Gene conversions have similar lengths and frequency and occur between sequences having similar degrees of divergence, in paralogs from pre- and post-WGD species. However, the sequences of ohnologs are both more divergent and less frequently converted than those of paralogs. This likely reflects the fact that ohnologs are more often found on different chromosomes and are evolving under stronger selective pressures than paralogs. Our results also show that ectopic gene conversions tend to occur more frequently between closely linked genes. They also suggest that the mechanisms responsible for the loss of introns in S. cerevisiae are probably also involved in the gene 3'-end gene conversion bias observed between the paralogs of this species.

1. Introduction

The repair of double strand DNA breaks is a critical biological process which maintains genome stability. The primary process whereby double-strand DNA breaks are repaired is via homologous recombination; this process requires the use of a repair template gene which provides a copy of the missing information caused by the double-strand DNA breaks. The repair template can either be an allele (allelic recombination) or a paralog (ectopic recombination). An end product of the homologous recombination pathway is the replacement of the broken part of the damaged gene by a homologous portion of the repair template gene. The damaged gene is therefore converted by the template gene (reviewed in [1]).

The factors affecting, and the characteristics of, ectopic and allelic gene conversions have been the focus of many studies, and sequence similarity has been shown to have a profound effect on gene conversion propensity between paralogs. In Escherichia coli, a 2%–4% decrease in sequence similarity between a damaged gene and its repair template can cause a 10- to 40-fold decrease in recombination frequency [2, 3]. Similarly, in Saccharomyces cerevisiae, larger gene conversions are limited to more similar sequences [4]. Chromosomally linked genes are converted more frequently than dispersed genes in Drosophila and humans [5, 6]. In S. cerevisiae, increasing distance between paralogs located on the same chromosome tends to decrease their conversion frequency [4, 7, 8]. In some genomes, different regions of genes are converted at different rates. For example, in S. cerevisiae, genes conversions between dispersed paralogs are more frequent at their 3'-ends [4]. This 3'-bias is likely the result of gene conversion with incomplete cDNA molecules [9].

The availability of ten hemiascomycete genomes provides the opportunity to study ectopic gene conversions within a clade with as much sequence divergence as the entire Chordate phylum [10]. The evolution of several hemiascomycetes species was affected by a whole genome duplication event (WGD) which occurred some 150 millions years ago (MYA; [1114]). The genomes of Kluyveromyces lactis, Debaryomyces hansenii, and Yarrowia lipolytica all diverged before the whole genome duplication event that occurred in the yeast lineage (pre-WGD species; [10]). The S. cerevisiae, Saccharomyces paradoxus, Saccharomyces mikatae, Saccharomyces bayanus, Saccharomyces kudriavzevii, Saccharomyces castellii, and Candida glabrata genomes all diverged after this whole genome duplication event (post-WGD species; [1517]).

The advantage of separating these genomes into two groups is that we are able to perform two comparisons. The first compares the characteristics of ectopically converted ohnologs and paralogs between the post-WGD species. The post-WGD ohnologs are composed of the duplicated gene pairs that resulted from the whole genome duplication [11, 18]. The post-WGD paralogs data set is composed of the genes from multigene families containing at least three members in the genome of the seven post-WGD species but excluding all ohnologs (Figure 1). The second comparison involves the contrast of the characteristics of ectopically converted paralogs between pre- and post-WGD species. The pre-WGD paralogs data set is composed of the genes from multigene families containing at least three members in the genome of the three pre-WGD species.

The previous studies have shown that the reason why many ohnologs are still found in yeast genomes is because they provide a selective advantage [19, 20]. Ohnologs are maintained by selection either because they carry out a subset of the functions that were previously assumed by their preduplication ancestor (subfunctionalization), assume new functions (neofunctionalization), or provide increased gene product dosage. We therefore expect that most ectopic gene conversions between ohnolog genes will be deleterious and removed by selection. If so, ectopic gene conversions between ohnologs should be less frequent than those between paralogs. In addition, based on the previous studies, we expect that gene conversion frequency should decrease as the distance between related genes increases (and be least frequent for genes situated on different chromosomes), that the length of gene conversion tracts should be positively correlated with sequence similarity and that converted regions should be more frequent at the 3'-end of genes [4].

2. Materials and Methods

2.1. Genome Sequences

The S. cerevisiae, S. paradoxus, S. mikatae, S. bayanus, S. kudriavzevii, and S. castellii genome sequences were retrieved from the Saccharomyces Genome Database (SGD; ftp://genome-ftp.stanford.edu/pub/yeast/sequence/). The C. glabrata, K. lactis, D. hansenii, and Y. lipolytica genome sequences and distance files (*.ptt files) were retrieved from the NCBI ftp website (ftp://ftp.ncbi.nih.gov/).

2.2. Gene Family Data Sets

We used three different data sets of protein coding genes. To retrieve the post-WGD ohnologs from the seven post-WGD species, we used the 551 S. cerevisiae duplicated gene pairs (1102 ohnologs) identified by Byrne and Wolfe [21] as queries. Sequences from C. glabrata and S. castellii were retrieved using the Yeast Gene Order Browser (http://wolfe.gen.tcd.ie/ygob/), and those from the other 4 species were retrieved from the Saccharomyces Genome Database (ftp://genome-ftp.stanford.edu/pub/yeast/sequence/fungal_genomes/Multiple_species_align/other/fungalAlignCorrespondance.txt). Our data set of ohnologs in post-WGD species is therefore only composed of the ohnologs pairs also found in S. cerevisiae. We used this subset of ohnologs because the efficient detection of gene conversion events using the GENECONV method requires that at least three sequences be available [4]. To detect gene conversions in ohnologs, we therefore needed ohnologs from at least two species and we used the ohnologs of S. cerevisiae to retrieve ohnologs pairs from the other 6 post-WGD species. Retrieving common ohnologs also allowed us to study gene conversions between similar genes in seven different genomes.

The post-WGD paralog data set was constructed using the BLASTCLUST program available at the NCBI FTP site. Gene families were defined as being composed of sequences having at least 60% amino acid identity over at least 50% of their length. If genes previously identified as ohnologs were grouped into paralog multigene families, then these genes were removed from the family to ensure that there was no redundancy between the ohnolog and paralog data sets (see Figure 1). The pre-WGD paralog data set was also constructed using the BLASTCLUST program, and gene families were also defined as being composed of sequences having at least 60% amino acid identity over at least 50% of their length.

2.3. Sequence Alignments and Gene Conversion Detection

ClustalW was used to align the protein sequences of multigene families’ members [22]. DNA sequences were then fitted to the protein alignments using a PERL script.

Gene conversions were detected using the GENECONV method [23]. Redundant gene conversions within a multigene family were detected by examining the phylogenetic tree of each family and removed from the analysis [4]. If the same gene conversion was detected at the same location in the multigene family alignment in closely related descendents of a common ancestor then the most parsimonious explanation is that the conversion event occurred within the common ancestor, therefore only one of the conversions detected in the set of descendents was retained for further analysis. To control for false positives, gene conversions between sequences having less than 80% maximum flanking similarity were removed from the analysis [24].

2.4. Gene Conversion Characteristics

The gene conversion frequency for each species was calculated using two different methods. The first method calculates the conversion frequency as the ratio of the number of conversions divided by the total number of gene comparisons between multigene family members. The second method calculates the frequency as the ratio of the number of gene conversions divided by the total number of multigene family members. Intra- and interchromosomal gene conversion frequencies were calculated for the S. cerevisiae, C. glabrata ohnolog and paralog multigene families. In addition intra- and interchromosomal conversion frequencies were calculated for the paralog multigene families of K. lactis, D. hansenii, and Y. lipolytica genomes. These frequencies are calculated as the ratio of intra- (or inter-) chromosomal conversions divided by the total number of intra- (or inter-) chromosomal gene comparisons. The gene conversion length was obtained from the GENECONV output. The maximum similarity for the flanking 100 nucleotides was calculated for each converted gene pair using an in-house PERL script. The locations of the converted regions were calculated as the correlation between the positions of each conversion with respect to the length of the converted genes. A positive correlation indicates a bias towards the 3'-end of genes, and a negative correlation indicates a bias towards the 5'-end of genes. The distance between converted genes was calculated only for conversions detected within S. cerevisiae, C. glabrata, K. lactis, D. hansenii, and Y. lipolytica because position data for the other five species was not available. Data tabulation and analysis was done using Microsoft Excel (Microsoft, Redmond, WA, USA) and S-plus v 7.0 (Insightful, Seattle, WA, USA). The G-Power program was used to calculate the power of the ANOVA tests [25]. Power calculations for correlation tests were done using an online application (http://calculators.stat.ucla.edu/powercalc/correlation/) and SAS 9.1.3 (SAS Institute Inc., Cary, NC, USA).

2.5. Numbers of Substitutions per Site and Gene Ontology

The number of nonsynonymous substitutions per nonsynonymous site (Ka) and synonymous substitutions per synonymous site (Ks) and their ratio (Ka/Ks) were calculated for the protein coding regions (excluding the converted regions) of each pair of converted genes using the YN00 program from the PAML software [26, 27].

The processes in which the S. cerevisiae ohnologs and paralogs are involved were analyzed using the gene ontology annotations of the Saccharomyces Genome Database at http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.pl [28].

3. Results

3.1. Ohnolog and Paralog Multigene Families

Ohnolog and paralog multigene families were analyzed to determine whether the number and size of these two types of families were different in different yeast genomes (Table 1, Figure 2). The genomes of the six post-WGD species from which we retrieved ohnolog pairs using the S. cerevisiae ohnologs contain an average of ohnolog pairs. The number of ohnolog pairs found in each of these six different genomes is not significantly different from this average when using a Bonferroni-corrected α-value of 0.0083 (Wilcoxon rank sum test; [29]).

For post-WGD paralogs, only the S. mikatae genome has significantly more paralog families than average ( ; Wilcoxon rank sum test, ) and only the S. kudriavzevii genome has significantly fewer paralog families than average ( ). The mean size of the paralog families ( genes/family) is similar in all post-WGD genomes except that of C. glabrata which has significantly smaller paralog families than average ( genes/family, Wilcoxon rank sum test, ).

For the pre-WGD paralogs, the numbers of paralog families in the three pre-WGD genomes are not significantly different from the population mean ( ; Wilcoxon rank sum test, ). The mean size of all paralog families in these three genomes ( paralogs per family) is similar to the mean family size of each pre-WGD genome (Wilcoxon rank sum test, ). Finally, there is no statistical difference between the number (Wilcoxon rank sum test, ) and the mean size of paralog families (Wilcoxon rank sum test, ) or between pre- and post-WGD species.

3.2. Organization of Gene Families

The organization of the multigene families can be measured as the proportion of multigene family members located on the same chromosome. Since most paralogs originate from unequal crossover events, they are expected to be most often found on the same chromosome. In contrast, since ohnologs are remnants of ancient genome duplication events, they are expected to be most often found on different chromosomes. The higher percentage of paralogs found on the same chromosome is therefore consistent with the likely mode of origin of these two types of duplicated genes (Table 2). The percentage of paralogs found on the same chromosome is also similar between pre- and post-WGD genomes (Table 2).

3.3. Gene Conversion Frequency and Distance between Converted Genes

In post-WGD genomes, intrachromosomal gene conversions tend to occur more frequently than interchromosomal conversions. In the paralog families of S. cerevisiae and C. glabrata, genes located on the same chromosome are converted 2 to 10 times more frequently than genes found on different chromosomes (Table 3). Similarly, in the ohnolog families of S. cerevisiae, genes located on the same chromosome are converted 4 times more frequently than genes found on different chromosomes (Table 3). In contrast, there is an almost complete absence of gene conversions between the ohnologs found within the C. glabrata genome (Table 3).

In pre-WGD genomes, the paralogs found on the same chromosomes of K. lactis and Y. lipolytica are not converted more frequently than paralogs found on different chromosomes but the D. hansenii paralogs found on the same chromosomes are converted roughly 3 times more frequently than those found on different chromosomes (Table 3).

The mean number (±S.D.) of conversions detected within the paralog gene families of the pre- ( ) and post-WGD ( ) genomes is not statistically different ( -test, ; Table 4). Although the ohnolog families of post-WGD genomes contain only an average of conversions, this number is also not significantly different from the average number of conversions found in post-WGD paralog families ( -test, ).

When considering gene conversion frequencies with respect to the total number of comparisons, gene conversions of post-WGD species are either equally frequent in paralog and ohnolog families (in the S. paradoxus, S. mikatae, and S. bayanus genomes) or significantly more frequent in paralog than in ohnolog families in the four other post-WGD families ( -test, ; Table 4).

When considering gene conversion frequencies with respect to the total number of multigene family members, the mean conversion frequency for paralogs ( ) is significantly larger than the frequency for ohnologs ( ; Wilcoxon two sample test, ).

We believe that using gene conversion frequencies with respect to the total number of multigene family members is more appropriate to compare gene frequencies between ohnologs and paralogs because it better reflects the much larger number of conversions found in paralogs when compared to ohnologs. For example, in the case of S. cerevisiae with 13 conversions between ohnologs and 110 conversion between paralogs (Table 4), the conversion frequency for ohnologs is 2.35% (13/551) and 5.71% for paralogs (110/1930) when frequencies are calculated with respect to the total number of comparisons. However, these frequencies do not take into account the fact that 1102 ohnolog sequences were compared (551 pairs) whereas only 212 paralog sequences (i.e., less than the fifth of the number of ohnolog sequences) were compared (for a total of 1930 pairwise comparisons) to obtain the 5.71% frequency of paralogs. In contrast, if one compares the frequencies calculated with respect with the number of genes, the frequency of conversions is 1.17% (13/1102) for ohnologs and 51.40% for paralogs (110/212). The large difference between the two ways of calculating frequencies is due to the fact that frequencies calculated with respect to the total number of comparisons have a much larger denominator which biases the comparisons between ohnologs and paralogs. For example, for a family with 10 paralogous sequences, the number of pairwise comparisons will be 45 ([10(10−1)]/2) whereas it will only be 5 for 10 ohnologs.

Ectopic gene conversions between paralogs are equally frequent in both pre- and post-WGD genomes. Median gene conversion frequencies relative to both total number of comparisons and number of multigene family members are not statistically different between pre-WGD (12.09%, 23.8%) and post-WGD (5.06%, 12.4%) paralogs (Table 4; Wilcoxon two sample test, with respect to the number of gene comparisons and with respect to the number of multigene family members).

There is a significant negative correlation (Spearman rank correlation test) between gene conversion frequency and distance between paralogs located on the same chromosomes in the genomes of S. cerevisiae ( ; ), C. glabrata ( ; ), and D. hansenii ( ; ). Correlations could not be calculated for the other paralog and/or ohnolog data sets either because gene distance information was not available for some species (see above) or because less than four genes were found on the same chromosomes (statistical power analyses require at least 4 data points).

3.4. Gene Conversion Length and Flanking Similarity

The median lengths of the gene conversions between ohnologs are identical in all seven post-WGD genomes (Table 5; multiple comparison ANOVA test, , ). The median lengths of gene conversions between the paralogs of pre-WGD genomes are also equal ( ). However, the median length of the gene conversions between paralogs are significantly longer in S. cerevisiae than in S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus (multiple comparison ANOVA, ). In post-WGD genomes, the median length of gene conversion in paralogs and ohnolog (182 and 186.5 bp, resp.) are not significantly different (pairwise Wilcoxon rank tests, Table 5). Finally, the median lengths of gene conversions are significantly different from each other between pre-WGD (150 bp) and post-WGD (182 bp) paralogs (Wilcoxon two sample test, , ). These median lengths are similar to the average length of the S. cerevisiae conversions observed in a previous study (173 bp, [4]).

The median sequence similarities of regions flanking gene conversions between ohnologs are equal in all seven post-WGD genomes (Table 6; multiple ANOVA tests, , ). Furthermore, the median sequence similarities of regions flanking gene conversions between paralogs are equal in all seven genomes (multiple comparison ANOVA test, , ).

Although the median flanking similarity of the converted paralogs of post-WGD species is always higher than that of their ohnologs, this difference is only significant in the genome of S. cerevisiae and S. bayanus (Table 6). However, this lack of statistical significance is likely the result of the relatively low power of these statistical tests because the power of each test was ≤61% (results not shown).

The median sequence similarities of regions flanking gene conversions between the paralogs of pre-WGD genomes are equal (Table 6; multiple ANOVA tests, , ). However, converted genes within pre-WGD paralogs have significantly less flanking similarity (pooled median of 90%) than converted paralogs in post-WGD genomes (pooled median of 94%; Wilcoxon two sample test, , , Table 6). We do not know whether this difference has any biological significance.

Analysis of the relationship between the length of gene conversions and flanking similarity indicates a significant positive correlation within the ohnologs of the seven post-WGD genomes (Spearman rank correlation test, , ; Figure 3(a)), the paralogs of the seven post-WGD genomes ( , ; Figure 3(b)) and the paralogs of the three pre-WGD genomes ( , ; Figure 3(c)).

3.5. Ka, Ks, Ka/Ks Ratios and Ontology of Ohnolog and Paralog Converted Genes

In post-WGD genomes, the fact that synonymous substitutions (Ks) are lower for converted paralogs than for converted ohnologs suggests that paralogs have a more recent origin (Table 7). Therefore, the higher Ka/Ks ratio of paralogs clearly indicates that paralogs are under less selection constraints than ohnologs. Furthermore, the similar Ka/Ks ratios of pre- and post-WGD paralogs indicate that the paralogs of pre- and post-WGD evolve under similar selective constraints (Table 7).

The ohnologs and paralogs of S. cerevisiae are involved in different processes. Although many of the GO terms shown in Table 8 are not mutually exclusive (e.g., “transposition” and “transposition, RNA-mediated”), analyses of the processes in which these genes are involved show that ohnologs are involved in regulation, essential biosynthetic processes, and metabolic processes whereas paralogs are involved transposition, transport, and nonessential biosynthetic processes.

3.6. Location of Converted Regions

When considering pre-WGD paralogs, post-WGD paralogs and post-WGD ohnologs, only the post-WGD paralogs of S. cerevisiae show a significant bias of gene conversions towards the 3'-end of genes (Table 9). However, the fact that the power of all nonsignificant tests is smaller than 15% suggests that this bias might also exist in the data sets where it was not detected but that our data are not sufficient to detect it (Table 9).

4. Discussion

Using S. cerevisiae ohnologs as queries allowed us to retrieve an average of 372 ohnolog pairs from the other six post-WGD genomes (Table 1). Although these seven species are phylogenetically related (see [13] for a phylogenetic tree of these fungi species), and therefore did not evolve independently, it is very unlikely that species as divergent as S. cerevisiae and C. glabrata (which diverged soon after the whole genome duplication, some 150 MYA), would have kept 300 pairs of common ohnologs by chance. In fact, assuming that the ancestral pre-WGD genome had 5000 genes and that current post-WGD genomes have 5500 genes [13], one would expect them to have kept only 50 ohnologs in common ( ) by chance alone. As we discuss further below, this suggests that common ohnologs provide a selective advantage and evolve under strong selective constraints.

Since the number and the mean size of paralog multigene families are not significantly different between pre- and post-WGD species, the genome duplication event in the post-WGD genome ancestor did not significantly increase the number or mean size of paralog multigene families in post-WGD species (Table 1, Figure 2). The small number and size of gene families in C. glabrata have already been noticed and are likely the result of reductive evolution and gene loss through relatively high genome instability [10, 12, 30].

The chromosomal distribution of ohnologs and paralogs is very different. Whereas, on average, 23.4% of paralogs are found on the same chromosomes, only 4.5% of ohnologs are found on the same chromosomes (Table 2). A likely explanation for this difference is that paralogs often originate from unequal crossing over or replication slippage events whereas ohnologs originate from whole genome duplication events (page 250 of [31], [18], pages 199–202 of [32]). Since gene conversions tend to be more frequent between genes found on the same chromosomes than between genes located on different chromosomes (Table 3), this explains, in part, why gene conversions tend to be more frequent between paralogs than between ohnologs (Table 4). In fact, on average, when comparing gene conversion using total numbers, frequency calculated using the number of multigene family members, or frequency based on the number of gene comparisons, gene conversions are more frequent in the paralogs of pre-and post-WGD genes than in the ohnologs of the post-WGD genomes (Table 4).

The previous work on yeast, Drosophila, and humans has shown that intrachromosomal gene conversions are more frequent than interchromosomal gene conversions [46]. A possible explanation for the relatively high frequency of intrachromosomal conversions in D. hansenii (36%, Table 3) is that multiple tandem duplication events have been identified within this genome and, therefore, most paralogs are still located on the same chromosomes [10]. In contrast, in K. lactis and Y. lipolytica, gene conversions between intra- and interchromosomal paralogs are equally frequent (Table 3). The highly redundant Y. lipolytica genome has been shown to be undergone a high degree of map dispersion [10]. The low frequency of intrachromosomal conversions observed in this genome might therefore be the result of the dispersion of tandemly duplicated paralogs to other chromosomes. A similar phenomenon might be present in K. lactis. It is unlikely that these exceptions are due to mechanistic differences in the repair of double-stranded-breaks between pre- and post-WGD species because the majority of repair genes have been maintained throughout the evolution of the hemiascomycetes [33].

The previous studies have demonstrated a negative correlation between gene conversion frequency and physical distance on the same chromosome [4, 7]. We also observed such a negative correlation in the genomes of S. cerevisiae, C. glabrata, and D. hansenii (see above). However, a lack of data (statistical power) prevented the detection of such a relationship in the paralogs of K. lactis and Y. lipolytica and the ohnologs of S. cerevisiae and C. glabrata. This correlation could result from the fact that the DNA repair mechanisms preferentially search for suitable repair templates close to the damaged gene. Since ohnologs are more often found on different chromosomes (Table 2), this would also explain why conversions are less frequent between ohnologs than between paralogs. On the other hand, our recent analyses of the human genome [6] has shown that, in the human genome, the negative correlation between gene conversion frequency and physical distance is simply the result of the fact that most duplicated genes are found next to one another. Thus the negative correlation we observed in some yeast species might also disappear if we normalized our data to take into account the fact that most paralogs are located next to one another on the same chromosome [10].

Sequence similarity requirements for ectopic conversions and the amount of negative selection are very similar between pre- and post-WGD paralogs. Several pieces of information support these conclusions. The fact that the frequency (Table 4), length (Table 5), and flanking sequence similarities (Table 6) of gene conversion of the paralogs within pre- and post-WGD species are similar indicates that mechanistic similarities are present between these genomes. In addition, the fact that the mean Ka/Ks values for the paralog families of pre- and post-WGD species are alike (Table 7) suggests that their genes are under similar selective pressures and have similar gene conversion constraints. This suggests that, despite the different ecological niches of the yeast species, these paralogs evolve in similar ways.

Surprisingly, the sequence of similarity flanking conversions between post-WGD ohnologs is always lower than that flanking post-WGD paralogs (Table 6). This is likely due to the fact that ohnologs are much older than paralogs (i.e., they have larger Ks values; Table 7), which gave time to accumulate more substitutions, and are under more selective constrains (i.e., they have larger Ka/Ks ratios; Table 7). Stronger selective constraints are expected to select against conversions which would homogenize ohnologs because such homogenization would erase the functional differences that each member of a pair of ohnologs has acquired during evolution. As mentioned above, the different function each member of a pair of ohnologs has acquired (neofunctionalization) also likely explains why different yeast genomes have so many common ohnologs (Table 1; [20]). Conversely, one of the effects of repeated gene conversion due to less negative selective pressure on paralogs is that the sequence of similarity between them will increase. Thus, the observation that ectopic gene conversions occur more frequently between paralogs than ohnologs (Table 4) might not only be due to the fact that ohnologs are more often found on different chromosomes (Table 2) but also due to ohnologs being under stronger selective constraints than paralogs (Table 7). These stronger selective constraints are due to the fact that ohnologs are involved in essential processes (regulation, essential biosynthetic processes and metabolic processes) whereas paralogs are involved in nonessential processes (transposition, transport and nonessential biosynthetic processes; Table 8). This is similar to the situation within genes where gene conversions have been shown to be less frequent in more functionally important regions [34, 35].

The previous studies on S. cerevisiae have found that gene conversions are biased toward the 3' end of converted genes. This has been attributed to ectopic gene conversion via cDNA intermediates [4]. Our results confirm that conversions are biased toward the 3'-end of genes within the S. cerevisiae paralog dataset [4, Table  9]. The fact that no significant bias was detected within any other species is likely a result of the low statistical power due to the small amount of data available for each of these species (Table 9). This low statistical power for the distribution of gene conversions other than those between S. cerevisiae paralogs likely reflects the facts that whereas there were 110 conversions between S. cerevisiae paralogs, there were only between 8 and 52 gene conversions between the paralogs of the other nine yeast species (Table 4). They were also only between 2 and 14 gene conversions between the ohnologs of the 7 post-WGD species. These low numbers of gene conversion are therefore not sufficient to ascertain whether their distribution is significantly biased.

The suggestion that the 3'-end bias of the gene conversions between S. cerevisiae paralogs is due to ectopic gene conversions with cDNA intermediates is consistent with the low number of introns present in this species as well as their 5'-position bias [4, 36, 37]. The genome of this species contains only 286 introns, and most of these introns are located at the 5'-end of the genes in which they are present [37]. This contrasts with the 139,418 introns found in the human genome and with the absence of intron position bias in human genes [37]. The model proposed by Fink to explain both the paucity and 5'-position bias of S. cerevisiae introns posits that incomplete cDNA molecules can recombine with their genomic copies leading to both intron loss and a 5'-position bias of the remaining introns [36, 37]. This model was later supported by the experimental demonstration that cDNA molecule can recombine with their genomic copy [9]. Since the genomes of C. glabrata, D. hansenii, K. lactis, and Y. lipolytica all have few introns and that their introns have a 5'-position bias [38], one would also expect to observe a 3'-end bias for their gene conversions if they often occur with cDNA copies. As discussed above, the fact that we did not observe such a bias in these four species could be due to the low statistical power of our tests. Alternatively, it could reflect recombination differences between S. cerevisiae and these four species.

In summary, our results show that the number and mean size of multigene families composed of paralogous sequences are not significantly different between pre- and post-WDG species (Table 1, Figure 2), that paralogs are more often found on the same chromosomes than ohnologs (Table 2), that gene conversions tend to be more frequent between genes found on the same chromosomes than between genes located on different chromosomes (Table 3), that gene conversions tend to be more frequent between paralogs than between ohnologs (Table 4), that the frequency (Table 4), length (Table 5), and flanking sequence similarities (Table 6) of the gene conversions between the paralogs of pre- and post-WGD species are similar, that there is a positive correlation between the length of gene conversions and flanking similarity in all converted genes (Figure 3), that ohnologs are under stronger selective constraints than paralogs (Table 7), that these stronger selective constraints are due to the fact that ohnologs are involved in essential processes whereas paralogs are involved in nonessential processes (Table 8), and that conversions are biased toward the 3'-end of the S. cerevisiae paralogs (Table 9). In the future, since it has recently been shown that the expression levels of duplicated genes influence their rate of sequence divergence [39], it would be interesting to test whether the increased ectopic gene conversion frequency we observed in C. glabrata, D. hansenii, and K. lactis (Table 3) is due to conversions between highly expressed genes.

Acknowledgments

The authors thank the two anonymous referees for their useful and constructive comments on a previous version of this paper. This work was supported by a Discovery Grant from the Natural Science and Engineering Research Council of Canada to G. Drouin.