Abstract

Tandemly arrayed genes (TAGs) are duplicated genes that are linked as neighbors on a chromosome, many of which have important physiological and biochemical functions. Here we performed a survey of these genes in 11 available vertebrate genomes. TAGs account for an average of about 14% of all genes in these vertebrate genomes, and about 25% of all duplications. The majority of TAGs (72–94%) have parallel transcription orientation (i.e., they are encoded on the same strand) in contrast to the genome, which has about 50% of its genes in parallel transcription orientation. The majority of tandem arrays have only two members. In all species, the proportion of genes that belong to TAGs tends to be higher in large gene families than in small ones; together with our recent finding that tandem duplication played a more important role than retroposition in large families, this fact suggests that among all types of duplication mechanisms, tandem duplication is the predominant mechanism of duplication, especially in large families. Finally, several species have a higher proportion of large tandem arrays that are species-specific than random expectation.

1. Introduction

Although the importance of duplicated genes in providing raw materials for genetic innovation has been recognized since the 1930s and is highlighted in Ohno's book Evolution by Gene Duplication [1], it is only recently that the availability of numerous genomic sequences has made it possible to quantitatively estimate how many genes in a genome are generated by gene duplication. For instance, it has been estimated that about 38% of the genes in the human genome and 49% of the Caenorhabditis elegans genome arose from gene duplication [2, 3]. It is almost certain that current estimates of the extent of gene duplications are low, as many duplicated genes may have diverged to such a great extent that their common origin can no longer be recognized.

Known mechanisms of gene duplication include unequal crossover (or equivalently, tandem duplication), retroposition, and segmental (or genome) duplication [4]. Unequal crossover consists of chromosomal mispairing followed by the exchange of DNA between nonhomologous regions and resulting in either gene duplication or gene deletion [5]. Retroposition refers to reverse transcription of the mRNA transcript of a gene into double-stranded DNA followed by insertion of the double-stranded DNA into a location typically distant from the original gene. Genome duplication in vertebrates is not as frequent as that in plants. According to the two-round genome duplication hypothesis, the last possible genome duplication in vertebrates occurred more than 400 million years ago [6]. Recent segmental duplications cover only about 2% of the mouse genome [7] and 4% of the human genome [8] and usually do not contain genes [9]. Recently, some general studies of gene duplications have been undertaken (e.g., [10]), as well as specific computational identification and characterization of retrotransposed duplicated genes with respect to their location and dynamics in species such as human and mouse [11, 12]. There have also been studies of duplicated genes generated through unequal crossover (tandem duplication) in C. elegans [13], Arabidopsis thaliana [14], Oryza sativa [15], and several mammals [16].

Our current study focuses on tandemly arrayed genes (TAGs) in available vertebrate genomes. Tandem duplication has been shown to act as the driving evolutionary force in the origin and maintenance of gene families [17] and has been a common mechanism of genetic adaptation to environmental challenges in organisms such as bacteria [18], mosquitoes [19], plants [20], and mammals [21]. TAGs constitute a large component of several eukaryotic genomes. For example, at least 10% of the genes in the genomes of C. elegans, Arabidopsis thaliana, and human are TAGs [13, 14, 16]. TAGs can either promote genomic diversity to enhance disease resistance, satisfy the requirement for a large amount of a gene product, or contribute to the fine-tuning of developmental stages and physiological functions [1, 22].

In this work, we performed a genome-wide survey of the TAGs in 11 completed or nearly completed vertebrate genomes. We provided some general statistics regarding the number of genes in these genomes that belong to TAGs, the contribution of TAGs to the total duplications, TAG size (i.e., how many genes are in an array) distributions, gene transcription orientations in TAGs, and the contribution of tandem duplication in the make-up of gene families of different sizes. We also identified species-specific TAGs and compared their distribution among species.

2. Results

The summary statistics are presented in Table 1. There are a total of 285 801 putative genes in the 11 genomes. Figure 1 (adapted mostly from [23]; the divergence time between zebrafish and tetraodon is from [24]) shows the phylogeny of these species. On average, each genome has 25 982 genes, with human (31 185) and chicken (19 399) having the most and least number of genes, respectively. The number of genes that have been assigned to a specific chromosome location in the 11 species reduces to 255 293. More than 90% of the genes have been assigned to a known location in the genome assembly for human, chimp, mouse, rat, macaca, dog, opossum, and zebrafish, whereas only 69%, 82%, and 55% have been assigned for cattle, chicken, and tetraodon, respectively. The numbers of gene families range from 2297 (chicken) to 4127 (zebrafish), and the numbers of genes contained in these families range from 7199 (chicken) to 20 187 (zebrafish) among all species. Thus, gene families on average contain about 3 to 5 members.

2.1. Spacers in TAGs

TAGs are usually defined as genes that are duplicated tandemly on chromosomes. Spacers are genes that are not homologous to the members of TAGs (see Section 5 for details). Allowing different numbers of spacers between two members of an array will result in different numbers of TAGs. Figure 2 shows TAG statistics with respect to different numbers of spacers. There are three general patterns. First, for all species, the number of tandem arrays increases with the number of spacers allowed in the array, although the extent of increase varies among species (Figure 2(a)). The zebrafish shows the highest extent of increase in the number of arrays with increase of the number of spacers ( for all of the one-tailed -tests between zebrafish and other species). Similarly, the number of genes included in the tandem arrays also increases when more spacers are allowed in the TAGs (Figure 2(b)). Second, for most species, both the number of tandem arrays and the number of genes in the arrays show the sharpest increase when going from spacer 0 to 1, consistent with studies in Arabidopsis thaliana and rice [14, 25]. Third, the similarities in these two quantities (the number of tandem arrays and the number of genes in the arrays) among species reflect to certain extent their evolutionary distances (Figure 1). For instance, mouse and rat show a very similar pattern in both the number of arrays and the number of genes; so do human and macaca, and dog and cattle. The zebrafish appears to be the most distinct of the remaining species, having the highest numbers in both arrays and genes for almost all TAG definitions ( for all pairwise -test between zebrafish and the other species), perhaps because the zebrafish has undergone recent genome duplications so that the number of tandemly arrayed genes are much larger than for other species. An exception is seen in the chimp where, despite its being the most closely related to the human, there is a much greater divergence in the two quantities from human than in macaca from human. The quality of the chimp genome assembly has been known to be poor, which might explain the strange pattern that we observe here.

The percentages of TAG genes range from about 8% to 19% among all species when no spacers are allowed in TAGs, from about 10% to 26% when allowing 10 spacers. Therefore, TAGs contribute to a large proportion of genes in vertebrate genomes (Figure 2(c)). As previous and current genome-wide studies of TAGs suggest that allowing 1 spacer between array members is a good compromise between stringency and gene coverage, we report for the rest of the study only the results on TAGs that have at most 1 spacer. Note that according to our definition, allowing at most 1 spacer means that every pair of the neighboring genes in a TAG array has at most 1 spacer; therefore, the array can have more than 1 spacer in total.

2.2. Contribution to Gene Duplication

Tandem duplication has been commonly cited in the literature as one of three major mechanisms of gene duplication [4]. However, a quantitative evaluation of its contribution to duplications in the vertebrate genomes has not been available until our recent report [16]. Here, we also examined the percentage of duplicated genes that are in tandem arrays in these 11 genomes. Results are shown in Table 1. TAGs not only make up nearly 20% (9%–21%) of the genes, but also account for up to one third (18%–34%) of all duplications in these genomes.

2.3. Size of Tandem Array

Table 2 shows the distribution of tandem array sizes (i.e., the number of genes in a tandem array) and the percentages of TAG genes in each size category. Among all species, about 60% to 83% of the tandem arrays are of size two, that is, having only 2 members in the arrays. The proportions of tandem arrays of larger sizes decrease rapidly after size two. Mouse (30%), rat (34%), and opossum (38%) have the least proportions of two-member arrays, in contrast to 41%–73% for all remaining species. Mouse, rat, and opossum tend to have more larger arrays. In fact, the average number of genes per array ranges from 3.4 to 4.0 in mouse, rat, and opossum, from 2.4 and 3.2 in the remaining species.

2.4. TAG Orientations

Table 3 shows the statistics of three types of gene transcription orientations (parallel or , convergent , and divergent ) for both genomes and TAGs. The proportions of neighboring genes with three different transcription orientations in the genome are very similar among all species, with parallel transcription orientation being the major type (varying within a narrow range of about 50%–57%) and equal proportions of convergent and divergent transcription orientations (both about 22%–25%). In contrast, for all species, the majority of gene pairs in TAGs have a parallel transcription orientation, ranging from about 72% to 94%, much higher than those in the genomes ( , -test). The proportions of convergent and divergent transcription orientations in TAGs are similar, ranging from 3% to 14% among species. Statistical tests show that the distribution of the three types of transcription orientations in TAGs is significantly different from that of all genes in the genome (the chi-square Goodness-of-Fit test: -value 1E-36 for all species).

2.5. TAGs in Gene Families

Table 4 shows the proportions of duplicated genes that belong to TAGs in gene families of different sizes. There is a clear trend that, as family size gets larger, the proportion of TAGs in the families also increases. For instance, in gene families of size two (i.e., families that have two members), only around 10% of gene members belong to TAGs (except for tetraodon), whereas in families of sizes 10, 35%–60% of the members belong to TAGs. Figure 3 shows the relationship between family sizes and mean percentages of TAGs (averaged over all species). Tests of correlation show that the percentages of TAGs in gene families are positively correlated with gene family sizes (Pearson's correlation coefficient varies from .78 to .94 among species, all -values .008). The correlation remains significant even after removing the family size 10 that includes all families with 10 genes for all species (all -values .05).

We also examined the homologous tandem arrays (the TAGs that belong to the same Ensembl gene family) across all the species. Due to the complex homologous relationship between the members of TAGs within and across species (see Section 3), we did not perform the standard phylogenetic analysis. Instead, to explore the relationship between TAGs across species, we clustered the species based on distribution of the number of TAGs in the same families across these species by the -means clustering method [26]. Specifically, each row of the input matrix for -means clustering contains a vector with the numbers in the vector corresponding to the number of TAG occurrence in each of the gene families in a particular species. Therefore, we used the -means clustering to take account of the information of all families in order to group the species that show similar TAG profiles. Our purpose is to test whether the clustering based on TAGs is congruent with the species tree (Figure 1). We set the number of clusters from to . When , the resulting two groups are human, chimp, macaca, mouse, rat, opossum and cattle, dog, chicken, zebrafish, tetraodon ; when , the resulting three groups are human, chimp, macaca, opossum , mouse, rat , and cattle, dog, chicken, zebrafish, tetraodon ; and when , the resulting four groups are human, chimp, macaca, opossum , mouse, rat , cattle, dog, chicken, tetraodon , and zebrafish . Compared with the species tree, it turns out that primate species (human, chimp, macaca) and murine species (mouse, rat) are always clustered correctly, but cattle and dog are more likely to be clustered with nonmammals.

2.6. Species Specific Tandem Arrays

Studies have shown that species-specific duplications can play an important role in species-specific traits or life styles that enable species to adapt to certain environments (e.g., [2729]). We studied species-specific tandem arrays (SSTAs), which are defined as the tandem arrays that are present in only one species while there may be no homologues or homologues are not tandemly arrayed in all the other species. Table 5 shows the summary of SSTAs in all species. There are about 10% SSTAs in mammals and more than 20% in nonmammals. The higher proportion of SSTAs in nonmammals may be mainly due to their much higher divergence from the most recent common ancestor in the species tree (Figure 1). We also used Gene Ontology (GO) annotations to see whether there are any GO categories that are highly enriched in the SSTAs. We found no apparent preference of specific GO functions even between closely related species, such as mouse and rat. But as not all SSTA genes have GO information, further evaluation is needed.

As SSTAs are more likely to be recently born than are the non-SSTA arrays that are shared by multiple species, we expect that under neutral evolution, the sizes of SSTAs (i.e., the number of genes in an array) should be on average smaller than the sizes of non-SSTA arrays. As most of the SSTAs are of size-two (Table 5), we expect that the proportion of SSTAs that are of size two should be higher than the proportion of non-SSTA arrays that are of size two. Only in chimp, macaca, rat, and dog is the proportion of size-two SSTAs significantly higher than that of size-two non-SSTAs, which means that the sizes of SSTAs in most of the species are not significantly smaller than the sizes of non-SSTA arrays.

3. Discussion

3.1. Contribution of Tandem Duplication

Here we performed a genome-wide survey of TAGs in 11 assembled vertebrate genomes. In summary, when using a stringent criterion for TAG identification (e.g., allowing at the most 1 spacer between array members), we observed a consistent pattern of tandem duplication contributing to the number of genes in the genomes and to genome wide duplications: on average, about 14% of the genes in vertebrate genomes are TAGs, and about 25% of all duplicated genes are tandemly arrayed.

These numbers most likely underestimate the extent of tandem duplication in these genomes. Our recent study shows that more than 25% to 40% of the recent gene duplications are generated by tandem duplications in human and mouse [30]. Therefore, it is likely that many old tandem arrays became invisible during evolution owing to various genome rearrangements. Meanwhile, one may wonder whether duplicated genes arising from duplication mechanisms other than tandem duplication could get scrambled during evolution and happen to be arranged as TAGs. However, as shown in our previous study, this possibility should have minimal effect on the TAG statistics because the probability that duplicated genes appear as TAGs by chance is very low, about 1 to 2 orders of magnitude lower than the actual extent of tandem duplication [16].

3.2. TAG Transcription Orientation

It has been shown that % and % of tandem arrays are in parallel transcription orientation in Arabidopsis thaliana and rice, respectively [25]. How this compares to the genome patterns in these species has not yet been studied. The vertebrate genomes show amazingly consistent patterns in the proportion of gene pairs in parallel, convergent, and divergent transcription orientations with a ratio of . In contrast, TAGs have much higher proportions of parallel orientation, ranging from about 72% to as high as 94%. Therefore, in both plants and animals, parallel orientation is the dominant type of transcription orientation in TAGs.

So why is there disproportionately less convergent and divergent transcription orientation in TAGs than in the genome? One explanation is that tandem duplications occur at a higher rate on the same strand than on different strands. Little is known about what determines the rates of tandem duplication on the same strand or different strands. Therefore, how much differential rates of tandem duplication between same-strand and different-strands contribute to the observed dominance of parallel orientation across all the studied species remains an open question. Another possible explanation is related to long inverted repeats (LIRs). It has been shown that LIRs can substantially increase genome instability. For example, in the mouse, LIRs in germ lines can lead to elevated genome rearrangements due to increased levels of illegitimate recombination, gene conversion, and deletion mediated by LIRs [31, 32]. In the human, several genetic diseases have been reported to be caused by illegitimate recombination and deletion induced by LIRs [33]. In the case of TAGs, tandem duplicated genes on opposite strands (in convergent or divergent orientation) are essentially LIRs, and their initial high sequence identity might increase the level of illegitimate recombination and various genome rearrangements. The increased genome instability might have a disastrous effect on the individuals that carry tandem duplication; strong negative selection against the duplication will reduce the fixation probability of tandem duplication in the population. This may at least in part explain why we observe a much lower proportion of TAGs with convergent and divergent orientations than in the genome. Meanwhile, the fact that there are still some TAGs with convergent or divergent orientation can also be explained by LIR-mediated changes. It has been observed that illegitimate recombination events induced by LIRs sometimes result in asymmetric deletion that eliminates the central symmetry of LIRs [31, 32]. When the deletion does not have a negative effect on the function of the genes, for example, when the deletion happens to be located in introns, the elimination of the symmetry in the LIRs can actually prohibit further illegitimate rearrangements and reduce the levels of gene conversion. Consequently, the LIRs, that is, tandem duplicated genes on opposite strands, no longer pose a threat to genome stability and thus can be fixed in the population [5, 31]. More research needs to be done to determine the causes of the higher proportion of TAGs in parallel orientation than the genome average.

3.3. Tandem Array Sizes

All plant and animal genomes that we have studied so far show that the majority of tandem arrays contain only two members. It is likely that large arrays are destroyed by various genome rearrangements and become smaller arrays over time, which might be the case for most of the tandem arrays. For the large TAG arrays such as the 18S and 28S ribosomal RNA genes in the vertebrates [22], mechanisms such as continued concerted evolution (including unequal crossover and gene conversion) and natural selection need to act on the arrays to prevent array-size decay.

The fluctuation of array sizes has been observed in natural populations of many species such as humans [34] and flies [35]. Empirical evidence also suggests that the fluctuation can produce visible phenotypic effects and sometimes can be detrimental. For example, in Drosophila melanogaster, 18S and 28S rRNA genes contain 150 to 250 tandemly arranged repeats in wild-type flies [35, 36] and individuals carrying a lower copy number than the wild-type have so-called bobbed mutations, characterized phenotypically by having small bristles, abdominal etching, and developmental delay [37, 38]. These studies show that the size of tandem arrays is important in the normal function of organisms and the fluctuation of array sizes might be selected against.

At the same time, a variety of mechanisms can reduce or prevent size change in a tandem array. For example, insertion of irrelevant genes (i.e., genes with no homology to the array members) into the array may effectively reduce the frequency of unequal crossovers. The divergence of array members can also reduce the frequency of unequal crossover. Therefore, observation on array sizes across multiple animal and plant genomes reflects not only a snapshot of current genomes, but also most likely a stable state of TAGs as a result of joint processes of selection, drift, and mutation on the arrays.

3.4. Tandem Duplication and Family Size

The positive correlation between the extent of tandem duplication and the sizes of gene families (Figure 3 and Table 3) indicates that the contribution of TAGs to gene families of different sizes weighs more in large gene families than smaller ones. It may also be possible that large gene families can have higher percentage of genes belonging to TAGs than small gene families simply by chance. We performed a permutation test to see whether chance alone can produce such a strong association between family sizes and TAG percentages. We simulated 10 000 pseudogenomes, each of which has the same number of gene families and distribution of family sizes as the studied genomes. We randomly assigned the chromosome location of all the genes in the pseudogenomes and determined the percentage of TAGs for all the family sizes. We then examined the correlation between the percentage of TAGs and family sizes and found that indeed they are correlated. However, the percentage of TAGs in the simulation is much lower than our observation in all sizes. For example, in human, we observed about 34% of TAGs in the gene families of size 10, while only about 4% in our simulated distribution. In fact, the simulated percentages of TAGs in all the gene family sizes are about 10 times lower than the actual observations. Thus, it is clear that even though chance does contribute to the positive correlation between the extent of tandem duplication and the sizes of gene families, it is not a determining factor.

Consistent with the current observation, our recent study shows that tandem duplication generated more duplicated genes than retroposition did in large families in both humans and mice [30]. Many genes in large families such as olfactory receptor genes and zinc finger protein genes are generated by tandem duplication. The question is why tandem duplication has played a more important role in large families than in small families.

To answer the question, we need to compare the differences among various mechanisms of gene duplication. There are three major mechanisms of gene duplication: genome duplication, retroposition, and tandem duplication [4]. Among the three, genome duplication happens the least frequently in animals. Moreover, it doubles the copy number of all genes and thus should have a similar contribution to different-sized families. In contrast, tandem duplication is more frequent and more specific as it duplicates only specific genes instead of every gene in the genome. It may be difficult for a gene to change from single copy to duplicated states because sequence homology around the gene, required by unequal crossover to generate tandem duplication, is not always present. However, once a tandem duplication occurs, it is easy for unequal crossover to quickly expand the array due to the availability of sequence homology. Thus, tandem duplication has the advantage of being fast and easy in generating a large number of genes and providing opportunities for the divergence and refinement of gene function among duplicated members.

It has been shown that retroposition seems to be more active in highly expressed genes in germline cells [39, 40]. However, members in large gene families are not necessarily highly expressed. Our recent study suggests that the expression level seems to be more important than gene family size in determining what genes get retroposed [41]. Moreover, due to the nature of retroposition, the retroposed copy does not have ancestral regulatory regions and its survival is thus dependant upon the probability of being able to capture a regulatory region. The large amount of retroposed pseudogenes in the human and mouse genomes suggests that the probability of survival of the retroposed copy is very small. In contrast, many fewer pseudogenes are generated by tandem duplication [42], suggesting that the survival rate of tandem duplication is much higher than the retroposed genes. Therefore, as unequal crossover is the most efficient mechanism to generate and maintain gene copies among the three major gene duplication mechanisms, it may explain why TAGs are more frequent in large families than in small ones.

3.5. TAG Homology and SSTAs

Identifying the homologous relationships (orthologous and paralogous) for TAGs across species is a challenging task. Frequent gene conversion within tandem arrays [43] and gene losses and gains of different array members in different species make genome-wide orthologue assignments computationally intractable [44]. One good example that shows the difficulty of homology assignments in TAGs can be seen in the HOX genes. Numerous studies of these genes have shown that there is a tremendous amount of variation in the number of HOX clusters in different species. Moreover, there are losses and gains of different members in different species and frequent gene conversion or concerted evolution in some members (e.g., [4548]). Computationally, it is nearly impossible to identify a correct homology relationship for these genes across multiple species. There have been computational attempts to infer an evolutionary history of tandem repeated sequences in multiple species (e.g., [4951]). However, it is clear that correct inference of evolutionary history relies on correct homology assignment, which remains a computationally challenging problem.

To circumvent the homology assignment problem, we studied two aspects regarding the evolution of TAGs in the 11 species that do not require identification of exact homology relationships among TAGs. The first aspect is to examine the evolutionary closeness of the 11 species in terms of TAG quantities in different gene families. There are two TAG quantities that one can describe for a particular gene family. One is the total number of arrays in the family and the other is the total number of tandemly-arrayed genes in the family. It is expected that the two quantitative descriptions should show similar evolutionary closeness to what the species tree reflects. Our -means clustering result suggests that the two quantities, to a large extent, are able to reflect the phylogenetic relationship of the species. The exception is the grouping of dog and cattle, which is always clustered with the nonmammals. A possible explanation is that many genes have not yet been annotated in these two species, especially those mammalian-specific TAGs, which makes them appear closer to nonmammals. Alternatively, it may also mean that some ancestral mammalian TAGs are broken up in dog and cattle.

The second aspect is related to species-specific tandem arrays (SSTAs). Apparently, the definition of SSTAs determines that SSTA statistics are sensitive to the number, the kind, and the annotation quality of the species that are sampled. For example, it is expected that the more species included in a sample, the less likely an array will be an SSTA. Meanwhile, the number of SSTAs in a certain species is directly influenced by the species' distance to its closest related species in the sample. For instance, the number of SSTAs in human in the human-mouse-rat sample will certainly be higher than the number of SSTAs in human in the sample that also includes chimp. Moreover, if the annotation qualities of the two species are different, for instance in the case of human and chimp, there would be more SSTAs in the better annotated species human than in the less well-annotated species chimp.

Despite these caveats, study of SSTAs, or more generally, species-specific duplication, can potentially provide insight into the adaptive evolution of species-specific traits and life styles. For example, one of the human SSTAs, the sperm protein associated with the nucleus on the X chromosome SPANX gene family, containing two tandem arrays with a total of 6 genes, has been reported to have gone through rapid evolution and amplification in hominids [52]. Our analyses show that the proportion of tandem arrays with more than two members in SSTAs is significantly higher than that in non-SSTAs in some of the studied mammals, which suggests that nonneutral forces may maintain relatively recent-born arrays. However, caution must be taken to interpret the results as the species sampling is not homogeneous and some of the SSTAs might not be truly species-specific due to the deep divergence.

4. Conclusions

We have provided a quantitative account of TAGs and their contribution to duplications in vertebrate genomes. This is a first step towards understanding the evolution of these genes. As it has been increasingly realized that how genes are arranged on chromosomes plays an important role in determining gene function, TAGs stand out for their unusual spatial arrangement. Future research can be directed towards further understanding the intricate differences of tandem duplication from other types of duplication and the impact on the ultimate fate of duplicated genes.

5. Materials and Methods

There are altogether 11 vertebrate genomes assembled and available in Ensembl Version 41 (http://www.ensembl.org/). Therefore, we focused on these 11 species including human (Homo sapiens), chimp (Pan troglodytes), mouse (Mus musculus), rat (Rattus Norvegicus), macaca (Macaca mulatta), cattle (Bos taurus), dog (Canis familiaris), opossum (Monodelphis domestica), chicken (Gallus gallus), zebrafish (Danio rerio), and tetraodon (Tetraodon nigroviridis). Previously, we studied TAGs in the genomes of human, mouse, and rat [16]. However, as the annotation quality has been continually improved over time and this paper is intended to be a comprehensive overview of TAGs in all available vertebrate genomes, we reanalyzed these species using the latest version 41 as well.

Annotation of genes for all 11 species was obtained using Ensembl Biomart (http://www.ensembl.org/). The total number of genes is shown in Table 1. Genes annotated as unknown and mitochondrial were removed and only those with known chromosome location were kept, as we needed the information to determine TAGs. We also required that each gene should be equal to or longer than 300 nucleotides. Family information was also obtained using Ensembl Biomart. In Ensembl, gene families are clustered using Markov clustering algorithms (MCL) based on sequence similarities (see http://www.ensembl.org/ for details). All data were stored in MySQL database for subsequent analysis.

TAGs are usually defined as genes that are duplicated tandemly on chromosomes. Members of tandem arrays may be separated by other unrelated genes (called spacers). During evolution, various genome rearrangements, such as transposition and insertion of genes that are unrelated to array members (i.e., not through duplication), can disrupt the spatial arrangement of the TAGs. Allowing different numbers of spacers in between two members of an array will result in a different number of TAGs. For example, consider an array with the spatial arrangement of - - - - - - , where all s are duplicated genes, and and are spacers. When allowing 0 spacers, we will have 2 tandem arrays with each having 2 members ( and ; and ); allowing 1 spacer, we will have 1 array with 5 members ( , , , , and ).

To obtain TAGs, we sorted all the genes of each species chromosome by chromosome and indexed them in ascending order based on their physical locations. Let denote the absolute difference of the indices between two genes on the same chromosome. is equal to the number of spacers between these two genes . When , it is a perfect TAG gene pair with no spacers. For certain , we marked those gene pairs with and clustered them using a single-linkage algorithm, which ensures that within each tandem array, there exists at least one TAG link between any two array members. A TAG link is the relationship of two genes that can be seen as a TAG pair under the certain number of spacers allowed. We screened TAGs under each TAG definition (spacers 0–10) for every species.

Acknowledgments

The authors thank Lenwood Heath and the anonymous reviewers for helpful comments. The work was supported by the NSF grant IIS-0710945 and the ASPIRES (A Support Program for Innovative Research Strategies) grant at Virginia Tech to the second author.