Abstract

Zea mays (maize) has historically been used as a model species for genetics, development, physiology and more recently, genome structure. The maize genome is complex with striking intraspecific variation in gene order, repetitive DNA content, and allelic content exceeding the levels observed between primate species. Maize genome complexity is primarily driven by polyploidization and explosive amplification of LTR retrotransposons, with the counteracting effect of unequal and illegitimate crossover. Transposable elements have been shown to capture genic content, create chimeras, and amplify those sequences via transposition. New sequencing platforms and hybridization-based strategies have appeared over the past decade which are being applied to maize and providing the first genome-wide comprehensive view of structural variation and will provide the basis for investigating the interplay between repeats and genes as well as the amount of species level diversity within maize.

1. Introduction

Maize is among the most extensively studied plant species in the history of genetics. Beyond its considerable agricultural and economic value as a crop for food, feed and fuel, maize presents unparalleled biological attributes as a research model for genetic diversity and genome evolution [1, 2]. The maize genetic pool includes a large natural diversity among both wild and cultivated relatives. Well-established breeding strategies, inbred lines, mutant collections, easy-to-follow phenotypes, and large distinctive chromosomes are just some of the characteristics that allowed the construction of the first plant genetic map [3], proof of meiotic recombination linked with the recombination of genetic traits, and support for the chromosomal theory of inheritance [4, 5]. Subsequent research using maize as a genetic model led to the discovery of epigenetic modifications (in the form of paramutation) as well as transposable elements (TEs) (later found to be common to most, if not all, eukaryotic and prokaryotic organisms) [69].

The accumulated cytogenetic and genetic data and more recently, vast sequence information derived from genome project initiatives in grasses, have provided a wealth of information on the structure and evolution of the maize genome. A BAC-by-BAC, mapped draft of the genome of the maize inbred line B73 is currently available ([10]; latest release, AGPv2 available at http://www.maizesequence.org/). Additional structural and sequence data for diverse maize genotypes are accumulating thanks to the use of improved comparative genomics hybridization (CGH) techniques and deep resequencing using next-generation sequencing platforms [1114]. Furthermore, an increasing number of finished genome sequences within the Poaceae (i.e., grass family) provide an important resource for comparative genomics [15]. Currently, there are relatively complete physical assemblies for four nonmaize grass species: rice (Oryza sativa [16]), sorghum (Sorghum bicolor [17]), purple false broom (Brachypodium distachyon) [18], and foxtail millet (Setaria italica [19]; Bennetzen et al., unpublished results; http://www.phytozome.net/foxtailmillet.php/).

In this review, we focus on the diversity of the nuclear genome of maize. As we will describe in this paper, this is a dynamic genome, with multiple processes playing a role in its expansion, contraction, and sequence diversification.

2. General Characteristics of the Maize Genome

The maize genome is genetically diploid and consists of 10 chromosomes with an estimate size ranging from 2.3 to 2.7 Gb [9, 10, 20, 21]. As is the case with other large genomes in plant species, the maize genome consists mostly of a nongenic, repetitive fraction punctuated by islands of unique, or low-copy DNA that harbor single genes or small groups of genes. The repetitive elements contribute significantly to the wide range of diversity within the species and include transposable elements (TEs), ribosomal DNA (rDNA), and high-copy short-tandem repeats mostly present at the telomeres, centromeres, and heterochromatin knobs [5, 2224].

Transposable elements are mobile, “selfish” sequences of DNA that have the capacity of moving from an original location to different parts of the genome. They are classified as Class I (retrotransposons) and Class II (DNA transposons), according to whether the transposition intermediate is RNA or DNA, respectively, [25].

Retrotransposons are duplicated in situ via reverse transcription, with the new copy inserting itself in a new location in the genome, producing a net gain of one element. Most Class II transposable elements, on the other hand, follow a transposition process by cutting and pasting. Members of one family of DNA transposons in plants are proposed to have replicative transposition which uses a rolling circle DNA replication mechanism [26]. Class I and Class II transposons are either autonomous, which is defined as containing all the components necessary for transposition, or nonautonomous, indicating that their transposition is dependent upon the presence of the cognate autonomous element [25]. The most abundant class of TEs in plants is long terminal repeats (LTRs) retrotransposons, which are retrovirus-like mobile genetic elements characterized by having long terminal repeats. Most LTR retrotransposons are bounded by target site duplications (TSDs), show extensive CpG and CHG methylation, as well as being generally organized into large clusters and commonly nested inside other LTR retrotransposons [27, 28]. Over 1 million LTR retrotransposon fragments, corresponding to more than 75% of the total nuclear genome sequence, have been identified in the maize genome. The actual number of elements is hard to define due to the nested nature of retrotransposition in maize and the fragmentation of the data [29] although some estimates have ranged from 150,000 to 250,000 elements [27]. To date, 441 families of LTR retrotransposons have been annotated, most of them present at relatively low copy numbers (less than 10 copies) and constituting a small proportion of the genome. However, two families (Copia and Gypsy) are highly abundant and constitute about 80% of the total retroelements. Retrotransposons do not follow a random distribution, with Copia-like elements usually present in gene-rich regions and Gypsy-like elements overrepresented in pericentromeric and other heterochromatic regions [10]. Finally, LINES (Long Interspersed Nuclear Elements) and their short derivatives, SINES (Short Interspersed Nuclear Elements), are less-defined, non-LTR retroelements. LINEs are commonly identified by the presence of TSDs (typically generated during insertion) and one end terminated with a homopolymer, usually poly(T), while SINEs have an internal RNA polymerase III promoter sequence and one homopolymer end, and their transposition is dependent upon the presence of the autonomous LINE [29]. LINES and SINES are relatively infrequent in maize and constitute approximately 1% of the genome.

Class II elements, DNA transposons, constitute a smaller proportion of the genome in maize than retrotransposons, about 8.6%. The first descriptions of mobile elements were class II maize transposons, discovered by Barbara McClintock during her study of chromosomal duplications, inversions, and translocations produced by the Ac/Ds elements [6, 30]. Since then, additional families of DNA transposons have been identified and classified which include: Tc1-Mariner, hAT, Mutator, PIF_Harbinger, and their nonautonomous derivatives which are termed MITES (Miniature Inverted Transposable Elements) [25, 31]. These Class II elements are usually bounded by terminal inverted repeats (TIRs) and are flanked by short TSDs. Autonomous elements encode transposase and/or additional genes that are necessary for their transposition, while nonautonomous elements tend to be short, have nonconserved internal sequences, and/or carry captured DNA elements [25, 32]. Like retrotransposons, class II elements do not have a randomized distribution in the genome and most families, with the exception of elements from the CACTA family, have insertional preference for genic regions [10]. Methylation seems to have a role in the regulation and silencing of class I and II transposable elements, and activation is correlated with de-methylation of TIRs [3335].

High-copy tandem repeats are present in different parts of the genome including centromeres, telomeres, knobs, and rDNA. The centromeric regions include a combination of repeats and retrotransposons in or near sequences that participate in the formation of the kinetochore and in the attachment of microtubules on chromosomes during mitosis and meiosis. Maize centromeres consist of thousands of a 156-bp unit called CentC [23] (reviewed in [36]). Centromeres evolve very quickly and centromere repeats have little or no homology between species. However the same repeats are found in all maize chromosomes [37]. Due to difficulties sequencing large regions with tandem repeats, the number of copies in most centromeres has not been determined. The only centromeres that have been fully assembled are those on chromosome 2 and 5 which are thought to be the shortest [38]. The maize ZmB73v1 reference genome assembly contains an estimated 54% of the genome’s total CentC content [38]. Analysis of stretched DNA fibers suggests that the total length of CentC arrays varies less than 100 kb to several Megabases [37]. Four types of retrotransposons, CRM1, CRM2, CRM3/CentA, and CRM4 have been described as centromeric and are interspersed among CentC tandem repeat sequences [10]. One additional repeat sequence, Cent4, is at or near the primary constriction of chromosome 4 [39]. Heterochromatic knobs, cytological features that can be observed as dark round structures, consist of megabase-sized tandem repeats of derivations of one of two repeats (180 and 350 bp long) that comprise 0.6% to 6% of the total genome. They can be found in more than 20 specific locations in pachytene chromosomes [40], and their structural differences among maize varieties suggested significant intraspecific diversity of the maize genome as we will discuss later. Knobs can also have retroelements inserted [22, 4143]. The rDNA regions consist of thousands of tandem repeats encoding for rRNA. It has been estimated that the maize genome has between 1,600 to 23,000 9-kb tandem copies of genes encoding the 45S RNA precursor for the 18S, 5.8S, and 28S ribosomal RNA on the short arm of chromosome 6. This arrangement constitutes the nucleolus organizer region (NOR), another early observable cytogenetic feature in the maize genome [44]. Each of these ribosomal genes in the repeat is separated by a non-transcribed spacer. Precursors for 5S ribosomal RNA genes are clustered as 342-bp tandem repeats in an additional, smaller cluster in long arm of chromosome 2 [45, 46]. Telomeres, first named by Muller in 1938 but defined by McClintock in maize several years earlier [5], include tandem-repeated telomeric and subtelomeric sequences that protect the frequent rearrangements that naturally occur at the ends of DNA molecules [47]. Finally the maize genome includes thousands of simple sequence satellites and a few megatracts of trinucleotide repeats, namely, AGT and AGC [48].

The total number of nontransposon-related genes, pseudogenes and miRNAs constitute the rest of the maize genome, approximately 5% of the total [49]. While it is difficult to estimate accurately the total number of genes due to the incomplete nature of the current B73 physical assembly, it has been estimated recently to be approximately 32,000, classified in 11,892 families and a total of 150 loci encoding miRNA [10]. However, syntenic arrangements of genes are not necessarily conserved across individuals within the Zea genus as we will discuss below.

3. Intraspecific Diversity of Maize

Early cytogenetic studies showed considerable line-specific differences in heterochromatin, or C, banding, and heterochromatic knob distribution. Supernumerary chromosomes, or B chromosomes, were also found in some maize and teosintes [5054]. These cytogenetic differences have been positively correlated to differences in DNA content [5557]. Using Southern hybridization, Rivin et al. [41] found that copy numbers of tandem-repeated sequences such as ribosomal DNA and knob repeats varied among North American genotypes for as much as two to three fold. Intraspecific variations of as much as 38.8% from the average of 5.5 pg/2n nucleus have been reported in Zea mays [55, 56, 5860]. More recently, sequencing data has demonstrated that the maize genome exhibits rather variable levels of naturally occurring genetic diversity depending on the lines involved in the comparison [49, 61]. On average, the frequency of single nucleotide polymorphism between two maize inbreds is approximately 1 substitution per 100 bases [62, 63]. Interestingly, this level of intraspecies polymorphism is striking when compared to mammals; this average rate of polymorphism is 10 times higher than that observed between humans and also higher than that observed between human and chimpanzees [64].

Maize seems to be tolerant of increases in large amounts of DNA content per nucleus without noticeable effect on plant phenotype. In maize, the most significant recent contributions to genome size have been by LTR retrotransposons, and their number and distribution have been shown to vary considerably in different haplotypes [65, 66]. Copy number variation has been found in tandem repeats at centromeres, knobs, and rDNA loci [43, 67, 68].

Major focus has been given recently to regions with copy number variation (CNV) and presence-absence variation (PAV). With the improvement of genome-wide hybridization technologies and increasing information on the sequences of multiple maize lines by next-generation technologies, it is becoming clear that CNV and PAVs have a major role in the diversity of the maize genome, and potentially its heterosis. Springer et al. [69] analyzed the structural variation present between the genomes of the inbred lines B73 and Mo17 using comparative genomic hybridization (CGH). This study showed megabase-size B73 regions that were absent in Mo17. By using PCR analysis in 22 additional lines, they were able to identify a 2 Mb region on chromosome 6 that was present or absent from the lines as a single haplotype block. Beló et al. [70] used an expression array to perform CGH analysis on 13 North American maize inbreds with the reference inbred B73 and found a total of 2,109 potential CNVs; the authors screened a subset of 15 CNV loci via PCR and were able to confirm that 12 loci (80%) were true insertion/deletion events. Two of the CNV regions were shown to be at least hundreds of kilobases long with the remaining validated CNVs being fewer than 10 kb in length. Swanson-Wagner et al. [71] used array-based CGH to compare content and copy number variation of 32,500 genes among 19 diverse maize inbred lines and 14 teosinte accessions, relative to the B73 reference genome. They found variation in about 10% of the targets, with 479 genes showing higher and 3,410 genes showing lower copy number or missing in B73. Most down genes were single copy in B73 and therefore considered PAV. A number of genes were higher in some lines and lower/absent in others. Interestingly they discovered that the majority of these polymorphisms predated the origin of maize from teosinte.

4. Mechanisms of Maize Genome Evolution and Diversity

Major mechanisms have an effect in the evolution of the maize genome and the generation of intraspecific genome diversity (1) whole genome duplications (polyploidization) and segmental duplications, (2) DNA transposition and retrotransposition, (3) capture and translocation of genes or gene segments by transposons, (4) recombination and gene conversion events, and (5) single base mutations and expansion/contraction of simple sequence repeats (SSRs). These mechanisms are described below and added to the genome diversity generated by the gene flow between maize populations and introgression between maize and related species (teosinte) [49, 72].

4.1. Duplication and Polyploidization

Like in other cereals and plants in general, polyploidization has played an important role in the evolution of the maize genome [73, 74]. Evidence for both segmental duplications and whole genome duplication by wide crosses was initially found in linkage and comparative genetic analysis, which showed extensive chromosome duplications in maize [75]. More recently, comparative sequence information has supported the idea that the maize genome has undergone at least two polyploidization events. In the first event, approximately 70 to 80 Myr ago, a common ancestor to cereals underwent whole genome duplication, followed by gene loss. More than 68% of the duplicated genes from this event, which are currently collinear between rice and sorghum, retain only one copy. However, 99% of these genes are orthologous between the two species, suggesting that early gene loss predated the divergence among the cereals [76]. Genes have been preferentially removed from one of the homologs, a process called biased fractionation [77]. The second polyploidization event in maize occurred from 5 to 12 Myr ago and occurred after the divergence from the last common ancestor to sorghum. Two progenitors of maize hybridized at some point between 4.8 and 11.9 Myr [7880], giving rise to a tetraploid followed by large-scale loss and movement of duplicated genes (up to 50%) and chromosomal rearrangements that eventually returned the genome to a diploid behavior [81, 82]. The number of maize B73 high-confidence protein-coding genes predicted under high stringency is 32,540, higher than similar estimates for Brachypodium (25,532), rice (29,717), or sorghum (27,640) [10, 17, 18, 83]. This number is likely to be an underestimate due to the missing genic content in the current physical assembly. While there are stable tetraploid maize varieties reported [68], the most important effect of polyploidization in modern maize is the redundancy that the early polyploidization generated, with the subsequent relaxation of selective constraints.

4.2. Retrotransposition and the Expansion of the Maize Genome

While maize and sorghum, close relatives within the Andropogoneae tribe, share the same number of chromosomes, the maize genome is approximately 3 times the size of sorghum (800 Mbp). The secondary polyploidization described above accounts for only part of this difference. The overall size of the maize genome and intergenic distances has expanded dramatically due to LTR retrotransposition within the last 10 Myr. In grasses, the proportion of LTR retrotransposons is correlated to its genome size, while the proportion of Class II transposons remains constant (see Table 1). The small genomes of Brachypodium and rice have a retrotransposon content of 23.3% and 25.8%, respectively, compared to 54.5% in sorghum, and 75.9% in maize [84].

The high abundance and nonrandom distribution of LTR retroelements in maize was one of the early observations made as sequence information started accumulating [8588]. As these elements have long terminal repeats that are identical at the time of the transposition, the analysis of mutations allowed dating of the elements [89]. Initial studies of nested retroelements found within the adh region indicated that a massive retrotransposition event had occurred in the last 3 Myr. [65, 66, 89]. Liu et al. [90] investigated the insertion dynamics of LTR retrotransposons in gene-free and gene-containing BACs and identified two peaks of amplification in gene-free areas, the first around 1.5–2 Mya and a more recent one, within the last 500,000 years. They found only one peak of amplification in gene-containing regions, within the last 1 Myr. The conservative nature of LTR retrotransposition via an RNA intermediate and leaving behind the original element belies the reason why this selfish DNA has colonized and expanded the maize genome.

4.3. Mechanisms for Genome Decrease

Both unequal and illegitimate recombinations are important mechanisms that may counteract the expanding effects of LTR retrotransposition [91]. Unequal homologous recombination within a chromosome (i.e., intrastrand), that is associated with larger (>50 bp) direct repeats (in this case between adjacent LTRs), is proposed to generate a “solo” LTR and leads to the net deletion of the internal sequence plus one LTR sequence [92, 93]. The effects of unequal crossover between homologous LTR sequences at distinct chromosomal locations can have more striking results including a reciprocal deletion and duplication event, inversions and reciprocal translocations [94, 95]. By comparison, illegitimate recombination can occur between smaller lengths of homology than unequal homologous recombination and is proposed to be responsible for the creation of numerous internal deletions and truncated LTR retrotransposons [96]. This form of recombination is presumed to occur via non-homologous end joining or slip-strand mispairing which, in turn, leads to DNA loss [96, 97]. All three of these mechanisms are proposed to counteract the genome expansion in plants which is primarily driven by either increases in ploidy or amplification of repetitive DNA [94].

4.4. Transposons and Genetic Colinearity

Intraspecific genome variation has long been attributed to changes in size of heterochromatic DNA outside coding sequences that contracted or expanded the chromosomes [98]. However, violation of gene microcolinearity has been found in multiple locations since it was first reported by Fu and Dooner [99]. These authors sequenced 230-kb and 110-kb BAC contigs flanking the bz locus in the North American inbred lines B73 and McC, respectively, and found extensive differences in content and position of intergenic retrotransposons. More remarkably, out of 10 genes clustered in the McC sequence, 4 were absent in B73. Further sequence analysis of the bz locus in multiple lines showed considerable variation in other maize lines, with only 25% to 84% of sequences shared [100]. Similar polymorphisms for the presence/absence of genic sequences have been found in different chromosome locations [101, 102].

Helitrons have been associated to intraspecific violation of genetic colinearity in maize. The role of Helitrons leading to genome variation in maize was first reported by Lai et al. [103], using comparative bioinformatics analysis of the bz region in McC and B73 to reveal the presence/absence of two Helitrons, HelA, and HelB, which account for all of the allelic variation at this locus. Unlike other Class II TEs, Helitron elements are not flanked by terminal inverted repeats and do not generate TSDs. They have an 18–25 bp sequence able to form a hairpin near the 3′ end and preferentially insert in AT dinucleotides [104, 105]. A more extensive study between inbred lines B73 and Mo17 [106] suggested that a large proportion of differential insertions in the genome between B73 and Mo17 could be attributed to Helitron sequences. While most reported Helitron genes seem to be truncated versions of their progenitor genes [107], the maize CYP72A27-Zm gene represents a full cytochrome P450 monooxygenase (P450) gene recently captured by a Helitron and transposed into an Opie-2 retrotransposon [108]. Complete Helitron elements are widespread in the genome. One study identified 1,930 intact Helitrons consisting of 8 families and more than 20,000 Helitron fragments [109, 110]. Another study identified 2,791 nonautonomous Helitron elements [111]. The majority of the elements identified thus far represent nonautonomous Helitrons containing chimeric segments derived from multiple genes although the analysis of the complete sequencing of one single maize inbred line provides only biased information of the extent and diversity of gene capture, transposition, and amplification by Helitrons [112].

In addition to Helitron, the Mutator superfamily possesses non-autonomous elements, called Pack-MULES, that have the ability to capture segments of nuclear gene(s) which can be arranged in chimeras [113, 114]. Further, molecular evidence has revealed that these novel chimeras can be both transcribed and translated, which suggest that this mechanism of gene fragment capture inside of non-autonomous elements can produce, and evolve into, novel protein coding sequences [113, 114]. Much like the Helitron elements, Pack-MULE transposition and amplification can lead to deviations in intraspecific synteny, and recent research has shown that Pack-MULEs preferentially capture GC-rich genomic segments and displayed biased insertion into the 5′ end of coding regions [115]. Pack-MULE elements possess sequences that are associated with small RNAs and can influence the expression profile of the captured genic sequences, and, given the insertional bias towards the beginning of the transcribed region, these elements can have significant effects on the expression of the endogenous genes into which they are inserted [115].

5. Future Prospects

The discovery of the colinearity of the maize and other grass genomes, dating back to 50–80 million years ago, was a major breakthrough in comparative genomics within the Poaceae and helped in the identification of genes, gene families, duplication events, and characterization of the structural variation of the genomes [24, 116, 117]. The accumulation of evidence pointing to high structural polymorphism and the significant sequence diversity among maize inbreds, however, highlights a serious limitation in the use of a single reference genome as a sufficient representative of a species. Thus, in order to capture the range of sequence diversity (i.e., single nucleotide polymorphisms and small sized indels) as well as larger structural variations (i.e., CNVs, PAVs, and large indels), deep resequencing and assembly of the gene space to create a “pan-genome” is necessary among a range of diverse inbreds [118]. While the B73 reference assembly in its current state has proven invaluable for annotation and basic research into the organization and diversity of gene (and repeat) content, there are estimates of upward of 10% of the genic content missing from the current version [10, 29]. The complexity of the maize genome combined with available resequencing strategies (either Sanger or next-generation sequencing) prevents the creation of a completed physical assembly similar to that in Arabidopsis and rice whose genome sizes are roughly an order of magnitude smaller. With the advent of next-generation sequencing platforms, the ability to rapidly generate the full content of sequences in an inbred with high coverage is now feasible, but de novo assembly of these short reads will not create any additional finished assemblies of comparable quality to B73 within maize. Comparative genomic hybridization (CGH) strategies offer a rapid and inexpensive strategy to look at structural variation and alignment of WGS reads to the B73 reference assembly and will produce an equivalent “digital” CGH for a reasonable cost. However, both of these strategies will have a strong B73-centric bias for the foreseeable future. Thus, for a period of time until sequencing technology significantly increases read lengths to promote robust genome assembly from whole genome shotgun (WGS) projects, a large amount of diverse sequences from the Zea genus will remain as small, anonymous assemblies which will have to be positioned by laborious or inexact methods including, among others, BAC library screening, syntenic comparisons to sorghum, oat-maize addition line screening, and genetic mapping via GBS on segregating populations. However, these small, genic assemblies from deep resequencing projects can be annotated structurally and functionally in a manner similar to EST and FL-cDNA projects that were initiated prior to significant reference assembly creation in the past two decades. And upon the creation of the large contigs of physical sequence, these small genic assemblies are easily integrated. Therefore, genic diversity in maize can be collected and analyzed in detail for individual inbreds using deep resequencing and assembly, and, upon the advent of upcoming sequencing platforms that will allow rapid de novo assembly, these can be easily organized on the physical map.