Abstract

Legume crops are particularly important due to their ability to support symbiotic nitrogen fixation, a key to sustainable crop production and reduced carbon emissions. Soybean (Glycine max) has a special position as a major source of increased protein and oil production in the common grass-legume rotation. The cultivar “Forrest” has saved US growers billions of dollars in crop losses due to resistances programmed into the genome. Moreover, since Forrest grows well in the north-south transition zone, breeders have used this cultivar as a bridge between the southern and northern US gene pools. Investment in Forrest genomics resulted in the development of the following research tools: (i) a genetic map, (ii) three RIL populations ( ), (iii) 200 NILs, (iv) 115 220 BACs and BIBACs, (v) a physical map, (vi) 4 different minimum tiling path (MTP) sets, (vii) 25 123 BAC end sequences (BESs) that encompass 18.5 Mbp spaced out from the MTPs, and 2 000 microsatellite markers within them (viii) a map of 2408 regions each found at a single position in the genome and 2104 regions found in 2 or 4 similar copies at different genomic locations (each of 150 kbp), (ix) a map of homoeologous regions among both sets of regions, (x) a set of transcript abundance measurements that address biotic stress resistance, (xi) methods for transformation, (xii) methods for RNAi, (xiii) a TILLING resource for directed mutant isolation, and (xiv) analyses of conserved synteny with other sequenced genomes. The SoyGD portal at sprovides access to the data. To date these resources assisted in the genomic analysis of soybean nodulation and disease resistance. This review summarizes the resources and their uses.

1. Introduction

The soybean cultivar “Forrest,” a product of a USDA breeding program, represents a determinate, Southern germplasm [1]. It was the first cultivar to possess soybean cyst nematode (SCN) resistance associated with high yield, and is believed to have played a key role in saving billions of US dollars during 1970s and 1980s that would have otherwise been lost, either due to SCN or due to the poor agronomic performance of earlier SCN resistant cultivars (see [2] and references therein). Forrest was an important parent of modern cultivars, “Hartwig,” “Ina” and many others that have an improved SCN resistance gene from PI437654 introgressed into their genome [35]. Forrest was also central to an understanding of the genetics of resistance to sudden death syndrome, an important new disease of soybean [69].

Forrest is also one of the two cultivars (the other being “Williams 82”), providing the majority of genomic tools for soybean, available in USA (Figure 1) [10, 11]. These two cultivars provide models for soybean genomics research in the same way as are the cultivars Col and Ler in Arabidopsis thaliana or Mo17 and B73 in Zea mays. However, since the genomics of “Williams 82” was recently reviewed [11], its inclusion in this article would be repetitive. The other cultivars, which represent the worldwide germplasm variation for soybean genomics, include the following: (i) “Noir 1,” a Korean plant introduction (PI) [12], (ii) “Misuzudaizu,” a Japanese cultivar [13], and (iii) “Suinong14,” a Chinese cultivar [14]. The soybean community is committed to advance the genomics of all these cultivars, which have been used in the past as resources for genomics research. However, the intent of this review is to present an overview of the genomic resources derived from Forrest; these genomics resources enable a wide range of analyses that address several fundamental questions, like the following: (i) what is the source of genetic variation in soybean improvement? [15]; (ii) what is the role of variation in regions of genome duplication in paleopolyploid species? [16]; (iii) how does the nodulation of legumes work? [17]; (iv) why are protein and oil contents of seed inversely related? [18, 19]; (v) why are seed yield and disease resistance so hard to combine? [4, 5, 15, 20]; (vi) why is seed isoflavone content limited below 6 mg/kg? [18, 2124]; (vii) how does partial resistance to disease work [69, 18]? It is believed that the development and use of genomics tools derived from Forrest will help soybean researchers to provide answers to these questions.

2. Genetic Variation between Forrest and Other Cultivars

An important question that received the attention of soybean researchers in the past is how much sequence variation one can expect between Forrest and other cultivars, if many are to be sequenced. This variation is extensive (about 1 bp difference per 100–300 bp), when judged by using the criteria like the following: (i) the coefficient of parentage [25], (ii) the number of shared RFLP bands [26], (iii) polymorphism among microsatellite markers [27], and (iv) DNA sequence comparisons (Figure 2). In soybean, the degree of linkage disequilibria among loci is high, extending over distances that range from 50 kbp to 150 kbp [28]. Few meioses have occurred within these regions to reshuffle the gene or DNA sequences, because soybean is largely an inbreeding crop. In recent times, only seven or eight crosses have been made, starting from the time when the PIs were collected to the development of most modern US cultivars (Figure 3). Therefore, in different parts of the genome, LD encompasses large segments and sets of genes.

2.1. The Essex Forrest Population

A soybean recombinant inbred line (RIL) mapping population (Reg. no. MP-2, NSL 431663 MAP) involving Forrest was recently developed from the cross “Essex MAP” (PI 636326 MAP) “Forrest MAP” (PI 636325 MAP) [10]. This RIL population was used for constructing a genetic map [9, 24, 30] that has been used extensively for an analysis of marker-trait associations [79, 24, 3038]. The genetic marker data encompass thousands of polymorphic markers and tens of thousands of sequence-tagged site (STS) that were collected at SIUC by Dr. Lightfoot’s group (Table 1) [10]. The genetic maps of E F94 will continue to be enriched [27, 39]. The registration of this population [10] has allowed public access to the population and data generated from it worldwide.

A key feature of the above mapping population is that Essex (registered in 1973 [10]) was derived from the same southern US germplasm pool to which Forrest (registered in 1972 [1]) belongs. Consequently the RILs share identity across about 25% of their genomes, the portion that was monomorphic in both of the parents (Figure 3) [25, 26]. Further, the two cultivars were selected under similar conditions and, therefore, appear rather similar in most environments [610, 1520, 3038]. However, detailed records of maturity dates are important, since even a single day variation in maturity may influence the results of QTL analysis for many other traits [10, 41]. Since morphological and developmental traits differ very little in the population, the RILs have been used extensively to map those genes which control biochemical and physiological traits (Table 2). For example, the parents of the mapping population differ by resistance traits, which exhibit both qualitative and quantitative inheritance (Table 3).

A major limitation in using E F population in genomics research is the small population size (n = 100) that could preclude fine mapping [10]. To overcome this problem, populations of near isogeneic lines (NILs; n = 40; Figure 3) were developed from each RIL [10, 37, 38, 43]. The NIL populations are listed in Table 1. The residual heterozygosis present in the F5 seed was largely fixed and captured in these NILs. The heterogeneity across the RILs has been measured to be 8%, which is more than the 6.25% expected among F5 lines [7, 24]. That increased heterogeneity appears to be caused by selection, since rare heterozygous plants still exist in some RILs and NILs [37, 38, 40]. Each locus that segregates in the RIL population is expected to segregate in about eight NIL populations. Therefore, each region in the genome will be segregating in about 420 lines (100 + 8 40), quite sufficient to create fine maps of 0.25 cM resolution (Table 4). A 0.25 cM interval represents 25–100 kbp on the physical map [16], sufficient for candidate gene identification [37, 38].

Consequent to the development of the NILs, the E F population was used to study the genetics of a large number of quantitative traits (QTs), leading to the identification of quantitative trait loci (QTL; Table 2) underlying more than seventy different traits [24, 39, 40, 42, 4446]. Biochemical and physiological traits included resistance to soybean sudden death syndrome (SDS) [caused by Fusarium virguliforme] in the US and Argentina, resistance to soybean cyst nematode (SCN; Heterodera glycine Ichinohe), seed yield, seed quality traits, agronomic traits, water use efficiency, manganese toxicity, aluminum toxicity, partial resistance to Phytophthora sojae, and insect herbivory. However, new opportunities abound because dozens of traits for resistance to pests and pathogens segregate in the population but were not yet mapped [10]. Further, the concentrations of many secondary metabolites among lines vary widely during development and among different organs [47]. Pesticide uptake, metabolism and degradation rates also vary among lines (unpublished). Preliminary studies have shown the link between the genome, proteome, and metabolome (the interactome), which can be further explored in these segregating populations [48]. Therefore, E F will eventually be used to map thousands of QTL for hundreds of QT.

Importantly, the NILs that have been developed from each RIL for fine mapping also allow confirmation of QTL detected in the RIL population. For instance, cqSDS001 was assigned to a QTL confirmed by NILs derived from Ripley [49], but earlier detected through RILs derived from Flyer [50] and “Pyramid” [6, 33]. The QTL have also been renamed under the new rules for QTL adopted by the Soybean Genetic Committee in 2006 [51], as a result of which cqRfs1, cqRfs2, and cqRfs4 were renamed as cqSDS003, cqSDS002, and cqSDS004, respectively.

The molecular linkage map, the RILs, and the NILs were used during the positional cloning of nts1, GmNARK [50], Rpg1 [17, 35], Rhg1, [38] Rhg4, [52], and Rfs2 [37]. Many opportunities for further gene isolations exist. Tables 2 and 3 list some of the known phenotypes that differ between the parents and segregate among the lines and that are candidates for gene isolation. The RIL and NIL populations provide sets of recombination events that can be used to identify the positions of genes underlying QT [10]. Since all the lines self-fertilize, the populations can be used to provide an immortal resource, if seed germination ability can be regenerated every five years. This type of resource is particularly important for soybean because the draft genome sequence will be released in April 2008 (unpublished). Combining knowledge of locus positions with a comprehensive knowledge of gene content will lead to the rapid isolation of many new and economically important genes [16].

Selected lines from the E F population that contrast for mapped QTL were also used for a variety of studies including the following: (i) to validate assays of pathogenicity [32, 5355], (ii) to examine the effects of resistance genes on gene expression [34, 56, 57], (iii) to analyze components of drought tolerance [24, 31, 36, 42, 46, 58], (iv) to validate methods of marker assisted selection [6, 31, 5962], and (v) to provide for germplasm releases (Figure 4) and cultivars [6, 63]. New cultivars and new methods for selection of improved soybean genotypes are among the most important spin-offs from the genomics research involving Forrest soybean. Among the selected lines, E F78 later became LS-G96 [63] and then “Gateway 512” (Gateway Seeds, Nashville, Ill, USA). This line together with the line E F55 was used as parents that combined moderate resistance (carrying resistance alleles at six loci) to SDS with high yield. The RIL E F23 was released as SD-X for very high resistance to SDS [34] and good yield potential under license from Access Plant Technologies (Plymouth, Ind, USA), because it contained beneficial alleles at all eight known resistance loci. In contrast, E F85 is susceptible to SDS as it contained no beneficial alleles at the known resistance loci. It makes a great entry for sentinel plots. For animal feed and human food, E F52 has been used as a parent to provide very high phytoestrogen contents to progeny (unpublished), since it contained beneficial alleles at all the known loci underlying phytoestrogen content. Low phytoestrogen contents are also required for estrogen sensitive consumers; E F89 and E F92 were used as parents to provide parents for low phytoestrogen in the progeny (unpublished).

2.2. Related Populations Flyer by Hartwig (F H) and Resnik by Hartwig (R H)

The F H and R H populations are integrated with E F96 [10], since Forrest was the recurrent parent used to develop Hartwig (Figure 3) [62] and Essex shares many alleles with the Flyer and Resnik [15, 27]. Flyer and Resnik were sister lines derived from a cross between a Williams 82 sister line and a commercial cultivar [64]. The F H has 92 RILs and R H has 952 RILs that have been used to confirm QTL detected in E F96 and for fine mapping of these QTL [4, 5, 15, 50, 52]. Flyer and Resnik each contains many genes conferring resistance against P. sojae. Both these populations can be used to map genes underlying additional biochemical, physiological, and some agronomic traits that include the following: (i) resistance against Phytophthora root rot, soybean sudden death syndrome (SDS) caused by F. virguliforme and soybean cyst nematode (SCN), Heterodera glycine Ichinohe, (ii) seed yield [15, 50, 52], and (iii) seed quality traits. These RILs were also used to develop SSR markers that anchor contigs and sequence scaffolds (http://soybeangenome.siu.edu/) to the physical map [27].

3. Phenotypic Variation between Forrest and Other Cultivars

One major limitation using the resources based on Forrest was the low amount of genetic variation detected in the populations based on this cultivar [65]. The implication was that the alleles detected in E F would not be weaker variants of the major gene effects found in weedy plant introductions (PIs). It was hypothesized that, instead, the loci detected in the E F population and in the material derived from this population perhaps represented other gene systems of lower hierarchical position and therefore lower value. Consideration of a few examples of the locations of QTL underlying phenotypic variation between Forrest and other cultivars has been informative regarding this issue. The results to date all infer that the alleles underlying QTs in Forrest are variations in the same genes as the PI alleles, if weaker in effects on QTs.

3.1. The Genetics of Phytoestrogen Content

The phytoestrogen content of soybeans seed mainly consists of daidzein (60%) and genistein ( 30%) with small proportion of glycitein ( 10%) [66]. Analysis of germplasm and elite cultivars (18, 21–24, 67–69) indicated that phytoestrogen concentrations in some elite cultivars ( 2 mg/kg) were higher than those in many of the ancestors of cultivated soybean ( 1 mg/kg). Phytoestrogen content and profile varied with environment (year and location effect) and genotype. However, the final seed content was largely controlled by the genotype (40–60% of the variation) and is controlled by a set of about 6–12 loci [18, 24, 67]. If the content of each phytoestrogen component was controlled independently, improvements in content by genetic selection should be possible. For instance, raising glycitein content to the same amount as that of daidzein could double the total phytoestrogen content. However, because heritability of phytoestrogen content is moderate at about 40–60%, direct selection (without DNA markers) has not been very effective. Through marker-assisted selection (MAS), the phytoestrogen amounts were raised to 3.6 mg/kg, well above the amounts found in elite cultivars or weedy PIs. Here, the variation programmed by the alleles segregating in E F population was greater than that among the entire germplasm collection.

Recently, crosses have been made betweenlines from southern Illinois and Canada having the highest phytoestrogen contents [23] and, separately, the lines having the lowest phytoestrogen content [67]. MAS exercised in the segregating populations (at F4 in 2007) should lead to improvement in phytoestrogen content. Opportunities for collaborative studies exist with sets of RILs in maturity groups that are not adapted to be grown in southern Illinois or Canada.

3.2. The genetics of Seed Yield, Protein and Oil Content

The overall average increase of 1-2% per year in soybean yield witnessed during 1960–1999 was only half the yield advances achieved in corn and other out crossing crops, where genetic diversity was not limiting [68]. As one would expect, there are hundreds of loci controlling yield in soybean [69]. In view of this, half of the yield loci detected in E F population were those which were earlier detected in other crosses [24]. These loci could each boost seed yield by 0.2 Mg/Ha. In contrast, substantial gains (0.9–1.1 Mg/Ha) can be made in soybean yield by identifying unique alleles in weedy PIs and introgressions into elite cultivars [70]. The nature of the genes altering seed yield will be an interesting product from fine map analysis and positional cloning.

The major components ofsoybean seed yield include the following: (i) protein ( 40%), (ii) oil ( 20%), (iii) structural carbohydrates ( 6%), (iv) water ( 13%), (v) soluble carbohydrates ( 14%), and (vi) other metabolites ( 7%) [71]. Metabolic changes during development driven by gene expression underlie the seed composition and yield [72]. Seed yield and composition are under polygenic control with different genes active at different stages of seed development. Seed traits are also associated with significant genotype environment (G E) interactions as observed in E F population (see [15, 18, 19]). Again, the G E interactions significantly reduce the effectiveness of visual selection based on the phenotype alone.

At harvest, seed protein content is inversely related to seed oil content and seed yield in E F population [18, 19] as also in other germplasm (see [68]). While some loci are implicated in all the three traits, there are others which influence only one or two of the three traits. Several QTL underlying soybean yield, protein, and oil content have been mapped in both the E F and the F H RIL populations [5, 18]. They do correspond with loci detected in crosses between high protein weedy types and low protein adapted cultivars. Three QTL on linkage groups A1, A2 and linkage group E have been fine-mapped and localized within 0.25 cM using substitution mapping to identify the underlying genes. Isolation of these genes will partly explain the molecular basis of the genetic control of yield and its component traits. However, a danger here is that because different genes are active at different stages of seed development, one would generally map only a composite trait, based on a mean of the action of several loci. Isolation of genes by position would not be successful in this circumstance.

3.3. The Genetics of Phytophthora Root Rot Resistance

The annual soybean yield loss suffered from the root and stem rot disease caused by the oomycete pathogen, Phytophthora sojae is valued at about $273 million in the US [73]. Monogenic resistance due to a series of Rps genes has been providing a reasonable protection to the soybean crop against the pathogen over the last four decades [74]. Several mapped Rps-genes are known to occur in Flyer and Resnik [50, 64]. Partial, rate-reducing resistance to many races of P. sojae is found also in Forrest, Essex, and Hartwig. The loci providing this partial resistance were not mapped by 2007.

3.4. The Genetics of SCN Resistance

Soybean cyst nematodes (Heteroderaglycines I.) are the most damaging pests of soybean worldwide [73]. Development of resistant cultivars is the only viable control measure [75]. Resistance genes have been found to be located on 17 of the 20 chromosomes by 2007. A combination of recessive genes is necessary to provide resistance against SCN populations because many are known to be capable of overcoming all known single resistance genes. SCN populations can be classified into 16 broad races or up to 1024 biotypes (HG Types) [76] based on the host responses of 8 weedy indicator lines. SCN resistance in many other adapted and weedy cultivars [9, 31] shared the same loci underlying bigeneic inheritance in E F [20]. The E F population was used to isolate candidate genes for those two loci (rhg1 and Rhg4 ; Table 4) that control resistance against SCN race 3 (HG Type 0). Alleles of the candidate genes were identified in many PIs through association studies [38, 77]. Paralogs of both these genes were found at new locations in BAC libraries and whole genome shotgun (WGS) sequences [78, 79]. They appear to be part of multigene families showing homoeology and intragenomic conserved synteny.

Three cultivars including Peking, PI437654, and Hartwig encoded 2–4 additional genes that provide additional resistances to SCN [52, 80, 81]. Peking has alleles for resistance to races 1 and 5 that were not transferred to Forrest [20]. Hartwig and PI437654 have complete resistance against all races of SCN except race 0, HG Type 1.2.3.4.5.6.7.8. The location of SCN resistance loci in F H and R H agreed with those found in crosses between PIs and adapted germplasm [81, 82]. Therefore, the resistance to SCN traits that are introgressed from PIs to Forrest-based germplasm is useful and the underlying genes can be isolated from Forrest.

3.5. The Genetics of SDS Resistance

Soybean sudden death syndrome caused by Fusarium virguliforme (e.g., solani f. sp. glycines) is among the most damaging syndrome of diseases affecting soybean in the US and worldwide [73]. The syndrome is composed of a root rot disease and a leaf scorch disease [53]. Development of resistant cultivars is the only viable control measure. Twelve resistance loci have now been found on 8 chromosomes (Figure 4), eight segregate in E F [24, 44] and two additional loci segregate in F H [5, 50]. A combination of loci is needed to provide resistance to both root rot (2 or more loci) and leaf scorch (all loci). Loci for resistance to SDS were named Rfs to Rfs11 [39]. Using NILs (Table 4), a set of candidate genes for the Rfs2 locus were identified [37]. Among these genes, a receptor like kinase [38] and a laccase [83] are being tested for their ability to provide resistance following transformation and mutation (unpublished). However, the presence of a pair of syntenic genes on linkage group O with similar DNA sequences (84%) and encoding nearly identical amino acid sequences (98%) complicates the analysis following reverse genetics approach.

One of the two loci underlying root rot resistance is encoded in the DNA sequence around marker OI03514 that lies between AFLP derived SCARs, CGG5, and CTA13 on linkage group G [37]. However, the root rot resistance locus (Table 4) lay in a region not well represented among BAC libraries [84, 85], so that the gene isolation was delayed until the local genome sequence could be assembled. Transcript analysis showed that the fungus attempts to prevent gene transcription in the target roots [34, 55, 56]. Resistant cultivars prevent the poisoning of transcription by inducing stress and defense genes that produce fungicidal metabolites within 2 days of contact with the fungus. However, the induced genes do not appear to map to the loci that control the SDS resistance response [57]. Instead, genes of a higher hierarchical position in the interactome were found in this interval (unpublished). One of these genes is expected to underlie root resistance to SDS.

For the fungus, F. virguliforme causing SDS, no races are known so far in the US [86]. When lines from E F have been used to look at variations in pathogenicity between strains, no convincing evidence for a host differential response was observed (unpublished). However, different Fusarium species that are capable of causing SDS are found in South America [86]. E F was planted in Argentina since 2004, and it was shown that the SDS pathogen(s) invoked responses that mapped to different resistance loci [39]. Therefore, the fungus does have the potential to form races that vary in their pathogenicity. Hence, soybean breeders should be cautious in using the available resistance genes and should realize that stacking of all the twelve genes for full resistance would not be wise because it would select for mutants in the pathogen populations that could lead to the development of races.

In conclusion, a variety of approaches including QTL analysis, fine map development for some loci, and analysis of isolated genes have revealed that the alleles detected in E F are variants of the same major genes found in weedy plant introductions (PIs) [5, 24, 41, 53]. Only few loci detected in the E F population and in the other materials derived from this cross seem to represent other gene systems at a lower hierarchical position [57]. Identification of the lower tier of genetic control may require intercrosses among NILs or assays that relate to development, time, position, or cell type.

4. Structural Genomics Resources

Soybean (Glycine max L. Merr.) has a genome size of 1115 Mbp/1C [87]. The soybean genome is the product of a diploid ancestor (n = 11), that underwent aneuploid loss (n = 10), allo- and autopolyploidization events separated by millions of years (n = 40) with reversion to a lower ploidy after one of those two events (n = 20) [88]. Evidence that two genome duplications occurred, 40–50 MYA and 8–10 MYA, was supported by RFLP analysis suggesting 4–8 homoeologous loci for most probes [89] and discontinuous variation among paralagous EST sequences [9092]. Even PCR-based markers that can amplify single loci from genomic DNA amplify multiple amplicons from BAC pool DNA (Figure 2). The duplicated regions have been segmented and reshuffled after the polyploidization events [16, 9395].

Recently, a systematic measurement of DNA sequence divergence between homoeologous regions was made possible by comparing Forrest BAC end sequences with 7 million reads from the WGS sequences of Williams 82 [29, 93]. MegaBlast searches distinguished some regions, resolving up to 10% nonidentity between homoeologs over a 60 bp window (Figure 2). This implied that significant sequence divergence has occurred at about half the loci tested, as predicted from the gene-family size distribution observed in the physical map [57] (Figure 5). Conversely, highly conserved regions (>90% identity) exceeding about 150 kbp (the size of a large insert clone) have been inferred in certain regions [29]. Within these regions, 2 or 4 homoeologs can be distinguished by single nucleotide variants that correspond to the duplicated regions of a paleopolyploid genome or recently polyploid genome. These variants have been described as single nucleotide polymorphisms among homoeologs (SNHs) [93] though they are commonly called homoeologous sequence variants (HSVs) (see, e.g., [91]).

Overlain on the segmented regions found in 2 or 4 copies, the soybean genome is a composite of dispersed and contiguous euchromatic regions [88]. The short arms of four chromosomes are entirely heterochromatic, but in the remaining 16 chromosomes with potentially gene rich euchromatic arms, the heterochromatin is restricted to pericentromeric regions. Euchromatin represents 64% of the soybean genome, with a range of 40–85% on an individual chromosome. Due to these features and the following other reasons, analysis of soybean genome has been a challenge: (i) large genome size, (ii) serial duplication of regions, (iii) small proportion of unique DNA, and (iv) highly conserved repeated DNA. One reasonable prediction would be that many of the duplicated regions would be silenced in heterochromatin. However, a comparison of the genetic map and physical map [9395] has shown that duplicated segments are neither clustered nor restricted to heterochromatic arms. Further, the gene-rich islands are not separate from the duplicated regions. Therefore, new models to explain gene regulation that include duplicated conditions must be developed. Lessons learned from this exercise will help in the analysis of some legume and many dicotyledonous crop genomes, where genome duplication is believed to have often accompanied speciation. Breeders, who develop new cultivars through selection from the available variation within a cultivar, will also utilize this information and will develop new selection methods through an understanding of the effects and benefits of partial, segmented, genome duplication.

4.1. BAC Libraries and Physical Maps

Construction of fingerprint-based physical maps in soybean relied on the availability of deep-coverage high-quality large insert genomic libraries, and a number of such public sector large insert libraries are available in four different plasmid vectors, providing >45-fold genome coverage. BAC libraries are available not only for Forrest and PI437654, but also for some G. soja PIs and the wild relatives of G. max [84, 85, 96, 97]. Among these libraries, there are three “Forrest” BAC libraries [84, 85], available in two different plasmid vectors with different oris and different selectable markers (Table 5). Despite the availability of these rich BAC resources, there are still a few regions of the genome that are not well represented across the above set of BAC libraries. New libraries without involving restriction digestion may help solve this problem (unpublished).

A double-digest-based physical map for the soybean genome is now nearing completion. For this purpose, soybean BACs from five libraries belonging to three cultivars were fingerprinted and assembled [98] using a moderate information content fingerprint method (MICF) and FPC. The available BACs presently include 1182 Faribault BACs ( 130 kbp, EcoRI inserts, 0.125x), 860 Williams 82 BACs ( 130 kbp, HindIII inserts, 0.1x) and 78 001 Forrest BACs that were selected from the three libraries (125–157 kbp EcoRI, HindIII, and BamHI inserts, 9x). Cultivar sequence variation did not appear to cause incorrect binning of BACs by FPC. However, the first release (build 3) [98] had many problems (Table 6), since many individual contigs appeared to contain noncontiguous genomic regions, and in some cases, different contigs contained the same region of the genome. Also, the available set of contigs encompassed a space that was 300 Mbp more than the size of the soybean genome. Clone contamination caused many of these problems, so that new methods to identify and eliminate contaminated clones were developed [99].

Subsequently, the publicly available soybean BAC fingerprint database was used to create build 4 [16] with the following specific aims: (i) to increase the number of genetic markers in the map, (ii) to reduce the frequency of clone contamination, (iii) to rebuild the physical map at high stringency, (iv) to examine clone density per contig, and (v) to examine the effectiveness of the generic genome browser in representing duplicated homoeologous regions (Table 6). Clones suspected of contamination were listed, fingerprints were examined, and contaminated clones removed from the FPC database. Many (7134 about 10%) well-to-well contaminated clones were removed from the fingerprint database. The edited database produced 2854 contigs and encompassed 1050 Mbp. In addition, homoeologous regions that might cause separate contigs to coalesce were detected in several ways. First, contigs with high clone density (23%) were inferred to represent two copy (240) or four copy (406) conserved genomic regions per haploid genome (Table 6). If the polyploid regions could all be split using HSVs (Figure 1) [29], there would be 1624 regions with two copies and 480 regions with four copies in the soybean genome. A second proof of this genome structure was that pairs of separate contigs that contained the same marker anchors (69%) were inferred to represent homoeologous but diverged genomic regions (Figure 6) [16]. A third proof came from EST hybridizations to BAC libraries where gene families with 1, 2, 4, and 8 members were more common than those with 3 or 5 members [57]. Finally, similarity search within the whole genome sequence at 90% similarity showed that the sequences that map to the contigs with duplicated regions do have homoeologs in the sequence, whereas sequences from single copy regions do not (Figure 2) [29, 93].

To deal with duplicated regions, SoyGD was adapted to distinguish homoeologous regions by showing each contig at all potential anchor points, spread laterally, rather than as overlapping [16]. Therefore, it should be realized that the genes in such regions have duplicates in other regions of the genome (Figure 6). This information will prove useful in future for gene isolation by positional cloning following a reverse genetics approach, where aneupleurotic pathways regularly cause wide-spread failures [100102] due to inability to predict phenotypes reliably.

In build 5, DNA sequence scaffolds (unpublished) have been used to cluster groups of neighboring contigs. This, however, does not solve the problems faced due to genome duplication. In many cases, (60–80%), homoeologous variants may help separation of coalesced regions [29], but this would require BESs for every fingerprinted BAC clone. In a minority of regions (20–40%), sequences longer than BESs may be needed to correctly separate BAC clones into contigs.

4.2. Minimum Tile Paths

The creation of minimally redundant tile paths (MTP) from contiguous sets of overlapping clones (contigs) in physical maps is a critical step for structural and functional genomics [95]. The first minimum tiling path (MTP) developed (from builds 2 and 3) contained 2 fold redundancy of the haploid genome (2,100 Mbp). MTP2 was 14 208 clones (mean insert size 140 kbp) that were picked from the 5597 contigs of build 2. MTP2 was constructed from three BAC libraries (BamHI (B), HindIII (H) and EcoRI (E) inserts), encompassing the contigs of build 3 that were derived from build 2 by a series of contig merges, but does not distinguish regions by degree of duplication, so that many regions are redundant. The MTP2 is used in two parts, MTP2BH and MTP2E (Table 6) because they are largely redundant and overlap each other. Also, the vectors differ in the antibiotic resistance conferred. Consequently, only the MTP2BH was used for development of EST map [57].

The third and fourth MTPs, called MTP4BH and MTP4E (Table 6), were each based on build 4 [95]. Each was selected as a single path through each of the 2854 contigs. MTP4BH had 4608 clones with a mean size 173 kbp in the large (27.6 kbp) T-DNA vector pCLD04541, which is suitable for plant transformation and functional genomics. Plates 1–8 contained clones from the contigs belonging to the single copy regions of the genome. Plates 9 and 10 were picked from the duplicated and quadruplicated regions without redundancy, so that an individual clone represented either 2 or 4 regions per haploid genome. Plates 11 and 12 contained the marker anchored clones also used in MTP2BH. Plate 13 of MTP4BH was developed from just 6 contigs from regions with four copies by redundant picking. This set of clones should resolve into 48 regions, if methods to separate them can be developed as the genome sequencing is completed [93]. This set of 13 plates was used for HICF fingerprinting by the same methods that were used for Williams 82 [11] and PI437654 BACS [79, 96]. The BACs used for HICF will form a bridge to other physical maps and a resource to test the ability of HICF to correctly separate duplicated regions, particularly in the contigs in plate 13.

MTP4E was designed to be 4608 BAC clones with large inserts (mean 175 kbp) in the small (7.5 kbp) pECBAC1 vector [57, 85]. However, only 3840 clones were picked to date. Sequencing efficiency was low on this MTP and reracking will be needed [103]. The vector is suitable for DNA sequencing and these clones will be used for sequencing across gaps in the WGS sequence.

MTP4BH and MTP4E clones each encompassed about 800 Mbp before duplicate regions were considered. The single copy regions represented 700 Mbp [57]. In addition there were 50 Mbp from the duplicate and 50 Mbp from the quadruplicate regions in the MTP. Because those regions were duplicate and quadruplicate they encompass another 300 Mbp in total. MTP2BH, MTP4E, and MTP4BH were each used for BAC-end sequencing and microsatellite integration into the physical map [27, 39]. MTP2BH was used for EST integration to the physical map [16, 57]. MTP4BH was used for high information content fingerprinting for integration with the Williams 82 physical map [11, 104]. In conclusion, it appears like each MTP and the derived BESs will be useful to deconvolute and finish the whole genome shotgun sequence of soybean while the whole genome sequence will help complete the physical map. A complete MTP5BH would be a useful tool for functional genomics because clones from these libraries were constructed in a T-DNA vector and are ready for plant transformation. About four thousand transgenic lines made from BACs would be enough to transfer every soybean sequence to another plant.

4.3. BAC End Sequences (BESs)

BAC end sequences (BESs) anchored to a robust physical map constitute an important tool for genome analysis, and have been developed from BACs belonging to three available MTPs including MTP2BH, MTP4BH, and MTPE4 [95, 103]. Therefore, three sets of BESs were available, of which the first set consisted of 13 474 good BESs derived from 8064 clones of MTP2BH(Table 5). Enquiries to GenBank nr and pat databases identified 7260 potentially geneic homologs, and an analysis of the locations of inferred genes suggested presence of gene-rich islands on each chromosome [37]. In addition, 42 BESs showed homology (extending over a length of 80–341 bp at e−30 to e−300) with DNA markers (10 RFLPs, 20 microsatellites) that were already genetically mapped [95]. This amounts to homology with about 2% of the markers, whose sequences are available in GenBank. Available BESs also carried as many as 1053 new SSR markers [27, 37] that are described further in the next section.

The second set of BESs consisted of 7700 good BESs reads from clones of MTP4BH (Table 5) of which 4147 had homologs in the GenBank nr and pat databases [57]. The clones in plates 11 and 12 were resequenced and so have 2 records for each BAC end in GenBank. Resequenced clones help determine the sequence error rate and greatly facilitate SNP detection [18, 19]. Twenty additional genetic anchors were detected in this second set of BESs (6 RFLPs, 14 microsatellites), which represented about 1% of the soybean markers with sequences in GenBank. This second set of BESs carried 625 SSR markers [27, 37] that are described further in the next section. The third set of BESs from MTP4E have recently been released and are only partly analyzed (Table 6).

The above builds of physical map representing recently duplicated regions of the genome can be further improved with existing databases and tools. In particular, this can be achieved by increasing the number of reliable genetic anchors derived from BESs [27, 37] and separating BACs from homoeologous regions with diagnostic SNPs (Figure 2) before contigs were formed [93].

4.4. Genetic Map and SSR Markers Derived from BESs

The molecular genetic map for soybean genome can be improved further through several approaches including (i) addition of BESs markers on the available genetic map [27, 37], (ii) bioinformatics analysis of contig data [16] and (iii) through the use of novel approaches to error detection [99]. The composite genetic map of soybean at SoyGD (in 2007) contained 3073 DNA markers [16, 27], which included 1019 class I SSRs, each with >10 di- or trinucleotide repeat motifs (BARC-SSR markers; Song et al., 2004), and a few class II SSRs with <10 di- to pentanucleotide repeats that were mostly SIUC-SSR markers. Forrest BESs helped in increasing the number of class I and II SSR markers for the soybean genome, and allowed integration of BAC clones into the soybean physical map.

SSRs were mined separately from the two sets of BESs described above. As mentioned above, the first set of 10 Mbp of BAC end sequences (BESs) derived from 13 474 reads of 7050 clones constituting MTP2BH, had 1053 SSRs (333 class I + 720 from class II), and the second set of 5.7 Mbp BESs derived from 7700 reads from 5184 clones constituting MTP4BH, had 620 SSRs (150 class I + 480 class II). Potential markers are shown on the MTP_SSR track at SoyGD (Figure 6). About 530 primer pairs were designed for both the sets of SSRs. These primers were 20–24 mers long with a of 55 + 1°C, and provided amplicons that were 100–500 bp long. As many as 123 of these primers belonging to duplicated regions gave multiple amplified products, and therefore should be avoided.

Different possible motifs were not randomly distributed among the above SSRs, with AT rich motifs being more frequent [27]. Compound SSRs having tetranucleotide repeats clustered with di- and trinucleotide motifs were also found. About 75% of class I and 60% of class II SSR markers were polymorphic among the parents of four recombinant inbred line (RIL) populations. Most of the BESs-SSRs were located on the soybean genetic map in regions with few BARC-SSR markers [27, 39]. Therefore, BESs-SSRs represent a useful tool for the improvement of the genetic map of soybean.

4.5. SNP Markers Derived from BESs to WGS

The soybean genome has been shown to be composed of 8000 short interspersed regions of one, two, or four copies per haploid genome, as shown by RFLP analysis, SSR anchors to BACs and by BAC fingerprints [16]. Recently, the genome has been sequenced by WGS sequencing of 4 kbp inserts in pUC18 [105]. When the extent and homogeneity of duplications within contigs was examined using BAC end sequences (BESs) derived from minimum tile paths (MTP2BH and MTP4BH; Figure 2) [29], a strong correlation was found between the fold of duplication inferred from fingerprinting and that inferred from WGS matches. Duplicated regions were identified by BAC fingerprint contig analysis using a criterion of less than 10% mismatch across a trace with a window size of 60 bp. Previously, simulations had predicted that fingerprints of clones from different regions would coalesce, if sequence variation was less than 2%. Hopefully, the HSVs among contigs from duplicated regions can be used to separate clone sets from different regions. Ironically, improvements for contig building methods will result from the whole genome sequence! However, many duplicated regions with less than 1% sequence divergence were found [29, 93]. The implication for bioinformatics and functional annotation of the soybean genome (and other paleopolyploid or polyploid genomes) is that reverse genetics with many genes will be nearly impossible without tools to simultaneously repress or mutate several gene family members.

5. Functional Genomics Tools

Unequivocal identification and map-based cloning of genes underlying quantitative traits have been a challenge for soybean genomics research. Gene redundancy, gene action, and low transformation efficiencies seriously hampered positional cloning [16]. Therefore, a variety of approaches need to be used for soybean functional genomics research. Two major areas of soybean genomics research include (i) annotation of genomic sequences (genes with unknown functions) and (ii) analysis of genome sequences of “Forrest” for synteny with the genomes of other dicotyledonous genera and with those of other soybean cultivars.

5.1. Annotation of Genome Sequences

The three methods that proved useful for annotation of the genome sequences of Forrest and related germplasm include (i) mutant complementation using transformation, (ii) gene silencing through RNAi, and (iii) targeted mutations. Each will be briefly discussed.

(i) Mutant complementation using transformation. A popular approach for the study of gene function is mutant complementation, which involves transformation of mutants with the wild alleles. Therefore, development of transformation protocols is an essential component of functional genomics research. In soybean, A. tumefaciens and A. rhizogenes-mediated transformation of cultured cells with Forrest BAC clones has been successfully achieved using previously described protocols involving the T-DNA vector pCLD04541 [84]. In this protocol, npt II gene is used as a plant selectable marker, and kanamycin as used as a selective agent [106109]. Screenable markers are available in some BAC clones (Table 7). Whole BAC transformation is important because fine maps locating loci at genetic distance of 0.25 cM that is equivalent to 50–150 kbp were earlier prepared using RILs and NILs. The clones selected for transformation are listed in Table 7, and should provide for complementation of easily scoreable phenotypes in mutants. For instance, dominant mutant phenotypes of traits like pubescence, color, and disease resistances should be evident in the very first products of transformation. BAC transformation with sets of overlapping clones will be the best approach in situations where an individual locus represents a cluster of genes [37, 38].

(ii) Gene silencing using RNAi. The composite plant system for RNAi has been tested in NILs derived from Forrest, and has been validated by Dr. C. G. Taylor at the Danforth Center (St. Louis, Mo, USA) [110] through expression of gene-specific dsRNA constructs. Using this system, shoots from stable transgenic soybean plants showing constitutive expression of uidA (GUS) were transformed with dsRNA constructs (Figure 7) that were designed using a modified pKannibal vector [111], with the 35S promoter replaced by the figwort mosaic virus (FMV) promoter. The 600 bp homologous sequences of the GUS or green fluorescent protein (GFP) gene were introduced in an antisense and sense orientation separated by the pKannibal intron (spacer) sequence. These constructs were designed to produce transcripts with a stem loop secondary structure that would be recognized by the plant cell machinery and activate RNAi. The dsRNA constructs placed in a binary vector, introduced into A. rhizogenes, were used for composite plant production [112]. GUS-specific RNAi construct silenced, while non-GUS RNAi (GFP) construct failed to silence GUS expression in hairy roots produced on shoots of transgenic soybean plants. These results show that the hairy roots can be used to produce dsRNAs. Further, the RNAi machinery in soybean hairy roots is fully functional in a sequence-specific manner. Thus, RNAi technology will allow the rapid analysis of sets of candidate genes for alleles underlying variation [38].

(iii) Study of gene function through TILLING. Two soybean mutagenized M2 libraries are already available for TILLING [113], from which 3000 of the 6000 available M2 lines were phenotyped visually. A soybean mutant database has been developed to track and sort these mutants (http://www.soybeantilling.org/). While developing a database that would allow search for “TILLED” genes a search engine was developed, so that the database can be searched for both phenotype and gene. The mutations occurred at a rate of 1 mutation/170 kbp, so that a screening of 6150 M2 families may provide a series of up to 40 to 60 alleles within each 1.5 kbp fragment of a target gene. This approach led to the identification of a putative mutant for a soybean leucine rich repeat receptor like kinase gene Gm-Clavata1A (AF197946; Figure 8). In future, TILLING and crosses among TILLED mutants [100102] will allow the testing of candidate genes and will provide new genetic variation that may lead to germplasm enhancement.

5.2. Analyses of Conserved Synteny

Forrest genome sequences have also been used for a study of their synteny with genomes of other dicotyledonous genera/species and also with the genomes of other soybean cultivars. For this purpose, cross-species transferable genetic markers are available in the data-based legumeDB1 [114], and can be used to compare the linear order of markers/genes, which are either species specific or conserved across genera [115124]. For instance, genes for resistance to pathogens will often appear as new genes or gene clusters inserted in regions, which otherwise exhibit conserved synteny across genera [35, 115, 122]. Synteny extends beyond genes into repeat DNA, as exemplified by the distributions of 15 bp sequences that provide sequence-specific genome fingerprints [94]. Interestingly the fingerprints do not show the same patterns of relatedness between species found in gene sequence. Therefore, genome fingerprinting will help identify good candidates for cross species markers in repeat DNA such as microsatellite markers.

Conserved synteny has also been observed among the genomes involved in the constitution of the allo- and autotetraploids hypothesized for soybean. It has been shown that about 25–30% of the genome has extensive conservation of gene order in otherwise shuffled blocks of 150–300 kbp [16]. Consequently, blocks of 3–10 genes are repeated at 2 or 4 locations per haploid genome [38, 79]. There are also genomic regions, where synteny among genomes of different cultivars has been shown to break down. Several interesting features including the following have been observed in these nonsyntenic regions: (i) in some cases, a loss of conserved synteny between cultivars is associated with a gene introgressed from a Plant Introduction [38]. (ii) In another case, a moderately repeated sequence common in one cultivar is absent in another cultivar [29]. (iii) In still another case, a sequence inserted in one cultivar appears to alter the expression of a neighboring gene (unpublished). It is thus apparent that genome analysis involving study of an association of these nonsyntenic sequence tracts in otherwise syntenic regions, with phenotypes will be an active area of research, when genome sequences from a number of soybean cultivars are available.

6. Conclusions

The soybean genomics resources developed through the use of cultivar Forrest have been used and will continue to be used in future leading to significant advances in soybean genomics knowledge base. The soybean genome shows evidence of a paleopolyploid origin with regions, encompassing gene-rich islands that were highly conserved following duplication [16]. In fact, it was estimated that 25–30% of the genome was highly conserved after both duplications. Implications of this feature are profound. First, a map of homoeology and an associated map of duplicated regions had to be developed. Second, an estimate of sequence conservation among the duplicated regions was necessary. Third, the implications for functional genomics were considered. Given that all soybean genes have been duplicated twice during recent evolution, and that most plant genomes encode functionally redundant pathways, it is not surprising that TILLING, RNAi-mediated silencing and overexpression of several genes often did not lead to phenotypic changes [101, 102, 110, 113]. In future, the E F population will continue to be used for (i) an analysis of functions of a number of gene families, (ii) patenting of inventions based on useful genes [6, 77, 124126], (iii) manipulation of soybean seed composition including protein, oil [19] and bioactive factors [127129], and (iv) an analysis of the protein interactome [130]. In summary, the newly released E F population and the other associated genomic resources developed through the use of cultivar “Forrest” will provide tremendous opportunities for further research in the field of genomics research.

Acknowledgments

This research summarized in this review was funded in part by grants from the NSF 9872635, ISA (95-122-04; 98-122-02; 02-127-03), and USB 2228–6228. The physical map was based upon work supported by the National Science Foundation under Grant no. 9872635. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The continued support of SIUC, College of Agriculture, and Office of the Vice Chancellor for Research to the Genomics Facility is appreciated. The author thanks Dr. Q. Tao and Dr. H. B. Zhang for assistance with fingerprinting; Dr. C. Town and Dr. C. Foo at TIGR for the BESs; Dr. K. Meksem for the Tilling figure; Dr. C. G. Taylor for an RNAi figure; Dr. Z. Zhang for transformation of Forrest; and the SIUC team for their rigor in addressing a novel problem for genomics. All team members are thanked for their continued collaborations. Thanks are also due to P. K. Gupta, the guest editor, for critical readings, which led to significant improvement of the manuscript.