Abstract

Genomic imprinting is an epigenetic phenomenon that causes a differential expression of paternally and maternally inherited alleles of a subset of genes (the so-called imprinted genes). Imprinted genes are distributed throughout the genome and it is predicted that about 1% of the human genes may be imprinted. It is recognized that the allelic expression of imprinted genes varies between tissues and developmental stages. The current study represents the first attempt to estimate a prevalence of imprinted genes within the total human transcriptome. In silico analysis of the normalized expression profiles of a comprehensive panel of 173 established and candidate human imprinted genes was performed, in 492 publicly available SAGE libraries. The latter represent human cell and tissue samples in a variety of physiological and pathological conditions. Variations in the prevalence of imprinted genes within the total transcriptomes (ranging from 0.08% to 4.36%) and expression profiles of the individual imprinted genes are assessed. This paper thus provides a useful reference on the size of the imprinted transcriptome and expression of the individual imprinted genes.

1. Introduction

Genomic imprinting is an epigenetic phenomenon that causes a differential expression of paternally and maternally inherited alleles of a minor subset of genes (the so-called imprinted genes). Genomic imprinting was first discovered in 1984 [1, 2], and in 1991 the first imprinted genes (IGF2, paternally expressed; IGF2R and H19, maternally expressed) were identified in the mouse [35]. Since then, the imprinting status was confirmed for numerous genes in Homo sapiens and Mus musculus genomes, less for Bos taurus, Rattus norvegicus, Sus scrofa, Canis lupus familiaris, and Ovis aries; many more genes are considered candidates [6]. Functional significance of the genomic imprinting is not yet fully understood [79], while alterations in the expression of imprinted genes are linked to certain pathologies, including Angelman syndrome, Prader-Willi syndrome, and particular cancer subtypes. Genomic imprinting varies between species and tissues. Furthermore, it is a dynamic process and may vary depending on the developmental stage [10]. The goal of the study was to estimate a prevalence of imprinted genes within the total human transcriptome, in cell and tissue samples in a variety of physiological and pathological conditions.

Serial analysis of gene expression (SAGE) is a sequence-based technique to study mRNA transcripts quantitatively in cell populations [11]. Two major principles underline SAGE: first, short (10 bp) expressed sequenced tags (ESTs) are sufficient to identify individual gene products, and second, multiple tags can be concatenated and identified by sequence analysis. SAGE results are reported in either absolute or relative numbers of tags, which permits direct comparisons between tag catalogues and datasets [1215]. Numerous technical adaptations assured a development of similar techniques [16], yet SAGE remains an important tool of modern molecular biology. It is widely used in a number of applications, of which a molecular dissection of cancer genome is the major [17]. In the current study, expression of established and candidate imprinted genes was evaluated in a wide array of cell and tissue samples using a comprehensive set of currently available SAGE data for Homo sapiens. Five hundred eighty-one SAGE catalogues based on the libraries generated with most commonly used NlaIII anchoring enzyme were screened using a conservative set of criteria, and in 492 of these (accounting for nearly 36 million SAGE tags) gene expression profiles of the imprinted genes were analyzed, using a proved algorithm [18]. It was therefore possible to estimate a prevalence of imprinted genes within the total human transcriptome.

2. Methods

2.1. Imprinted Gene Subsets

Established and candidate imprinted gene subset was assembled based on the Geneimprint resource (http://www.geneimprint.com/; credits to R.L. Jirtle) and Luedi et al. study [6]. Of the latter, high-confidence imprinted human gene candidates predicted to be imprinted by both the linear and RBF kernel classifiers learned by Equbits Foresight and by SMLR ([6], supplementary data) were utilized. Redundant entries have been excluded.

2.2. SAGE

SAGE technology is based on isolation of short tags form the appropriate position within the mRNA molecule, followed by the concatemerization of the tags, sequencing, tag extraction and gene annotation [11]. The complete set of publicly available SAGE libraries (GPL4 dataset, NlaIII anchoring enzyme) was downloaded from the Gene Expression Omnibus (GEO) database (National Center of Biotechnology Information (NCBI); http://www.ncbi.nlm.nih.gov/geo/). Following an exclusion of the duplicate entries, SAGE libraries were annotated and sorted based on the number of tags sequenced. Noninformative (A)10 sequences were extracted from SAGE libraries when detected, and tags per million (tpm) values were recalculated accordingly for all libraries as the transcript’s raw tag count divided by the number of reliable tags in the library and multiplied by 1,000,000. SAGE libraries, constructed by Potapova et al. [19], were a subject to a “clean-up” procedure through which all clones containing ≤4 tags were excluded [20], with the remaining tags constituting the pool of “reliable tags.”

2.3. SAGE Tag Annotation

Established and candidate imprinted gene subset has matched CGAP (Cancer Genome Anatomy Project, NCI, NIH) SAGE Anatomic Viewer (SAV) applet [17]. For genes not matching SAV applet entries, and when unreliable/internal tags were suggested by SAV applet (viz., for TIGD1, HOXA3, NTRI genes, etc.), reliable 3′ end tags were extracted from full-length sequences available via GenBank (NCBI, NIH).

2.4. Expression Profiling

SAGE tags was matched the individual SAGE catalogues using MS Access software package Query function. Individual queries (both absolute tag abundance per library and normalized tag per million (tpm) values) were merged using MS Excel software. Calculations of maximal and average expression of transcripts matching established and candidate imprinted genes were performed using normalized tpm values. Particular values could be recalculated to the fraction of the total gene expression by dividing tpm value by 1,000,000.

2.5. Clustering Analysis

Clustering analysis was performed using EPCLUST Expression Profile data CLUSTering and analysis software (http://www.bioinf.ebc.ee/EP/EP/EPCLUST/). K-mean clustering analysis was performed after transposing the data matrix with initial clusters chosen by most distant (average) transcripts. For each dataset, the number of clusters was set to the lowest value yielding one cluster containing a solitary database entry. Hierarchical clustering was performed using correlation measure-based distance/average linkage (average distance) clustering method; hierarchical trees were built for individual datasets.

3. Results

Established and candidate human imprinted gene subset (203 entries total) was assembled based on the Geneimprint resource and Luedi et al. study data [6]. Of the candidate imprinted genes identified in the latter, high-confidence gene candidates (predicted via Equbits Foresight and SMLR means [6]) were selected. Following exclusion of the redundant entries, appropriate short (10 bp) SAGE tags matching NlaIII anchoring enzyme were annotated to gene targets using CGAP (Cancer Genome Anatomy Project, NCI, NIH) SAGE Anatomic Viewer (SAV) applet or manually, as described earlier [18]. For a number of the candidate imprinted genes, a complete sequence was unavailable via GenBank or alternative databases (e.g., GenBank ID: NM_016158, NM_024547, NM_181648, etc.), for that reason, a volume of the human imprinted gene subset subjected to tag annotation was reduced to 174 genes. Of these genes, candidate imprinted gene Q9NYI9 (PPARL; GenBank ID: AF242527) could not be annotated with SAGE tag, missing NlaIII anchoring enzyme recognition sites completely. Therefore, a total of 173 genes (including 53 established imprinted genes and 120 candidate imprinted genes) were annotated with the appropriate SAGE tags (Table 1) and subjected to further analysis.

The complete set of publicly available human SAGE catalogues was downloaded from the Gene Expression Omnibus (GEO, NCBI) database. Acquired SAGE catalogues represent 581 SAGE libraries generated from a wide spectrum of cell and tissue samples in a variety of physiological and pathological conditions. Following an exclusion of the numerous duplicate GEO database entries (e.g., GSM785 = GSM383907; GSM1515 = GSM383958; GSM85612 = GSM125353, etc.), the criteria listed below were applied when selecting libraries for the analysis of gene expression. SAGE libraries were selected only if they have represented (i) genetically unaltered/unmodified samples, (ii) SAGE catalogues with a total number of tags 20,000, and (iii) a complete dataset available. For example, samples GSM383929 and GSM180669 were excluded since these did not satisfy criteria (i), representing ovary surface epithelium immortalized with SV40 and lymphocytes from Down syndrome children, respectively; samples GSM384024 (white blood cells, CD45+, isolated from a mammary gland carcinoma; 18,741 tags) and GSM1128 (breast cancer cell line; tags detected once are not available) were excluded as not satisfying criteria(ii) and (iii), respectively (Supplementary Table 1). Due to the conservative nature of the criteria listed above, a total number of SAGE catalogues satisfying these and thus selected for further analysis (i.e., to the extraction of tags matching imprinted genes) was reduced to 492. Together, these 492 SAGE catalogues representing human samples account for 35.97 million SAGE tags constructed using NlaIII-anchoring enzyme. The catalogues were assigned into one of the following Clusters: C (cancer tissue; 185 SAGE catalogues), N (normal tissue and cells; 166 SAGE catalogues), IV (cells cultured in vitro; 112 SAGE catalogues), or D (nontumorous disease tissue and cells; 29 SAGE catalogues) (Table 2, and Supplementary Table 1).

Figure 1 shows a distribution of the analyzed established and candidate imprinted genes through the human genome. Primary analysis of the normalized expression profiles of the imprinted genes demonstrated a great variability in the cumulative gene expression for 173 genes (Figure 2, Table 3, and Supplementary Table 2). Average cumulative gene expression of those genes in human tissues and cells was 0.90% of the total gene expression: specifically, 0.95% for both cancer and normal tissue and cells (clusters C and N, resp.), 0.77% for cells cultured in vitro (cluster IV), and 0.83% for nontumorous disease tissue and cells (cluster D). In the pool of the assessed SAGE catalogues, it ranged from 0.08% (total blood, GSM389907 [21]) to 4.36% of the total gene expression (bronchial epithelium, GSM125353 [22]). Of 492 human SAGE catalogues tested, the cumulative expression of the imprinted genes constituted >2% of the total gene expression in 21 and <0.2% in 7 catalogues. The SAGE libraries with 10% most and 10% least cumulative and average expression of established and candidate imprinted gene subsets are listed in Table 3.

In some samples, a major fraction of the cumulative expression of the imprinted genes was established by only a few highly abundant transcripts. For example, in the GSM125353 SAGE catalogue already mentioned above, 91.9% of the cumulative (total) gene expression of the assayed imprinted genes is represented by the single gene, namely, PTPN14 (ACTTTTTCAA tag). Similarly, in GSM383893 SAGE catalogue (gallbladder tubular adenocarcinoma [17, 23]), the same gene constitutes 86.6% of the cumulative (total) gene expression of the assayed imprinted genes. In many other SAGE catalogues, expression profile of the assayed imprinted genes was rather more balanced. For example, in the GSM383840 SAGE catalogue (mammary myoepithelium, CD10+ cells [24]), PTPN14 constitutes just 8.7% of the cumulative (total) gene expression of the assayed imprinted genes, equal to GNAS gene (ATTAACAAAG tag). Some imprinted genes were expressed almost ubiquitously through the samples: for example, genes NDUFA4, RPL22, Q8NE65, PTPN14, GNAS, and RAB1B (Supplementary Table 3). Notably, in other cases, expression of the particular imprinted genes either was not detected at all in all 492 SAGE catalogues screened (EVX1, ACGCCCGTGG tag), or was detected only occasionally (Supplementary Table 3). For example, gene DUX2 (AAGGGGTGGA tag) expression was detected only 3 times (on a minimum level) in 492 SAGE catalogues representing cell and tissue samples in a variety of physiological and pathological conditions: namely, in GSM383692 SAGE catalogue (astrocytoma grade II [25]), GSM383867 SAGE catalogue (colon carcinoma cell line [17, 23]), and GSM383928 SAGE catalogue (ovary preneoplasia cell line [26]). Similarly rare was the expression of FAM75D1 (detected only 3 times altogether), FAM77D, ISM1, FLJ20464, and Q8NB05 (detected only 5 times, in all cases on a minimum level).

To assess variation in the expression of individual imprinted genes in the samples, the clustering analysis of the normalized expression profiles was performed using EPCLUST (Expression Profile data CLUSTering and analysis) software. For each dataset, the number of clusters was set to the lowest value yielding one cluster containing a solitary database entry; 5 for cancer tissue, 6 for normal tissues and cells, 5 for cells cultured in vitro, and 2 for nontumorous disease tissue and cells (Figures 37). Notable diversity was observed in the transcription profiles represented by the individual clusters, with relatively high expression levels characteristic for just 1-2 or a higher number of the individual imprinted genes (Figures 4(a), 4(b)7(a), and 7(b)). Expectedly, in a few cases samples generated from the same tissues/cell types did fell into the same compact cluster of the distinct pattern (Figure 3, Figures 4(a), 4(b)7(a), and 7(b)). However, in many other cases imprinted gene expression profiles of the same/similar tissue or cell types fall into different clusters. Similarly, though in many cases imprinted gene expression profiles of the same/similar tissue or cell types fell into the closely matching area of the hierarchical tree built for the individual datasets (clusters C, N, IV, and D) (Figures 4(c), 5(c), 5(d), 6(c), and 7(c)), in other cases notable variability was observed in the distribution of imprinted gene expression profiles of the same/similar tissue or cell types. For example, at K-mean clustering, small-size cluster 3 in cancer tissue dataset (3 entries) is composed entirely of neuroblastoma samples (Figures 4(a) and 4(b)); however, other entries representing tumors of the same histological properties [27] fell into cluster 1 (composed of 141 entries in total). Cluster 4 in the same dataset (12 entries) is composed entirely of carcinoma samples, while cluster 2 (28 entries) is composed of carcinoma samples predominantly (19 entries), with other samples representing astrocytoma (3 entries), glioblastoma multiforme (2 entries), cystadenoma (1 entry), rhabdosarcoma (1 entry), and unclassified breast cancer (2 entries). Similarly, only one cluster in the normal tissue and cell dataset has a homogenous composition (cluster 5, 2 entries), matching both available SAGE libraries constructed from placenta (GSM14849, also designated GSM383945; GSM14750, also designated GSM383947 [17]) (Figure 3), with all other clusters composed of the samples of diverse origins. Illustratively, this particular cluster brakes down (i.e., cluster content get redistributed to the clusters of the smaller size) only if the number of K-mean clusters for the dataset is increased from the set value of 6 to 26, while some other clusters break down more readily. In the hierarchical trees, most densely packed areas (representing most similar transcription profiles) are generally composed of the samples of the same/similar tissue or cell types. For example, one of the densest areas in four hierarchical trees built is composed of 19 samples matching bronchial brushings (Figures 5(c) and 5(d)) [22], with all 5 other samples of the same origin falling into the nearest vicinity within the hierarchical tree (Figure 5(c)). At the same time, some SAGE libraries representing the samples of the identical origin fell into the separate K-mean clusters and into well-separated areas of the hierarchical tree. This was observed, for example, for 3 available peripheral retina samples, from which GSM572 and GSM573 [28] fell into cluster 3, and GSM383968 [29] fells into cluster 1 (Figure 4).

4. Discussion

Mechanism of genomic imprinting plays important, yet not fully understood role in many physiological processes: in particular, in the control of growth and development. Since the identification of the first imprinted genes (IGF2, IGF2R, and H19) in mouse in 1991, a large volume of information has been accumulated on the identity and biological function of imprinted genes both for Homo sapiens and animal species (Mus musculus in particular). Over the course of the decade, we witness an expansion of the list of the established imprinted genes [6, 30]. It is most probable that novel candidate imprinted genes will be identified in the future, and features of the imprinted genes will be confirmed for some candidates. In the current study, a comprehensive list of the human imprinted genes and high-confidence gene candidates (203 entries total) became a subject for a large-scale in silico gene expression profiling. Available nucleotide sequences (174 genes and gene candidates) have been utilized for the extraction of the appropriate short SAGE tags matching NlaIII anchoring enzyme, most common in generating SAGE libraries. Notably, candidate imprinted gene Q9NYI9 (PPARL) did not bear NlaIII recognition sites. This limitation of the conventional SAGE protocol can generally be overcome by using an alternative anchoring enzyme [16]. However, gene Q9NYI9 does not bear recognition sites for anchoring enzymes Sau3AI and RsaI (second and third most common in generating SAGE libraries) as well, though it bears one for MmeI utilized in LongSAGE protocol. Taken together, not 174 but 173 genes (missing Q9NYI9 (PPARL))—including 53 established imprinted genes and 120 candidate imprinted genes—were annotated with the appropriate SAGE tags. The latter was matched the pool of 492 normalized SAGE catalogues representing libraries derived from human samples, constructed using NlaIII anchoring enzyme and together accounting for 35.97 million SAGE tags. Collectively, these catalogues represent a comprehensive assay of tissues and cell types in physiological and a variety of pathological conditions. Gene expression of imprinted genes was assessed in the normalized SAGE catalogues representing the transcriptomes of these samples, according to the straightforward algorithm of in silico analysis.

As with nearly any other gene, expression of imprinted genes is not a constant, but rather a dynamic function of cell type and state. In the current study, a great variability was observed in both cumulative/total expression of the studied imprinted genes and that of the individual genes. The cumulative expression of 173 studied imprinted genes ranges from 0.08% (total blood) to 4.36% (bronchial epithelium) of the total gene expression (Table 3). In some samples (Table 3 and Supplementary Table 2), imprinted genes-associated proportion of the transcriptome is obviously above what is to be expected from such a limited group of genes, clearly reflecting the importance of the biological roles played by the latter. At the same time, overall expression of the imprinted genes was equal in the clusters of cancer tissues and normal tissue and cells (clusters C and N, 0.95% for both clusters) and lower for the cells cultured in vitro (cluster IV, 0.77%).

The current study apparently represents the first attempt to estimate an impact of imprinted genes on the total volume of the transcriptome. Obvious biases affect an accuracy of the algorithm applied, suggesting both underestimation (probable existence of yet unidentified imprinted genes, unavailable information on gene structure for some imprinted genes, absence of anchoring enzyme recognition sites for at least one gene) and overestimation (unconfirmed imprinting status of some of the candidate imprinted genes, SAGE tags matching more than one gene; see Table 1) of the relative size of the imprinted transcriptome. Despite this, provided data on the estimated cumulative/total expression of the known imprinted genes (their number well corresponding to the predicted number of imprinted genes in human genome [31, 32]) in a variety of tissues and cells is most interesting. Until now, little information was available on the overall expression of imprinted genes in the cells of different types. It is generally believed that many imprinted genes are highly expressed in the developing and adult brain tissue [33], placenta [34], and undifferentiated stem cells [35]. Discrete studies identify certain highly expressed imprinted genes as the potential biomarkers of cancer subtypes [36, 37]. In contrast, imprinted genes are known to be expressed on relatively low level in adult blood cells [38]. This information is supported by the observed values of the cumulative expression of the imprinted genes through the screened samples (Table 3 and Supplementary Table 2): cumulative expression of the imprinted genes is generally high in many assessed brain-derived samples and low in blood samples. It was also observed earlier that major upregulation of gene expression of the numerous imprinted genes is associated with early differentiation and development, rather than with undifferentiated status of stem cells [39, 40]. Concordantly, in the current study, all of the 13 SAGE libraries generated from undifferentiated embryonic stem cells (ESCs)—namely, lines HES3, HES4 [17, 23, 41], BG01, H1, H7, H9, H13, H14, HSF6 [17, 23]—uniformly demonstrate intermediate cumulative expression of the imprinted genes (Supplementary Table 2) and fit closely in the hierarchical tree built for the corresponding cluster (cluster IV; Figure 5(c)). However, many samples with high cumulative expression of the imprinted genes do not fit into any of the groups listed above. Important role of genomic imprinting in particular normal cell and cancer subtypes, suggested by high expression of these genes, thus should be a subject of the follow-up studies. Expression of individual imprinted genes varies to even further extent in the samples screened. Expression of the candidate imprinted gene even-skipped homeobox 1 (EVX1) was not detected in any sample submitted to the analysis, while the expression of many more (DUX2, FAM75D1, Q8NB05, FLJ20464, ISM1, FAM77D, and others) was detected only in a few samples, always on a minimal level. In contrast, further imprinted genes (NDUFA4, RPL22, Q8NE65, GNAS, PTPN14, RAB1B, and others) were expressed in the majority of the samples screened, often on high level (Supplementary Table 3).

Illustratively, a notable variation in the cumulative expression of the imprinted genes and in the expression of individual imprinted genes is observed in the cells cultured in vitro, including cells of the same type (e.g., numerous medulloblastoma, glioblastoma multiforme, and breast carcinoma cell lines) (Supplementary Table 2 and Figure 6). This observation further supports earlier suggestion that cell culture conditions contribute to the maintenance or alteration of the imprinted gene expression [42, 43].

Taken together, a screening of the normalized expression profiles of a comprehensive panel of the established and candidate imprinted genes within the publicly available human SAGE datasets was performed in the current study: the first to estimate a prevalence of imprinted genes within the total human transcriptome in a large scale. This paper thus provides a useful reference on the relative size of the imprinted transcriptome and on the expression of the individual imprinted genes.

Acknowledgments

This study was supported by the grants by The Ministry of Education and Science of the Russian Federation, Federal target program “Research and Pedagogical Cadre for Innovative Russia” for 2009–2013 (State Contract 14.740.11.0004) and by MCB Program, Russian Academy of Sciences. The author is grateful to all GEO database contributors.

Supplementary Materials

Supplementary Material provides key properties of established and candidate imprinted gene subset within the SAGE datasets.

  1. Supplementary Tables