Computational Systems Biology Methods in Molecular Biology, Chemistry Biology, Molecular Biomedicine, and BiopharmacyView this Special Issue
Research Article | Open Access
A Quantitative Analysis of the Impact on Chromatin Accessibility by Histone Modifications and Binding of Transcription Factors in DNase I Hypersensitive Sites
It is known that chromatin features such as histone modifications and the binding of transcription factors exert a significant impact on the “openness” of chromatin. In this study, we present a quantitative analysis of the genome-wide relationship between chromatin features and chromatin accessibility in DNase I hypersensitive sites. We found that these features show distinct preference to localize in open chromatin. In order to elucidate the exact impact, we derived quantitative models to directly predict the “openness” of chromatin using histone modification features and transcription factor binding features, respectively. We show that these two types of features are highly predictive for chromatin accessibility in a statistical viewpoint. Moreover, our results indicate that these features are highly redundant and only a small number of features are needed to achieve a very high predictive power. Our study provides new insights into the true biological phenomena and the combinatorial effects of chromatin features to differential DNase I hypersensitivity.
In eukaryotes, DNA is organized into chains of nucleosomes, each of which consists of about 146 bp of DNA wrapped around an octamer of four types of histones . The packaging of chromatin into nucleosomes provides a repressive environment for many DNA-binding proteins and plays an important role in the regulation of transcription . However, some domains in chromatin are depleted of nucleosomes and exhibit highly accessible structure. These nucleosome-free regions are supersensitive to the cleavage of DNase I  and are known as DNase I hypersensitive sites (DHSs). They are predominantly found in many active genes and cis-regulatory elements . The dynamic alterations of “openness” in chromatin play important roles in many biological processes, including transcription , replication , and differentiation .
Traditionally, the experimental technique of choice to discover the DNase I hypersensitive sites is Southern blotting . However, this low-throughput method is not able to study large chromosomal regions at a time and cannot represent the “openness” of chromatin in a quantitative manner. The significance of differential accessibility in DNase I hypersensitive sites is unknown, but it may reflect some important biological phenomena like histone modifications and protein occupation . Even until now genome-wide quantitative analyses of the relationship between chromatin accessibility and chromatin features in DNase I hypersensitive sites are rare. By taking advantage of the abundant datasets of the ENCODE project , we analyzed genome-wide localization data of DNase I hypersensitive sites and 33 chromatin features in human embryonic stem cell (H1hesc) cell line. All datasets were generated by recently developed genome-wide high throughput experimental techniques, such as Chip-seq [10, 11] and DNase-seq .
It is generally accepted that histone modifications and the binding of transcription factors are two main effectors for the “openness” of chromatin. Previous studies have shown that histone modifications and transcription factors tend to occur near or just in the DNase I hypersensitive sites [8, 13]. Recently, two studies, one in K562 cell line and the other in Drosophila embryonic cells, have demonstrated that transcription factor binding sites and the chromatin accessibility are highly correlated with each other [6, 13]. Although these studies provided important information, so far, quantitative analysis of the combinatorial effects of different chromatin features and the biological significance of differential hypersensitivity is still unclear. In this work, we built support vector regression (SVR) models to directly predict the “openness” of chromatin in DNase I hypersensitive sites using combined chromatin features. Our results indicate that both histone modification features and transcription factor binding features are predictive for chromatin accessibility with high accuracy and these chromatin features are highly redundant.
2. Materials and Methods
All datasets are from ENCODE project, which aims to build a comprehensive list of functional elements in the human genome . The 10 histone modifications (HMs) and binding sites of 23 transcription factors (TFs) were quantified using Chip-seq and downloaded from the tracks of UCSC Genome Browser at ENCODE/Broad Institute and ENCODE/Stanford/Yale/USC/Harvard. The chromatin accessibility dataset was measured using DNase-seq and downloaded from ENCODE/OpenChrom (Duke/UNC/UTA). Each dataset includes the genome-wide sequencing signals and regions of statistically enriched signal (peaks). Peaks can be viewed as locations of chromatin features and DNase I hypersensitive sites, respectively, and the values of DNase-seq signals represent chromatin accessibility. We must note that DNase I hypersensitivity could not be simply viewed as binary property (peaks versus nonpeaks) but rather continuous values (sequencing signals) representing differential chromatin accessibility. These datasets come from the common H1hesc cell line.
2.2. Mapping HM and TF Binding Peaks to the DNase I Hypersensitive Sites
We obtained genomic locations of 33 chromatin feature profiles, all together including 582489 histone modification peaks (10 HMs) and 443217 transcription factor binding peaks (23 TFs). For each profile, we mapped the peaks of feature onto the genome and examined whether it localized in open chromatin or not. The presence or absence of chromatin feature within accessible chromatin was decided by overlap or nonoverlap with DNase-seq peaks. If there was any amount of overlap within accessible chromatin (DNase-seq peaks), we counted as a presence . Then, we calculated the percentage of the peaks occurring in the DNase I hypersensitive sites for each feature.
2.3. Supervised Learning Methods for Chromatin Accessibility Prediction
To investigate the quantitative relationship between chromatin accessibility and these chromatin features in DNase I hypersensitive sites, we constructed support vector regression (SVR) models for HM and TF binding features, respectively. Concretely, in every DNase I hypersensitive site, we calculated the maximum signal of DNase-Seq and the corresponding maximum signal of Chip-Seq for each chromatin feature. For the sake of figuring out whether the maximum signal exhibits largest prediction power or not, as a comparison, we also calculated the average signal of Chip-seq and DNase-seq for each hypersensitive region. Then, SVR model was built to predict the chromatin accessibility using signals of these chromatin features. SVR is a machine learning algorithm based on statistical theory for regression problems [14, 15]. We implemented this algorithm using the “e1071” package .
In order to reduce the computation cost, we randomly selected 5000 DHSs for our samples. The sample size analysis indicated that the prediction power increased only moderately after the size reached 2000. So, the sample size of 5000 is big enough to represent the entire dataset (Supporting Information S1 which is available online at http://dx.doi.org/10.1155/2013/914971). We used the 10-fold cross-validation method to evaluate the prediction power. Specifically, we randomly split our sample dataset into 10 equal size subsets. Among them, 9 subsets were used as training data and the remaining subset was treated as the validation data for testing the model. This process was repeated 10 times and each subset could only be used once as the validation data. After that, we combined the results and plotted the regression relationship between predicted signals and the actual DNase-seq signals. Then, the coefficient of determination ()  was computed indicating how well these data points fit the line. is also a frequently used measure of the proportion of total variation of outcomes explained by the model. We chose the square root of the coefficient of determination as our prediction power.
2.4. Analysis of the Importance for Each Chromatin Feature and the Combinatorial Effects of Different Features
To estimate which feature exhibits the maximal prediction power, we predicted the chromatin accessibility using only one feature. And to investigate whether HM features and TF binding features are redundant, we next predicted the “openness” of chromatin using all features. We also explored the combinatorial effects of these features. All possible one-feature , two-feature , and three-feature models were evaluated by their performance.
2.5. Model Comparison Analysis
Instead of SVR algorithm, we also explored the quantitative relationship between chromatin features and chromatin accessibility with linear regression model. Similarly, HM features, TF binding features, and HM+TF feature combinations were applied to linear regression model, respectively. The coefficient of determination of the predicted signals and the actual DNase-seq signals were calculated and compared with the SVR models. In order to identify whether the maximum signals or the average signals exhibit largest prediction power, we also applied these models with the average signals of chromatin features to predict the average signals of DNase-seq.
3.1. The Localization Preference of Chromatin Features
We analyzed genome-wide localizations of 33 Chip-seq profiles in the human embryonic stem cell line (H1hesc) from ENCODE project , including 10 histone modifications, and the binding sites of 23 transcription factors. For each profile, we mapped the peaks of Chip-seq dataset to the DNase I hypersensitive sites (see Section 2). Figure 1 shows the percentage of the peaks within the accessible chromatin for each feature. We observed that different chromatin feature exhibits different preference to chromatin accessibility. For histone modifications, H3k4me3 exerts the largest preference of accessible chromatin. 82.2% H3k4me3 peaks located in DHS. On the contrary, most H3k9me3 occurred out of DHS (93.7%), which indicated that H3k9me3 was associated with heterochromatin . Compared to histone modifications, a majority of transcription factors tend to bind onto accessible chromatin, which suggests that the process of transcription requires an open chromatin structure . The mean percentage of transcription factors locating in DHS is 60.5%, higher than that of histone modifications (45.1%).
3.2. Predicting Chromatin Accessibility Using Histone Modification Features
In order to examine the quantitative relationship between chromatin accessibility and HM features in a combinatorial manner, we constructed SVR model to predict the “openness” of chromatin in DNase I hypersensitive sites using all histone modification features. We can see from Figure 2(a) that there is a linear relationship between predicted signals and the actual DNase-Seq signals. The coefficient of determination is 0.58 indicating that histone modification features explain about 58% variance of chromatin accessibility.
We next examined the prediction power for every histone modification feature. Figure 2(b) shows that H3k4me2, H3k4me3, and H3k9ac exhibit the most important effects to chromatin accessibility (, resp.). These histone modifications are generally enriched in the promoters of expressed genes  and the open chromatin structure plays an important role in regulating the complex transcription process. On the other hand, H3k9me3 and H3k36me3 exhibit the least prediction powers (, resp.), which suggests that these modifications are associated with heterochromatin [21, 22]. Interestingly, H3k27ac and H4k20me1, which are the most predictive histone modifications for gene expression levels , are not the most important features associated with chromatin accessibility.
3.3. Predicting Chromatin Accessibility Using Transcription Factor Binding Features
Previous studies have shown that transcription factors tend to bind onto open chromatin and they are highly correlated with each other [6, 13]. To investigate the quantitative relationship of the binding of transcription factors and the chromatin accessibility in a combinatorial manner, we next applied our SVR model to all TF binding features. As shown in Figure 3(a), the TF model achieves a coefficient of determination of 0.58 which is equal to that achieved by HM model. These TF binding features can also explain about 58% variance of chromatin accessibility.
For the prediction power of particular TF binding feature, there is a difference with that of histone modifications; that is, most transcription factors exhibit important effects to chromatin accessibility (Figure 3(b)). This is consistent with their functions because transcription factors directly control the complex transcription process  which requires an open chromatin environment. However, a small group of features exhibit lower prediction powers, such as SUZ12, CTCF, and ZNF274 (, resp.). ZNF274 and SUZ12 are known to be transcriptional repressors [25, 26]. CTCF has many roles, such as transcriptional repression, insulator function, and imprinting genetic information . These factors are not so important to contribute to the “openness” of chromatin.
3.4. Chromatin Features Are Highly Redundant to Chromatin Accessibility
The previous analyses suggest that both histone modification features and transcription factor binding features are predictive for chromatin accessibility with high accuracy in DNase I hypersensitive sites. So, there is a question that whether the prediction power will increase if we use all these features. To address this question, we directly predicted the “openness” of chromatin using all features. As shown in Figure 4(a), the coefficient of determination is a little higher (8%) than using only HM or TF binding features, which indicates that these two types of features are highly redundant. To check the importance of different features and their combinatorial effects, we tried to build models with all possible combinations of one to three features (Figure 4(b)). Focusing on the three-feature combinations (5456 models), we found that the least prediction power combinations (H3k36me3, H3k9me3, and ZNF274, ) could achieve about 56% prediction power of the full model . And there are 110 combinations achieving more than 90% prediction power of the full one. These analyses indicate that most of these features are highly redundant for chromatin accessibility.
By examining the 110 high prediction power combinations, we found that seven chromatin features, H3k4me2, H3k4me3, H3k9ac, H4k20me1, SIN3A, ZNF143, SUZ12, were significantly enriched (, hypergeometric test) in the set of 110 models. Interestingly, all these features showed high prediction powers in the one-feature models except H4k20me1 and SUZ12. H4k20me1 is a particular one, which has been reported for the most predictive histone modification for gene expression . SUZ12 is a part of Polycomb Repressive Complex 2 (PRC2) and may be involved in chromatin silencing with noncoding RNA . The mechanisms of how SUZ12 influences chromatin structure are unknown; however, it may exert distinct impact on chromatin accessibility compared with other features.
3.5. Comparison with Other Models
In this study, we chose the SVR algorithm and the maximum signal in every hypersensitive region to model the relationship between chromatin features and chromatin accessibility. Generally, the SVR algorithm is a nonlinear regression method. We also have explored modeling using linear regression model and the average signal in every region. As shown in Table 1, prediction powers of models using average signal are significantly lower than the corresponding maximum signal models. And in either situation, the SVR models exhibit higher prediction power than linear models. Our results indicate that the “openness” of chromatin is determined by the maximum signal of features and their relationships are assumed as a nonlinear relevance.
In this work, we presented quantitative analyses of the relationship of histone modifications and the binding of transcription factors to chromatin accessibility separately and combinedly in DNase I hypersensitive sites. We first examined the percentage of feature peaks within DNase hypersensitive sites (DHSs) in human embryonic stem cell (H1hesc) line. We found that different chromatin features showed different location preference in DHS. This may be due to the particular function of different chromatin features. Thurman et al. have done similar analysis in K562 cell line  for TF binding features. In our analysis, we find that the percentage of transcription factors within DHS is significantly lower in the H1hesc cell line than that in K562 cell line. The reason may be as follows: in order to maintain the “stemness” state, most genes are repressed in the stem cell compared to the cancer cell line K562. This phenomenon means that the degree to which chromatin features occur in accessible chromatin may differ according to different cellular circumstances.
Our results demonstrate that both histone modification (HM) features and transcription factor (TF) binding features account for nearly 58% variance of chromatin accessibility in H1hesc cell line. For histone hallmarks, many activators of gene expression exhibited important impact on the “openness” of chromatin, such as H3k4me  and histone acetylations . The hallmarks of repressors for gene expression such as H3k9me3  show lower prediction powers. Unexpectedly, the transcription elongation hallmark H3k36me3  shows the least prediction power. This is consistent with the viewpoint of a recently published paper. Chantalat et al.  argued that H3k36me3 is associated with constitutive and facultative heterochromatin. For TF binding features, the majority of TFs showed an important impact on chromatin accessibility except some transcriptional repressors, such as ZNF274 and SUZ12. This may indicate that the complex transcription process requires open chromatin environment .
It is generally accepted that cellular factors regulate the complex dynamic change of chromatin structure in a collective manner. We have shown that these features is highly redundant to predict chromatin accessibility and a small subgroup of features are able to achieve a very high prediction power. However, the mechanism of how these features cooperatively impact the openness of chromatin is still unclear, and we must note that our analysis could not reveal the “cause” or “consequence” relationship of HM and TF binding features to chromatin accessibility. Histone modifications play an important role in creating and maintaining the accessible chromatin environment  and may act as docking sites for transcription factors . Some pioneer TFs tend to bind onto the genome and create an accessible site, such as FoxA1  which is the best known pioneer transcription factor. Then, more transcription factors tend to bind onto the opening site and the DNase I hypersensitive site is created. As an extension, future work could explore the mechanisms of how these features cooperatively regulate open chromatin structure and their causal relationships, based on increased datasets.
We present genome-wide quantitative analysis of the impact of chromatin features to chromatin accessibility in DNase I hypersensitive sites. Our findings indicate that both histone modifications and the binding of transcription factors could explain nearly 58% variation of the “openness” of chromatin structure. The combinatorial effect analyses reveal that these chromatin features are highly redundant for prediction and H3k4me2, H3k4me3, H3k9ac, SIN3A, and ZNF143 show closest association with chromatin accessibility. Our results provide insights into the systematic effects of chromatin features to differential chromatin accessibility.
|DHS:||DNase I hypersensitive site|
|SVR:||Support vector regression.|
This work was funded by National Key Basic Research Program (2010CB912702, 2011CB910204), National Hi-Tech Program (2012AA020201), Key Infectious Disease Project (2012ZX10002012-014), and National Natural Science Foundation of China (31070752, 31000582).
Online Supporting Information S1. The prediction powers using all features with different sample sizes. In each sample size, we randomly sampled 500 times and modeling according to the procedure discussed in method 2.3 in each sample.
Online Supporting Information S2. The input maximum signal dataset including the maximum signals of Chip-seq and DNase-seq in 5000 DNase I hypersensitive sites.
Online Supporting Information S3. The input average signal dataset including the average signals of Chip-seq and DNase-seq in 5000 DNase I hypersensitive sites.
Online Supporting Information S4. The prediction powers (R) using only one feature.
Online Supporting Information S5. The prediction powers (R) using all possible two-feature combinations.
Online Supporting Information S6. The prediction powers (R) using all possible three-feature combinations.
Online Supporting Information S7. The related R codes are available.
- K. Luger, A. W. Mäder, R. K. Richmond, D. F. Sargent, and T. J. Richmond, “Crystal structure of the nucleosome core particle at 2.8 Å resolution,” Nature, vol. 389, no. 6648, pp. 251–260, 1997.
- J. D. Anderson and J. Widom, “Sequence and position-dependence of the equilibrium accessibility of nucleosomal DNA target sites,” Journal of Molecular Biology, vol. 296, no. 4, pp. 979–987, 2000.
- C. Dingwall, G. P. Lomonossoff, and R. A. Laskey, “High sequence specificity of micrococcal nuclease,” Nucleic Acids Research, vol. 9, no. 12, pp. 2659–2674, 1981.
- D. S. Gross and W. T. Garrard, “Nuclease hypersensitive sites in chromatin,” Annual Review of Biochemistry, vol. 57, pp. 159–197, 1988.
- P. N. Cockerill, “Structure and function of active chromatin and DNase I hypersensitive sites,” FEBS Journal, vol. 278, no. 13, pp. 2182–2210, 2011.
- X.-Y. Li, S. Thomas, P. J. Sabo, M. B. Eisen, J. A. Stamatoyannopoulos, and M. D. Biggin, “The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding,” Genome Biology, vol. 12, article R34, 2011.
- Q. Lu and R. Bruce, “DNaseI hypersensitivity analysis of chromatin structure,” in Epigenetics Protocols, pp. 77–86, Humana Press, 2004.
- A. P. Boyle, S. Davis, H. P. Shulha et al., “High-resolution mapping and characterization of open chromatin across the genome,” Cell, vol. 132, no. 2, pp. 311–322, 2008.
- I. Dunham, A. Kundaje, S. F. Aldred et al., “An integrated encyclopedia of DNA elements in the human genome,” Nature, vol. 489, pp. 57–74, 2012.
- P. J. Park, “ChIP-seq: advantages and challenges of a maturing technology,” Nature Reviews Genetics, vol. 10, no. 10, pp. 669–680, 2009.
- E. R. Mardis, “ChIP-seq: welcome to the new frontier,” Nature Methods, vol. 4, no. 8, pp. 613–614, 2007.
- L. Song and G. E. Crawford, “DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells,” Cold Spring Harbor Protocols, 2010.
- R. E. Thurman, E. Rynes, R. Humbert et al., “The accessible chromatin landscape of the human genome,” Nature, vol. 489, pp. 75–82, 2012.
- C. Nello and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.
- C. Cheng, K.-K. Yan, K. Y. Yip et al., “A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets,” Genome Biology, vol. 12, no. 2, article R15, 2011.
- D. Evgenia, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel, “Misc functions of the Department of Statistics (e1071), TU Wien,” R package pp. 1–5, 2008.
- B. S. Everitt, Cambridge Dictionary of Statistics, Cambridge University Press, New York, NY, USA, 2nd edition, 2002.
- J. Bartkova, P. Moudry, Z. Hodny, J. Lukas, R. D. Meyts, and J. Bartek, “Heterochromatin marks HP1γ, HP1α and H3K9me3, and DNA damage response activation in human testis development and germ cell tumours,” International Journal of Andrology, vol. 34, no. 4, pp. e103–e113, 2011.
- D. Sproul, N. Gilbert, and W. A. Bickmore, “The role of chromatin structure in regulating the expression of clustered genes,” Nature Reviews Genetics, vol. 6, no. 10, pp. 775–781, 2005.
- K. Regha, M. A. Sloane, R. Huang et al., “Active and repressive chromatin are interspersed without spreading in an imprinted gene cluster in the mammalian genome,” Molecular Cell, vol. 27, no. 3, pp. 353–366, 2007.
- R. Margueron and D. Reinberg, “Chromatin structure and the inheritance of epigenetic information,” Nature Reviews Genetics, vol. 11, no. 4, pp. 285–296, 2010.
- S. Chantalat, A. Depaux, P. Héry et al., “Histone H3 trimethylation at lysine 36 is associated with constitutive and facultative heterochromatin,” Genome Research, vol. 21, no. 9, pp. 1426–1437, 2011.
- R. Karlić, H.-R. Chung, J. Lasserre, K. Vlahoviček, and M. Vingron, “Histone modification levels are predictive for gene expression,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 7, pp. 2926–2931, 2010.
- G. Gill, “Regulation of the initiation of eukaryotic transcription,” Essays in Biochemistry, vol. 37, pp. 33–43, 2001.
- J. L. Rinn, M. Kertesz, J. K. Wang et al., “Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs,” Cell, vol. 129, no. 7, pp. 1311–1323, 2007.
- K. Yano, N. Ueki, T. Oda, N. Seki, Y. Masuho, and M.-A. Muramatsu, “Identification and characterization of human ZNF274 cDNA, which encodes a novel Kruppel-type zinc-finger protein having nucleolar targeting ability,” Genomics, vol. 65, no. 1, pp. 75–80, 2000.
- K. L. Dunn and J. R. Davie, “The many roles of the transcriptional regulator CTCF,” Biochemistry and Cell Biology, vol. 81, no. 3, pp. 161–167, 2003.
- B. M. Turner, “Histone acetylation and an epigenetic code,” BioEssays, vol. 22, no. 9, pp. 836–845, 2000.
- S. M. Fuchs, R. N. Laribee, and B. D. Strahl, “Protein modifications in transcription elongation,” Biochimica et Biophysica Acta, vol. 1789, no. 1, pp. 26–36, 2009.
- J. Marx, “Molecular biology. Protein tail modification opens way for gene activity,” Science, vol. 311, no. 5762, p. 757, 2006.
- O. Bell, V. K. Tiwari, N. H. Thomä, and D. Schübeler, “Determinants and dynamics of genome accessibility,” Nature Reviews, vol. 12, pp. 554–564, 2011.
- L. A. Cirillo, F. R. Lin, I. Cuesta, D. Friedman, M. Jarnik, and K. S. Zaret, “Opening of compacted chromatin by early developmental transcription factors HNF3 (FoxA) and GATA-4,” Molecular Cell, vol. 9, no. 2, pp. 279–289, 2002.
Copyright © 2013 Peng Cui et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.