Previous analyses of rearranged immunoglobulin (Ig) variable genes (VDJs) concluded that the mechanism of Ig somatic hypermutation (SHM) involves the Ig pre-mRNA acting as a copying template resulting in characteristic strand biased somatic mutation patterns at A:T and G:C base pairs. We have since analysed cancer genome data and found the same mutation strand-biases, in toto or in part, in nonlymphoid cancers. Here we have analysed somatic mutations in a single well-characterised gene TP53. Our goal is to understand the genesis of the strand-biased mutation patterns in TP53—and in genome-wide data—that may arise by “endogenous” mechanisms as opposed to adduct-generated DNA-targeted strand-biased mutations caused by well-characterised “external” carcinogenic influences in cigarette smoke, UV-light, and certain dietary components. The underlying strand-biased mutation signatures in TP53, for many non-lymphoid cancers, bear a striking resemblance to the Ig SHM pattern. A similar pattern can be found in genome-wide somatic mutations in cancer genomes that have also mutated TP53. The analysis implies a role for base-modified RNA template intermediates coupled to reverse transcription in the genesis of many cancers. Thus Ig SHM may be inappropriately activated in many non-lymphoid tissues via hormonal and/or inflammation-related processes leading to cancer.

1. Introduction

A major goal of this paper is to provide an explanation of the origin of the main strand-biased mutation signatures observed in the TP53 tumor suppressor gene in the many tumors likely to arise by “endogenous” mutation processes: that is to say, those cancers not caused by well-known exogenous mechanisms such as exposure to carcinogens in tobacco smoke (Benzo(a)pyrene, G-to-T), toxins in food contamination (Aflatoxin B1, G-to-T; Aristolochic acid, A-to-T), or UV radiation in sun exposure causing DNA photoproducts such as cyclobutane pyrimidine dimers, C-to-T reviewed in Soussi [1]. The TP53 mutation pattern in “All Breast Cancers” has been chosen as representative of the TP53 “endogenous pattern” as this mutation pattern appears to arise in a tissue “least accessible to carcinogens in tobacco smoke” or directly exposed to such exogenous carcinogens; see Hainaut and Pfeifer [2]. There is also a large number of TP53 point mutations in this tissue category (>1000), similar to the numbers in “All Lung Cancers,” the major comparator in the analysis. Further, this basic “endogenous” pattern is evident in many tumors outside of lung, head, neck, and oesophagus. All of these can be considered as “directly accessible to tobacco smoke carcinogens.” This choice is made despite the known complexity of breast cancer in both etiology and the diversity of histological subtypes [1] as a similar pattern is evident in mutated TP53 variants in “All Bladder Cancers” which are likely to be more directly exposed to tobacco smoke-derived carcinogenic metabolites in urine.

Our analysis shows that the underlying “endogenous” strand-biased mutation signatures in TP53, for many different non-lymphoid cancers, bear a striking resemblance to the Ig SHM pattern. This allows inferences to be drawn about the mechanistic role of TP53-mediated DNA repair regulation and base-modified RNA template intermediates coupled to reverse transcription in the genesis of many cancers. It is also consistent with the view that the normally tightly regulated mutation processes targeting VDJ genes in B lymphocytes may, following further loss of DNA damage response regulation by TP53, be inappropriately turned on in non-lymphoid tissues, for example, by hormonal and/or inflammation-related processes, leading to cancer.

2. Background

2.1. Caveat and Writing Strategy

The writing strategy of this paper is to provide as clear an introduction as possible concerning what is currently known of the immune system’s somatic mutation mechanism as we believe that this mechanism is relevant to understanding the role of somatic mutations in the pathology of cancer. This is an unexplored topic for most scientists, particularly in the field of cancer biology although it is very topical now given renewed interest in the regulation of inflammatory responses initiating both somatic mutations and thus cancer (see later). However there is a caveat to this analysis that should be highlighted right at the start: strand-biased mutation spectrums although very informative with strong inferential value with respect to molecular mechanisms provide little information about the initial events that precede malignant transformation as malignant cells grow rapidly and are exposed to strong selection. This means “first causes” cannot be precisely defined. Nevertheless the clear possibility that immunoglobulin somatic hypermutation may be one of these “first causes” promoting somatic mutation, both across the genome and in key gene regulators such as TP53, is worth pursuing in its own right as the implications have wide and interesting ramifications for the future directions of cancer research.

2.2. Somatic Hypermutation in Rearranged Ig Genes

The mechanism of SHM of Ig VDJ genes is now well understood and many molecular steps are known or can be plausibly inferred [3, 23, 24]. More recently this knowledge has been applied to the etiology of cancer. What we discovered in a preliminary analysis was that the characteristic strand-biased mutation signatures of Ig SHM were present, in toto or in part, in a number of somatic mutation datasets posted at the Welcome Trust Sanger Institute’s website run by the institute’s Cancer Genome Project [4]. Indeed the possibility has often been discussed that dysregulation of SHM driven by activation induced cytidine deaminase (AID) conversion of cytosine to uracil (C-to-U) in DNA normally confined to antigen-stimulated B lymphocytes in postantigenic Germinal Centers could lead to somatic mutations and translocations in non-Ig genes thus contributing to oncogenesis [21, 2527].

The aim here is to use these insights from the Ig SHM field to help explain the strand-biased mutation signatures in TP53 in human nonlymphoid tumors arising by as yet unknown “endogenous” mechanisms. We have analysed in detail strand-biased TP53 mutation signatures in breast, bladder, and lung cancers. These are exemplars of oncogenesis in tissues exposed directly to carcinogens in tobacco smoke (lung) or indirectly exposed to such carcinogenic metabolites (bladder via urine) versus cancers arising in tissues such as breast generally considered “as least accessible to tobacco smoke” and other known exogenous agents [1, 2, 28].

We have also analysed and compared strand-biased mutation signatures and mutation patterns in other cancerous tissues. The analytical reviews by Soussi [1] and Pfeifer and Besaratinia [28] have proved valuable and we recommend these papers and related literature be read in association with the present analysis.

2.3. Strand-Biased Mutation Signatures in Ig Genes

Strand-biased mutation signatures in Ig VDJ gene loci, particularly at A:T base pairs, have been recognised for over 25 years [29]. More recently, published data (1984–2008) on mutated mouse VDJ regions and their 3′ JH4-flanks have been analysed [3]. In this study, somatic mutation data from thirty-two independent studies are summarised and are available in a meta-analysis in the Supplementary Data. Little significant new data has been published in the SHM field since then that changes the basic patterns shown in Table 1 or their interpretation.

The essence of this analysis established that the Ig SHM mutation pattern free of “PCR recombinant artefacts” reveals several significant strand-biased mutation signatures at A:T and G:C base pairs (Table 1 ). The first is at A:T base pairs where mutations of A exceed mutations of T by almost threefold, particularly the dominant strand bias of A-to-G exceeding T-to-C mutations. The second main strand bias is at G:C base pairs where mutations of G exceed mutations of C by at least 1.7-fold. The dominant strand bias here concerns G-to-A exceeding C-to-T mutations.

The distortion in the DNA sequence data contaminated with PCR-recombinant artefacts (Table 1 ) has previously masked the clear strand bias at G:C base pairs. This distortion has also made it difficult for the field to develop their understanding of the mutator mechanisms operative on A:T and G:C base pairs during Ig SHM in vivo [3, 30].

The significant presence of PCR recombinants also referred to as PCR hybrids or mosaic heteroduplexes at the end of a PCR run is due to Taq or Pfu polymerase denaturation generating incomplete extension products acting as forward and reverse primers during amplification cycles. They arise from PCR runs where amplification is from multiple similar templates using the same set of primers. For this reason, the inevitable presence of such sequence artefacts causes blunting if not complete ablation of strand-biased mutation signatures following cloning in E. coli and then sequencing of PCR inserts. Such problems are avoided by reducing PCR cycle numbers, by sequencing target VDJ genes expressed in hybridoma clones or direct sequencing of amplification products from single VDJ loci expressed in FACS sorted single B lymphocytes [3, 3032].

As a consequence of these analyses the reference strand-biased Ig SHM mutation pattern we will use as a comparator for the TP53 mutation data is shown in Table 1 . The patterns in Table 1 are essentially free of strand-biased blunting PCR hybrid artefacts.

For comparison, Table 1 shows the previously cited pattern in the SHM field where there is no strand-bias evident at G:C base pairs. Note that in Table 1 the magnitude of the ratio of mutations of A versus mutations of T (hereafter indicated as the A≫T ratio) is reduced from 2.8x to 1.9x in Table 1 . Thus there is significant blunting of the established A≫T mutation ratio (as well as the dominant and diagnostic A-to-G versus T-to-C ratio, below) and compete ablation of the lower (1.7x) yet significant G≫C strand-biased ratio (compare Tables 1 and 1 ).

In summary, we now know that the SHM reference pattern shown in Table 1 is characterised by significant strand-biases evident for all Watson-Crick complements: A-to-T versus T-to-A; A-to-C versus T-to-G; A-to-G versus T-to-C and G-to-A versus C-to-T; G-to-T versus C-to-A; G-to-C versus C-to-G.

We first addressed this issue [30] by making the point that “the synthesis of a mutated cDNA copy of the transcribed strand (TS) off the pre-mRNA template, and replacement of the original TS with the cDNA is inevitably strand-biased (see Figure 1).” This was underpinned by the finding that the error-prone Y family DNA polymerase-eta (η), an enzyme shown to be at least involved in translesion DNA repair, reviewed in Goodman [33], was an efficient reverse transcriptase at low enzyme-to-RNA template ratios in vitro [7]. It is now firmly established that Pol-η is the only DNA polymerase involved in physiological Ig SHM in vivo [13, 16].

The SHM field now accepts that Pol-η mutates A:T base pairs, particularly A-sites at certain WA hotspots where the target A is preceded 5′ by A or T (=W). With the analysis of most published experimental data 1984–2008 (see legend Table 1) a unifying explanation can be provided for the central role of base-modified RNA template intermediates and cellular reverse transcription in the generation of all the Watson-Crick strand biases displayed in Table 1 .

In the updated version of the reverse transcriptase model (RT model SHM, initially proposed in Steele and Pollard [34]) the two different sets of mutation strand biases at A:T and G:C base pairs can, we believe, be explained by a common core mechanism (Figure 1): emergence of an error-filled mRNA intermediate followed by reverse transcription via DNA polymerase-η [3]. In the case of the A≫T strand bias the model proposes a combination of adenosine-to-inosine (A-to-I) pre-mRNA editing by ADAR1 deaminase [8] and A-to-T and A-to-C biases via the RT activity of Pol-η during the cDNA synthesis step. In the case of G≫C strand bias it proposes that RNA mutations (G-to-A, G-to-C) generated by RNA Polymerase II (RNAPII), transcribing a DNA template (TS) carrying AID-lesions (uracils and abasic sites), are copied back to DNA by the RT activity of Pol-η. This can be considered a form of “transcriptional mutagenesis” as proposed in Figure in Hanawalt and Spivak [35] but coupled now to DNA fixation by reverse transcription.

According to the RT model, for the G≫C mutation signatures the G-to-A versus C-to-T strand bias results from rA being incorporated into RNA opposite unrepaired dU on the TS and the G-to-C versus C-to-G strand bias results from rC being incorporated into RNA opposite an abasic site on the TS [6]. For the G-to-T versus C-to-A strand bias this would be the alternative pyrimidine substitution (rU) if rC is not inserted opposite an abasic site (note: modified G residues in DNA due to reactive oxygen species such as 8oxoG are not known to play a role in physiological SHM in vivo [9] and below).

Recently, we have applied this updated RT model for DNA diversification to both the etiology of strand-biased mutation patterns in many non-lymphoid cancers [4] and, amongst other hypotheses, to the origin of the established genome-wide strand bias for A-to-G over T-to-C in transcribed regions of the human genome [36]. The RNA modifications we are considering are nonbulky simple changes to base pairing, such as adenosine-to-inosine deamination in RNA [8] or putative 8oxoG modifications at G residues in RNA [4]. For Ig SHM and in our genome-wide diversification analysis, we have strongly argued against conventional explanations of strand biases (mainly the A-to-G versus T-to-C bias) for repair of non-bulky DNA lesions caused by differential DNA repair of transcribed (TS) as opposed to nontranscribed (NTS) strands during transcription-coupled repair (TCR). Indeed, critical evaluation of the TCR field has so far not provided evidence to support a TCR-mediated mechanism for strand biases arising from non-bulky DNA lesions such as C-to-U and abasic sites [36]. Further, in the case of the repair of 8oxoG lesions in DNA the Bohr group have convincingly shown that there is no transcriptional strand bias in their repair [37].

2.4. Origin of the A-to-G versus T-to-C Strand Bias

The A-to-G versus T-to-C strand bias is a common strand bias in many somatic and germline mutation data sets. This strand bias is found not only in all SHM data sets in mouse and human VDJ genes, in families of similar human germline IgV segments (in Matsuda et al. [38], EJS unpublished analysis) but also in almost all TP53 cancer data sets where A/T mutations have not been significantly suppressed or ablated (presumably by genetic deficiencies affecting the mismatch repair (MMR) machinery as in colorectal, stomach, oesophagus adenocarcinomas, skin, rectum, and colon cancers, below). And in all cases examined, A-to-G mutations are enriched at some but not all A-site hotspots where the A target is preceded by a 5′ A or T (=W).

Key evidence supporting an RNA template intermediate model for the prominent A-to-G versus T-to-C mutation strand bias derives from an IgV mRNA-stem loop computational analysis where the RNA substrate for ADAR1 mediated A-to-I deamination was modelled and tested on the somatic mutation data set of the rearranged light chain encoding VκOx1 passenger transgene [8]. Thus, in an RNA-based pathway for immunoglobulin SHM, A-to-I RNA editing causes A-to-G transitions since I like G pairs with C. The adenosine deaminases (ADARs) are known to preferentially edit A sites that are preceded by an A or U (W) in double-stranded RNA substrates [39]). We showed that a significant and specific Pearson correlation ( ) exists between the frequency of WA-to-WG mutations and the number of mRNA hairpins that could potentially form at the mutation site. Indeed the statistical significance of the correlation improved with increased stem length (or stability) of the dsRNA substrate (Figure 11, in [8]) and proximity of the nascent dsRNA to the transcription bubble. It is known that ADAR1 edits pre-mRNAs in the nucleus prior to splicing [40, 41]. Indeed ADAR1 seems to act on the WA-site closest to the transcription bubble and explains why the A-stem partner in the target A:U editing site must be previously synthesised [8]. This study strongly implies a role for both RNA editing and reverse transcription during SHM in vivo involving ADAR1 and Pol-η acting in its RT mode. For these reasons, we consider the elevated A-to-G versus T-to-C ratio as a diagnostic for mutational strand bias caused by modified RNA template intermediates and DNA fixation via reverse transcription.

However key direct experimental evidence supporting an RNA template intermediate model is still lacking. The ideal experiment in the context of Ig SHM would be a conditional genetics approach targeting ADAR1 expression in mature B lymphocytes in antigen-activated Germinal Centers. In a collaboration Cre-lox specific gene targeting techniques were used to inactivate ADAR1 during SHM in vivo. A positive result might involve a clear reduction or complete removal of the A-to-G component of the SHM mutation spectrum (i.e., those A-to-G changes which correlate strongly with WA sites in dsRNA stem loops). If however ADAR1 is a more central player in the SHM process it may also result in a total reduction in mutations at A:T base pairs (leaving intact mutations at G:C base pairs). In the recent collaboration ADAR1flox alleles on the C57BL6 mouse background (Wang et al. 2004 [42]) were crossed into C57BL6 mice with a “knocked-in” Ig antigen receptor (the SWHEL IgVH10 single-copy heavy chain transgene) which was assayed for somatic hypermutation in the adoptive transfer system described in Paus et al. 2006 [43]. An inducible Cre-recombinase gene when activated by tamoxifen should specifically target the ADAR1floxed alleles and delete them from B lymphocytes activated by antigen into the somatic hypermutation pathway. Unfortunately no mature donor B lymphocytes could be recovered in Germinal Centers suggesting that one or more ADAR1 sensitive developmental steps were necessary leading to Germinal Center B lymphocytes. Therefore with current Cre-lox technology approaches to implementing a successful experiment targeting ADAR1 alleles seem limited (R. Brink, K. Nishikura, G. F. Weiller, and E. J. Steele unpublished data 2007–2009).

With respect to carcinogenesis, unregulated ADAR-mediated A-to-I RNA editing is a well-described phenomenon [44, 45]. In a similar vein, unregulated APOBEC family C-to-U DNA deaminases such as AID, APOBEC3G, and APOBEC1 are comparable rogue mutator processes thought to be operative in many cancers [21, 2527]. Thus for the present analysis we can reasonably assume that unregulated RNA and DNA deamination processes (at rA and dC residues) may well be associated with either the genesis or progression of many non-lymphoid cancers. Here we are concerned with the molecular implications of such processes particularly in relation to understanding the strand-biased mutation signatures at A:T and G:C base pairs in the TP53 gene and wider genome.

3. The TP53 Mutation Database

The DNA sequence encoding the human tumor suppressor protein TP53, on chromosome 17 located at 17p13.1, has been cloned and sequenced as both full length DNA and cDNA in many tumors over the past two decades. Mutated variants of the TP53 germline sequence carrying somatic mutations mainly in the region encoding DNA binding are found in a wide range of cancers [1, 28, 46]. Oncogenic TP53 mutations are a biased dataset in that they are partially selected for a competitive binding function focused on the DNA binding region. They are usually missense mutations in one allele spanning many sites in the TP53 coding DNA from about codon 130 to 300 [1]. Many investigators have deposited their sequence data in the database funded by the WHO at Lyon in France. The WHO-IARC public database has now curated around 30,000 somatic mutations in TP53. The data analysed here was extracted from this source [47] (http://www-p53.iarc.fr/, R15, November 2010).

3.1. Method of Data Extraction and Presentation

The somatic point mutation data presented in the Tables were extracted from the database as follows. On entering the website (http://www-p53.iarc.fr/) “Database Search” is selected allowing entry to http://www-p53.iarc.fr/p53main.html which allows selection of “Search database for data related to SOMATIC MUTATIONS” and selection of “Search (Tumor types).” This allows entry to http://www-p53.iarc.fr/BasicCriteria.asp where the tumor site can be selected at “Select a tumor site” and thus entry to “Mutation pattern” (at http://www-p53.iarc.fr/Graph.asp) where selection can be made for the key data sets: “Strand distribution” and “Download data.” The “Strand distribution” tables can be downloaded where numbers of all 12 possible base substitutions are displayed including the numbers of C-to-T and G-to-A mutations at CpG islands. The spreadsheet from “Download data” allows construction and analysis of all types of mutations with 5′ and 3′ flanking sequence context in relation to the unmutated TP53 exon sequence (and in some cases intronic sequence). This allows development of frequency distributions of various types of mutation (e.g., A-to-G) versus nucleotide (and codon) position across regions of interest such as the DNA binding region for example Figure 2.

3.2. Statistics

Displayed in each base substitution table are Chi-squared statistical comparisons (1 df) of several types of base substitutions. Thus for A:T base pairs the main comparisons are all mutations of A versus all mutations of T-when strand biased for excessive mutations of A this is symbolized as “A≫T.” The common and dominant A-to-G strand bias is represented as “A>G versus T>C.” Other types of biases are presented and tested similarly. Thus when mutations of G exceed mutations of C this is symbolized as “G≫C”; when G-to-A mutations exceed C-to-T mutations this is symbolized as “G>A versus C>T.” The Appendix and Figure 3 explains the rationale for detecting strand biases in mutation datasets.

4. TP53 Strand-Biased Mutation Data

The strand-biased mutation pattern typical of normal Ig SHM (Table 1 ) was compared with the somatic point mutation patterns observed in the TP53 coding region for a range of key cancers (Table 2). Chi-squared tests were applied to test the levels of statistical significance of the various strand biases. Attention was focused on mutations of A and G, respectively, particularly A-to-G versus T-to-C, G-to-A versus C-to-T and G-to-T versus C-to-A as the analysis of these strand-biases has implications for the molecular mechanisms involved.

The main strand-biased mutation patterns observed in TP53 are best represented by the patterns in All Breast, All Bladder, and All Lung cancer categories where each has many (>1000) somatic mutations to analyse (Table 2). We have ranked these tissues in their likely exposure to carcinogens in tobacco smoke as smoking is the leading cause of lung and many other cancers in the world today. The origin of TP53 mutations in breast may be considered the least likely to be caused by direct exposure to tobacco smoke carcinogens and metabolic by-products [2].

In the case of TP53 mutations in bladder cancer we assume the tissues are at least exposed to relatively high levels of carcinogenic metabolic by-products of tobacco smoke in urine.

In the case of TP53 mutations in lung cancer there is direct tissue exposure to tobacco smoke polycyclic hydrocarbons (PAHs) and carcinogenic metabolic derivatives [48]. Here the molecular, biochemical, and cellular evidence is overwhelming. Exposure to such smoke-derived carcinogens causes lung cancer. In particular, bulky DNA adducts at certain G sites targeted by carcinogens such as benzo[a]pyrene (B[a]P) are the direct cause of the dominant G-to-T transversion in these cancers in the TP53 gene [4951] (reviewed in [1, 28]) and throughout the wider lung cancer genome [52].

Some striking similarities are observed for the cancers shown in Table 2.

(a) The first is the similarity between the Ig mutation pattern (Table 1 ) and the patterns in TP53 for “All Breast” and “All Bladder” cancers. The main difference between the Ig pattern and the TP53 pattern in breast and bladder (and with many other TP53 cancer data sets for that matter) is the ~50 : 50 balance of mutations at A/T and G/C for Ig versus the G/C excess over A/T in TP53. The majority (70–75%) of TP53 mutations occurs at G/C sites. This may be partly contributed by the significant excess of G/C (~60%) versus A/T (~40%) base composition of the target region in TP53. But the pattern similarities nevertheless exist within A:T base pairs and G:C base pairs.

(b) The second systematic pattern observed in all data sets shown in Table 2 is the A≫T and A-to-G versus T-to-C strand bias. They stand out as common strand-biased patterns. To our knowledge of the mainstream TP53 literature this strand bias is rarely highlighted in published discussions. However, this pattern is not observed in colorectal, stomach, skin, and some other cancers as A/T mutations here have been ablated or significantly reduced (below). This pattern is stable across breast, through bladder to lung (and all ovary cancers, Table 3) suggesting a common causal mechanism that may not be associated with exposure to carcinogens in tobacco smoke. For A-to-G hotspots in both breast and lung cancers, the majority are defined by being part of a WA-site, particularly at codons 132, 163, 205, 220, 234, and 239; the TAT site in codon 220 is a super hotspot (data not shown).

If the A-to-G spectrum in TP53 at all A-sites in codons 100–300 inclusive for “All Breast Cancer” is compared with the same spectrum in “All Lung Cancer”, they are virtually super imposable (Figure 2). The Pearson correlation coefficient (r) is 0.93, which for 129 degrees of freedom gives a . Indeed, we think that this repeatable pattern in the two disparate target tissues is consistent with their origin being the result of an “endogenous” process. By comparison with what we have inferred from Ig SHM (Figure 1) a likely candidate is unregulated ADAR1-mediated A-to-I RNA editing and fixing of the mutated retrotranscripts back into DNA via reverse transcription as envisaged for Ig somatic hypermutation. The cellular reverse transcriptase could be DNA polymerase-η (or one of it’s Y family relatives iota (ι) and kappa (κ)) which also possess significant reverse transcriptase activity, [7]).

(c) The third common pattern is the excess of G mutations over C mutations (G≫C), particularly the dominant G-to-A over C-to-T strand bias (evident at both CpG and non-CpG sites). The pattern in breast and bladder is very similar again, suggesting common causal mechanisms. All ovary cancers display a similar pattern (Table 3). Again, to our knowledge of the mainstream TP53 literature, this particular and striking strand bias is rarely highlighted in published discussions on the topic.

The likely cause of the G≫C imbalance for lung cancer is known to be due to the binding of bulky tobacco smoke-derived adducts such as B[a]P at certain G-sites (mainly CpG islands although GpG sites can be targeted) in known critical codons such as 154, 157, 158, 245, 248, 249, and 273. Such adducted G sites now mispair with adenosines causing G-to-T transversion mutations if left unrepaired. This grossly imbalances other G mutations at these sites leading to a loss of the G-to-A versus C-to-T strand bias evident in breast, bladder, and ovary cancers.

Direct experiments by Pfeiffer and colleagues have shown that the G-to-T versus C-to-A strand bias is caused by the much slower repair of bulky DNA adducts such as B[a]P along the nontranscribed strand compared with the faster repair on the transcribed strand [51]. So the strand biased G-to-T pattern is a direct consequence of a transcription-coupled DNA repair (TCR) process for bulky DNA adducts [35]. This is a DNA-based strand-biased mutation mechanism and thus quite different from the RNA-based mechanisms outlined above for Ig SHM (Figure 1).

Before leaving this section we wish to deal with two further issues. First, we must deal with a relevant point that has now emerged from whole genome sequencing of individual lung cancers such as NCS-H209, Pleasance et al. [52]. In a critical section on the origin of the DNA repair pathways that may be responsible for the complex strand-biased signatures of tobacco exposure the authors make the following set of statements and assumptions:

“… that bulky adducts on purines are the predominant form of DNA damage induced by tobacco carcinogens and can be sufficiently disruptive to impede RNA polymerase when they occur on the transcribed strand..” and they observed “that guanine and adenine substitutions are generally less frequent on the transcribed than the nontranscribed strand-confirming that purines seem to be the major target of carcinogens in tobacco smoke.”

We accept this explanation for the origin of the G-to-T transversions but our data do not support their conclusions on the origin of the A-to-G strand bias. Indeed we would hardly expect the A-to-G spectra in TP53 to be identical in lung and breast cancers (Figure 2) if this is the case. The fitted curves in Pleasance et al. [52] show the effect of gene expression on strand bias mutation rate for the six classes of adenine and guanine mutations in NCI-H209 (Figure in [52]). The overall patterns are to be expected from the TP53 lung cancer pattern (Table 2 ) which the authors acknowledge in their Supplementary data. However the profiles for G-to-T and A-to-G are quite different. The decline in G-to-T mutation rate with increased transcription level is biphasic suggesting two causal DNA repair mechanisms for G-to-T: one to be expected and rapid depending on increased transcription involving TCR of bulky adducts, the other suggestive of another strand bias process. This could be due to perhaps 8oxoG generation by reactive oxygen species in RNA and an RT step of RNA DNA fixation.

The curve for A-to-G mutation rate versus transcription level appears monophasic with the difference between the repair of the NTS and TS mutations deepening with increased gene expression (or transcription). However the slope of the curve is very shallow—a slight decline in mutation rate from expression level 4 to 9 (approximately 2.9 mutation rate/Mb to approximately 2.5 mutation rate/Mb). This suggests a constitutive process of error generation and repair marginally affected by transcription level. One interpretation could be that this is consistent with A-to-I RNA editing, and RNA DNA fixation.

The second issue adds to the argument against asymmetrical or strand-differential TCR as an explanation for the excess of G-to-A mutations over C-to-T mutations in both Ig SHM and somatic mutation patterns in TP53 in cancers such as in breast. If TCR was occurring to clear C-to-U lesions arising from the action of unregulated APOBEC family deaminases in cancer cells (or progenitors) we would expect a preferential clearance of C-to-U on the TS and an excess of unrepaired C-to-U lesions, manifest as C-to-T, on the NTS. In fact in Ig SHM and in mutated TP53 genes in breast and other cancers (Table 2) it is the other way around. An excess of G-to-A over C-to-T suggests that C-to-U lesions would need to go unrepaired preferentially on the TS which is not evident in the mutation patterns analysed here.

So the argument goes full circle posing the question why then should G-to-A mutations exceed C-to-T mutations? This conundrum may explain why the G-to-A≫C-to-T strand bias has gone relatively unreported in the literature. The simplest explanation of the G-to-A strand bias is that C-to-U goes unrepaired on the TS prior to RNA Pol II copying this into G-to-A in the mRNA [6] which in turn would be manifest in a strand-biased manner in the DNA by reverse transcription first as a C-to-T in TS DNA and then following replication as G-to-A on the NTS [3] (see Figure 1). This is the simplest explanation of the G-to-A versus C-to-T strand bias involving AID/APOBEC3G/APOBEC1 deaminations in DNA. The same argument for G-to-C over C-to-G in SHM and TP53 mutation patterns in breast and other cancers with similar mutation patterns (e.g., bladder Table 2 ) is used, and the lesser strand bias of G-to-T over C-to-A. In the TP53 mutation pattern in breast cancers the relative increase in the G-to-T over C-to-A ratio compared to that in Ig SHM (Table 1 )—in “a tissue less exposed to smoke”—suggests it might be contributed by another mechanism. One possibility is the formation of 8oxoG in the TP53 mRNA (as suggested in Steele and Lindley [4]) because it is known that 8oxoG DNA lesions are unlikely to display strand differential TCR biases on DNA repair [37]. Further research on the genetic and biochemical consequences of 8oxoG formation in RNA is clearly required.

5. TP53 G T versus C A Strand Biases in Other Cancers

B[a]P and similar DNA-binding tobacco-derived carcinogens are known to form bulky G-site adducts causing G to base pair like T with adenosines. These bulky adducts are preferentially cleared from the transcribed strand during transcription-couple repair and this causes the excess of G-to-T mutations on the upper nontranscribed strand. In a similar vein, in hepatocellular liver cancer the G-to-T versus C-to-A strand bias is caused by adduct formation at G-sites by aflatoxin dietary contaminants, diagnostically at the third position of codon 249 in TP53 [46]. This G-site (GpG) is also a G-to-T hotspot in many lung cancers.

Other geographically and ethnically localised strand biases via dietary contamination for G-to-T over C-to-A and for A-to-T over T-to-A are reviewed in [1, 46]. Thus G-to-T transversions in the third base of TP53 codon 249 correlates strongly in tumors from HBV carriers and exposure to dietary aflatoxinB1; and A-to-T strand bias (codons 131, 209, and 280) has been linked with crops contaminated with Aristolochia sp seeds (in certain Balkan communities in southeastern Europe). Aristolochic acids (AAS) are the identified carcinogenic agent, and DNA adducts involving AA have been detected in patients suffering from Balkan endemic nephropathy (BEN). The strand bias suggests preferential TCR of bulky DNA lesions along the transcribed DNA strand as concluded already for G-to-T strand biases in many lung cancers.

6. DNA Repair Deficiencies: Colorectal, Stomach, and Skin Cancers

There are also some other more complex TP53 mutation patterns not conforming to the simple strand biases just discussed and worthy of further comment here. Before we do this, it is informative to consider what is known about DNA repair deficiencies and distortions of the Ig SHM mutation pattern (Tables 1 and 4). Much of this work has been done using single and double knockout mice targeting key base excision repair (BER) and mismatch repair (MMR) genes encoding proteins that have been coopted to now function aberrantly (a form of “subverted DNA repair” as put by Martomo and Gearhart [53]) see summary in Table 4 and review in Steele [3]. The various protein components of the normally tightly regulated and targeted Ig mutator now act to encourage error-prone DNA synthesis during the somatic hypermutation of rearranged antibody variable genes in Germinal Center B lymphocytes following antigenic challenge. The key proteins and DNA repair enzymes are AID deaminase which initiates the SHM process (and Ig class switch recombination, CSR) by deaminating C-to-U in VDJ DNA; this is then followed by attempts to remove the base in a base excision repair (BER) manner via uracil DNA glycosylase (UNG), followed by other “subverted” DNA repair enzymes such as translesion DNA polymerase η, and the mismatch repair heterodimer MSH2-MSH6.

Table 4 lists the main consequences on the Ig SHM mutation spectrum of genetic deficiency in uracil DNA-glycosylase (UNG), in the mismatch repair heterodimer (MSH2-MSH6), and deficiencies in Y family DNA polymerases η (eta), (iota) and κ (kappa) and combinations thereof. Additional deficiencies are shown in alkyladenine DNA glycosylase (Aag) which removes hypoxanthine (deaminated adenine) from DNA generating an abasic site and 8-hydroxyguanine-DNA glycosylase (Ogg1) which removes oxidised guanine from DNA. The effect on the SHM spectrum of inactivating TP53 is also shown.

In Ig SHM a failure to remove uracils from DNA as a consequence of dC-to-dU AID deaminase action (UNG−/−, Table 4) has a slight effect on overall A/T mutations and no effect on A≫T or A-to-G strand biases. The main effect is on the focusing of mutations to G:C base pairs with a reduction in transversion mutations.

In UNG−/−MSH−/− double deficient mice mutations at A:T base pairs are virtually eliminated as are transversions at G:C base pairs leaving what is considered the “AID deamination footprint” (Table 4, [11]) which now manifests itself as a strong strand bias of C-to-T exceeding G-to-A by at least 1.5-fold [3]. The simplest interpretation is that since AID-deaminase converts C-to-U in the single stranded DNA regions of the transcription bubble, this is likely to happen more often on the NTS than the TS, thus the unfettered C-to-T over G-to-A strand bias in such mutant mice. The same strand bias is revealed in double deficient Pol-η−/−/MSH2−/−mice [13]. These data are consistent with SHM models whereby MSH2-MSH6 heterodimers engage G:U DNA mispairs and recruit Pol-η necessary for full blown mutagenesis at A:T base pairs [54]. In the complete absence of Pol-η, Pol-κ can step in to affect A/T mutagenesis [16] and probably also the RT step [7].

With respect to the general over-arching molecular mechanism of Ig SHM the deficiencies in Aag, Ogg1, and TP53 are informative. First, the failure to remove 8oxoG from DNA has no effect on the Ig SHM spectrum [9] indicating that the G-to-T/C-to-A component of the SHM spectrum does not involve 8oxoG residues in DNA. This leaves open the possibility that 8oxoG sites in RNA may contribute in other somatic mutation scenarios such as in TP53 as envisaged in Figure 1. Second, direct deamination of adenines in DNA to hypoxanthine (and thus potential A-to-G miscoding) seems to play no role in the generation of the A-to-G spectrum [17] once again leaving open the possibility of adenosine deamination to inosine at the RNA level contributing significantly to the observed A-to-G strand bias as envisaged in Figure 1. The investigators also observed a borderline statistically significant ( ) increase in T-to-C in Aag(−/−) mice—in our view this variation in T-to-C frequency is well within the range of variation for these PCR based SHM assays as outlined in Steele [3] and Supplementary data therein. Third, and most importantly for the present analysis, Strob and associates have clearly shown an effect of TP53 inactivation on the Ig SHM spectrum—there is a striking increase in A-to-G frequency in such mice and a corresponding increase in both A-to-G versus T-to-C and A≫T strand bias [18]. This result suggests that TP53 inactivation can profoundly affect the A-to-G component of the SHM spectrum and thus implies that the global DNA damage surveillance function of TP53 extends, as expected [18], to Ig SHM. In the context of the present analysis the result suggests that TP53 may well regulate the imprint of A-to-I RNA editing on the DNA somatic mutation pattern as predicted by the RT model of Ig SHM (Figure 1). This in turn has implications for the magnitude of the A≫T and A-to-G versus T-to-C strand biases across the human cancers bearing inactivated TP53 alleles such as, for example, bladder (Table 2 ) and ovary cancers (Table 3) and the wider cancer genome.

With respect to the other major effects on A/T and G/C mutations can we find parallel, or similar, patterns in other cancers with TP53 somatic mutation data? It is well known that colorectal and other aggressive gastrointestinal cancers are typified by the known high incidence of defects in mismatch DNA repair machinery, Bellizzi and Frankel (2009) [55]. We would therefore expect, if subverted DNA repair components of the Ig SHM process are operative in such tumors, that the signature of ablated or suppressed A/T mutagenesis should be revealed in the TP53 patterns in such cancers. This expectation is partly satisfied by the TP53 mutation data on colorectal and stomach cancers (Table 5) where A/T mutations have been reduced (but not eliminated) in colorectal and stomach cancers. In some cases the A-to-G strand bias has been retained (stomach Table 5 ) and in other cases it is lost (colorectal cancers, Table 5 ). Reductions in A/T mutations are also noted in oesophagus adenocarcinomas (not shown). Whilst there is a relative excess of G-to-A and C-to-T mutations at presumed methylated CpG sites in colorectal cancers, the strand bias here is also evident at non-CpG sites (not shown). In contrast, the TP53 mutation patterns in stomach cancers lack significant strand biases apart from those involving A-to-G versus T-to-C and G-to-T versus C-to-A (Table 5 ). It is conceivable that unrepaired excessive C-to-U deaminations on the NTS (relative to TS) are blunting intrinsic strand biases of G-to-A versus C-to-T in the same way such blunting (or strand bias reversal) occurs in Ig SHM in UNG−/−MSH−/− and Polη−/−/MSH−/− mice (Table 4; see extended discussion on this point in Steele [3]).

TP53 mutation patterns in skin cancers are shown in Table 5 . Here there is both suppression of A/T mutagenesis and a reversal of the strand bias at G/C sites, namely, C-to-T mutations clearly exceed G-to-A mutations by almost 2-fold. This is similar to data for Ig SHM in mice displaying the “AID deamination footprint” in UNG−/−MSH−/− and Pol-η−/−/MSH−/− mice (Table 4, [11, 13]).

In summary, known DNA repair deficiencies in previously characterised Ig SHM model systems are displayed in toto or in part in TP53-bearing cancer mutation patterns. The results are consistent with the hypothesis that components of “subverted DNA repair” play a similar role in both SHM and non-lymphoid cancers bearing mutated TP53 derivatives. In addition inactivation of TP53 may increase the magnitude of A≫T and A-to-G versus T-to-C strand biases in the tumors that harbor mutated TP53 derivatives [18].

7. TP53 Mutation Patterns in Brain Cancers

This particular pattern is discussed at length in Soussi [1] and displayed in Table 6. Note that a trend to A≫T strand bias and a significant A-to-G versus T-to-C strand bias is evident. There is also a specific strand bias of G-to-T transversions over C-to-A. However the global G≫C strand bias has been ablated, particularly the dominance of mutations of G-to-A versus C-to-T evident in many other cancers harboring mutated TP53 variants. As Soussi [1] discusses, there are similar numbers of excessive G-to-A and C-to-T mutations at CpG sites suggesting excessive deaminations of 5-Methylcytosine-to-T on both DNA strands (Table 6). The latter may be affected by AID-deaminase [26]. Once again however, as with the analysis of mutation patterns in Ig SHM, competing strand-biased mutational processes at G:C base pairs may accentuate, blunt, ablate, or even reverse the strand-biased signature presented by a particular tumor.

8. Context of C-to-U Lesions by APOBEC-Family Enzymes

This topic has been extensively studied by the Neuberger group [21]. Here we summarise the main findings and relate them to the sequence context of TP53 mutations in lung and breast cancer. For mutations of G most if not all occur at one of the known 5′ 3′ motifs on the opposite deamination strand. Thus APOBEC1 targets 5′-TCA-3′ on the deamination strand (or 5′-TGA-3′ on the NTS); APOBEC3G targets 5′-CCG-3′ on the TS (or 5′-CGG-3 on the NTS) and AID variously targets in descending order 5′-ACA-3′, 5′-GCA-3′, 5′-ACG-3′ and 5′-GCG-3′ (or 5′-TGT-3′, 5′-TGC-3′, 5′-CGT-3′ and 5′-CGC-3′ on the NTS). We also include the possibility that 5-MeC-to-T deaminations are affected by such enzymes [26].

When this information is applied to the G-site mutations of the “endogenous pattern” represented by the All Breast Cancer data it reveals that all major and minor G-site mutation hotspots can be classed as the direct result of either AID or APOBEC3G C-to-U deaminations targeting the transcribed TP53 DNA strand (Table 7). The dominant likely C-to-U deaminase is AID suggesting that dysregulated SHM initiation via AID activation in non-lymphoid tissue may be the primary cause of the mutagenesis leading to cancer.

9. Breast Cancer Mutation Patterns in TP53 Compared with Patterns in Genome-Wide Data

The somatic mutation patterns in TP53 is often a very good correlate to genome-wide point mutation patterns, for example, lung cancer [52]. However this is not always the case probably because TP53 is not inactivated in all cancers of a given category, for example, about 25% of breast and bladder cancers, 48% of ovary cancers and 38% of lung cancers have an inactivated TP53 allele (see IARC TP53 database). As pointed out in Pfeifer and Besaratinia [28] large-scale genome sequencing of cancer genomes has revealed some interesting results not evident in TP53 patterns [56, 57]. Thus there is a quantitative difference evident between strand-biased mutation patterns in TP53 in breast cancers (Table 2 ) compared with available data from genome-wide exome sequencing of breast cancer genomes [56, 57]. In Table 8 are displayed data illustrating this difference [56] from the sequencing of exons of close to 20,000 protein coding genes in eleven breast cancer genomes. In the majority of these breast cancers (10/11) TP53 is mutated (see Supplementary data in [56]).

Note first, the strand-biased pattern at A:T base pairs in TP53 (Table 2 ) and in genome-wide data (Table 8) is similar. The systematic A≫T and prominent A-to-G strand biases are evident. However in this data set (approximately 1445 point mutations) the strand biases at G:C for G≫C is systematic and just significant at the level. A key difference pointed out by Pfeifer and Besaratinia [28] is the higher load of mutations for G-to-C/C-to-G. Pfeifer and Besaratinia postulate the following:

“These data suggest that breast cancers are caused by an etiological agent that induces this particular type of mutation. There are few known mutagens that specifically induce G/C to C/G transversions, let alone selectively at a particular dinucleotide sequence.”

They then go on to point out that a significant fraction of these G to C transversions occur at the 5′ GpA dinucleotide motif which is 5′ TpC on the other strand.

From our perspective it is interesting that this happens to be the favoured APOBEC1 motif if this DNA deaminase enzyme, now unregulated, deaminated cytosines at such sites ([21], see previous section, Table 7). Inspection of the TP53 “All Breast Cancer” data reveals there are indeed several 5′GpA sites which account for about a third of the load of G-to-C mutations in the strand-biased G-to-C pattern in TP53 in cancers of the breast (Table 2 ). These G-to-C hotspots are at codons 196, 280, and 281 (Table 7) and constitute 25 of 92 (27%) of all G-to-C mutations over codons 150–300 inclusive. Additionally, ten G-to-C mutations in this region occur at a CGC site in codon 156 (a motif favoured by AID deamination on the opposite strand).

Recently the Cancer Genome Project (CGP) at The Welcome Trust Sanger Institute has reported on the exomic mutation spectrum of category-selected sets of 21 breast cancer genomes, Nik-Zainal et al. [22]. Few if any significant strand biases are reported in this genome-wide data. Of real interest is the fact that only a minor fraction (4/21) carries an exomic mutation in TP53 and most sample sizes for mutations are statistically small (N values in the hundreds)—except tumor PD4120a which carries 1931 exomic mutations which are predominantly focused on G/C with ≤5% mutations at A/T base pairs. Two of the tumors bearing a TP53 mutation, PD4109a and PD4199a, display the early trends of the significant strand biases at A:T and G:C base pairs (Table 9) evident in the earlier Wood et al. (2007) exome study [56].

Collectively these genome-wide data sets suggest that mutations in TP53 accentuate strand-biased mutation patterns across the cancer genome implying that inactivated TP53 and dysregulated SHM contribute to such patterns. This conclusion is underlined by the documented functional interaction between TP53 and the Ig SHM machinery shown in mice by the Strob group, particularly in relation to the accentuated strand-biased mutation pattern of A-to-G versus T-to-C [18].

10. Inflammation and Carcinogenesis

Whilst there are some exceptions and qualifications, we conclude that there is a strong statistically significant similarity between the strand-biased mutation signatures of TP53 in many tumor types and the now well-established Ig SHM pattern, particularly in relation to the strand biases of A-to-G over T-to-C and the G-to-A over C-to-T. Previous work on Ig SHM suggests that the A-to-G over T-to-C stand bias correlates strongly with A-to-I RNA editing coupled to reverse transcription to fix the A-to-G pattern in the cellular DNA [8].

The G-to-A over C-to-T pattern is found to be a dominant strand bias in all those cancers arising in tissues “least accessible to tobacco smoke” suggesting that this strand biased pattern (as well as A-to-G over T-to-C) arises from endogenous mutation processes in most non-lymphoid cancers. We have previously concluded that the G-to-A over C-to-T strand bias is consistent with RNA mutations initiated at C sites by activation-induced cytidine deaminase (AID)-mediated C-to-U deamination on the transcribed strand (TS) resulting in G-to-A transitions in the mRNA which are fixed as G-to-A mutations on the nontranscribed strand (NTS) following reverse transcription [3], see Figure 1.

The present analyses therefore confirm and extend our earlier conclusions in a preliminary study of genome-wide somatic mutation data curated by the Cancer Genome Project (CGP) at The Welcome Trust Sanger Institute, Hixton, UK: [4]. The special features of the strand biases at A:T and G:C base pairs in tumors bearing mutated derivatives of TP53 imply a role for base-modified mRNA template intermediates and reverse transcription in somatic mutagenesis leading to or initiating cancer. In addition our analysis and conclusions are consistent with the view that inflammatory infiltrates, or in situ inflammatory episodes, in non-lymphoid tissues may contribute to dysregulated Ig SHM and thus oncogenesis, first by mutating components of the Ig SHM machinery and then by affecting mutations in TP53. This could occur via “a bystander effect” of various liberated cytokines inappropriately activating gene expression pathways in nearby non-lymphoid cells. It is common clinical knowledge that most tumors have associated inflammatory infiltrates and are part and parcel of tumor growth.

This analysis also identifies several potential new drug targets for cancer therapy: in addition to AID and APOBEC family deaminases we can include Pol-η, ADAR1 and yet to be identified factors that modulate the apparatus of RNA Pol II transcription-coupled repair. Further, identifying the interacting proteins/genes mediating the functional TP53-Ig SHM interaction must also be considered a top priority in drug development and targeting.

Consistent with these conclusions is the large and now rapidly growing literature on chronic inflammation preceding cancer in many tissues [58]. Whilst it may not be possible to control all those induced somatic genetic factors leading to cancer, strategies to dampen and avoid chronic or transient inflammatory episodes in life may depress the chance of triggering “endogenous” mutagenic events, via dysregulated Ig SHM machinery, being turned on in non-lymphoid tissues [4]. This is particularly important in breast and ovarian tissues as estrogen can directly elevate AID expression as demonstrated by Petersen-Mahrt and colleagues [59, 60] and discussed by Maul and Gearhart [61]. Indeed the Pauklin et al. data [60] show that estrogen induces AID transcription in these non-lymphoid tissues suggesting that the TP53 G-site mutation hotspots in breast cancers may be directly caused by AID and other APOBEC-family deaminases targeting such sites, as the data in Table 7 imply. Collectively, these findings and the present analyses are shedding a new light on how we might view oncogenesis. Our work points to the unregulated Ig SHM mechanism as playing a key role in the progression of the main non-lymphoid cancer groups. This provides us with a fundamentally new molecular model with which to view the process of oncogenesis and ways to develop new strategies for treating (and perhaps preventing) the development of certain cancer groups.


Detection of Strand-Biased Somatic Mutation Signatures

In a data set containing a large number of somatic mutations strand-biased signatures are revealed by comparing the base substitution frequencies of Watson-Crick complements on the same strand. By convention nucleotide substitutions are read from the nontranscribed strand (NTS). However the known direction of transcription in a region of genomic DNA encoding a protein allows identification of the strands. Thus if A-to-G mutations occur with equal frequency on both strands, then its Watson-Crick complement, T-to-C will occur with equivalent frequency when scored off the same strand. However if there is a bias in the mutations favouring the NTS then A-to-G mutations will exceed T-to-C mutations. If there are systematic strand biases involving excessive mutations of A or G (e.g., as seen in many of the data tables presented herein) then the sum total of mutations of A will exceed the sum total of mutations of T (at A:T base pairs where A≫T) and the sum total of mutations of G will exceed the sum total of mutations of C (at G:C base pairs where G≫C). (See Figure 3).


We thank John A Millman, Brent (“Charlie”) J Stewart, Pat Carnegie, Susan Lester, Joseph F Williamson and Roger L Dawkins for early discussions and the AL & M Dawkins Foundation and C Y O’Connor ERADE Village Foundation for support in the early stages of the project. We thank Thierry Soussi for comments on an earlier draft of the manuscript, and Selena Nik-Zainal and Bert Vogelstein with helpful patience and assistance in accessing and analysing the supplementary data in [22, 56].