Abstract

Colorectal cancer (CRC) is one of the leading cancers throughout the world. It represents the third most common cancer and the fourth in mortality. Most of CRC are sporadic, arise with no known high-penetrant genetic variation and with no previous family history. The etiology of sporadic CRC is considered to be multifactorial and arises from the interaction of genetic variants of low-penetrant genes and environmental risk factors. The most common well-studied genetic variation is single nucleotide polymorphisms (SNPs). SNP arises as a point mutation. If the frequency of the sequence variation reaches 1% or more in the population, it is referred to as polymorphism, but if it is lower than 1%, the allele is typically considered as a mutation. Lots of SNPs have been associated with CRC development and progression, for example, genes of TGF-β1 and CHI3L1 pathways. TGF-β1 is a pleiotropic cytokine with a dual role in cancer development and progression. TGF-β1 mediates its actions through canonical and noncanonical pathways. The most important negative regulatory protein for TGF-β1 activity is termed SMAD7. The production of TGF-β can be controlled by another protein called YKL-40. YKL-40 is a glycoprotein with an important role in cancer initiation and metastasis. YKL-40 is encoded by the CHI3L1 gene. The aim of the present review is to give a brief introduction of CRC, SNP, and examples of some SNPs that have been documented to be associated with CRC. We also discuss two important signaling pathways TGF-β1 and CHI3L1 that influence the incidence and progression of CRC.

1. Colorectal Cancer

Colorectal cancer (CRC) has attracted significant attention as it represents the third most common cancer and fourth cancer in mortality in the world after lung, stomach, and liver cancers [1]. Colorectal cancer accounts for approximately 10% of all new cancer cases, affecting one million people every year throughout the world [2]. The highest incidence rates are mainly found in developed countries, whereas the lowest rates are found in developing countries (Figure 1) [3]. From the genetic standpoint, CRC can be divided into three types: sporadic, familial, and hereditary CRC [4] as shown in Table 1.

The etiology of sporadic CRC is considered to be multifactorial and arises from the interaction between allelic variants in low-penetrant genes and environmental risk factors [5, 6]. Penetrance is the frequency with which the characteristics transmitted by a gene appear in individuals possessing it. A highly penetrant gene almost always expresses its phenotypes regardless of other environmental influence, while low-penetrant genes express its phenotype in the presence of other genetic and/or environmental influence [7]. The genetic contribution of high- and low-penetrant genes to CRC is shown in Figure 2. Risk factors for CRC may be nonmodifiable or modifiable [8] as shown in Table 2.

Vogelstein model, also known as the adenoma-carcinoma sequence, is a multistep model [19] that describes the progression of CRC carcinogenesis from a benign adenoma to a malignant carcinoma through a series of well-defined histological stages (Figure 3). The main features of the model include a mutational activation of oncogenes and/or the inactivation of tumor suppressor genes. At least four or five genetic alterations must take place for the formation of malignant tumors. The characteristics of the tumor are dependent upon the accumulation of multiple genetic mutations rather than a certain sequence of mutations of these genes.

Dukes’ colorectal cancer staging and Tumors/Nodes/Metastases (TNM) are the two classification system that are used for the staging of CRC (Table 3). There has been a gradual move from Dukes’ to the TNM classification system as TNM was reported to give a more accurate independent description of the primary tumors and its spread [20].

2. Prevention of Colorectal Cancer

Several approaches have been developed to reduce CRC incidence and mortality. Prevention includes primary and secondary strategies. Primary strategy includes dietary changes, increasing physical activity, and the use of nonsteroidal anti-inflammatory drugs (NSAIDs), while the secondary strategy is based on screening tests (Table 4).

Interestingly, dietary factors are responsible for 70% to 90% of CRC. The relatively low CRC rates in the Mediterranean area compared with most Western countries are mostly because the traditional Mediterranean diet is characterized by high consumption of foods of plant origin, relatively low consumption of red meat, and high consumption of olive oil [32]. Therefore, diet modification could potentially help to reduce the incidence of CRC [33, 34]. Examples of some dietary components that lower CRC risk are shown in Table 5.

Early diagnosis of CRC is important to improve outcomes. Fecal occult blood testing (FOBT) or fecal immunochemical test (FIT) is routinely used prior to colonoscopy, and only patients with a positive test result are referred to a specialist. Although these assays are useful screening tools, patient compliance with these stool-based assays tends to be low. Serum-based assays for the early detection of CRC are highly attractive, as they could be integrated into any regular health checkup without the need for additional stool sampling, thereby increasing acceptance among patients [29].

3. Gene Polymorphism

Polymorphism is the occurrence of two or more clearly different morphs or forms of a species in the population. Poly means many; morph means form [48]. The colored flowers of mustard, butterflies, and human ABO blood group system are obvious examples of polymorphisms [49, 50].

Genetic polymorphisms are different forms of the DNA sequence, which may or may not affect biological function depending on its exact nature. Polymorphism arises as a result of mutation. If the frequency of a specific sequence variant reaches 1% or more in the population, it is referred to as polymorphism, and if it is lower than 1%, the allele is typically regarded as mutation [51]. Molecular polymorphism, first demonstrated in Drosophila pseudoobscura, stimulated molecular studies of many other organisms and led to vigorous theoretical debate about the significance of the observed polymorphisms [52, 53].

Single nucleotide polymorphism (SNP) is a variation in a single nucleotide that occurs at a specific position in the genome. Single nucleotide polymorphisms are the most abundant type of genetic variation in the human genome, accounting for more than 90% of all differences between individuals [54]. Single nucleotide may be changed (substitution), removed (deletion), or added (insertion) to a polynucleotide sequence [54].

Single nucleotide polymorphisms are also thought to be the keys in realizing the concept of personalized medicine as it can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. Single nucleotide polymorphisms underlie the differences in the susceptibility to a wide range of human diseases, for example, a single base mutation in the apolipoprotein E gene is associated with a higher risk for Alzheimer’s disease. The severity of illness and the way the body responds to treatments are also manifestations of genetic variations [55, 56].

According to their location in the genome, SNPs are classified into cSNP in the coding region (exons), rSNP in the regulatory region, and iSNP located in the intronic region [54].

Polymorphisms in the coding region are either synonymous or nonsynonymous (Figure 4). Synonymous polymorphisms do not result in a change of amino acid in the protein but still can affect its function in other ways. Silent mutation in the multidrug resistance gene 1, which codes for a cellular membrane pump that expels drugs from the cell, is an example of synonymous polymorphism. It can slow down translation and allow unusual folding of the peptide chain, causing the mutant pump to be less functional [57, 58].

Nonsynonymous polymorphisms, on the other hand, can change the amino acid sequence of the protein and subclassified into missense and nonsense. Missense polymorphism results in different amino acids such as single base change G > T in LMNA gene that results in the replacement of the arginine by the leucine at the protein level, which manifests progeria syndrome [59]. Nonsense polymorphism results in a premature stop codon and usually nonfunctional protein product such as that manifested in cystic fibrosis caused by mutation in the cystic fibrosis transmembrane conductance regulator gene [60].

Promoter polymorphism can cause variations in gene expression as it affects the DNA binding site and alters the affinity of the regulatory protein while intronic region polymorphism may affect gene splicing and messenger RNA degradation [61, 62].

Genotyping technologies typically involve the generation of allele-specific products for SNPs of interest followed by their detection for genotype determination. All current genotyping technologies with only a few exceptions require the polymerase chain reaction (PCR) amplification step. In most technologies, PCR amplification of a desired SNP-containing region is performed initially to introduce specificity and increase the number of molecules for detection following allelic discrimination [63]. Enzymatic cleavage, primer extension, hybridization, and ligation are four popular methods used for allelic discrimination (Table 6).

4. Genome-Wide Association Study and Colorectal Cancer

Genome-wide association study (GWAS), also known as whole genome association study, is defined as an examination of many common SNPs in different individuals to see if any SNP is associated with a disease. Genome-wide association study compares the DNA of participants having a disease with similar people without the disease. The ultimate goal is to determine genetic risk factors that can be used to make predictions about who is at risk for a disease and to identify their role in disease development for developing new prevention and treatment strategies [68].

The availability of chip-based microarray technology that assay hundreds and thousands of SNPs made genome-wide association studies easy to be performed (Table 7). Genome-wide association study identifies a specific location, not complete genes. Many SNPs identified in GWAS are near a protein-coding gene or are within genes that were not previously believed to associate with the disease. So, researchers use data from this type of study to pinpoint genes that may contribute to a person’s risk of developing a certain disease [69].

Genome-wide association study is built on the expanding knowledge of the relationships among SNPs generated by the international HapMap project. The HapMap project is an international scientific effort to identify common SNPs among people from different ethnic populations. When several SNPs cluster together on a chromosome, they are inherited as a block known as a haplotype. The HapMap describes haplotypes, including their locations in the genome, and how common they are present in different populations throughout the world [70].

Genome-wide association study is an important tool for discovering genetic variants influencing a disease, but it has important limitations, including their potential for false-positive and false-negative results and for biases related to selection of study participants and genotyping errors [71]. The gold standard for validation of any GWAS is replication in an additional independent sample. Replication studies are performed in an independent set of data drawn from the same population as the GWAS, in an attempt to confirm the effect in the GWAS target population. Once an effect is confirmed in the target population, other populations may be sampled to determine if the SNP has an ethnic-specific effect [72].

It has been recognized that SNPs play an important role in conferring risk of CRC. Genome-wide association studies have reported multiple risk loci associated with risk CRC, some of which are involved in the transforming growth factor-β (TGF-β) signaling pathway [73]. For example, SMAD7 rs4939827 was found to be associated with CRC in two GWASs [74, 75]. The association of SMAD7 rs4939827 with CRC was confirmed by other replication studies [76, 77]. A summary of other SNPs studied as risk factors for CRC is shown in Table 8.

5. Transforming Growth Factor-β Signaling and Its Regulatory Smad7

Mothers against decapentaplegic homolog 7 (Smad7) is a key inhibitor of TGF-β [94, 95]. Smad7 was named after mothers against decapentaplegic (mad), an intermediate of the decapentaplegic signaling pathway in Drosophila melanogaster and sma-gene in Caenorhabditis elegans that has mutant phenotype similar to that observed for the TGF-β-like receptor gene [96]. Regulation of TGF-β by Smad7 is crucial to maintain gastrointestinal homeostasis [97]. Smad7 overexpression is commonly found in patients with chronic inflammatory conditions of the colon [98] and may be associated with prognosis in patients with CRC [99]. Loss of Smad/TGF-β signaling interrupts the principal role of TGF-β as a growth inhibitor, allowing unchecked cellular proliferation [100].

In the early 1980s, Roberts and his colleagues isolated two fractions that could induce growth of normal fibroblasts from murine sarcoma cell extracts and were named TGFα and TGF-β [101, 102]. Transforming growth factor-β is a prototype of a large family of cytokines that includes the TGF-βs, activins, inhibins, and bone morphogenetic proteins (BMPs) [103].

In mammals, TGF-β has 3 isoforms (TGF-β1, TGF-β2, and TGF-β3), with similar biological properties. The TGF-β isoforms are encoded from genes located on different chromosomes. The TGF-β1 gene is located in chromosome 19q13.1, while TGF-β2 and TGF-β3 genes are located in chromosomes 1q4.1 and 14q24.3, respectively [104].

The isoforms of TGF-β1, TGF-β2, and TGF-β3 are encoded as large precursor, which undergo proteolytic digestion by the endopeptidase furin, yielding two products that assemble into dimers. One is latency-associated peptide (LAP), a dimer from the N-terminal region. The other is mature TGF-β, a dimer from the C-terminal portion. A common feature of TGF-β is that its N-terminal portion (LAP) remains noncovalently associated with the mature TGF-β forming a small latent complex [105, 106]. The small latent complex is associated with a large protein termed latent TGF-β binding protein (LTBP) via disulfide bonds forming large latent complex for targeted export to the extracellular matrix (ECM) [107, 108]. For TGF-β to bind its receptors, the latent complex must be removed so that the receptor-binding site in TGF-β is not masked by LAP. Latent TGF-β is cleaved by several factors, including proteases, thrombospondin, reactive oxygen species (ROS), and integrins (Figure 5) [109, 110].

Transforming growth factor-β is a pleiotropic cytokine that has a dual function in cancer development, where it acts as a tumor suppressor in the early stages and a tumor promoter in the late stages [111]. The main actions of TGF-β are summarized in Table 9.

The active TGF-β binds to transforming growth factor-β receptor 2 (TGF-βR2), a serine/threonine kinase receptor, leading to the recruitment and phosphorylation of the TGF-βR1 (Figure 6). The activated TGF-βR1 interacts with and phosphorylates a number of proteins, thereby activating multiple downstream signaling pathways in either a Smad-dependent (canonical) or Smad-independent (noncanonical) signaling pathway (Figure 6) [96].

In the canonical pathway, TGF-βR1 propagates the signal through a family of intracellular signal mediators known as Smads. To date, eight mammalian Smad proteins have been characterized and are grouped into three functional classes: receptor-activated Smads (R-Smads) including Smad1, Smad2, Smad3, Smad5, and Smad8, common mediator Smad (Smad4), and inhibitory Smads (I-Smads) including Smad6 and Smad7. Receptor-activated Smads are retained in the cytoplasm by binding to SARA (Smad anchor for receptor activation). Receptor-activated Smads are released from SARA when they are phosphorylated by the activated TGF-βR1 [130, 131].

Once R-Smads (Smad2/3) are activated through phosphorylation by TGF-βR1, they form an oligomeric complex with Smad4 and translocate into the nucleus, where it modulates the transcription of specific genes. Ability of Smads to target a particular gene and the decision to activate or repress gene transcription are determined by many cofactors that affect the Smad complex [130].

In the noncanonical pathway, TGF-β activates other non-Smad signaling pathways (Table 10). Some of these pathways can regulate Smad activation, but others might induce responses unrelated to Smad [132].

Transforming growth factor-β is strongly implicated in cancer as genetic alterations of some common components of TGF-β pathway (Table 11) that have been identified in human tumors [141].

6. Inhibitory Smad (I-Smad, Smad7)

Mothers against decapentaplegic homolog 7 (Smad7) belongs to the third type of Smads, the I-Smads that also include Smad6. The structure of the Smads is characterized by two conserved regions known as the amino terminal (N-terminal) Mad homology domain-1 (MH1) and C-terminal Mad homology domain-2 (MH2), which are joined by a short poorly conserved linker region. The MH1 domain is highly conserved among the R-Smads and the Co-Smad, whereas the I-Smads lack a MH1. The MH2 domain is conserved among all of the Smad proteins but I-Smads lack SXSS motif, which is needed for phosphorylation following TGF-βR1 activation (Figure 7). Thus, I-Smads are not phosphorylated upon binding of TGF-β to its receptors. The L3 loop in the MH2 domain of the R-Smads is a specific binding site for the TGF-βR1 [95, 156].

Smad7 antagonizes TGF-β signaling through multiple mechanisms, both in the cytoplasm and the nucleus (Figure 8). Smad7 antagonizes TGF-β in the cytoplasm through the formation of a stable complex with TGF-βR1, leading to inhibition of R-Smad phosphorylation. Smad7 can recruit E3 ubiquitin ligases that induce the degradation of activated TGF-βR1 complexes [156, 157]. Also, Smad7 forms a heteromeric complex with R-Smads through the MH2 domain and hence interferes with R-Smad (Smad2/3)-Smad4 oligomerization in a competitive manner. Additionally, Smad7 can bind to DNA disrupting the formation of functional Smad-DNA complexes [158, 159].

Inhibitory Smads can mediate the cross talking of TGF-β with other signaling pathways. Various extracellular stimuli such as interferon-γ (IFN-γ) can induce Smad7 expression to exert opposite effects on diverse cellular functions modulated by TGF-β [161]. In addition, Smad7 was found to be a key regulator of Wnt/β-catenin pathway that is responsible for the TGF-β-induced apoptosis and survival in various cell types [162].

There is a controversy regarding the role of Smad7 in tumor development depending on the type of the tumor. High Smad7 expression was reported to be correlated with the clinical prognosis of patients with colorectal, pancreatic, liver, and prostate cancer. In contrast, a protective role of high Smad7 expression was reported in other tumors [163]. Boulay et al. [164] found that CRC patients with deletion of Smad7 had a favorable clinical outcome compared with patients with Smad7 expression. Additionally, Smad7 was found to act as a scaffold protein to facilitate TGF-β-induced activation of p38 and subsequent apoptosis in prostate cancer cells [162].

Even in the same tumor, the function of Smad7 can switch from tumor suppressive to tumor promoting depending on the tumor stage (i.e., early versus advanced). These apparently contradictory functions are in harmony with the opposite roles of TGF-β signaling pathway in the early versus advanced tumor stages and the interaction of Smad7 with a vast array of functionally heterogeneous molecules that may be differently expressed during the carcinogenic process [160].

The overexpression of Smad7 in CRC cell was reported to enhance cell growth and inhibit apoptosis through a mechanism dependent on suppression of TGF-β signaling [100]. In addition, Smad7-deficient CRC cells were reported to enhance the accumulation of CRC cells in S phase of cell cycle and cell death through a pathway independent on TGF-β [165]. Genetic variants in SMAD7 gene have been extensively studied in CRC patients (Table 12).

7. Chitinase 3 Like 1/YKL-40

YKL-40 is a mammalian member of the chitinase protein family. YKL-40 is a 40 kDa heparin- and chitin-binding glycoprotein. The human protein was named YKL-40 based on its three N-terminal amino acids tyrosine (Y), lysine (K), and leucine (L) and its 40 kDa molecular mass [178]. This protein has several names, YKL-40 [178], human cartilage glycoprotein-39 (HC-gp39) [179], 38 kDa heparin-binding glycoprotein (Gp38k) [180], chondrex [181], and 40 kDa mammary gland protein (MGP-40) [182].

In a search of new bone proteins, the glycoprotein YKL-40 was identified in 1989 to be secreted in vitro by the human osteosarcoma cell line MG63. The protein was later found to be secreted by differentiated smooth muscle cells, macrophages, human synovial cells, and nonlactating mammary gland [178, 181, 182]. In 1997, the chitinase 3 like 1 (CHI3L1) gene encoding for YKL-40 was isolated. It is assigned to chromosome 1q31-q32 and consists of 10 exons and spans about 8 kilobases of genomic DNA [178, 183].

Based on amino acid sequence, it was found that YKL-40 belongs to the glycosyl hydrolase family 18 that hydrolyses the glycosidic bond between two or more carbohydrates or between a carbohydrate and a noncarbohydrate moiety. Based on sequence similarity, there are more than 100 different families of glycosyl hydrolases [184186].

Chitin, a polymer of N-acetyl glucosamine, is the second most abundant polysaccharide in nature, following cellulose. It is found in the walls of fungi, the exoskeleton of crabs, shrimp and insects, and the micro filarial sheath of parasitic nematodes [187]. Chitin accumulation is regulated by the balance of chitin synthase-mediated biosynthesis and degradation by chitinases. Although YKL-40 contains highly conserved chitin-binding domains, it functionally lacks chitinase activity due to the mutation of catalytic glutamic acid into leucine [183].

Several types of solid tumors can express YKL-40 such as osteosarcoma [178], CRC [188], thyroid carcinoma [189], breast [190], ovarian [191], lung [192], pancreatic cancer [193], glioblastoma [194196], and cholangiocarcinoma [197].

There are several synergistic and antagonistic factors that modulate the regulatory functions of YKL-40 (Figure 9) in both normal and pathological conditions [198].

8. CHI3L1/YKL-40 Targets and Actions

Although the biological function of YKL-40 is not fully understood, the pattern of its expression suggests function in remodeling or degradation of ECM. The diverse roles of YKL-40 in cell proliferation, differentiation, survival, inflammation, and tissue remodeling have been suggested [199]. Aberrant expression of YKL-40 is associated with the pathogenesis of an array of human diseases (Figure 10).

Elevated serum YKL-40 levels were reported to be associated with a wide range of inflammatory diseases (Table 13). More than 75% of patients with streptococcus pneumoniae bacteremia had elevated serum levels of YKL-40 compared with age-matched healthy subjects. Treatment of these patients with antibiotics resulted in reaching serum YKL-40 normal level within few days in most patients before the serum C-reactive protein (CRP) reach the normal level [200].

Biologically, YKL-40 was found to activate a wide range of inflammatory responses. An inflammatory stimulus can trigger the secretion of a variety of cytokines that in turn may regulate YKL-40 (Figure 11). Increased YKL-40 was reported to regulate chronic inflammatory responses like asthma, chronic obstructive pulmonary disease (COPD), cardiovascular disease (CVD), and arthritis. Inhibition of YKL-40 by utilizing anti-CHI3L1 antibody may be a useful therapeutic strategy to control/reduce the effect of inflammatory diseases [198].

Over the past three decades, a considerable attention has been focused on the potential role of YKL-40 in the development of a variety of human cancers. Serum levels of YKL-40 (Table 14) were independent of serum carcinoembryonic antigen (CEA) in CRC [188], serum cancer antigen 125 (CA-125) in ovarian cancer [191], serum human epidermal growth factor receptor 2 (HER-2) in metastatic breast cancer [190], serum lactate dehydrogenase (LDH) in small cell lung cancer [192], and serum prostate-specific antigen (PSA) in metastatic prostate cancer [208]. Therefore, it may be of value to include serum YKL-40 as a biomarker for screening of cancer together with a panel of other tumor markers as it can reflect other aspects of tumor growth and metastasis than the routine tumor markers [201].

Macrophages and neutrophils in tumor microenvironment or tumor cells were found to secrete YKL-40 into extracellular space, which can enhance tumor initiation, proliferation, angiogenesis, and metastasis (Figure 12).

The ability of YKL-40 to induce cytokine secretion, proliferation, and migration of target cells suggests the existence of their receptors on the cell surface. However, receptors interacting with YKL-40 are incompletely characterized, and only limited information is available about YKL-40-induced signaling pathways. There are evidences to strengthen a hypothesis that a cross talk between adjacent membrane-anchored receptors plays a key role in transmitting “outside-in” signaling to the cells, leading to a diverse array of intracellular signaling [213, 214].

YKL-40 possesses heparin-binding affinity, which enables it to specifically bind heparan sulfate (HS) fragments [215]. Syndecans are transmembrane molecules with cytoplasmic domains that can interact with a number of regulators [216]. Syndecan-1 is the major source of cell surface HS. There is compelling evidence demonstrating that syndecan-1 can act as a matrix coreceptor with adjacent membrane-bound receptors such as integrins to mediate cell adhesion and/or spreading [217]. It was found that YKL-40 could induce the coupling of syndecan-1 and αvβ3 integrin (Figure 13), resulting in phosphorylation of focal adhesion kinase (FAK) and activation of downstream ERK1/2 signaling pathway, which enhance vascular endothelial growth factor (VEGF) expression in tumor cells, angiogenesis, and tumor growth [214]. Additionally, ERK1/2 and JNK signaling pathways were reported to upregulate proinflammatory mediators such as C-chemokine ligand 2 (CCL2), chemokine CX motif ligand 2 (CXCL2), and MMP-9; all of which contribute to tumor growth and metastasis [218].

Another VEGF-independent pathway was reported to mediate angiogenic activity of YKL-40, as an anti-VEGF neutralizing antibody failed to impede YKL-40-induced migration [219]. Therefore, targeting both YKL-40 and VEGF could be an efficient course of therapy along with radiotherapy for eventual eradication of deadly diseases.

Furthermore, YKL-40 was demonstrated to stimulate TGF-β1 production in malignant cells via interleukin-13 receptor α2- (IL-13Rα2-) dependent mechanism (Figure 14). The binding of YKL-40 to IL-13Rα2 results in the activation of MAPK, AKT, and Wnt/β-catenin which play an important role in inhibiting apoptosis and interleukin-1β (IL-1β) production thereby acting as a potential cancer promoter [220].

Recently, Low et al. [221] showed that YKL-40 can also bind surface receptor for advanced glycation end product (RAGE), which is involved in tumor cell proliferation, migration, and survival through β-catenin- and nuclear factor kappa-B- (NF-κB-) associated signaling pathways [221, 222].

Most of the ongoing researches have been carried out on SNP rs4950928 in the promoter region of CHI3L1 gene as it was found to be associated with the serum/plasma YKL-40 levels [223, 224] and diseases such as asthma, bronchial hyperresponsiveness [207], and the severity of hepatitis C virus-induced liver fibrosis [225]. Some of the association studies of CHI3L1 SNPs with different diseases are shown in Table 15.

Conflicts of Interest

The authors declare that they have no conflict of interest.