Abstract

The nucleotide and amino-acid distributions are studied for two variants of mRNA of gene that codes for a protein which is involved in multiple myeloid. Some patches and symmetries are singled out, thus, showing some distinctions between the two variants. Fractal dimensions and entropy are discussed as well.

1. Introduction

In some recent papers, the concepts of fractality [119] and entropy [1921] have been considered as fundamental parameters to investigate the existence of correlations [2236] and simple rules [37] in the DNA sequences. In particular, it has been observed that the increasing fractal dimension [713] can be related to a degeneration in sequences, having as a consequence pathological evolution of related diseases. A fundamental role is played by the concept of information entropy [20, 21] so that a change in the nucleotide distribution in DNA implies a corresponding change in the information content and, as a consequence, a variation in the entropy. Since the cell activity is functionally dependent on the nucleotide distribution our task is to understand better about this distribution and/or about the existence of large scale structure [16, 15, 2237]. So that we could relate the functional activity of cells to some epitomizing patches in the nucleotide distribution. We will propose, in the following, also to take into account the information content in the amino acid distribution. In particular, we will see that the amino-acid distribution shows a higher level structure, and some patchiness which are undetectable in the nucleotide distribution. Our statistical approach is based on the transformation of the symbolic string into a numerical string by the Voss indicator function [4, 5] which is a discrete binary function. On this function, the indicator matrix is defined and on this matrix the fractal dimension and entropy can be simply computed. We will compare the fractal dimension and complexity of two mRNA variants of TET 2 (ten-eleven translocation 2) gene downloaded from gene bank [38] (similar data are also available from [3941]), by showing that these parameters can be used to classify the two variants. Multiple myeloma is a pathology which involves plasma cell, but it can move and spread into the whole body. Some aspects are still unclear; however, it is known that this pathology is characterized by the activation of abnormal genes through chromosomal translocations and other genetic anomalies. One of the genes involved in the birth and progression of multiple myeloma is the TET 2. In fact it is present in some myelodysplastic syndromes, and it seems to play a key role when it is subject to mutation. The gene TET 2 is related with myelopoiesis; in fact it encodes protein that we can find significantly expressed in hematopoietic cells and granulocytes.

2. Multiple Myeloma and the Oncogene TET 2

Multiple myeloma (MM) is a blood cancer of the plasma cell. Myeloma originates in a specific type of cell, the plasma cell, but it can move, so that it spreads by the blood to the whole body. Like other cancers, multiple myeloma will develop in steps. Myeloma begins when the normal plasma cell becomes abnormal. The abnormal cell divides, and the new cells divide again and again, thus proliferating the number of abnormal cells. Myeloma cells collect in the bone marrow and in the solid part of the bone. These malignant plasma cells produce a para protein, an inactive antibody known also as M-protein or Bence Jones protein, that attack bone marrow, bones, blood, and kidneys. As a consequence, there happens extensive destruction within the skeleton involving multiple bones, and resulting in widespread bone pain and multiple fractures; for this reason, such a disease is also called multiple myeloma. Some genetic factors are also involved in this pathology. In absence of other symptoms and clinical signs, this condition is more properly called benign monoclonal gammopathy of uncertain significance (MGUS). In fact, the uncertainty about the future progression, it shows that also benign diseases might evolve into MM. It is likely that the evolution of MGUS into MM depends on many mutations of the MGUS clone. Initially, MM has a low progression, but afterward it becomes more aggressive. The signs that characterize onset of multiple myeloma are mostly high concentration of calcium ions with damages in the kidneys, the weakening of the immune system with abnormal production of immune globulin, and some other signs such as an evident osteoporosis. Both MGUS and MM diseases are characterized by the presence of alterations in gene expression [4246]. The chromosomes that are more involved are 1,11,13,14, respectively. The alteration at chromosome 1 is found in half of cases of MM patients [4754]. The same aberrations chromosome seem to be evident both in MM and in MGUS, thus supporting the thesis that these two diseases are closely related [53].

Gene TET 2 is located on the chromosome 4 exactly in 4q24. More precisely, the TET2 gene is located from base pair 106,067,942 to base pair 106,200,957 on chromosome 4, position as shown in Figure 1 [38].

The gene TET 2 plays a key role in the conversion of methylcytosine (5mC) to 5-hydroxymethylcytosine (hmC) moreover is related to myelopoiesis. For the hmC many roles were noted like for example (1) remodeling of chromatin structure (2) recruitment of some factors (3) demethylation of cytosine [55, 56]. The gene TET2 encodes a protein that we find significantly expressed in hematopoietic cells and granulocytes. In almost all patients with myelodysplastic syndromes, the protein is decreased in peripheral blood granulocytes. TET 2 gene is usually mutated in myeloproliferative disorders (MPDs). The MPD is part of a larger group of disorders called myeloproliferative neoplasms (MPNs). The mutation of TET 2 characterizes some disorder known as systemic mast cell disease, but TET 2 is above all mutated in myelodisplastic syndromes [57].

We will see that, by using some parameters defined on the indicator function, we can single out some patches which characterize abnormal functional activity [13, 35, 58, 59].

3. DNA Representation

The DNA, as well as the mRNA, of each organism of a given species is a sequence of a specific number of base pairs defined on the 4 elements alphabet of nucleotides: Since the base pairs are distributed along a double helix, when straightened, the helix appears as a complementary double-strand system. The two sequences on opposite strands are complementary in the sense that opposite nucleotides must fulfil the ligand rules ( with and with ) of base pairs, between purines and and pyrimidines and . In a DNA sequence, there are some subsequences, which can be roughly subdivided into coding and noncoding regions, having special meaning. In particular, genes (belonging to coding regions) are characteristic sequences of base pairs, and the genes in turn are made by some alternating subsequences of exons and introns (except Procaryotes where the introns are missing). Each exon region is made of triplets of adjacent bases called codons. There are 64 possible codons, inasmuch the number of combination of the 4 nucleotides into 3 length classes. There are only 20 amino acids, therefore, the correspondence codons to amino acids are many to one. The 20 elements alphabet of amino acids is in Table 1. In the following, we will analyze two mRNA sequences: (H1) and (H2), downloaded from the National Center for Biotechnology Information [38], which represent respectively.(H1)homo sapiens tet oncogene, family member 2 (TET2), transcript variant 1, mRNA, locus NM_001127208 (9796 bp mRNA linear). The accession number is NM_001127208, version NM_001127208.2 GI: 325197189.(H2)homo sapiens tet oncogene family member 2 (TET2), transcript variant 2, mRNA, locus NM_017628 (9236 bp mRNA linear). The accession number is NM_017628, version NM_017628.4 GI: 325197183.

Some differences between two variants are the following: (H2) is different from (H1) in 5′UTR (untranslate region) and in 3′UTR (untranslate region); furthermore, (H2) variant, compared with (H1), is shown to have c-terminal to be distinct and even shorter of (H1), which is also represented by a longer transcript [37].

4. Dot Plot on the Indicator Matrix

In this section, we will define the indicator matrix [4, 5] on which the computation of multifractality and entropy are based.

4.1. Indicator Function for the 4-Symbol Alphabet

Let be the finite set (alphabet) of nucleotides and any member of the 4 symbols alphabet.

A DNA sequence is the finite symbolic sequence so that being the acid nucleic at the position .

Let be two DNA sequences; the indicator function [4, 5] is the map such that When the indicator function, it shows the existence of autocorrelation on the same sequence. According to (4.5), the indicator map of the -length sequence can be easily represented by the sparse matrix of binary values , and this matrix can be visualized by the following (autocorrelation) dot-plot:

4.2. Indicator Function for the 20-Symbols Alphabet of Amino Acids

As a generalization of the 4-symbols alphabet of nucleotides, we can define the 20-symbols alphabet of amino acids as follows: A protein sequence is the finite symbolic sequence so that being the amino acid at the position .

The indicator function [4, 5] can be extended also to protein sequences as the map such that After a transduction of the two DNA sequences (H1) and (H2) into their amino acids components, we can see that the corresponding dot plots can show some (higher-level) structure on the distribution of nucleotides (see Figure 2). In particular, (H2) shows a special pattern which is more evident in the amino acids dot plot.

5. Probability Distribution

5.1. Frequency Distribution

The probability distribution of nucleotides can be defined by the frequency that the acid nucleic can be found at the position . This value can be approximated by the frequency count (on the indicator matrix) of the nucleotide distribution before . So that, for the transcript variant, we have the probability density distribution of Figure 3 which, however, tends to assume some different constant values thus showing that nucleotides are heterogeneously distributed.

5.2. Distribution of the Essential Amino Acids

Analogously to the nucleotides frequency distribution, we can compute also the amino-acid distribution that the amino-acid can be found at the position .

In particular, we have noticed that even if the nucleotides distribution is nearly the same in both sequences (H1) and (H2), the amino acid shows different distributions for the same amino-acid in each sequence. In other words, the “second”-level distribution seems to be organized according to a different distribution law (see Figures 4 and 5).

6. Fractal Dimension and Entropy

6.1. Fractal Dimension

The frequency distribution implies a corresponding frequency of correlation in the correlation matrix. By using the indicator matrix, it is possible to give a simple formula which enables us to estimate the fractal dimension as the average of the number of 1 in the randomly taken minors of the correlation matrix If we compare the fractal dimensions of the two mRNA sequences (H1) and (H2), we can see (Figure 6) that the fractal dimension of nucleotide distribution tends, for both variants, to the value 1.26.

It is interesting to notice that the corresponding amino acids of the two sequences have (more or less) the same fractal dimension which tends for both (Figure 7) to the value 1.29.

6.2. Entropy Estimate

As a measure of the information distribution, we consider the normalized Shannon entropy, which is defined, for a distribution over the alphabet, as where is given by (5.1) for nucleotides and (5.2) for amino acids.

Since , , the main values of this function are the following.(1)If , , then . This happens when the information is concentrated in only one symbol.(2)If , then . In this case, the information is equally distributed over all symbols.(3)Equation . In general, the information content is distributed over the range .

Therefore, the entropy is a positive function ranging in the interval , the minimum value is obtained when the distribution is concentrated on a single symbol, while the maximum value is obtained when all symbols are equally distributed.

In particular for higher values of , according to the frequency definition of probability, the entropy tends to the constant value 1 (see Figures 8 and 9) both for nucleotides and amino acids. However, in the first case (Figure 8), the entropy of (H1) is lower than (H2), while on the contrary, for the corresponding amino acids,s the entropy of (H1) is greater than (H2).

7. Conclusions

In this paper, two variants of mRNA of isoforms TET2 gene have been analyzed through their nucleotide and amino acids distribution. By using the indicator function (and matrix), the fractal dimension and the entropy have been easily computed. We have noticed that, at the amino acid level, some patches can be easily singled out. Moreover, the second variant (H2) of TET 2 shows some more randomness than (H1).