Abstract

This paper deals with the sequence analysis of acute myeloid leukemia mRNA. Six transcript variants of mlf1 mRNA, with more than 2000 bps, are analyzed by focusing on the autocorrelation of each distribution. Through the correlation matrix, some patches and similarities are singled out and commented, with respect to similar distributions. The comparison of Kolmogorov fractal dimension will be also given in order to classify the six variants. The existence of a fractal shape, patterns, and symmetries are discussed as well.

1. Introduction

In some recent papers [114], it has been shown that the basic structure of genome is based on fractal geometry. Indeed, the fractal dimension is defined according to the concept of information entropy [15, 16], so that a change in the DNA structure, that is, in the distribution of nucleic acids, implies a corresponding change in the information and, as a consequence, a variation in the entropy. In [24, 814], it has been suggested that a variation in the entropy can be interpreted as the symptom of malignant evolution of the cell activity, thus being an expedient test for cancer prognosis.

Despite the many still unsolved questions about distribution of base pairs (bps), it is generally understood that the cell activity is functionally dependent on the distribution of nucleotides, that is, the distribution of the 4 symbols , , , and along the DNA sequence [17, 18]. Some further attempts to understand better about this distribution and about the existence of large-scale structure or hidden rules (eventually fractals) were given in [1, 6, 7, 1926]. Namely, the large-scale depends on the possibility to show the long range correlation among bps [2325, 2743]. The multifractality was also used in connection with the concept of entropy to analyze the complexity of the DNA sequence [1, 2, 26, 41]. The main tasks are to find (if any) some kind of mathematical rules or meaningful statistics in the nucleotides distribution and to use the deviation from these patterns as a means to detect the existence of malignant evolution.

On the other hand, the existence of patchiness and correlation would imply some important understanding of DNA organization. It has been observed that the source for long-range correlation is linked with existence of patchiness in the DNA sequence. The identification of these patches could be the key point for understanding the large-scale structure of DNA.

Correlation in a DNA sequence is interesting because base pairs in a sequence of millions of pairs seem to have some statistical dependence. The existence of correlation in DNA has been explained with the so-called process of duplication mutation.

The power law for long-range correlations is a measure of the scaling law, showing the existence of self-similar structures similar to the physics of fractals. The long-range correlation, which can be detected by the autocorrelation function, implies the scale independence (scale invariance) which is typical of fractals.

Any statistical analysis on DNA is based on a digitalization of the symbolic sequence, so that one may benefit from the statistical analysis of the digitalized time series, and the genome can be characterized by the classical statistical parameters like variance, deviation, or nonclassical like complexity, fractal dimension, or long-range dependence.

The easiest mathematical model for DNA is based on the transformation of the symbolic string into a numerical string by the Voss indicator function [27, 28] which is a discrete binary function. In the following, a complex representation is proposed in order to single out a fractal law in the cumulative distribution of nucleotides. In some recent papers, the indicator (or correlation) matrix [1, 26] has been proposed as a suitable tool for detecting fractal patterns on the dot-plot representation of the indicator matrix. Then, the computation of fractal dimension and complexity can be easily performed on the dot plot.

In the following, we will compare the fractal dimension and complexity of six mRNA variants of leukemia, showing that these parameters can be used to classify variants. Moreover, the complexity is compared with random and quasi-random sequences (based on the same symbolic alphabet). There follows that the mRNA variants have the complexity close to the random sequence, thus rising some more inquiries on the existence of long-range correlation.

Acute myeloid leukemia consists on the interruption of growth for bone marrow cells at the earliest stages of development. The mechanism of this interruption is under further investigation and still unclear in some aspects; however, it is known that it involves the activation of abnormal genes through chromosomal translocations and other genetic anomalies [42]. If a cell does not reach a mature state, then we can speak about a leukemia that usually has a very abrupt onset and for this reason is called acute leukemia. When, instead, the rate of immature cell with respect to healthy cells is low and this number increases slowly, then we speak about chronic leukemia.

The outcome for adults with AML (acute myeloid leukemia) depends on a variety of factors, including age of the patient and biologic characteristics of the disease, the most important of which are the cytogenetics at presentation [4244]. The karyotype of the leukemic cells can be roughly classified into 3 groups with either favourable, intermediate, or poor prognostic risk [42, 43].

In the following, we will analyze the sequence of data obtained for homo sapiens mlf1 transcript variant mRNA which has been downloaded from gene bank [45].

2. Preliminary Survey on Leukemia

Blood cells are normally produced by the bone marrow and when mature go into circle. Bone marrow is the principal organ for the production of blood cells. It consists of a complex of cells with high proliferative capacity. The bone includes a portion of adipose tissue that may become prevalent with age (yellow marrow) compared with the hematopoietic component (red marrow). The parenchyma, of the bone marrow, is supported by a stroma composed of irregularly distributed fibroblasts that produce thin beams of reticular fibers. These cells are responsible for the production of growth factors, which activate blood-forming elements. These elements provide a suitable microenvironment for the growth process. The bone marrow is an organ producing every day a population of cells larger than 2.5 billion red blood cells and platelets and 1 billion of white blood cells per kilogram of body weight. It is known that leukemia originates in the bone marrow and from there it spreads into the bloodstream.

Bone marrow can be considered the reservoir from which all blood cells are produced; it contains the precursors of red blood cells, white blood cells (lymphocytes, monocytes, and granulocytes), and platelets, so that blood cells are derived from a single progenitor cell or steam cell. Each blood cell belongs to a different branch starting from the same progenitor cell.

The myeloid progenitor cell lineage for the white blood cells is called granulocyte. The lymphoid progenitor cells give rise to another type of white blood cell. The progenitor for erythroid lineage produces red blood cells, and finally a megakaryocytic precursor gives rise to platelets.

Unlike other cells of the body, which rarely duplicate, the bone marrow cells are characterized by higher proliferative capacity, so that the blood constantly contains a large number of red blood cells, white blood cells, and platelets, which are quickly renewed at different rate. It is clear that the probability of the formation of a malignant tumor is roughly proportional to the number of cell divisions, so that it is possible that almost any cell, in the bone marrow, becomes malignant and gives rise to this type of cancer called leukemia.The different speed of diffusion enables us to classify leukemia into acute (fast growth) or chronic (slow growth) leukemia. In the following, we will focus only on the acute myeloid leukemia.

The underlying pathophysiology in acute myeloid leukemia consists on the interruption of growth for bone marrow cells at the earliest stages of development. According to the rate of growing, we can have either acute or chronic leukemia. The main characteristics of both kind of leukemia are different upon the cell strain from which is originated the disease, the most common being leukemia myeloid and lymphoid. In general, we have four main types: (1)acute myeloid leukemia that strikes mostly in old people and adults, (2)chronic myeloid leukemia that is characterized by a highly specific chromosomal abnormality known as Philadelphia chromosome, (3)chronic lymphocytic leukemia and acute lymphocytic leukemia with a high incidence in children between 2 and 5 years of age.

Usually, the clinical symptoms of leukemia are initially underestimated and they can be easily confused with symptoms of minor diseases. This is due to the fact that initially leukemia cells are quickly replaced by new cells produced by bone marrow. Only when the growth speed of the number of leukemia cells increases rapidly, the symptoms become more easily detectable.

The most common symptoms are bleeding or bruising on the skin or mouth, due to lack of platelets or thrombocytopenia, neutropenia, due to a lack of white blood cells, or simply the pale and weakness, due to anemia. Namely, symptoms of leukemia and production rate of cells by marrow depend on the number of blasts.

The most popular classification of leukemia is the French-American-British (FAB) system that classifies AML into 8 subtypes, from M0 to M7, (see Table 1) based both on the type of cell from which the leukemia is developed and its degree of maturity. The classification is realized by analysing the appearance of the malignant cells under light microscopy and/or by using cytogenetics to characterize any underlying chromosomal anomalies. According to subtypes membership, leukemia owns different prognoses and responses to therapy.

The malignant cell in AML usually appears at the myeloblast level. During a normal hematopoiesis activity, the myeloblast is the immature precursor of myeloid white blood cells; a normal myeloblast usually gradually evolves into a mature white blood cell. Whereas, in AML, a single myeloblast accumulates some genetic changes which blocks the cell in its immature state and prevent differentiation. As seen in Table 1, there are six major features which have been recently added by two more systematic: MO and M7 megakaryoblastic.

In the following, we will focus on the MLF1 gene (myeloid leukemia factor 1). This gene MLF1 [45] encodes an oncoprotein which is thought to play a fundamental role in the phenotypic determination of hemopoetic cells. Translocations between this gene and nucleophosmin have been associated with myelodysplastic syndrome and acute myeloid leukemia. Multiple transcript variants encoding different isoforms have been found for this gene. In Figure 1 is represented the location of MLF1 at the genomic level [46, 47].

3. DNA Representation

3.1. Preliminary Remarks on DNA

The DNA, as well as the mRNA, of each organism of a given species is a sequence of a specific number of base pairs defined on the 4 elements alphabet of nucleotides: Since the base pairs are distributed along a double helix, when straightened, the helix appears as a double-strand system, The two sequences on opposite strands are complementary in the sense that opposite nucleotides must fulfill the ligand rules of base pairs, between purines and and pyrimidines and , In a DNA sequence, there are some subsequences, coding and noncoding regions, having special meaning. In particular, genes (coding regions) are characteristic sequences of base pairs, and the genes in turn are made by some alternating subsequences of exons and introns (except Procaryotes where the introns are missing), After the transcription, each exon region is made of triplets of adjacent bases called codon. Since the bases are 4, there are 64 possible codons. Each codon synthesizes a specific amino acid in the translation process, so that a sequence of codons defines a protein. There are only 20 amino acids; therefore, the correspondence codons to amino acids are many to one. The exons region is also called the coding region.

3.2. Indicator Function in a 4-Symbol Alphabet

In this section, we will define the indicator matrix [1, 26] which can be used to visualize the existence of some patterns in the nucleotide distribution.

Let be the finite set (alphabet) of nucleotides and any member of the 4-symbol alphabet.

A DNA sequence is the finite symbolic sequence so that being the acid nucleic at the position .

The indicator function [1, 26] is the map such that, According to (3.9), the indicator map of the -length sequence can be easily represented by the sparse matrix of binary values , and this matrix can be visualized by the dot plot obtained by putting a black dot where and white spot when (see correlation matrix (3.11) for the 4-symbol alphabet)

Indeed, the definition (3.9) is expedient to study the autocorrelation of each sequence. In general for two sequences , , definition (3.9) can be extended to with In order to understand the acid nucleic distribution in the 6 variants of AML mRNA, we will compare them with some artificial sequence based on the same alphabet . In particular, we will consider the -length random sequence so that being the acid nucleic at the position , randomly chosen.

The periodic sequence is defined by repeating the same -length random sequence , so that with The quasiperiodic sequence is obtained by alternating -length periodic sequence of period with random -length sequences For example, it is If we compare the dot plot of the AML mRNA with the artificial sequences, we can see (Figure 2) that mRNA is much more alike the random or quasirandom sequence.

Although the mRNA sequence cannot be easily recognized by the dot plot, we can better characterize the distribution of nucleic acids by computing some parameters which are strictly related to the complexity. These parameters are based on the computation of “1” in the indicator matrix (see following section).

3.3. Fractal Estimate by the Indicator Matrix

From the indicator matrix, we can have an idea of the “fractal-like” distribution of nucleotides as follows: let be the probability that the acid nucleic can be found at the position . This value can be approximated by the frequency count. So that for the transcript variant, we have the probability density distribution of Figure 3.

It can be seen from Figures 3, 4, 5, and 6 that, for higher values of , the probabilities tend to assume some constant values, thus showing that nucleotides are heterogeneously distributed. However, there are some significant differences among the variants; H4 shows a higher distribution of while H6 is a minor content of .

The frequency distribution implies a corresponding frequency of correlation in the correlation matrix. By using the indicator matrix, it is possible to give a simple formula which enables us to estimate the fractal dimension as the average of the number of 1 in the randomly taken minors of the correlation matrix If we compare the fractal dimension of DNA with the random sequence and the periodic sequence, we can see (Figure 7) that the mRNA dimension is closer to the dimension of random sequence.

Moreover, the fractal dimension of the indicator matrix can be used to characterize the different distribution of acid nucleic in the mRNA 6 variants. In fact, it can be seen that some variants have lower values of the dimension (Figure 8), like H5 (and for a short interval H2).

3.4. Complexity

The existence of repeating motifs, periodicity, and patchiness can be considered as a simple behavior of sequence, while nonrepetitiveness or singularity is taken as a characteristic feature of complexity. In order to have a measure of complexity, for an -length sequence, we use the definition with By using a sliding -length window over the full DNA sequence, one can visualize the distribution of complexity on partial fragment of the sequence.

It can be seen (Figure 9) that the complexity line of the mRNA-H1 sequence is bounded by the upper line of random data and the lower line of periodic sequence . The quasiperiodic , instead, shows high oscillations between the two bonds. It should be also noticed that the complexity of DNA tends to the asymptotic value of the random sequence.

The explicit computation of the complexity line for the remaining mRNA variants (Figure 10) shows some different behavior, and for this reason, it can be used as a parameter to characterize the differences among variants. Like for the fractal dimension, H5 shows some lower values of complexity.

4. Conclusions

In this paper, six variants of acute myeloid leukemia mlf1 mRNA have been analyzed through the correlation matrix. In particular, some parameters like fractal dimension and complexity have been computed and compared. It has been shown that some variants have many similarities, and practically they can be considered as belonging to the same class of mRNA. Some others instead have a very typical distribution of bps very different from the remaining variants. Some variants look like pseudorandom sequence.