Abstract

We begin here a combinatorial study of dinucleotide circular codes. A word written on a circle is called circular. A set of dinucleotides is a circular code if all circular words constructed with this set have a unique decomposition. Propositions based on a letter necklace allow to determine the 24 maximum dinucleotide circular codes (of 6 elements). A partition property is also identified with eight self-complementary maximum dinucleotide circular codes and two classes of eight maximum dinucleotide circular codes in bijective correspondence by the complementarity map.

1. Introduction

We continue our study of the combinatorial properties of circular codes in genes, that is, on the nucleotide alphabet . A dinucleotide is a word of two letters (diletter) on . A trinucleotide is a word of three letters (triletter) on . The two sets of 16 dinucleotides and 64 trinucleotides are codes in the sense of language theory but not circular codes [1, 2]. In order to have an intuitive meaning of these notions, codes are written on a straight line, while circular codes are written on a circle, but, in both cases, unique decipherability is required.

Trinucleotide comma-free codes, a very particular case of trinucleotide circular codes, have been studied for a long time, see for example, [35]. After the discovery of a trinucleotide circular code in genes with strong mathematical properties [6], circular codes are mathematical objects studied in combinatorics, theoretical computer science, and theoretical biology. This theory underwent a rapid development, see for example, [727].

Trinucleotides are the fundamental words for genes, that is, the DNA sequences coding the amino acids constituting the protein sequences. However, dinucleotides are also words with important biological functions in genomes. Dinucleotides are involved in some genome sites, for example, the splice sites of introns in eukaryotic genomes are based on the dinucleotides and [28, 29]. Dinucleotides are also involved in some genome regions, for example, the dinucleotide in animal and plant genomes allows a positive or negative control over gene expression [30], and the dinucleotides [31, 32], [33], and [34] in eukaryotic genomes occur as concatenated words , (called tandem repeats in biology).

We begin here a new combinatorial study concerning the dinucleotide circular codes. Their number, their list, and a partition according to the complementarity map are determined with propositions based on a letter necklace.

2. Preliminaries

The following definitions and propositions are classical for any finite set of words on any finite alphabet [1]. We recall them for dinucleotides, that is, words of length 2 on a 4-letter alphabet. Let denote the genetic alphabet, lexicographically ordered by . The set of nonempty words (resp., words) on is denoted by (resp.). The set of the 16 words of length 2 (dinucleotides or diletters) over is denoted by . The set of the 64 words of length 3 (trinucleotides or triletters) over is denoted by .

Definition 1. A set of words in is a dinucleotide code if, for each ,  , the condition implies and for  .

Dinucleotide codes are read on a straight line.

Definition 2. A dinucleotide code in is circular if, for each ,  , , , the conditions and imply , (empty word), and for .

Dinucleotide circular codes are read on a circle.

Remark 3. The set is a code but not a circular code.

Definition 4. Two dinucleotides and are conjugate if there exist two letters and , , such that and .

Proposition 5 (see [1]). A dinucleotide circular code cannot contain a word of the form with .

The periodic dinucleotides , , , and cannot be in a dinucleotide circular code.

Proposition 6 (see [1]). A dinucleotide circular code cannot contain conjugate dinucleotides.

Example 7. The dinucleotides and cannot be in the same circular code.

The set operations complementarity , permutation , and mirror image  ̃  defined later are involutions.

Definition 8. The nucleotide complementarity map   is defined by , , , and .

Definition 9. The dinucleotide complementarity map    is defined by    for all  .

Example 10. .

Definition 11. The complementary dinucleotide set of a dinucleotide set is the set obtained by applying the dinucleotide complementarity map to all the dinucleotides of .

Remark 12. .

Definition 13. A dinucleotide circular code is self-complementary if, for each , .

Definition 14. The (left) dinucleotide circular permutation map permutes circularly each dinucleotide , .

Definition 15. The permuted dinucleotide set of a dinucleotide set is the set obtained by applying the circular permutation map to all the dinucleotides of .

Remark 16. .

Definition 17. The mirror image of a dinucleotide is , .

Definition 18. The mirror image   of a dinucleotide set is the set of the mirror images of all the dinucleotides of .

Remark 19. .

Remark 20. For a dinucleotide and for a dinucleotide set , we have and .

Proposition 21 (see [27]). A dinucleotide code is circular if and only if the dinucleotide code is circular.

Proposition 22. A dinucleotide code is circular if and only if the permuted dinucleotide code is circular.

Proof. By Proposition 21 and Remark 20.

Remark 23. Proposition 22 is not true with trinucleotides [6].

3. Results

In this paper, we identify the subsets of which are circular codes. Based on a letter necklace, we prove a necessary and sufficient condition for a set of dinucleotides to be a circular code.

Definition 24. Let be  letters in . One says that the ordered sequence is a -necklace for a subset if each dinucleotide   belongs to .

Proposition 25. Let be a subset of . The following conditions are equivalent:(1)is circular code,(2) has no 5-necklace.

Proof. (1) (2). Let be a circular code. We have to prove that has no -necklace. Suppose, by way of contradiction, that is a -necklace for . As contains four letters, for some , , we have that . Remark that the maximum value of is . (i) If , then has a periodic dinucleotide . (Contradiction with Proposition 5.) (ii) If , then has two conjugated dinucleotides and . Contradiction with Proposition 6. (iii) If , then either or . (iiia) If , the -necklace is . So, . Consider the sequence . Put , , , , , . Note that belong to . Now, the following relations hold: and . (Contradiction with the assumption that is a circular code.) (iiib) The case of is analogous. (iv) If , then , and the 5-necklace is . So, . Consider the sequence . Put , , , . Note that belong to . Now, the following relations hold: and . Contradiction with the assumption that is a circular code.
(2) (1) Let be without -necklace and suppose, by way of contradiction, that is not a circular code. As is a uniform code, there exist , , , such that and . Moreover, as all the elements of have a length of , there exist , such that ,, , and , ,, , . Nowif , then is a -necklace,if , then is a -necklace,if , then is a -necklace.In any case, there is a -necklace. Contradiction.

If a dinucleotide set is a circular code, then there exists no word of with two different decompositions of their products written on a circle.

Example 26. Consider the set containing only the dinucleotides and . Let be any sequence with or . As does not contain , , and , the sequence cannot have a double decomposition on a circle. The set has no 5-necklace as, if is (resp., ) then must be   (resp., ), but   (resp., ) is never a suffix in . There are sets with six dinucleotides which are circular codes. For example, any sequence with the set of dinucleotides , , , , , has no double decomposition on a circle.

More generally, write the sequence of on a circle. If , , , , , belong to a set , then cannot be a circular code because the sequence can be read in two ways: (with as the first letter) and (with as the first letter). There is a double reading of the sequence (corresponding to a double reading of the sequence ). In this case, is a -necklace for .

Example 27. If , , , , , are dinucleotides of , we have the following relations: , and . So, is not a dinucleotide circular code (also a consequence of the fact that contains two conjugate dinucleotides and ).

Proposition 28. A dinucleotide circular code has at most 6 elements.

Proof. There are 16 dinucleotides. Four dinucleotides are periodic: , , , and . The remaining 12 dinucleotides are partitioned in six conjugation classes: , , , , , and . By Proposition 6, a dinucleotide circular code has at most one dinucleotide in each of these conjugation classes. So, a dinucleotide circular code has at most 6 elements.

Proposition 29. Let be a permutation of . Ifthen is a dinucleotide circular code.

Proof. Suppose, by way of contradiction, that is not a dinucleotide circular code, and let be a -necklace of . Note that, with the exception of , the other letters composing the necklace must be a suffix of a dinucleotide of .
Claim 1. For , .
Proof of Claim 1. By inspection, is never a suffix of a dinucleotide of .
Claim 2. For , .
Proof of Claim 2. By inspection, is a suffix only of . For , if , then which is impossible by Claim 1.
Claim 3. For each , .
Proof of Claim 3. By inspection, is a suffix only of and . Suppose, by way of contradiction, that . Then, or . If , we are in contradiction with Claim 1 and if , we are in contradiction with Claim 2. Suppose, by way of contradiction, that . Then, or . If , we are in contradiction with Claim 1 and if , we are in contradiction with Claim 2.
Claim 4.  .
Proof of Claim 4. By inspection, is a suffix only of , , and . Suppose, by way of contradiction, that . Then, ,  , or . In the first case, we are in contradiction with Claim 1; in the second case, we are in contradiction with Claim 2; and in the third case, we are in contradiction with Claim 3.
By Claims , and , we have , and so, has no -necklace. Consequently, is a dinucleotide circular code.

Definition 30. A maximum dinucleotide circular code is a dinucleotide circular code having 6 elements.

Remark 31. In Proposition 29, we have considered an arbitrary permutation of , and we have proved that a maximum dinucleotide circular code corresponds to it. As the number of possible permutations is 24, the number of maximum dinucleotide circular codes is at least 24, and we will prove hereafter that it is exactly 24.

In the maximum dinucleotide circular code (Proposition 29), the letter has three occurrences in prefix of dinucleotides of (shortly in prefix of ), the letter has two occurrences in prefix of , and has one occurrence in prefix of . The letter never occurs in prefix of . This is a general fact, in the sense that in each maximum dinucleotide circular code there is a letter, say , with three occurrence in prefix of , and a letter, say , with two occurrences in prefix of , and a letter, say , with one occurrence in prefix of , while the remaining letter, say , never occurs in prefix of .

We will prove formally this general fact. In the sequel, a set of nonnegative numbers having a sum equal to is called a -partition of . By “set” we rather mean a “multiset” as some numbers can be equal,. Define (resp., , , ) as the number of occurrences of (resp., , , ) in prefix of a maximum dinucleotide circular code .

Lemma 32. If is a maximum dinucleotide circular code, then is a -partition of .

Proof . By Proposition 28.

Lemma 33. In any dinucleotide circular code, one has .

Proof. The alphabet contains four letters, and a dinucleotide circular code cannot contain periodic dinucleotides.

Example 34. For , the 4-partition of 6 is .

The following lemma will prove that the unique possible 4-partition for a maximum dinucleotide circular code is .

Lemma 35. For each maximum dinucleotide circular code , there exists a permutation of such that has three occurrences in prefix of , has two occurrences in prefix of , has one occurrence in prefix of , and has no occurrence in prefix of .

Proof. Putting the values in nonincreasing order, by Lemma 33, we have to consider only the following cases: , , , , and .
Case . Let be the letter with three occurrences in prefix of . Let be the three other letters of . We have . Without loss of generality, suppose that has three occurrences in prefix of . Necessarily one of the two dinucleotides and must be in . But, in the first case, we are in contradiction with Proposition 6, and in the second case, we are in contradiction with Proposition 5.
So, the case is impossible.
Case . Let be the letter with three occurrences in prefix of . Let be the three other letters of . Then, we have and ; otherwise, we are in contradiction with Propositions 5 and 6.
Now, suppose that in , the same letter, say without loss of generality, has two occurrences in suffix of ,. The letter cannot be a prefix of . Indeed, , are in contradiction with Proposition 6, and is in contradiction with Proposition 5.
So, in , the letters must have only one occurrence in suffix of . Without loss of generality, we have . But, is a -necklace for . By Proposition 25, we are in contradiction.
So, the case is impossible.
Case . Let be one of the three letters with two occurrences in prefix of . Let the three other letters of . Without loss of generality, we have . With the two other letters having two occurrences in prefix of , we have three possibilities , , .
Case . By Propositions 6 and 5, , but . As are conjugate, we are in contradiction with Proposition 6.
Case . By Propositions 6 and 5, , but . As (otherwise, we are in contradiction with Proposition 6), the two dinucleotides with in prefix of must be and and, consequently, . But, is a -necklace for . By Proposition 25, we are in contradiction.
Case . By Propositions 6 and 5, , but . As (otherwise, we are in contradiction with Proposition 6), the two dinucleotides with in prefix of must be and and, consequently, . But, is a -necklace for . By Proposition 25, we are in contradiction.
So, the case is impossible.
Case Let be one of the two letters with two occurrences in prefix of . Let be the three other letters of . Without loss of generality, we have . Consider the following cases:(i) has two occurrences in prefix of , and and have one occurrence in prefix of . By Propositions 6 and 5, , and is the unique possible dinucleotide of with in prefix of . By Propositions 6 and 5, . If then is a -necklace of , and by Proposition 25, we are in contradiction. So, cannot be a prefix of . Contradiction.(ii) has two occurrences in prefix of , and and have one occurrence in prefix of . By Propositions 6 and 5, , and is the unique possible dinucleotide of with in prefix of . By Propositions 6 and 5, and (otherwise, is a -necklace for , and by Proposition 25, we are in contradiction). So, cannot be a prefix in . Contradiction.(iii) has two occurrences in prefix of , and and have one occurrence in prefix of . By Propositions 6 and 5, we have three possible cases , , and .(iiia). So, . By Propositions 6 and 5, , but . By Propositions 6 and 5, . In the first case , and by Proposition 6, we are in contradiction. In the second case, . But, is a -necklace for . By Proposition 25, we are in contradiction.(iiib). So, . By Propositions 6 and 5, , but . By Propositions 6 and 5, is the unique possible dinucleotide of with prefix . So, . But, is a -necklace for . By Proposition 25, we are in contradiction.(iiic). So, . By Propositions 6 and 5, , but . By Propositions 6 and 5, , but . As are conjugate, we are in contradiction with Proposition 6.So, the case is also impossible.
Only the 4-partition is realized by . It corresponds to the permutation of . In other words, the permutation, whose existence is proved, is , , , and .

Proposition 36. There are 24 maximum dinucleotide circular codes.

Proof. By Proposition 29, each permutation of is associated with a maximum dinucleotide circular code . As there are permutations, the number of maximum dinucleotide circular codes is at least .
Now, let be a maximum dinucleotide circular code. By Lemma 35, its -partition must be . Let (resp., , , ) be the letter of having (resp., , , ) occurrences in prefix of . As has three occurrences in prefix of , we must have . As has two occurrences in prefix of , and as is already in , we must also have . Finally, as has only one occurrence in prefix of , and as and are already in , we must have . Consequently, , and is one of the maximum circular codes already considered. Thus, the number of maximum dinucleotide circular codes is exactly .

A computer calculus confirms that there are exactly maximum dinucleotide circular codes (Table 1).

There are eight self-complementary maximum dinucleotide circular codes: , , , , , , , and (Table 1). The 16 remaining ones are partitioned in two classes of eight maximum dinucleotide circular codes in bijective correspondence by the complementarity map (Table 1).

Proposition 37. If is a maximum dinucleotide circular code, then is also a maximum dinucleotide circular code.

Proof . By inspection (Table 1).

Proposition 38. If is a maximum dinucleotide circular code, then

Proof. By inspection (Table 1).

This proposition is not true with maximum trinucleotide circular codes, see for example, [6].

4. Conclusion

This new combinatorial study of circular codes in genes has proved that there are exactly maximum dinucleotide circular codes on the -letter genetic alphabet . They are listed in Table 1. Propositions 22, 37, and 38 lead to interesting properties with dinucleotide circular codes in DNA. Indeed, they ensure that several maximum dinucleotide circular codes can exist in the two strands of the DNA double helix simultaneously. Indeed, a maximum dinucleotide circular code in a given strand of DNA implies that its complementary set in the complementary strand of DNA is also a maximum dinucleotide circular code (Proposition 37) and according to two possibilities: or with (Table 1). Furthermore, its permuted set in , obtained by a frameshift of one letter of in , is also a maximum dinucleotide circular code (Proposition 22). Finally, its complementary permuted set in is also a maximum dinucleotide circular code (Proposition 38).

Chemical modification of nucleotides is ubiquitous in RNA and DNA. So far, a total of 107 modified nucleotides, for which chemical structures have been assigned, have been reported in RNA (see the RNA Modification Database at http://rna-mdb.cas.albany.edu/RNAmods/ [35]). The largest number, that is, 81, with the greatest structural diversity, is found in tRNA, with 30 in rRNA, 12 in mRNA, and 13 in other RNA species, most notably snRNA. The four nucleotides can be chemically modified, for example, methyladenosine, dimethyladenosine, trimethyladenosine, methylcytidine, dimethylcytidine, thiocytidine, methylguanosine, dimethylguanosine, trimethylguanosine, methyluridine, dimethyluridine, thiouridine, pseudouridine, dihydrouridine, but also inosine, lysidine, wybutosine, wyosine, queuosine, and archaeosine. In DNA, the cytosine in the dinucleotide, involved in gene regulation, can have two chemical forms (methylcytosine, hydroxymethylcytosine). This chemical change allows to store additional information, thus expanding the alphabet by two letters. Thus, the generalization of dinucleotide circular code propositions over larger alphabets is very interesting and should be investigated.

Dinucleotide circular codes may be involved in retrieval of the modulo frame in genomes, for example, in the dinucleotide repeats.

Dinucleotide circular codes may also have a biological function in the coding process of amino acids. In the standard genetic code, eight amino acids Ala (), Arg (), Gly (), Leu (), Pro (), Thr (), Ser (), and Val () are coded by sets of trinucleotides involving dinucleotides. Indeed, for each of these eight amino acids, there exists a dinucleotide such that all the trinucleotides of the form (where is any letter of ) code the same amino acid (Table 2).

Now, Gly () and Pro () cannot be coded by a dinucleotide circular code as their dinucleotides are periodic . Moreover, Ala () and Arg () cannot be coded simultaneously by a dinucleotide circular code as their dinucleotides are conjugate and similarly for Leu () and Ser () with the conjugate dinucleotides . On the other hand, as any subset of a maximum dinucleotide circular code is also a dinucleotide circular code, the following properties exist.(i) The four amino acids Arg (R), Leu (L), Thr (T), and Val (V) can be coded by the dinucleotide circular code which is a proper subset of the maximum dinucleotide circular code (Table 1).(ii) The four amino acids Ala (A), Leu (L), Thr (T), and Val (V) can be coded by the dinucleotide circular code which is a proper subset of the maximum dinucleotide circular code (Table 1).(iii) The four amino acids Ala (A), Ser (S), Thr (T), and Val (V) can be coded by the dinucleotide circular code which is a proper subset of the maximum dinucleotide circular code (Table 1).

These results contribute to the research field analysing the mathematical properties of genetic codes.

Acknowledgments

The authors thank the reviewers and Jacques Justin for their advice. The second author thanks the Dipartimento di Matematica “U. Dini” for giving him a friendly hospitality.