Abstract

Recently, we identified a hierarchy relation between trinucleotide comma-free codes and trinucleotide circular codes (see our previous works). Here, we extend our hierarchy with two new classes of codes, called and codes, which are stronger than the comma-free codes. We also prove that no circular code with 20 trinucleotides is a code and that a circular code with 20 trinucleotides is comma-free if and only if it is a code. Finally, we point out the possible role of the symmetric group in the mathematical study of trinucleotide circular codes.

1. Introduction

We continue our study of the combinatorial properties of trinucleotide circular codes. A trinucleotide is a word of three letters (triletter) on the genetic alphabet . The set of 64 trinucleotides is a code in the sense of language theory, more precisely a uniform code but not a circular code (Remark 2.4 and [1, 2]). In order to have an intuitive meaning of these notions, codes are written on a straight line while circular codes are written on a circle, but, in both cases, unique decipherability is required. Circular codes are some particular subsets of the 64 trinucleotide set while comma-free codes are even more constrained subsets.

In the past 50 years, comma-free codes and circular codes have been studied in theoretical biology, mainly to understand the structure and the origin of the genetic code as well as the reading frame (construction) of genes, for example [35]. Before the discovery of the genetic code, Crick et al. [3] proposed a (maximal) comma-free code of 20 trinucleotides for coding the 20 amino acids. In 1996, a (maximal) circular code of 20 trinucleotides was identified statistically on two large and different gene populations, eukaryotes, and prokaryotes [6]. During the last years, circular codes are mathematical objects studied in discrete mathematics, theoretical computer science, and theoretical biology, for example [722]. In particular, in theory of codes, there are some unexpected common notions between variable length circular codes and trinucleotide circular codes [17, 19, 21, 22].

Recently, we proposed a hierarchy relation between the trinucleotide comma-free codes and the trinucleotide circular codes (Proposition 3 in [23]). More precisely, all the trinucleotide codes in this hierarchy are circular, the strongest ones being comma-free. In this paper, we identify two new classes of trinucleotide circular codes which are stronger than the comma-free codes.

We introduce here the following new notions. A set of trinucleotides has the property if for any trinucleotides , no letter occurs both as a proper suffix of and a proper prefix of . A set of trinucleotides has the property if for any trinucleotides , no diletter occurs both as a proper suffix of and a proper prefix of . These sets and are not only trinucleotides circular codes but they are also stronger than the comma-free codes (Propositions 3.4 and 3.5, and Remarks 3.6 and 3.7). We also prove that no circular code with 20 trinucleotides is a code (Proposition 3.10) and that a circular code with 20 trinucleotides is comma-free if and only if it is a code (Proposition 3.11).

Therefore, our previous hierarchy (Proposition 3 in [23] recalled in Proposition 2.17 below) is extended with these new and classes of strong trinucleotides circular codes (Proposition 4.1).

Finally, a curious relation with the symmetric group appears again. The tables given here and the other symmetric relations identified previously (e.g., Proposition 6 in [23]) suggest that the symmetric group can play an important role in the mathematical study of these trinucleotide circular codes. However, we have no formal mathematical explanation so far.

2. Preliminaries

Let denote a finite alphabet, the free monoid over and the free semigroup over . The elements of are words and the empty word, denoted by , is the identity of . Given a subset of , is the set of the words over which are the products of words from , that is, . If is a (finite) set, then denotes its cardinality and if is a word, then denotes its length. A word is a factor of a word if there exist two words and such that . When (resp. ), is a prefix (resp. suffix) of . A proper factor (resp. proper prefix, proper suffix) of is a factor (resp. prefix, suffix) of such that .

There is a correspondence between the genetic and language-theoretic concepts. The letters (or nucleotides or bases) define the genetic alphabet . The set of nonempty words (resp. words) over is denoted by (resp. ). The set of the 16 words of length 2 (or dinucleotides or diletters) is denoted by . The set of the 64 words of length 3 (or trinucleotides or triletters) is denoted by . The total order over the alphabet is . Consequently, is lexicographically ordered: given two words , is smaller than in lexicographical order, written , if and only if either is a proper prefix of or there exist , , and such that and .

Definition 2.1. Code: a subset of is a code over if for each , , the condition implies and for .

For any -letter alphabet, , and for any word length , , is a code. In particular, is a code. More precisely, it is a uniform code [1]. Consequently, any nonempty subset of is a code, called trinucleotide code in this paper.

Definition 2.2. Trinucleotide comma-free code: a trinucleotide code is comma-free if for each and such that with , , it results that .

Several varieties of trinucleotide comma-free codes were described in [18].

Definition 2.3. Trinucleotide circular code: a trinucleotide code is circular if for each , , , , the conditions and imply , and for .

Remark 2.4. is obviously not a circular code and even less a comma-free code. However, several subsets of are trinucleotide circular codes (e.g., Propositions 2.12 and 2.13).

Definition 2.5. Maximal trinucleotide circular code: a trinucleotide circular code is maximal if for each , , is not a trinucleotide circular code.

Definition 2.6. A trinucleotide circular code containing exactly elements is called a -trinucleotide circular code.

Remark 2.7. A 20-trinucleotide circular code is(i)maximal (in the sense that it cannot be contained in a trinucleotide circular code with more words);(ii)maximum (in the sense that no trinucleotide circular code can contain more than 20 words).

We now recall some definitions and previous results related to the trinucleotide circular code necklaces. In the sequel, are letters in , are diletters in , and is an integer satisfying .

Definition 2.8. Letter Diletter Necklaces (): we say that the ordered sequence , is an for a subset if and ,.

Definition 2.9. Letter Diletter Continued Necklaces (): we say that the ordered sequence is an for a subset if , and .

Definition 2.10. Diletter Letter Necklaces (): we say that the ordered sequence , is an for a subset if and ,.

Definition 2.11. Diletter Letter Continued Necklaces (): we say that the ordered sequence is an for a subset if , and .

Proposition 2.12 (see [17]). Let be a trinucleotide code. The following conditions are equivalent:(i) is a circular code;(ii) has no .

Proposition 2.13 (see [18]). Let be a trinucleotide code. The following conditions are equivalent:(i) is a comma-free code.(ii) has no and no .

Definition 2.14. Let be a trinucleotide code. For any integer , we say that belongs to the class if has no and that belongs to the class if has no . Similarly, for any integer , we say that belongs to the class if has no and that belongs to the class if has no .

Notation 1. For any integer , and . Similarly, for any integer , and .

Proposition 2.15 (see [23]). The following chains of inclusions hold:(i); (ii); (iii); (iv); (v); (vi).

Remark 2.16. By Proposition 2.13, the chain of inclusions of Proposition 2.15 () begins with which is the class of comma-free codes.

Proposition 2.17. With 20-trinucleotide circular codes, the following chains of inclusions and equalities hold:

3. Strong Trinucleotide Circular Codes

We introduce new definitions which impose very strong conditions on the words of a subset of . These word subsets, strongly constrained, are indeed new circular codes which are stronger than the trinucleotide comma-free codes according to the following propositions.

Definition 3.1. A subset of has the property if, for any , the conditions and imply .

No letter of can occur in the first position of a trinucleotide of when it is also in the last position of another trinucleotide of .

Definition 3.2. A subset of has the property if, for any , , the conditions and imply .

No diletter of can occur as a prefix of a trinucleotide of when it is also a suffix of another trinucleotide of .

Remark 3.3. The trinucleotide code is not a -strong trinucleotide circular code but it is a -strong trinucleotide circular code. The trinucleotide code is not a -strong trinucleotide circular code but it is a -strong trinucleotide circular code.

Therefore, the class of -strong trinucleotide circular codes is different from the class of -strong trinucleotide circular codes. However, both are very particular cases of comma-free codes according to the following propositions.

Proposition 3.4. A -strong trinucleotide circular code over is comma-free.

Proof. Suppose that is a -strong trinucleotide circular code and, by way of contradiction, that it is not comma-free. Then, there exist two trinucleotides such that either or are in . In the first case, is a prefix of and a suffix of while in the second case, is a prefix of and a suffix of . In both cases, is not a -strong circular code. This is a contradiction.

Proposition 3.5. A -strong trinucleotide circular code over is comma-free.

Proof. Suppose that is a -strong trinucleotide circular code and, by way of contradiction, that it is not comma-free. Then, there exist two trinucleotides such that either or are in . In the first case, is a prefix of and a suffix of while in the second case, is a prefix of and a suffix of . In both cases, is not a -strong circular code. This is a contradiction.

Remark 3.6. There are trinucleotide comma-free codes which are not -strong trinucleotide circular codes. Example: .

Remark 3.7. There are trinucleotide comma-free codes which are not -strong trinucleotide circular codes. Example: .

The two following propositions are obvious.

Proposition 3.8. For any letters , a trinucleotide singleton is a -strong trinucleotide circular code over if and only if .

Proposition 3.9. For any letters , a trinucleotide singleton is a -strong trinucleotide circular code over if and only if at least two of its letters are different.

Remark 3.3 showed that -strong and -strong trinucleotide circular codes are different classes. The following propositions give more information about their difference.

Proposition 3.10. No 20-trinucleotide circular code can be a -strong trinucleotide circular code.

Proof. Suppose, by way of contradiction, that a 20-trinucleotide circular code is also a -strong trinucleotide circular code. Let (resp. ) be the set containing the letters (resp. ) of the trinucleotides of . We have (otherwise, has not the property), (otherwise, has at most 16 elements) and (otherwise, has at most 16 elements). Using Pigeon Hole Principle, it follows that has two disjoint subsets, say and , such that and . Consequently, has at most the following elements: , so we have again at most 16 elements. This is a contradiction.

Proposition 3.11. A 20-trinucleotide circular code is comma-free if and only if it is a -strong trinucleotide circular code.

Proof. If. By Proposition 3.5, any -strong trinucleotide circular code is also comma-free.
Only if. Suppose that is comma-free and, by way of contradiction, that it is not a -strong trinucleotide circular code. Then, there exist two letters and a diletter such that . As cannot contain two elements in the same conjugation class, the condition holds. So, contains exactly two elements, say and .
being a comma-free code, must contain exactly one trinucleotide in each of the 20 conjugation classes. By considering the conjugation class , only can belong to . Indeed, and are not comma-free codes as the concatenations and lead to in contradiction with Definition 2.2. With the conjugation class , only can belong to . Indeed, and are not comma-free codes as the concatenations and lead to in contradiction with Definition 2.2. Similarly, and must belong to . Moreover, with the conjugation class , only can belong to .
Now, we have: (i) (otherwise is not a comma-free code);(ii) (otherwise is not a comma-free code); (iii) (otherwise is not a comma-free code). So, no element in the conjugation class belongs to . This is a contradiction.

4. Extended Hierarchy

The previous hierarchy of trinucleotide circular codes [23] is now extended with these new and codes. By Proposition 3.10, the set of -strong 20-trinucleotide circular codes is empty. Moreover, by Proposition 3.11, the set of -strong 20-trinucleotide circular codes coincide with the set of trinucleotide comma-free codes (set ). With the notations and (Notation 1), the hierarchy of the above recalled Proposition 2.17 is extended with these new strong trinucleotide circular codes as follows.

Proposition 4.1. With the 20-trinucleotide circular codes, the following chains of inclusions and equalities hold:

5. Coding of Trinucleotide Circular Codes with the Symmetric Group

We use the symmetric group (e.g., [24]) to develop a coding of trinucleotide circular codes.

A permutation of a set is a bijection from into itself. Given a positive integer , denotes the set . As has a natural total order , a permutation of is the word giving the successive images of the elements of . Analogously, denotes any totally ordered set , , of elements. Also as a consequence of the total order, a permutation of is the word and by abuse of language, can also be considered as a permutation of . The symmetric group denotes all the permutations of .

Recall that denotes the number of elements of a set . Recall that if is a word of length on the alphabet , then . So, is the set of the letters of having at least one occurrence in .

A permutation of can be represented by a word of length . Clearly, the prefix of length of the word uniquely determines . There are also four other cases to represent the elements of by words of length : and ; and ; and ; and . We begin with the case and .

For a given , denotes the subset of containing its first elements . For a given and for a permutation of , the set is defined as follows: contains the elements of having one occurrence in , the suffix of length of . Consequently, counts the number of elements of such that and . In other words, counts the number of elements of such that and is on the right of in the word . Put and let the code of be the word denoted by .

For a given permutation , is the number of the letters of that are strictly smaller than or equivalently, the number of the elements of the alphabet that are strictly smaller than the leftmost letter , and by the choice of the alphabet, this number is exactly and belongs to . Then, is the number of the letters of that are strictly smaller than and on the right of or equivalently, the number of the elements of the alphabet that are strictly smaller than and this number belongs to . And so on until which is the number of the letters of that are strictly smaller than and on the right of or equivalently, the number of the elements of the two-letter alphabet that are strictly smaller than and this number belongs to , that is, with only values 0 or 1. Finally, is the number of the letters of that are strictly smaller than and on the right of or equivalently, the number of the elements of the one-letter alphabet that are strictly smaller than and this number belongs to , that is, with value equal to 0. Thus, and belongs to a set of cardinality which is exactly the cardinality of .

Clearly, if and are two different permutations of , then . Indeed, let be the maximum integer such that . Without loss of generality, suppose that . As , then . So, is different from .

Example 5.1. The code of the permutation of is .

The correspondence is an injective map between two finite sets of same cardinality (). So is a bijection and to each corresponds a unique . The following algorithm allows the permutation from the code to be retrieved.

Algorithm 1 (principle). Initialisation ; only one element, say , in can verify , so ; only one element, say , in can verify , so ; repeat this procedure until all the elements are found; finally is the unique value in .

Remark 5.2. In general, is the code of the permutation on the totally ordered alphabet .

Example 5.3. Consider the previous example with the permutation of having the code . As , then ; as contains elements strictly smaller; as contains elements strictly smaller; as contains element strictly smaller; as contains element strictly smaller; as contains element strictly smaller; finally, as . So, the permutation is .

For a given permutation , we can also define the sets , and . The set consists of the elements of that have one occurrence in the prefix of length of . Its cardinality counts the number of elements of such that and or, in other words, counts the number of elements of such that and is on the left of in . Similarly, the set consists of the elements of that have one occurrence in , the suffix of length of . Its cardinality counts the number of elements of such that and or, in other words, counts the number of elements of such that and is on the right of in . Finally, the set consists of the elements of that have one occurrence in the prefix of length of . Its cardinality counts the number of elements of such that and or, in other words, counts the number of elements of such that and is on the left of in .

There are trivial relations

For a given permutation, , , and allow the construction of three other codes, namely, , and , which have similar properties to the code . These relations can retrieve more efficiently the permutation from the code . For the interesting case of this paper, an efficient algorithm is given.

Algorithm 2 (principle). Initialisation ; Consider and let with .
If , then and, if , then or, if , then .
If , then and, if , then or, if , then .
If , then and, if , then or, if , then .
The number is the code of the permutation on .

Example 5.4. Consider the permutation of having 111 as its code. Clearly, . Then, the considered set is . As , then and as , then . So, the permutation is .

Finally, the code of a permutation on the genetic alphabet () can easily be computed by putting , , and . Similarly, for the totally ordered alphabet () in Section 5, the code of a permutation is obtained by putting , , and .

6. Role of the Symmetric Group

We put , , and and identify the elements of the symmetric group over () with the 24 permutations of the word . We denote the permutations by their codes (Table 1(c)).

We wish to point out that a computer calculus confirms that the 20-trinucleotide comma-free codes are exactly the -strong 20-trinucleotide circular codes. These codes are partitioned into 28 classes: . There are four classes containing six codes each (Table 2(a)), 16 classes containing 12 codes each (Table 2(b)), and eight classes containing 24 codes each (Table 2(c)). For each class, we give explicitely the list (in lexicographical order) of trinucleotides: the first (in lexicographical order) -strong 20-trinucleotide circular code (pattern of the class) and the codes of the permutations of (Table 1(c)) on giving the other -strong 20-trinucleotide circular codes of the class. The classes are lexicographically ordered according to the patterns of classes.

Moreover, a computer calculus describes the properties of prefixes and suffixes for the 28 classes of -strong 20-trinucleotide circular codes . The set is formed by the letters in the first position of the trinucleotides of and the set , by the letters in the last position of the trinucleotides of . The set is formed by the diletters in prefix position of the trinucleotides of and the set , by the diletters in suffix position of the trinucleotides of . Eight classes have both four letters in and (, , , , ). Ten classes have four letters in and three letters in (, , , , ). Reciprocally, ten classes have four letters in and three letters in (, , , ). Three classes have nine diletters in (, , ) and similarly, three classes have nine diletters in (, , ). Only the class has six diletters in and nine diletters in and similarly, only the class has six diletters in and nine diletters in . All the sets are obviously empty.

These tables and the other symmetric relations identified before (e.g., Proposition 6 of [23]) suggest that the symmetric group can have a very important role in the study of these trinucleotide circular codes.

Acknowledgments

We thank Jacques Justin for his advices. The second author thanks the Dipartimento di matematica “U. Dini” for giving him a friendly hospitality.