Table of Contents Author Guidelines Submit a Manuscript
Applied Computational Intelligence and Soft Computing
Volume 2016, Article ID 9481971, 12 pages
http://dx.doi.org/10.1155/2016/9481971
Research Article

Generative Power and Closure Properties of Watson-Crick Grammars

Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, 53100 Kuala Lumpur, Malaysia

Received 27 November 2015; Revised 19 February 2016; Accepted 2 June 2016

Academic Editor: Ryotaro Kamimura

Copyright © 2016 Nurul Liyana Mohamad Zulkufli et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We define WK linear grammars, as an extension of WK regular grammars with linear grammar rules, and WK context-free grammars, thus investigating their computational power and closure properties. We show that WK linear grammars can generate some context-sensitive languages. Moreover, we demonstrate that the family of WK regular languages is the proper subset of the family of WK linear languages, but it is not comparable with the family of linear languages. We also establish that the Watson-Crick regular grammars are closed under almost all of the main closure operations.

1. Introduction

DNA computing appears as a challenge to design new types of computing devices, which differ from classical counterparts in fundamental way, to solve wide spectrum of computationally intractable problems. DNA (deoxyribonucleic acid) is double-stranded chain of nucleotides, which differ by their chemical bases that are adenine (A), guanine (G), cytosine (C), and thymine (T), and they are paired as A-T, C-G according to the Watson-Crick complementary as it is illustrated in Figure 1 [1]. The massive parallelism, another fundamental feature of DNA molecules, allows performing millions of cut and paste operations simultaneously on DNA strands until a complete set of new DNA strands performing are generated. These two features give high hope for the use of DNA molecules and DNA based biooperations to develop powerful computing paradigms and devices.

Figure 1: The structure of DNA strand.

Since a DNA strand can be interpreted as a double strand sequence of symbols, the DNA replication and synthesize processes can be modeled using methods and techniques of formal language theory. Watson-Crick (WK) automata [2], one of the recent computational models abstracting the properties of DNA molecules, are finite automata with two reading heads, working on complete double-stranded sequences where characters on corresponding positions from the two strands of the input are related by a complementarity relation similar to the Watson-Crick complementarity of DNA nucleotides. The two strands of the input are separately read from left to right by heads controlled by a common state. Several variants have been introduced and studied in recent papers [37].

WK regular grammars [8], a grammar counterpart of WK automata, generate double-stranded strings related by a complementarity relation as in a WK automaton but use rules as in a regular grammar. The approach of using formal grammars in the study of biological and computational properties of DNA molecules by formal grammars is a new direction in the field of DNA computing: we can introduce powerful variants of WK grammars, such as WK linear, WK context-free, and WK regulated grammars, and use them in the investigation of the properties of DNA structures and also in DNA applications in food authentication, gene disease detection, and so forth. In this paper, we introduce WK linear grammars and study the generative capacity in the relationship of Chomsky grammars.

Further, as a motivation, we show synthesis processes, for instance, in DNA replication (Figure 2) can be simulated by derivations in WK grammars. The replication of DNA begins at the origin(s). The double strand then separated by proteins, producing bubble-like shape(s). The synthesis of new strands using the parental strands as templates starts from the origins and proceeds in the to direction of both strands [1].

Figure 2: Synthesis process in bacterial DNA replication.

This synthesis process in general can be seen as a string generation. The enzymes responsible for the synthesizing, DNA polymerases, cannot initiate the process by themselves but can only add nucleotides to an existing RNA chain. This chain is called primer which is produced by the enzyme primase. From the grammar perspective, the primase can be interpreted as the start symbol . After the primer has been connected to the parental strand, one of the synthesizing enzymes, called DNA polymerase III, continues to add nucleotides one by one to the primer and are complemented with the parental strand. The synthesis finishes with replacing RNA primer with DNA nucleotides using the enzyme DNA polymerase I and joining with DNA ligase. Again, from the grammar perspective, DNA polymerase I and polymerase III act as production rules in the grammar, specially, DNA ligase resembles to a terminal production (see Figure 3).

Figure 3: The simulation of a synthesis process with a derivation.

The paper is organized as follows. In Section 2, we give some notions and definitions from the theories of formal languages and DNA computing needed in the sequel. In Section 3, we define WK grammars and languages generated by these grammars. Section 4 is devoted to the study of the generative capacity of WK regular and linear grammars. In Section 5, we investigate the closure properties of WK grammars. Furthermore, we show the application of WK grammars in the analyses of DNA structures and programming language structures in Section 6. As the conclusion, we discuss open problems and interesting future research topics related to WK grammars in Section 7.

2. Preliminaries

We assume readers are familiar with formal languages theory and automata. Readers are referred to [3, 911].

Throughout this paper we use the following notations. Let be the belonging relationship of an element to the corresponding set and indicates its negation. The symbol indicates the inclusion while notes the proper inclusion. The notations , , and denote the empty set, the cardinality of a set , and the power set of , respectively. When is an alphabet (a finite set of symbols), the set of all finite strings is denoted by , while shows similar meaning without including empty strings (we use for empty string). The length of a string is shown by . A language is a subset .

Next we recall some terms regarding the closure properties of languages. The union of two languages, , is the set of strings including the elements contained in both sets of and . The concatenation of two languages is yielded by lining two strings from both languages which is shown by The Kleene-star closure is the closure under the Kleene operation, the set of all possible strings in including the empty string. The mirror image of a word is For language , its mirror image is A substitution is a mapping where and for . The substitution for a language ; that is, is the union of , where . A substitution is called finite if its length is finite for each . A morphism is a substitution where its length is 1.

A Chomsky grammar is defined by where is the set of nonterminal symbols and is the set of terminal symbols, and . is the start symbol while is the set of production rules. We write indicating the rewriting process of the strings based on the production rules . The term directly derives is written as when for some production rules . A grammar generates a language defined by

According to the forms of production rules, grammars are classified as follows. A grammar is called(i)context-sensitive if each production has the form , where , , and ;(ii)context-free if each production has the form , where and ;(iii)linear if each production has the form or , where and ;(iv)right-linear if each production has the form or , where and ;(v)left-linear if each production has the form or , where and .

A right-linear and left-linear grammars are called regular.

The families of languages generated by these grammars are , , , and , respectively. The families of recursive enumerable languages are denoted by while the families of finite languages are denoted by . Thus the next relation holds [11].

Theorem 1 (Chomsky hierarchy). Consider

We recall the definition of a finite automaton. A finite automaton (FA) is a quintuple , where is the set of states, is the initial state, and is set of final states. Meanwhile is an alphabet and is called the transition function. The set (language) of all strings accepted by is denoted by . We denote the family of languages accepted by finite automata by . Then, (see [11]).

Next, we cite some basic definitions and results of Watson-Crick automata.

The key feature of WK automata is the symmetric relation on an alphabet ; that is, . In this paper, for simplicity, we use the form to mention the elements in the set of all pairs of strings (which we choose to write as ), and, instead of , we write .

Watson-Crick domain is the set of well-formed double-stranded strings (molecules) holds the similar meaning without including . We write as where the upper strand is and the lower strand is . Note that when the elements in the upper strand are complemented and have the same length with the lower strand,

A Watson-Crick finite automaton (WKFA) is 6-tuple where , , , and are the same as a FA. Meanwhile the transition function is where is not an empty set only for finitely many triples . Similar to FA, we can write the relation in transition function as a rewriting rule in grammars; that is, We describe the reflexive and transitive closure of as . The language accepted by a WKFA is The family of languages accepted is indicated by . It is shown in [10, 12] that

3. Definitions

In this section we slightly modified the definition of Watson-Crick regular grammars introduced in [8] in order to extend the concept to linear grammars and context-free grammars.

Definition 2. A Watson-Crick (WK) grammar is called(i)regular if each production has the form where and ;(ii)linear if each production has the form where and ;(iii)context-free if each production has the form where and .

Definition 3. Let be a WK linear grammar. We say that directly derives , denoted by , iff where , , , and

Definition 4. Let be a WK context-free grammar. We say that directly derives , denoted by , if and only if where , , , and .

Remark 5. We use a common notion “Watson-Crick grammars” referring to any type of WK grammars.

Definition 6. The language generated by a WK grammar is a quintuple   which is defined as

4. Generative Power of Watson-Crick Grammars

In this section, we establish results regarding the computational power of WK grammars.

4.1. A Normal Form for Watson-Crick Linear Grammars

Next, we define -normal form for WK linear grammars and show that, for every WK linear grammar , there is an equivalent WK linear grammar in the normal form; that is, .

Definition 7. A linear WK grammar is said to be in the -normal form if each rule in of the form where , , , and .

Lemma 8. For every WK linear grammar , there exists an equivalent WK linear grammar in the 1-normal form.

Proof. Let be a WK linear grammar. Letbe a production in where , , , or . Without loss of generality, we assume that and . Then, we define the following sequence of right-linear and left-linear production rules: where , , and , , are new nonterminals.
Letwhere or . Without loss of generality, we assume that . Then, we define the following sequence of right-linear production rules: where , , are new nonterminals.
We construct a WK linear grammar , where consists of productions defined above for each with , , , or and with or . Then, it is not difficult to see that, in every derivation, productions in the form of (22) and (24) in can be replaced by the sequences of productions (23) and (25) in and vice versa. Thus, .

4.2. The Generative Power

The following results immediately follow from the definition of WK grammars.

Lemma 9. The following inclusions hold:

Next, we show that WK grammars can generate non-context-free languages:

Example 10. Let , be a WK linear grammar, where consists of the rules In general, we have the derivation Thus, generates the language .

Example 11. Let be a WK regular grammar, and consists of the rules Then, we have the following derivation for : Hence, .

Example 12. Let , ) be a WK linear grammar with consisting of the following rules: By rules and , we obtain a sentential form where . The derivation is continued by only possible rule and we have . Further, we can only apply rules and . By the symmetric relation , the derivation results in . Then, we can only apply continuing with rules and and obtain . Finally, by rule , we get . Illustratively, Thus, .

The following theorem follows from Lemma 9 and Examples 10, 11, and 12.

Theorem 13. The following inclusions hold:

The following example shows that some WK linear languages cannot be generated by WK regular grammars.

Lemma 14. The following language is not a WK regular language:

Proof. The language can be generated by the following WK linear grammar , where consists of the rules: It is not difficult to see that where .
Next, we show that .
We suppose, by contradiction, that can be generated by a WK regular grammar . Without loss of generality, we assume that is in -normal form. Then, for each rule in , we have and Let be a string in such that . Then, the double-stranded sequence is generated by the grammar .
Case  1. In any derivation for this string, first can occur in the upper (or lower) strand if has already been generated in the upper (or lower) strand. Thus, we obtain two possible successful derivations: where . In the latter derivation in (40), we cannot control the number of occurrences of ; that is, the derivation may not be successful. In the former derivation in (40), using the second strand, we can generate :Equation (41) is continued by generating ’s in the first strand and we can use the second strand to control their number. Considerand is related to . Since , generally, is not the same as for all derivations.
Case  2. We can control the number of ’s after ’s by using the second strand for ’s before ’s. In this case, the number of ’s cannot be related to the number of ’s: In both cases, we cannot control the number of ’s and the number of ’s after ’s at the same time using WK regular rules.

Since strings are palindrome strings for even ’s, the language is not in ; that is, we have the following.

Corollary 15. The following holds:

4.3. Hierarchy of the Families of Watson-Crick Languages

Combining the results above, we obtain the following theorem.

Theorem 16. The relations in Figure 4 hold; the dotted lines denote incomparability of the language families and the arrows denote proper inclusions of the lower families into the upper families, while the dotted arrows denote inclusions.

Figure 4: The hierarchy of WK and Chomsky language families.

5. Closure Properties

In this section, we establish results regarding the closure properties of WK grammars. The families of WK languages are shown to be higher in the hierarchy than their respective Chomsky language families; thus it is interesting to see how WK grammars work in terms of closure properties as the ones for the Chomsky languages. Moreover, researching closure properties of WK grammars ensure the safety and correctness of the results yielded when performing operations on the sets of DNA molecules generated by some WK grammars.

5.1. Watson-Crick Regular Grammars

Let , and , be Watson-Crick regular grammars generating languages and , respectively; that is, and . Without loss of generality, we can assume that and and do not appear on the right-hand side of any production rule.

Lemma 17 (union). is closed under union.

Proof. Define by setting with , and Then it is not difficult to see that .

Lemma 18 (concatenation). is closed under concatenation.

Proof. Define , where and . We define Then it is obvious that .

Lemma 19 (Kleene-star). is closed under Kleene-star operation.

Proof. Define the WK regular grammar with where Then, .

Lemma 20 (finite substitution and homomorphism). is closed under finite substitution and homomorphism.

Proof. We show that the finite substitution (homomorphism) of is also in . Let be a finite substitution (homomorphism). Define , where . Without loss of generality, we assume that is in the 1-normal form. Then, can contain production rules of the forms We define as the set of production rules of the forms Since the substitution/homomorphism is finite, is a finite set too; that is, is a WK regular grammar, and .

Lemma 21 (mirror image). is closed under mirror image.

Proof. We show that . Define generating the language , where is a new nonterminal and consists of production rules defined as follows: (i), where ,(ii), where .

The results obtained from the lemmas above are summarized in the following theorem.

Theorem 22. The family of Watson-Crick regular languages is closed under union, concatenation, Kleene-star, finite substitution, homomorphism, and mirror image.

This shows that WK regular grammars preserve almost all of the closure properties of regular grammars. Other closure properties of WK regular grammars are left for future studies.

5.2. Watson-Crick Linear Grammars

Similar to the subsection above, a Watson-Crick linear grammar is constructed for the purpose of investigating the closure properties of .

Let , and be Watson-Crick linear grammars generating and , respectively; that is, and . Without loss of generality, we can assume that .

Lemma 23 (union). is closed under union.

Proof. Define , where with and Then, .

Lemma 24 (homomorphism). is closed under homomorphism.

Proof. Define , where . We show that the homomorphism of is also in . Let be a homomorphism. Without loss of generality, we assume that is in the 1-normal form. For each rule where , we construct in .
In every successful derivation of generating , we replace production rule with the production rule and obtain the string . Thus, .

Theorem 25. The family of Watson-Crick linear languages is closed under union and homomorphism.

It is compelling to prove if the concatenation of two WK linear grammars is still included in or not. This also decides whether WK linear grammars are unique from linear grammars or not, as linear grammars are not closed under concatenation and thus not closed under Kleene-star operation.

5.3. Watson-Crick Context-Free Grammars

Let , and be Watson-Crick context-free grammars generating and , respectively; that is, and . Without loss of generality, we can assume that .

Lemma 26 (union). is closed under union operation.

Proof. Define the WK context-free grammar , where with , and Then it is obvious that .

Lemma 27 (concatenation). is closed under concatenation.

Proof. Define where , and Then, .

Lemma 28 (Kleene-star). is closed under Kleene-star operation.

Proof. Define setting Then it is not difficult to see that .

Lemma 29 (homomorphism). is closed under homomorphism.

Proof. Define , where . We show that the homomorphism of is also in . Let be a homomorphism. contains production rules of the form where
For each rule , we constructwhere .
In every successful derivation of generating , we replace production rule with the production rule and obtain the string . Thus, .

With the lemmas provided above in this subsection, the next theorem follows.

Theorem 30. The family of Watson-Crick context-free languages is closed under union, concatenation, Kleene-star, and homomorphism.

Closure of under complement and intersection depends on the generative capacity of . The nonclosure of context-free (CF) grammars for intersection was shown with the famous example that the intersection of two CF languages results in a string that cannot be generated by a CF grammars. If one can provide some examples of strings that cannot be generated by WK context-free grammars, then the nonclosure of can be proven.

6. Applications of Watson-Crick Grammars

In this section, we consider two examples of the applications of Watson-Crick grammars in the analyses of DNA structures and programming language structures.

6.1. DNA Structure Analysis

Since Watson-Crick grammars are developed based on the structure and recombinant behavior of DNA molecules, they can suitably be implemented in the study of DNA related problems.

The analysis of DNA strings provides useful information: for instance, the finding of a specific pattern in a DNA string and the identification of the repeats of a pattern are very important for detecting mutation. One of the diseases caused by mutation is Huntington disease, resulting from trinucleotide repeat disorders [1315]. It is discovered that the number of repeats of the trinucleotide in the patient’s DNA with Huntington disease is not normal. The repeats are also useful for finding the origin of replication of microorganisms [16].

In this section, we show that Watson-Crick grammars can be used for analyzing the repeats in DNA strings. We use the DNA of a breed of pig, Sus scrofa breed mixed chromosome 1, Sscrofa10.2 provided by the The National Center for Biotechnology Information (NCBI) database (NCBI Reference Sequence: NC_010443.4) [17].

Consider a part of the upper strand of the DNA from the breed stated above with the length of 100 nucleotides:

In this example, the pattern is being repeated for six times in the direct strand (upper strand) and two times in the reverse strand (lower strand) which can be seen as in the upper strand. Focusing on the repeats of pattern, a DNA string containing such a pattern can be expressed as

We now construct a simple Watson-Crick regular grammar to generate the above language.

Let , where , , and consists of the following productions:

For instance, a derivation resulting in six repeats of pattern can be obtained as follows:where and is their complementarity symbols based on , respectively.

Though, in this example, the functionality of WK regular grammars is not used to the fullest, the DNA structures can be naturally and effectively analyzed with WK grammars.

6.2. Programming Language Structure Analysis

The ability to differentiate between parentheses that are correctly balanced and those that are unbalanced is an important part of recognizing many programming language structures. Balanced parentheses mean that each opening symbol has a corresponding closing symbol and the pairs of parentheses are properly nested.

The parsing algorithms of compilers and interpreters have to check the correctness of balanced parentheses in the blocks of codes including algebraic and arithmetic expressions.

Although context-free grammars are used to develop parsers for programming languages, many programming language structures are context-sensitive. Thus, it is of interest to develop parsers based on grammars which are able to analyze non-context-free language structures.

Further, we show an example of balanced parentheses that can be generated by WK regular grammars but cannot be generated by context-free grammars. One can see that just by incorporating the concept of double-stranded string bonded with Watson-Crick complementarity, even WK grammars that are based of just regular rules can enhance the power of the parsing techniques.

To avoid confusion, we denote “” as the open parenthesis terminal symbol and “)” as the close parenthesis terminal symbol in bold font.

Example 31. Let be a WK regular grammar. consists of the rules From this, we obtain the derivationwhere . Hence, the language obtained is

7. Conclusion

In this paper, we defined Watson-Crick regular grammars, Watson-Crick linear grammars, and Watson-Crick context-free grammars. Further, we investigated their computational power and closure properties. We showed that(1)WK linear grammars can generate some context-sensitive languages;(2)the families of linear languages and WK regular languages are strictly included in the family of WK linear grammars;(3)the families of WK regular languages and linear languages are not comparable;(4)the family of WK linear languages is not comparable with the family of context-free languages;(5)WK regular grammars preserves the closure properties similar to the ones of regular languages.

The following problems related to the topic remain open:(1)Are the family of context-free languages proper subset of the family of WK linear languages or are they not comparable?(2)What is the generative capacity of Watson-Crick context-free grammars?(3)What are the remaining closure properties of Watson-Crick (regular, linear, and context-free) grammars? These depend on their generative capacity.

Moreover, there are many interesting topics for further research; for instance, we can define WK context-free and regulated WK context-free grammars and use them in the study of DNA properties and in the DNA based applications such as food authentication and disease gene detection.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work has been supported through International Islamic University Endowment B research Grant EDW B14-136-1021 and Fundamental Research Grant Scheme FRGS13-066-0307, Ministry of Education, Malaysia. The first author would like to thank both organizations for the scholarship through the IIUM fellowship program.

References

  1. J. B. Reece, L. A. Urry, M. L. Cain, S. A. Wasserman, P. V. Minorsky, and R. B. Jackson, Campbell Biology, Pearson Education, 10th edition, 2011.
  2. R. Freund, G. Paun, G. Rozernberg, and A. Salomaa, Watson-Crick Finite Automata, vol. 48 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 1999.
  3. E. Czeizler, “A short survey on watson-crick automata,” Bulletin of the EATCS, vol. 88, no. 3, pp. 104–119, 2006. View at Publisher · View at Google Scholar
  4. L. Kari, S. Seki, and P. Sosik, “DNA computing—foundations and implications,” in Handbook of Natural Computing, G. Rozenberg, T. Bäck, and J. N. Kok, Eds., pp. 1073–1127, 2012. View at Google Scholar
  5. P. Leupold and B. Nagy, “5′ → 3′ Watson-Crick automata with several runs,” Fundamenta Informaticae, vol. 104, no. 1-2, pp. 71–91, 2010. View at Publisher · View at Google Scholar · View at Scopus
  6. M. I. Mohd Tamrin, S. Turaev, and T. M. Tengku Sembok, “Weighted watson-crick automata,” in Proceedings of the 21st National Symposium on Methematical Sciences, vol. 1605 of AIP Conference Proceedings, p. 302, Penang, Malaysia, November 2013. View at Publisher · View at Google Scholar
  7. K. G. Subramanian, I. Venkat, and K. Mahalingam, “Context-free systems with a complementarity relation,” in Proceedings of the 6th International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA '11), pp. 194–198, Penang, Malaysia, September 2011. View at Publisher · View at Google Scholar · View at Scopus
  8. K. G. Subramanian, S. Hemalatha, and I. Venkat, “On Watson-Crick automata,” in Proceedings of the 2nd International Conference on Computer Science, Science, Engineering and Information Technology (CCSEIT '12), pp. 151–156, Coimbatore, India, 2012.
  9. P. Linz, An Introduction to Formal Languages and Automata, Jones and Bartlett, 2006.
  10. G. Păun, G. Rozenberg, and A. Salomaa, DNA Computing, New Computing Paradigms, Springer, Berlin, Germany, 1998. View at Publisher · View at Google Scholar
  11. G. Rozenberg and A. Salomaa, Handbook of Formal Languages, vol. 1–3, Springer, 1997. View at Publisher · View at Google Scholar
  12. S. Okawa and S. Hirose, “The relations among Watson-Crick automata and their relations with context-free languages,” IEICE Transactions on Information and Systems, vol. E89, no. 10, pp. 2591–2599, 2006. View at Publisher · View at Google Scholar · View at Scopus
  13. M. E. MacDonald, C. M. Ambrose, M. P. Duyao et al., “A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes,” Cell, vol. 72, no. 6, pp. 971–983, 1993. View at Publisher · View at Google Scholar · View at Scopus
  14. H. T. Orr and H. Y. Zoghbi, “Trinucleotide repeat disorders,” Annual Review of Neuroscience, vol. 30, pp. 575–621, 2007. View at Publisher · View at Google Scholar · View at Scopus
  15. J. Petruska, M. J. Hartenstine, and M. F. Goodman, “Analysis of strand slippage in DNA polymerase expansions of CAG/CTG triplet repeats associated with neurodegenerative disease,” The Journal of Biological Chemistry, vol. 273, no. 9, pp. 5204–5210, 1998. View at Publisher · View at Google Scholar · View at Scopus
  16. N. C. Jones and P. Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, Boston, Mass, USA, 2004.
  17. M. A. M. Groenen, A. L. Archibald, H. Uenishi et al., “Analyses of pig genomes provide insight into porcine demography and evolution,” Nature, vol. 491, no. 7424, pp. 393–398, 2012. View at Publisher · View at Google Scholar · View at Scopus