Abstract

This paper presents a novel hybrid DNA encryption (HyDEn) approach that uses randomized assignments of unique error-correcting DNA Hamming code words for single characters in the extended ASCII set. HyDEn relies on custom-built quaternary codes and a private key used in the randomized assignment of code words and the cyclic permutations applied on the encoded message. Along with its ability to detect and correct errors, HyDEn equals or outperforms existing cryptographic methods and represents a promising in silico DNA steganographic approach.

1. Introduction

The deluge of counterfeited goods flooding the world markets today generates a high demand for novel cryptographic and steganographic approaches that will better protect information and branded products and ensure their authenticity. Positioned at the confluence of mathematics, biology, informatics, chemistry, and physics, cryptography and steganography represent the ultimate means for information protection.

1.1. Cryptography

Cryptography is generally defined as the practice and study of techniques for secure communication performed over unsecured channels. There are two major operations involved in secure communication, namely, the encryption and decryption of a message. The purpose of encryption is to modify the information, such that only an authorized party is capable of decoding it. Both, encryption and decryption, require a key, which is needed by the authorized parties, and it is assumed to be kept secret. To date, only one encryption approach was mathematically proven to be secure and virtually unbreakable: the one-time pad [1]. Nevertheless, its practicality is hampered by the necessity of a random key, which must be at least as long as the message itself. For all other cryptographic approaches, there is a theoretical possibility of breaking them, although the time required to do so might be very long, thus making the approaches fairly secure. Examples of such cryptographic approaches include the data encryption standard (DES) [2], the advanced encryption standard (AES) [3], the Rivest-Shamir-Adleman (RSA) method [4], and the Pretty Good Privacy (PGP) [5] method.

1.2. Steganography

Steganography is the science of concealing information within different types of media, such that only the sender and the receiver are aware of its exact location. Unlike cryptography, where only the message is protected, steganography protects both the message and the communicating parties. With origins deeply rooted in ancient Greece, where messages were recorded as texts or tattoos and then hidden on wax tablets and skins, steganography was used relentlessly over the centuries under various ingenious forms such as invisible inks [6], postal stamps [7], knitted clothes [8], microdots [9], modified images [10], executable files [11], and DNA sequences embedded in various materials [12, 13].

1.3. Error-Correcting DNA Codes

Error-correcting codes consist of sets of symbols defined over a finite alphabet, such that if any code word is altered in positions we can detect and correct the error based on knowledge of the remaining code words.

For example, assume a given binary code consisting of two code words and each of length 3. A 1-bit error occurring in any of the two code words (e.g., ) will produce a modified code word; let us say . By comparing the modified code word with both code words from , we notice that it differs in only one bit from (middle bit), while it differs in two bits compared with (flanking bits). Thus, we can quickly identify the exact location of the error and correct it based on s closest proximity to code word .

1.3.1. Hamming Codes

One instance of simple and efficient error-correcting codes are Hamming codes [14], where each pair of code words differs in at least bits. We denote by the size of a quaternary code where all pairs of code words of length differ in at least positions. The number of bits/positions in which two code words differ is also known as the Hamming distance. For certain combinations of and , the exact size of quaternary codes are unknown and thus lower and upper bounds were derived to provide approximations. The text by MacWilliams and Sloane [15] provides a succinct introduction to the topic.

While Hamming codes were originally designed using a alphabet with the purpose of sending binary information over noisy channels, the increased need for storing and retrieving information with synthetic DNA strands used as chemical bar codes, or as biological tags for DNA computing applications, facilitated the advent of Hamming codes defined over quaternary alphabets, such as the DNA alphabet .

1.3.2. DNA Codes

A single-stranded DNA molecule is a long, unbranched polymer composed of only four types of subunits linked together by chemical bonds and attached to a sugar-phosphate chain like four kinds of beads strung on a necklace. These subunits are the deoxyribonucleotides containing the bases: adenine (A), cytosine (C), guanine (G), and thymine (T).

Conceptually equivalent to a digital signal, DNA sequences are naturally and synthetically used for information encoding in living organisms and biotechnological and steganographic applications. Given the data encoding capacity of DNA and the fact that traditional data encoding techniques using binary sequences are fortified against communication errors, quaternary codes using the DNA alphabet were proposed and continuously developed over the past decades.

The design of error-correcting DNA codes of fixed length that satisfy various combinations of constraints such as having a minimum pairwise Hamming distance () is a hard computational problem, whose complexity is still unknown today. Over the past two decades, a large number of publications have proposed intricate code design techniques [1618] based on their state-of-the-art algorithms such as stochastic local search, genetic algorithms, and pure mathematical constructions. Most of these approaches lead also to the continuous improvement of upper and lower bounds for DNA codes [1921].

Assuming that a DNA code with code words of length is given and that each pair of distinct code words and obeys the condition that, for all pairs () with , , then can detect errors and can correct errors.

1.4. Related Work

Over the past decade, complex algorithms have been devised to encode information using DNA sequences. Examples of such algorithms include the DNA triplet-based approach described by Clelland et al. [9], which extends the principle of using microdots to hide information developed during the Second World War. An extension of Clelland et al.’s work was presented by Leier et al. [22], and it consisted of encoding zeros and ones using short DNA sequences with sticky ends, which can bind together forming longer sequences. The encrypted messages include a mixture of coding and noncoding DNA sequences, and the decryption can be performed only by someone who has access to the correct primer sequences. A primer is a short DNA sequence that serves as a starting point for DNA synthesis. A similar approach based on DNA tiling was proposed by Hirabayashi et al. [23] who designed true random one-time pads using a DNA cryptosystem. The true randomness is conferred by molecular computations using hybridization of DNA sequences encoding 4 types of cipher tiles.

Gehani et al. [24] extended the one-time pad approach to perform operations on DNA sequence pairs, representing plain and cipher texts. Originally, the one-time pad approach was designed to perform XOR operations on binary codes. The message encoded with DNA pairs can be retrieved and decoded using specific DNA polymerases. Arita and Ohashi [25] developed a steganographic algorithm based on the redundant codon table (see Table 1). A codon consists of 3 consecutive nucleotides, and while it is possible to have 64 (43) different codons, only 20 of them encode distinct amino acids, with the rest being redundant. Their algorithm encoded each letter in the English alphabet using binary codes of length 5, with each bit being encoded by a codon. They added an additional parity bit to each letter encoding to keep the number of bits in each bit-pattern odd and thus used for error-detection purposes. The decoding could be achieved only by someone who knows the original codon sequence.

Following a different approach, Wong et al. [27] developed a DNA steganography method that encodes information in living organisms. The information is encoded with the aid of unique DNA sequences that do not exist in the particular genomes where they will be embedded, thus assuring the success of the identification stage. For this approach to succeed, the embedded foreign DNA must be replicated by the host organism together with their genomic DNA. The extraction of the information is achieved using a standard laboratory technique called the polymerase chain reaction (PCR) [28].

The DNA-Crypt approach proposed by Heider and Barnekow [29] combines and extends the steganographic and cryptographic methodologies proposed by Wong et al. [27] and Arita and Ohashi [25]. DNA-Crypt encodes information using a substitution cipher and two types of error-correcting codes, namely, Hamming [14] and WDH [30]. DNA-Crypt incorporates a fuzzy controller and powerful cryptographic algorithms such as one-time pad, AES, Blowfish [31], and RSA. Shiu et al. [32] introduced 3 data hiding methods based on properties of DNA sequences, namely, the insertion method, the complementary pair method, and the substitution method. All three methods provide distinct means to incorporate secret messages within existing DNA sequences pulled from public databases. The known DNA sequence acts as a private key, and it can be identified only by the sender and the receiver.

A hybrid approach built on the substitution method described in Shiu et al. [32] that combines cryptography and DNA steganography was proposed by Torkaman et al. [33]. Their approach uses reference DNA sequences from the European Bioinformatics Institute (EBI) Database, which contains roughly 163 million entries. The encoding of information is achieved using 6 association rules.

Here, we present the hybrid DNA encryption (HyDEn) approach, which combines the advantages conferred by cryptography and steganography into a unique symmetric cryptosystem. The system uses a unique private numeric key to scramble the assignment of DNA code words from a predesigned set to the extended ASCII characters and then apply a cyclic permutation on the encrypted message. The combination of key uniqueness, the randomization of code word assignments, the undisclosed code word length, and the final cyclic permutation of the encrypted message confer additional strength to the proposed approach. The information encrypted with HyDEn can be safely communicated between senders and receivers via dedicated and inconspicuous publicly accessible channels, such as bioinformatics discussion groups and DNA sequence databases.

2. HyDEn: The Hybrid DNA Encryption Approach

Deeply rooted in the ways nature encodes information using nucleic acids, DNA stegano-cryptography uses short DNA sequences to encrypt and hide messages, thus protecting their content. The hybrid DNA encryption (HyDEn) approach presented here includes a novel in silico cryptosystem that uses DNA error-correcting Hamming codes and disguises encrypted messages as long DNA sequences conveniently placed on host bioinformatics resources.

Following next is a stepwise description of the HyDEn cryptosystem.

Input. The message is defined over an alphabet , private key .

Encryption Algorithm

Step 1. Select an error-correcting DNA code with -ary code words obtained with one of the state-of-the-art code design techniques described in Aboluion et al. [16], Gaborit and King [19], Tulpan and Hoos [26], and Tulpan et al. [18]. Here, represents the number of characters in a DNA code word. An example of a DNA code with and is given in Table 2.

Step 2. Using the key provided as input, perform a random shuffling of the -ary DNA code words that will be associated to each character from .

Step 3. Encrypt the message using the random assignment of DNA code words obtained in Step 2.

Step 4. Perform a circular rotation () to the right of the characters in the message with exactly positions.

Output. The encrypted message .

Step 1 provides the means of encoding a message using a code defined over a quaternary alphabet. The code will be able to identify and correct errors that can occur during the message transmission stage. Step 2 will generate a unique code word assignment based on the key . If all keys are unique, then the assignment will be equivalent to a one-time pad system. In the eventuality that code word length () is found, Step 4 is used to lower the chances of a successful frequency analysis based on well-established tests such as the Friedman test [34] and the Kasiski test [35].

The message decryption step will use the same unique key to perform the reverse circular permutation on the encrypted message and find the correct code words assignment, which will reveal the original message.

The flowcharts for message encryption and decryption with HyDEn are summarized in Figure 1.

3. Example of Message Encryption and Decryption Using HyDEn

To better understand how the HyDEn approach works, let us assume that Alice would like to transmit the message “ATTACK AT DAWN” to Bob. They have established before hand to use the secret key “5”. The message uses only 8 distinct ASCII characters, namely, “space,” “A,” “C”, “D,” “K,” “N,” “T” and “W.” Based on the unique key used by Alice and Bob, and applying Steps 1 and 2 of our approach, a unique assignment of DNA code words of length 8 is associated to each of the 8 characters, as shown in Table 3.

Using this assignment, the encrypted message resulting after Step 3 is the following:

ACTACACTGTTGTATTGTTGTATTACTACACT

ATGGAGTTCTGGTAGTAAAAAAGAACTACACT

GTTGTATTAAAAAAGACCCTTCGAACTACACT

TCGTGTTAGGAAAGGT

To better visualize the encryption process, every second code word was bold faced. The encrypted message is then permuted cyclically five positions to the right, thus obtaining the following sequence of DNA bases:

AAGGTACTACACTGTTGTATTGTTGTATT

ACTACACTATGGAGTTCTGGTAGTAAAAAAGA

ACTACACTGTTGTATTAAAAAAGACCCTTCGA

ACTACACTTCGTGTTAGGA

Ideally, the key (mod 256) must be different from a multiple of the code word length (); otherwise, the permutation will shift the encrypted message exactly letters to the right (or to the left) and will not have the desired effect.

4. Comparison Parameters

To facilitate the comparison between our approach and related encryption methodologies, we use a combination of performance parameters including the ones introduced by Shiu et al. [32], namely, capacity, payload, , and the cracking probability or the probability of a successful brute-force attack .

The capacity () is defined as the total length of a reference sequence that encodes or includes the encrypted message. The payload () is the remaining length of the new sequence after subtracting the reference DNA sequence. The represents the number of hidden bits per character. The previous parameters utilize the following notations: is the length of a DNA sequence, is the message that will be encrypted, and is its length.

5. Results and Discussion

We analyze the robustness of HyDEn by estimating the probability of success for a brute-force attack, and we provide a comparative assessment between our cryptosystem and other cryptographic techniques with performance characteristics described in the literature. The comparison relies on a set of parameters introduced in Section 4. We further investigate HyDEn’s strengths and weaknesses, and we provide insights into potential improvements that will augment its performance.

5.1. Robustness

Calculations of the strength of encryption against brute-force attacks are typically the worst case scenarios thus, the probability of success for a brute-force attack against the proposed cryptosystem (HyDEn) is captured where is the length of a DNA code word and is the number of characters in alphabet .

Assuming that is the extended ASCII character set, then and (2) becomes

Using the Stirling approximation [36] for factorials, , for all , and DNA code word length , we obtain

The first term in (2) comes from the fact that is unknown to the attacker; thus, a successful attacker must first guess the length of the used code words, which would be 8 in the sample DNA code from Table 2. The second term of the equation describes the probability of finding the correct code assignment for the extended ASCII character set. We also assume that the attacker already knows what character set is encoded by the DNA code. The last term of the equation is given by the probability of finding the correct cyclic permutation applied to the encrypted message. Without knowing the correct permutation, the attempt of identifying the correct code word assignment is prone to failure.

5.2. Comparison with Other DNA Cryptographic Strategies

Using the parameter estimations described in Section 4, we compare HyDEn with other encryption approaches described in Shiu et al. [32].

Table 4 presents comparative results between HyDEn and other cryptographic methods. The methods are compared based on their capacity (), payload (), the number of hidden bits per character (), and the probability of success for a brute-force attack ().

Based on the probability of success for a brute-force attack (), HyDEn and the insertion method are the most secure, while the substitution method seems to be the least secure. Nevertheless, the best capacity (), payload (), and correspond to and the Substitution method, while the insertion method ranks second and the complementary pair third.

The result expressed in (4) can be also directly compared with the result reported by Torkaman et al. [33] on page 233 in their paper. Their result states that the probability of recovering via a brute-force technique an original message hidden within a sequence database with other 163 million sequences is equal to . Using simple numerical inequality manipulations, we show that our technique confers higher protection against brute-force attacks compared with the method proposed by Torkaman et al.: Thus, (substitution: Torkaman et al. [33]).

5.3. HyDEn’s Strengths, Weaknesses, and Potential Extensions

Compared with the existing DNA-based cryptographic and steganographic methods, HyDEn has one of the lowest probabilities of success for brute-force attacks. HyDEn includes mechanisms such as cyclic permutations and randomized assignments of code words to protect against various types of frequency analysis such as the Kasiski and Friedman tests along with error detection and correction capabilities conferred by DNA Hamming codes. One of the drawbacks of using many-to-one character encoding schemes is the increase in size of the encrypted message, which could become a burden for the communication media and which also poses also a challenge for hiding strategies of large messages. The steganographic approach including message distribution and the selection of inconspicuous dissemination venues must be carefully analyzed. For example, large encrypted messages encoded as long in silico DNA sequences can be better hidden in databases for DNA coding sequences, DNA contigs or mRNA sequences, while relatively short messages would be better hidden as DNA and RNA primer sequences or as microarray probes.

One potential weakness of the current approach could stem from peculiarities of the language in which the original message was written, assuming that the attacker has already guessed it. For example, if English is the language, then an analysis based on occurrences of double letters such as double Ls in a fairly limited number of words could be used to find partial (code word, character) associations. A potential extension inspired from the Belasso Ciphers [37], which were later wrongfully attributed to Vigenère [38], that will add confusion and increased security to HyDEn is to encode each character with multiple code words selected uniformly at random, without breaking the error detection and correction capabilities of the DNA code. Table 5 presents an code with 1024 DNA sequences of length 8 and minimum pairwise Hamming distance 3, which could be used as a replacement of the code from Table 2. Each extended ASCII character could be encoded using one out of 4 different code words, each selected with equal probability. Lower (2048) and upper (2340) bounds published by Bogdanova et al. [39] and hosted on Dr. Andries Brower’s website [40] suggest that even larger DNA codes can be generated.

6. Conclusion

Here, we have presented a novel stegano-cryptographic approach called HyDEn (hybrid DNA encryption), which uses custom-built error-correcting DNA Hamming codes, a randomized code assignment procedure and cyclic permutations based on a private key. HyDEn represents a symmetric cipher that is capable of encrypting and disguising information as long DNA sequences in public bioinformatics discussion groups and DNA sequence databases. Our cryptosystem has significant error tolerance and adds another dimension to the information security field. We are currently working on experimentally evaluating and further improving HyDEn’s capabilities following the ideas described in Section 5.3.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments that led to the improvement of this paper. They are grateful to their colleagues from the Knowledge Discovery and the Learning and Collaborative Technologies Groups for helping in reviewing and improving this paper. Funding for this research was provided by the National Research Council Canada.