Reversible Data Hiding for DNA Sequence Using Multilevel Histogram Shifting
A large number of studies have examined DNA storage to achieve information hiding in DNA sequences with DNA computing technology. However, most data hiding methods are irreversible in that the original DNA sequence cannot be recovered from the watermarked DNA sequence. This study presents reversible data hiding methods based on multilevel histogram shifting to prevent biological mutations, preserve sequence length, increase watermark capacity, and facilitate blind detection/recovery. The main features of our method are as follows. First, we encode a sequence of nucleotide bases with four-character symbols into integer values using the numeric order. Second, we embed multiple bits in each integer value by multilevel histogram shifting of noncircular type (NHS) and circular type (CHS). Third, we prevent the generation of false start/stop codons by verifying whether a start/stop codon is included in an integer value or between adjacent integer values. The results of our experiments confirmed that the NHS- and CHS-based methods have higher watermark capacities than conventional methods in terms of supplementary data used for decoding. Moreover, unlike conventional methods, our methods do not generate false start/stop codons.
DNA has the potential for high-capacity and long-term storage, which is of considerable interest to researchers in a wide range of applications related to biology and information technology [1–3]. Rapid progress in synthesizing defined strings of DNA and sequencing the data saved in DNA has enhanced the feasibility of DNA storage. In addition to DNA storage [1–3], DNA steganography for secret communication or encryption using noncoding DNA sequences [4–6] has been widely studied for decades. Recent years have witnessed the use of coding DNA sequences to investigate DNA watermarking for tracking parent genes in offspring or genetically modified organisms and for protecting DNA copyright [7–14]. In silico and in vitro/in vivo tracking of offspring by watermarking has been easily realized in bacteria and other genetically modified genomes. However, watermarking of sexually reproducing organisms remains an open issue because recombination events can destroy the watermark. Heider et al.  identified a coupled Y-chromosomal/mitochondrial DNA watermarking procedure as the most appropriate DNA watermarking procedure for diploid organisms by using population predictions and statistical analyses, but there remains a lack of experimental validation in this regard. Balado [15, 16] modeled the Shannon capacity of DNA data embedding under mutations for irreversible DNA watermarking.
A common consideration in DNA storage, steganography, and watermarking is how to embed the external data in a DNA sequence to preserve the biological function for various purposes. Furthermore, reversible data hiding for DNA storage, steganography, or watermarking is necessary to recover an original DNA sequence without loss of information. However, most methods [1–14] are irreversible. In addition to facilitating original DNA recovery, reversible DNA data hiding can prevent DNA forgery and mutations from external data, and the mutation process can be analyzed using the iterative process of embedding, detecting, and recovering the information. Reversible image data hiding (or watermarking) has been investigated in many studies using difference expansion , prediction error expansion [22–24], histogram shifting [25, 26], lossless compression , and quantization index modulation , among other methods. In addition, the performance analysis of relevant methods has been reported , and it has been shown that expansion-based methods are more effective than other methods in terms of capacity, imperceptibility, and computation. However, reversible DNA data hiding has not been investigated as extensively as reversible image data hiding or irreversible DNA data hiding for DNA storage, steganography, and watermarking because of the small quantities of four-character-symbol nucleotide bases and their limited reversibility.
With regard to reversible DNA data hiding, Chen  encoded a sequence of nucleotide bases in noncoding DNA into decimal values and embedded a watermark in coded values using a lossless compression and difference expansion (DE) algorithm described by Tian . Further, Huang et al.  used a histogram with low modification rates. Lee and Kwon presented consecutive DE multiple bit embedding (CDE-MBE)  and least-squares-based prediction error expansion (LS-PE)  of neighbor code values of noncoding DNA sequences while preventing a false start codon. They reported that LS-PE has 0.36 bits per nucleotide base (bpn) more than CDE-MBE . Other methods [30–33] use substitution by the complementary rule of base pairs with a reference DNA sequence. Although these methods cannot change the length of a DNA sequence, they can introduce false start codons [19, 20, 30–33], require a reference sequence to detect and recover nonblind sequences, [31–33], or have extremely low watermark capacity .
In this study, we examine reversible DNA data hiding methods using histogram shifting (HS). In addition, we aim to not only prevent false start/stop codons but also achieve blind detection/recovery and high watermark capacity. First, we encode nucleotide bases with four-character symbols into -bit values as a unit of nucleotide bases. Thus, we can easily handle the nucleotide bases. Next, we embed multiple bits in each integer value using multilevel histogram shifting of noncircular type (NHS) and circular type (CHS). Unlike an image, a sequence of integer values of a DNA sequence does not have a regular histogram distribution; hence, multilevel histogram shifting is possible by numeric coding. These methods provide a high watermark capacity and facilitate blind detection and recovery. Finally, we verify whether a false start/stop codon is generated in an integer value and between consecutive integer values.
Through experiments, we evaluate the capacity efficiency of watermark bpn versus extra data bpn and the occurrence of false start/stop codons for our NHS- and CHS-based methods as well as for the methods described by Chen , Huang et al. , and Lee et al. (LS-PE) . Extra data capacity is required for detecting the watermark and recovering the original sequence. The capacity efficiency is the number of watermark bits that can be embedded per bit of extra data. The experimental results show that the CHS-based method, NHS-based method, LS-PE-based method, Chen’s method, and Huang’s method have watermark capacities and capacity efficiencies of 0.584 bpn and 1.818, 0.409 bpn and 1.239, 0.419 bpn and 0.234, 0.108 bpn and 0.495, and 0.027 bpn and 0.131, respectively. In addition, we find that our methods do not introduce false start/stop codons, whereas Chen’s and Huang’s methods introduce false start/stop codons every 104 and 5.78 × 105 nucleotide bases, respectively.
The remainder of this paper is organized as follows. Section 2 discusses the requirements of reversible DNA data hiding and analyzes the advantages and disadvantages of conventional methods. Section 3 explains the numeric coding of nucleotide bases, the prevention of false start/stop codons, and the NHS and CHS methods in detail. Section 4 presents and analyzes the experimental results of the proposed methods and compares them with those of conventional methods. Finally, Section 5 concludes the paper.
2. Related Works
2.1. Requirements of Reversible Data Hiding in DNA Sequence
Recently, it has been shown that the genome of a genetically tractable organism can be used as a medium for data hiding depending on the required application [1–14]. The coding DNA of a gene is transcribed by codons of three nucleotide bases that specify amino acids to encode proteins. The watermark should be embedded into coding DNA by considering codon degeneracy, to preserve the protein sequence, and codon optimization, including codon usage and GC content [10, 11]. Codon degeneracy and optimization make it difficult to embed a large number of bits into coding DNA. Noncoding DNA was initially thought to lack biological function and was referred to as “junk DNA.” However, it is now clear that some noncoding DNA act as genetic switches that regulate gene expression and determine the levels or location of expression of various genes via transcription factor binding. Data hiding in noncoding DNA may damage unknown genes or gene regulatory networks. Heider et al.  experimentally showed that an integrated watermark deactivates the lac promoter and the RNA molecules display altered configuration after watermark introduction. Thus, they did not recommend integrating a watermark sequence into a noncoding regulatory sequence. Therefore, noncoding DNA in nonliving organisms or primitive organisms, such as bacteria, is suitable for DNA storage [1–3], DNA steganography [4–6], reversible DNA data hiding [17–20, 30–32], or fragile DNA watermarking requiring high data capacity.
There are several considerations for reversible data hiding in noncoding DNA.
(1) Dynamic Range. Nucleotide bases are described by one of four-character symbols (A, T, C, and G (DNA) or U (RNA)) with 2-bit representation. Compared to 8-bit or 10-bit image data, the 2-bit capacity for nucleotide bases is extremely low for high-capacity watermarking. A combination of nucleotide bases should be used to increase the dynamic range for more effective processing. For example, a series of four nucleotide bases can be coded with 8-bit values (44 = 256 levels).
(2) False Start/Stop Codon. The watermark can change any nucleotide base in noncoding DNA to a start codon (“ATG” (Methionine)) or three stop codons (“TAG” (Amber), “TAA” (Ochre), “TGA” (Opal)), indicating the start/stop of the coding DNA region . Here, we refer to this as a “false start/stop codon.” Because these false codons have fatal effects on biological function, they should be prevented during the embedding process.
(3) Blindness without Appending. Blind detection and recovery, which detects the watermark and recovers the original DNA sequence without using the original DNA sequence or a reference sequence, should be practicable while preserving the length of the DNA sequence.
Figure 1 shows the general process of reversible data hiding in a noncoding DNA sequence. First, nucleotide bases in noncoding regions are coded into numerical values. The noncoding regions can be selected for embedding of the target depending on their length. The watermark coded by error correction or encryption is embedded into any or all numerical values using any reversible method while considering blind detection/recovery, biological function preservation, and capacity. The watermarked numerical values are reverse-coded into nucleotide bases in order to prepare watermarked noncoding regions. Here, two secret keys can be used for the watermark message or for embedding.
2.2. Conventional Methods
DNA steganography [4–6] and DNA watermarking [7–14] for DNA data hiding have been investigated in numerous studies. However, not all of these methods are reversible. We now consider some reversible data hiding methods.
Chen  adopted lossless compression and difference expansion (DE), which has been widely used for reversible image watermarking . They coded a sequence of 2-bit binary “ATCG” values into decimal values as a unit of bits, classified pairs of decimal values into expandable set S1 and changeable set S2, appended the compressed location map, original LSBs of pairs in S2 (LSB(S2)), and secret binary message in the last compressed values, and embedded them into differences of pairs in S1 using difference expansion as described by Tian . They experimentally demonstrated a payload capacity of 0.09–0.13 bpn with , which is extremely low. This method does not embed sufficient payload and does not consider preventing false start/stop codons.
Huang et al.  presented histogram-based reversible data hiding for low modification of nucleotide bases. This method generates a histogram of decimal values that are coded by every bits of consecutive nucleotide bases. Note that is the most frequent value, is the least frequent value, and is the second-least frequent value in the histogram; is a decimal value; and is a watermark bit. If , then set to and the location map to 1. If , then set the location map to 0 without changing . If and , then set to . Otherwise, if and , remains unchanged. Detection and recovery are performed using , , , and the location map. Huang et al. experimentally showed that the watermark capacity and the modification rate were 0.024 bpn and 4.07%–4.80% with and 0.011 bpn and 1.86%–2.34% with . Although this method achieves a low modification rate, the watermark capacity is extremely low and false start/stop codons are produced, as in the case of Chen’s method.
Lee and Kwon applied CDE-MBE  and LS-PE  to DNA code values, which allow for maximum permissible expansion within the range where no false start/stop codons are generated. CDE-MBE embeds multiple watermark bits in the maximum allowable difference expansion of the previous embedded code value and the current code value, while LS-PE embeds multiple watermark bits with the maximum allowable prediction error expansion of two code values. Both methods substitute extra information for detection and recovery with the LSB of the watermarked DNA code values. It was reported that the watermark capacities of LS-PE and CDE-MBE are 0.419 bpn and 0.235 bpn, respectively, on average . Thus, although LS-PE has a higher watermark capacity, it has low capacity efficiency. Its ratio of watermark data to extra data is 23.4%; hence, the extra data required by this method is 4.3 times greater than the watermark data.
Liu et al.  presented the piecewise linear chaotic map- (PWLCM-) based reversible data hiding method for DNA sequences. Fu et al.  and Ma et al.  presented reversible data hiding methods for tamper location and tamper restoration of DNA sequences via substitution by the complementary rule. However, these methods are nonblind in that the detection and recovery processes require the original DNA sequence or a reference sequence.
Most conventional methods do not prevent the introduction of false start/stop codon and have low capacity efficiency. In this paper, we describe reversible data hiding methods that produce no false start/stop codons while achieving not only blind detection and recovery but also high watermark capacity.
3. Proposed Reversible Data Hiding on DNA Sequence
This section introduces reversible data hiding methods using multilevel histogram shifting with the following features. (1) Blindness: the watermark is embedded without changing the sequence length or amino acids. Further, the watermark is detected and the original DNA sequence is recovered without the need for a reference DNA sequence. (2) Numerical facility: the watermarking process is facilitated by encoding consecutive nucleotide bases into 2-bit integer values. (3) High capacity: multiple bits are embedded in each integer value using multilevel histogram shifting. (4) No false start/stop codon: false start/stop codons are prevented via false codon searching both in an integer value and between consecutive integer values.
3.1. Preprocessing: Numerical Coding and False Codon Prevention
Nucleotide bases of four-character symbols should be encoded into integer values to facilitate watermarking.
Numerical Coding of Nucleotide Bases. In general, a nucleotide base of four-character symbols is represented by a 2-bit value as shown in Figure 2(a). For example, To extend the numerical value range of nucleotide bases, we encode a segment with nucleotide bases into a -bit value :Furthermore, we can obtain nucleotide bases in from as follows: denotes the right bit-shifting operator, and is the numeric order, which is the number of nucleotide bases for a segment.
False Start/Stop Codon Prevention. Any false start/stop codon can be generated in a segment (intrasegment) or between segments (intersegment) by the watermark. Hereafter, for brevity, we refer to a false start/stop codon as a false codon. We prevent false codons in noncoding regions using false codon searching.
(1) Intrasegment. Given a segment with nucleotide bases, any false codon can occur times in a segment, as shown in Figure 2(b). A false codon that occurs in any position is encoded into values. Thus, one false start codon and three false stop codons occur as a total of values. We generate the false codon table of all values including one false start codon or three false stop codons; then, we determine whether the watermarked value is included in the false codon table in the embedding process. Table 1 shows an example of the false codon table when the numeric order is 4.
(2) Intersegment. Given adjacent watermarked segments , any false codon may be present between and . For example, the last nucleotide base or last two nucleotide bases in and first two nucleotide bases or first nucleotide base in can be (A, TG) or (AT, G) in the false start codon ATG, as shown in Figure 2(c). Adjacent watermarked segments including a false codon can be determined by searching for a value that concatenates the th and th nucleotide bases of and the first and second nucleotide bases of in the false codon table.
3.2. Noncircular Histogram Shifting (NHS) Based Reversible Data Hiding
2-bit values for all segments are shifted to other values except for values in the false codon table. Let us consider multilevel histogram shifting of noncircular type.
Embedding Process. Let be the maximum number of shifting levels for a value. We divide the range of 2-bit values into a number of regions having values such that a region has bilateral symmetry with subregion of left values and subregion of right values about a center value . The center value is used as the reference value for multilevel shifting. Here, a residual region of values that are not included in the regions exists. This region is not selected for embedding.
Given a value for a segment , a previously watermarked segment , and a watermark , the maximum number of embeddable bits in a value is . First, we find the center value of a region to which belongs, and we determine the number of embeddable bits on the basis of the difference : If is a center value , is 0, that is, no bits are selected for embedding. Otherwise, we shift by up to bits of watermark while checking whether does not include false codons:If a shifted value is in the false codon table, we decrease by 1 and shift by up to bits. We obtain a sequence of all watermarked segments by repeating this process until for all segments.
Our approach requires an extra dataset for detection and recovery, including a set of the numbers of embedded bits in segments, a set of the region markers of shifted center values, and a set of binary LSBs of nucleotide bases in . The extra dataset is compressed losslessly and the compressed extra dataset is substituted into binary LSBs of nucleotide bases in . A DNA sequence with watermarked segments including is transmitted or stored.
Histogram Shifting. The region of a histogram is divided into a left subregion and a right subregion by a center value . Figure 3 shows the multilevel shifting of values by the difference with and watermark bits when the maximum number of embeddable bits is 3. The values with equal to 1 can be shifted by up to 3 bits (), while those with between 4 and 7 can be shifted by up to 2 bits () and those with between 2 and 3 can be shifted by up to 1 bit (). The value of () is fixed.
The values in the right subregion are shifted to the next left subregion . By contrast, the values in the left subregion are shifted to the previous right subregion , as shown in Figure 4(a).
The shifted center value can be observed in three cases. The first case is that a value is the same as , which is fixed. The other two cases include values in or shifted to . The number of embedded bits indicates whether the original value is or whether it is shifted to from the left or right regions. Therefore, the region mark of the shifted center values indicates where it is shifted from or for the detection and recovery process.
Figure 4(b) shows that subregions from to among a total of regions are shifted toward each other except for the two boundary subregions and , which are nonembedding regions.
Detection and Recovery Process. We extract the compressed extra dataset from binary LSBs of nucleotide bases in transmitted segments including and decompress to obtain the extra dataset . Here, the watermarked segments can be easily obtained by substituting and . Next, we detect the watermark and recover the original values of segments by a set of the numbers of embedded bits and a set of the region markers of shifted center values.
Given a value of a watermarked segment and the number of embedded bits of , we first obtain the center value of the region to which the original value belongs. This can be determined by the region of and the region marker in both cases:When is not the center value of , the region of can be easily determined as the left subregion or right subregion according to or , respectively. However, when and , the region of can be known as or by the region marker . Based on the center value , -bit watermark is detected and the original value is recovered as follows:
Watermark Capacity and Extra Data Capacity. Balado [15, 16] modeled the Shannon watermark capacity for noncoding and coding regions under mutations. Considering the capacity of a noncoding region, 2 bpn can always be embedded into a 4-ary alphabet of a nucleotide base without any mutations. In the case of substitution mutation, the capacity of a 4-ary symmetric channel, which is achieved for a uniform input, is bpn, where is the substitution probability of the nucleotide base. For example, if 25% of nucleotide bases are substituted, the maximum capacity is approximately 1.636 bpn. Thus, Balado analyzed the capacity of irreversible watermarking without considering reversible watermarking. Reversible watermarking, which can detect the watermark while recovering the original sequence from the watermarked sequence, has two additional constraints of reversibility and extra data. Therefore, we analyze the capacity of our method under reversibility and extra data rather than following Balado’s capacity analysis.
We compute the watermark bpn and the extra dataset bpn to analyze the capacity performance. The watermark bpn is defined as the watermark bits embedded in one nucleotide base, denoted by , and the extra dataset bpn is defined as the additional data bits for a nucleotide base, denoted by . The extra dataset is required to detect the watermark and recover the original sequence. The watermark bpn should be high whereas the extra dataset bpn should be low.
Let and be the number of embedded bits in two subregions and . The total number of embedded bits is the sum of embedded bits in all subregions except the two boundary subregions and . Thus, the number of watermark bits per nucleotide base is given bywhere is the total number of segments and is the number of nucleotide bases in a segment. Thus, denotes the total number of nucleotide bases in all segments.
Let be the number of bits for storing uncompressed extra dataset required for detection and recovery. Since the maximum number of embeddable bits in a value is , it is represented by bits. Therefore, a set of the numbers of embedded bits for all values is represented by bits. A region mark for the shifted center value can be stored by one bit that indicates whether it is shifted from the left region or the right region plus the position information on a sequence of segments, which can be represented by bits. Let be the total number of shifted center values. A set of region marks for shifted center values is represented by bits. The size of a set of LSBs of binary nucleotide bases in is equal to the total number of nucleotide bases. Thus, is represented by bits. In summary, the uncompressed extra dataset is represented as follows:Let be the lossless compression ratio. The number of bits of the compressed extra dataset per nucleotide base, , can be computed as follows:
3.3. Circular Histogram Shifting (CHS) Based Reversible Data Hiding
Unlike image quality, integer values of DNA segments can be shifted to any values. In addition, maximum and minimum values can be shifted toward each other only if the condition of false codon is satisfied. The CHS-based method makes the histogram domain circular so that the two boundary subregions and can be shifted toward each other.
Embedding Process. As in the previous subsection, we divide the range of 2-bit values into regions , and each range consists of values. Here, the residual region of is generated, as shown in Figure 5. Because the residual region exists between and , it is difficult to shift the values in the two subregions toward each other. Therefore, we exchange the right subregion of the last region and the residual region. Here, the detached region has two center values of subregions individually: for and for .
Given a value , we find the center value of the region to which belongs; then, we shift by up to bits of watermark , which is determined by the difference .Here, values in the exchanged residual region and center values in each region are excluded from the embedding. The region marker of the center values shifted from an adjacent region is set as follows:
We obtain watermarked segments by shifting as many watermark bits as possible into all values except for the residual region while preventing false codons. The extra dataset required for the detecting and recovering process is , which is the same as the extra dataset for the NHS-based method. We substitute the compressed extra dataset into LSBs of binary nucleotide bases in and then obtain a DNA sequence with watermarked segments including the compressed extra dataset.
Detection and Recovery Process. As with the NHS-based method, we obtain watermarked segments from a transmitted DNA sequence via LSB substitution of the compressed extra dataset and then detect the watermark and recover the original values from the watermarked values using the extra dataset.
Given a watermarked value in , we find the center value of the nonboundary region or boundary region to which belongs: Then, we detect bits of the watermark and recover the original value using the center value .
Watermark Capacity and Extra Data Capacity. The CHS-based method embeds watermark bits into all regions except for the residual region in the histogram domain. The total number of embedded bits is the sum of embedded bits in all subregions. Thus, the number of watermark bits per nucleotide base is defined as follows:
Let be the number of bits of uncompressed extra dataset for the CHS-based method. It can be represented in the same way as of the NHS-based method. Based on the lossless compression ratio , the number of bits of the compressed extra dataset per nucleotide base, , can be computed as follows:The extra datasets for the CHS-based and NHS-based methods are the same, but the watermark capacity of the CHS-based method is higher than that of the NHS-based method.
4. Experimental Results
Image quality in terms of PSNR versus bits per pixel is the main evaluation metric for reversible watermarking. However, DNA reversible data hiding methods should be evaluated on the basis of the preservation of biological function versus bpn. Our methods do not introduce any false start/stop codons in noncoding regions while preserving the amino acid code. We evaluated the watermark bpn , the extra data bpn , the base change rate , and the occurrence of false codons using our NHS- and CHS-based methods as well as Chen’s method , Huang’s method , and LS-PE . Here, the base change rate is the rate of change of nucleotide bases by watermark bits. Given the original segments and watermarked segments , the base change rate is defined as follows:Assuming that nucleotide bases changed by a random watermark are uniformly distributed, the base change rate was nearly 3/4 = 0.75.
The watermark bpn and extra data bpn of LS-PE, and , respectively, depend on the numeric order and the prediction order . By contrast, the watermark bpn and extra data bpn of the NHS- and CHS-based methods, , and , , respectively, depend on the numeric order and the maximum embeddable bit number . Therefore, we experimentally selected parameters with the most watermark bpns in each method and used them to compare the capacities of our methods with those of the conventional methods. Specifically, we used in Chen’s method and in Huang’s method.
We used test DNA sequences provided by NCBI GenBank that are the same experimental sequences of . Table 2 summarizes the type, access number, total number of nucleotide bases, number of noncoding regions, and number of nucleotide bases in the noncoding regions of our test DNA sequences. The test DNA sequences varied in length. Noncoding regions with a small number of nucleotide bases were not used for the embedding regions.
4.1. Parameter Setting
Here, we discuss how to determine the parameters of our methods for comparison with the conventional methods. Given the numeric order , the maximum embeddable bit number for each region is . Figure 6 shows the extra data bpn versus the watermark bpn and the base change rate versus the watermark bpn for the NHS- and CHS-based methods, where is varied from 2 to 2 in . The watermark bpn is the highest when and . In this case, is 0.175 bpn higher than ( = 0.566 bpn and = 0.391 bpn). For a given , increases with . As increases, and decrease. Based on these results, is the highest when and is the highest when .
With regard to the extra data bpn versus the watermark bpn, the two bpns decrease if the numeric order increases. For and , the CHS-based method requires extra data bpn = 0.303 bpn for = 0.566 bpn and the NHS-based method requires extra data bpn = 0.315 bpn for = 0.391 bpn. These extra data are approximately one-third of the LSB substitutable bits, for which we can substitute the extra data into the LSBs of binary nucleotide bases three times. With regard to the base change rate versus the watermark bpn, the former increases with the numeric order . The CHS-based method has for , with the highest , but it has and 0.765 for , and , , respectively, with low . Similarly, the NHS-based method has for , with the highest , but it has = 0.498 and 0.474 for = 3, = 3 and = 4, , respectively, with low .
According to the two parameters and , the watermark bpn of the CHS-based method is 0.004–0.175 higher than that of the NHS-based method under a similar quantity of extra data, while the base change rate of the former is 0.038–2.760 higher than that of the latter. On the basis of these results, we set the maximum embeddable bit number to 20 in the NHS- and CHS-based methods and then compared them with the LS-PE-based method  and other methods [19, 20] by varying the numeric order .
4.2. Comparison of Watermark Capacity, Extra Data Capacity, and Base Change Rate
We set the parameters of the LS-PE-based method (), NHS-based method (), and CHS-based method () for numeric order , as well as for Chen’s method () and Huang’s method (), such that the highest watermark bpn is achieved. Next, we compared the watermark bpn, extra data bpn, and base change rate of these methods. Figure 7 shows the extra data bpn and the base change rate versus the watermark bpn of each method.
With regard to the watermark bpn, the CHS-based method with () = showed the highest value at 0.566 bpn. The next highest values were 0.419 bpn for the LS-PE-based method with , 0.413 bpn for the LS-PE-based method  with , and 0.391 bpn for the NHS-based method with () = . For Chen’s  and Huang’s  methods, the values were 0.108 bpn and 0.027 bpn, respectively, which are extremely low compared to those of our methods. With regard to the extra data bpn versus the watermark bpn, the watermark bpn of the CHS-based was 0.175 bpn higher than that of the NHS-based method, while the two methods required a similar quantity of extra data. The extra data required by the LS-PE-based method was 1.480 bpn more than that required by the NHS- and CHS-based methods. Chen’s and Huang’s methods required approximately 0.30 bpn of extra data.
The results for the test DNA sequences are summarized in Table 2, while Table 3 summarizes the results of the NHS- and CHS-based methods with () = , Chen’s method with = 2, Huang’s method with = 2, and the LS-PE-based method with () = . For all the test sequences, the watermark bpn of the CHS-based method was 0.147 bpn higher than that of the LS-PE-based method, 0.175 bpn higher than that of the NHS-based method, and 0.458–0.539 bpn higher than those of Chen’s and Huang’s methods. The extra data bpns of Chen’s and Huang’s methods were approximately 0.218 bpn and 0.203 bpn. However, those of the NHS- and CHS-based methods were approximately 0.315 bpn and 0.303 bpn, which are slightly higher than those of Chen’s and Huang’s methods. Further, those of the LS-PE-based method with () = and () = were 1.114 bpn and 1.788 bpn, respectively, which are higher than those of Chen’s and Huang’s methods. The two conventional methods require less extra data owing to their low watermark bpns. By contrast, with regard to the capacity efficiency, which is the ratio of the watermark bpn to the extra data bpn, the CHS-based method showed the highest capacity efficiency of 1.865. This means that 1 bit of extra data is required for embedding 1.865 watermark bits. Thus, the CHS-based method requires the least amount of extra data compared to the watermark.
We assume that the embedding segments of noncoding DNA do not alter regulatory gene expression and have no impact on biological function. Although our methods show a relatively high base change rate compared to the conventional methods, they do not affect coding DNA and biological function.
4.3. False Codon Occurrence
To prevent the generation of false codons in noncoding regions, we performed false codon searching of intra-/intercode values for both the embedding process and the LSB substitution of extra data. According to our results, the NHS- and CHS-based methods do not generate false codons. However, Chen’s and Huang’s methods and the LS-PE-based method generate false codons during the embedding process because they do not consider the constraint of the false codon.
We define the occurrence probability of false codons, , as . This indicates the probability that three consecutive nucleotide bases in the watermarked noncoding sequence will become start and stop codons. We experimented 1000 times on each test sequence with different watermarks and then computed . The results are summarized in Table 4. Our methods did not generate false codons. However, Huang’s and Chen’s methods generated false codons every nucleotide bases and every nucleotide bases, respectively. The LS-PE-based method, which does not consider the stop codons, generated false codons approximately every nucleotide bases. Even though this probability is extremely low, the false codon can be fatal to biological function.
Reversible DNA data hiding can be used for repeated embedding and detection of a watermark while recovering the original DNA sequence without loss of information. Therefore, this technique can be applied to DNA storage and DNA steganography as well as to the analysis of the mutation process using an external watermark. However, most recently proposed DNA data hiding methods are irreversible. This study evaluated reversible DNA data hiding techniques using histogram shifting of noncircular and circular types while preventing biological mutation and achieving blindness and high watermark capacity.
It is extremely difficult to extend reversible image data hiding to multiple bits or to shift the histogram bin to multiple levels owing to image quality concerns. DNA data hiding has no the invisibility evaluation similar image quality with the constraints of biological function and false start/stop codons. Therefore, it is possible to extend the difference of code values of DNA sequences to multiple bits or to shift the values to multiple levels within these constraints. We coded four-character symbols of noncoding regions into integer values using the numeric order and embedded the binary watermark in two ways, namely, by shifting the NHS and CHS types to multiple levels. Next, we prevented the generation of false codons via searching of intra-/intercode values. On the basis of our experimental results, we verified that the CHS-based method has the highest watermark bpn of 0.566 bpn, which is 0.147–0.539 bpn higher than the watermark bpns of other methods, and that this method shows the highest capacity efficiency of approximately 1.865, which is 0.622–1.734 higher than that of other methods. Furthermore, we verified that false codons are not introduced by our methods, but they are introduced every – nucleotide bases by conventional methods.
Data hiding in noncoding regions may damage unknown genes or gene regulatory networks . Future studies should investigate reversible DNA watermarking in coding regions by solving codon preservation and codon optimization problems with reversibility.
Conflicts of Interest
The author declares that there are no conflicts of interest.
This research was supported by the Tongmyong University of Research Grants 2015A009.
O. O. Babatunde, “Deoxyribonucleic acid (DNA) as a hypothetical information hiding medium: DNA mimics basic information security protocol,” Journal of Engineering and Technology Research, vol. 3, no. 5, pp. 148–154, 2011.View at: Google Scholar
I. Hafeez, A. Khan, and A. Qadir, “DNA-LCEB: a high-capacity and mutation-resistant DNA data-hiding approach by employing encryption, error correcting codes, and hybrid twofold and fourfold codon-based strategy for synonymous substitution in amino acids,” Medical & Biological Engineering & Computing, vol. 52, no. 11, pp. 945–961, 2014.View at: Publisher Site | Google Scholar
T. Chen, “A novel biology-based reversible data hiding fusion scheme,” in Frontiers in Algorithmics, vol. 4613 of Lecture Notes in Computer Science, pp. 84–95, Springer, 2007.View at: Google Scholar
L.-T. Ko, J.-E. Chen, Y.-S. Shieh, H.-C. Hsin, and T.-Y. Sung, “Nested quantization index modulation for reversible watermarking and its application to healthcare information management systems,” Computational and Mathematical Methods in Medicine, vol. 2012, Article ID 839161, 8 pages, 2012.View at: Publisher Site | Google Scholar