Abstract

In order to protect the copyright of audio media in cyberspace, a robust and blind audio watermarking scheme based on the genetic algorithm (GA) is proposed in a dual transform domain. A formula for calculating the embedding depth is developed, and two embedding depths with different values are used to represent the “1” and “0” states of the binary watermark, respectively. In the extracting process, the embedding depth in each audio fragment will be calculated and compared with the average embedding depth to determine the watermark bit by bit, so this scheme can blindly extract the watermark without the original audio. GA will be applied to optimize the algorithm parameters for meeting the performance requirements in different applications. Besides, the embedding rule is further optimized to enhance the transparency based on the principle of minimal modification to the audio. Experimental results prove that the payload capacity reaches 172.27 bps, the bit error rate (BER) is 0.1% under the premise that its transparency is higher than 25 dB, and its robustness is strong against many attacks. Significantly, this scheme can adaptively select the algorithm parameters to satisfy the specific performance requirements.

1. Introduction

In modern society, the development of network technology has greatly promoted the rapid dissemination of network resources. People can efficiently and conveniently obtain network resources. At the same time, they are also worried about the copyright infringement of these network resources [1]. Therefore, how to protect the copyright of these network resources has aroused many scholars’ interest. Encryption is a traditional technology to protect the information, but ciphertext cannot be spread more widely compared with plaintext [2]. Sometimes, if the hacker fails to get the correct password, he may corrupt it in a violent manner. Therefore, encryption may put information at risk. Information hiding technology is a novel way to secretly embed information into some digital media that can be made public for achieving the purpose of protecting the copyright of the digital media, transmitting confidential information, labelling information, and so on [3]. It not only conceals the information content but also hides the transmission behaviour, which reduces the possibility of being attacked by paralyzing the human perception system. It can be seen from the above analysis that encryption prevents unauthorized hackers from obtaining the information, while information hiding technology conceals the behaviour of transmission [4]. Therefore, information hiding technology provides a novel mode for protecting information in cyberspace, and it has been widely used in copyright protection, secret communication, content anticounterfeiting, military intelligence, identity authentication, and so on [5]. There are four main indicators, including transparency, robustness, security, and payload capacity, to evaluate the performance of information hiding technology. In fact, they are contradictory, so researchers usually focus on some indicators according to the actual application requirements when they develop the hiding scheme. Steganography [6, 7] and digital watermarking are the two main branches of information hiding technology. They both must have high transparency; otherwise, the quality of the carriers that may be images [8], videos [9], audios [10, 11] and so on will be degraded because of the embedded information. Steganography usually has large payload capacity to carry large confidential information. However, digital watermarking technology pays more attention to the robustness against external attacks.

Digital watermarking technology protects the copyright of the carrier by embedding the watermark that is difficult to be sensed by the human perception system [12]. In the process of using those carriers, watermarks hidden in the carriers may be destroyed or lost due to some signal processing operations, so watermarking scheme must have strong robustness to withstand those operations. In addition, to make the developed watermarking scheme convenient for practical application, the scheme with blind extraction is more popular. In recent years, digital watermarking technology has been widely used in the field of information security, so many scholars have invested in the research of watermarking scheme, and the core issue is how to effectively improve the overall performance of the watermarking scheme in practical application.

Audio media is one of the most common multimedia resources. There are a lot of online songs, conversations, and other audio types in cyberspace every day. How to protect the copyright of these audio media has aroused the interest of many researchers. However, there are few research results on audio watermarking schemes, mainly due to the following difficulties. The human ear is very sensitive to the drop in audio quality caused by the presence of the watermark, which will affect the normal use of audio media [13, 14]. In addition, many audios editing software can be used to modify audio signals conveniently, which will cause the watermark hidden in the audio to be destroyed and even lost partially or totally. Many existing audio watermarking schemes have not overcome these difficulties well, so there are still many technical issues to be solved in improving their performance.

1.1. Related Works

According to the different domains, audio watermarking schemes are mainly divided into two categories: time domain methods [15, 16] and transform domain methods [1721]. In order to prevent the loss of the information hidden in the audio carriers which may suffer from various attacks, watermarking schemes developed in the transform domain usually have strong robustness. There are many transform domain schemes, such as discrete cosine transform (DCT) [17], discrete wavelet transform (DWT) [18, 19], discrete Fourier transform (DFT) [20], and singular value decomposition (SVD) [21]. DCT focuses most of the energy of the audio fragment on its low-frequency part, so many scholars use DCT to design an audio watermarking scheme. Hu [17] proposed a high-capacity audio watermarking scheme that embedded watermark into the low-frequency coefficients according to the masking characteristics of the human auditory system in the DCT domain, but its robustness was not strong against some conventional signal processing operations. DWT has multiresolution characteristics, so it is often used to analyse the main component of the signal in both time domain and frequency domain [22]. Huang [18] presented an adaptive watermarking scheme which modified the DWT coefficients using signal-to-noise ratio (SNR) of the audio, and this scheme had good transparency because its embedding formula was optimized by minimizing the difference between the original coefficient and the modified coefficient. Hsu [19] designed an audio watermarking scheme based on the two-stage Lagrange principle and minimum-energy scaling optimisation in the wavelet domain. This scheme also had good transparency, but their BER was high indicating weak robustness against low-pass filtering, time-scaling, and resampling attacks. In [20], a stereo audio watermarking scheme was proposed in discrete Fourier transform (DFT) domain. This scheme calculated the similarity of the audio signals in two stereo channels to develop the embedding and extracting rules. Although the payload capacity of the scheme was low, it had strong robustness when withstanding attacks. In [21], a blind audio watermarking scheme was proposed based on singular value decomposition (SVD) by mixing the watermark with the diagonal matrix of singular value. In recent years, there are many watermarking schemes in multitransform domains, which are usually more robust than the schemes developed in a single transform domain. Lei [23] proposed an audio watermarking scheme in the dual transform domain consisting of SVD and DCT. This scheme processed the audio signal into two-dimensional data blocks which would be applied by SVD and DCT in turn, and finally the watermark was embedded into these SVD-DCT coefficients with larger values. Some audio watermarking schemes based on DWT and SVD are proposed in [24, 25]. In addition, some scholars have proposed other watermarking schemes. Wang [26] proposed an audio watermarking scheme with blind extraction by using exponential modulation, and this scheme extracted exponential distance feature parameters by mapping the audio into two-dimensional dates and then embedded watermark on these parameters. These above watermarking schemes have promoted the development of watermarking technology, but most audio watermarking schemes still have many shortcomings, such as poor transparency which may cause a decline in audio carriers, weak robustness which may lead to loss of the watermark, low security, nonblind extraction, and insufficient capacity. In addition, the parameters of the most existing schemes are set by the designers according to their own experience, which makes those schemes unable to determine their parameters adaptively in different application, thus unable to stimulate their best performance.

In our work, a robust audio watermarking scheme is proposed by combining the energy concentration characteristics of DCT with the multiresolution characteristics of DWT. Firstly, apply DWT on the audio carrier for choosing the data in a specific frequency band, which can improve transparency. Secondly, Apply DCT on the data to focus energy on the low-frequency component to carry the watermark, which can improve robustness. In order to solve the problem of parameter setting, a genetic algorithm is used to optimize the algorithm parameters in different applications.

1.2. Contributions

It can be seen from the above introduction that there are still many problems in the audio watermarking scheme that need further research. In this paper, a robust and blind audio watermarking scheme based on GA is proposed in the DCT-DWT dual transform domain. Our contributions are as follows:(1)Our proposed scheme is developed in the DWT-DCT dual transform domain, and it has strong robustness to prevent from losing the watermark hidden in the carried audio which may suffer from various attacks. Besides, the embedding rules are further optimized based on the principle of minimal modification to the audio to improve transparency.(2)Our proposed scheme is blind when extracting the watermark. This scheme employs two different embedding depths to represent two states of the binary watermark. In the extracting process, the embedding depth in each audio fragment will be calculated and compared with the average embedding depth to determine the watermark information bit by bit, so the extracting process does not need the original audio.(3)Our proposed scheme uses GA to optimize the important parameters adaptively for meeting the performance requirements in different applications, which can stimulate the optimal performance of the scheme. In many existing watermarking schemes, the algorithm parameters are often set by the designer according to their own experience, which cannot fully stimulate the performance of the algorithm and also cannot adjust the algorithm parameters adaptively in different applications.

The remainder of this paper is organized as follows: the principle of the proposed scheme is described in Section 2, including chaotic encryption of the watermark, principle of the embedding algorithm, principle of the extracting algorithm, and optimisation of the parameters based on GA. In Section 3, the performance of this proposed scheme is evaluated in terms of transparency, payload capacity, and robustness. Furthermore, this proposed scheme is compared with some related schemes in recent years according to their experimental results. Finally, we summarize our work and introduce the future research focus in Section 4.

2. Principle of the Proposed Scheme

Due to the auditory masking effects, the human auditory system cannot effectively capture the extremely small changes from the frequency components in audio media, so watermarks can be embedded in audio media to protect the copyright of audio media. Figure 1 is the principle diagram of the watermarking scheme.

In embedding processing, watermarks will be encrypted firstly and then embedded into the audio media using the proposed embedding algorithm. Finally, the carried audio with the watermark will be uploaded to the Internet. When it is necessary to prove the copyright of the audio media, this carried audio will be implemented with the corresponding extracting algorithm to extract the encrypted watermark which will be decrypted using the correct key. In this scheme, the embedding and extracting algorithms are the core particularly. It can be seen that the extracting process and decryption are symmetric with the embedding process and encryption, respectively, and only those who have the corresponding extracting algorithm and the correct key can obtain the watermark.

2.1. Chaotic Encryption of Watermark

In order to enhance the security of the watermark, it is necessary to encrypt the watermark before embedding it into the audio carrier. Because of its nonperiodical, continuous broadband, noise-like and long-term unpredictability, chaotic encryption is an information security protection technology that has developed rapidly in recent years and is especially suitable for security communications and other related fields [27].

Assume that the watermark can be converted into a binary stream , as follows:where , is the length of , and is the serial number of the element in . Apply logistic mapping equation to generate a chaotic sequence with the same size as as follows:where , is the initial value when and is a threshold to get . The logistic system is in chaos when . Exclusive OR operation is performed on and to obtain the encrypted information shown in equation (4), where represents the exclusive OR operator. The triple key will be the unique key that can be used to decrypt :

In order to mark the start and end positions of , a synchronization code should be added to . For instance, add “1111 1111 0000 0000” in front of as the start flag, and add “1111 1111 0000 0000” after as the end flag.

2.2. Principle of the Embedding Algorithm

Suppose that is the original audio with sample points as follows:where is the serial number of the element in and represents the th sample point. is evenly divided into audio fragments which can be expressed as (, ); then, the length of each audio fragment is , where  ( ) indicates that the data in brackets is rounded down. Apply -level DWT on to obtain a set of the approximation coefficient and sets of the detail coefficient . contains the main frequency components in , which may cause a serious decline in audio quality when there are minor changes in , so the watermark usually is not concealed into . contains the higher frequency components in , so the watermark often can be concealed into these frequency band because of the less influence on audio quality. represents the highest frequency band of the audio signal, and the degradation of audio quality caused by embedding the watermark in this frequency band is usually hard for human ear to perceive. Since the high-frequency components are vulnerable to attack, the detail coefficient close to is usually used to carry the watermark. In the following description, the principle of embedding algorithm will be illustrated by taking how to embed 1-bit binary information into the -level detail coefficient of as an example.

Divide into two data blocks, one is the former block and the other is the latter block , where is the serial number of the element in . Perform DCT on those two data blocks to obtain two sets of transform domain coefficients (TDC) which can be expressed as and , respectively. The low-frequency components of and can be used to carry the watermark because they contain most of the energy of TDC, which is beneficial in improving the scheme’s robustness. and are the average amplitudes of the low-frequency components of and as follows:where and is the average value of and , as shown in equation (8). In order to embed 1-bit binary information into the audio, and should be modified by the rules in equations (9) and (10), where is the embedding depth in the range of (0, 1). and are the modified coefficients.where and are the average amplitude of the modified coefficients and and can be calculated in the following equations:

The average value of and can be calculated in equation (13). Then, the variation of and can be expressed in equation (14). Thus, the embedding depth of can be calculated according to equation (15):

Two embedding depths with different values can be used to represent the status of 1-bit binary information “1” and “0,” so the embedding rules can be designed in the following equation when embedding the th bit of the watermark into , where , , and is the first audio fragment to carry the watermark:

The rules shown in equations (9) and (10) can be named as the first rule, and the second rule can be shown as follows:.

Then, the average amplitudes and of and will be calculated in equations (19) and (20) according to the second rule. When , , and are the variations, as shown in Figure 2(a) according to the first rule and also can be shown in Figure 2(b) according to the second rule:

Both and in Figure 2(a) are smaller than those in Figure 2(b), which indicates that the first rule is better than the second rule in improving the scheme transparency when . Similarly, the comparison of the two graphs in Figure 3 about and also implies that the second rule should be chosen to embed the information bit when .

Finally, the inverse discrete cosine transform (IDCT) and the inverse discrete wavelet transform (IDWT) are performed on the modified coefficients in turn to obtain the carried audio fragment . The flowchart of the embedding algorithm is shown in Figure 4. The embedding process can be described as follows:Step 1: convert the watermark into a binary stream and encrypt it to obtain .Step 2: add synchronization code at the beginning and end of .Step 3: divide the original audio into audio fragments .Step 4: apply DWT on to obtain the -level detail coefficient .Step 5: divide to obtain the former block and the latter block .Step 6: apply DCT on two data blocks to obtain and , respectively.Step 7: calculate , , and according to equations (6)–(8).Step 8: if , modify and by the first rule; otherwise, use the second rule. If is equal to 1, then ; otherwise, .Step 9: apply IDCT and IDWT in turn to recover the carried audio fragment .Step 10: repeat Step 4 to Step 7 until all watermark bits are concealed.Step 11: recombine all audio fragments to recover the carried audio .

2.3. Principle of the Extracting Algorithm

The extracting process is symmetric with the embedding process. When extracting the watermark from the carried audio , the embedding depth in each audio fragment will be calculated according to equation (15), and then, the 1-bit binary information will be determined by comparing it with the overall average embedding depth which can be calculated according to equation (21). The extracting rule is expressed in equation (22):

When all binary bits are extracted from the audio fragments , the data between the start flag and the end flag is the encrypted . Remove the synchronization code, and then decrypt using the triple key to obtain the watermark. It can be seen from the principle of the extracting algorithm that this scheme has high security because only those who have the corresponding extracting algorithm and the correct key can access the watermark. The flowchart of the extracting algorithm can be shown in Figure 5. The process of extracting the watermark can be described as follows:Step 1: divide the carried audio into audio fragments .Step 2: apply DWT on to obtain the -level detail coefficient .Step 3: divide to obtain the former block and the latter block .Step 4: apply DCT on two data blocks to obtain and , respectively.Step 5: calculate , , and .Step 6: calculate according to equation (15).Step 7: repeat Step 2 to Step 6 until all are calculated.Step 8: calculate according to equation (21).Step 9: extract all binary bits according to equation (22).Step 10: remove the synchronization code and decrypt to obtain the watermark.

2.4. Optimization of the Parameters Based on GA

Three important parameters of this scheme have an important effect on the overall performance of the scheme. To stimulate the best performance in different applications, GA is used to search for the optimal parameters according to the specific performance indicators. The fitness function can be constructed with transparency, payload capacity, and robustness as follows: and can be expressed in equations (24) and (25), respectively, and represents the payload capacity. and are the thresholds of transparency and payload capacity:where and represent the original audio and the carried audio, respectively, and and represent the original watermark and the extracted watermark. The population consists of chromosomes, and the length of each chromosome is . Chromosomes will be encoded by using a binary encoding approach, as shown in equation (26), where represents the first chromosome; , , and represent the binary of , , and ; and their lengths are , , and , respectively:where can be obtained by Converting from binary to decimal. The transformation relationship between the chromosome and the two parameters can be shown as follows:where [ ] means converting the data in brackets from binary to decimal and . The detailed process can be described as follows:Step 1: parameter initialization. Set the crossover probability , the mutation probability , , , , , and and then generate an initial population .Step 2: calculate the four parameters and then execute the embedding algorithm in Section 2.2 to obtain the carried audio.Step 3: attack test. Apply some attacks on the carried audio, and calculate SNR according to equation (24).Step 4: choose all qualified chromosomes that meet and . Then, execute the extracting algorithm to calculate BER according to equation (25).Step 5: calculate the fitness value according to equation (23) to obtain the best chromosome with the largest fitness value.Step 6: apply selection operation by roulette on the best chromosome to generate the transition population .Step 7: apply crossover operation on two adjacent chromosomes except for the best chromosome to obtain a new transition population .Step 8: apply mutation operation on each chromosome except for the best chromosome to obtain the next generation population .Step 9: repeat Step 2 to Step 8 until the global optimal chromosome appears.

3. Performance Evaluation

In this section, the performance of this scheme is evaluated, including transparency, security, capacity, robustness, and complexity. The detailed experimental parameters can be described as follows: (1)  = 25,  = 172.27,  = 0.8,  = 0.1,  = 20, , ,  = 8; (2)  = 0.1,  = 0.3,  = 0.5, and  = 0.9; (3) the triple key is ; (4) use a binary image as the watermark, as shown in Figure 6; and (5) the original audio carriers are 10 male voices and 10 female voices, all of which come from TIMIT standard database.

SNR, the subjective difference grades (SDG), and the object difference grade (ODG) will be used to evaluate the transparency of this proposed scheme. SDG refers to the original audio and the carried audio being provided to the same group of listeners to distinguish the difference and give a subjective score. The closer the average score is to 0, the better the audio quality will be. ODG is one of the output values obtained from the perceptual evaluation of audio quality (PEAQ), so it can be used to give an objective score from −5 to 0 for audio. BER refers to the ratio of the number of erroneous bits to the total number of bits in the extracted information, which can be used to evaluate the robustness of the scheme. Correlation coefficient (NC) refers to the similarity between the original information and the extracted information, as defined in equation (29). It is located in the range of [0, 1]. The larger the NC, the more similar the original information and the extracted information, and the stronger the robustness of the scheme.

3.1. Transparency and Capacity

To illustrate the transparency of this scheme, an audio clip lasting for about 3 seconds was picked up randomly from the tested audio signals to compare the waveform and spectrum before and after embedding information, as shown in Figure 7. It can be seen from the figures that the waveforms of the original audio are very similar to those of the carried audio, and there is no obvious difference between them, so this scheme has high transparency. In the case of no attack, the average experimental results are listed in Table 1.

The experimental results in Table 1 confirm that this scheme has higher transparency and robustness compared with [22, 26], which can be seen from the results about SNR, SDG, ODG, BER, and NC. In the case of providing the capacity of 172.27 bps, although the transparency of our scheme is slightly lower than that of [19] and BER of our scheme is significantly better than that of [19], which shows that the robustness of our scheme is stronger.

3.2. Robustness

In this section, robustness evaluation can be carried out through the following three steps. Firstly, implement various types of attacks on the carried audio, then use the developed extracting algorithm to extract the watermark in the carried audio, and finally calculate BER and NC of the watermark. The attack types considered in our test are shown in Table 2.

Figure 8 shows the waveform comparison of the original audio and the carried audio that they are both compressed by MP3 with 64 kbps (only show an audio clip lasting about 3 seconds randomly). It can be seen from Figure 8 that there is no obvious abnormality in their waveforms. The waveform comparison diagrams after they suffered other attacks are similar to this, which will not be shown here one by one. Table 3 shows the result about BER of the watermark extracted from the carried audio under the above attacks. The extracted watermarks and their NC values are listed in Figure 9. According to the experimental results in Tables 1 and 3 and Figure 9, the payload capacity and SNR of this scheme reach 172.27 bps and 25 dB, respectively, which indicates that this scheme has large capacity and high transparency, so it can be used to protect the copyright of audio media without affecting the audio quality itself. The extracted watermarks are very similar to the original watermark under most attacks except for additive noise with 20 dB, the values of BER are below 1.60%, and NC values are above 0.9237, which implies that this scheme has strong robustness, so the watermark will not be destroyed or lost due to some conventional signal processing operations in the process of using the audio.

Compared with the schemes in the other references, this scheme has better overall performance. Although the transparency of our scheme is slightly lower than that in [19], our scheme is more robust when resisting all the attacks listed in Table 2. Under the same capacity, this scheme has stronger robustness than that in [22] when against most signal processing operations. This proposed scheme also has a larger capacity and better robustness than that in [26]. The average values of BER in [26] reach 2.87% and 17.92% when resisting amplitude scaling, while those of this proposed scheme are only 0.19% and 0.34%. Amplitude scaling is the most common way that audio media may suffer from in the process of being used, so this proposed scheme is more practical.

3.3. Security and Complexity

According to the Kerckhoffs criterion, the security of the scheme should not only depend on the scheme itself but also be further strengthened with encryption technology. Therefore, the size of the key space determines the security degree of the scheme. This scheme uses a triple key to encrypt the watermark in chaos. Three parameters of this triple key are all taken in the real field, so this scheme has infinite key space in theory, but in fact, they are affected by the word length of the computer system; thus, their key space is limited. In addition, the comparison of the spectrogram in Figure 10 also indicates that the characteristics of the carried audio have some changes because of carrying watermarks.

The running time of the algorithm can be used to test its complexity. In our scheme, GA is used to optimize the main parameters of this algorithm, so the overall running time is also related to the search efficiency of GA. The average embedding time of this algorithm is 35.4561 seconds, and the average extraction time is 11.6896 seconds.

From the above analysis, it can be seen that our proposed scheme has better performance, mainly because it combines the advantages of DWT and DCT to enhance the robustness, optimizes the important parameters of the scheme by using GA, and adjusts the embedding rules to improve the transparency based on the principle of least modification of TDC. However, the complexity of this scheme is high because it takes more time to search for the best algorithm parameters using GA. In addition, slight changes in the spectrogram may also expose the watermark hidden in the audio media.

4. Conclusions

In this paper, a robust and blind audio watermarking scheme based on GA is proposed in the dual transform domain, which can be used to protect the copyright of the audio media in cyberspace. This scheme is developed in the DWT-DCT dual transform domain, so it has strong robustness to prevent from losing the watermark hidden in the carried audio which may suffer from various attacks. Furthermore, this scheme utilizes GA to optimize the important parameters adaptively for meeting the performance requirements in different applications. Besides, this scheme adjusts the embedding rules based on the principle of minimal modification to the carried audio to improve transparency. When embedding watermarks, firstly, the carried audio will be divided into many audio fragments, and DWT will be performed on each audio fragment to obtain one set of appropriate wavelet coefficients which will be used to carry the watermark. Secondly, those appropriate wavelet coefficients will be divided into two groups of data blocks which will be implemented by DCT to obtain two groups of TDC, respectively. Finally, two different embedding depths are used to modify these two groups of TDC for embedding the binary watermark according to the designed embedding equation. When extracting the watermark, the embedding depth of each audio fragment will be calculated firstly and then compared with the overall average embedding depth to extract the binary watermark according to the designed extracting equation, so this scheme is blind because it can extract the watermark without the carried audio.

Experimental results confirm that this proposed scheme can be used to embed watermarks into audio media without affecting the audio to be used normally and blindly detect it. Compared with other schemes in the relevant references, it has achieved excellent performance, such as the strong robustness when withstanding MP3 compression, additive noise, low-pass filtering, requantizing, resampling, amplitude scaling, and echo jamming. The SNR of the carried audio reaches more than 25 dB in the case of the payload capacity of 172.27 bps, which indicates that this proposed scheme has good transparency and a large capacity. However, due to the intervention of the genetic algorithm, this algorithm has high complexity. In the next research work, we will strive to reduce complexity and improve the security of the scheme.

Data Availability

All audio signals and images tested in our experiment can be used under the public platform.

Conflicts of Interest

All authors declare no conflicts of interest.

Acknowledgments

This research work was funded by the High-Level Talent Scientific Research Foundation of Jinling Institute of Technology, China (Grant no. jit-b-201918), the Natural Science Foundation of the Jiangsu Higher Education Institutions (Grant no. 20KJB110004), the National Natural Science Foundation of China (Grant no. 11601202), the “333 Project” Scientific Research Support Project of Jiangsu in 2020 (Research on Privacy Protection Technology for Big-Data), and the Scientific Research Fund Incubation Project of Jinling Institute of Technology in 2020 (Research on Multipath TCP Attack and Defense Mechanism).