Research Article  Open Access
Qiuling Wu, Aiyan Qu, Dandan Huang, "Robust and Blind Audio Watermarking Algorithm in Dual Domain for Overcoming Synchronization Attacks", Mathematical Problems in Engineering, vol. 2020, Article ID 2035747, 15 pages, 2020. https://doi.org/10.1155/2020/2035747
Robust and Blind Audio Watermarking Algorithm in Dual Domain for Overcoming Synchronization Attacks
Abstract
How to effectively resist synchronization attacks is the most challenging topic in the research of robust watermarking algorithms. A robust and blind audio watermarking algorithm for overcoming synchronization attacks is proposed in dual domain by considering time domain and transform domain. Based on analysing the characteristics of synchronization attacks, an implicit synchronization mechanism (ISM) is developed in the time domain, which can effectively track the appropriate region for embedding and extracting watermarks. The data in this region will be subjected to discrete cosine transform (DCT) and singular value decomposition (SVD) in turn to obtain the eigenvalue that can be utilized to carry watermarks. In order to extract the watermark blindly, the eigenvalue will be quantized. Genetic algorithm (GA) is utilized to optimize the quantization step to balance both transparency and robustness. The experimental results confirm that the proposed algorithm not only withstands various conventional signal processing operations but also resists malicious synchronization attacks, such as time scale modification (TSM), pitchshifting modification (PSM), jittering, and random cropping. Especially, it can overcome TSM with strength from −30% to +30%, which is much higher than the standard of the International Federation of the Phonographic Industry (IFPI) and far superior to the other algorithms in related papers.
1. Introduction
1.1. Related Works
With the rapid development of network and computer technology, people edit, modify, store, and disseminate audio media easily by using various audio editing software [1–3]. While the editing software brings us convenience, it also makes unauthorized users perform a variety of infringements on the audio media, such as malicious tampering, forgery, deletion, and unauthorized distribution. Sometimes, these infringements not only jeopardize the safety of personal property and the credibility of the audio media but also may even endanger national public safety in acute cases [4–6]. How to effectively protect the security of those audio media has become a research hotspot in information security, communication, and some related fields. Robust audio watermarking algorithm pays much attention to improve its ability for preventing watermarks hidden in the audio from being destroyed under complex environments [7, 8], so it must not only be able to withstand the conventional signal processing operations encountered when using those audio media normally but also need to be extremely resistant to many malicious synchronous attacks that may cause the structure of the audio media to change.
Synchronization attacks may cause serious damage to the structure of the audio, resulting in the extraction failure due to the inaccuracy of the embedding region [9–12], so they have become the most challenging attacks in the research of audio watermarking algorithms [13–15]. Hu et al. [16] proposed an audio watermarking algorithm based on lifting wavelet transform. The authors claimed that the algorithm had good robustness to some conventional signal processing attacks and synchronization attacks, and its payload capacity reached 43.07 bps when SNR was over 21 dB. However, it can be seen from the experimental results that the algorithm robustness still needs to be improved when resisting TSM. Xiang and Huang [17] designed an audio watermarking algorithm with a constant watermark synchronization mechanism according to the insensitivity of the histogram shape of audio media. Hu and Chang [18] proposed a selfsynchronous audio watermarking algorithm based on discrete wavelet transformation (DWT) and DCT. This algorithm concealed the synchronous signal in the first approximation subband and recalibrated the embedding position by extracting the zerocrossing point of the synchronous signal. The experimental results showed that this algorithm was effective for some synchronization attacks but poor for some signal processing attacks. Wang et al. [19] proposed a robust audio watermarking algorithm which utilized the invariance of exponential moment to enhance its robustness. However, it was poor for amplitude scaling and MP3 compression as shown in experimental results. Yuan et al. [20] put forward an audio watermarking algorithm that detected the melcepstrum coefficient as a synchronous signal when extracting the watermark in the DWT domain. Wang et al. [21] proposed a robust audio watermarking algorithm based on empirical mode decomposition. In this algorithm, the audio was evenly segmented into numerous fragments, and then each audio fragment was separated into two parts. One part was utilized to embed the synchronization code, and the other part was used to embed the watermark in the residue of higherorder statistics after empirical mode decomposition. If synchronization codes could not be accurately acquired, watermark extraction would fail, which was a fatal shortcoming of this algorithm. Chen et al. [22] proposed an audio watermarking algorithm that embedded the watermark into the lowfrequency coefficients of the audio in the DWT domain. This algorithm enhanced its robustness by increasing the embedding depth, but this behaviour also led to the low transparency. In general, audio watermarking algorithms with the ability to resist synchronization attacks must have an effective synchronization mechanism, which can be used to track the embedding position [23, 24]. However, most existing algorithms are usually robust to only one or two of these attacks, and some algorithms even lose robustness to conventional signal processing operations due to their excessive pursuit of robustness to some synchronization attacks. In addition, how to balance the overall performance of the algorithm by optimizing the parameters of the designed algorithm is also an issue with research significance.
1.2. Contributions
Based on the above introduction, we can see that there are still many problems to be solved in antisynchronization attacks. Our contributions in this paper are as follows.(1)An ISM is developed to effectively search for the appropriate embedding region when embedding watermarks and to automatically track the region where the watermark is located when extracting watermarks. Based on analysing the characteristics of synchronization attacks, it is found that the shape of the voiced frame almost has not changed after being subjected to TSM, so the proposed ISM takes the sample point with the largest amplitude in the voiced frame as the synchronization mark to identify the embedding region and extracting region. When embedding watermarks, the appropriate region will be searched out from the voiced frames by using ISM, and then the data in the chosen embedding region will be further operated to carry watermarks. When extracting watermarks, ISM can automatically track the region where the watermark is located.(2)GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. The data in the embedding region will be processed by DCT and SVD in turn to obtain the eigenvalue that can be used to carry watermarks. In order to extract the watermark blindly, the eigenvalue is quantized when embedding or extracting the watermark, so the quantization step is an important parameter, which directly affects the transparency and robustness of the algorithm. We propose an optimal audio watermarking algorithm using GA to further enhance the overall performance of this algorithm.
Besides, this algorithm adopts several additional measures to improve the robustness, such as twice even segmentation to the audio, and the operation that embeds the same watermark into three voiced frames.
The remainder of this paper is organized as follows. In Section 1, we review some related works about the existing audio watermarking algorithms which can overcome synchronization attacks and then introduce our contributions in this proposed algorithm. Section 2 describes the proposed ISM and shows the implementation flow chart in detail. The principle of the proposed audio watermarking algorithm will be elaborated in Section 3, and this section will be divided into four subjects, including the embedding principle, the extracting principle, optimization of the quantization step, and the measure to further improve robustness. Section 4 evaluates the performance of this proposed algorithm and compares their performance with other algorithms in recent years. Finally, Section 5 draws up the conclusion and gives the possible future research task.
2. ISM for Tracking Embedding Region
Synchronization attacks may cause the position of the data in the audio to shift, which may lead to extraction failure because the location of the watermark cannot be obtained accurately [25]. Therefore, it is very important to design an effective synchronization mechanism for tracking the embedding region. If the data in the voiced frames are modified too much, the audio may not be used normally because of the obvious degradation of audio quality, so synchronization attack usually only modifies the data in redundant frames, but not in voiced frames. TSM attacks by 10% and −10% are applied to an audio clip, respectively, and the waveform comparison is illustrated in Figure 1. It can be seen from the pictures that the absolute positions of the two voiced frames in this audio clip all have shifted on the time axis, but their shapes do not change much in the process of being stretched in Figure 1(b) or compressed in Figure 1(c), so it is relatively safe to conceal watermarks into these voiced frames.
(a)
(b)
(c)
If the watermark is only embedded in voiced frames and the embedding region in the audio is independent of the absolute position of the audio data on the time axis, it will greatly improve the algorithm’s ability to withstand synchronization attacks. As long as the embedding region can be effectively tracked, the watermark will be accurately extracted. Based on the above analysis, an ISM is developed in our study, which can search out the appropriate embedding region when embedding watermarks and can effectively track the extracting region where the watermark is located when extracting watermarks. As shown in Figure 1, the regions between the two red dashed vertical lines are the embedding regions in the two voiced frames, and “” indicates the synchronization mark which is the position of the tracked sample point with the largest amplitude. It can be observed in Figure 1 that the proposed ISM can more accurately track the appropriate embedding region under TSM.
Figure 2 shows the implementation flowchart of the ISM. Assuming that the length of the voiced frame is , the length of the region for embedding watermarks is and . The specific implementation process can be described as follows. (i) Step 1: extract all the voiced frames with the length of from the audio.(ii) Step 2: search for the sample point with the largest amplitude in each voiced frame and record its position as .(iii)Step 3: sample points in the surrounding region of , which are in the range of [], can be used to carry watermarks, where is the starting position in this embedding region, is the ending position, and indicates that the data in brackets should be rounded down.
The embedding region [] has the following three conditions.(1)If , it indicates that the position of the sample point with the largest amplitude is closer to the head of this voiced frame, then the starting position of the embedding region should be set as , and the ending position is .(2)If , this condition shows that the position is closer to the end of this voiced frame, then the starting position should be set as , and the ending position is .(3)If , it indicates that the position is in the middle part of this voiced frame, then the starting position of the region should be set as , and the ending position is .
3. Principle of the Watermarking Algorithm
3.1. Principle of Embedding Watermarks
The embedding algorithm mainly includes the following several parts. Firstly, the proposed ISM is used to search for the best embedding region. Then, DCT is performed on the data in the embedding region to determine the frequency range for carrying the watermark. Finally, the DCT coefficients in the frequency range are processed by SVD to conceal the watermark by the quantization method. Figure 3 shows the principle diagram of the embedding algorithm.
Suppose that the binary watermark can be expressed in the following formula:where and is the length of . In order to improve the security, should be encrypted before it is concealed into the audio.
Apply logistic mapping formula to generate a chaotic sequence with the same size as , as shown in the formulas.where , is the initial value when , and is a threshold to obtain . The logistic system will be in chaos when .
Exclusive OR operation is performed on and to obtain the encrypted information , as shown in formula (3), where stands for the exclusive OR operator. Triple key is the unique correct key to decrypt .
Convert into a twodimensional array shown in formula (4), where and . Suppose that is the original audio with sample points, as expressed in the formula.where is the amplitude of the ^{th} sample point. Divide into audio fragments, namely, (), and each audio fragment has sample points, .
Then, can be divided into two parts, namely, and , where will be used for carrying the watermark, and does not participate in the embedding process. can be expressed in formula (6), and its size is .
The watermark will be embedded into , that is to say, each audio fragment needs to carry bits watermark. To prevent the audio quality from decreasing too much, will be divided into several audio frames with the length of , and only the voiced frame with the largest energy is used to carry the watermark. The proposed ISM is used to track the appropriate embedding region [] in .
We will take the embedding process that embed bits binary watermark into as an example to illustrate the core embedding scheme. Figures 4 and 5 show the main data and the flowchart of this core embedding scheme.
In our study, DCT is used to determine the frequency range where the watermark is located. Apply DCT on the data between [] to obtain the DCT coefficient and then get the intermediate frequency coefficient from in the range [, ], where is the starting position and is the length of . The frequency range [] for embedding the watermark can be calculated according to the formulas.where is the maximum cutoff frequency of the audio, and its value is usually half of the sampling rate. Divide into data blocks, namely, (), and the length of the data blocks can be calculated as . Apply SVD on to obtain the eigenvalue , as shown in formula (8), where is a single element matrix, is an orthogonal matrix with the dimension of , and is a row matrix with the dimension of in which only the first element and all other elements are equal to 0.
According to the stability characteristics of SVD, the eigenvalue usually does not change greatly when changes slightly, so one bit binary watermark can be hidden into one eigenvalue. In order to realize blind extraction, will be obtained by quantization, where is the quantization step. If the binary watermark is “0,” will be modified to be an even number; otherwise, will be set as an odd number. The embedding rule is described in Table 1.

Then, the modified eigenvalue is shown in the following formula:
The modified data block can be reconstructed according to the following formula:
Repeat the above process to modify all eigenvalues, and bits binary watermark can be concealed into all (). According to the process described above, each row of the binary data in can be concealed into each (); finally, will be completely concealed into . The embedding process can be described as follows.(i)Step 1: convert into with the size of .(ii)Step 2: divide the original audio into audio fragments with the same length.(iii)Step 3: divide into audio frames with the length of and find out the voiced frame .(iv)Step 4: search for the embedding region [] by using ISM in .(v)Step 5: apply DCT on the data in the embedding region to obtain .(vi)Step 6: divide into data blocks and perform SVD on to obtain .(vii)Step 7: embed bits binary watermark into according to the embedding rule in Table 1.(viii)Step 8: repeat Step 4 to Step 7 until all watermark bits are concealed.(ix)Step 9: recombine all audio fragments to recover the carried audio .
3.2. Principle of Extracting Watermarks
The extracting algorithm is the inverse process of the embedding algorithm, and its principle is shown in Figure 6, in which the “core extracting scheme” is the most important part of the whole extracting algorithm. The process of extracting the watermark can be described as follows.
Figure 7 shows the flowchart of the core extracting scheme. In particular, the key parameters in the extracting algorithm should be consistent with the corresponding parameters in the embedding algorithm, including , , , , , , and .
3.3. Optimization of the Quantization Step
From the watermark embedding principle mentioned above, quantization step is an important parameter, which directly affects the transparency and robustness of the algorithm. In order to balance the algorithm performance, GA is used to search the optimal quantization step intelligently. The fitness function is constructed with SNR and BER as shown in the formula.where is the lower threshold of transparency. The selected quantization step should not make the algorithm transparency lower than dB. and can be defined in formulas.where A and represent the original audio and the carried audio, respectively, and and denote the original watermark and the extracted watermark. The population consists of chromosomes, and each chromosome with the length of , which will be encoded by using a binary encoding approach, can be converted into the quantization step. Formula (14) can be used to describe the transformation relationship between each chromosome and each quantization step ().where means converting from binary to decimal. The detailed process can be described as follows.(i)Step 1: set the parameters, including the crossover probability and the mutation probability , and then generate an initial population .(ii)Step 2: calculate the quantization step according to formula (14) and then execute the embedding algorithm proposed in Section 3.1 after the carried audio is subjected to some attacks. (iii)Step 3: pick out all qualified chromosomes when and then execute the extracting algorithm proposed in Section 3.2.(iv)Step 4: calculate the fitness value according to formula (12) and obtain the best chromosome with the largest fitness value.(v)Step 5: perform selection operation by roulette to get the transition population .(vi)Step 6: perform crossover operation on each pair of chromosomes except for the best chromosome to obtain the next transition population .(vii)Step 7: perform mutation operation on each chromosome except for the best chromosome to obtain the next generation population .(viii)Step 8: repeat Step 2 to Step 7 until the global optimal chromosome appears.
3.4. Measure to Further Improve Robustness
In order to improve the robustness of the algorithm, the same row of the binary watermark can be repeatedly embedded into three voiced frames with the highest energy in . When extracting watermarks, three groups of binary watermarks are extracted from the three voiced frames, respectively, and compared bit by bit to obtain a more accurate group of binary watermarks according to formula (15), where , , and are the three groups of binary watermarks extracted from three voiced frames, respectively.
4. Performance Evaluation
In this section, the performance of the proposed algorithm will be tested. In order to evaluate the performance of this proposed algorithm, the quality of the audio can be evaluated by three ways, including SNR, the object difference grade (ODG) which is one of the output values obtained from the perceptual evaluation of audio quality (PEAQ), and the mean opinion score (MOS). According to the standard of IFPI, SNR should be greater than 20 dB to make the audio have good transparency. BER can be used to evaluate the algorithm robustness. Generally, small BER means that the algorithm has strong robustness to various attacks. According to the standard of IFPI, the BER of the extracted watermark is no less than 20% when the carried audio is attacked. NC can be used to compare the similarity between the original watermark and the extracted watermark, as shown in formula (16). When NC is close to 1, the original watermark is very similar to the extracted watermark.
The experimental parameters are as follows. (1) Algorithm parameters: , , , , , , , , , , and . (2) The watermark is a binary image as shown in Figure 8(p), and its size is . (3) The PEAQ metric was the basic standard which was released by the TSP Lab at McGill University. (4) The tested audio comprises twenty 64second audio signals, formatted by WAV, sampled at 44100 Hz with 16bit resolution, including popular songs, classical songs, rock songs, and dialogues.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
The detailed experimental environment and software are described as follows: (1) computer system—64bit Microsoft Windows 10; (2) programming language—Matlab 2016R; (3) software for processing audio signals—Cool Edit Pro V2.1.
4.1. Transparency and Capacity
The payload capacity of this algorithm can be calculated according to the following formula:where is the time length that the audio carries the watermark. In our study, is equal to 64 seconds, so the payload capacity is bps. The average values about the payload capacity (bps), SNR (dB), ODG, MOS of the audio, BER (%), and NC of the extracted watermark are listed in Table 2.
In our test, all the audio signals are processed with the four algorithms mentioned in Table 2, respectively, to obtain four groups of carried audio signals which will be provided to 20 listeners (10 males and 10 females, aged between 18 and 60 years old) in order to get MOS scores. Table 2 shows that this algorithm has good transparency because the average SNR is up to 25.96 dB, ODG is −0.99, and MOS is 4.5 while the payload capacity is 64 bps, which is higher than the standard of IFPI. Most importantly, BER is equal to 0, and NC is equal to 1, which indicates that this algorithm has good robustness when there is no attack, so the extracted watermark image in Figure 8(a) is the same as the original image shown in Figure 8(p). Compared with the algorithms in paper [16] and paper [23], this proposed algorithm has a larger payload capacity and better transparency, and the robustness is stronger than that in paper [23]. Although the payload capacity of this algorithm is not as high as the algorithm in the paper [22], the transparency is more superior. Besides, this proposed algorithm is more robust than other three algorithms, which will be discussed in Section 4.2.
Figure 9 shows the waveform pictures of the carried audio without attack before and after embedding the watermark (we only display an audio clip lasting about 3 seconds to clearly show the details), and their spectrogram pictures are shown in Figure 10. It can be seen that there is no obvious change in the waveform and spectrogram of the audio before and after embedding the watermark, which indicates that this algorithm’s transparency is nice.
(a)
(b)
(a)
(b)
4.2. Robustness
This section will evaluate the algorithm robustness by BER and NC when resisting against various conventional signal processing operations and synchronous attacks and compare the experimental results with other algorithms in three related papers.
4.2.1. Conventional Signal Processing Operations
Conventional signal processing operations are the most common attacks encountered by audio in the process of being used and spread, and they may cause damage or even loss of the watermark hidden in the audio, so the watermarking algorithm must have strong robustness to withstand these attacks. These operations mainly include the following types in Table 3.

BER (%) and NC of the extracted watermark are averaged and listed in Table 4 under these signal processing operations. The extracted images whose NC values are closest to the average value are shown in Figure 8. According to the experimental results in Figure 8 and Table 4, this algorithm has strong robustness against conventional signal processing operations, which can be summarized as follows.

When resisting noise corruption with 30 dB and 40 dB, requantization, lowpass filtering with cutoff frequency of 12 kHz, and echo addition with 50 ms, the extracted images are almost the same with the original image. BER values are equal to 0, and NC values are equal to 1, which indicates that the proposed algorithm has excellent robustness against these attacks.
When resisting MP3 compression, noise corruption with 20 dB, lowpass filtering with cutoff frequency of 8 kHz, resampling, echo addition with 50 ms, and amplitude scaling, the extracted images are similar to the original image. BER values are below 1.28%, and NC values are above 0.9740, so the proposed algorithm has good robustness against these attacks.
When resisting lowpass filtering with cutoff frequency of 4 kHz, the former half of the extracted watermark image is very clear, while another half is completely blurred, NC is 0.7458, and BER is 25.64%, as shown in Figure 8(k). The reason for this phenomenon is mainly related to the algorithm parameters, including the length of the data by DCT, the region [,] for embedding the watermark, and the sampling rate of the audio. In our experiment, sampling rate is 44.1 kHz, , , and , so the embedding frequency range [] can be calculated as follows.
It can be seen from formulas (18) and (19) that the watermark can be concealed in the frequency range of [1.62, 7.13] kHz in the audio, so lowpass filtering with a cutoff frequency higher than 7.13 kHz or lower than 1.62 kHz almost has no effect on the watermark.
When resisting lowpass filtering with cutoff frequency of 8 kHz, the extracted watermark is relatively clear, NC is 0.9990, and BER is 0.05%. However, because the upper limit of the embedding region is 7.13 kHz, which is very close to the cutoff frequency (8 kHz) of the filter, and the lowpass filter has 3 dB amplitude attenuation near the cutoff frequency, there are still a few noise points in the extracted watermark image shown in Figure 8(l). When resisting lowpass filtering with cutoff frequency of 12 kHz, the extracted watermark is the same as the original image, as shown in Figure 8(m), NC is equal to 1, and BER is equal to 0. When the audio is subjected to lowpass filtering with cutoff frequency of 4 kHz, the frequency components above 4 kHz in the audio will be removed, so the watermark in the frequency range of [1.62, 4] kHz can be extracted (the former half of the image in Figure 8(k) is very clear), while the watermark in the frequency range of [4, 7.13] kHz cannot be extracted (another half in Figure 8(k) is completely blurred). In practical application, the cutoff frequency of the lowpass filter and the frequency range of the embedding region should be staggered by adjusting the algorithm parameters to prevent watermarks from being damaged.
4.2.2. Synchronous Attack
Synchronization attack is the most challenging type in the research of robust watermarking algorithm.
In Table 5, there are four kinds of synchronization attacks with different strengths for testing the robustness, including TSM, PSM, jittering, and random cropping. After the audio is subjected to the above synchronization attacks, BER (%) and NC values of the extracted watermark are, respectively, averaged and listed in Tables 6–9. The extracted images whose NC values are closest to the average value are shown in Figures 11–14.





(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(1) TSM. Table 6 shows the average BER (%) and NC of the extracted watermark under TSM with different strengths from −30% to +30%. It can be seen from the experimental results in Table 6 and Figure 11 that this algorithm has excellent robustness for overcoming TSM attacks with different strengths. BER values all are below 15.31%, which is far superior to the standard of IFPI. NC values all are above 0.8267, so the extracted images all can distinguish its content in them.
(2) PSM. When the audio is subjected to PSM, its playing time will not change, but the position and shape of the voiced frame will change slightly. Table 7 shows the average BER (%) and NC of the extracted watermark under PSM with different strengths from −5% to 5%. Although the extracted images are not very clear in Figure 12, their content can still be distinguished.
(3) Jittering. Table 8 shows the average BER (%) and NC of the extracted watermark under jittering with different strengths from 1/100000 to 1/500. The extracted images are shown in Figure 13. As shown in Figure 13(a), under the maximum attack strength (1/500), the extracted image contains more noise points, but its main feature can still be identified, in which NC is 0.9303 and BER is 3.68%. As the attack strength weakens, the extracted images become more and more similar to the original image in Figure 8(p), and BER values become smaller and smaller, so this proposed algorithm has strong robustness against jittering.
(4) Random Cropping. The average BER (%) and NC of the extracted watermark under random cropping with different strengths are shown in Table 9. From the experimental results, the extracted images in Figure 14 all are relatively clear, NC values are above 0.9754, and BER values are below 1.90%, which all show that this proposed algorithm is robust against random cropping.
4.2.3. Comparative Analysis of Robustness
In order to compare the algorithm robustness with the related algorithms in papers [16], [22], and [23], Table 10 lists BER (%) values of these algorithms when resisting signal processing operations and synchronization attacks. From the experimental results in Table 2 and Table 10, the comparative analysis is discussed as follows about these four algorithms. Compared with the algorithm in paper [16], this proposed algorithm has larger payload capacity, higher transparency, and better robustness against synchronization attacks and conventional signal processing operations except for some attacks, such as MP3 compression with 64 kbps, lowpass filtering with cutoff frequency of 4 kHz, and PSM. According to the embedding principle of this algorithm, the frequency band where the watermark is located can be changed by modifying the algorithm parameters in practical application, so this proposed algorithm can overcome the attack from lowpass filtering with cutoff frequency of 4 kHz in fact. In the following comparative analysis with the other two algorithms, this viewpoint will not be reiterated. The robustness of this proposed algorithm is far superior to the algorithms in paper [22] and paper [23], although the payload capacity is slightly lower than that in paper [22]. Synchronization attacks may change the overall structure of the audio, but it has little effect on the shape of the voiced frame. The proposed ISM in this algorithm can accurately track the position of the largest amplitude in the voiced frame to determine the embedding region where the watermark is located. Therefore, this algorithm has strong robustness against various malicious attacks.

4.3. Security
The watermark hidden in audio is protected by encryption technology and information hiding technology, so it is necessary to analyse the security of this algorithm from the key space constructed by encryption technology and information hiding technology.
The proposed algorithm uses a triple key to encrypt the watermark and seven key parameters (, , , , , , ) to conceal the watermark. and are taken in the real field, so this algorithm has infinite key space in theory. In fact, they are affected by the word length, so their key space is limited. In our test, the computer system is 64bit, , , and all are 16bit, and , , and are 10bit, so the key space to encrypt the watermark can be calculated as , and the key space to conceal the watermark is . From the above analysis, even if the attacker obtains the principle of the algorithm, as long as these key parameters are not known, it is difficult for the attacker to obtain the watermark.
4.4. Complexity
The complexity of the algorithm is an important index to evaluate the performance of the algorithm. It is usually measured by the computational cost when embedding and extracting the watermark. In our experiment, the average time for extracting the watermark is 0.8544 s and that for embedding the watermark without GA is 1.6246 s. When GA is used to search for the best algorithm parameter, the embedding time is related to the evolution time of GA. It can be seen that GA enables our proposed algorithm to achieve a good balance between transparency and robustness, but it also brings a large computational cost. When embedding the watermark, the average time for searching the embedding region in a voiced frame is 0.037 ms. When extracting the watermark, the average time for tracking the extracting region in a voiced frame is 0.036 ms. It can be seen that our proposed ISM takes up very little computational cost to find the synchronization marks.
5. Conclusions
In our study, it is found that the playing time of the audio will be longer or shorter after being attacked by TSM, but the shape of the voiced frame will not change basically. Therefore, an ISM which can search for the embedding region where the watermark is located is developed, in which it takes the sample point with the largest amplitude in the voiced frame as the synchronization mark. GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. Combining the “energy concentration” characteristic of DCT and the stability characteristic of SVD, a robust and blind audio watermarking algorithm with ISM and GA is proposed for overcoming malicious synchronization attacks and conventional signal processing operations.
The following measures are taken to improve the algorithm robustness. Firstly, the proposed ISM can accurately track the region where the watermark is located. Even if the structure of the audio changes slightly, the ISM can accurately search this synchronization mark in the voiced frame to track the region that can be used to embed and extract the watermark. Secondly, GA is utilized to optimize the key algorithm parameter to balance both transparency and robustness. Thirdly, the audio will be divided evenly twice to avoid the drift of the embedding region caused by the change of the audio structure. At last, the watermark is repeatedly embedded in three voiced frames to improve the algorithm robustness. Embedding the same watermark in the three voiced frames is equivalent to embedding the watermark with a triple repetition code. The experimental results confirm that this proposed algorithm has excellent robustness in the case that the payload capacity is 64 bps; it can not only withstand conventional signal processing operations but also resist TSM, PSM, jittering, and random cropping. Especially, this algorithm even stands up to TSM with strength from −30% to +30%.
Although the proposed algorithm has excellent robustness when overcoming TSM, jittering, random cropping, and various conventional signal processing operations, the experimental results of this algorithm under PSM are not good enough, mainly because PSM makes the synchronization mark in the voiced frame shift, which leads to the error bits in the extracted watermark. Therefore, the performance of this algorithm is not enough when withstanding some attacks, such as deliberately distorting the peak amplitude points to remove synchronization mark, which will be further studied in our future work. In addition, GA is used to optimize the key algorithm parameter, which is very helpful to balance the transparency and robustness. However, GA needs a long evolution time to search for the optimal algorithm parameters, which greatly increases the computational cost. Based on this, this algorithm is not suitable for the application with strict time requirements. In future research, we will strive to enhance the security and robustness against more types of synchronization attacks.
Data Availability
The data used to support the findings of the study are included within the article and are obtained from public platform.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was funded by the HighLevel Talent Scientific Research Foundation of Jinling Institute of Technology, China (grant no. jitb201918), and the National Natural Science Foundation of China (grant no. 11601202).
References
 M. J. Hwang, J. S. Lee, M. S. Lee, and H. G. Kang, “SVD based adaptive QIM watermarking on stereo audio signals,” IEEE Transactions on Multimedia, vol. 20, no. 1, pp. 45–54, 2017. View at: Publisher Site  Google Scholar
 Y. Hong and J. Kim, “Autocorrelation modulationbased audio blind watermarking robust against high efficiency advanced audio coding,” Applied Sciences, vol. 9, no. 14, pp. 1–17, 2019. View at: Publisher Site  Google Scholar
 B. Lei, I. Yann Soon, F. Zhou, Z. Li, and H. Lei, “A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition,” Signal Processing, vol. 92, no. 9, pp. 1985–2001, 2012. View at: Publisher Site  Google Scholar
 Z. Zhang, M. Zhang, and L. Wang, “Reversible image watermarking algorithm based on quadratic difference expansion,” Mathematical Problems in Engineering, vol. 2020, pp. 1–8, 2020. View at: Publisher Site  Google Scholar
 A. Merrad and S. Saadi, “Blind speech watermarking using hybrid scheme based on DWT/DCT and subsampling,” Multimedia Tools and Applications, vol. 77, no. 20, pp. 27589–27615, 2018. View at: Publisher Site  Google Scholar
 G. Hua, J. Huang, Y. Q. Shi, J. Goh, and V. L. L. Thing, “Twenty years of digital audio watermarkingA comprehensive review,” Signal Processing, vol. 128, pp. 222–242, 2016. View at: Publisher Site  Google Scholar
 W. Jiang, X. Huang, and Y. Quan, “Audio watermarking algorithm against synchronization attacks using global characteristics and adaptive frame division,” Signal Processing, vol. 162, pp. 153–160, 2019. View at: Publisher Site  Google Scholar
 Q. Qian, H. Wang, X. Sun, Y. Cui, H. Wang, and C. Shi, “Speech authentication and content recovery scheme for security communication and storage,” Telecommunication Systems, vol. 67, no. 4, pp. 635–649, 2018. View at: Publisher Site  Google Scholar
 M. A. Nematollahi, C. Vorakulpipat, H. GamboaRosales, F. J. MartinezRuiz, and J. I. De la RosaVargas, “Digital speech watermarking based on linear predictive analysis and singular value decomposition,” Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, vol. 87, no. 3, pp. 433–446, 2017. View at: Publisher Site  Google Scholar
 D. Singh and S. K. Singh, “DWTSVD and DCT based robust and blind watermarking scheme for copyright protection,” Multimedia Tools and Applications, vol. 76, no. 11, pp. 13001–13024, 2017. View at: Publisher Site  Google Scholar
 H.T. Hu and L.Y. Hsu, “Incorporating spectral shaping filtering into DWTbased vector modulation to improve blind audio watermarking,” Wireless Personal Communications, vol. 94, no. 2, pp. 221–240, 2017. View at: Publisher Site  Google Scholar
 A. A. Attari and A. A. B. Shirazi, “Robust audio watermarking algorithm based on DWT using Fibonacci numbers,” Multimedia Tools and Applications, vol. 77, no. 19, pp. 25607–25627, 2018. View at: Publisher Site  Google Scholar
 P. K. Dhar and T. Shimamura, “Blind audio watermarking in transform domain based on singular value decomposition and exponentiallog operations,” Radioengineering, vol. 26, no. 2, pp. 552–561, 2017. View at: Publisher Site  Google Scholar
 Z. Liu, Y. Huang, and J. Huang, “Patchworkbased audio watermarking robust against desynchronization and recapturing attacks,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 5, pp. 1171–1180, 2019. View at: Publisher Site  Google Scholar
 P. Hu, Z. Yi, D. Peng, and Y. Xiang, “Robust timespread echo watermarking using characteristics of host signals,” Electronics Letters, vol. 52, no. 1, pp. 56, 2016. View at: Publisher Site  Google Scholar
 H.T. Hu, J.R. Chang, and S.J. Lin, “Synchronous blind audio watermarking via shape configuration of sorted LWT coefficient magnitudes,” Signal Processing, vol. 147, pp. 190–202, 2018. View at: Publisher Site  Google Scholar
 S. Xiang and J. Huang, “Histogrambased audio watermarking against timescale modification and cropping attacks,” IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1357–1372, 2007. View at: Publisher Site  Google Scholar
 H.T. Hu and J.R. Chang, “Efficient and robust framesynchronized blind audio watermarking by featuring multilevel DWT and DCT,” Cluster Computing, vol. 20, no. 1, pp. 805–816, 2017. View at: Publisher Site  Google Scholar
 X. Y. Wang, Q. L. Shi, S. M. Wang, and H. Y. Yang, “A blind robust digital watermarking using invariant exponent moments,” AEUInternational Journal of Electronics and Communications, vol. 70, no. 4, pp. 416–426, 2016. View at: Publisher Site  Google Scholar
 X.C. Yuan, C.M. Pun, and C. L. Philip Chen, “Robust melfrequency cepstral coefficients feature detection and dualtree complex wavelet transform for digital audio watermarking,” Information Sciences, vol. 298, pp. 159–179, 2015. View at: Publisher Site  Google Scholar
 X.G. Wang, P.P. Niu, H.Y. Yang, Y. Zhang, and T.X. Ma, “A robust audio watermarking scheme using higherorder statistics in empirical mode decomposition domain,” Fundamenta Informaticae, vol. 130, no. 4, pp. 467–490, 2014. View at: Publisher Site  Google Scholar
 S.T. Chen, H.N. Huang, and C.Y. Hsu, “Waveletdomain audio watermarking using optimal modification on lowfrequency amplitude,” IET Signal Processing, vol. 9, no. 2, pp. 166–176, 2015. View at: Publisher Site  Google Scholar
 L. Li and X. Fang, “Audio watermarking robust against playback speed modification,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E94A, no. 12, pp. 2889–2893, 2011. View at: Publisher Site  Google Scholar
 Z. H. Liu, D. Luo, J. W. Huang, J. Wang, and C. D. Qi, “Tamper recovery algorithm for digital speech signal based on DWT and DCT,” Multimedia Tools and Applications, vol. 76, no. 10, pp. 12481–12504, 2017. View at: Publisher Site  Google Scholar
 M. A. Nematollahi, S. A. R. AlHaddad, S. Doraisamy, and H. GamboaRosales, “Speaker frame selection for digital speech watermarking,” National Academy Science Letters, vol. 39, no. 3, pp. 197–201, 2016. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2020 Qiuling Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.