Abstract

This paper presents a gain invariant speech watermarking technique based on quantization of the Lp-norm. In this scheme, first, the original speech signal is divided into different frames. Second, each frame is divided into two vectors based on odd and even indices. Third, quantization index modulation (QIM) is used to embed the watermark bits into the ratio of the Lp-norm between the odd and even indices. Finally, the Lagrange optimization technique is applied to minimize the embedding distortion. By applying a statistical analytical approach, the embedding distortion and error probability are estimated. Experimental results not only confirm the accuracy of the driven statistical analytical approach but also prove the robustness of the proposed technique against common signal processing attacks.

1. Introduction

Hiding a secret message in an object has a long history, possibly dating back thousands of years. The rapid growth of computer and communication transmissions has inspired the idea of digital data hiding. Digital watermarking, as a major branch of data hiding, has attracted many researchers [1]. The importance of speech watermarking is gradually increasing because of significant speech transmission through insecure communication channels. There are many approaches for speech watermarking, including spread spectrum (SS), auditory masking, patchwork, transformation, and parametric modeling [2]. In the SS approach, a pseudorandom sequence is used to spread the spectrum of the watermark data and add it to the frequency spectrum of the host signal. However, auditory masking uses unimportant perceptual components of the signal to embed the watermark bits. By contrast, the patchwork approach embeds the watermark data by manipulating two sets of the signal to determine the difference between them. The transformation approach embeds the watermark data into the transformation domains, for example, discrete cosine transform, discrete wavelet transform (DWT), and discrete Fourier transform (DFT). Finally, in the parametric modeling approach, the watermark is embedded by modifying the coefficients of the autoregressive (AR) model.

In addition to speech watermarking approaches, four main embedding strategies are widely applied for watermarking: least significant bit (LSB) replacement, quantization, addition, and multiplication. Among these strategies, quantization has attracted much attention because of blindness, robustness, controlled distortion, and payload. For this purpose, a set of quantizers that are associated with various watermark data are used. However, the quantization strategy suffers from amplitude scaling. To rectify this problem, rational dither modulation (RDM) [3] was proposed to enhance the robustness of quantization index modulation (QIM) [4, 5]; however, it degraded the imperceptibility of the watermarked signal. Hence, hyperbolic RDM [6] was proposed to improve the robustness against power law and gain attacks. Another attempt was made by embedding a watermark into the angle of the signal, known as angle QIM (AQIM) [7]. However, this technique was very sensitive to additive white Gaussian noise (AWGN). In [8], normalized cross-correlation between the original signal and a random sequence was quantized based on dither modulation (known as NC-DM) to embed the watermark data. However, applying the random sequence degraded the security of this technique. Lastly, other efforts, such as projection quantization [9], logarithmic quantization index modulation (LQIM) [10], and Lp-norm QIM [11], have been studied for a gain invariant image watermarking technique.

This paper attempts to mitigate the limitations of previous research by quantizing the ratio between the Lp-norms of even and odd indices. After quantization, the Lagrange optimization method is applied to compute the best watermarked sample that minimizes the embedding distortion and improves imperceptibility. By assuming Laplacian and Gaussian distributions for the speech and noise signals, respectively, the embedding distortion and error probability are driven analytically and validated by performing a simulation. Moreover, experimental results show that the proposed speech watermarking technique outperforms state-of-the-art watermarking techniques.

Generally, speech watermarking should preserve the identity of the speaker, which is important for certain security applications [12, 13]. To preserve speaker-specific information, some investigations have been conducted to embed the watermark into special frequency subbands that have less speaker-specific information [5, 14, 15]. Further discussion can be found in [16].

The remainder of this paper is organized as follows. In Section 2, the proposed model for the speech watermarking technique is presented. Additionally, the watermark embedding and extraction processes are described. The performance of the developed watermarking technique is analytically studied in Section 3 and validated by performing a simulation in Section 4. The experimental results are explained in Section 5. Finally, the conclusion and future work are discussed.

2. Proposed Speech Watermarking Technique

In this section, a blind speech watermarking technique is developed based on quantization of the Lp-norm ratio between two blocks of even and odd indices. Assume that S represents an original speech signal that consists of N samples. Two subsets X and Y are formed with respect to even and odd indexed terms, respectively, so that both and have approximately the same energy that causes less embedding distortion. Moreover, synchronization between the transmitter and receiver is most efficient in this case. Figure 1 shows the formation of the subsequences of X and Y from the odd and even indices of the original signal, respectively.

Then, the Lp-norm of both subsequences X and Y are computed, respectively, as follows: The ratio () between and LY, given as is quantized to embed the watermark bit. Although embedding the watermark into the ratio of Lp-norms can provide high robustness against various attacks, imperceptibility can be seriously degraded.

To resolve this limitation, the variation between the original ratio () and quantized ratio () should be minimized. Therefore, the Lagrange optimization method is used to minimize this variation; that is, the Lagrange optimization method decreases the embedding distortion after quantization to improve the imperceptibility of the watermarked speech signal. As a result, the Lagrange optimization problems can be formulated as follows:To solve this optimization problem, the Lagrange method should estimate the optimized values of the equation system as follows: These optimized values are simply computed by solving the following:

2.1. Speech Watermarking Algorithm

The details of the proposed embedding and extraction processes are described in the following algorithms:

Embedding Process(a) Segment the input speech signal (S) into different frames () with size N.(b) Form two subsequences X and Y, each of length , based on the even and odd indices of , respectively.(c) Compute the Lp-norms LX and LY of both the X and Y subsequences, respectively, based on (1) and (2), respectively.(d) Apply the QIM technique to embed the watermark bit into the ratio between the Lp-norms of X and Y () as follows: where represents the quantization steps, is the watermark bit, and is the modified ratio of the Lp-norms between X and Y. Choosing large quantization steps increases the robustness but results in less imperceptibility and vice versa.(e) Apply the Lagrange method to optimize the values of .(f) Reposition the even and odd subsequences based on and Y, respectively.(g) Rearrange the watermarked speech signal based on the modified frames ().

Figure 2 shows the block diagram of the proposed embedding process.

Extraction Process(a) Segment the input watermarked speech signal () into different frames () with size N.(b) Form two subsequences and , each of length , based on the even and odd indices of Si, respectively.(c) Compute the Lp-norms and of both and subsequences, respectively, based on (1) and (2), respectively.(d) Extract the th binary watermark data from the th frame of the watermarked speech signal by selecting the minimum Euclidean distance (nearest quantization step) from the ratio of as follows:where is the quantization function while meeting the requirements of watermark bits .

Figure 3 shows the block diagram of the proposed extraction process.

3. Statistical Analysis of the Proposed Technique

Generally, Laplacian distribution is the best distribution approach for modeling speech signals within the frame range of 5–50 ms [17, 18]. Laplacian distribution is expressed as where is the sample size and is the mean of the random variables. If the subsequences of and are considered as independent, identically distributed (i.i.d) variables, then the distribution of each of them can be assumed to be Laplacian distributions and , respectively. Based on (3), the ratio () between X and Y should be computed. However, the ratio between two Laplacian distributions cannot be computed exactly because the mean and variance are not actually finite in either the Gaussian or Laplacian case. The problem arises because the denominator has nonzero density in the neighborhood of zero. If the denominator is bounded away from zero (immediately it no longer has the ratio of two Laplacian distributions or two normals), then a Taylor expansion should converge to estimate the ratio between two Laplacian distributions. According to Appendix A, the parameters of the ratio can be derived as follows:To estimate the embedding distortion, quantization noise (Δ) should be considered between the original and watermarked speech signals as follows:As in (4) to (6), ; thus, (12) can be expressed as If , then can be expressed as Thus, (13) can be approximately estimated by Therefore, the expected values of (13) can be estimated as If quantization noise () is considered as a uniform distribution in [] then and . Additionally, as the mean value of the speech signal is considered to be zero, then the zero mean Laplacian distribution is used to model the speech signal as . As a result, . To model , the absolute moment of the Laplacian distribution should be estimated using Appendix B as follows:where and . Thus, we can derive the mean and variance for the th absolute moment of the Laplacian distribution as Now, based on (1) and (19), we can compute . Therefore, the signal-to-watermark ratio (SWR) can be estimated as Because both and sets have been selected from the neighboring samples, it can be assumed that . As a result, (20) can be expressed based on the quantization step as To model the error probability, it is assumed that the watermarked speech signal passes through an AWGN channel with zero mean Gaussian noise . Therefore, (3) must be rewritten as where and correspond to the odd and even components of the AWGN, respectively. Because the term is a known parameter, it is not possible to estimate using a chi-square with degrees of freedom, . To compute the distribution of , it should be decomposed and estimated as Equation (23) can be expressed as where each part of is estimated as follows: To estimate the probability of error, the noise term can be analyzed because it makes the original into a wrong region. Therefore, the distribution of each term of (24) can be estimated by the central limit theorem (CLT) because of the large number of samples in each block. Regardless of the type of original speech signal distribution and because of the independence between the signal and noise samples, the mean and variance of the noise can be computed as By assuming equal probabilities for both zero and one bit of the watermark data, the probability of error for a fixed quantization step (Δ) can be estimated as A close-form solution for (27) is computed as where is the complementary error function defined as , , , and and can be computed as in (11) and (12), respectively.

4. Discussion on the Experimental Results

To validate the performance of the developed watermarking technique, a simulation was performed on the TIMIT database to verify the robustness, imperceptibility, and capacity of the technique. The TIMIT database included 630 speakers (438 males and 192 females) with sampling frequency 16 KHz [19]. Each speaker pronounced 10 sentences, which are contained in 6,300 sentences. For the experimental results, the average results of 630 speech signals with duration 1 s to 3 s from 630 speakers were used.

Figure 4 shows the bit error rate (BER) with respect to different for various frame lengths under Watermark to Noise Ratio (WNR) = 40 dB. In this figure, each curve is plotted separately in order to appear the changes. As can be observed, the frame size was positively correlated with the BER. Whenever the frame size decreased, the BER increased. Additionally, it seems that was not highly correlated with the BER for values greater than two. Only a small fluctuation can be observed for the BER when changed.

Figure 5 shows the BER with respect to different for various quantization steps. As expected, whenever the quantization step increased, the BER decreased. Furthermore, the variation of did not seriously change the BER. It must be mentioned that because of perfect watermark detection under clean conditions, a small AWGN was induced on the watermarked signals for the experiments shown in Figures 4 and 5.

Figure 6 shows the variation of the signal-to-noise ratio (SNR) with respect to different values for different frame lengths. There was not a significant difference in the SNR when the frame size increased. As can be observed, whenever the frame size increased, the energy level between the two sets of and increased. Consequently, the ratio between them increased, which caused a lower SNR. Additionally, it seems that changing was not highly correlated with the SNR for different frame lengths.

Figure 7 illustrates different SNRs with respect to different for various quantization steps. As observed, did not highly affect the SNR. However, the quantization step highly affected the SNR. As the quantization step increased, the SNR decreased.

To compute the payload of the proposed watermark, a memoryless binary symmetric channel (BSC) () defined aswherewas applied to estimate the capacity of the channel with bitrate () for error-free watermark transmission [20].

Because the sampling rate of the TIMIT was 16 KHz, was assumed to be 64 Kbps (8 KHz for speech bandwidth × 8 bits per sample = 64 Kbps) for a telephony channel and was assumed to be equal to the BER in the watermark detection process. Figure 8 shows the amount of the BSC for different WNRs for various quantization steps. As observed, the capacity increased whenever the WNR increased. This is because the watermark was extracted with a minimum BER when the WNR increased. Moreover, it can be inferred that the amount of the BSC increased while the quantization step increased because the watermark was embedded with high intensity when the quantization step increased. As observed, the BSC capacity for fewer quantization steps () was approximately zero under a high noisy channel.

Figure 9 shows the variation of the BSC capacity with respect to different WNRs for different frame lengths. As observed, it seems that, under serious noise, the frame size was not a significant factor for the BSC capacity. Despite this, the frame size was likely to be important whenever the WNR increased. Thus, for a large WNR, it is obvious that whenever the frame size increased, the BER in the watermark detection process decreased, which caused an improvement in the BSC capacity.

To demonstrate the efficiency and performance of the proposed speech watermarking technique, the robustness, capacity, and inaudibility of the proposed technique must be compared with other state-of-the-art speech watermarking techniques.

Table 1 describes the benchmark for simulating the results for the robustness test. Many of these attacks are based on the StirMark Benchmark for Audio (SMBA) [24].

Table 2 compares the BER with state-of-the-art speech watermarking techniques. We implemented all the techniques and tested them for the entire TIMIT corpus under different attacks. As can be observed, the proposed speech watermarking technique has a lower BER overall compared with other techniques.

The perceptual quality of the watermarked signal is critical for the evaluation of the proposed watermarked technique, which can be measured based on the mean opinion score (MOS) (as proposed by the International Telecommunications Union (ITU-T) [23]) and SNR. The MOS uses a subjective evaluation technique to score the watermarked signal, which is presented in Table 3. In the MOS evaluation method, 10 people were asked to listen blindly to the original and watermarked signals. Then they reported the dissimilarities between the quality of the original and watermarked speech signals. The average of these reports were computed for MOS music and MOS speech and presented in Table 4.

An objective evaluation technique, such as SWR and SNR, attempts to quantify this amount based on the following formula:where and are the original and watermarked signals, respectively.

Table 4 presents a comparison of the proposed technique and other techniques in terms of imperceptibility and capacity. Based on the results, it seems that the proposed speech watermarking technique outperformed the other techniques in terms of capacity and imperceptibility. Although the SNR for formant tuning [21] is higher than the proposed technique, the capacity and robustness of the proposed technique are greater than those for formant tuning [21] and Analysis-by-Synthesis [22].

As observed in Table 4, each entity was bounded between two values that related a particular value of imperceptibility (SNR and MOS) to a particular capacity. Consequently, when the capacity increased, imperceptibility decreased. The trade-off value is completely application dependent and should be determined by the user.

5. Performance Analysis

Generally, two types of errors, false positive probability (FPP) and false negative probability (FNP), must always be analyzed to validate the security of a watermarking system [25]. FPP is defined when an unwatermarked speech signal is declared as a watermarked speech signal by the watermark extractor. Similarly, FNP is defined when the watermarked speech signal is declared as an unwatermarked speech signal by the watermark extractor. By assuming that the watermark bits are independent random variables, both the FPP and FNP can be formulated based on Bernoulli trials, which is expressed as follows:where is the total number of watermark bits; is the number of matching bits; is a binomial coefficient; is the probability of a false positive, which is assumed to be 0.5; is the probability of a false negative, which is assumed to be 0.0919 (as in Table 2); and is the threshold, which is computed as follows:

Figure 10 shows the FPP with respect to various total number of watermark bits for different BER. For better visualization, each line was shifted by adding a constant. As observed, the FPP was close to zero for greater than 50. There was a small fluctuation for less than 50, which depended on the BER.

Figure 11 shows the FNP with respect to various total number of watermark bits for different BER. For better visualization, each line was shifted by adding a constant. As can be observed, the FNP was close to zero for greater than 100. Additionally, whenever the BER decreased, the fluctuation increased.

6. Conclusion and Future Work

In this paper, a gain invariant speech watermarking technique was developed using the Lagrange optimization method. For this purpose, samples of the signal were separated based on odd and even indices. Then the ratio between the Lp-norms was quantized using the QIM method. Finally, the Lagrange method was used to estimate the optimized values. In a similar manner, the extraction process detected the watermark data blindly by finding the nearest quantization step.

By assuming Laplacian distribution for the speech signal and Gaussian distribution for the noise signal, the probability of error and watermarking distortion were modeled based on a statistical analysis of the proposed technique. Additionally, experimental results not only proved that the developed watermarking technique was highly robust against different attacks, such compression, AWGN, filtering, and resampling, but also demonstrated the validity of the analytical model. For future work, an investigation on synchronization and adaptive quantization techniques might contribute to the proposed watermarking technique.

Appendix

A. Estimation of the Mean and Variance of the Ratio of Two Laplacian Variables Based on Taylor Series

In [26], the bivariate second-order Taylor expansion for around is expressed as follows:Therefore, can be expanded about to compute the approximate values as follows:For , , , and . Then, the mean and variance of the ratio between and , respectively, can be estimated as follows:

B. Compute the Absolute Moment of the Laplacian Distribution

The moment of Laplacian distribution expressed as follows: There are two cases, and :For first case, when , If , then can be expressed as can also be expressed as Substituting (B.4) and (B.5) into (B.3), the absolute moment of the Laplacian distribution can be computed based on

Competing Interests

The authors declare that they have no competing interests.