Abstract

A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT) and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz). Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT), Massachusetts Institute of Technology (MIT), and Mobile Biometry (MOBIO) show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

1. Introduction

Security and robustness of speaker recognition systems are the main concerns in online environments [1]. Eight potential cracks are available which made online speaker recognition systems vulnerable [2]. Recently, speech watermarking is used to secure the communication channel against intentional and unintentional attacks for speaker verification and identification purpose [37]. For this reason, the watermark is embedded to verify the authenticity of the transmitter (i.e., sensor and feature extractors) and the integrity of the entire authentication mechanism. However, applying speech watermarking can seriously degrade the recognition performance. Since the main aim of the speaker recognition technologies is to enhance recognition performance, applying watermark technology in this context is questionable due to its potential degradation on recognition performance. Available speech watermarking techniques [811] embed the watermark in the special frequency range or the speech formants. However, these techniques can seriously degrade the speaker recognition performance. Furthermore, watermarking and speaker recognition systems have opposite goals whenever the Signal-to-Watermark Ratio (SWR) is decreased and the robustness of the watermark is increased. However, the speaker identification and verification performance can be decreased [5, 6, 12, 13]. Therefore, some researchers apply semifragile watermarking to reduce this impact on recognition performance [14, 15]. Although semifragile watermarking techniques can be used for tamper detection, a requirement is still needed for robust watermarking techniques to protect the ownership.

In this paper, a novel digital speech watermarking technique is proposed for online speaker recognition systems by using Discrete Wavelet Packet Transform (DWPT) and multiplication. For this reason, watermark bits are embedded where less speaker-specific subbands are available. Basically, discriminative speaker features are within low and high frequency bands: glottis is between 100 Hz and 400 Hz, piriform fossa is between 4 kHz and 5 kHz, and constriction of the consonants is 7.5 kHz [1618].

The rest of this paper is organized as follows: first, applied methodology is discussed; second, the proposed digital robust speech watermarking algorithm is explained; third, experimental result on the proposed digital speech watermarking is evaluated; the effect of the proposed robust digital speech watermarking technique on speaker recognition performance is given; and finally, conclusion and future trend are drawn.

2. Methodology

Figure 1 shows the critical bands which are chosen to embed the watermark. As seen in Figure 1, the selected bands have less speaker-specific information which has caused less degradation on the recognition performance of online speaker recognition systems. For this reason, the speech signal has decomposed into 16 critical bands by applying DWPT. Then, 8 critical bands (with numbers 2, 3, 4, 5, 6, 7, 13, and 14), where the amount of Fisher ratio (-ratio) is not much, were chosen to have minimum degradation on speaker-specific information. -ratio curve in Figure 1 is captured from previous work [3, 16] specifically.

3. Robust Digital Speech Watermarking Algorithm

In this section, a robust digital speech watermarking technique based on robust multiplicative technique is proposed. In this technique, the watermark is embedded by manipulating the amplitude of the speech signal [19]. For this reason, the speech signal is segmented into nonoverlapping frames with the length of . Then, all the sampling of the frame is manipulated based on the following equation:where is the intensity of the watermark which must be slightly greater than 1, is watermark bit, is the original speech samples, and is watermarked speech samples. Whenever is increased, the robustness of the watermark is increased, but the imperceptibility is decreased. corresponds to th samples of the frame. is the th watermarked sample of the frame.

It is demonstrated [19, 20] that, by knowing the watermark’s strength , variance of the noise, and variance of the original signal, it is possible to extract the watermark bit from the energy of the signal by using a predefined threshold. The detection for watermark bit is based on the following equation:where is the amount of threshold which depends on the variance of the noise and signal. This detection function works well except for gaining attack. If all the samples are multiplied by a constant, the watermark bits cannot be detected at the receiver. In this paper, a rational watermark detection technique has been applied to solve this problem. For this reason, the speech frame is divided into two sets and which should have equal length and energy. If their energy is not equal, then their energy can be equalized by using a distortion signal. Next, the watermark bit is embedded into set based on (1).

For the extraction of the watermark bit from the watermarked frame, (3) has been appliedwhere is an even number and is assumed to provide a tradeoff between robustness and imperceptibility.

Due to the application of DWPT, the distribution of the speech subbands is considered as a Generalized Gaussian Distribution (GGD) which can be assumed as Weibull distribution when DFT is applied [21]. If GGD is assumed to be and , then it can be expressed as follows: where is gamma function which is represented by , is the shape of the distribution and can be estimated based on statistical moment of the signal which is discussed briefly in Appendix A.

The amount of threshold for the detection of the watermark bit is estimated for Additive White Gaussian Noise (AWGN) channel. Therefore, the received watermark signal can be expressed based on the following equation:where is the noise which is added to the watermarked speech signal. Equation (6) estimates the probability of the watermark bit as follows:As seen in (6), the amount of the detection threshold depends on the summation of the different parameters. Therefore, different series (which are considered as normal distribution) in the nominator and denominator can be computed based on Central Limit Theorem (CLT). Although some parameters, like , are always positive and cannot be modeled by Gaussian distribution which may be negative, the probability of a negative number which is generated by this Gaussian distribution is very low due to the long length of the speech frames and big amount for . As a result, the mean and variance of each parameter of the nominator and denominator are estimated based on (7) and (8), respectively. Consider the following: where is the length of each set of and . By assuming that and and based on the moment of GGD which is computed as in Appendix A, (9) are estimated as follows:Therefore, (10) is estimated as follows:By assuming Gaussian signal with zero mean, (11) can be formulated. Consider The distribution of the noise component with the moment of 4 can be estimated based on the following equation: The rest of the components of (6) are simply expressed as follows: Therefore, by using two free auxiliary parameters and which are stated in (14), is expressed by (15). Consider the following:where and are estimated based on the following equations: For estimating the PDF of , computing the density of is required. By assuming and as normal distribution and that they are independent, (18) can be expressed (more details in Appendix B) as follows: By the assumption of independent and normal distribution of and , can be expressed as follows: It should be mentioned that the closed-form solution for (17) is available which is fully discussed in the literature and formulated as in the following equation: where each parameter is expressed as follows:As a result, the density of can be formulated as follows:where and are the lowest bound and the highest bound of the energy ratio between two sets of and , respectively. As discussed, these two sets should be selected and somehow have equal energy approximately. This situation can be stated as in the following equation:The density of parameter is expressed as in (10). However, the density of parameter is formulated as in (23) which is estimated from the ratio between normal and independent distributions The probability of can be computed by using the same manner. Then, the probability of detection error can be estimated based on the following equation: As the main aim of the watermark detector is the minimization of the error, the threshold is calculated as follows:The amount of the threshold is experimentally computed by using simulation. In the following, the robust digital speech watermarking technique has been developed based on the statistical model which is fully described in this section.

As discussed, the watermark bits are embedded into the specific frequency subbands of DWPT. Details of the embedding and extraction process are presented in the following algorithms.

Embedding process is as follows:(a)Segment the original speech signal into frame Fi with lengths of .(b)Apply DWPT on each frame with levels to compute the different sub-bands.(c)Select specific frequency subbands in the last level and arrange them into a data sequence.(d)Divide the data sequence into two sets of and with equal length of for each set. If these two sets have different energy, their energy is equalized by using a distortion.(e)Apply a channel coding technique to improve the robustness of the watermark bits.(f)Embed the coded watermark into set based on multiplication which is expressed in (1).(g)Apply inverse DWPT to reconstruct the watermarked signal.Figure 2 shows the block diagram of the embedding process in the proposed robust speech watermarking technique.

Extraction process is as follows:(a)Segment the watermarked speech signal into frame Fi with lengths of (which can be considered as a public key between the transmitter and receiver).(b)Apply DWPT on each frame with levels to compute the different subbands (which can be considered as a public key between the transmitter and receiver).(c)Select specific frequency subbands in the last level and arrange them into a data sequence.(d)Divide the data sequence into two sets of and with equal length of for each set.(e)Extract the watermark bits based on (3).(f)Decode the watermark bits which are extracted from all embedding frames.

4. Experimental Setup

In this part, the simulation was done to evaluate the performance of the developed robust digital speech watermarking technique. For this reason, this technique was applied separately to evaluate its performance. The simulation results fully confirmed the mathematical models which were developed in previous section. It must be mentioned that 6300 speech signals of TIMIT database were used in this experiment. The simulation parameters were assumed as follows:(a)The frame size was assumed to be 32 ms which was equal to samples. A watermark bit was embedded into each frame. It is clear that whenever the size of the original speech signal is increased, the watermark capacity is increased.(b)The required level for DWPT was assumed to be 4. The watermarked subbands were considered as in Figure 1. Daubechies’ wavelet function was applied for DWPT.(c)Although the watermark’s intensity () was changed for simulation purpose, the overall assumption was .(d)For channel coding, Hamming method was used with its parameters being assumed to be , .(e)The threshold was assumed to be 0.95. However, the proper amount for the threshold was expected to be a number near to 1 due to the equalization of energy blocks in the developed algorithm.In the following, simulations were done to study robustness, imperceptibility, and payload of the developed robust digital speech watermarking technique by using MATLAB. The developed robust digital speech watermarking technique was compared to state-of-the-art digital watermarking techniques including DWT-SVD [22], LWT-SVD [23], and SVD-QIM [24].

4.1. Robustness

Various intentional and unintentional attacks were used for this experiment to evaluate the robustness of the developed technique. The most common method using BER was applied to evaluate the robustness of watermarking which is defined as follows:where is the exclusive OR (XOR) operator, is the length of the watermark, is the original watermark, and is the extracted watermark.

Whenever BER is close to zero (due to less errors between the original and extracted watermarks), the watermark technique provides good robustness.

For better, valid, and fair comparisons of the advantages and disadvantages of each digital watermarking technique, different speech attacks were designed to evaluate the robustness of the different watermarking techniques. The following attacks are common during the speech transmission through telephony channel.

Speech Attacks

Additive White Gaussian Noise (AWGN). A 5 dB noise was added to the signal for simulating ambient distortion.

Low Pass Filter (LPF). An elliptic LPF with 4 KHz cutoff frequency was performed on the watermarked signal.

Band Pass Filter (BPF). An elliptic BPF (from 300 Hz to 3400 Hz for simulation of the narrowband telephony channel) was performed on the watermarked signal.

A-Law. The watermarked signal was compressed and then expanded by -law with the prevailing parameter of .

μ-Law. The watermarked signal was compressed and then expanded by -law with the prevailing parameter of .

CELP9600. Apply 9.6 kbps CELP codecs on the watermarked speech signal.

CELP 16 K. Apply 16 kbps CELP codecs on the watermarked speech signal.

Amplitude Variation. Increase or decrease the amplitude of the watermarked signal up to 400% (multiplied by 4) or down to 25% (divided by 4).

Resample 8 KHz. Convert the sampling rate of the watermarked signal to 8 KHz and then convert it back to the original one.

Requantization. 16 bits of the watermarked samples were quantized to 8 bits and then were requantized to 16 bits.

Table 1 summarizes the average BER of different digital watermarking techniques. As reported, the average BER for robust DWPT-Multiplication seriously outperformed the other techniques. Based on the results in Table 1, it appeared that embedding the watermarks in the transform domains (DWPT and DWT) was superior to the time domain (SVD-QIM) as the maximum BER in the robust watermarking technique was for SVD-QIM.

4.2. Imperceptibility

Table 2 presents the average of the objective and subjective imperceptibility comparisons among different watermarking techniques (see International Telecommunications Union (ITU-T) method for the subjective measurement of speech quality [25]). For fair comparison, similar frame lengths, watermark bits, and assumptions were used for these watermarking techniques. As seen in Table 2, approximately less MOS music was reported for the speech signals due to the short durations and less energy of the speech signals. Despite the signals’ duration and energy, the most important issue for the listeners was reporting the dissimilarities, that is, how much of the quality of speech signals was understood and how much of the quality of the speech signals was enjoyable. Although SNR values for music were expected to be more than those for speech, the listeners still expected more quality from music than speech. Therefore, they reported more MOS speech for this experiment. It can be concluded that the listeners’ expectations from the speech signals are different from those for the music signals.

4.3. Capacity

In this experiment, a memoryless binary symmetric channel as in (28) was applied to compute the capacity of the data channel for error-free watermark transmission through telephony channel, where bitrate () for the channel was assumed to be 64 kbps [26]. ConsiderFigure 4 shows based on the average BER (see Table 1) for different watermarking techniques. It can be seen that the robust DWPT-Multiplication watermarking technique had more capacity than LWT-SVD, QIM-SVD, and DWT-SVD techniques.

4.4. Discussion on the Developed Robust Digital Speech Watermarking

This section discusses the various aspects of the DWPT-Multiplication robust watermarking technique. Firstly, the effects of the threshold are discussed. Figure 5 shows BER in respect to different thresholds () for the various AWGN channels. Whenever the threshold became larger, robustness of the watermark improved. Due to the equalization of energy in each block, it was expected that the proper value for the threshold was a number near to 1.

Figure 6 presents BER versus the frame lengths for the developed digital watermarking techniques in various AWGN channels. For Figures 6 and 7, it can be seen that the decreasing length of the frames could increase BER and decrease SNR due to the increase in watermark distortion which was induced by more watermark bits. Increasing SNR in respect to the frame length (when longer frame length was applied to watermarking) caused lesser distortion in the watermarked signal. Choosing a shorter frame length could increase BER because the shorter frame length was easily affected by noise and had small energy. It is clear that increasing the length of the frame means increasing the duration of the host speech signal which can decrease BER and increase SNR, as demonstrated in Figures 6 and 7. There is a limitation for frame length as increasing the frame length does not improve imperceptibility and robustness anymore. This length is determined by the signal when it appears as quasi-stationary. Furthermore, choosing a longer frame length can be effective to reduce BER in serious noisy conditions. As seen in Figure 6, whenever the size of frame increased, the watermark was detected with less BER for serious noise (SNR = 0 dB). It shows that the frame length is directly affected by robustness and imperceptibility.

Figure 8 shows how BER was changed by the variation of the strength of the watermark (). The robustness of the developed watermarking technique was improved when the strength of the watermark () was increased. This situation is because the watermark is embedded with more strength in the speech signal. As seen in Figure 8, increasing the strength of the watermark () was effective only for serious noisy channels (SNR = 0 dB or SNR = 20 dB). It seems that increasing the strength of the watermark () for clean conditions is unnecessary.

Figure 9 shows how SNR was changed by the strength of the watermark (). It is clear that increasing the strength of the watermark () can decrease SNR which also decreased the signal quality. Furthermore, more distortion was injected into the host signal by choosing a high value for . Although , SNR was more than 22 dB. There was a difference of 10 dB when was decreased to 1.01.

5. Effects of Robust Watermarking on the Performance of Speaker Recognition Systems

In this section, the effects of robust DWPT-Multiplication digital speech watermarking on the performance of speaker verification were evaluated in terms of two speaker verification systems, that is, i-vector and GMM-UBM systems.

Table 3 shows the effect of robust digital speech watermarking on the performance of different speaker verification systems for different speech databases As seen, the best results were reported for TIMIT speech database which was a clean speech database.

Due to the mismatch in the channel, microphone, and environment, other databases had a lesser performance than TIMIT. As shown in Table 3, it appeared that i-vector with MFCC outperformed other speaker verification systems. Furthermore, robust speech watermarking affected MFCC more than LPRC. From Table 3, the total effect of robust digital speech watermarking was calculated to be 1.16%. This amount is not small and shows that robust digital speech watermarking has some degradation on the performance of online speaker recognition systems which can justify the application of frame selection before robust digital speech watermarking.

Table 4 presents the effect of the developed robust digital speech watermarking system on the performance of GMM speaker identification system in terms of recognition rate. As seen, the best recognition rates were reported for TIMIT speech database which was a clean speech database. Due to the mismatch in the channel, microphone, and environment, other databases had less recognition rate than TIMIT. Furthermore, it seems that MFCC outperformed LPRC. From Table 4, the total degradation of robust digital speech watermarking was calculated to be 2.52%. Therefore, the degradation effect of robust digital speech watermarking can be considerable and can be decreased by applying the frame selection techniqueAs seen in Tables 3 and 4, robust watermarking and the speaker recognition system have opposite goals. Whenever SWR decreased, the robustness of the watermark increased. However, the performance of speaker recognition decreased as confirmed in previous studies [5, 6, 12, 13]. Therefore, selecting the frames which have less speaker-specific information for watermarking can result in minimum degradation on the recognition performance of the online speaker recognition systems.

6. Conclusion and Future Works

In this paper, new robust digital speech watermarking technique was developed by applying DWPT and multiplication. This watermarking technique is very robust against different attacks such as filtering, additive noise, resampling, gain, and compression. By embedding the watermark in less speaker specific of the speech subbands, the degradation effect on the recognition performance for this watermarking technique is the minimum.

It will be the future work to study new adaptive multiplication technique. Also, proposing synchronization technique for this approach could be an improved watermark extraction process.

Appendices

A. Shape of the Distribution Based on Statistical Moment

The estimation for GGD shape parameters has been calculated by using moments of a signal. If and is assumed for a signal with GGD distribution, then (A.1) can be expressed as follows:Therefore, moments of this variable are calculated as follows:By changing the variable , (A.3) is formulated:Therefore, (A.4) is expressed as follows:For , (A.5) is expressed as follows: Finally, the estimation for GGD shape parameters is estimated as follows:

B. The Computation of Statistical Density of a Ratio between Two Independent Normal Variables

The computation of statistical density of a ratio between two independent normal variables is formulated as follows: For , the expression appeared which is finally expressed as in the following equation:

Conflict of Interests

The authors declare that they have no conflict of interests.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments to improve the quality of this paper.