Abstract

Content authentication and tampering detection of multimedia is a vital application by using digital watermarking. In this paper, we propose a novel fraGile wateRmArking of speeCh based on Endpoint Detection (namely GRACED) to verify the integrity of speech. Firstly, speech signal is framed word by word and each speech frame includes one intact nonsilence word. Subsequently, feature fusion is adopted to generate the fragile watermark which will be embedded into the coefficients of hybrid domain of discrete wavelet transform (DWT) and singular value decomposition (SVD). Finally, the tampering detection is accomplished without using any synchronous code to detect kinds of attacks. Several experiments are executed in order to quantify the performance of the proposed method. Experimental evaluation and comparisons with other schemes demonstrate that the signal-to-noise ratio of the proposed method is high with a favorable imperceptibility. Additionally, the tampering localization of various malicious attacks can be achieved without using synchronous code and the proposed scheme even can determine the attack types.

1. Introduction

With the advent of the era of big data, the relationship between big data and multimedia security has become closer; speech plays an important role in our life, such as military, courtrooms, and dissemination of policies [13]. In some case, speech content contains private information that can be used for judicial expertise. However, it could threaten the national security due to the digital multimedia can be manipulated easily by various software and the content of speech may also be modified or tampered by attackers during the transmission or storage [4, 5]. The “terminal-network-cloud” architecture based on big data brings more challenges to speech content authentication. Therefore, it is essential to evaluate the integrity and authenticate the content of speech.

Generally, there are two technologies to achieve content authentication including content-based identification and information hiding. The first one is perceptual hash function [68] and the second one is digital watermarking [9, 10]. Hashing technology produces hash sequence as hash value or message digest. The generated sequence will be stored in cloud and compared with its reconstructed hash sequence to verify the integrity of speech content. A reliable speech perceptual hash authentication algorithm [11] by using the static and dynamic characteristics of speech based on the coefficients of Mel frequency inverted spectrum is introduced. In the process of tampering detection, the hamming distance between the reconstructed hash sequence and the stored hash sequence is calculated to verify the authenticity. In order to achieve content authentication of encrypted speech in the cloud, an efficient encrypted speech authentication method [12] based on uniform sub-band spectrum variance and perceptual hashing is proposed. The reconstructed authentication digest and the original hashing sequence stored in the cloud are matched by hamming distance algorithm to achieve tampering detection. A robust hash method is introduced which is based on MFCC (Mel-Frequency Cepstrum Coefficients) and PCA (Principal Component Analysis) to verify the integrity and authenticate of speech content [13]. Experimental results show that the BER (Bit Error Rate) between the hash value of the original audio and the tampered audio is low for perceptual manipulations. However, the solutions proposed in the above studies are segmented by using fixed-length framing. Meanwhile, original hash sequence needs to be stored in the cloud with more storage consumption.

On the other hand, digital watermarking is an essential technology to realize content authentication which embeds secure message into speech without noticeable perceptual distortion. Integer Wavelet Transform and Non-negative Matrix Factorization can be used to verify the content authentication of speech [14]. In authentication process, the tampered region can be located by comparing the reconstructed perceptual hashing with the extracted perceptual hashing version. Experiments demonstrate that the proposed scheme is sensitive to malicious tampering of encrypted speech. Two fragile watermarking schemes are proposed [15] by using LSB (Least Significant Bit) in hybrid domain of DCT (Discrete Cosine Transform) and the DST (Discrete Sine Transform). The proposed schemes are sensitive than LSB method in spatial domain but limited to tampering detection. The combination of modifying the least significant digits and G723.1 coding can be used to achieve the speech content authentication and tamper recovery [16]. In order to recover the tampered area, the compressed signal is generated by using G.723 coding and embedded into original speech. An audio watermarking algorithm is proposed in [17]. In this algorithm, watermarks are generated by compressed data of GBT (Graph Based Transform), and then the watermarks are embedded into the coefficients of LSFs (Line Spectral Frequencies) via the combination of LP (Linear Prediction) and DM-QIM (Dither Modulation-Quantization Index Modulation). A secured watermarking algorithm based on chaotic is introduced in [18]. The embedding information is the compressed data of DCT of the secret audio, and then the information is embedded into random sequences of matrixes of singular value via the combination of DWT and SVD (Singular Value Decomposition). The uniform sub-band spectral variance and spectral entropy are fused into fusion features by feature fusion, and the zero-one data of the watermark is determined by comparing the value of each fusion feature with the average value [19]. In [20], the speech is encrypted firstly, and then the G723.1 compression algorithm is used to compress the speech frame data. Finally, the compressed data is embedded into the LSBs of encrypted speech. Therefore, the embedded information can realize the integrity authentication and tampering recovery of the speech content. An audio watermarking scheme in the compressed domain is designed in [21]. In this scheme, the Huffman data of each MP3 frame is used to carry watermark. Experiments present good results in relation to inaudibility, robustness, and capacity rate. A novel blind digital audio watermarking scheme has been proposed in the wavelet and cosine transforms domain [22]. In order to achieve tampering detection and copyright protection, hash sequence is generated with SHA-512 to authenticate the integrity, and image is embedded to protect the copyright. A blind speech watermarking algorithm on a frame-by-frame basis is presented in [23]. The method perceptually manipulates the vector norms drawn from the FFT (Fast Fourier Transform) coefficients firstly and then modifies the speech signal through the combination of DPQIM (Downward Progressive Quantization Index Modulation) and BCIA (Boundary Constrained Iterative Adjustment) according to the watermark bits. A robust dual-domain twofold encrypted image-in-audio watermarking scheme is introduced [24]. Initially, the encrypted binary image is obtained. Then, the encrypted image and the host audio signal are decomposed by the hybrid of DTCWT (Dual-Tree Complex Wavelet Transforms), STFT (Short-Time Fourier Transform) and SVD. Finally, the singular value of encrypted image is embedded in the singular value of host audio signal. Taking the advantage of LWT (Lifting Wavelet Transform) and DCT, the encrypted watermark was embedded into the selected coefficient to ensure the stability of the watermark [25]. In addition, to improve the robustness of watermark, cyclic coding is introduced to correct the errors. In the tampering detection process, the extracted watermark and original watermark are compared to locate the tampered area.

The mentioned algorithms for authenticating the integrity of speech content based on hashing need to consume storage, the generated watermark is nonblind; most of the algorithms for authenticating speech content integrity through digital watermarking take fixed-length framing to implement watermark embedding. Embedding watermarks can affect the audibility of speech. Therefore, in order to solve above problems, we propose an efficient speech content authentication scheme based on endpoint detection. The main contributions are listed as follows.(1)The watermark generating and embedding are focused on nonsilence segment of speech by GRACED. It can better guarantee speech audibility by effectively reducing the amount of watermark and reducing interference with silent frames.(2)For desynchronization attacks, the misaligned location can be synchronized without extra synchronous codes in GRACED. Meanwhile, the implementation of synchronization does not require bit-by-bit search.(3)According to the continuity of numbers, attack types can be determined in GRACED.

This paper is organized as follows: Section 2 illustrates the proposed authentication scheme. Section 3 introduces the experimental results of the proposed scheme. The conclusions are described in Section 4.

2. Proposed Method

In this section, we mainly present the proposed method GRACED. Three subsections are written to describe GRACED in detail. Subsection 2.1 introduces the proposed framing method. Subsection 2.2 describes the watermark generation and embedding principle of the proposed method. Subsection 2.3 gives a more specific explanation about the content authentication.

2.1. Speech Framing

For attackers, the purpose of malicious attack is to change content of speech signal. Obviously, a specific word can change the meaning of a message, and the modification of an entire word is more meaningful than the modification of random sampling points. Apparently, whether a speech word is tampered is attracted more concerned than the nonspeech segment. It is well-known that speech endpoint detection refers to the operation of determining the starting point and ending point of every speech segment. Therefore, speech endpoint detection is used to dynamically obtain speech segment instead of using the traditional fixed length frame in this paper. The segmentation method based on endpoint detection technology is illustrated in this paper and the details are described in the following steps.(1)The speech signal is first broken into frames. Each frame is denoted as which contains samples.(2)For each speech frame , the spectral centroid is calculated by using the following equation: Here, is the spectral centroid of the th frame, is the coefficients of Discrete Fourier Transform of , is a variable from one to M, and is the length of .(3)Calculating the short-term energy of according to the following equation: Here, is the sequence number of , is the amplitude of the th sampling point of , is the length of , and is the short-time energy of the th frame.(4)Calculating two thresholds of the spectral centroid sequence and the energy sequence , respectively, as follows.(i)Computing the histograms of the spectral centroid sequence and the energy sequence and denoted as and , respectively.(ii)Selecting two local maxima of the histogram and denoted as and .(iii)Calculating the threshold value of spectral centroid sequence using the following equation:Here, is a user-defined parameter.(iv)Selecting two local maxima of the histogram and denoted as and .(v)Calculating the threshold value of energy sequence using the following equation:Here, is a user-defined parameter.(vi)After calculating the threshold and , the beginning point and ending point of each speech word can be calculated by

Here, is the flag of the th frame. The th frame belongs to speech segment if the value of is one, otherwise, the th frame belongs to nonspeech segment.

According to (5), the result can be shown as Figure 1. In this figure, Figure 1(a) is the waveform of an original speech, the red lines represent the start positions of each speech segment, and the blue lines represent the end positions of each word. Meanwhile, Figure 1(b) is the flags after endpoint detection for the speech. Here, the value of ordinate is used to indicate whether the sampling point belongs to the speech segment. It can be seen that speech signal can be divided into speech segments and nonspeech segments.

Therefore, the speech signal can be divided into speech segments which are denoted as .

2.2. Watermark Generation and Embedding Algorithm

Figure 2 illustrates the overall architecture of the watermark generation and embedding process. The detailed introduction of each step is shown below:

2.2.1. Speech Framing

In the speech division processing, we adopt an endpoint detection algorithm to divide the original speech into frames which frame includes one speech word and denotes as .

2.2.2. Watermark Generation

The watermark generation process includes six steps: feature extraction, feature fusion, feature watermark generation, frame number watermark generation, watermark connection, and watermark encryption.(i)Feature extraction. In this step, 2-level discrete wavelet transform is performed on each frame signal to obtain the detail component and the approximation component firstly. Then, three features of approximation component are extracted and denoted as . presents the mean value of short-time Fourier transform coefficient. denotes the mean value of mel spectrum frequency cepstrum coefficient. is the mean value of the energy of root mean square.(ii)Feature fusion. In order to reduce the amount of watermark and improve the robustness, the extracted features are merged as . , , and are fusion coefficients which satisfy .(iii)Feature watermark generation. According to the fusion feature , the perception hashing can be used to generate feature watermark as. The feature watermark of the th speech frame is denoted as .(iv)Frame number watermark generation. Each frame number n is converted into binary bits to produce the frame number watermark with the length of . Here, .(v)Watermark connection. The feature watermark and frame number watermark are combined as the watermark of the th speech frame shown as follows:(vi)Watermark encryption. A group of pseudorandom sequence is obtained using the logical regression function shown as follows:

Subsequently, the produced sequence is sorted in ascending order. Then the watermark can be encrypted by using the index of the sorted pseudorandom sequence to disturb the position.

2.2.3. Watermark Embedding

In order to verify the integrity of a speech, the generated watermarks are embedded into speech signal. The steps of watermark embedding are illustrated as follows.(i)Speech division. Based on the Sec. 1, the speech signal is divided into words. Each word expresses one frame.(ii)Position selection. In order to improve the security of GRACED, partial sampling points from the th speech frame are selected to carry watermark by a secret key .(iii)DWT transformation. DWT is performed on the selected sampling points. After that, the detail component and the approximation component can be obtained.(iv)Subsegmentation. The detail component is divided into subsegments and denoted as .(v)Bit embedding. In this step, the singular value decomposition is executed on each segment to obtain the singular value firstly.Then, the obtained singular value is applied to carry one bit watermark using the following equation. Here, . represents quantization step. After that, the inverse singular value decomposition is performed to obtain the watermarked subsegment . This step is repeated until all watermark bits are embedded and obtained the watermarked detail component .(vi)Inverse transforms. Inverse discrete wavelet transform is performed on the watermarked detail component and the approximation component to obtain the watermarked speech subsegment.

2.2.4. Connection

From step 2 to step 3, each speech frame is selected to carry the generated watermark. Then, all watermarked speech frames are connected to acquire the watermarked speech .

2.3. Content Authentication Algorithm
2.3.1. Speech Framing

Based on the speech framing method in Sec. 2.1, the watermarked speech is divided into frames. Each frame contains one watermarked speech word and denoted as .

2.3.2. Feature Watermark Reconstruction

According to step 2 in watermark generation, the reconstructed feature watermark can be calculated and denoted as for the th frame.

2.3.3. Watermark Extraction

The watermark extraction process is illustrated as follows.(i)Position selection. For each speech frame , partial sampling points are selected by the secret key .(ii)Frequency domain transformation. Discrete wavelet transform is performed on the selected sampling points to obtain the detail component and the approximation component . Subsequently, the detail component is divided into segments and singular value decomposition is executed on each segment to acquire the singular value .(iii)Watermark extraction. Based on (11), watermark bits can be calculated one by one.(vi)Watermark decryption. The logical regression function is performed to generate a group of pseudorandom sequence which is sorted in ascending order. Subsequently, the order index can be used to decrypt the extracted watermark. The decrypted watermark of th frame is denoted as . Then, the feature watermark and frame number watermark can be separated.

2.3.4. Tampering Location

Calculate the information distance between the reconstructed feature watermark and the extracted feature watermark . The result of tampering detection is defined as

If represents the corresponding frame is integrity and the frame number can be recalculated. Otherwise, it means this frame is tampered and the tampered frame number can be calculated by the absence of continuity in the numeric sequence.

3. Experimental Results

In this section, experiments are performed to verify the effectiveness of the designed audio watermarking algorithm. Simulation software is Python 3.9. Additionally, 240 speech signals (including 80 female speech signals, 80 male speech signals, and 80 children speech signals) are selected to evaluate the relative performance. Every speech recording is a 16-bits monaural file in WAVE format.

3.1. The Robustness of Framing

In this paper, endpoint detection technology is used to divide speech. Therefore, the robustness of endpoint detection method directly affects the accuracy of searching speech frames and the accuracy of tampering location. In order to quantify the robustness of framing, the following experiments are performed. Five common signal processing is used to attack original speech signal such as low-pass filtering, quantization, noise addition, MP3 compression, and resampling. Subsequently, the attacked speech signal is framed according to the endpoint detection method. From Figure 3, it can be seen that attacked speech can be accurately divided into words after above conventional signal processing. Therefore, it is believed that the framing method has good robustness in this paper.

3.2. Inaudibility

Inaudibility usually can be classified into subjective assessment and objective assessment. On the one hand, the waveforms of the original speech and the watermarked speech are shown in Figure 4. It shows that there is no obvious difference between original speech and watermarked speech. On the other hand, SNR value is employed to measure the quality of watermarked speech and the equation is shown as follows. Wherein, represents the sampling value of original speech sequence, represents the sampling value of watermarked speech, and represents the number of sampling points of speech.

In this experiment, different algorithms are chosen to evaluate inaudibility using the same speech signal and embedding capacity. From Table 1, it can be seen that the SNR values of GRACED are larger than Ref. [14], Ref. [16], and Ref. [25]. It means that GRACED can achieve the integrity authentication of speech signal with better inaudibility.

3.3. Fragility

Fragility represents that the watermark is sensitive to all kinds of malicious and nonmalicious attack. It means that the embedded watermark will be changed after malicious attacks (such as insertion attack, deletion attack, mute attack, and substitution attack) and common signal processing (such as resampling, low-pass filtering, and compression). The bit error rates (BER) between the generated watermark and extracted watermark can be used to evaluate the fragility. BER can be defined in the following formula:where is the number of different bits between the generated watermark and extracted watermark and is the total number of watermark bits.

In order to test the fragility of the proposed algorithm, several kinds of typical common signal processing are performed on the watermarked speech. The details are listed as follows.(1)AWGN: 30 dB while Gaussian noise is added into the watermarked speech.(2)Low-pass filtering: Low-pass filter with a cutoff frequency of 1.5 kHz is performed on the watermarked speech.(3)Requantization: The watermarked speech is quantized from 16 bits per sample down to 8 bits per sample and requantized from 8 bits per sample up to 16 bits per sample.(4)Compression: The format of watermarked speech is changed from WAV to MP3.(5)Resampling: The sampling rate of watermarked speech is downsampled from 4.8 kHz to 1.6 kHz and then upsampled from 1.6 kHz to 4.8 kHz.

Table 2 shows the fragility of our proposed algorithm using different speech signals. In this table, each BER value is the average of the BER values of 80 speech signals. It can be found that the BER between the reconstructed watermark and the extracted watermark is around 0.5 after common signal processing. Obviously, the error bits are random. Therefore, it is considered that GRACED is very vulnerable to the operation of conventional processing.

3.4. Tampering Detection and Location

Actually, malicious attacks are executed on speech signals to the purpose of modifying content information, and the modification of an entire word is more meaningful than the modification of random sampling points. Therefore, a fragile watermark for speech authentication based on endpoint detection is proposed to verify the integrity of speech words. The tampering detection of malicious attacks is mainly focus on entire words rather than the speech segment with fixed length. In order to validate the proposed method, several malicious attacks are operated on the watermarked speech and the details are shown as follows.

3.4.1. Insertion Attack

In this attack, one word is inserted into the watermarked speech signal. Here, one word is inserted after the third word of the watermarked speech. Figure 5 shows the watermarked speech signal (Figure 5(a)), the attacked speech (Figure 5(b)), the location result (Figure 5(c)), and the status of frame number (Figure 5(d)). In the experiment, the watermarked speech is divided into seven frames. However, the attacked speech is divided into eight frames and the fourth frame is classified as tampered frame which is shown in Figure 5(c). Meanwhile, from the intact speech frames, the correct frame number sequence is {1, 2, 3, 4, 5, 6, 7}. Obviously, the extracted frame number sequence is a consecutive one as Figure 5(d). Hence, the attack type is considered as insertion attack. Therefore, it is believed that GRACED can accurately locate the tampering position without using synchronous code and judge the attack type.

3.4.2. Deletion Attack

It represents that one or more speech words are deleted from the watermarked speech. In the deletion experiment, the watermarked speech is shown in Figure 6(a), and the third word is deleted (including 9600 sampling points) as shown in Figure 6(b). According to the content authentication, it can be found that all words in Figure 6(c) are considered as unattacked words and the correct frame numbers are {1, 2, 3, 5, 6, 7} as shown in Figure 6(d). Hence, according to the continuity of frame numbers, the missing frame number can be confirmed. Meanwhile, according to the detection result and the missing frame number, the type of attack can be judged as deletion attack in the experiment.

3.4.3. Muteness Attack

It represents that one or more words are silenced in the watermarked speech. Obviously, the muteness attack also deletes the content of speech. Therefore, the muteness attack is considered as a kind of deletion attack. In the experiment, the third word contains 12000 sampling points as shown in Figure 7(a). The third word is silenced as an attacked speech as shown in Figure 7(b). According to the content authentication process, all speech frames are considered as integrity in Figure 7(c). However, the same as the deletion attack, the missing frame number can be confirmed according to the continuity of frame numbers in Figure 7(d). Here, in the correct frame number sequence, the intact frames are {1, 2, 4, 5, 6, 7}, the silenced content is the third word. Therefore, the tampering detection result and the missing frame numbers indicate that the type of attack is a muteness attack in the experiment.

3.4.4. Substitution Attack

In this attack, one or more speech words are replaced by another word or random sampling points. In the substitution attack, the fifth word is replaced by random sampling points and the attack speech as shown in Figure 8(b). According to GRACED, the attack speech is divided into seven frames and the fifth frame is judged as a tampered frame. Meanwhile, the extracted correct frame numbers are {1, 2, 3, 4, 6, 7} and the missing frame numbers can be determined. According to the status of frame number and the result of tampering location, the attacked type is regarded as substitution attack.

In this section, four attack experiments are executed. From what has been discussed above, we may safely arrive at the conclusion that GRACED can locate the tampered content of speech and judge the type of attack based on the location result of tampering detection and the status of frame number.

4. Conclusion

In order to realize the integrity authentication and tampered localization of speech content, a content authentication of speech based on endpoint detection GRECED is proposed in this paper. Firstly, speech signal is divided into frames word by word using endpoint detection. For each extracted word, its approximate and detail components can be calculated by discrete wavelet transform. Secondly, feature fusion and perceptual hashing are combined to generate authentication watermark. Finally, the integrity of the speech content is authenticated and tampering localization is achieved by that watermarking. Extensive experiments show that GRECED is sensitive to conventional processing. Meanwhile, the embedded watermark has good imperceptibility. Compared with other algorithms, the signal-to-noise ratio is high, and tampered localization of various malicious attacks can be achieved. Even more, the attack type can be identified by the continuity of frame numbers of those intact speech words.

Data Availability

The experimental data used to support the findings of this study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no financial, affiliations, intellectual property, personal, ideology, and academic conflicts of interest in this paper.

Authors’ Contributions

Shuyun Zhou and Meixin Song contributed equally to this work.

Acknowledgments

This work was supported by the National Natural Science Foundation of China, under Grant 61902085, the Guizhou Provincial Science and Technology Projects, under Grant no. Qian Ke He Jichu-ZK[2021]YiBan312, and Technological Talents of Guizhou Provincial Science, under Grant no. Qian Jiao He KY Zi[2021]136.