Abstract

Splicing is one of the most common tampering techniques for speech forgery in many forensic scenarios. Some successful approaches have been presented for detecting speech splicing when the splicing segments have different signal-to-noise ratios (SNRs). However, when the SNRs between the spliced segments are close or even same, no effective detection methods have been reported yet. In this study, noise inconsistency between the original speech and the inserted segment from other speech is utilized to detect the splicing trace. First, noise signal of the suspected speech is extracted by a parameter-optimized noise estimation algorithm. Second, the statistical Mel frequency features are extracted from the estimated noise signal. Finally, the spliced region is located by utilizing a change point detection algorithm on the estimated noise signal. The effectiveness of the proposed method is evaluated on a well-designed speech splicing dataset. The comparative experimental results show that the proposed algorithm can achieve better detection performance than other algorithms.

1. Introduction

With the wide spread of social networks and the rapid development of powerful audio editing tools (such as Adobe Audition and GoldWave), digital speech can be easily accessed, manipulated, and distributed. Such tools have provided lots of convenience in various aspects such as social activity, news media, entertainment, and so forth. These modified speeches, however, may cause unpredictable results when they are presented in a scene such as justice or criminal investigation. Digital speech forensics [13] is a valuable technique for determining the authenticity of digital speech. By analyzing the modification traces left in the suspected speech, digital forensics can identify the tampering type and locate the tampering position [4].

Deletion, insertion, and splicing are three most commonly tampering operations that can significantly change the content of the original speech. Splicing is an operation in which one or more speech segments are inserted in the original one to change the content of the target speech. In general, splicing is always accompanied by deletion and insertion. According to whether the inserted speech segment is from the original speech or not, splicing can be further divided into self-splicing and transsplicing, respectively. Specifically, self-splicing refers to copying a segment in the original speech and inserting it into the other region in the same speech. Since the self-splicing will introduce high-similarity regions in the spliced speech, the detector can take the similarity of speech features as criterion to find the splicing matching regions. In real scenarios, transsplicing is relatively more common than self-splicing. On the one hand, the forgers tend to splice speech components from different source/scenes. On the other hand, it is a hard task for the forgers to find the splicing segment from the original speech in most cases. In this work, we focus on the detection of speech transsplicing.

As an important branch of multimedia security [5, 6], many splicing detection algorithms [79] for digital speech have been proposed over the last decade. The ENF- (electric network frequency) based method [10] is effective for detecting speech splicing, in which the ENF signal is extracted from a questioned audio recording and matches it with the reference signal in an ENF database. Reis et al. [11] proposed an ESPRIT-Hilbert ENF estimator with an outlier detector based on the kurtosis of the estimated ENF. Then, the kurtosis is taken as an input for a support vector machine classifier to indicate the presence of splicing. However, ENF-based detection algorithms may not be applicable when the speech is recorded with the well-designed or battery-operated devices. On the other hand, the reference ENF dataset is needed during an ENF-based forensic investigation process. Imran [12] proposed a splicing detection algorithm based on intrinsic statistical properties of suspected speech. The speech is first divided into segments using voice activity detection, and the histogram of one-dimensional LBP (local binary pattern) is exploited as the detection feature. Zhao et al. [13] introduced channel impulse response to detect speech splicing. The impulse response amplitude and background noise are used to determine the location of the splicing.

In real scenarios, in order to remove the splicing trace, the forger would try best to keep the SNR (signal-noise ratio) of the processed speech as consistent as possible between the spliced and the original regions. This will greatly increase the difficulty of the splicing detection task. As far as we know, there is no prior work on transsplicing detection with the same SNR. In this study, we proposed an approach for detecting transsplicing with the same SNR. First, the Sorensen algorithm [14] is utilized to estimate the noise level of the suspected speech. Then, the variances of Mel frequency cepstral coefficient (MFCC) [15] for estimated noise signal are calculated as the detecting features. Finally, the spliced region is located by a change point detection algorithm based on the penalty cost function [16]. The performance of the proposed algorithm is evaluated on a well-designed speech splicing dataset. The experimental results show that the proposed algorithm achieves better detection accuracy compared with other algorithms.

The rest of the study is organized as follows. The main work of this study is described in Section 2, in which noise estimation, feature extraction, and the change point detection algorithm are described in detail. Section 3 will present the splicing dataset and the experimental results. Finally, the conclusion is drawn in Section 4.

2. Proposed Transsplicing Detection Algorithm

The proposed framework for transsplicing detection and localization is shown in Figure 1. First, the Sorensen algorithm is adopted to estimate the noise signal. Next, the estimated noise is framed, and its Mel-frequency cepstral coefficients are extracted. The variance of the coefficients is calculated as the detecting feature. Finally, the change point detection algorithm is applied on the variance sequence to detect and locate the splicing.

2.1. Noise Estimation

Sorensen [14] proposed a recursive averaging noise estimation algorithm. The idea is that different attenuation rules are adopted to different regions to estimate the noise in the speech accurately. Figure 2 shows the flowchart of this algorithm.

Let be the suspected speech at time , which consists of clean speech and additive noise . First, the windowed and framed speech signal is subjected to short-time Fourier transform (STFT):where is the time index, is the frequency bin index, L is the window length, and and are the STFT coefficients of and , respectively.

Then, the periodograms can be calculated as

Next, is spectrally smoothed to produce and then temporally smoothed to . Then, the temporal minimum values could be tracked within a minimum search window of length , that is,where , and . Window represents an analysis window length. Since it is computationally expensive to find minimum in each frequency band for each frame, an efficient procedure [17] is proposed in which the analysis window is divided into subwindows of samples. Hence, the minimum is updated for every samples, stored it for later use, and reduced the number of comparation operations per frame and frequency bin on .

For , the noise periodogram estimation is equal to a time-varying power scaling of the minimum tracks . For , it is equal to the noisy speech periodogram , that is,where is used to determine whether speech exists. is a bias compensation factor, and it only updates in the nonspeech frames.

A smooth estimate of the noise magnitude spectrum can be obtained by

After the above steps, we obtained the enhanced speech . Finally, the estimated noise signal can be obtained by subtracting the enhanced speech from the noisy speech , that is,

It is seen from equation (3) that plays an important role in the noise estimation process. is mainly used to control a fixed-length window. In the noise estimation process of each frame, the minimum in the window is tracked, and the value obtained by the tracking is used to continuously update . Finally, the noise power spectrum is calculated by . It can be seen from the above analysis that reasonable adjustment of and can effectively improve the noise estimation performance of the algorithm.

2.2. Detection Feature Extraction

For each frame of the estimated noise, Mel frequency cepstral coefficients are extracted, which is based on the human peripheral auditory system. Figure 3 shows the diagram of MFCC extraction.

First, the estimated noise signal is subjected to DFT to obtain a linear spectrum . Then, is filtered by the Mel frequency filter bank to obtain the Mel spectrum. In order to make the result more robust to noise and spectral estimation errors, the logarithmic energy of the Mel spectrum is generally taken, that is,where is the number of filter banks.

Next, is subjected to DCT to obtain the MFCC coefficient:where is the index of the cepstral coefficients.

Finally, for each frame, the variance of can be calculated by equation (9), and we can obtain a variance sequence for each suspected speech.

2.3. Change Point Detection

Since the segments of transsplicing come from the different sources/scenes, we consider the inconsistencies of the noise characteristics mixed in the suspected speech to be a clue of splicing. It means that there will be a change on noise characteristics where the splicing happened. Hence, the splicing detection and localization can be transformed into a change point detection problem. Algorithms for change points’ detection [1820] have made good progress in recent years. Lavielle [16] proposed a model selection method based on a penalized contrast which is applied to the change point problem. It can be used for estimating the number of change points and their location. In this work, Lavielle’s algorithm is adopted to find the splicing positions.

Let be the variance sequence of estimated noise signal and be some integer. Similarly, let be a sequence of integers satisfying . For any , let be a contrast function for estimating the unknown true value of the parameter in the segment . It means that there will be an estimated value of () when the contrast function reaches it minimum. In other words, the minimum contrast estimate , computed on segment of , is defined as a solution of the following minimization problem:

Then, we define the contrast function aswhere .

As an example, consider the flowing model:where is a sequence with zero mean and unit variance. In the case of changes in the variance, is a constant sequence and is a piecewise one. The contrast function can be defined as a Gaussian log-likelihood, even if is not a Gaussian sequence.

Then,where is the variance of . For instance, when the maximum number of segments , the number of change points is , and the change boundary is .

Finally, we summarize our splicing detection algorithm as follows. First, we estimate the power spectral density of the noise in the noisy speech signal and then use to obtain the enhanced speech signal . Therefore, the noise signal can be estimated with the noisy speech and the enhanced speech . Then, the estimated noise is framed and windowed, and then for each frame, -dimensional MFCC coefficients are calculated. The variance sequence of MFCC coefficients is obtained and taken as the input of the change point detection algorithm, and then, the penalty cost function is constructed by equation (11). Finally, the estimated parameters of the penalty cost function and represent the number of change points and the boundaries of the change segments, respectively. Among them, the boundary of the change segment is the final detected tampering position.

3. Experimental Results

In this section, we first describe the dataset adopted in this work. Additionally, as mentioned in subsection 2.1, the performance of the proposed detection algorithm depends strongly on the effectiveness of the noise estimation. Hence, the noise estimation algorithm is evaluated to find the optimal parameters. Then, the performance of the proposed splicing detection method with optimal noise estimation is present.

3.1. Splicing Dataset

The transsplicing speech samples in this study are created based on NOIZEUS speech corpus [21] which is derived from the clean speech contaminated by various kinds of noise in the real world. The clean speech comes from 30 IEEE statements containing three male and three female pronunciations. The noise signals in NOIZEUS come from the AURORA-2 database [22], including noise from train stations, airports, exhibition halls, streets, and restaurants, as well as car noise, noise from commuter trains, and babble noise from multiperson speech. During noise contamination, various SNR cases including 0 dB, 5 dB, 10 dB, and 15 dB have been considered.

The creation process of the splicing speech dataset is as follows. First, the samples of NOIZEUS corpus are divided into two classes: the original samples and the samples to be spliced. Then, for each sample to be spliced, we further cut it into 4 different segments by using random numbers. For each original sample, a pseudorandom generator is used to determine where the segment will be spliced. Next, the splicing is performed, and the spliced speech is saved with the original sampling rate. In this work, the SNR of the original sample is kept the same as the segment to be spliced.

In the experiment, there will be 42 types of samples in each splicing subset, and each type contains 30 samples. As a result, there will be 1260 samples in each splicing subset. Each sample is 8 KHz, mono, 16 bit quantized, and the duration is 3-4 seconds.

3.2. Performance Evaluation on Noise Estimation

It can be seen from the analysis in Section 2.1 that the parameters and will affect the performance of the noise estimation algorithm. In order to find the optimal and , we first adjust the and values in the Sorensen algorithm to estimate the noise of 1260 segments of each subset and then calculate the average SNR of the 1260-segment speech under each and case. The experimental results for 0 dB and 5 dB speech are given in Tables 1 and 2.

It can be clearly seen from Tables 1 and 2 that and have a great influence on the performance of the Sorensen algorithm. For example, the estimation error for 0 dB case is minimized at −0.0737 dB when is (2, 5). And the best choice for 5 dB case is (4, 4). Table 3 gives the optimal and for various SNR cases.

Additionally, we compared the optimized Sorensen algorithm with other typical noise estimation algorithms. From Table 4, the optimized algorithm achieves the best estimated results in various SNR cases.

3.3. Performance Evaluation on Splicing Detection

In MFCC extraction, we set the number of filters to 27 and the number of cepstral coefficients to 12. For Lavielle’s algorithm [16], we set the maximum number of segments to 3 and only variance change is considered.

score is introduced as an objective metric to evaluate the performance of the proposed algorithm, which can be expressed as follows:where is the accuracy rate, is the recall rate, is the actual splicing region, and is the detected splicing region. It can be seen from equation (15) that the larger the value, the better the detection capability of the algorithm.

As a comparison to [7, 9], we adopt the optimal parameters in Table 3 to detect the splicing trace. The scores are shown in Table 5. It can be seen that the proposed method achieves better detection performance in all SNR cases. Meanwhile, it can be seen from Table 3 that the detection performance of the algorithm gradually deteriorates with the SNR increases. This is consistent with the situation in the actual scene, that is, the lower the noise energy contained in the speech signal, the more difficult the noise estimation algorithm is to extract the noise. In addition, according to the results in Tables 3 and 5, the detection result of the algorithm tends to become better with the decrease of and . It indicates that the speed of the noise estimation will be beneficial to improve the detection rate of the algorithm.

4. Conclusion and Future Work

In this study, a novel method for the speech transsplicing detection algorithm has been proposed. Considering that the segment to be spliced and the original segment have different noise levels, the noise of the suspected speech is estimated first. Then, we extract the variance of the 12-dimensional MFCC coefficients from the estimated noise and utilize the change point detection algorithm based on the penalty cost function to locate the splicing region, finding that the variance of the spliced region is significantly lower than that of the nonspliced regions. Experimental results show that the detection algorithm can accurately determine the starting position of splicing and can detect the entire splicing region. Compared with the splicing detection methods based on grid frequency and intrinsic statistical law of speech, the proposed method has fewer assumptions and can be applied to more forensic scenarios. The future work will focus on extracting more efficient hybrid features to further improve detection accuracy, and more scenarios closer to the real world such as reverberation will be considered.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61300055), Ningbo Natural Science Foundation (Grant No. 202003N4089), and K. C. Wong Magna Fund in Ningbo University.