Abstract

Resampling is an operation to convert a digital speech from a given sampling rate to a different one. It can be used to interface two systems with different sampling rates. Unfortunately, resampling may also be intentionally utilized as a postoperation to remove the manipulated artifacts left by pitch shifting, splicing, etc. To detect the resampling, some forensic detectors have been proposed. Little consideration, however, has been given to the security of these detectors themselves. To expose weaknesses of these resampling detectors and hide the resampling artifacts, a dual-path resampling antiforensic framework is proposed in this paper. In the proposed framework, 1D median filtering is utilized to destroy the linear correlation between the adjacent speech samples introduced by resampling on low-frequency component. And for high-frequency component, Gaussian white noise perturbation (GWNP) is adopted to destroy the periodic resampling traces. The experimental results show that the proposed method successfully deceives the existing resampling forensic algorithms while keeping good perceptual quality of the resampled speech.

1. Introduction

With the wide availability of powerful audio-editing tools such as Adobe Audition, Audacity, and GoldWave, one can easily modify a digital speech with little or no obvious perceptual artifacts. For law enforcement agencies, the authenticity and integrity [1, 2] are of importance when a speech is provided as evidence. Therefore, various speech forensic techniques have been proposed to identify different kinds of forgery, such as replaying [3, 4], pitch shifting [5, 6], and double compression [7]. Resampling is an operation widely used to convert a digital speech from a given sampling rate to a different sampling rate. However, it is worth noting that resampling is also a necessary operation when a speech undergoes other manipulations like pitch shifting, splicing, and fake-quality mp3 compositing [8]. Additionally, resampling could be used as a postprocessing operation to hide the artifacts left by the forgery operations [9].

In recent years, many forensic techniques have been proposed to detect the traces left in resampled speech. Since specific periodic correlations between the samples of the resampled speech will be introduced, Expectation Maximization (EM) algorithm is applied to estimate such periodic artifacts [10]. In addition, the statistical moments in frequency domain are chosen as features to classify the original and the resampled speech. Inspired by the successful application of derivative features in image resampling forensics [11] and audio double compression forensics [12], Xu and Xia [13] found that if an original audio signal is resampled, significant peaks can be found in the second-order derivative of the spectrum, and the peak position is related to resampling factor. They utilized K-singular value decomposition to distinguish the resampled speech from the natural one. In our previous work [14], we found that resampling will cause the inconsistency between the bandwidth and sampling rate. It proves that the logarithmic ratio of band energy can effectively detect the resampling trace.

In real scenario, a malicious attacker may launch an antiforensic attack against the existing forensic detectors. The purpose of the antiforensic techniques is to fool or mislead the forensic detectors by creating difficulties in forgery detection. In fact, the study of antiforensics can help researchers be aware of weaknesses in the existing forensic detectors. In this work, we focus on antiforensics of speech resampling. The objective of this work is to remove the resampling trace without degrading the perceptual quality of the resampled speech. To the best of our knowledge, there is no prior work on antiforensic of speech resampling. Specifically, a dual-path strategy is applied in the proposed algorithm. For the low-frequency component of the speech, median filter is used to destroy the linear relationship between the adjacent sampling in the resampled speech. And for the high-frequency component, Gaussian perturbation is adaptively added with Gaussian white noise. The experimental results show that the proposed antiforensic algorithm can not only remove the resampling trace but also ensure the speech quality after the antiforensic process.

The rest of this paper is organized as follows. Section 2 briefly reviews the basics of speech resampling and the related work on resampling forensics. In Section 3, a dual-path antiforensic framework is given and then the proposed antiforensic resampling algorithm is introduced. The experimental results are presented in Section 4. Finally, Section 5 summarizes the paper and discusses future work.

In this section, the basics of speech resampling are given and two typical forensic algorithms for speech resampling are investigated.

2.1. Speech Resampling

Resampling, also known as sample rate conversion, is the process of converting a signal from one sampling rate to another. The resampling function can be defined by where is the original speech, is the resampled speech, and is the resampling factor. can be defined as where and are the original and new sampling rates after the conversion, respectively. According to the value of , it can be divided into downsampling and upsampling .

The implementing of the resampling scheme is shown in Figure 1. The original speech is first upsampled by a factor of and followed by interpolation filtering. Then, the interpolated speech is downsampled by a factor of . Here, the role of the filter is typically a lowpass filter to act as an anti-imaging as well as an antialiasing filter; the stopband frequency of the ideal filter should be where denotes the minimum function. More details of resampling can be found in [11].

2.2. Existing Works on Detecting Resampling Trace

Popescu and Farid [11] found that the linear interpolated signal will exhibit a periodic correlation. Such correlation could be well fitted by the EM algorithm and successfully applied to the detection of image resampling. To make the EM algorithm more suitable for speech resampling detection, Chen and Wu [10] introduced signal histogram to describe the speech’s distribution and extracted the statistical moments in frequency domain as the identification features. Let , be the value of the spectrum peak, and is the total number of spectrum peaks. The detection algorithm can be described as follows. (1)Normalize to [0, 1] where is defined by equation (5) and its purpose is to make the spectrum peaks consistent for the speech with various lengths, so as to adapt the situation that the speech is often not in a fixed length in real scenes. (2)In order to highlight the primary peaks, the small values of are suppressed by where denotes the average function.(3)Calculate the 3rd-order central moment of and normalize it by (4)Once is obtained, it will be compared with a threshold to determine whether the suspected speech is resampled or not.

Xu and Xia [13] found that significant peaks will arise in the second-order derivative of the spectrum in a resampled speech. Meanwhile, the peak position is related to the resampling factor. For a given speech, we can obtain the FFT spectrum on its second-order derivatives. According to the sum of the mean and standard variance of the spectrum , the flat critical boundary can be estimated. Meanwhile, the maximum value of the spectrum peak can be found by searching the whole normalized frequency domain. Under a hypothesis framework, the decision rule for distinguishing resampled and original speech can be formulated as follows: where is a decision threshold.

3. Proposed Antiforensic Method

Generally, it is a hard task to defeat the resampling forensic detectors by a single-step attack [15]. Hence, in this work, a dual-path framework is designed to remove the resampling trace, which has been shown in Figure 2. In this framework, different strategies are applied to the different frequency-domain components of the resampled speech. For the low-frequency component, 1D median filtering is utilized to destroy the linear correlation between the adjacent samples introduced by the resampling. For the high-frequency component, perturbation is created by adding Gaussian white noise on the samples in order to destroy the periodic resampling traces.

Given the speech and the resampling factor , the speech can be modelled as where and are the low- and high-frequency components, respectively.

The proposed antiforensic algorithm can be described as follows. (1)First, these two components are separated by 1D median filter where is the size of the median filter.

Here, 1D median filter works with a sliding window that passes through each speech sample one by one. At each sample, numerically sort the list of adjacent neighbor samples and the middle one from the list is just the output of the median filter. (2)For , a zero mean Gaussian white noise is added and the strength of the noise is controlled by the standard variance (3)Then, the noised signal is resampled with the specific resampling factor (4)Meanwhile, is resampled with to get . Then, the low-frequency component can be obtained by applying median filtering on (5)Finally, the resampled speech is obtained by

4. Experimental Results

4.1. Experimental Setup

To evaluate the antiforensic effectiveness of the proposed algorithm quantitatively, we perform experiments to attack two typical resampling detectors [10, 13]. To this end, TIMIT, which is a dataset, consisted of 6300 speech with the average duration of 3 seconds from 630 North America speakers. The format of each speech is WAV, 16 kHz sampling rate, 16-bit quantization, and mono. TIMIT is taken as the original speech dataset. And the resampling dataset is created by resampling function resample in Matlab from 0.8 to 2.0 (step is 0.1). Two different window sizes (3 and 5) of median filter are chosen, and the strength of the Gaussian noise is set as 0.2.

4.2. Removing Resampling Artifacts

From the description in Section 2.1, interpolation algorithm is the key component in speech resampling. Meanwhile, it will cause unavoidable linear dependencies between adjacent audio samples. As shown in [16, 17], the strength of such linear dependencies varies periodically with the resampling parameters. Other detection methods [9, 13] also prove the existence of such periodic artifacts.

The detection accuracy of the forensic methods is defined as the percentage of true positive and true negative in all testing cases. Figure 3 shows the detection accuracies in [10, 13]. First, when the resampling factor is less than 1.0, the accuracy of the EM algorithm is reduced by about 10%, while the second-order derivatives are reduced by about 2%. With the increase of the resampling factor, when the resampling factor is greater than 1.2, the detection performance of two methods is reduced by an average of 30%. Meanwhile, if we decrease the window size of the median filter, the attack performance will be improved further. The above experimental results show that the proposed antiforensic algorithm better erases the periodic resampling traces.

Figure 4 shows the normalized power spectrums of the EM algorithm, and Figure 5 shows the second-order derivative spectrum. We can see that the periodic traces caused by resampling have been effectively removed. It further confirms the effectiveness of the proposed algorithm.

It is known that median filtering will cause regions with nearly constant intensity values [18]. Moreover, median filtering could be taken as an attack against resampling detection due to its nonlinear smoothing ability. Thus, in the proposed method, the periodic dependencies between the neighboring samples are effectively destroyed by the median filtering. On the other hand, the high-frequency component is resampled with adding Gaussian white noise on the samples. By the proposed dual-path attack, the resampling trace can be successfully removed.

4.3. Perceptual Quality

Segmental SNR (SegSNR) [19] and Perceptual Evaluation of Speech Quality (PESQ) [20], which have been extensively used for evaluating the sound quality objectively, are adopted as the speech quality evaluation metrics. SegSNR is a variant of the signal-to-noise ratio. It is a method for determining the signal-to-noise ratio by dividing the speech signal into a short time of 10-30 ms and then taking the average value over the full speech time interval. The definition of SegSNR is where is the number of total frames, is the frame index, and is the frame size.

PESQ is standardized as ITU-T recommendation P.862 and can provide an end-to-end quality assessment. The Objective Difference Grade (ODG) is a main parameter of PESQ and has a range from 0 to -4. The highest score 0 means imperceptible difference, and the lowest score -4 means very annoying.

Figure 6 shows the SegSNR and ODG scores of the speech processed through the proposed antiforensic algorithm with various filtering window sizes. First, the quality of the resampled speech is improved with the increase of the resampling factor, which meets the actual expectation. Meanwhile, ODG scores for various resampling factors are higher than -2.0, which means that the proposed antiforensic algorithm can maintain good perceptual quality of the resampled speech. Additionally, we can see that the algorithm with the smaller window size provides better quality results. Hence, in practice, a smaller window size should be more suitable for choosing.

5. Conclusion

In this work, a resampling antiforensic method based on dual-path strategy is proposed to attack the existing forensic detectors. The linear correlation introduced by resampling is processed by a median filtering on low-frequency component. And the periodic traces introduced by resampling are removed by adding Gaussian perturbation. A better tradeoff is obtained between speech perceptual quality and forensic undetectability by the proposed resampling antiforensic technique. In the future, we expect that our approach can be extended to antiforensics of other speech operations.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61300055), Ningbo Natural Science Foundation (Grant No. 202003N4089), and K.C. Wong Magna Fund in Ningbo University.