Abstract

Adaptive multirate (AMR) compression audio has been exploited as an effective forensic evidence to justify audio authenticity. Little consideration has been given, however, to antiforensic techniques capable of fooling AMR compression forensic algorithms. In this paper, we present an antiforensic method based on generative adversarial network (GAN) to attack AMR compression detectors. The GAN framework is utilized to modify double AMR compressed audio to have the underlying statistics of single compressed one. Three state-of-the-art detectors of AMR compression are selected as the targets to be attacked. The experimental results demonstrate that the proposed method is capable of removing the forensically detectable artifacts of AMR compression under various ratios with an average successful attack rate about 94.75%, which means the modified audios generated by our well-trained generator can treat the forensic detector effectively. Moreover, we show that the perceptual quality of the generated AMR audio is well preserved.

1. Introduction

AMR audio codec [1] is one of the most popular audio codec standards, which is optimized for speech signals and encodes narrowband (200–3400 Hz) signals, with sampling frequency of 8000 Hz [2]. As more and more AMR audio appears as evidence in the forensics scene, it is of extreme importance to verify their integrity [3]. Generally, to manipulate an AMR audio, attacker should decompress it into raw waveform first and then do the forgery operations and decompress it into AMR format. The double compressed audio becomes questionable because the manipulated audio is always through the double compression. In the past decade, many forensic techniques have been proposed to detect compression history of AMR audios based on traditional methods [35] and deep learning methods [2, 6, 7]. To represent the difference of single compressed audios and double compressed audios, traditional AMR compression detection techniques rely on low-level acoustic features such as sub-band energy and linear prediction coefficients (LPCs), which acquire professional acoustic knowledge. Recently, deep learning methods are gaining popularity in forensic research studies, which can capture the highly complex feature from a raw sample by training large-scale sample data with a neural network.

However, as many forensic techniques are proposed to detect the integrity of digital file, some antiforensic methods have also been proposed to expose the shortcomings and weakness of existing forensic techniques and thus help investigators better address the weaknesses and improve their forensic techniques. For example, Fontani et al. [8] firstly presented an antiforensic method of median filtering (MF), which made MF images undetectable by the MF detectors [911] while keeping the image quality in a good PSNR. Luo et al. [12] applied a GAN framework to improve the quality of JPEG images and fool the JPEG compression detectors successfully. Chen et al. [13] used the legacy traces of a designated camera to generate a forged image that can deceive the existing camera identification techniques successfully. Kim et al. [14] adopted a deep convolutional neural network (DCNN) to remove the forensic traces from MF images and effectively recover the MF images visually similar to the original image. Li et al. [15] modified the forensic traces using a data-driven manner to mislead the results of three advanced audio source identification techniques [1618].

These antiforensic methods have a little consideration about exposing the weakness of the robustness of AMR compression detection. Generally speaking, as more and more AMR audio appears as evidence in forensics scene, it is important to help the investigators to address the weakness of AMR compression detectors. Therefore, in this paper, we propose an antiforensic method utilizing a GAN framework which comprised of two networks: a generator and a discriminator. The generated data can statistically model the distribution of real data [19]. To improve the perceptual quality of the double compressed audio and remove the artifacts introduced by AMR compression procedure, we adopt the GAN to modify the double compressed audios to avoid forensic detection. For building our antiforensic attack, we design the architecture of GAN and the loss functions. In particular, three state-of-the-art detectors of AMR compression have been selected as the attack target to evaluate the performance of our method.

The rest of this paper is organized as follows. In Section 2, we introduce the related work of forensic method of AMR compression and the GAN framework. The detail of our proposed GAN framework has been provided in Section 3. Section 4 presents the experimental settings and extensive experiments against three AMR compression detectors. Conclusions are given in Section 5.

In this section, we briefly introduce three advanced detection methods, which are considered as attack targets. Additionally, the GAN framework is also briefly reviewed.

2.1. Detection of AMR Compression

In general, traditional detection of AMR compression consists of two primary steps: feature extraction and model classification.

As the first work of the detection of AMR compression, Shen et al. [3] used the traditional acoustic features including average sub-band frequency energy ratio, average low-frequency sub-band energy ratio, bispectrum features, and linear prediction spectrum to represent the difference caused by AMR compression. And a standard SVM modelling technique was employed for classification. They achieved an accuracy about 87% for detecting the single compressed audio from the double one.

In [2], Luo et al. adopted an autoencoder network for automatic feature extraction. They demonstrated that the deep features differ greatly between the single compressed audio and the double one which were extracted from a well-trained autoencoder. And they designed a majority voting strategy for classification.

In [6], the authors delved into the stack autoencoder (SAE) network for obtaining better deep features in the AMR compression forensic task. Then, they applied a universal background model-Gaussian mixture model (UBM-GMM) for the identification of compression history. They improved the classification accuracy to 98% on the TIMIT [20] database.

2.2. Generative Adversarial Network

The generative adversarial network (GAN) was firstly proposed by Goodfellow et al. [21] for generating realistic images. In GAN, two networks are training against each other in a min-max two-player game. In their iterative training, the purpose of generator is to capture the distribution of real data and that of discriminator is to classify a sample that came from the real database rather than generated by . The generator tries to maximize the probability of making the discriminator mistakenly classify the generated data as real, while the discriminator guides the generator to produce a more realistic sample. Generally, the adversarial training process can be denoted as a min-max game and it will be optimized by the loss function as follows:where denotes the real data and denotes the random noise similar to after the adversarial training of the generator and discriminator . In the training process, the purpose of is to minimize the loss value while that of is to maximize it.

Recently, GAN has gained growing popularity in various fields because of its effective generative capability. In this work, the GAN framework is assumed as the reverse procedure of AMR compression to improve the perceptual quality of double compressed audio and remove the forensic artifacts. Specifically, the generator and the discriminator can be regarded as an antiforensic model and AMR compression detector, respectively. Hence, the adversarial concept is suitable for antiforensic task in the AMR compression detection.

3. Proposed Antiforensic Framework

In this section, we briefly introduce three advanced detection methods, which are considered as attack targets. Additionally, the GAN framework is also briefly reviewed.

is firstly sent into the generator to get a falsified audio . and selected from the uncompressed audio are further fed into the discriminator for classification. Then, by freezing the parameters of discriminator, the loss from will be fed back to , which is represented by the dotted lines.

3.1. Overall Architecture

The overall goal of our attack is to remove the artifacts left by the AMR compression so that the resultant audio can fool the detectors. To deploy a successful attack, the generated audio should be decompressed back to AMR format because many investigators only accept the AMR file before the detection. Thus, the generated audio must statistically model the distribution of original audio so that the decompressed ones will be similar to the single compressed audio .

As shown in Figure 1, the proposed framework consists of a generator and a discriminator . To remove the artifacts left by the compression, is used to generate the falsified audio by adding a generated perturbation into . The discriminator is designed to distinguish an original audio , which is never through compression from a falsified audio . In the adversarial training of and , is encouraged to learn how to minimize the difference between and and optimize the parameters to achieve a better performance in generating good perceptual quality of .

3.2. Architecture of Proposed Framework
3.2.1. Generator

Generator is used to generate the antiforensic audios. In this framework, we use the SEGAN [22] as a reference architecture to design our adversarial network, which has been effectively applied in speech enhancement. As shown in Figure 2, the generator gets (size = 1 × 8000) as the input and consists of 7 convolutional groups and 7 corresponding deconvolutional groups.

Each convolutional group includes a convolutional layer with 64 filters with 1 × 30 kernels and stride = 2, whereafter a batch normalization (BN) layer which can stabilize the training process makes the generated audios more realistic. And the Leaky-ReLU is chosen as the activation function. The deconvolutional group is constituted of a deconvolutional layer which is set up as the convolutional group, followed by a BN layer and ReLU as the activation function. To reconstruct the details of audio and diminish the loss when information flows through convolutional and deconvolutional groups, we apply the skip connection in the generator, which can make the convolutional group’s output flow to its corresponding deconvolutional group. The skip connection can make the generator have a better performance, as the gradients can flow deeper through the skip connection without suffering much vanishing [23]. And the sigmoid activation is added to restrict the output for classification.

3.2.2. Discriminator

Since the key advantage of GAN is iterative training to obtain a better performance in generating samples, it seems that the architecture of is a very important constraint to our framework. The discriminator is intended to classify and and force the generated audios to deceive the detector. Hence, the discriminator must perform well in distinguishing and . Therefore, we build a CNN architecture for . As shown in Figure 3, the discriminator is designed as a compression detector based on CNN. It comprises 6 convolutional groups and is followed by a group consisting of a global average pool layer. At the end of the network, a dense layer coupled with a softmax activation function is placed to output the categorical probability.

Before the iterative training, we firstly test the capability of the designed discriminator to distinguish the original audio from double compressed audio . Then, we test the capability of the designed discriminator in a sub-dataset including 6000 original audios selected from TIMIT database and its double compressed audios with a compression bit rate randomly selected from {4.75 kbps, 5.15 kbps, 5.9 kbps, 6.7 kbps, 7.4 kbps, 7.95 kbps, 10.2 kbps, 12.2 kbps}. The sub-dataset was then divided into training (70%) and validation (30%). The accuracy of the discriminator model is shown in Figure 4. It is observed that our designed discriminator achieves a good performance.

3.3. Loss Functions

In this section, we demonstrate the loss functions for the two networks. To achieve the goal of antiforensics, the generator should be capable to learn how to minimize the difference of the modified double compressed audio and the original audio , while maintaining an acceptable perceptual quality. In this work, we define the loss of generator aswhere represents the perceptual loss of , denotes the adversarial loss calculated from , and are the weights to balance the importance of and .

Considering that the attack needs to introduce lesser perceptual artifacts to improve the forensic undetectability, we employ the perceptual loss for improving the quality of . is defined aswhere presents the output of , and and represent the batch size and the position of in this batch, respectively.

Then, the adversarial loss is designed to force to have a better performance in the iterative training. We define the aswhere denotes the class probabilities of the modified audio calculated by .

In this adversarial task, for forcing to modify similar to , should have the ability to detect the original audio correctly from the decompressed or the generated . Therefore, is defined as follows:

4. Experimental Results

In this section, we evaluate our antiforensic method against the three advanced forensic techniques [2, 3, 6]. First, we create an audio database designed especially for the experiment. Then, successful attack rate (SAR) is used to perform the forensic undetectability of our antiforensic audios and perceptual evaluation of speech quality (PESQ) [24] is adopted to present the quality of our antiforensic audios.

4.1. Database

TIMIT [20] is a typical speech database which consists of 630 speakers from different dialects of American English (192 females and 432 males) and each speaker reads ten sentences which are approximately three seconds. At first, to build the forensic database, we use the AMR codec to obtain the single compression audio from TIMIT database, with a random compression bit rate selected from {4.75 kbps, 5.15 kbps, 5.9 kbps, 6.7 kbps, 7.4 kbps, 7.95 kbps, 10.2 kbps, 12.2 kbps}. Then, we decode and recompress the AMR audios to get the double compressed AMR audio with random bit rates also selected from 4.75 to 12.2 kbps.

In the experiments, we first split those audios into 1 s clips and randomly divide those clips into the train set and test set. Therefore, we obtain 12000 1 s training audios and 6900 testing audios. Then, three detectors [2, 3, 6] are trained using the train set, and the average detection accuracies in test set are 87.52%, 92.60%, and 98.54%, respectively, which are essentially in agreement with the results reported in their works.

4.2. Experimental Setup and Evaluation Metrics
4.2.1. Experimental Setup

We train our network on patch sized audios with the pairs of sets: . Considering the audio might be split into different sizes by the investigator before the detection, we stitch all 1 s audios to obtain more audios with difference sizes, including 13800 0.5 s clips, 6900 1 s clips, 3450 2 s clips, and 2300 3 s clips. Then, we compressed back to AMR format with random bit rates chosen from 4.75 to 12.2 kbps.

Adam [25] is adopted as the optimizer with a learning rate of 1 × 10−4 for G and 5 × 10−6 for . Before the iterative training, we perform the generator training with batch size = 64 and weight terms of  = 1000 and  = 0 for 5 epochs. Then, and are trained iteratively for 30 epochs with weight terms of  = 1000 and = 1, with an iteration ratio of 1 : 5, which gives the discriminator more iterations to get a better performance.

4.2.2. Evaluation Metrics

The successful attack rate (SAR) is used as the evaluation metric, which could well represent the forensic undetectability of our antiforensic audio. We define the SAR aswhere represents the audio decompressed with each bit rate selected from 4.75 to 12.2 kbps and is the classification result of forensic detector, that is, while has been misclassified as and 0 otherwise.

Meanwhile, we apply the PESQ to test the perceptual quality of the antiforensic audio . PESQ is an industry-standard methodology for the assessment of speech quality. The range from −0.5 to 4.5 is the default PESQ score range, and higher score means better perceptual quality.

4.3. Experimental Performance and Analysis

We perform our attack on three advanced forensic methods [2, 3, 6]. Specifically, for each clip in the testing set, we generate a copy of it with the well-trained generator and then decompress the copy with eight different bit rates.. Finally, three trained detectors are used to classify our antiforensic audios.

As shown in Table 1, the experimental results are in line with expectations. The antiforensic audios can significantly deceive the three advanced AMR compression detectors, and the SAR of is significantly achieved with an average rate about 94.71% which means the forensic techniques cannot distinguish the antiforensic audios correctly. Obviously, our method can significantly make undetectable by the forensic techniques.

To measure the quality of our antiforensic audios, we compute the PESQ score of comparing the original audios. As shown in Figure 5, it is obvious that our antiforensic audios can retain good perceptual quality and the PESQ values of most are over 3.3 compared with , which means that our method can improve the perceptual quality of while achieving the antiforensic purpose. Figure 6 shows the spectrograms of an original audio from the test set, its and , and its antiforensic audio compressed with random bit rates. Compared with , presents fewer losses of content in the high frequency than , and is similar to .

5. Conclusion and Future Work

In this paper, we have proposed a new method to prove the weakness of the forensic detectors of AMR compression. To do this, we developed a GAN framework for the removal of AMR compression artifacts. Unlike the conventional antiforensic methods, our method can retain good perceptual quality with a better antiforensic capability in a data-driven manner. Through extensive experiments, the results demonstrate that the antiforensic double compressed audio can effectively avoid the detection of existing AMR compression methods with an average SAR about 94.75%, while retaining good perceptual quality.

However, there are still many remaining problems in the competition of forensics and antiforensics. In the future, we plan to consider the robustness of the forensic approach of AMR compression, i.e., whether adversarial framework could obtain a robust discriminator which can detect the antiforensic audios correctly by a well-trained generator or other attack strategy while distinguishing the double compressed audios from single compressed audios successfully.

Data Availability

The TIMIT dataset used to support the findings of the study is public and available at https://catalog.ldc.upenn.edu/LDC93S1.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported in part by the National Natural Science Foundation of China under grant nos. U1736215, 61672302, and 61901237, in part by the Zhejiang Natural Science Foundation under grant nos. LY20F020010 and LY17F020010, and in part by the K. C. Wong Magna Fund of Ningbo University.