Speech synthesis technology has made great progress in recent years and is widely used in the Internet of things, but it also brings the risk of being abused by criminals. Therefore, a series of researches on audio forensics models have arisen to reduce or eliminate these negative effects. In this paper, we propose a black-box adversarial attack method that only relies on output scores of audio forensics models. To improve the transferability of adversarial attacks, we utilize the ensemble-model method. A defense method is also designed against our proposed attack method under the view of the huge threat of adversarial examples to audio forensics models. Our experimental results on 4 forensics models trained on the LA part of the ASVspoof 2019 dataset show that our attacks can get a attack success rate on score-only black-box models, which is competitive to the best of white-box attacks, and attack success rate on decision-only black-box models. Finally, our defense method reduces the attack success rate to and guarantees detection accuracy of forensics models.

1. Introduction

Speech synthesis technologies have advanced significantly in recent years [1, 2]. Speech synthesis generally refers to the process of converting text into speech. At present, the mainstream speech synthesis system generally consists of two parts: spectrogram prediction network and vocoder. The spectrogram prediction network converts the text into the mel spectrograms. Shen et al. [3], for example, use a Seq2Seq network with an attention mechanism to map text to mel spectrograms, Ren et al. [4] and Lancucki et al. [5] use the transformer structure [6] for this purpose. The vocoder is used to convert the mel spectrograms into speech. Van et al. [7] use several dilated convolution layers to achieve this function. Prenger et al. [8] use a generative model that generates audio by sampling from a distribution [9]. Of course, there are some end-to-end models, such as FastSpeech2s [10]. These technologies have been widely utilized in the Internet of things [11, 12] like a smart speaker, personal voice assistant, etc.

However, these technologies also have been abused. They appear in telecom fraud, creating rumors and spoofing automatic speaker verification (ASV) systems. To detect these fake audios, the researchers designed several methods. Lai et al. [13] accumulate discriminative features in frequency and time domains selectively, Lai et al. [14] adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels, Jung et al. [15] use the convolutional layer to extract frame-level embedding and the GRU layer to aggregate extracted frame-level features into a single utterance-level feature. Related competitions [16] were also organized to promote research in this field.

Previous researches show that the image classification neural networks [1719] are vulnerable to attacks from adversarial examples, and audio models are no exception [2027]. Generally, adversarial attacks are divided into two categories: white-box attacks and black-box attacks. A white-box attack means that the attacker can access the complete structure, parameters, and input and output of the model, the black-box attack means that the attacker can only obtain external input and output information but cannot access the internal structure and parameters of the model [28, 29]. Current researches on adversarial attacks on audio forensics models mainly focus on white-box attacks [30]. Although there are studies on using the transferability of adversarial examples to achieve black-box attacks, it still relies on white-box models to generate adversarial examples [31]. In this paper, we will only rely on the output distribution to conduct black-box adversarial attacks.

The main contributions of this article can be summarized as follows:(i)To the best of our knowledge, we are the first to propose a black-box adversarial attack method only relying on the output distribution of audio forensics models and we use the ensemble-model method to increase the transferability of adversarial examples to implement decision-only black-box attacks.(ii)We propose a defense method based on low-sensitivity features in view of the huge threat of black-box adversarial examples.(iii)Our proposed black-box method can get the attack success rate equivalent to the best of white-box attacks and our defense method significantly reduces the threat.

The rest of the paper is organized as follows: Section 2 introduces several audio forensics models, which are the victim models in this paper; Section 3 describes the proposed adversarial attacks and defense methods; Section 4 introduces the experimental setup and results; and Section 5 gives theconclusion and future work.

2. Audio Forensics Models

Current speech synthesis technologies have developed to a high level. Once they are used by criminals in the fields of telecommunications fraud, network rumors, etc., and it will bring great harm to society. Therefore, people have designed a variety of audio forensics models, which aim to reveal the difference between real voice and fake voice from various angles. The following will introduce several current mainstream audio forensics models, whose detection accuracy is among the best in the audio forensics competition ASVspoof 2019. Therefore, they will serve as the victim models in this paper.

2.1. Attentive Filtering Network Model (AFnet) [13]

Attentive filtering (AF) accumulates discriminative features in frequency and time domains selectively. AF augments every input feature map with an attention heatmap . The augmented feature map is then treated as the new input for the dilated residual network (DRN). For , , AF is described aswhere and are the frequency and time dimensions, is the element-wise multiplication operator, is the element-wise addition operator, and is the residual . To learn the attention heatmap, As contains similar bottom-up and top-down processing as [32, 33], and is described aswhere is a nonlinear transform such as sigmoid or softmax, is a -net-like structure, composed of a series of downsampling and upsampling operations, and S is the input.

2.2. Squeeze-Excitation ResNet Model (SEnet) [14]

The squeeze-and-excitation (SE) block [34] is a computational unit that can be constructed for any given transformation . The features U are first passed through a squeeze operation and get a statistic , where the element of is calculated by

This is followed by an excitation operation .where refers to the ReLU function, and . The final output of the block is obtained by rescaling the transformation output with the activationswhere and refers to channel-wise multiplication between the feature map .

It will be easy to apply the SE block, which adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels to ResNet and get the squeeze-excitation ResNet (SEnet).

2.3. CNN-GRU Model [15]

The DNNs used in this model include convolutional neural network (CNN), gated circulation unit (GRU), and fully connected layer (CNN-GRU). In this architecture, the convolutional layer is first used to process input features to extract frame-level embedding. The convolutional layer includes residual blocks with identity mapping [35] to facilitate the training of deep architectures. Specifically, the first convolution layer of this model deals with the local adjacent time and frequency domains and gradually aggregates them through the repeated pooling operations to extract frame-level embedding. Then, the GRU layer is used to aggregate the extracted frame-level features into a single utterance-level feature. Fully connected layers are used to convert utterance-level features. An output layer with two nodes indicates whether the input utterance is a spoof or bona fide.

3. Audio Adversarial Examples Generation

3.1. Threat Model

In this paper, the adversarial attack is to craft adversarial voice by finding a perturbation such that (1) is an original voice classified as the spoof by the audio forensics model, (2) is as human-imperceptible as possible, and (3) the audio forensics model classifies the voice as the bonafide. To be as human-imperceptible as possible, our attack following the FAKEBOB [22] adopts norm to measure the similarity between the original and adversarial voices and ensures that the distance is less than the given maximal amplitude threshold of the perturbation, where denotes the sample point of the audio waveform. So, we can formalize the problem of finding an adversarial voice for a voice as the following constrained minimization problem:where is a loss function. To ensure the success rate of the attack, we minimize the loss function rather than minimizing the perturbation . When is minimized, is recognized as the bonafide.

According to the attacker’s mastery of the model, the adversarial attack can be divided into a white-box attack and a black-box attack. The white-box attack generally means that the attacker can fully understand all the information of the victim model, including the external input and output information of the model and the internal structure and parameters. Attackers can efficiently perform gradient descent by differentiating the loss function to launch an iteration-based adversarial attack. Previous researches on adversarial examples against audio forensics models mostly focus on white-box attacks [30] or using the adversarial examples generated from white-box models to conduct transferable adversarial attacks [31]. However, in the real environment, users of the audio forensics model generally do not disclose the internal structure and parameters of the model, which significantly limits the application scenarios and the threat of white-box attack. It also leads some people to mistakenly believe that protecting the internal information of the model can prevent them from adversarial attacks.

Therefore, in this paper, we will focus on black-box adversarial attacks. Black-box adversarial attacks mean that the attackers can only access the input and external output of the model. The attacker cannot directly use the internal information to obtain the gradient of the loss function and launch the adversarial attack. Compared to the white-box model, black-box adversarial attacks can be further subdivided into score-only black-box adversarial attacks and decision-only black-box adversarial attacks. The score-only black-box adversarial attack refers to that the attacker can access the confidence scores of the model for each input, while a decision-only black-box attack means a direct attack that solely relies on the final decision of the model [36].

In the remainder of this section, we will present methods for launching adversarial attacks in these two black-box scenarios and attempt to defend against score-only black-box adversarial attacks.

3.2. Score-Only Black-Box Attack Algorithm

As shown in Figure 1, we will introduce the whole attack process in the remainder of this subsection, especially the loss function and algorithm to solve the optimization problem.

3.2.1. Loss Function

The key to carrying out the adversarial attack is that the score of the bonafide voice should be greater than of the spoof voice. Therefore, the loss function is defined as follows:where the parameter , is to control the intensity of adversarial examples, so we can increase to enhance the robustness of the adversarial examples.

3.2.2. Optimization Algorithm

We use the basic iterative method (BIM) [18] with the estimated gradients to craft adversarial examples. Therefore, the iteration voice can be defined aswhere is the learning rate, is the iteration gradient.

To compute the estimated gradients, we use the natural evolution strategy (NES) [37], because the NES-based gradient estimation is proved to require much fewer queries than finite difference gradient estimation. In detail, we first create (must be even) Gaussian noises on the iteration, and generate new voices , where . Then we compute the loss values . Finally, the gradient can be computed as

We also use the momentum [38] to speed up the convergence and increase the transferability of adversarial examples, therefore the iteration gradient can be defined aswhere the is the decay factor.

3.3. Decision-Only Black-Box Attack Algorithm

Although the NES-based gradient estimation attack has no need to touch the internal structure and parameters of the model, it still needs to obtain the distribution of result scores through a large number of queries. Once the model limits the number of queries or returns only positive or negative results without the score, this attack will be impossible to implement. In this regard, the transferable adversarial attack method can be used to achieve a decision-only black box attack. Specifically, the transferable adversarial attack is generating adversarial examples through known methods and then using these examples to attack the decision-only black-box model.

To improve the transferability of adversarial examples, an obvious idea is to increase the attack intensity . However, if we only increase the , the adversarial examples may overfit and it will decrease the transferability. So, an ensemble-based method will be used to conduct the decision-only black-box attack. In [39], authors argue that the adversarial examples are more likely to transfer to other models if they could fool various models simultaneously, as shown in Figure 2. We follow this strategy and made a weighted sum of the scores of multiple models. To attack score-only black-box models simultaneously, we fuse the loss function aswhere represents the loss of score-only black-box model and is the ensemble weight, where .

3.4. Defense

Audio forensics models aim to reduce the harm of speech synthesis technology; however, several adversarial attacks have made these efforts in vain. So we need some methods to defend against this adversarial attack so that the model can be reinforced.

In previous experiments, we noticed that although the models have the same structure and are trained on the same dataset, they show different detection accuracy when trained by features of different sizes and types. We deem audio forensics models have different sensitivity to different features. Because models trained by high-sensitivity features show better performance than those trained by low-sensitivity features, we consider the reason is the model can obtain more information from original audio information through particular features. Therefore, we think that if we use the low-sensitivity features, the models will suffer less impact from the adversarial perturbation, and we will attempt to use these low-sensitivity features to reinforce audio forensics models.

4. Experiments

4.1. Dataset and Victim Models

Following the setting in [30], we use the LA part of the ASVspoof 2019 dataset [16]. We use the 2048 fast Fourier transform (FFT) bins energy spectrum as input for all models. Only the first 400 frames of each utterance are used to extract acoustic features.

We use the LA training set to train our audio forensics models and the LA developing set to evaluate the models. The details of the models can be found in Table 1.

4.2. Score-Only Black-Box Attack

We randomly selected 500 spoof audio examples from the trn set to conduct our adversarial proposed attacks. All the selected samples are classified correctly by our victim audio forensics models before the attacks. We only generate adversarial examples from spoof examples, because we consider there’s no real value in converting a bonafide sample to a spoof one. All of the attacks are conducted under in equation (6), in equation (7), , in equation (9). As shown in Table 2, our proposed method gets a 99% attack success rate, which is comparable to the MI-FGSM, the most powerful white-box attack method.

We can conclude that our proposed score-only black-box attack method is extremely threatening to mainstream audio forensics models. Tiny adversarial perturbation can almost completely invalidate them. This also shows that if only the internal structure and parameters are hidden from the attackers, it is almost impossible to defend against the attack. It is necessary to find other more effective defense methods.

4.3. Decision-Only Black-Box Attack

We use the 100 spoof audio examples used in the previous subsection to conduct the decision-only black-box attack. We use 3 of the models to generate the adversarial examples and use the remaining one to evaluate the adversarial examples. In order to evaluate the ensemble-model method and the effect of intensity factor , we also generate adversarial examples through single-model with and muti-models with . All of the results can be found in Table 3 and Figure 3.

We find that if we only increase the intensity factor or use the ensemble-model method, the improvement of the transferability of the adversarial examples is limited. So we need to combine these methods to get the best attack effect.

4.4. Defense

In the previous part of the paper, We have discussed how to enhance the defense capabilities of the model against our proposed adversarial attack. Here, we will test the method using low-sensitivity features. After conducting a series of experiments, we found that the models, which are trained by the log-power spectrum of 512 FFT bins, get a balance between the accuracy of detecting spoof samples and the defense capabilities against adversarial attacks.

We used the LA training set to train the audio forensics models and the LA developing set to evaluate the models. The detection accuracy of the original models and the reinforced models are shown in Table 4.

We randomly select 100 spoof audio examples from the trn set to conduct the adversarial attack and evaluate the defense capabilities of the reinforced models, we also conduct the attack on original models. The results of the examples can be seen in Table 5 and Figure 4.

By comparing the two types of models, we find that the average detection accuracy of original models on original examples is slightly higher than that of the reinforced models. However, the reinforced models we proposed significantly reduce the success rate of adversarial attacks.

5. Conclusion

In this paper, the black-box attack method we proposed achieves an attack success rate equivalent to the best of white-box attacks, which shows that hiding the internal structure and parameters of the model from the attacker cannot effectively protect the model. The success rate of the decision-only black-box attack also shows that the method of limiting the number of queries has scant protection capabilities for the model. Therefore, it is necessary to do more research on exploring more effective methods of model reinforcement.

Although the method proposed in this paper has reached a similar success rate to that of the white-box attack, however, there is still a large gap between the black-box method and the white-box method in terms of the generation efficiency of adversarial examples. Therefore, further research is needed on improving the generation efficiency of black-box adversarial examples.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.