Abstract

Speaker verification system has gained great popularity in recent years, especially with the development of deep neural networks and Internet of Things. However, the security of speaker verification system based on deep neural networks has not been well investigated. In this paper, we propose an attack to spoof the state-of-the-art speaker verification system based on generalized end-to-end (GE2E) loss function for misclassifying illegal users into the authentic user. Specifically, we design a novel loss function to deploy a generator for generating effective adversarial examples with slight perturbation and then spoof the system with these adversarial examples to achieve our goals. The success rate of our attack can reach 82% when cosine similarity is adopted to deploy the deep-learning-based speaker verification system. Beyond that, our experiments also reported the signal-to-noise ratio at 76 dB, which proves that our attack has higher imperceptibility than previous works. In summary, the results show that our attack not only can spoof the state-of-the-art neural-network-based speaker verification system but also more importantly has the ability to hide from human hearing or machine discrimination.

1. Introduction

In recent years, the verification system has been employed in many scenarios including entrance guard, online payment, and smart home management. Speaker verification system that offers a convenient and reliable verification is a popular biometrics system. It verifies the person by an utterance that contains the unique biometrics feature called voiceprint. Voiceprint has the advantages of noncontact, high privacy, and low cost compared with other biometrics features used in verification system (e.g., fingerprint [1], face ID [2], and iris features [3]). Therefore, speaker verification has become a promising biometrics technique and received high social acceptance, especially in the area of smart Internet of Things (IoT) such as voice assistant.

One concern of applying speaker verification systems is whether they are secure enough. To explore this latent risk, we first reviewed the speaker verification system to understand how the speaker verification system works. We found that current speaker verification systems can be divided into two types: one is text-dependent speaker verification (TD-SV) and the other is text-independent speaker verification (TI-SV). They have different requirements for inputs. TD-SV requires users to say the same utterance with the one used to enroll, but users can say any utterance for verification in TI-SV. Obviously, TI-SV offers a more convenient speaker verification system than TD-SV. However, we raised a question about the security of the speaker verification system based on TI-SV: Can the speaker verification system based on TI-SV be spoofed by the adversary?

Firstly, we need to find the state-of-the-art speaker verification system based on TI-SV. We found that deep-neural-networks-based speaker verification shows a better performance than conventional solutions through the review. Because it shows a more promising prospect, we set it as our target speaker verification system to explore the latent risk. With the further survey, we raised another question: Can the speaker verification system based on deep neural networks be spoofed by adversarial examples? To find out whether the possibility of spoofing exists, we firstly perform a comprehensive review of voiceprint identification technology using deep neural networks [46]. Among these works, a state-of-the-art embedding vector model with generalized end-to-end (GE2E) loss function [5] proposed by Google performs best and has been applied in many domains (e.g., education and speaker verification). Our work tries to explore the vulnerability in TI-SV based on the GE2E loss function due to the above knowledge. We rebuild this speaker verification system, therefore, with the TIMIT dataset and try to spoof it by adversarial examples. To achieve the spoofing attack, we firstly give two requirements.

1.1. Imperceptible

The adversarial examples need to be similar to the original utterance. In other words, the injected perturbation to the original utterance should be slight enough. Otherwise, the spoofing attack will be discovered with ease by humans or machines, which will cause the spoofing attack to fail.

1.2. Purposeful

The victim should be verified to the special target set by the adversary. Then the spoofing attack is more powerful and destructive.

We consider that the speaker verification system verifies the user by detail waveform rather than the macroscopical waveform. The trait gives us the possibility to spoof the speaker verification system with a slight perturbation. Based on this possibility, we found several technical challenges that still need to be addressed to realize the spoofing attack: (i) How to generate the perturbation for the victim? (ii) How to limit the perturbation slightly enough? (iii) How to evaluate the perturbation, which can measure the influence in utterance? We structure a novel adversarial examples generator for producing effective slight perturbation to spoof the state-of-the-art speaker verification system, which is shown in Figure 1. The adversary utilizes the generator to generate a tiny perturbation to inject into an utterance of the unregistered user; the synthetic utterance will be verified as the targeted registered legal user. In addition, we designed a novel loss function to achieve the imperceptibility that tries to find the slightest perturbation on the premise of the spoofing attack. We proposed two different methods that can evaluate perturbation in maximum noise ratio (MNR) and signal-to-noise ratio (SNR). Finally, we evaluated our spoofing attack on the TIMIT dataset, which contains 630 speakers and 6300 sentences. Based on the dataset, we rebuild the state-of-the-art speaker verification system with GE2E loss function and two identification models, linear discriminate analysis (LDA) and cosine similarity threshold, respectively. In our experiments, the spoofing attack achieved up a high success rate of 82% and a slight distortion of −77 dB in MNR (usually negative) and 76 dB in SNR (usually positive).

To summarize, contributions in our work can be listed as follows:(1)We proposed a novel multifactor-based attack to spoof the state-of-the-art TI-SV system based on deep learning. Our spoofing attack transforms an illegal utterance’s verification result into a legal target with a slight perturbation. Meanwhile, we do not have any utterance from the target. To the best of our knowledge, this is the first exploratory work in spoofing the state-of-the-art TI-SV system based on deep learning.(2)We consider imperceptibility as a key metric in the spoofing attack. Hence, our spoofing attack improves imperceptibility by a novel loss function. The result shows that the imperceptibility of our spoofing attack is much better than previous works.(3)We evaluated our spoofing attack on the state-of-the-art TI-SV system based on deep learning with the TIMIT dataset including 630 speakers. The result shows that our spoofing attack achieves up a high success rate and high imperceptibility.

This section will introduce previous works about spoofing attacks in speaker verification systems and adversarial examples in the audio domain.

2.1. Attack on Speaker Verification Services

Many conventional methods were used to realize speaker verification systems before deep learning, including two models that work best: i-vector [7] and Gaussian mixture model-universal background model (GMM-UBM) [8]. The two models acquired effective results in realizing speaker verification systems, but there is still some vulnerability that was found by adversaries. For instance, two genetic algorithms were used to produce a new utterance, which will be identified as the target in both i-vector and GMM-UBM, with a target utterance from the dataset [9]. Moreover, adversaries can also use voice conversion to transform the original utterance to the target utterance, and the transformed utterance will have similar features to the target utterance in speaker verification systems [10]. The deep-learning-based verification systems also can be spoofed (e.g., the spoofing attack in d-vector [11]). In our work, we aimed at the state-of-the-art TI-SV and proposed a novel attack method to spoof the state-of-the-art speaker verification system.

2.2. Audio Adversarial Examples

Adversarial examples have been utilized in different domains as spoofing attack methods and reported effective results in recent research [1214], including the audio domain. Speech-to-text is an important problem in the audio domain; an effective solution has been universally utilized, which is called connectionist temporal classification [15] (CTC). With the development of technology, a spoofing attack that is based on adversarial examples has been proposed, one generator will produce a perturbation to make the new utterance sounds like the original one, and Carlini and Wagner [16] proposed utilizing this method to spoof the speech-to-text system. The spoofing attack [17] to the speech-to-text system is reported as more imperceptible. Aiming at intelligent voice assistants, Qin et al. [18] proposed an attack to produce an incorrect command that cannot be detected by people. Our work is based on adversarial examples to attack the state-of-the-art speaker verification system. Compared with previous works, our target is the speaker verification system rather than the speech-to-text system and our attack is more imperceptible.

3. Background

In this section, we will elaborate on technologies and related concepts that were used in our work.

3.1. Speaker Verification Systems

First of all, we will briefly introduce two different forms of speaker verification systems, TD-SV and TI-SV. There are several works in TD-SV [19, 20], but TD-SV required users to say the fixed utterance in both enrollment and verification. The constraint makes TD-SV not able to be utilized in continued verification and high user experience requirement situation. Hence, TI-SV was proposed to mitigate this constraint. TI-SV focuses on finding the features from speakers independently; also several works [21, 22] were proposed by the benefit of these advantages. TI-SV is more practical and convenient than TD-SV. To this end, it is a meaningful work to explore the attack possibility on TI-SV.

Due to this point, our work focuses on exploring vulnerabilities in TI-SV, and we first reviewed the technologies of TI-SV. Through the review, we found that several works realized TI-SV based on i-vector [7] and GMM-UBM [8]. Further, we found that, with the rapid development of deep learning networks, it was also used in speaker verification to extract the voiceprint to representing the speaker’s identity. The deep-learning-based TI-SV performs better than previous solutions based on i-vector and GMM-UBM. Because of the higher efficiency of the deep-learning-based TI-SV, our attack is targeting deep-learning-based TI-SV. We analyzed the main process of these solutions. For an utterance, the voiceprint extractor employs an embedding network to extract feature vectors which can represent the identity of the speaker from a processed utterance. After the process, the deep-learning-based TI-SV verifies feature vectors by different methods. We found that the loss function is a critical part that will influence the final performance directly during our experiment. Hence, we reviewed the deep-learning-based TI-SV, and we found the state-of-the-art loss function, tuple-based end-to-end [6] (TE2E) loss function, and generalized end-to-end [5] (GE2E) loss function, which have been widely used in real life and achieved great performance. Next, we will briefly illustrate GE2E and its prior work TE2E.

3.1.1. Tuple-Based End-to-End

TE2E outputs embedding vectors for one evaluation utterance and enrollment utterances by long short-term memory (LSTM); and the centroid of the enrollment embedding vectors is calculated. The centroid can be represented by , where k represents the speaker, and it is mean value vector of all utterances from the speaker. The evaluation embedding vector is represented by , where j represents the speaker. TE2E quantifies the distance between and by cosine similarity with the formulawhere and are the parameters that will be learned during training. Based on these definitions, TE2E loss is defined as follows:where if equals , then ; otherwise, . Although TE2E can work well in both TI-SV and TD-SV, it has several disadvantages. Firstly, TE2E gets a scalar representing the similarity between embedding vector and single centroid ; this makes the network not able to capture features from other enrollment speakers. Therefore, they further proposed the GE2E model.

3.1.2. Generalized End-to-End

Compared with TE2E, GE2E builds the similarity matrix between and all rather than a single with an input, where represents speakers, represents utterances for each speaker, , , and . When calculating , if , then calculate with other utterances; otherwise, calculate with utterances. Then GE2E defined two different loss functions, the softmax loss function is defined by Equation (4), and the contrast loss function is defined by Equation (5).where , is the similarity matrix, and represents the similarity between the utterance of the speaker and the center of the speaker. is the utterance of the speaker. The softmax loss function performs well on TI-SV and contrast loss function performs well on TD-SV. Because GE2E considers global features rather than local features, it can extract more unique features than TE2E.

Since the GE2E loss function has better performance than the TE2E loss function, our work utilized the GE2E loss function with softmax loss function to realize the TI-SV as our attack target.

3.2. Adversarial Examples

As a learning-based spoofing attack method, adversarial examples were proposed to inject tiny perturbation, which will lead the network to output incorrect results with high confidence. It was first proved effective in the image domain. The difference between the attacked image and the original image cannot be distinguished by the human eyes, but the attacked image will be classified into different results with the original image. In the audio domain, adversarial examples attack also exists. For instance, a generator is utilized to produce adversarial examples to spoof speech-to-text systems [16]. Here, we employed adversarial examples to generate the perturbation that cannot be perceived by ears to realize our attack.

4. Attack Model

In this section, we will illustrate the target and constraints of our spoofing attack.

As the final goal, the adversary wants to transform the illegal utterance to be identified into legal identity, which has been set as the target victim, hereafter A, by the adversary. To achieve this target, for any given utterance which is from anyone other than the victim, hereafter B, where belongs to B, the adversary tries to generate a slight perturbation and introduced into the audio waveform as a new audio waveform , where equals . The adversary wants to be verified as A, which can spoof the speaker verification system in mobile phone, online systems, and so on.

We assume a black-box setting where the adversary only knows the output scores and the identified result of the speaker verification system, without the detailed structure of speaker verification system that is required in white-box setting. In addition, we assume that our adversarial examples can be directly introduced into the waveform without any noise (e.g., ambient noise when we play them over the air). These constraints are reasonable as they also appear in prior work [16]. We prefer to prove the possibility of this attack rather than the practical application. To make the work more confident, we also discussed several advanced attacks in Section 9 to overcome these constraints on this basis, which will improve the practical ability of our work. Note that our spoofing attack injects noise into the real utterance; thus, the antispoofing method by living detection [23, 24] is not effective aiming at this attack.

5. Attack System Design

In this section, we will start with the adversarial examples’ requirements in our work. Next, we will design two different systems by unique loss functions and describe them.

5.1. Adversarial Examples’ Requirements

As an effective attack, the adversarial examples in our experiments need to satisfy several requirements. In an adversarial example’s generative process, we will inject perturbation into a given original utterance to generate a new utterance; we define the new utterance with the following equation:where represents the perturbation that will be injected into the utterance and represents the original utterance. To make the attack effective, we need to restrict the process with the three following points: (1) must be in an available range which let the waveform be able to be recovered as an utterance; (2) needs to be as slight as possible; (3) the speaker verification system will recognize as the special target that the adversary sets before the attack. To better describe our requirements for adversarial examples, we formulate our requirements with the following formulations:where represents the classification and represents the target classification result. We make several designs in our adversarial example’s generator to satisfy the above requirements. We will elaborate these designs in the remainder of the section.

5.2. The Clip Function

We first design the solution for requirement (1). It is difficult to limit the generator to generate , which will make in an available range. But we found that we can set any value, which is out of the available range, as the minimum or maximum value for the available range. It hardly affects the auditory effect. Thus, we designed a special clip function and the generator needs each clip point in by the function to keep the new utterance available; the clip function is defined as follows:

The above function satisfies requirement (1) well through our experiments. Next, we need to design special loss functions to satisfy requirements (2) and (3).

5.3. Generalized Relevancy Based Attack

Based on the clip function, we designed loss functions to satisfy requirements (2) and (3). As we introduced in Section 4, we need our attack to be able to work under the black-box setting; in other words, the adversary can only get the classification results and the confidence of each speaker from the speaker verification system. The current work needs to estimate a special for the loss function [25]; will deeply influence the attack performance. Our generator excludes ’s influence, which lets us need not choose special before the training. We mainly designed our generalized relevancy loss function based on GE2E loss function; the loss function was represented as follows:where represents the confidence score feedback from the speaker verification system between and the speaker, represents the target speaker number, represents the speaker number, and represents the reciprocal of the speaker quantity. This loss function will expand the distance between and the nontarget identities and shrink the distance between and the target identity, while the training process will not rely on special .

5.4. Multifactor Based Attack

In our experiment, the generalized relevancy based loss function, which is elaborated in the last section, can guide the generator to generate adversarial examples that can attack the speaker verification system successfully. However, the distortion of adversarial examples will be beyond our tolerance; people will hear obvious noise in adversarial samples and find the attack easily. They are also easily recognized by machines. To this end, we designed a stronger generator with another loss function designed by us, which can achieve the spoofing attack imperceptibly. Hence, we divide our loss function into two parts: one is to achieve the attack and the other is to limit the amplitude perturbation. In particular, we limit the distortion with a special design in the loss function. Note that because the first goal of our generator is to generate a new utterance that can spoof the speaker verification system, we only utilize the part which can limit the noise in the loss function, after the new utterance can be verified as the target identity. We utilized the multifactor based loss function to realize spoofing attack. We found that the distortion of successful adversarial examples is much less than before, while the total success rate is close to the result reported by generalized relevancy loss function based attack. We designed this loss function as follows:where equals 0 when was rejected by the speaker verification system; otherwise, equals 0, and is a constant. We designed our generator with this loss function. When we initially train the generator, is set as 0; it will try its best to achieve attack first when has been accepted by the speaker verification system; will be reset as 1; then the generator begins to reduce the distortion, which may cause to be rejected by the speaker verification system; then will be reset as 0. Two different requirements will compete through the whole training process. Thus, the generator can find the slightest perturbation inject into the original utterance to realize the attack and we will get adversarial examples that cannot be recognized by humans or machines.

6. Experiment Setup

6.1. Data Partition and Experimental Environment

In this section, we will describe the design of our experiments. To prove the efficiency of our attacks, we examined our spoofing attack with an open dataset TIMIT which has been used in many other voice-related works [26, 27]. This dataset includes 630 speakers and 6300 sentences. Firstly, we utilized 462 speakers from the TIMIT training set to train the embedding network which is based on the GE2E loss function. Then we extract the embedding vector from the testing set by the embedding network for further speaker verification. We deploy our attacks locally and randomly select 4 illegal speakers and 5 legal speakers from the testing set. For each illegal speaker, we selected 5 sentences and acquired 20 (4  5) sentences; then we trained our generator to produce perturbation for each sentence targeting 5 legal speakers, respectively. Through this process, we obtained 100 (5  20) adversarial examples for spoofing attacks. Then we test these adversarial examples on the speaker verification system. Two different classifications were employed in our work to show the performance of spoofing attack on machine learning solutions and similarity threshold solutions; we also test two loss functions that were introduced in Section 5 to show the performance of different spoofing attack. We conducted the experiments on a server with Ubuntu 16.04 and Intel Xeon CPU E5-2678 v3 @ 2.50 GHz with 125 G RAM. We set the learning rate as 1e − 2, as 20, and the epoch as 500.

6.2. Metrics

We employed different metrics for evaluating the above results. For the verification part, we employed false acceptance rate (FAR), false rejection rate (FRR), and average classification error (ACE). They were defined with the following equations:where TP represents the amount of correctly classified positive samples, TN represents the amount of correctly classified negative samples, FP represents the amount of incorrectly classified positive samples, and FN represents the amount of incorrectly classified negative samples. Except that, we also employ equal error rate (EER) as metrics which is the error rate when FAR equals FRR. Then, for attack part, we utilized success rate (SR) as the metric, and it is defined as follows:

These metrics are widely utilized in evaluating performance of verification.

It is easy to evaluate the performance of classification and verification, but it is difficult to evaluate distortion directly. We need a quantitative method to calculate the distortion. We evaluate an utterance in decibels (dB) and a universal description of relative volume. The following equations represent the MNR of the adversarial samples:where represents the perturbation we inject into the utterance. Equation (14) represents the MNR of the attack utterance [16]. We can also write the equation about MNR as follows:

Since the perturbation is smaller than the original utterance, the result will be a negative number. The result is smaller, and the distortion is tinier. Although this result can describe the maximum distortion well, it cannot tell the overall distortion. So, we also introduced another metric SNR which is used in previous work [25]; it can be defined by the following equations:where is the signal power of the original utterance and is the signal power of the injected perturbation. When is large enough, the human nearly cannot hear the noise in the utterance.

Our work utilized the above two works as our evaluation metrics for evaluating distortion. They, respectively, represented the maximum distortion which shows waveform alert in detail and the total influence of the distortion which shows waveform alert in total.

7. Evaluation

In this section, we will evaluate our spoofing attack on the state-of-the-art TI-SV based on deep learning.

7.1. Performance without Attack

We need to study the performance of our speaker verification system on the TIMIT dataset which can prove that the spoofing result under attack is caused by our spoofing attack rather than the system’s poor performance. Thus, we first run an evaluation to prove that the speaker verification system can verify the identities of users before we evaluate the performance of our spoofing attack. We train the embedding vector extractor by the training data from the TIMIT dataset. After training, we randomly select 100 people for evaluating the performance of the speaker verification system. When the LDA was used, FAR = 5%, FRR = 5%, and ACE = 5%. Figure 2 shows the ROC curve when it uses cosine similarity. The EER equals 5% and the area under curve (AUC) can reach 0.99. Meanwhile, we found that when we set the threshold value at 0.59 the TI-SV has the best performance through the experiments. These results show whichever classification is used by the TI-SV, and the identities of users can be correctly verified. Then we can examine our spoofing attack on the TI-SV based on these results.

7.2. Performance with Spoofing

In this section, we will first evaluate the distortion of our generator, which can generate adversarial examples with different loss functions. Firstly, we produce adversarial samples by cosine similarity score and randomly show three waveforms from the same utterance; the result is shown in Figure 3; the yellow waveform is generated by generalized relevancy based attack, the red waveform is generated by multifactor based attack, and the black waveform is the original waveform. We can observe that the waveform generated by our multifactor based attack is more similar to the original waveform than the waveform generated by our generalized relevancy based attack. These waveforms show, on the intuitive, that the multifactor based attack will have better performance than generalized relevancy based attack on imperceptibility. Beyond that, we need to describe the distortion on a quantitative level. Firstly, a distribution diagram for MNR in Figure 4 is shown, and the ordinate value represents the quantity in this range. The average MNR for our multifactor based attack is −33 dB, and it is −18 dB for generalized relevancy based attack. The best-reported MNR for our multifactor based attack and generalized relevancy based attack is −77 dB and −22 dB, respectively. Our multifactor based attack is more imperceptible than generalized relevancy based attack in this metric. Besides, our best performance is also better than −45 dB, which was reported by previous work [16]. The MNR describes the distortion under the maximum scene, and we still need to describe the distortion under the global scene. Thus, a comparison of SNR was made between two attacks.

Figure 5 shows the distribution of SNR, and the ordinate value represents the quantity in this range. The average SNR for our multifactor based attack is 31 dB, and it is 17 dB for generalized relevancy based attack; meanwhile, the best result of SNR for multifactor based attack is 76 dB, which is larger than 26 dB that was reported by generalized relevancy based attack. The best performance is better than 31 dB in the previous work too. The above results can prove that our multifactor based spoofing attack will have better performance compared with previous work in terms of imperceptibility. In other words, our attack owns higher imperceptibility, which is a key feature for adversarial audio samples. After evaluating the imperceptibility of generalized relevancy based attack and multifactor based attack, we need to evaluate the performance of the SR in different classifications.

7.2.1. Linear Discriminate Analysis

We first tried to spoof the TI-SV, which uses the LDA to verify the identity of the user. We use the enrollments to train a two-class model used to distinguish the target speaker and nontarget speaker, which can realize a speaker verification task. The result is shown in SR1 of Table 1. Through the result, the SR of generalized relevancy based attack can achieve up to 83%. Meanwhile, the SR of multifactor based attack achieved up to 80%. This result proved that the two spoofing attacks can spoof the TI-SV with a similar SR when using LDA to achieve speaker verification.

7.2.2. Cosine Similarity

Since LDA and other machine learning classifications need training before using, which is inconvenient for enrolling a new user, as a more practical speaker verification system, the TI-SV can utilize cosine similarity and set a fixed threshold to verify the speaker’s identity, which does not need training. We set the threshold value at 0.59 which has the best performance in Section 7.1. SR2 in Table 1 shows the results when we use generalized relevancy based attack and multifactor based attack. We can learn from the result that our multifactor based attack still has close performance to that of the generalized relevancy based attack. The SR for multifactor based attack is 82% and the SR for generalized relevancy based attack is 86%.

7.2.3. Summary

Through these results, we can learn that our multifactor based attack aimed at the state-of-the-art TI-SV has a high SR and it is more imperceptible, whatever the TI-SV model is based on machine learning classifications or threshold classifications. Our spoofing attack’s SR is lower than that in the previous works; it is due to the fact that to our spoofing attack’s target is the TI-SV, and it is more difficult to extract voiceprint than TD-SV that is targeted by previous works. Our multifactor based attack will have a slighter distortion with the original utterance, which is more imperceptible than the generalized relevancy based attack and previous works.

8. Attack Characteristics

The attack has two distinct characteristics: imperceptibility and target-independent ability. We will further analyze this part in the following paragraphs.

8.1. Imperceptibility

The imperceptibility will be changed continually during the procession of generation. It is important to analyze the procession of generation for showing the advantage of multifactor based attack on imperceptibility. We experiment for our attack to analyze the imperceptibility of the perturbation during the procession of generation. We randomly selected all sentences (20 sentences) in the illegal set and a random target in the legal set to observe the procession of generation. Figure 6 shows the average signal noise rate (dB) for each epoch. The result shows that the change of SNR is coincident between the multifactor based attack and the generalized relevancy based attack during the first phase of the procession. However, the multifactor based attack begins to increase the SNR, while the generalized relevancy based attack decreases the SNR persistently in the second phase. Note that the multifactor based attack not only stops the decreasing of the SNR but also explores the highest SNR for the successful attack. We can observe that the final SNR of multifactor based attack is larger than the end of the first phase and much larger than the final SNR of generalized relevancy based attack.

8.2. Target-Independent Ability

We consider that the voiceprint is only dependent on the biometric differences rather than the content. Thus, the factor of the physiological structure will determine the voiceprint. The previous work [28] has proven that the gender is an important factor for voice. Given that view, we analyze the target-independent ability by studying the influence of gender. We randomly selected ten sentences from different males and ten sentences from different females and partitioned them into two groups; each group includes five sentences from different males and five sentences from different females. We enrolled in the speaker verification system with users in one group. After that, we utilized sentences from females to attack the enrolled males; meanwhile, the same operation was done for sentences from males. Note that we use the cosine similarity for verification, since Section 7.1 has proven that it has the same efficacy as that of LDA. Table 2 shows the SR, average signal noise rate (ASNR), and average maximum noise rate (AMNR) after finishing the above experiment. The result shows that gender has little impact on the success rate and the perturbation of the attack. Our multifactor based attack has target-independent ability.

9. Discussion

In this section, we will discuss some advanced attack methods and defense methods for speaker verification systems.

9.1. Universal Perturbation

Current spoofing attacks need the generator to generate perturbations for each utterance, even targeting the same target. We hope our generator can produce a universal perturbation for a special target, which can use only one perturbation to realize the spoofing attack to a special victim. We can learn from some work that the voiceprint will exist in the tiny waveform [29], so it is possible to generate a stabilized perturbation that can include the whole voiceprint for one victim.

9.2. Over-the-Air Injection

Current spoofing attacks aimed at speaker verification systems can only work in the data layer or offer utterance for replaying attacks [25]. These restrictions limited the spoofing attack’s range, which makes this attack unable to be utilized. An advanced attack could inject a slight perturbation when a legal speaker is verifying; the adversary can lead the speaker verification system into an incorrect permission space and all the legal user’s operations in this space will be unsafe. Some attack in computer vision can spoof classifications by only changing one pixel’s content [30]. If we can realize the spoofing attack by only changing one or several points in the utterance, we will have the ability to repeat playing the short perturbation over the air to attack the system when a legal speaker is verifying. Even more, we can use ultrasound to complete injection instead of audible sound by the technology proposed in previous work [25]. It will further enhance the imperceptibility of the spoofing attack, which makes the attack more practical.

9.3. Mitigation of Multifactor Based Attack

Since our attack is effective, while it is hard to be detected by human or current speaker verification systems, we will discuss several possible defense methods as follows: A detection method is to train a detector using normal utterances and adversarial utterances. In the computer vision domain, they can employ a detector to detect the adversarial samples [31], although this work has a high false positive rate, and it is not robust when the adversary is aware of this defense. It also has the ability to distinguish between normal utterances and adversarial utterances.

Another method is adding transformation for the input (e.g., bit-depth reduction and JPEG compression [32] for images). We can mitigate the attack by applying input transformation such as a bit-depth reduction. Because this process will reduce the information in an utterance including injected perturbation, our attack will not succeed.

10. Conclusion

In this paper, we explore the vulnerability of the speaker verification system which will affect the security of users’ economics, privacy, and even safety. We first conduct imperceptible audio adversarial examples to attack the state-of-the-art deep-learning-based TI-SV by our generalized relevancy based attack and multifactor based attack. We evaluate generalized relevancy based attack and multifactor based attack in two patterns to verify the speaker including both machine learning method and threshold method. The multifactor based attack can achieve 82% SR, when the distortion in the utterance only has MNR −77 dB and SNR 76 dB, which are much better than the values in the previous works. Our work also gives out several advanced attacks with a theoretical foundation, which will influence real life a lot. Due to the fact that our effective attack reveals the vulnerability of the speaker verification system, we also proposed several defense methods to mitigate the insecure problems of speaker verification systems.

Data Availability

The voice data used to support the findings of this study have been deposited in the TIMIT repository (https://catalog.ldc.upenn.edu/LDC93S1).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61972348, in part by the National Key Research and Development Program of China under Grant 2018YFB0803600, and in part by the Leading Innovative and Entrepreneur Team Introduction Program of Zhejiang under Grant 2018R01005.