Journal of Electrical and Computer Engineering

Volume 2017 (2017), Article ID 1735698, 9 pages

https://doi.org/10.1155/2017/1735698

## Speaker Recognition Using Wavelet Packet Entropy, I-Vector, and Cosine Distance Scoring

Laboratory of Cyberspace, School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

Correspondence should be addressed to She Kun; nc.ude.ctseu@nuk

Received 16 February 2017; Revised 17 April 2017; Accepted 26 April 2017; Published 14 May 2017

Academic Editor: Lei Zhang

Copyright © 2017 Lei Lei and She Kun. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Today, more and more people have benefited from the speaker recognition. However, the accuracy of speaker recognition often drops off rapidly because of the low-quality speech and noise. This paper proposed a new speaker recognition model based on wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS). In the proposed model, WPE transforms the speeches into short-term spectrum feature vectors (short vectors) and resists the noise. I-vector is generated from those short vectors and characterizes speech to improve the recognition accuracy. CDS fast compares with the difference between two i-vectors to give out the recognition result. The proposed model is evaluated by TIMIT speech database. The results of the experiments show that the proposed model can obtain good performance in clear and noisy environment and be insensitive to the low-quality speech, but the time cost of the model is high. To reduce the time cost, the parallel computation is used.

#### 1. Introduction

Speaker recognition refers to recognizing the unknown persons from their voices. With the use of speech as a biometric in access system, more and more ordinary persons have benefited from this technology [1]. An example is the automatic speech-based access system. Compared with the conventional password-based system, this system is more suitable for old people whose eyes cannot see clearly and figures are clumsy.

With the development of phone-based service, the speech used for recognition is usually recorded by phone. However, the quality of phone speech is low for recognition because the sampling rate of the phone speech is only 8 KHz. Moreover, the ambient noise and channel noise cannot be completely removed. Therefore, it is necessary to find a speaker recognition model that is not sensitive to those factors such as noise and low-quality speech.

In a speaker recognition model, the speech is firstly transformed into one or many feature vectors that represent unique information for a particular speaker irrespective of the speech content [2]. The most widely used feature vector is the short vector, because it is easy to compute and yield good performance [3]. Usually, the short vector is extracted by Mel frequency cepstral coefficient (MFCC) method [4]. This method can represent the speech spectrum in compacted form, but the extracted short vector represents only the static information of the speech. To represent the dynamic information, the Fused MFCC (FMFCC) method [5] is proposed. This method calculates not only the cepstral coefficients but also the delta derivatives, so the short vector extracted by this method can represent both the static and dynamic information.

Both of the two methods use discrete Fourier transform (DFT) to obtain the frequency spectrum. DFT decomposes the signal into a global frequency domain. If a part of frequency is destroyed by noise, the whole spectrum will be strongly interfered [6]. In other words, the DFT-based extraction methods, such as MFCC and FMFCC, are insensitive to the noise. Wavelet packet transform (WPT) [7] is other type of tool used to obtain the frequency spectrum. Compared with the DFT, WPT decomposes the speech into many small frequency bands that are independent of each other. Because of those independent bands, the ill effect of noise cannot be transmitted over the whole spectrum. In other words, WPT has antinoise ability. Based on WPT, wavelet packet entropy (WPE) [8] method is proposed to extract the short vector. References [8–11] have shown that the short vector extracted by WPE is insensitive to noise.

I-vector is another type of feature vector. It is a robust way to represent a speech using a single high-dimension vector and it is generated by the short vectors. I-vector considers both of the speaker-dependent and background information, so it usually leads to good accuracy. References [12–14] have used it to enhance the performance of speaker recognition model. Specially, [15] uses the i-vector to improve the discrimination of the low-quality speech. Usually, the i-vector is generated from the short vectors extracted by the MFCC or FMFCC methods, but we employ the WPE to extract those short vectors, because the WPE can resist the ill effect of noise.

Once the speeches are transformed into the feature vectors, a classifier is used to recognize the identity of speaker based on those feature vectors. Gaussian mixture model (GMM) is a conventional classifier. Because it is fast and simple, GMM has been widely used for speaker recognition [4, 16]. However, if the dimension of the feature vector is high, the curse of dimension will destroy this classifier. Unfortunately, i-vector is high-dimensional vector compared with the short vector. Cosine distance scoring (CDS) is another type of classifier used for the speaker recognition [17]. This classifier uses a kernel function to deal with the problem of high-dimension vector, so it is suitable for the i-vector. In this paper, we employ the CDS for speaker classification.

The main work of this paper is to propose a new speaker recognition model by using the wavelet packet entropy (WPE), i-vector, and cosine distance scoring (CDS). WPE is used to extract the short vectors from speeches, because it is robust against the noise. I-vector is generated from those short vectors. It is used to characterize the speeches used for recognition to improve the discrimination of the low-quality speech. CDS is very suitable for high-dimension vector such as i-vector, because it uses a kernel function to deal with the curse of dimension. To improve the discrimination of the i-vector, linear discriminant analysis (LDA) and the covariance normalization (WCNN) are added to the CDS. Our proposed model is evaluated by TIMIT database. The result of the experiments show that the proposed model can deal with the low-quality speech problem and resist the ill effect of noise. However, the time cost of the new model is high, because extracting WPE is time-consuming. This paper calculates the WPE in a parallel way to reduce the time cost.

The rest of this paper is organized as follows. In Section 2, we describe the conventional speaker recognition model. In Section 3, the speaker recognition model based on i-vector is described. We propose a new speaker recognition model in Section 4, and the performance of the proposed model is reported in Section 5. Finally, we give out a conclusion in Section 6.

#### 2. The Conventional Speaker Recognition Model

Conventional speaker recognition model can be divided into two parts such as short vector extraction and speaker classification. The short vector extraction transforms the speech into the short vectors and the speaker classification uses a classifier to give out the recognition result based on the short vectors.

##### 2.1. Short Vector Extraction

Mel frequency cepstral coefficient (MFCC) method is the conventional short vector extraction algorithm. This method firstly decomposes the speech into 20–30 ms speech frames. For each frame, the cepstral coefficient can be calculated as follows [18]:(1)Take DFT of the frame to obtain the frequency spectrum.(2)Map the power of the spectrum onto Mel scale using the Mel filter bank.(3)Calculate the logarithm value of the power spectrum mapped on the Mel scale.(4)Take DCT of logarithmic power spectrum to obtain the cepstral coefficient.Usually, the lower 13-14 coefficients are used to form the short vector. Fused MFCC (FMFCC) method is the extension of MFCC. Compared with MFCC, it further calculates the delta derivatives to represent the dynamic information of speech. The derivatives are defined as follows [5]:where is the th cepstral coefficient obtained by the MFCC method and is the offset. is the th delta coefficient and is the th delta-delta coefficient. If the short vector extracted by MFCC is denoted as , then the short vector extracted by FMFCC is denoted as .

##### 2.2. Speaker Classification

Gaussian mixture model (GMM) is a conventional classifier. It is defined aswhere** x** is a short vector extracted from an unknown speech. is the th Gaussian function in GMM, where are its mean vector and variance matrix, respectively. is the combination weight of the Gaussian function and satisfies . is the mixture number of the GMM. All of the parameters, such as weights, mean vectors, and variance matrices, are estimated by the famous EM algorithm [19] using the speech samples of a known speaker. In other words, represents the characteristic of the known speaker’s voice, so we use to recognize the author of the unknown speeches. Assume that an unknown speech is denoted by , where represents the th short vector extracted from** Y**. Also, assume that the parameters of are estimated using the speech samples of a known speaker . The result of recognition is defined aswhere is the decision threshold and should be adjusted beforehand to obtain the best recognition performance. If , then the GMM decides that the author of the unknown speech is not the known speaker ; if the , then the GMM decides that the unknown speech is spoken by the speaker .

#### 3. The Speaker Recognition Model Using I-Vector

The speaker recognition model using i-vector can be decomposed into three parts such as short vector extraction, i-vector extraction, and speaker classification. Figure 1 shows the structure of the model.