Abstract

In speaker recognition systems, feature extraction is a challenging task under environment noise conditions. To improve the robustness of the feature, we proposed a multiscale chaotic feature for speaker recognition. We use a multiresolution analysis technique to capture more finer information on different speakers in the frequency domain. Then, we extracted the speech chaotic characteristics based on the nonlinear dynamic model, which helps to improve the discrimination of features. Finally, we use a GMM-UBM model to develop a speaker recognition system. Our experimental results verified its good performance. Under clean speech and noise speech conditions, the ERR value of our method is reduced by 13.94% and 26.5% compared with the state-of-the-art method, respectively.

1. Introduction

Speaker recognition is a biometric recognition technique, which can identify speaker identity according to speaker personality information on a speech signal. From the existing biometric recognition, speaker recognition is one of the most convenient and accessible ones due to the abundance of mobile devices, with a microphone, allowing users to be authenticated across multiple environments and devices [1].

Research in speaker recognition has focused increasingly on enhancing robustness in adverse conditions induced by background noise. Many approaches have been proposed to address these challenges; one of the most successful being the i-vector technology [2] used jointly with the probabilistic discriminant analysis (PLDA) back-end [3, 4]. In addition to the new utterance level features and back-ends, robust acoustic features are developed to improve the performance of the speaker recognition system.

The cepstrum features (such as MFCC) of speech are the most distinguishing and first used [5] in speaker recognition. However, under the influence of channel distortion and background noise, the cepstral feature distribution of speech will change arbitrarily, which leads to its weak distinguish ability. Therefore, in the early 1990s, a series of feature compensation techniques were proposed to enhance the generalization ability of speech features in recognition [68]. The existing feature compensation is mainly including filter compensation, noise model compensation, and empirical compensation.

The main purpose of filter compensation is to reduce noise or relieve the influence of noise on features. This method is based on the fact that channel and environmental distortions are superimposed on the logarithmic spectrum and the cepstrum domain. Furui S. believe that the variation in the channel is the offset of a single coefficient in the cepstrum vector. Therefore, the cepstrum mean subtraction (CMS) method is used to relieve the influence of the channel [9]. This method also can reduce the channel noise to a certain extent, but this method also impairs the information on the cepstral coefficient. Unlike the CMS method, the relative spectrum feature is proposed to compensate for rapidly changing channel distortions, and it uses moving average filtering to simulate the exponential decay of the mean subtraction [10]. However, this method was later confirmed to have limited improvements in channel mismatch and additive background noise.

The noise model compensation uses the prior knowledge of the noise spectrum to estimate the parameters of the pure speech through the noise model or the influence of noise on the speech. It mainly uses spectral equalization and spectral subtraction to relieve the influence of noise on features. J. Hansen et al. proposed a multidimensional equalization method to reduce the sensitivity of speech features of noise, thereby improving the distinguishing ability of speech features [11]. S.S. Bharti utilized interframe features to estimate the continuous noise spectrum, which can alleviate the problem of noise spectrum changes caused by single-frame estimation in the original spectrum subtraction, and then uses spectrum subtraction to enhance speech features and improve the robustness of speaker features [12]. The noise compensation model method mainly relies on the mathematical model of noise estimation. Because of the uncertainty of noise change, it is difficult to find a mathematical model with good performance.

Empirical compensation is a data-driven method, which is inherently random. Studies have shown that this method is better than the previous two [13]. This method directly uses spectrum comparison based on experience. In the training phase, to estimate the change between the clear speech and the noisy speech, the difference in the feature vector between the two frames is calculated, and the probability distribution is modeled by adding a bias term of this difference. In the evaluation stage, the minimum mean square error prediction method is adopted, and the bias vector is used to convert the noisy test feature vector into the equivalent clear speech feature vector. Afify M et al. proposed a random mapping method [14], which uses the joint distribution of clear speech and noisy speech feature vectors to generate a Gaussian mixture model, and then uses this joint model distribution to predict clear speech. This prediction method has a significant improvement compared with the previous minimum mean square error.

In summary, the method of feature compensation is to improve the distinguishing ability of speech features for reducing the influence of noise on the features. However, as long as noise exists, this improvement is always difficult to avoid the impact on noise on the recognition performance.

In this paper, a novel multiscale chaotic feature is proposed to speaker recognition. The proposed multiscale chaotic feature is evaluated using a nonlinear dynamic model based on wavelet decomposition (multiresolution analysis (MRA)). In our method, an MRA technique is used to capture more finer spectrum information. In speech feature, harmonic feature is an important factor to distinguish different speakers. Because harmonic can represent speaker’s tone, tone information is usually distributed over different frequency components. The wavelet decomposition is an adapt method to capture frequency components. Moreover, we also take into account a chaotic characteristic of speech signal. Speech signal is a nonlinear system on long time series. In addition, this nonlinear characteristic should be reflected in speech features to speaker recognition. The chaotic feature based on the nonlinear dynamic model is used widely to the speech application system. The nonlinear dynamic model has been used in various fields of speech processing area, such as speech steganalysis [15], speech synthesis [16], speech recognition [17], and speech encryption [18]. The proposed feature represents the signal chaotic characteristic at different frequency bands.

2. Proposed Speaker Recognition System

To improve the performance of speaker recognition under adverse conditions induced by background noise, we proposed a multiscale chaotic feature extraction method to enhance the robustness of the recognition system. The proposed speaker recognition system is illustrated in Figure 1.

In our proposed system, a time-domain speech signal is treated with short-time frames of N samples by windowing each frame with, e.g., the hamming window. To relieve the influence of noise, we extract a multiscale chaotic feature (MCF) in each frame, which is comprised of multiresolution analysis and chaotic feature. Multiresolution analysis is implanted by wavelet decomposition and chaotic features including the nonlinear dynamic model and acoustics features. More details will be descripted in Section 3. A Gaussian mixture model (GMM) is used to identify each speaker; here, we introduce a universal background model (UBM) for training the distribution of features that are not related to the speaker. The GMM-UBM model is used widely for speaker recognition [1921] as a classifier; it is a generalization of the GMM model. The GMM-UBM model firstly performs a pretraining for the current speaker by collecting feature data from other speakers, which can solve the problem of recognition performance declining due to the insufficient feature data of the current train speaker. Then, the pretrained model is fine-tuned to the target speaker model by a maximum a posteriori (MAP) adaptive algorithm [22].

3. Multiscale Chaotic Feature

In the speaker recognition systems, the speech feature is vital to recognition performance. The significant discrimination of the feature can distinguish accurate speakers. The acoustic feature (such as MFCC and LPC) has a strong discrimination to speaker recognition of clear speech. However, the performance of classification will decline sharply if the speech signal is disturbed by noise. In order to reduce the disturbance due to environment noise, we introduced a wavelet decomposition and reconstruction technology to enhance the resolution of speech features on the frequency domain. Take account of the speech signal is a nonlinear system, we extract the chaotic feature by the nonlinear dynamic model to improve the recognition rate. In our proposed speaker recognition system, the feature extraction consists of two parts: multiresolution analysis and chaotic feature extraction.

3.1. Multiresolution Analysis

Multiresolution analysis (MRA) is a technique, which forms a set of basis functions through stretch and scale based on wavelets. On large scale, they expanded basis functions to search for significant features, and on smaller scales, they find more details of features. In our system, according to the literature [23] method, the speech signal is decomposed into a number of subband signals by MRA. The MRA is carried out by passing a speech signal through a series of high pass and low pass filter banks. The speech signal is simultaneously passed through high pass and low pass filters with impulse response and , respectively. The resulting outputs are the convolution of with and , respectively:

As the literature [23], we selected Daubechies 4 (db4) as the basis function for MRA in this paper. Likewise, we have also tested with other basis functions. We found that the recognition rate achieved with multiscale chaotic features using db4 basis function is higher than those results with other basis functions. Therefore, the db4 basis function is selected for MRA. The filter coefficients and , corresponding to the high pass and low pass filters, respectively, are computed from the following MRA equations [24]:where represents the decomposition scale and denotes the coefficient index of each decomposition scale. At the first level of decomposition , the detail coefficient is an output of the high pass filter and the approximation coefficient is an output of the low pass filter. The detail coefficients and approximation coefficient captured the high frequency and low frequency information, respectively. The approximation band is further decomposed into detail and approximation bands at the next level of decomposition. The repeated decomposition can obtain multiple levels for getting better resolution. In our system, 3-level decomposition is set as shown in Figure 1. In order to analyse the speech signal to different resolutions, subband signals are reconstructed via each of the approximation and detail coefficients applying inverse discrete wavelet transform (IDWT) [24]. When reconstructing a subband signal, the other subband coefficients are set 0. In this paper, we obtained 4 subbands signals: ; these subband signals will be utilized to extract chaotic features.

3.2. Chaotic Feature

In our feature extraction method, acoustic feature and nonlinear feature are extracted. Acoustic feature mainly focuses on speech spectral features (such as MFCC and LPC). Nonlinear feature represents the speech chaotic characteristics using a nonlinear dynamic model.

3.2.1. Acoustic Feature

In acoustic features, we extract Mel frequency cepstrum coefficient (MFCC) and linear prediction coefficient (LPC). MFCC is computed based on the perception characteristics of the human auditory system. In the human auditory system, Mel frequency has a nonlinear relationship to the Hz frequency. We can obtain the Mel spectral feature by the nonlinear relationship as follows:where denotes the Hz frequency.

LPC represents the frequency envelope of voice; its computation is based on the speech signal digital model. The vocal tract model is a key factor to distinguish different speakers. Therefore, LPC is usually used to represent the vocal tract envelope for various speech recognitions [25, 26]. According to the speech signal digital model, a speech frame signal can be equivalent to a unit pulse sequence to excite the vocal tract. The process is a linear time-invariant system and can be represented as a form of different equations:where is the real signal, the weighting term represents the prediction signal, and denotes the prediction error. The filter coefficient is calculated according to the minimum mean square error (MSE) criterion of .

3.2.2. Nonlinear Feature

The nonlinear dynamic model is an effective analysis method to study the chaotic characteristics of speech signals. According to this model, the nonlinear characteristics of the speech signal are obtained by processing the speaker signal as an one-dimensional time series. From the Takens embedding theorem, to reconstruct the phase space, one-dimensional speaker signals can be mapped to high-dimensional space by selecting an appropriate minimum delay time and embedding dimension m with two parameters. In addition, the high-dimensional spaces after reconstruction are equivalent to the original space [27]. The reconstructed speaker speech signal becomes . The key point of chaotic feature extraction includes the analysis of speaker speech signal in a high-dimensional space, the extraction of nonlinear feature parameters under the voice dynamic model.

 (1) The minimum delay time: the minimum delay time describes the correlation between the neighboring components of the speaker speech signal . In order to reconstruct the phase space of one-dimensional speaker speech signals, we calculate the minimum delay time and embedding dimension m by the C-C method [28].

(2) Maximum Lyapunov exponent: the Lyapunov exponent represents the average change rate of the local convergence or divergence of adjacent orbits in the phase space. The maximum Lyapunov exponent denotes the speed of orbit convergence or divergence. When  > 0, the larger the value of , the greater the rate of orbital divergence and the greater the degree of chaos. We use the small data size method to compute Lyapunov exponent [28]. The calculation method is as follows:(1)Calculate the average period by fast Fourier transform on the time series .(2)Calculate the minimum delay time and embedding dimension m by the C-C method.(3)Reconstruct phase space of series and denote it as . Then, find the nearest neighbor of each point in the phase space and limit the short separation. Define the distance from the i-th point of the nearest point in its adjacent track:(4)Find each point in the phase space and calculate the distance next unit times of the adjacent point pair:(5)If the orbit, which locates on the nearest point in the neighborhood domain, diverges from an exponential rate of , thenwhere is the sampling period. Taking the logarithm of both sides with the equation, we get

Take the average of the logarithmic difference in the distance between all adjacent points, which iswhere is the numbers of nonzero . Last, we use the least squares method to fit :

(3) Correlation dimension and Kolmogorov entropy: correlation dimension and Kolmogorov entropy are both nonlinear representation quantities under the nonlinear dynamic model. The correlation dimension describes the self-similar structure of the system. Kolmogorov entropy accurately describes the degree of confusion of the distribution probability of time series. We use the G-P algorithm [29] to calculate the correlation dimension and Kolmogorov entropy at the same time. The algorithm is as follows:(1)Firstly, we calculate the correlation integral and the curve. Reconstruct the -dimensional phase space. Then, given a critical distance , search the phase point pair whose distance is less than , and further calculate the ratio of all phase points. Last, we get the correlation integral function as follows:where is the embedding dimension, is the total number of phase points, is the Heaviside function, which satisfies .(2)The correlation dimension is derived by the G-P algorithm as follows:Draw the curve, and take the slope of the approximate straight line part; the slope is as the correlation dimension D.(3)The Kolmogorov entropy formula is derived by the G-P algorithm as follows:(4)Hurst exponent: the Hurst exponent may measure the long-term memory of the time series. It can also find the evolution trend of a time series converging on one direction. In addition, the range of values is 0∼1. If  > 0.5, it means that the time series has a long-term autocorrelation and means a greater correlation between the context time series.  < 0.5 means no autocorrelation in a time series. Extracting the Hurst exponent feature of speaker speech can reflect the level of relevance to the speaker’s speech change, so this paper selects the Hurst exponent as one of the nonlinear features. Hurst proposed the exponent and introduced the rescaled range analysis method [30] to calculate the value. This method is a nonparametric statistical method, which is not affected by the time series distribution.

4. Evaluation and Analysis

4.1. Experiment Setting

In order to evaluate the performance of the proposed speaker recognition system, we carried out 2 experiments under clear speech and noise speech, respectively. To feature selection, we set 3 combinations: (1) acoustic features, (2) nonlinear feature, and (3) chaotic features (acoustic + nonlinear feature). We selected the i-vector speaker recognition model in the literature [31] as the baseline system. In the i-vector model, MFCC feature is extracted and further obtain the supervector as the speaker feature. Currently, it is the state-of-the-art of speaker feature. The details of the experiment settings are listed in Table 1. In our experiments, the hardware environment is Intel i7 CPU and 8 GB memory. The software environment is Windows 10 OS with 64 bits, the Matlab 2016a, and Voice speech tool package as a develop tool.

4.2. Corpus

TIMIT corpus is used to evaluate our system. The corpus is composed of 630 speakers from different regions, each with 10 sentences. The length of each sentence is 3∼5 s, and the sampling frequency is 16 KHz with 16 sampling bits. In the TIMIT database, 630 people are divided into 462 and 168 according to a 3 : 1 ratio, which are used to train the background model and test the recognition system, respectively. Among the voice data of the 168 speakers tested, 9 sentences of each speaker were randomly selected as training data and 1 sentence as test data. The noise library is Noise-92 noise library. White noise and babble noise are selected as experimental objects. Since the two selected noises are associated with daily life scenes, these noises will also be mixed in real life application scenes. Therefore, this noise experiment has certain representative and feasibility.

4.3. Preprocessing

Speech signal is a nonstationary time-varying signal. Preprocessing must be performed first before speech analysis and feature extraction. The preprocessing usually includes endpoint detection, pre-emphasis, windowing, and framing processing. In this paper, endpoint detection adopts a double-threshold method based on zero-crossing rate and energy. The pre-emphasis is carried out by a first-order FIR high pass filter, and the pre-emphasis coefficient is set to 0.97. The frame length is set to 20 ms with 50% overlap.

4.4. Feature Extraction

In the feature extraction phase, we extracted the acoustic features and nonlinear feature for each subband (4 subbands in one frame). Acoustic feature consists of 12-order MFCC coefficients and 4-order LPC coefficients and then calculated statistical features of each coefficient for classifying the statistical features including skewness, kurtosis, mean, variance, and median. Nonlinear feature comprises the minimum delay time, correlation dimension, K entropy, maximum Lyapunov exponent, and Hurst exponent. We also calculate its statistical characteristics, like as maximum value, minimum value, mean, median, and variance. The speaker feature is listed in Table 2.

In order to eliminate the problem of internal dependence of speech features due to different dimensionality, a mean normalization is carried out on features as follows:where and denote mean and standard deviation, respectively.

4.5. Results and Analysis

In order to evaluate the discriminability of the proposed feature, we carried out 2 group experiments according to Table 1. Take account of the validity of equal error rate (EER) on speaker recognition evaluation, we selected EER as a metric to evaluate the performance of chaotic features. For EER, more less value, the better performance. In our experiments, the mixture numbers of GMM-UBM are set to 512, and the iteration of the EM train algorithm is 10 times, and the dimension of the T matrix of the i-vector model is set to 100. All parameters are obtained by iterative optimization.

4.5.1. Results and Analysis under Clean Speech Condition

The results are listed in Table 2 under clean speech condition. It can be seen from Table 3, if using acoustic feature alone, the ERR is 2.562%, which has a similar performance compared with the i-vector model. This shows LPC feature is helpful to identify different speakers because LPC can represent the vocal tract envelope. The EER of nonlinear feature is 2.833%, which is the worst performance compared with other features. However, we find the chaotic feature is the better performance, and the EER value is reduced by 14% compared with i-vector, which shows that the speech chaotic characteristic has a good discrimination.

4.5.2. Results and Analysis under Noise Speech Condition

The purpose of this group experiment is to evaluate the robustness of chaotic features with noise speech. We selected stationary noise (white noise) and nonstationary noise (babble noise) as the disturb signal and set different disturb degrees. The SNR is set to 0 dB, 5 dB, 10 dB, 15 dB, 20 dB, 25 dB, and 30 dB.

Table 4 and Figure 2 show the EER value of speaker recognition under white noise speech condition. From the results, the acoustic feature, nonlinear feature, and i-vector model are the similar average EER value. However, the chaotic feature obtained a better performance of recognition compared with other features. The EER values reduced by 27.53% compared with the i -vector model. The good performance is attributed to the chaotic characteristic. This also shows that the speech nonlinear features can relieve the disturbance of noise and improve the robustness of speaker recognition.

Table 5 and Figure 3 give the ERR results under nonstationary noise babble condition. Similar with white noise disturb, there is also a good robustness of chaotic feature.

Table 6 shows the average EER values for all experiments. From the results, compared with the i-vector model, the EER value reduced by 13.94% and 26.5% under clean speech and noise speech conditions, respectively. Therefore, we believe the following:(1)Acoustic feature has a good performance of recognition with clean speech. MFCC and LPC have a perfect discrimination to different speakers. However, the recognition performance will decline if the speech signal is disturbed by environment noise.(2)Speech chaotic characteristic based on the nonlinear dynamic model has a good compensation for the acoustic feature under environment noise condition. That is, the nonlinear feature can better distinguish different speakers in a noise environment. These benefits of the multiresolution analysis techniques can better capture the frequency information of speakers.(3)In the proposed method, we only extracted 5 nonlinear parameters, and the feature combination is not optimized. We suggest that feature optimization may improve the robustness of recognition.

5. Conclusion

In this paper, we proposed a novel multiscale chaotic feature for speaker recognition. The MRA technique is used to capture the frequency information of a speaker under environment noise condition. We extracted the nonlinear feature based on speech chaotic characteristics to improve the robustness of recognition. The experiment results show that this method is valid. Therefore, we believe the speech chaotic characteristic is a robust feature to various speech application systems, such as speech recognition and speech emotion recognition. In this paper, the proposed feature is not optimized, and the feature optimization will be the next work.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grants 61762005, 61702472, and 61671335, Hunan Provincial Natural Science Foundation of China under grant 2019JJ40144, Scientific Research Project of the Hunan Province Education Department of China under grants 18A304 and 18B338, and the Key Laboratory of Hunan Province for New Retail Virtual Reality Technology under grant 2017TP1026.