Abstract

Due to the shortcomings of linear feature parameters in speech signals, and the limitations of existing time- and frequency-domain attribute features in characterizing the integrity of the speech information, in this paper, we propose a nonlinear method for feature extraction based on the phase space reconstruction (PSR) theory. First, the speech signal was analyzed using a nonlinear dynamic model. Then, the model was used to reconstruct a one-dimensional time speech signal. Finally, nonlinear dynamic (NLD) features based on the reconstruction of the phase space were extracted as the new characteristic parameters. Then, the performance of NLD features was verified by comparing their recognition rates with those of other features (NLD features, prosodic features, and MFCC features). Finally, the Korean isolated words database, the Berlin emotional speech database, and the CASIA emotional speech database were chosen for validation. The effectiveness of the NLD features was tested using the Support Vector Machine classifier. The results show that NLD features not only have high recognition rate and excellent antinoise performance for speech recognition tasks but also can fully characterize the different emotions contained in speech signals.

1. Introduction

Language is the most effective medium of human communication. Language not only contains interpretable text but also contains a large amount of paralinguistic information that can reflect the emotional changes in a speaker. Interpretation of human spoken language through technologies such as speech recognition and affective computing have found a wide range of applications in diverse domains such as vehicle navigation, video surveillance, network video, and other human-computer interaction fields. Speech recognition refers to the ability of machines to convert spoken language into written text. To do this, a speech recognition system often needs to take into consideration the specific and nonspecific environment to recognize the content of speech accurately. Therefore, feature extraction and speech signal characterization are two important steps for accurate speech recognition. Currently, the most important feature extraction techniques used in speech recognition can be divided into (a) prosodic features [1], (b) phonetic features [2], (c) features based on the correlation characteristics of the spectrum [3,4], and (d) feature fusion [5]. The above features are characterized by the piecewise linearity of speech signals. However, studies have shown that speech signal generation is neither a linear process, nor a stochastic process, but rather a nonlinear process [6]. Thus, only using the piecewise linearity of speech signals in the time and frequency domains to extract speech feature will lead to the loss of some of the nonlinear features of speech signals, making the information being extracted incomplete.

With recent development in nonlinear analysis methods, they have been successfully applied in various fields [712]. Zbancioc [7] applied the Lyapunov index for the extraction of spectral coefficients of MFCC and LPCC features and achieved an emotion recognition accuracy of 75%; Firoozet al. [8] evaluated nonlinear dynamic features by reconstruction of speech signals using phase space reconstruction to improve the accuracy of automatic speech recognition. Spanish researcher Karmele Lopez applied the study of the chaotic characteristic of natural speech for the detection of Alzheimer’s disease and pointed out detection of the speaker’s lesions by extracting the fractal dimension features in natural speech [9, 10]. Xiang and Tan of Beijing Jiaotong University combined the chaotic features from speech with other common features to detect fatigue among automobile drivers [11]. Although some researchers have studied the chaotic characteristics of speech signals, very few studies have focused on the nonlinear dynamics and geometric features of chaotic characteristics in speech signals.

Aerodynamic studies have shown that people generate vortices in the channel boundary layer when they make sounds, and this vortex can eventually form turbulence [12]. The nature of this turbulence is chaotic. To verify the chaotic characteristics of speech signals, this paper explores this chaotic mechanism of speech signal generation, from three different analytical aspects: (a) power spectrum, (b) principal component analysis, and (c) phase space reconstruction. This research aims to provide a theoretical basis for extracting nonlinear dynamic features based on the chaotic characteristics of speech signals. By studying and analyzing the two main parameters of phase space during phase reconstruction of speech signals, the minimum time and embedding dimension, we realize the optimal phase space reconstruction. Then, we extract the nonlinear dynamics features from the phase space. By designing experiments to contrast the dynamic features and MFCC nonlinear features for speech recognition, we verify that nonlinear dynamic features of speech signals not only provide high accuracies and excellent noise cancellation performance for speech recognition but also help in identifying emotional cues in speech.

2. Chaos Theory and Verification of Chaotic Characteristics in Speech

2.1. Chaos Theory

Chaos is a seemingly irregular, random phenomenon that occurs in deterministic systems [13]. Although a chaotic system has no obvious cycle, and the form of motion seems disorderly, the internal structure is ordered, and it is a new existence form of nonlinear systems.

Nonlinear dynamics are mainly studied for describing a system or time series. The internal state of motion and the law of transformation of a nonlinear system or time series are analyzed qualitatively and quantitatively [6]. At present, the method of nonlinear dynamic analysis of time series has been maturing and has a relatively complete theoretical research background, covering different nonlinear modeling techniques and nonlinear representations [14], such as fractal dimensions, Lyapunov index, and Kolmogorov entropy, among others. These features can not only effectively distinguish the signal sequence due to chaotic characteristics but also effectively describe the motion state and variation of the signal. These features absent in traditional analysis methods give an advantage to nonlinear modeling.

2.2. Verification of Chaotic Characteristics in Speech

There are two basic features which are used to describe chaotic characteristics. The chaotic attractor of high-dimensional phase space reconstruction has (a) fractal dimension characteristics and (b) initial conditions which have great influence on the system [13]. If a time series has the above two characteristics, we can say that the time series itself is chaotic. Based on the above theory, this paper verifies the chaotic characteristics of speech signals from three aspects: (a) power-spectrum analysis [13], (b) principal component analysis [13], and (c) phase space reconstruction [13].

2.2.1. Power-Spectrum Analysis Method

From the time-domain waveform, we cannot intuitively determine whether the time series is periodic or disordered. However, its power spectrum can be used to identify these regularities. Analysis of the power spectrum can help determine whether the time series demonstrates chaotic characteristics. This analysis is based on two aspects: the number of peaks in the power spectrum and the broad-spectrum characteristics. If there are a finite number of peaks in the power spectrum, the time series is said to have a periodic sequence. However, if there is no obvious peak in the spectrum and it demonstrates a “wide-spectrum” characteristic, we can say that the time series is turbulent or chaotic. Therefore, power-spectrum analysis has evolved as a theoretical basis for judging whether the signal has chaotic characteristics.

In this paper, we analyze the power spectrum of the speech signals of a single word in the Korean isolated words database [15]. The analysis is done for four cases: “15 dB, 20 dB,”“25 dB,” and “clean.” From Figure 1, we can see that the speech signals of the four SNR have a wide spectrum and no special peak. Therefore, it can be verified that the isolated speech signals are chaotic.

2.2.2. Principal Component Analysis

Principal component analysis (PCA) is an effective method to identify a time series which has chaotic characteristics. The steps for the calculations are as follows.

Given a time series , the appropriate embedding dimension is chosen to construct the matrix , which is represented as

Then, the covariance matrix of the trajectory is calculated as

Then, the eigenvalues of the covariance matrix are solved to obtain . Next, we calculate the sum of all the eigenvalues and then sort the eigenvalues in descending order. We calculate and plot the main component spectrum using as the coordinates for the simulation graphics. If the principal component spectrum is a nearly straight line with a negative slope, this indicates that the signal has chaotic characteristics.

As shown in Figure 2, in this paper, we carry out the principal component spectral analysis of the four emotions—“happy,”“sad,”“neutral,” and “anger”—from semantic speech signals taken from the Chinese Affective Chinese database (CASIA) [16]. As can be seen from Figure 2, the covariance matrix is used to calculate the three eigenvectors , and the resulting value is calculated as a nearly straight line with a negative slope in the graph. Therefore, it can be shown that the emotional speech signals are chaotic.

2.2.3. Phase Space Reconstruction

Phase space reconstruction (PSR) is the first step to analyze nonlinear dynamic, commonly used in the embedding theorem proposed by Taken’s [17]. The essence of this method is to construct an -dimensional space vector by selecting the corresponding appropriate delay time and embedding dimension from the one-dimensional time series . The reconstructed high-dimensional space is equivalent to the original space. Given the time series of the one-dimensional emotional speech signals , , we select the appropriate time delay and embedding dimension . The sequence expression after phase space reconstruction can be written as

The row vector represents the location information of each single attractor required for phase space reconstruction. The definition of nonlinear dynamical systems indicates that these vectors are connected by a column to form a trajectory matrix. This information can be used to create the following PSR matrix:

The significance of a high-dimensional phase space is that the internal structure of the signal can be expanded. The signal can be projected onto a high-dimensional space, and the qualitative properties of the signal can be obtained by measuring and predicting the evolutionary trajectory in this space.

This paper reconstructs the phase space by measuring different emotions in the same semantics of the Berlin emotional speech database [18]. In this paper, we study the overall structure and motion trajectory of the speech signals under a one-dimensional time series and a three-dimensional phase space reconstruction for four emotional states: “happy,”“sad,”“neutral,” and “angry.” From Figure 3, we can see that the differences between the four kinds of emotional speech are mainly reflected in features such as the number of peaks, the peak size, and the number of zero crossings in the time-domain waveform. However, there are also significant differences in the overall structure and motion trajectory once the four kinds of emotional speech are reconstructed in a three-dimensional phase space. Therefore, a nonlinear dynamic model can be used to analyze the chaotic characteristics of speech signals.

3. Nonlinear Dynamic Feature Extraction from Speech

Phase space reconstruction is one of the key techniques used to study time series with chaotic characteristics. Taken’s embedding theorem [14] states that as long as the appropriate time delay and the embedded dimension are appropriately selected, the one-dimensional emotional time series can be mapped from a low-dimensional space to a high-dimensional space to realize phase space reconstruction. Here, , and ensure that the reconstructed phase space and the original one-dimensional voice signal retain information integrity. The emotional speech signals are analyzed under the reconstructed phase space, and then, the following nonlinear dynamic (NLD) features are extracted. The algorithm flow is shown in Figure 4.

3.1. Preprocessing

Since speech signals are nonstationary and time-varying and have short-time stationary characteristics, the following three steps are needed for the processing and analysis of speech signals: ①endpoint detection: the identification of the start and end points of the speech signals based on energy and zero rate; ②preemphasis: a first-order digital filter is used to preaccentuate the high-frequency part of the speech signals; ③window framing: a Hamming window is used for frame processing, with a frame length of 256 and a frame shift of 128.

3.2. C-C Algorithm

The purpose of phase space reconstruction is to extend the dynamic one-dimensional speech signals into a high-dimensional space to completely reveal the implicit information in the time series. However, we observed that the significant parameter delay time of the reconstruction phase space is strongly correlated with the embedded dimension . Therefore, this paper chooses the C-C [18] method to calculate the delay time and the window delay time . This paper also further obtains the embedded dimension which is a part of the implicit information in the time series. In view of the current spatial coordinates, the geometric information is limited to a two- or three-dimensional space. This paper improves the C-C method and extends its speech time series to two- and three-dimensional phase spaces to extract five nonlinear geometric features (NLD-2) from the structural trajectory contours. The specific calculations are performed in the following steps:(1)As shown in equation (5), the time series is divided into disjoint time subsequences:where the length is .(2)The associated integral of the embedded time series is defined by the following function:where , and when , , and , .(3)The of the subsequence is defined using the associated integral function:When , . If the time series is independently distributed, then for fixed , when , for all , is equal to zero. But the actual sequence is limited, and the sequence elements may be related, we actually get which is generally not equal to zero so that the local maximum time interval can be located at the zero point of or at the minimum time point for all the differences between the radii. Since this implies that these points are almost uniformly distributed, the maximum and minimum radii of the corresponding values are selected, and the difference can be written asThe above formula measures the maximum deviation of the radius .(4)To calculate the time delay and the window delay time , we must first calculate the following three components:where is and is the mean square of the time series time delay. is the first value of or the first minimum of corresponding to the value of the input . The window delay time is the value of the input corresponding to the minimum value of .(5)The embedded dimension is calculated:

3.3. Nonlinear Attribute Feature Extraction

(1)Minimum delay timethe known speech signal is represented as . Here, we use the mutual information function to calculate the mutual information between the speech signals and at different time intervals. At the points where the mutual information of these two speech time series reaches the minimum, the correlation between the two variables is also minimal. This corresponding time interval is the minimum delay time . As shown in equation (11), this paper uses the average mutual information (MI) [19] to calculate the minimum delay time:where and , respectively, represent the probability of the sequence amplitude falling in the and segments, respectively. denotes the joint probability of the two-point amplitude of the sequence at time interval . The minimum delay which quantifies the disorder between two discrete variables corresponds to the moment of the first local minimum of the obtained mutual information function curve.(2)Correlation dimension:the correlation dimension is a nonlinear representation of chaotic dynamics. It is used to describe the property of the dynamics and self-similarity of the structure of high-dimensional spatial speech and provides a quantitative analysis of the complexity of its structure. The more complex the corresponding system structure is, the greater will be the correlation dimension. The correlation dimension is calculated using the G-P algorithm [20]. As shown in equation (12), the G-P algorithm is a method proposed by Grassberger and Procaccia for calculating the correlation dimension:where is the relational dimension and is the correlation integral function. is the ratio of the phase point between any in the -dimensional reconstruction space which is less than , the ratio of all phases, and is defined asIn Equation (13), the corresponding curve is obtained by taking the minimum embedded dimension of , and the correlation dimension can be obtained by fitting the local line of the curve.(3)Kolmogorov entropy: it is a physical quantity used to accurately describe the degree of confusion in a time-series distribution. Grassberger and Procaccia proposed the correlation dimension analysis method. They demonstrated that the K entropy can be approximated using the entropy. The relationship between entropy and the correlation integral function can be expressed asThis entropy calculated in equation (14) is the Kolmogorov entropy.(4)Largest Lyapunov exponent: the Lyapunov exponent is used to quantify the average change in the rate of local convergence or divergence of adjacent orbits in the phase space. The maximum Lyapunov exponent x represents the degree of convergence or divergence of the orbit. When , as the value of increases, the value of the orbital divergence and the chaos also increases. The paper uses the Wolf method [21] to obtain the maximum Lyapunov exponent. Here, we take the initial point in the phase space and find its nearest neighbor point . The distance between them is represented as . This distance is tracked over time as the adjacent orbits in the phase space converge or diverge. A point is retained when the distance between the two points meets the set value after n iterations of tracking. Once this condition is met, the next moment is tracked.When tracking the overlay M times, we can obtain the maximum Lyapunov exponent using the following equation:Compared with other algorithms, this algorithm has advantages of fast computation, robustness to embedded dimension , delay time , and noise.(5)Hurst exponent: the Hurst exponent (H) measures the long-term memory of a time series. lies within the range of 0-1. If , it indicates that the time series displays a long-term autocorrelation and the time series is highly correlated. This paper uses the rescaled-range analysis method [22] to calculate the H value. The rescaled-range is a nonparametric statistical method, which is not affected by the distribution of the time series. The method divides the one-dimensional speech signal with emotional content into adjacent subsequences C of equal lengths. By calculating the cumulative deviation and the standard deviation for each subsequence and then calculating the weight difference of each sub-sequence , we obtain the Hurst exponent using . The calculation is as follows:Here, is a constant. By taking the logarithm of both sides of equation (16), we can obtain the value of which is the Hurst exponent. For different emotional states contained in a speech signal, the changes in the value are different. The Hurst exponent feature of the extracted emotional speech reflects the correlation between the emotion and the change.

3.4. Nonlinear Geometric Feature Extraction

After the one-dimensional speech signal is mapped to a high-dimensional space using phase space reconstruction, the speech signal is analyzed in the high-dimensional space. Next, the geometric features—which are the five trajectory-based descriptor contours—of the phase space reconstruction for different speech states are extracted. These five descriptors are detailed as follows:(1)The first contour: the distance from the attractor to the center is expressed as :Among them, the two-dimensional space under the attractor is defined as , and the three-dimensional space under the attractor is defined as .(2)The second contour: the length of the continuous trajectory between the attractors is expressed as:(3)The third contour: the trajectory of the continuous path between the attractors is expressed as :(4)The fourth contour: the distance from the attractor to the marker line is expressed as :For the time delay , when the original waveform is lagged, there will be a small difference between the two samples and . This can be expressed as the identity [20]:From formula (20), we can observe that the upper form will not hold when the three attractors are different. Since the dynamic factors of the chaotic system are interactive, the data points produced in time will also be correlated [23]. Therefore, formula (21) represents the labeling line. The differences between the attractors can be obtained by analyzing the distances between the attractors and the labeling line.(5)The fifth contour: the total length of the trajectory of the attractor is expressed as :

4. Experimental Preparation

4.1. Speech Corpora
4.1.1. Korean Isolated Words Database [15]

The isolated words database was used for performing speaker-independent, isolated word recognition from neutral (nonemotional) speech. The vocabulary sizes used in the experiments were 10 words, 20 words, 30 words, 40 words, and 50 words. The corpus consisted of ten digits and 40 command words with 16 speakers thrice repeating each word. For our experiment, we used the recording of the utterances of 9 speakers as the training set and the utterances of the remaining 7 speakers as the test set.

4.1.2. CASIA Database [16]

The CASIA database is a Chinese database developed in the Institute of Automation, Chinese Academy of Sciences. The recordings consisted of six acted (simulated) emotions (Neutral, Anger, Fear, Happiness, Sadness, and Surprised) by four professional speakers (2 females and 2 males). Each emotion category consists of 300 identical texts and 100 different texts. Recordings of readings of the same text with different emotions are useful for the comparison of acoustics and prosodic performance for different emotional states. Another 100 different texts with emotional content that matched the emotion being expressed made it easier for the articulating person to express their feelings better. The recordings were performed with a sampling rate of 16 kHz and a 16-bit resolution and were stored in PCM format.

4.1.3. Berlin Database [17]

The Berlin database is a German database recorded in an anechoic chamber at the Technical University Berlin. The database consists of 10 actors (5 females and 5 males) who simulated seven emotions (Neutral, Anger, Fear, Happiness, Sadness, Disgust, and Boredom). Each emotion category contains ten German sentences. The recordings were performed with a sampling frequency of 48 kHz and later downsampled to 16 kHz with high-quality recording equipment. In our experiments, we use happy, sad, neutral, and angry as the four basic emotions from the German Berlin speech library.

Taking into account the effect of the length of speech on the recognition results, this paper filters the database to obtain 363 German sentences and 1000 Chinese sentences with approximate speech length of five seconds. The results of the division of emotional speech into the training and test set are shown in Table 1.

4.2. Feature Extraction

Previous studies have demonstrated that prosodic features [24] and MFCC features [24] are highly efficient for distinguishing between different emotional states. In this paper, we first perform a series of preprocessing operations on the speech signals. Then, we extract the prosodic features and MFCC features for each speech frame. We also extract the NLD-1 and NLD-2 features based on the phase space reconstruction method described earlier in this paper. Then, we calculate the statistical functions for the above features. These statistical functions include the maximum and the minimum values, the mean, the variance, the median, the deviation, and the kurtosis. Finally, as shown in Table 2, we end up with a feature set of 150 dimensions. The normalized method of linear function transformation is used to eliminate the influence of different types of affective features, and then, the objective performance is evaluated synthetically.

4.2.1. Prosodic Feature Extraction

Prosodic features mainly describe the nonverbal information in the emotional speech signal, including the level, the length, the speed, the severity of speech, and the fluent speech information. Therefore, the prosodic feature, also known as the “supersegmental feature,” is also recognized for its ability to recognize emotions.Therefore, we use speech speed, average zero-crossing rate, energy, fundamental, and formant as the prosodic feature.

4.2.2. MFCC Feature Extraction

The ability of the human ear to perceive the sound intensity is related to the frequency of the sound. At low frequencies, the perceived sound perception of a human ear is linear with the sound frequency. At high frequencies, due to the masking effect, the perception of the human ear to the sound is nonlinear with the frequency of the sound, so Mel frequencies are introduced to simulate auditory properties. This paper uses the expression: . The ordinary frequency is converted to Mel frequency, and the first 12 steps of MFCC are extracted.

4.3. Classification

Constructing a reasonable and efficient speech recognition model is the most important research challenge in the field of speech recognition technology. It requires learning from a large training corpus, which can be used to explore a variety of acoustic features for mapping the corresponding path of the speech signals to achieve the correct identification. Currently, for speech recognition tasks, both linear and nonlinear classifiers are used. The linear ones include Naïve Bayes Classifier, Linear ANN (artificial neural network), and Linear SVM (support vector machine).The nonlinear ones include Decision Trees, k-NN (k-nearest neighbor algorithm), and Nonlinear ANN. Nonlinear classifiers also include SVMs, GMM (Gaussian mixture model), HMM (hidden Markov model), and sparse means classifiers, among others. Researchers have experimented with different model classifiers for improving speech recognition. The most widely used classifiers for speech recognition are HMM [25,26], GMM [27,28], ANN [29,30], and SVM [31,32]. In this paper, to improve the separability of data, the SVM classifier is used to generate a nonlinear mapping of the original features to a high-dimensional space; the choice of kernel function is Radial Basis Function (RBF).

5. Experimental Setup and Analysis of Results

To verify the validity and robustness of the proposed NLD feature set, we design the following two experiments. The first experiment consists of an analysis of the influence of PSR parameter selection on the NLD feature set. The second experiment verifies the validity of NLD features for speech recognition by comparing them with traditional acoustic features.

5.1. Influence of PSR Parameter Selection on NLD-2 Features

We design two experiments to verify the validity of two important parameters of phase space reconstruction and discuss the results under different parameters:Experiment 1: first, we generate the phase space reconstruction of speech signals using the delay time and the embedding dimension as set in the document [20]. Next, the phase space reconstruction of speech signals is also carried out using the delay time and the embedding dimension for each frame of the speech signal extracted using the improved C-C method. Finally, we compare the results of the two experiments.Experiment 2: in view of the current research on spatial coordinates, the geometric information is limited to the two- or three-dimensional space [13]. Therefore, we set the value of the embedded dimension as and the delay time . This is done to compare the experimental results for the delay times and embedding dimensions.

We reconstruct the phase space, based on the above two sets of experimental parameters. Next, we extract five kinds of NLD-2 features from the corresponding phase space of the Berlin-DB for the recognition of five basic emotions. The experimental results are shown in Table 3 and Figure 5.

From Table 3, we can observe the task of recognition of emotional speech, and we obtain a higher accuracy (75%) for the delay time and the embedded dimensions than those reported in the literature [20]. Our system demonstrates an increase of 33.3%, for the happiness category, while the recognition rates for sadness, anger, and fear are relatively low. However, from the perspective of average recognition rate, using NLD features extracted by our method based on the parameters of this paper, we obtain a recognition rate which is 2.5% higher.

According to the experimental results shown in Figure 5, the NLD-2 features based on the method of parameter setting cannot achieve the optimal recognition rate for the recognition of each emotion speech category. However, the overall recognition trend is relatively smoother than other approaches. At the same time, we are also able to achieve an optimal value for the average recognition rate. This indicates that the five NLD-2 features used to solve the delay time and the embedding dimension are valid based on the method of improving the C-C. This also proves that compared with setting fixed values for the delay time and embedding dimension , the method of using C-C to set the delay time and the embedding dimension for each frame of the speech signal’s phase space reconstruction yields better results for recognition of emotional speech signals.

5.2. Validity and Verification of Robustness of the NLD Features

In this paper, we used three methods to verify the validity of the extracted features.

5.2.1. Experimental Scheme 1: Speech Recognition of Isolated Words

The ten types of NLD features based on the PSR theory are combined with the MFCC features to identify isolated speech vocabulary. The experimental results are shown in Table 4 and Figure 6. These results verify the validity and robustness of the NLD features based on phase space theory.

The experimental results show that using different vocabulary and different values of signal-to-noise ratio (SNR), the recognition rate can be improved by the combination of NLD features and traditional linear speech acoustic features. Compared with the above four types of feature combination methods, from Table 4, we can see that the complementary effect of NLD-2 features is better than that of NLD-1 features. The effect of combination of the NLD features and the MFCC features yields optimal results. From the results, it can be seen that the recognition rate of the feature set comprising of the fusion of traditional linear acoustics with NLD features increases with the increase in the vocabulary size. This can be attributed to an increase in the training set. Therefore, the effective information from the speech signals can be better described by combining or complementing the NLD features with the traditional linear features of speech signals. But the overall recognition rate decreases with an increase in the number of words. This is because the fusion of the above features is not suitable for large words. Therefore, new features must be considered to improve the recognition effect for large vocabulary speech recognition.

5.2.2. Experimental Scheme 2: Single Language Emotion Recognition

The prosodic features, MFCC features, NLD-1 features, and NLD-2 features are combined to recognize single emotional speech from Berlin-DB and CASIA in two languages. The results of the recognition are shown in Tables 5 and 6.

The confusion matrix of the Chinese emotional speech recognition is provided in Table 5. We can see that compared with MFCC, NLD-1, and NLD-2, prosodic features achieve the best recognition rate for the happy emotional state. From the perspective of misjudgment, the misclassification of happiness and anger is the lowest for prosodic features. This indicates that prosodic features can effectively distinguish between happy and angry emotional states. From the overall recognition results, the overall recognition performance of MFCC is higher than that of the other three features and the recognition results for the anger class are optimal. NLD-1 features have better recognition effect for the neutral emotional voice, and NLD-2 has a better recognition for sadness and fear. The recognition performance of NLD alone is not optimal. This can be explained as that for emotional speech, NLD is used for effectively recognizing the effect of local emotion recognition only. It also indicates that the nonlinear feature can make up for the lack of speech chaos observed in previous studies.

In Table 6, we can see the confusion matrix of the Berlin German emotional speech corpus. The recognition effect of NLD-2 is better than that obtained using prosody, MFCC, and NLD-1. For happiness, NLD-2 correctly classifies 50 instances which is higher than the number of instances recognized using the other feature sets. From the recognition results of fear, the recognition performances of NLD-1 and MFCC reach the optimum values. From the overall recognition results, the recognition performance obtained using MFCC is superior to the other three types of features. This is because the MFCC features extracted for the sadness, neutral, anger, and fear yield the best recognition results. Comparing the results of emotion recognition in two languages, we can see that recognition result of emotional speech is not only related to the language type of the speech database but also has a close relationship with the features. The same feature yields different results for the representation of emotional information in different languages.

In Figure 7, we compare the results for single language emotional speech recognition for German and Chinese. We can see that for the recognition of the emotional speech, only prosodic features yield slightly better results in Chinese than in German. This is because in Chinese, we obtain the highest recognition rate for happy emotional speech. From the results of the recognition rate of the different features, the dominant features based on the recognition performance can be sorted as follows: MFCC > NLD-2 > NLD-1 > prosodic features. This is verified for both Chinese and German emotional speech corpus. Therefore, we can state that the NLD-1 and NLD-2 features extracted in this paper can effectively characterize the emotional information in speech signals.

5.2.3. Experiment Scheme 3: Speech Recognition of Mixed Language Emotion

Prosodic features, MFCC features, NLD-1 features, and NLD-2 features are used to recognize the cross-emotional speech from the Berlin-DB and CASIA in two languages. The recognition results are shown in Table 7. This further validates the efficiency of the extracted features for recognition of emotional states from speech.

From Table 7, we can draw the following conclusions: from the average recognition results with single use of four feature types (prosodic features, MFCC, NLD-1, NLD-2, and NLD-2), the average recognition rate is the highest for NLD-2 and the lowest for the prosodic features. We can conclude that prosodic features are superior for the task of recognition of emotional speech in a single language. Evaluating the results for each individual emotion, we observe that MFCC has a better discriminative power for detecting sadness; NLD-1 can better differentiate neutral emotions; However, NLD-2 provides better distinction between happiness, anger, and sadness. Therefore, we can state that NLD demonstrates a better distinction between different emotions of great intensity, such as sadness, neutral, happiness, and anger. From the perspective of feature fusion, we observe that addition of NLD features effectively compensates the chaotic characteristics of the emotional speech signals compared to traditional acoustic linear features. In addition, we also observe that it is partial to using NLD features for characterizing the emotional difference in speech signals. This is because NLD features are obtained by treating the speech signals as a one-dimensional time series and completely ignoring the acoustic features of the emotional speech signals. Therefore, when the NLD features and acoustic features are combined, the effective information in the emotional speech signals can be better described.

6. Conclusion and Further Study

In this paper, based on the chaotic characteristics in the nonlinear generation mechanism of speech signal, aiming at the deficiency of linear feature parameters in speech signal and the limitation of existing time-domain and frequency-domain attribute features in characterizing the integrity of speech information, a nonlinear feature extraction method based on phase space reconstruction theory is proposed, and the chaotic characteristics of speech signal are verified from three aspects: power-spectrum analysis, principal component analysis, and phase space reconstruction. The nonlinear dynamic model is applied for the extraction of speech features. This paper also extracts and evaluates the contribution of NLD features from speech signals. The speech recognition experiments are designed to combine the features of traditional linear acoustics with the NLD features to verify whether this combination can improve the performance of the recognition. From the experimental results for the recognition of isolated words, the addition of nonlinear dynamical features is able to effectively compensate for the chaotic features neglected by the traditional acoustic features. This proves that merging NLD features with acoustic features can better describe the effective information contained in speech signals. From the recognition results of emotional speech, we can observe that while the performance of nonlinear features alone is ideal, we can obtain better recognition rates through feature fusion. For the experimental designed in this paper, the recognition network was developed by combining NLD features with acoustic features. Through our experiments, we demonstrate that while NLD features efficiently compensate for the chaotic characteristics of the emotional speech signals, they are also biased to represent the differences in the emotional speech alone. In future research, we would like to explore the research direction of integrating NLD and acoustic features to generate the strongest combination of the features. Additionally, in view of the high efficiency of NLD features for emotion recognition in mixed languages, the study of cross-database emotion recognition using NLD features is another research direction that needs to be further explored.

Data Availability

The databases used in the manuscript can be found in the following two links, from which you can download: http://emodb.bilderbar.info/docu/#home and http://www.chineseldc.org/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no.61371193), research project supported by the Shanxi Scholarship Council of China (Grant no.201925), and project supported by the http://doi.org/10.13039/501100004480 Natural Science Foundation of Shanxi Province, China (Grant no.201901D111096).