Abstract

Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

1. Introduction

Speech emotion recognition was a technology that extract emotional feature from speech signals by computer and contrasts and analyses the characteristic parameters and the emotional change acquired. Finally, the law of speech and emotion was concluded and speech emotional states were judged according to the law. At present, speech emotion recognition was an emerging crossing field of artificial intelligence and artificial psychology; besides, it was a hot research topic of signal processing and pattern recognition [1]. The research was widely applied in human-computer interaction, interactive teaching, entertainment, security fields, and so on.

Speech emotion processing and recognition system was generally composed of three parts, which were speech signal acquisition, feature extraction, and emotion recognition. System framework is shown in Figure 1.

In this system, the quality of feature extraction directly affected the accuracy of speech emotion recognition. In the process of feature extraction, it usually took the whole emotion sentence as units for feature extracting, and extraction contents were four aspects of emotion speech, which were several acoustic characteristics of time construction, amplitude construction, fundamental frequency construction, and formant construction. Then contrast emotion speech with no emotion sentence from these four aspects, acquiring the law of emotional signal distribution, then classify emotion speech according to the law [2].

Deep neural network (DNN) has unprecedented success in the field of speech recognition and image recognition [3]; however, so far no research on deep neural network has been applied to speech emotion processing. We found that the deep belief network (DBN) of DNN in speech emotion processing has a huge advantage [4]. Therefore, this paper proposed a method to realize the emotional features automatically extracted from the sentence. It used DBNs to train a 5-layer-deep network to extract speech emotion features. It incorporates the speech emotion features of more consecutive frames, to build a high latitude characteristic, and uses SVM classifier to classify the emotional speech. We compared other traditional feature extraction methods with this method and concluded that the speech emotion recognition rate reached 86.5%, which was 7% higher than the original method.

2. Feature Extraction of Speech Emotion

Emotion can be expressed by speech because speech contains the characteristic parameters that can reflect emotion information [5]. We can extract and observe the change of characteristic parameters to measure the corresponding speech emotional changes. The key above is extracting characteristic parameters of speech emotion from speech signals. The quality of feature extraction directly affects the accuracy of speech emotion recognition. Meanwhile, speech signals contain not only emotional feature information but also the speaker’s own important information, therefore research on how to extract and which speech emotion characteristic parameters to extract are of great importance [6].

2.1. Emotion Speech Database

Before the speech emotion feature extraction, at first, we need to input emotional speech signal. Emotion speech library is the foundation of speech emotion recognition, which provides standard speech for speech emotion recognition.

At present, there is much literature on this research aspect [7], and throughout the world English, German, Spanish, and Chinese single language emotion speech databases have been built. A few speech libraries also contain a variety of languages. This paper is aimed at using the language of Chinese to recognize speech emotion. In order to establish a perfect speech data sampling library, when selecting experimental sentence we need to follow the following principles.(i)The sentence selected must not contain a particular aspect of emotional tendency, ensuring that the recorded statements will not affect the experimenter's judgment [8].(ii)The sentence selected has relatively emotional freedom. That is, the sentence can express different emotions, not just a single emotion; otherwise we will be unable to compare the emotional speech parameters in the same emotional sentence under different emotional states [9].

According to the above principles and considering all aspects, to ensure the accuracy in this paper, we selected Buaa emotional speech database, which passed the effectiveness measurement, instead of recording speech database by ourselves. This database was recorded by 7 males and 8 females and it consisted of 7 kinds of basic emotions, which are sadness, anger, surprise, fear, joy, hate, and calm. Each emotion had 20 sentences referential scripts. That is to say, the database consisted of 2100 emotion sentences, and all the emotion sentences above were recorded into WAV format files. The sampling rate of speech above was 16000 Hz all, and the quantitative accuracy was 16 bits. This paper selected 1200 sentences which contain sadness, anger, surprise, and happiness, four basic emotions for training and recognition.

2.2. Traditional Emotional Feature Extraction

Traditional emotional feature extraction was based on the analysis and comparison of all kinds of emotion characteristic parameters, selecting emotional characteristics with high emotional resolution for feature extraction. In general, traditional emotional feature extraction concentrates on the analysis of the emotional features in the speech from time construction, amplitude construction, and fundamental frequency construction and signal feature [10].

2.2.1. Time Construction

Speech time construction refers to the emotion speech pronunciation differences in time. When people express different feelings, the time construction of the speech is different. Mainly in two aspects, one is the length of continuous pronunciation time, the other is the average rate of pronunciation. One is the length of continuous pronunciation time and the other is the average rate of pronunciation.

Zhao Li’s research showed that different emotional pronunciations are different in pronunciation length and pronunciation speed. Compared with the length of calm pronunciation time, the pronunciation time of joy, anger, and surprise significantly shortened. But comparing with calm pronunciation time, sad pronunciation length is longer. Compared with quiet pronunciation rate, sad pronunciation rate is slow, while joy, anger, and surprise pronunciation rates are quick relatively.

In conclusion, if we extract the time construction characteristics parameters of the speech, it is easy to distinguish sad emotion from other emotional states. Of course, we also can set a certain time threshold to distinguish joy, anger, surprise, and speech. However it is obvious that using only speech time construction is not enough to recognize speech emotional state.

2.2.2. Amplitude Construction

Speech signal amplitude construction and speech emotional state also have a direct link. When the speaker is angry or happy, the volume of speech is generally high. When speaker is sad or depressed, the volume of speech is generally low. Therefore, the analysis of speech emotion features of amplitude construction is more meaningful.

Figure 2 is the comparison of emotional speech and calm speech, which is shown with the average amplitude difference. Figure 2 shows that the amplitude of joy, anger, and surprise, three kinds of emotional speech, compared with calm voice signal is larger, while the sad speech amplitude is smaller.

2.2.3. Fundamental Frequency Construction

Bänziger and Scherer [11] proposed that, for the same sentence, if the emotions expressed were different, fundamental frequency curves were also different; besides the mean and variance of fundamental frequency were also different. When the speaker is in a state of happiness, the fundamental frequency curve of speech generally is bent upwards. And when the speaker is in a state of sadness, the fundamental frequency curve of speech generally is bent downward. Figure 3 shows the curves of different emotional variance.

Compared to calm emotional state, happiness, surprise, and angry speech signal characteristics variation is larger. Thus, analyzing fundamental frequency curve of the same sentence under different emotional states, we can contrast and acquire fundamental frequency construction of different emotion speech.

3. The Depth of the Belief Network

Deep neural network stems from artificial neural network [12]. Deep neural network is literally a deep neural network. In 2006, Professor Hinton in the University of Toronto presented a deep belief network (DBN) structure in [13]. Since then, deep neural network and deep learning have become the most popular hot spot research in artificial intelligence. He clarified the effectiveness of unsupervised learning and training learning at each layer and pointed out that each layer can conduct unsupervised train again on the basis of the results output from previous layer training. Compared with the traditional neural network, deep neural network has deep structure with multiple nonlinear mapping which can complete complex function approximation [14].

Hinton first put forward DBNs in 2006. Since then DBNs have got unprecedented success in areas such as speech recognition and image recognition. Microsoft researcher Dr. Deng cooperated with Hilton found deep neural network can significantly improve the accuracy of speech recognition; however; so far no research on deep neural network has been applied to speech emotion recognition. In this paper, research found that deep belief networks (DBNs) have a great advantage in speech emotion recognition; therefore, we choose deep belief networks to extract emotional feature automatically in the speech.

A typical deep belief network is a highly complex directed acyclic graph, which is formed by a series of restricted Boltzmann machine (RBM) stacks. Training DBNs is realized by training RBMs layer by layer from bottom to up. Because RBM can be trained rapidly via layered contrast divergence algorithm, training RBM can avoid a high degree of complexity of training DBNs, simplifying the process to training each RBM. Numerous studies have demonstrated that deep belief network can solve low convergence speed and local optimum problems in traditional backpropagation algorithm training multilayer neural network.

3.1. Restricted Boltzmann Machine

DBNs are stacked regularly by restricted Boltzmann machine (RBM); RBM is a kind of typical neural network. RBM is composed via the connection of visible layer and hidden layer, but there is no connection between visible layer, visible layer and hidden layer, and hidden layer. In Figure 4, training RBM used unsupervised greedy method step by step. That is, in training, the characteristic value of the visible layer maps to the hidden layer, then visible layer can be reconstructed through the hidden layer; this new layer visible characteristic value maps to the hidden layer again, then acquiring a new hidden layer. Its main purpose is to obtain the generative power value. Thus, the main characteristics of RBM are the activation features of the layer inputted to next layer as training data, and as a consequence the learning speed is fast [15]. This is a layer by layer and efficient learning strategy theory, and proof of procedure is in [16].

In Figure 4, DBNs are stacked from bottom to up by RBM. Using Gauss-Bernoulli RBM and Bernoulli- Bernoulli RBM to connect, the lower layer output is the input for the upper layer.

Figure 5 is the structure diagram of DBNs. The number of layer and unit is an example, and the number of hidden layer is not necessarily the same in actual experiment.

In Bernoulli RBM, visible and hidden layer units are binary: and . and present unit numbers of visible and hidden layers. In Gaussian RBM, visible layer unit is a real number: . and joint probability is expressed as

is a normalized constant and is an energy equation. For Bernoulli RBM, energy equation is

presents the quality of unconnected visible layer nodes and hidden layer nodes and and are implicit bias of visible and hidden units. For Gaussian RBM, energy equation is

DBNs combined emotional speech signal features of continuous frames, forming a high dimensional feature vector, fully describing the correlation between the emotional speech features. Meanwhile, DBNs use these high dimensional features to simulate [17]. Besides, DBNs extracting speech information process is similar to brain processing speech, using RBM to extract emotional information layer by layer, eventually acquiring the most suitable high dimensional characteristics for pattern recognition. DBNs can combine well with traditional speech emotion recognition technology in practical application (e.g., SVM), improving the accuracy of speech emotion recognition.

3.2. DBNs Model Training

We have used Theano to train the deep belief networks (DBNs) in this paper. Theano is a mathematical symbols compilation kit in Python, which is extremely a powerful tool in machine learning, because it combines the power of Python and , making it easier to establish deep learning model.

The DBNs were first pretrained with the training set in an unsupervised manner [18]. We split the Buaa dataset and used 40% of the voice data for training and 60% of the voice data for testing. Then we proceeded to the supervised fine tuning using the same training set and used the validation set to do early stopping [19].

We tried approximately 100 different hyperparameters combinations and selected the model that has the smallest error rate. The selected DBN model has been shown in Table 1.

We fixed the DBN architecture for all experiments and used 5 hidden layers (the first layer is a Gaussian RBM; all other layers are Bernoulli RBMs) and 50 units per layer, and the only variable is size of input vector, which depends on the context window length. The hyperparameters for generative pretraining are shown in Table 1. The unsupervised layers ran for 50 epochs with 0.001 learning rate. Supervised layers ran for 475 epochs with 0.01 learning rate.

4. Support Vector Machine Classifier

SVM is a kind of machine learning methods based on statistical theory and structural risk minimization principle. Its principle is to map the low dimensional feature vector to high-dimensional feature vector space, so as to solve nonlinear separable problem. SVM has been widely used in pattern classification field [20].

The key to solve nonlinear separable problem is to construct the optimal separating hyperplane; the construction of the optimal hyperplane eventually translates into the calculation of optimal weights and bias. We set the training sample set for ; the cost function of minimization weight and slack variable is

Conditions for its limitation. , using to measure the deviation degree of a sample points relative to linear separable ideal conditions. Due to the sample set, we can calculate the dual to know the optimal weight and bias. To solve quadratic optimization problem, when optimal hyperplane in the feature space is constructed, the optimal hyperplane can be defined as the following function:

In the formula above, is kernel function, is support vector of nonzero Lagrange multiplier , is the number of support vectors, and is bias parameter. For nonlinear SVM, nonlinear mapping kernel function mapping data to higher-order feature space, optimal hyperplane exists in this space. Several kinds of kernel function are applied to nonlinear SVM, such as Gaussian kernel function and polynomial kernel function. This paper takes the following Gaussian kernel function to do the research:

In this function, is a Gaussian transmission coefficient.

When we use support vector machine (SVM) to solve the problem of classification, there are two solutions: one-to-all/one-to-one. In previous studies we found that the accuracy of one-to-one way of classification is higher. Therefore, this paper chose the one-to-one way of classification to deal with four kinds of emotion (surprise, joy, angry, and sadness).

“One-to-one” mode is to construct hyper-plane for every two emotions, so the number of child classifiers we need to be trained is   [21]. In this experiment and during the whole process of training, the number of SVM classifier we need is , namely 6. Each child classifier was trained by surprise, joy, anger, sadness, or any of these two kinds of emotion, namely, joy-anger, joy-sadness, joy-surprise, anger-sadness, anger-surprise, and sadness-amazement.

Training a classifier for any two categories, when classifying an unknown speech emotion, each classifier judges its category and votes for the corresponding category. Finally take the category with most votes as the category of unknown emotional. Decision stage uses voting method, it is possible that votes tied for multiple categories, leading to the unknown samples belonging to different categories at the same time, therefore affecting the classification accuracy.

SVM classifier defines a label for each speech emotion signal before training and recognition, to indicate speech emotional signal's emotional categories. Label type must be set to double. In the process of emotion recognition, input feature vector into all SVMs; then, the output of each SVM passes through logic judgment to choose the most probable emotion. Finally, the emotion with the highest weight (most votes) is the emotional state of speech signal. The setting of penalty factor in classifier training and in kernel function can be determined via cross-validation of training set. One thing to note here is that and are fit for recognition effect of training set but do not necessarily also fit for test set. After repeated testing, in the experiments of speech emotion recognition in this paper, the magnitude of Gaussian transmission coefficient SVM was set to and the magnitude of parameter is . These parameters will change according to the experiment and the error rate, to improve the accuracy rate of training data classification.

DBNs proposed multidimensional feature vector as the input of SVM. Since the SVM does not scale well with large datasets, we subsampled the training set by randomly picking 10,000 frames. For nonlinear separable problem in speech emotion recognition, take kernel function to map input characteristics sample points to high-dimensional feature space, making the corresponding sample space linear separable. In simple terms it creates a classification hyperplane as decision surface, making the edge isolation of positive and negative case maximum. Evaluating decision function calculation is still in the original space, making less computational complexity of the high dimensional feature space after mapping.

Figure 6 is the block diagram for the emotion recognition based on SVM system.

5. The Experiment and Analysis

This paper selected 1200 sentences which contain sadness, anger, surprise, and happiness, four basic emotions for training and recognition.

Before inputting the emotional speech signal to the model, we need to preprocess this signal. In this paper, the speech signal to be inputted first goes through pre-emphasis and window treatments. We selected median filter that has the window of the length of 5 to do smoothing processing for denoised emotional speech signal.

This paper used 40% of the voice data for training and 60% of the voice data for testing. The experimental group is speech emotion recognition model established by deep belief network via extracting phonetic characteristics. The control group is established by extracting traditional speech feature parameters to the speech emotion recognition model under the condition of the same emotional speech input. At last, we contrasted and analyzed the experimental data for the conclusion. The process of the experiment is shown in Figure 7.

5.1. The Experimental Group: Research of Speech Emotion Recognition Based on Multiple Classifier Models of the Deep Belief Network and SVM

This paper proposed a method to realize the emotional features extracted automatically from the sentence. It used DBNs to train a 5-layer-deep network to extract speech emotion features. In this experiment, we put the output of the DBNs hidden layer as a characteristic training regression model and the finished training characteristics as the input of a nonlinear support vector machine (SVM). Eventually we set up a multiple classifier model of speech emotion recognition. It would also be possible to train our DBN directly to do classification. However our goal is to compare the DBN learned representation with other representations. By using a single classifier we were able to carry out direct comparisons. The experimental results are shown in Table 2.

As it was shown in Table 1, anger and sadness have a higher recognition rate than joy and surprise; they reached 91.3% and 88.3%, respectively, and the overall recognition rate was 86.5%.

5.2. The Control Group: Research of Speech Emotion Recognition Based on Extracting the Traditional Emotional Characteristic Parameters and SVM

In this paper, the control group was established by extraction traditional emotional characteristic parameters: time construction, amplitude construction, and fundamental frequency construction. After extracting them we also need to input the emotional characteristic parameters into the SVM classification of speech emotion recognition. Finally, we compared the difference of experimental group and control group. The experimental results were shown in Table 2.

As it was shown in Table 3, overall recognition rate based on the traditional emotional characteristic parameters of the SVM system was 79.5%.

As it was shown in Table 4 and Figure 8, compared with the two methods, the method using DBNs to extract speech emotion features improved these four types of (sadness/joy/anger/surprise) emotion recognition rate. The average rate was increasing by 7%, and the recognition rate of joy has increased by 10%.

6. Conclusion

In this paper we proposed a method that used the deep belief networks (DBNs), one of the deep neural networks, to extract the emotional characteristic parameter from emotional speech signal automatically. We combined deep belief network and support vector machine (SVM) and proposed a classifier model which is based on deep belief networks (DBNs) and support vector machine (SVM). In the practical training process, the model has small complexity and 7% higher final recognition rate than traditional artificial extract, and this method can extract emotion characteristic parameters accurately, improving the recognition rate of emotional speech recognition obviously. But the time cost for training DBNs feature extraction model was 136 hours, and it was longer than other feature extraction methods.

In future work, we will continue to further study speech emotion recognition based on DBNs and further expand the training data set. Our ultimate aim is to study how to improve the recognition rate of speech emotion recognition.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank the National Key Science and Technology Pillar Program of China (the key technology research and system of stage design and dress rehearsal, 2012BAH38F05; study on the key technology research of the aggregation, marketing, production and broadcasting of online music resources, 2013BAH66F02; Research of Speech Emotion Recognition Based on Deep Belief Network and SVM, 3132013XNG1442).