Abstract

Recently, researchers have paid escalating attention to studying the emotional state of an individual from his/her speech signals as the speech signal is the fastest and the most natural method of communication between individuals. In this work, new feature enhancement using Gaussian mixture model (GMM) was proposed to enhance the discriminatory power of the features extracted from speech and glottal signals. Three different emotional speech databases were utilized to gauge the proposed methods. Extreme learning machine (ELM) and -nearest neighbor (NN) classifier were employed to classify the different types of emotions. Several experiments were conducted and results show that the proposed methods significantly improved the speech emotion recognition performance compared to research works published in the literature.

1. Introduction

Spoken utterances of an individual can provide information about his/her health state, emotion, language used, gender, and so on. Speech is the one of the most natural form of communication between individuals. Understanding of individual’s emotion can be useful for applications like web movies, electronic tutoring applications, in-car board system, diagnostic tool for therapists, and call center applications [14]. Most of the existing emotional speech database contains three types of emotional speech recordings such as simulated, elicited, and natural ones. Simulated emotions tend to be more expressive than real ones and most commonly used [4]. In the elicited category, emotions are nearer to the natural database but if the speakers know that they are being recorded, the quality will be artificial. Next, in the natural category, all emotions may not be available and difficult to model because these are completely naturally expressed. Most of the researchers have analysed four primary emotions such as anger, joy, fear, and sadness either in stimulated domain or in natural domain. High emotion recognition accuracies were obtained for two-class emotion recognition (high arousal versus low arousal), but multiclass emotion recognition is still disputing. This is due to the following reasons: (a) which speech features are information-rich and parsimonious, (b) different sentences, speakers, speaking styles, and rates, (c) more than one perceived emotion in the same utterance, and (d) long-term/short-term emotional states [1, 3, 4].

To improve the accuracy of multiclass emotion recognition, a new GMM based feature enhancement was proposed and tested using 3 different emotional speech databases (Berlin emotional speech database (BES), Surrey audio-visual expressed emotion (SAVEE) database, and Sahand Emotional Speech database (SES)). Both speech signals and its glottal waveforms were used for emotion recognition experiments. To extract the glottal and vocal tract characteristics from the speech waveforms, several techniques have been proposed [58]. In this work, we extracted the glottal waveforms from the emotional speech signals by using inverse filtering and linear predictive analysis [5, 6, 9]. Emotional speech signals and its glottal waveforms were decomposed into 4 levels using discrete wavelet packet transform and relative energy and entropy features were calculated for each of the decomposition nodes. A total of 120 features were obtained. Higher degree of overlap between the features of different classes may degrade the performance of classifiers which results in poor recognition of speech emotions. To decrease the intraclass variance and to increase the interclass variance among the features, GMM based feature enhancement was proposed which results in improved recognition of speech emotions. Both raw and enhanced features were subjected to several experiments to validate their effectiveness in speech emotion recognition. The rest of this paper is organized as follows. Some of the significant works on speech emotion recognition are discussed in Section 2. Section 3 presents the materials and methods used. Experimental results and discussions are presented in Section 4. Finally, Section 5 concludes the paper.

2. Previous Works

Several speech features have been successfully applied for speech emotion recognition and can be mainly classified into four groups such as continuous features, qualitative features, spectral features, and nonlinear Teager energy operator based features [1, 3, 4]. Various types of classifiers have been proposed for speech emotion recognition such as hidden Markov model (HMM), Gaussian mixture model (GMM), support vector machine (SVM), artificial neural networks (ANN), and -NN [1, 3, 4]. This section describes some of the recently published works in the area of multiclass speech emotion recognition. Table 1 shows the list of some of the recent works in multiclass speech emotion recognition using BES and SAVEE databases.

Though speech related features are widely used for speech emotion recognition, there is a strong correlation between the emotional states and features derived from glottal waveforms. Glottal waveform is significantly affected by the emotional state and speaking style of an individual [1012]. In [1012], researchers have investigated that the glottal waveform was affected due to the excessive tension or lack of coordination in the laryngeal musculature under different emotional states and the speech produced under stress. The classification of clinical depression using the glottal features was carried out by Moore et al. in [13, 14]. In [15], authors have obtained 85% of the correct emotion recognition rate by using the glottal flow spectrum as a possible cue for depression and near-term suicide risk. Iliev and Scordilis have investigated the effectiveness of glottal features derived from the glottal airflow signal in recognizing emotions [16]. The average emotion recognition rate of 66.5% for all six emotions (happiness, anger, sadness, fear, surprise, and neutral) and 99% for four emotions (happiness, neutral, anger, and sadness) was achieved. He et al. have proposed wavelet packet energy entropy features for emotion recognition from speech and glottal signals with GMM classifier [17]. They achieved average emotion recognition rates for BES database between 51% and 54%. In [18], prosodic features, spectral features, glottal flow features, and AM-FM features were utilized and two-stage feature reduction was proposed for speech emotion recognition. The overall emotion recognition rate of 85.18% for gender-dependent and 80.09% for gender-independent was achieved using SVM classifier.

Several feature selection/reduction methods were proposed to select/reduce the course of dimensionality of speech features. Although all the above works are novel contributions to the field of speech emotion recognition, it is difficult to compare them directly since division of datasets is inconsistent: the number of emotions used, the number of datasets used, inconsistency in the usage of simulated or naturalistic speech emotion databases, and lack of uniformity in computation and presentation of the results. Most of the researchers have commonly used 10-fold cross validation and conventional validation (one training set + one testing set) and some of them have tested their methods under speaker-dependent, speaker-independent, gender-dependent and gender-independent environments. In this regard, the proposed methods were validated using three different emotional speech databases and emotion recognition experiments were also conducted under speaker-dependent and speaker-independent environments.

3. Materials and Methods

3.1. Emotional Speech Databases

In this work, three different emotional speech databases were used for emotion recognition and to test the robustness of the proposed methods. First, Berlin emotional speech database (BES) was used which consists of speech utterances in German [19]. 10 professional actors/actresses were used to simulate 7 emotions (anger: 127, disgust: 45, fear: 70, neutral: 79, happiness: 71, sadness: 62, and boredom: 81). Secondly, Surrey audio-visual expressed emotion (SAVEE) database [20] was used and it is an audio-visual emotional database which includes seven emotion categories of speech utterances (anger: 60, disgust: 60, fear: 60, neutral: 120, happiness: 60, sadness: 60, and surprise: 60) from four native English male speakers aged from 27 to 31 years. 3 common, 2 emotion-specific, and 10 generic sentences from 15 TIMIT sentences per emotion were recorded. In this work, only audio samples were utilized. Lastly, Sahand Emotional Speech database (SES) was used [21] and it was recorded at Artificial Intelligence and Information Analysis Lab, Department of Electrical Engineering, Sahand University of Technology, Iran. This database contains speech utterances of five basic emotions (neutral: 240, surprise: 240, happiness: 240, sadness: 240, and anger: 240) from 10 speakers (5 male and 5 female). 10 single words, 12 sentences, and 2 passages in Farsi language were recorded which results in a total of 120 utterances per emotion. Figures 1(a)1(d) show an example of portion of utterance spoken by a speaker in the four different emotions (neutral, anger, happiness, and disgust). It can be observed from the figures that the structure of the speech signals and its glottal waveforms are considerably different for speech spoken under different emotional states.

3.2. Features for Speech Emotion Recognition

Extraction of suitable features for efficiently characterizing different emotions is still an important issue in the design of a speech emotion recognition system. Short-term features were widely used by the researchers, called frame-by-frame analysis. All the speech samples were downsampled to 8 kHz. The unvoiced portions between words were removed by segmenting the downsampled emotional speech signals into nonoverlapping frames with a length of 32 ms (256 samples) based on the energy of the frames. Frames with low energy were discarded and the rest of the frames (voiced portions) were concatenated and used for feature extraction [17]. Then the emotional speech signals (only voiced portions) are passed through a first-order low pass filter to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing [22]. The first-order preemphasis filter is defined as The commonly used value is or 0.95 [22]. In this work, the value of is set equal to 0.9375. Extraction of glottal flow signal from speech signal is a challenging task. In this work, glottal waveforms were estimated based on the inverse filtering and linear predictive analysis from the preemphasized speech waveforms [5, 6, 9]. Wavelet or wavelet packet transform has the ability to analyze any nonstationary signals in both time and frequency domain simultaneously. The hierarchical wavelet packet transform decomposes the original emotional speech signals/glottal waveforms into subsequent subbands. In WP decomposition, both low and high frequency subbands are used to generate the next level subbands which results in finer frequency bands. Energy of the wavelet packet nodes is more robust in representing the original signal than using the wavelet packet coefficients directly. Shannon entropy is a robust description of uncertainty in the whole signal duration [2326]. The preemphasized emotional speech signals and glottal waveforms were segmented into 32 ms frames with 50% overlap. Each frame was decomposed into 4 levels using discrete wavelet packet transform and relative wavelet packet energy and entropy features were derived for each of the decomposition nodes as given in (4) and (7). Consider the following: where , , is the number of decomposition levels, and is the length of wavelet packet coefficients at each node . In this work, Daubechies wavelets with 4 different orders (db3, db6, db10, and db44) were used since Daubechies wavelets are frequently used in speech emotion recognition and provide better results [1, 17, 27, 28]. After obtaining the relative wavelet packet energy and entropy based features for each frame, they were averaged over all frames and used for analysis. Four-level wavelet decomposition gives 30 wavelet packet nodes and features were extracted from all the nodes which yield 60 features (30 relative energy features + 30 relative entropy features). Similarly, the same features were extracted from emotional glottal signals. Finally, a total of 120 features were obtained.

3.3. Feature Enhancement Using Gaussian Mixture Model

In any pattern recognition applications, escalating the interclass variance and diminishing the intraclass variance of the attributes or features are the fundamental issues to improve the classification or recognition accuracy [24, 25, 29]. In the literature, several research works can be found to escalate the discriminating ability of the extracted features [24, 25, 29]. GMM has been successfully applied in various pattern recognition applications particularly in speech and image processing applications [17, 2836]; however its capability of escalating the discriminative ability of the features or attributes is not being extensively explored. Different applications of GMM motivate us to suggest GMM based feature enhancement [17, 2836]. High intraclass variance and low interclass variance among the features may degrade the performance of classifiers which results in poor emotion recognition rates. To decrease the intraclass variance and to increase the interclass variance among the features, GMM based clustering was suggested in this work, to enrich the discriminative ability of the relative wavelet packet energy and entropy features. GMM model is a probabilistic model and its application to labelling is based on the assumption that all the data points are generated from a finite mixture of Gaussian mixture distributions. In a model-based approach, certain models are used for clustering and attempting to optimize the fit between the data and model. Each cluster can be mathematically represented by a Gaussian (parametric) distribution. The entire dataset is modeled by a weighted sum of numbers of mixtures of Gaussian component densities and is given by the equation where is the -dimensional continuous valued relative wavelet packet energy and entropy features, , , are the mixture weights, and , are the component Gaussian densities. Each component density is an -variate Gaussian function of the form where is the mean vector and is the covariance matrix. Using the linear combination of mean vectors, covariance matrices, and mixture weights, overall probability density function is estimated [17, 2836]. Gaussian mixture model uses an iterative expectation maximization (EM) algorithm that converges to a local optimum and assigns posterior probabilities to each component density with respect to each observation. The posterior probabilities for each point indicate that each data point has some probability of belonging to each cluster [17, 2836]. The working of GMM based feature enhancement is summarized (Figure 2) as follows: firstly, the component means (cluster centers) of each feature belonging to dataset using GMM based clustering method was found. Next, the ratios of means of features to their centers were calculated. Finally, these ratios were multiplied with each respective feature.

After applying GMM clustering based feature weighting method, the raw features (RF) were known as enhanced features (EF).

The class distribution plots of raw relative wavelet packet energy and entropy features were shown in Figures 3(a), 3(c), 3(e), and 3(g) for different orders of Daubechies wavelets (“db3,” “db6,” “db10,” and “db44”). From the figures, it can be seen that there is a higher degree of overlap among the raw relative wavelet packet energy and entropy features which results in the poor performance of the speech emotion recognition system. The class distribution plots of enhanced relative wavelet packet energy and entropy features were shown in Figures 3(b), 3(d), 3(f), and 3(h). From the figures, it can be observed that the higher degree of overlap can be diminished which in turn improves the performance of the speech emotion recognition system.

3.4. Feature Reduction Using Stepwise Linear Discriminant Analysis

Curse of dimensionality is a big issue in all pattern recognition problems. Irrelevant and redundant features may degrade the performance of the classifiers. Feature selection/reduction was used for selecting the subset of relevant features from a large number of features [24, 29, 37]. Several feature selection/reduction techniques have been proposed to find most discriminating features for improving the performance of the speech emotion recognition system [18, 20, 28, 3843]. In this work, we propose the use of stepwise linear discriminant analysis (SWLDA) since LDA is a linear technique which relies on the mixture model containing the correct number of components and has limited flexibility when applied to more complex datasets [37, 44]. Stepwise LDA uses both forward and backward strategies. In the forward approach, the attributes that significantly contribute to the discrimination between the groups will be determined. This process stops when there is no attribute to add in the model. In the backward approach, the attributes (less relevant features) which do not significantly degrade the discrimination between groups will be removed. -statistic or value is generally used as predetermined criterion to select/remove the attributes. In this work, the selection of the best features is controlled by four different combinations of (0.05 and 0.1, SET1; 0.01 and 0.05, SET2; 0.01 and 0.001, SET3; 0.001 and 0.0001, SET4) values. In a feature entry step, the features that provide the most significant performance improvement will be entered in the feature model if the value < 0.05. In the feature removal step, attributes which do not significantly affect the performance of the classifiers will be removed if the value > 0.1. SWLDA was applied to the enhanced feature set to select best features and to remove irrelevant features.

The details of the number of selected enhanced features for every combination were tabulated in Table 2. From Table 2, it can be seen that the enhanced relative energy and entropy features of the speech signals were more significant than the glottal signals. Approximately between 32% and 82% of insignificant enhanced features were removed in all the combination of values (0.05 and 0.1, 0.01 and 0.05, 0.01 and 0.001, and 0.001 and 0.0001) and speaker-dependent and speaker-independent emotion recognition experiments were carried out using these significant enhanced features. The results were compared with the original raw relative wavelet packet energy and entropy features and some of the significant works in the literature.

3.5. Classifiers
3.5.1. -Nearest Neighbor Classifier

NN classifier is a type of instance-based classifiers and predicts the correct class label for the new test vector by relating the unknown test vector to known training vectors according to some distance/similarity function [25]. Euclidean distance function was used and appropriate -value was found by searching a value between 1 and 20.

3.5.2. Extreme Learning Machine

A new learning algorithm for the single hidden layer feedforward networks (SLFNs) called ELM was proposed by Huang et al. [4548]. It has been widely used in various applications to overcome the slow training speed and overfitting problems of the conventional neural network learning algorithms [4548]. The brief idea of ELM is as follows [4548].

For the given training samples, the output of a SLFN network with hidden nodes can be expressed as the following: It can be written as , where , , and are the input training vector, input weights, and biases to the hidden layer, respectively. are the output weights that link the th hidden node to the output layer and is the activation function of the hidden nodes. Training a SLFN is simply finding a least-square solution by using Moore-Penrose generalized inverse: where or , depending on the singularity of or . Assume that is not singular; the coefficient ( is positive regularization coefficient) is added to the diagonal of in the calculation of the output weights . Hence, more stable learning system with better generalization performance can be obtained.

The output function of ELM can be written compactly as In this ELM kernel implementation, the hidden layer feature mappings need not be known to users and Gaussian kernel was used. Best values for positive regularization coefficient as 1 and Gaussian kernel parameter as 10 were found empirically after several experiments.

4. Experimental Results and Discussions

This section describes the average emotion recognition rates obtained for speaker-dependent and speaker-independent emotion recognition environments using proposed methods. In order to demonstrate the robustness of the proposed methods, 3 different emotional speech databases were used. Amongst, 2 of them were recorded using professional actors/actresses and 1 of them was recorded using university students. The average emotion recognition rates for the original raw and enhanced relative wavelet packet energy and entropy features and for best enhanced features were tabulated in Tables 3, 4, and 5. Table 3 shows the results for the BES database. NN and ELM kernel classifiers were used for emotion recognition. From the results, ELM kernel always performs better compared to NN classifier in terms of average emotion recognition rates irrespective of different orders of “db” wavelets. Under speaker-dependent experiment, maximum average emotion recognition rates of 69.99% and 98.98% were obtained with ELM kernel classifier using the raw and enhanced relative wavelet packet energy and entropy features, respectively. Under speaker-independent experiment, maximum average emotion recognition rates of 56.61% and 97.24% were attained with ELM kernel classifier using the raw and enhanced relative wavelet packet energy and entropy features, respectively. NN classifier gives only maximum average recognition rates of 59.14% and 49.12% under speaker-dependent and speaker-independent experiment, respectively.

The average emotion recognition rates for SAVEE database are tabulated in Table 4. Only audio signals from SAVEE database were used for emotion recognition experiment. According to Table 4, ELM kernel has achieved better average emotion recognition of 58.33% than NN classifier which gives only 50.31% using all the raw relative wavelet packet energy and entropy features under speaker-dependent experiment. Similarly, maximum emotion recognition rates of 31.46% and 28.75% were obtained under speaker-independent experiment using ELM kernel and NN classifier, respectively.

After GMM based feature enhancement, average emotion recognition rate was improved to 97.60% and 94.27% using ELM kernel classifier and NN classifier under speaker-dependent experiment. During speaker-independent experiment, maximum average emotion recognition rates of 77.92% (ELM kernel) and 69.17% (NN) were achieved using the enhanced relative wavelet packet energy and entropy features. Table 5 shows the average emotion recognition rates for SES database. As emotional speech signals were recorded from nonprofessional actors/actresses, the average emotion recognition rates were reduced to 42.14% and 27.25% under speaker-dependent and speaker-independent experiment, respectively. Using our proposed GMM based feature enhancement method, the average emotion recognition rates were increased to 92.79% and 84.58% under speaker-dependent and speaker-independent experiment, respectively. The superior performance of the proposed methods in all the experiments is mainly due to GMM based feature enhancement and ELM kernel classifier.

A paired -test was performed on the emotion recognition rates obtained using the raw and enhanced relative wavelet packet energy and entropy features, respectively, with the significance level of 0.05. In almost all cases, emotion recognition rates obtained using enhanced features were significantly better than using raw features. The results of the proposed method cannot be compared directly to the literature presented in Table 1 since the division of datasets is inconsistent: the number of emotions used, the number of datasets used, inconsistency in the usage of simulated or naturalistic speech emotion databases, and lack of uniformity in computation and presentation of the results. Most of the researchers have widely used 10-fold cross validation and conventional validation (one training set + one testing set) and some of them have tested their methods under speaker-dependent, speaker-independent, gender-dependent, and gender-independent environments. However, in this work, the proposed algorithms have been tested with 3 different emotional speech corpora and also under speaker-dependent and speaker-independent environments. The proposed algorithms have yielded better emotion recognition rates under both speaker-dependent and speaker-independent environments compared to most of the significant works presented in Table 1.

5. Conclusions

This paper proposes a new feature enhancement method for improving the multiclass emotion recognition based on Gaussian mixture model. Three different emotional speech databases were used to test the robustness of the proposed methods. Both emotional speech signals and its glottal waveforms were used for emotion recognition experiments. They were decomposed using discrete wavelet packet transform and relative wavelet packet energy and entropy features were extracted. A new GMM based feature enhancement method was used to diminish the high within-class variance and to escalate the between-class variance. The significant enhanced features were found using stepwise linear discriminant analysis. The findings show that the GMM based feature enhancement method significantly enhances the discriminatory power of the relative wavelet packet energy and entropy features and therefore the performance of the speech emotion recognition system could be enhanced particularly in the recognition of multiclass emotions. In the future work, more low-level and high-level speech features will be derived and tested by the proposed methods. Other filter, wrapper, and embedded based feature selection algorithms will be explored and the results will be compared. The proposed methods will be tested under noisy environment and also in multimodal emotion recognition experiments.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is supported by a research grant under Fundamental Research Grant Scheme (FRGS), Ministry of Higher Education, Malaysia (Grant no. 9003-00297), and Journal Incentive Research Grants, UniMAP (Grant nos. 9007-00071 and 9007-00117). The authors would like to express the deepest appreciation to Professor Mohammad Hossein Sedaaghi from Sahand University of Technology, Tabriz, Iran, for providing Sahand Emotional Speech database (SES) for our analysis.