Computer Intelligence in Modeling, Prediction, and Analysis of Complex Dynamical SystemsView this Special Issue
Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System
The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.
The development of applications and services is trying to deploy natural interaction between man and computer. Specifying commands by voice and movements is very popular nowadays. The majority of information is extracted from human speech with rather good accuracy. Human speech also includes secondary information, which holds properties of the speaker. Age, gender, emotional state, speech error, and other features are contained in human speech. The mentioned source characteristics are highly valuable, because speech features can be simulated only by person with good acting skills. As the title suggests, this paper describes a system for classifying emotional state of human speech. Emotion is one of the characteristics of human which describes his mental condition affecting physiological changes in the human body. These changes are also reflected in the human speech. Information about the emotional state is requested in many fields. Statistical evaluation of customer satisfaction and his interest in the products is evaluated by affected emotional state. This information is a direct response to any stimulus. Call center agents can be evaluated with regard to their work and access to the customer. There is a chance to train new agents and teach them to correct the procedure of communication with the customer. Human body influenced by stronger emotions is getting stressed. Sectors such as police, firemen, and especially military generate the greatest emotional pressure on employees. Dispatching orders can be directly influenced by information from speech emotion recognition system. Speech signal can serve as an authorization key in access systems. Speech is affected by physiological changes caused by changing emotions. An authorized user can be denied because authorization unit recognizes the stress speech as a wrong key. These are just the first examples of utilizations for speech emotion recognition systems. It is obvious that the system will have great application in human-machine interaction. Therefore it is appropriate to identify a classification ability of different classifiers for different emotional states. One of the related works, but a more extensive research summary from Mr. El Ayadi et al., is published in the article “survey on speech emotion recognition: features, classification schemes, and databases,” which is mentioned in [1–4].
2. Speech Emotion Recognition System
System design consists of several blocks, which it distributed to major functions. Input values are represented by audio signals from created database and used for training and testing. Block diagram of the system is shown in Figure 1.
The quality of the input data, the audio signal in this case, has a direct impact on the classification accuracy. For this reason, the Berlin database containing over 500 recordings of actors consisting of men and women is used. The database contains 10 sentences in the seven emotional states. This corpus of recordings is considered as a high-quality database, because it was created by professional actors in the sound studio. Blocks , , and represent point of view for emotion recognition. The system can be designed and used for detecting the stress of the speaker (option ), for recognition of all emotional states, as in the case of the Berlin database in which they are seven (option ). Other approaches to the problem are represented by block . As mentioned, the speech signal has to be modified by routine preprocessing operations such as removing the DC component, preemphasis, and segmentation stochastic signal into quasiperiodic frames.
Speech recognition system is context independent, that is, taking into account only signal parameters, not content information. These parameters are the training and testing vectors for classifiers [5–7].
The calculation parameters are represented by the features extraction block that extracts the following:(i)39 Mel-frequency cepstral coefficients (MFCC) and dynamic parameters (first and second derivative of MFCC),(ii)12 linear prediction coefficients (LPC),(iii)12 linear spectral pairs (LSP),(iv)8 prosodic features (RMS energy, log-energy, zero crossing rate (ZCR), mean crossing rate (MCR), position of maximum, maximum, minimum, and harmonic-to-noise ratio (HNR) .
Individual research shows that it cannot be said which classifier for emotion recognition is the best. Each classifier or their combination achieved some results accuracy, which depends on several factors. The success of classifier is directly dependent on the data. This is derived from the fact that the accuracy varies with the data character such as the quantity, density distribution of each class (emotions), and the language too. One classifier has different results with acted database, where the density of each emotion is equitable and different with real data from call center where normal (calm) emotion state occupies 85 to 95 percent of all data. Appropriate choice of parameters has a considerable effect on the accuracy of these classifiers. The following subsections describe the used classification methods.
3.1. Artificial Neural Network
Our emotional state classification problem with high number of parameters can be considered as a pattern-recognition problem. In this case, two-layer feedforward network can be used. A two-layer feedforward network, with sigmoid hidden and output neurons, can classify vectors arbitrarily well, given enough neurons in its hidden layer. The network is trained with scaled conjugate gradient (SCG) backpropagation.
We shall denote the input values to the network by where . The first layer of network forms linear combinations of these inputs to give a set of intermediate activation variables :with one variable associated with each hidden unit. Here represents the elements of first-layer weight matrix and is the bias parameters associated with the hidden units. Demonstration of such a network with speech features as an input, 5 hidden layers, and two output classes is shown in Figure 2.
SCG training implements mean squared error associated with gradient and avoids the line search per learning iteration by using Levenberg-Marquardt approach in order to scale the step size. A weight in the network will be expressed in vector notation:The vector-delta points in the direction in which will decrease at the fastest possible rate. Weight update equation is shown below, where is suitable constant:The gradient descent method for optimization is very simple and general. Only local information, for estimating a gradient, is needed for finding the minimum of the error function [9, 10].
3.2. -Nearest Neighbour
The -NN is a classification method on the principle of analogies learning. Samples from the training set are numeric attributes, and each sample represents a point in -dimensional space. This space of training samples is scanned by the classifier due to determining the shortest distance between training and unknown samples. Euclidean and other distances can be computed. In other words, an object is classified by a majority vote of its neighbours, with the object being assigned to the class most common amongst its nearest neighbours ( is a positive integer, typically small). If , then the object is simply assigned to the class of its nearest neighbour. The various distances between the vectors and are as follows:The neighbourhood distance is calculated through Euclidean metric. Given an -by- data matrix , it is treated as (1-by-) row vectors .
3.3. Gaussian Mixture Model
A Gaussian mixture model is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuing measurements or features in biometric system, such as vocal tract, in speaker recognition systems as well. Probability distribution of the parameter vectors derived from human speech can be described using GMM:where is the number of components for class, , are weights of components complying the condition that sum of all weights is 1, and means the probability density of the components represented by the mean value and covariance matrix . Gaussian model for class “” is defined by (6)The criterion of maximum likelihood depends on the probability density and sequence parameters , as seen bellow [11, 12]:
The aim of the experiment was to clarify the significance of chosen groups of features, as well as classification ability of selected classification methods for speech emotion recognition system. Samples of examination were formed from recordings of human speech with various emotional characters. The following settings and features were used in the experiment:(i)input samples—Berlin database of emotional utterances:(a)10 different sentences recorded by 10 different actors (both genders),(b)over 530 samples consisting of 7 emotions: anger, boredom, disgust, fear, happiness, sadness, neutral state;(ii)feature extraction—computing of input vectors (speech parameters):(a)13 MFCC coefficients , 13 dynamic , and 13 acceleration coefficients of MFCC ,(b)12 LPC coefficients,(c)12 LSP coefficients,(d)8 prosodic features;(iii)emotion classification:(a)GMM—64 mixture components,(b)-nearest neighbours (set up 5 neighbours);(c)artificial neural network—feedforward backpropagation.
One of the points of view is recognizing the stressed-out person, which means recognizing deviations from the neutral state. This state is not defined in the Berlin database. Therefore, it was necessary to assemble a set of data, the so-called “stress cocktail” from defined emotional states.
The stress of a person can be assembled from emotional states, other than neutral state. Emotions, anger and fear, were used to compile the stress data set with 50/50 ratio that these emotional states are reflected most often when a person is exposed to stressful situations. Fear and anger were selected because of the major sound differences from neutral state. Confusion matrices for each set of features are shown in Tables 2, 3, 4, and 5. The meaning of cells describes Table 1. True positive (TP) represents correctly classified first class (neutral) and true negative (TN) correctly classified second class (stress).
The GMM, -NN, and the ANN were used to classify the stress versus neutral state. Results for all three classifiers are shown in Figure 3. The receiver operating characteristic (ROC) is applied for better system understanding. ROC curve is a tool for the evaluation and optimization of binary classification system (test), which shows the relationship between the sensitivity and specificity of the test or the detector for all possible threshold values .
The results in Tables 6, 7, and 8 describe classification accuracy for a particular type of classifier that has been trained by best-scored MFCC features of emotion pair. The classifier was trained by pair of emotions and values in the tables show tested ability of recognizing emotional state (left table header). All three classifiers showed the best recognition ability for the emotional state of anger. Emotional state of sadness was recognized with the evaluation very well. On the other hand, the worst-recognition ability of the system was the emotional state of fear (GMM and ANN) and disgust (ANN).
Neutral state versus stress scenario has been used for evaluating the accuracy of classification methods and features. The results show that the most precise method for recognizing speech of the human stress state is an artificial neural network, which achieved the best results for all sets of parameters (90% for MFCC). The most significant feature for emotion classification is MFCC. This fact demonstrates accuracies of all the classifiers and the ratio of the sensitivity and specificity of the ROC curve shown in Figure 3. One of the reasons is the individuality of MFCC coefficients, which are not mutually correlated.
This experiment shows that these classification methods can be used on the recognition of emotional state. At the same time, the question arises: what emotional states will characterize stress. The answer will probably depend on which system would be applied. Another fact is that we cannot determine the intensity of emotionally stimulated Berlin database. One of the main tasks will be to compare these results with the emotional recordings of the realistic environmental conditions.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
M. Zarkowski, “Identification-driven emotion recognition system for a social robot,” in Proceedings of the 18th International Conference on Methods and Models in Automation and Robotics (MMAR '13), pp. 138–143, August 2013.View at: Google Scholar
S. Bakhshi, D. Shamma, and E. Gilbert, “Faces engage us: photos with faces attract more likes and comments on Instagram,” in Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems (CHI '14), pp. 965–974, ACM, New York, NY, USA, 2014.View at: Publisher Site | Google Scholar
M. A. R. Ahad, Motion History Images for Action Recognition and Understanding, Springer, London, UK, 2013.
M. Voznak, F. Rezac, and J. Rozhon, “Speech quality monitoring in Czech national research network,” Advances in Electrical and Electronic Engineering, vol. 8, no. 5, pp. 114–117, 2010.View at: Google Scholar
P. Partila, M. Voznak, M. Mikulec, and J. Zdralek, “Fundamental frequency extraction method using central clipping and its importance for the classification of emotional state,” Advances in Electrical and Electronic Engineering, vol. 10, no. 4, pp. 270–275, 2012.View at: Google Scholar
F. Eyben, F. Weninger, M. Wollmer, and B. Schuller, openSMILE—the Munich open Speech and Music Interpretation by Large Space Extraction toolk it, TU Munchen, 2013, http://opensmile.sourceforge.net/.
H. Hu, M.-X. Xu, and W. Wu, “GMM supervector based SVM with spectral features for speech emotion recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), pp. IV-413–IV-416, Honolulu, Hawaii, USA, April 2007.View at: Publisher Site | Google Scholar