Abstract

Speech emotion classification method, proposed in this paper, is based on Student’s -mixture model with infinite component number (iSMM) and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model), speech emotion model based on Student’s -mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover, -mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus) and two acting (DES and EMO-DB) universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.

1. Introduction

In naturalistic human-computer interaction (HCI), speech emotion recognition (SER) is becoming increasingly important in various applications. In the literature, the studies on speech emotion recognition may be categorized into the following: (a) the searching for robust speech emotion features, (b) the study on emotion classification models (Gaussian mixture model (GMM) [1] and regression are adopted, resp., for discrete emotion model and continual dimensional model [24]), (c) the study towards naturalistic emotional behavior, and (d) the approaches that consider the complex environment factors, for example, context dependent [5], cross-language, and gender dependent speech emotion recognition [6].

On account of simple structures and easy implementation, classifier models with Gaussian kernel are commonly used in speech emotion recognition. However, the division performance in space of Gaussian kernel largely depends on the characterization of characters in the training samples themselves. If a certain proportion of outliers (singular points) arise in the actual emotion feature samples, it may lead to the surge in number of divisory components, which would seriously affect the recognition performance. Moreover, in previous research on emotion models [79], the model component numbers usually need to be gained through a plenty of training on data. Then, when the underfitting of emotion characteristics leads to mismatching between test samples and training samples, the model component number division will be affected, which results in the performance degradation of classifiers. To solve the above problems, this paper introduced feature classification method of speech emotion based on Student’s -mixture model with infinite component numbers (iSMM). First of all, based on nonparametric Bayesian statistical distribution [10], Student’s -distribution emotion model is set up for high-dimensional space modeling of speech emotion features, while according to the appropriate numbers of components characteristics classification proceeds effectively. Then, the iSMM models are based on normal distributions [11]. Compared to other models, their unique “long-tail” structures make them more robust when handling outliers in speech emotion features and monitoring data with underfitting sample spaces. At last, profiting from latent variables with the hybrid model distributions, the iSMM emotion model in this paper can automatically determine the best composition numbers to conduct the optimal partition of feature space. Meanwhile, outliers in various kinds of speech emotion characters can be effectively dealt with and the problem of model size selection could be solved simultaneously.

In order to verify our proposed method, we adopt three different databases in two different languages: Danish and German. The cross-language speech emotion recognition is still a challenge. Although the emotion expression has a universal character in different nations and cultures, the acoustic features are greatly impacted by different languages. Hence, some of the emotion recognition algorithms may achieve good result in one language while having poor performance in another one.

The remainder of this paper is organized as follows. Section 2 gives the introduction of Student’s -mixture model. Section 3 provides the detailed establishment of emotion model based on the iSMM. Section 4 gives the experimental results on different databases and different languages. Finally, the conclusion is given in Section 5.

2. Speech Emotion Feature Classifiers Based on Statistical Models

FA (factor analysis) is a statistical model to process high-dimensional observed data of speech emotion. FA consists of low-dimensional latent factors, generally, used for dimensionality reduction of local offline data, such as GMM. The combination of a series of factor analyses can form finite local MFA (mixture factor analysis). However, during the process of MFA applied, standardization may usually lead to distribution errors on several factor components, which can further result in the decline of performance with the presence of underfitting data and outliers. The common solution is adopting an independent dimensionality reduction strategy instead of MFA, such as LDA (Linear Discriminant Analysis) and PCA (Principal Component Analysis). However, not only does this increase the complexity of the classification scheme, but processing the singular values or outliers of speech emotion features is also very difficult.

As shown in Figure 1, Student’s -distribution is introduced to replace the MFA in normal distribution, constructing the finite Student’s -distribution mixture model (SMM) [12]. Specifically, the degree of freedom is integrated into Student’s -distribution as a new parameter, bringing the model distribution a “long tail.” So, to outliers and atypical feature data, emotion model could keep robust.

In allusion to speech emotion classifiers based on distribution models, the finite mixture model is applied to assess observed emotion sample data. The density expression of a finite mixture model is formed aswhere and the mixture probability is satisfied. is the vector of parameters while is the density of th component in the model. In the former research of setting up emotion models for SER [9, 13], multivariate Gaussian distributions are often used in the analytical formula of classifiers to express the component densities, which is shown in Figure 2(a). However, recent studies have consciously focused on mixture models based on non-Gaussian frameworks which were used for classification or clustering in high-dimensional data for SER or other fields [1417].

Being similar to Gaussian distribution and Skew- distribution [18], Student’s -distribution is quite suitable for classifying high-dimensional data or clustering. The AECM (alternating expectation-conditional maximization) algorithm [19] is often used to estimate the model parameters. Meanwhile, the BIC (Bayesian information criterion) [20] may be introduced to determine the types of model.

3. Establishment of the Emotion Model Based on iSMM

In some cases such as exorbitant dimension of observed data (usually in the spontaneous speech emotion databases) or oversize number of components needed to be demarcated in the emotion feature spaces, the parameters of the SMM may be hard to control, thereby leading to the performance degradation of classifiers. Under this motivation, in this work, an infinite mixture countable number of components are introduced, constructing the iSMM model.

As shown in Figure 2(b), for each component, expectations of mean are denoted by “” and expectations of covariance are represented as the shaded ellipse. So the iSMM propels monitoring data to match the model, rather than aiming at reducing the complexity through adjusting the structure of the models themselves.

3.1. The Emotion Model Based on Student’s -Distribution

The distribution of the emotional samples in the feature space could be described by multiple superposed Student’s -functions. Theoretically, iSMM model may fit arbitrary density distribution function, as long as there are enough mixed Student’s -components. In our research, iSMM is applied to modeling various emotions. In detail, each emotion category corresponds to an iSMM and is judged through the maximum a posteriori probability criterion. Here, is the utterance sample and represents the emotion category where . Then, the maximum a posteriori probability is

In (2), is obtained through each iSMM model. For a given statement sample, the probability of feature vector is a constant. Assume that each emotion possesses equal probability; is the probability of the emotion class:where is the acoustic feature vector of corresponding speech utterance and is the feature distribution model which can be formulated by Student’s -mixture model with iSMM. Hence, identification of samples can be judged bywhere stands for the emotion class label in the emotion model.

Infinite Student’s -factor mixture distribution, in this paper, is introduced to structure the emotion classification model. Suppose that the number of infinite substructures in a model is countable, and then, according to the Dirichlet process, the mixture model suitable for emotion feature database could be obtained.

-dimensional emotion characteristics data can be described asAmong them, the probability of feature data sample is set as , where and is the number of the mixture compositions divided in the emotion feature space. The corresponding -dimensional () factor is distributed independently , where is called degrees of freedom, parameter is -dimensional vector, and is a positive definite symmetric matrix, which represent the mean and variance of the th emotion component in low-dimensional feature subspaces. Furthermore, represents component error of the emotion features, which obeys independent distribution , where is a diagonal matrix and is a matrix shared by all divided emotion components. Therefore it is defined as the common emotion factor loading matrix. The -distributions and can be regarded as average normal scale distributions and with the precision scalar Gamma distributions; that is,Here, data space model based on -distribution of speech emotion features could be gained, where the parameter set of emotion model, , contains the parameters , ,  , , , and .

3.2. Probabilistic Distribution Modeling of Emotion Features

Because the topic of SER this paper discussed belongs to supervised machine learning, in order to accomplish the mission of emotion classification, the focus on the posterior distribution of stochastic variables of the emotion feature parameters is unnecessary while the research about the emotion characteristics data for testing holds the main part in our work.

For this purpose, the distribution of emotion characteristics data tested is evaluated as . According to the principle of the Bayesian framework, it is known that the distribution above can be obtained through the calculation of marginal likelihood about and posterior distribution , as the following formula:where the evaluation on posterior distribution of emotion samples is obtained from training data of each type of emotions. In formula (7), , , and and parameters and are gotten from inference prediction of corresponding emotion categories in the infinite Student’s -distribution.

Furthermore, through applying step “’’ in expectation-maximization (EM) algorithm [21] to the test data , the posterior probability of latent variables , which is derivation from emotion characteristics, could be obtained. The iSMM which is composed of several latent weights could approximate the real distribution of samples by the Student’s -model. The estimate of iSMM model includes three parameters: the mixed weight of latent weight, the mean , and variance of each -distribution function. Hereby, the weight of latent variables isThe recursive formula of mean and variance iswhere and are both Student’s -distribution.

At last, the forecast distribution of characteristic in each type of emotions can be calculated according to the analyses above.

3.3. Evaluation: Decision Function of SER Based on Student’s -Distribution

In speech emotion classification, label represents a category of emotions in data sample set , where is the sample set of emotion features, in which is -dimensional emotion data and category number belongs to feature samples of speech for training. During the process of testing, each emotion data tested adopts the “’’ step of EM algorithm to be derived. At the same time, each prediction distribution of features is calculated in emotion categories: . Based on formula (4), the criterion of speech emotion model can be obtained by Student’s -distribution aswhere emotion feature vector satisfying is extracted from the frame of each speech on the database in use.

Based on the judgment criteria (12) and the conditional distribution , the emotion model can be established by Student’s -distribution, and the decision function for feature recognition could be inferred as follows:

At the beginning of test on emotion model with Student’s -distribution, step “’’ for deriving model parameters is carried out on each test data . Meanwhile, according to formula (11), the forecast distributions of each emotion category are calculated as . Consequently obtained by formula (13), the emotion category , which possesses the maximum prediction distribution, is the recognition results of emotion feature data tested.

3.4. Analysis on High-Dimensional Feature Space of Speech Emotion Based on iSMM

The emotion model, where variable and observed data is one-to-one correspondence, benefits from Bayesian criterion and the average field theory to obtain the relevant posterior distribution, which can be implemented to determine the model structure for emotion feature classification. For the reason of nonparametric Bayesian statistics principle, the iSMM adjusts component numbers on the basis of the distribution model established.

After the evaluation of decision function for SER accomplished, further analysis on the probability distributions of parameters in the emotion model is necessary. High-dimensional emotion feature samples generated by the iSMM model are . The iSMM is formed through joining an accompanying countable infinite number of components to the SMM, where the infinite dimensional latent stochastic variables introduced are one-to-one correspondence with emotional features data for experiment. The constraint conditions of the latent emotion feature elements in are and . In Student’s -distribution, because is the mixed component of emotion features stemming from the partition on the sample space, thus the specific feature element here satisfies and other elements are all 0. Therefore, the entirety latent variables space of features could be set as . According to formula (3), the marginal distribution of conforms to the mixture probabilistic , which is given asThen, based on the principle of nonparametric Bayesian statistics, could be obtained by the stick-breaking [22] method.

With respect to the data space of emotion features tested, in which it is assumed that samples are obeying identically independent distribution, the decision function can be represented aswhere the hyperparameters and are inferred through the estimation of latent variable by the convergence criterion standard [23]. The relationship between the variables in the iSMM emotion model is shown in Figure 3.

According to the emotion model based on infinite Student’s -distribution, in the feature space, the calculation of the posterior distribution is essential. Therefore, the boundary likelihood probability of emotion feature space partitioned is necessary to be analyzed:

In detail, because there are interactions between multiple stochastic variables in , therefore, according to the method which is in accordance with the Kullback-Leibler (K-L) divergence theory described in previous research [24, 25], an arbitrary distribution is introduced to estimate the actual posterior distribution . On the premise of this assumption, the calculation of the boundary likelihood function of feature data in the emotion model could be realized by applying Student’s -distribution:where is defined for estimating the K-L divergence between the posterior distribution and the actual posterior distribution [8]. In the above equation, for the K-L divergence is negative, is the lower bound of . According to the mean field approximation principle [26], the estimation of the lower bound can be determined. Accordingly, predicted parameters of the emotion features in could be calculated.

Therefore, the emotion model could automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification.

4. Experimental Results

4.1. Experimental Settings

In the experiment, we conducted three international general speech databases using the iSMM emotion model for SER. In the previous related research [2729], the EMO-DB and Danish emotional speech (DES) are databases based on speech performance; the utterances included are obtained from the speakers under pressure of simulation and performance. These two databases are chosen to test the performance of the iSMM model in dealing with different amount of data and speech categories. Aibo database is composed of spontaneous speech, belonging to the natural speech emotion. We conduct Aibo in the experiment to evaluate the recognition performance of the iSMM emotion model on imparity properties of speech emotion samples.

We adopted the OpenEAR toolkit [30] to extract numerous low-level descriptors (LLD) including the delta and double-delta functions and achieved 1,582 dimensional features in total. After applying the LLD description criterion, the mapping static feature vectors could be obtained including original signal, the signal energy, pitch, timbre, frequency, and MFCCs. Particularly, the focus in our work is the performance of the iSMM emotion model proposed, which is applied to the SER, rather than research on the optimization algorithm about selecting features. Accordingly, any promotion in recognition performance could be given the credit to the emotion model proposed.

In this experiment, a set of   -dimensional speech feature data from various corpus set is generated by the SMM [12], whose true underlying group structure is known: the number of components is the number of emotion classes (5 or 7) and the dimension of factors is 2. In each observed data , represents dimensional subvector containing the latent stochastic variables , while is a -dimensional subvector of test variables with . Each observed data is obtained by both SMM and iSMM by substitution into formulas in Section 3:

submatrix , the common factor loading matrix, isand all elements of submatrix related to the singularities are zero. Assume that ; then the mixing proportion vector is . The mean vectors of are given by , , , , and and their covariance matrices are specified asThe degrees of freedom are and . The covariance matrix of is a diagonal matrix, whose diagonal elements are specified to be 0.2.

4.2. Experimental Analysis on Acting Corpus Using iSMM

We conduct SER on the three general databases using four emotion models: iSMM, SMM, GMM, and NN (-nearest neighbor) [31], respectively. The Berlin database, also known as EMO-DB, contains utterances spoken in German. EMO-DB is acquired, respectively, from phonetic performance of five male and female actors, including seven kinds of emotion data: anger, disgust, fear, sadness, boredom, neutral, and happiness. There are a total of 800 utterances in EMO-DB. In the literature [9], 20 respondents are invited to conduct actual test of the 494 utterances. Then, the results show that, in EMO-DB, the natural intelligible utterances reach 60% and accurate selection of emotion is above 80%. DES database is collected from 341 utterances within 5 kinds of emotion: anger, happiness, neutral, sadness, and surprise.

We analyze the results reflected in Figure 4, which demonstrates the normalized confusion matrix of SER experiment using the iSMM on the EMO-DB database. Compared with the improved GMM in literature [9], the iSMM proposed in our research has great improvement on SER performance. Specifically, the recognition rates of “angry’’ and “disgust’’ emotion are close to 100%. For the reason of Student’s -distribution, when dealing with EMO-DB, such database with deficiency speech samples conducts the EM algorithm to test emotion data for latent variables of emotional features. Thus, the prediction distribution of emotion features in each category could be deduced as ; as a result, influences of unstable recognition performance caused by underfitting feature space can be weakened.

However, analyzing the error result of “fear’’ and “boredom” emotions, which are almost classified to “calm’’ and “sad,’’ it is shown that the correlations of emotion feature between them are relatively high. It should be noted that there are two types of errors: false positive and false negative. In our experiments, recognition accuracy is the ratio of correct judgment: where represents the accuracy and is the false positive. Here, the inaccuracy consists of leakage probabilistic and false drop rate.

Figure 5 reveals the result of SER on the Danish database by the iSMM emotion model. DES is the speech emotion database comparatively difficult to be recognized; on this account, experimental results show decline of the emotion recognition performance compared to Berlin database overall. In particular, the recognition rates of “fear’’ and “happy’’ are less than 80%. It is worth mentioning that, in most cases, some key dimensions of emotion feature in the speech emotion database possess the main information of their categories. Compared to many other dimensions, the necessity of their existence is much higher. Profiting from the unique “long-tail” structures, Student’s -distribution, compared with the GMM emotion model, is able to select the key dimensions with high feature information for component partition. Hence, under similar conditions, the SER method proposed in our work achieves obvious improvement of performance compared with the literature [32] which uses 3DEC hierarchical classifier.

4.3. Analysis of Contrast Experiment between iSMM and GMM on the Aibo Database

The Aibo speech database (Batliner et al., 2008 [33]; Steidl, 2009 [34]) was gathered from children accompanying pet robot Aibo developed by SONY. Then, ten kinds of emotion classes contained in the recordings, through machine learning, are mapped into the four categories: anger, emphatic, neutral, and positive, while the other samples are divided into to the fifth classes: rest. Thereafter, through the recognition test to nonspecific person by Björn et al. in 2009, the Aibo database is collated as the recent corpus: Interspeech 2009 emotion challenge [35].

This section focuses on the effect of the iSMM algorithm on emotion recognition from spontaneous speech database. It is revealed by Figure 6 that, compared with the acting speech databases, the recognition rates on the Aibo present striking descent.

Analytically, due to strong continuous correlation between emotional categories in spontaneous speech databases, there exists a considerable amount of outliers and singular values leading to classified errors during the process of component demarcations by emotion models. These errors are more frequent in describing the categories of feature data, peculiarly, when conducting Gaussian distribution with equal means and variances. Benefiting from the long-tail structure of Student’s -distribution, the iSMM possesses more robustness to the atypical observation data (outliers). Thus, compared to other models, the iSMM has distinct superiority of SER performance on spontaneous speech emotion samples.

Regarding the emotion feature samples of the spontaneous database Aibo, the data distribution of space is discrete and there is lack of the training samples leading to underfitting, since the speech material is evoked. Therefore, this condition brings the emotion models a certain extent of difficulties with component partition. The analysis of Figure 7 shows that accuracies of correct classification on Aibo using the GMM model keep a considerable low level, especially on the two emotions: “positive’’ and “rest’’ are even lower than 30%. The iSMM, by contrast, possesses comprehensive advantages in SER. Particularly, on the three emotions: “emphatic’’, “anger,” and “rest’’, there is a significant improvement. Since there is a given mass of emotion features with high redundancy distributing in the spontaneous database, in the process of component spaces partitioned, the GMM model conducts classification based on the principle of mean square error, which may result in the large redundancy of the singular points. In contrast, as the analysis above, the ability to deal with outliers brings Student’s -distribution favorable fitness to emotion features of the spontaneous speech. These aspects make the iSMM model keep a relatively good performance and robustness on the Aibo database.

4.4. Integrated Analysis on Various Databases Using the Four Emotion Models

The results of recognition experiment on EOM-DB in German and DES in Danish show that the iSMM model maintains good robustness for different language speech emotion databases. In the iSMM emotion model, each forecast distribution of component in emotion category is a subset of the distribution of the initial component. Through the stick-breaking principle in the time domain, the number of infinite components on observed data can be inferred. Whereas the numbers of components in the GMM or NN emotion models are changeless, as a result, the speech feature data may be overfitting.

As shown in Table 1, the iSMM holds higher average recognition rates on all databases than its comparison group. The recognition experiment results using GMM and NN are almost flat, while SMM also has a significant advantage in comparison to them. Given this difference, we found that the iSMM and SMM approaches described here, which achieve better results than GMM and NN, profit from the prediction distribution of emotion categories: . This inference of component partition in EM algorithm can effectively handle singular values and outliers. Therefore, in Student’s -distribution, the degradation of recognition performance caused by outliers’ partition could be reduced. Furthermore, compared to SMM, iSMM due to the introduction of the infinite number of components has better adaptability to the high-dimensional and underfitting emotion feature data.

In further analysis, we can see that iSMM reaches the recognition rates of 86.9% and 75.0% on German database EMO-DB and Danish database DES, respectively. This means that our method is not significantly dependent on the language type. The variance in features caused by language differences can be modeled universally using infinite Student’s -distribution, which is an important character compared with GMM and other algorithms. Accordingly, we can see that the GMM method achieves a good recognition rate of 82.7% in EMO-DB, while giving a low recognition rate of 58.4% in the Danish database. These results show the iSMM’s advantage in cross-language modeling. The relatively low recognition rates on Aibo database are mainly caused by naturalistic speech data, rather than the language factor.

According to the former description in Sections 2 and 3, we can obtain the total number of free parameters in iSMM, which is . In the FA + GMM, this number is , while, in the SMM with common factor loading matrix, it is . Therefore, in most cases, the iSMM has the smallest number of free parameters. When is large and/or is not small, the iSMM is a more feasible tool for modeling high-dimensional data.

As shown in Table 2, the time consumption experimental results are obtained by training features from the “anger” emotion containing 128 utterances on EMO-DB. It is obvious that the models based on Student’s -distribution achieve less time consumption than GMM and NN. Particularly, the proposed iSMM obtains the optimal result. Moreover, the estimated means , covariance , and degrees of freedom in the iSMM can be used to visualize the distributions of the factors in a low-dimensional latent space, enabling this approach to perform clustering or classification in the latent space.

It is worth mentioning that the independence of speaker, the independence of text, the number of emotional categories for classification, and the size of the database will all affect the actual recognition performance of the emotion mode. This paper puts forward a solution for SER on underfitting samples. Subsequently, the iSMM emotion model possessing good robustness is provided. However, as revealed in the experiments on the Aibo database, at present, by using the emotion models, the SER accuracy ratings on the spontaneous database are considerably low. Moreover, not only does the promotion of comprehensive recognition performance depend on the classifiers, but the optimization of selection on speech features also plays an important role.

5. Conclusions

The mixture model based on Student’s -distribution proposed in this paper, when conducted to speech emotion recognition, can process the outliers of feature samples by the “long-tail’’ distribution structure. Furthermore, to reduce the dependence of emotion model on the training samples, in our research, an infinite number of components are introduced constituting the iSMM emotion model. Given this solution, the emotion model, by the possibility of automatic partition of components on the feature space, realizes the self-adaptation to various database samples.

Emotion expression is related to many factors, such as personality, context, gender, age, and culture. These factors may create various problems for the speech emotion recognition. Emotions in speech are expressed and interpreted according to different circumstances by humans (e.g., context dependent, personality specific, and gender dependent), which may be the key to bring speech emotion recognition to the real world applications. As a result, the further research will focus on how to obtain a more appropriate combination of the extractor of emotion feature and the classifier.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work has been supported by the National Natural Science Foundation of China (NSFC) under Grants nos. 61273266, 61231002, and 61375028.