Classification of Emotional Speech Based on an Automatically Elaborated Hierarchical Classifier

Xiao, Zhongzhe; Dellandrea, Emmanuel; Dou, Weibei; Chen, Liming

doi:https://doi.org/10.5402/2011/753819

International Scholarly Research Notices

On this page

Abstract Introduction Experimental Results References Copyright Related Articles

Research Article | Open Access

Volume 2011 | Article ID 753819 | https://doi.org/10.5402/2011/753819

Classification of Emotional Speech Based on an Automatically Elaborated Hierarchical Classifier

Zhongzhe Xiao,^1,2Emmanuel Dellandrea,¹Weibei Dou,³and Liming Chen¹

Academic Editor: F. Palmieri, Y. H. Ha, R. Palaniappan

Received04 Nov 2010

Accepted15 Dec 2010

Published20 Feb 2011

Abstract

Current machine-based techniques for vocal emotion recognition only consider a finite number of clearly labeled emotional classes whereas the kinds of emotional classes and their number are typically application dependent. Previous studies have shown that multistage classification scheme, because of ambiguous nature of affect classes, helps to improve emotion classification accuracy. However, these multistage classification schemes were manually elaborated by taking into account the underlying emotional classes to be discriminated. In this paper, we propose an automatically elaborated hierarchical classification scheme (ACS), which is driven by an evidence theory-based embedded feature-selection scheme (ESFS), for the purpose of application-dependent emotions' recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% classification accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to a 81.22% accuracy rate displayed by a manually elaborated multistage classification scheme (DEC).

1. Introduction

Speech emotion analysis has attracted growing interest within the context of increasing awareness of the wide application potential of affective computing [1, 2]. Current machine-based techniques for vocal emotion recognition only consider classification problems of a finite number of discrete emotion categories [3] whereas the kinds of emotional states and their number are typically application dependent. These affective categories can be the six basic emotional states but also some nonbasic emotional classes, including for instance deception [4], certainty [5], stress [6], confidence, confusion, and frustration [7].

Most works in the literature made use of acoustic correlation of vocal emotion expressions and explore prosodic features, including pitch, energy, formants, cepstral, voice quality [8–10], and more recently Harmonic and Zipf features [11]. Moreover, the majority of them rely on one or several global one-step classifiers, including SVM, neural networks, and GMM, using the same feature set for all the emotional states, while studies on emotion taxonomy suggests that some discrete emotions are very close to each other on the dimensional emotion space, and there is confusion of emotion class borders. Indeed, Banse and Sherer evidenced [12] that acoustic correlates between fear and surprise or between boredom and sadness are not very clear, thus making an accurate emotion classification by a single step global classifier very hard. Implementing an intuition that hardly separable classes should be divided at last, Schuller et al. [13] proposed an interesting multilayer-SVM architecture. Some authors [14, 15] also tested an ensemble classification scheme making a voting by several base classifiers. However, only a very slight improvement of classification accuracy was observed. In our previous work [16], we proposed an effective multistage classification scheme driven by the dimensional emotion model which hierarchically combines several binary classifiers. At each stage, a binary class classifier made use of a different set of the most discriminative features and discriminated emotional states according to different emotional dimensions. However, all these hierarchical classification schemes, including our own ones which is based on an empiric mapping of the discrete emotion states onto the dimensional emotion model, were manually elaborated to take into account the various emotional states under consideration whereas in practice, the types of emotion considered are rather application or dataset dependent. Clearly, we need an automatic way for building such a hierarchical classification scheme for machine-based emotion analysis, especially when the number of emotions changes and their types vary.

In this paper, we propose an automatically elaborated hierarchical classification scheme, which is driven by an evidence theory-based feature selection (ESFS), for the purpose of application-dependent emotions’ recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to an 81% accuracy rate displayed by the manually elaborated multistage classification scheme (DEC). So far as we know, the best classification rates displayed in the literature are, respectively, 66% [17] and 76.15% [18] on the same DES dataset.

The remainder of this paper is organized as follows. Section 2 briefly introduces our evidence theory-based embedded feature selection scheme, the ESFS. We describe in Section 3 our ESFS-based algorithm for automatically deriving a hierarchical classification scheme (ACS) for application-dependent emotion analysis. Section 4 presents the experimental results of our ACS both on the Berlin and the DES datasets compared to the ones by our previously empirically elaborated but dimensional emotion model-driven hierarchical classification schemes (DECs). Finally, we summarize and conclude our work in Section 5.

2. ESFS: An Evidence Theory-Based SFS

Feature subset selection is an important subject when training classifier in Machine Learning (ML) problems. Practical ML algorithms are known to degrade in prediction accuracy when faced with many features that are not necessary [19, 20]. The current feature selection methods can be categorized into three broad classes according to their dependence to the underlying classifer [21]: filter approach, wrapper approach, or embedded one. In this section, we describe a novel embedded feature selection method, called ESFS, which is similar to the wrapper method SFS, since it relies on the simple principle to incrementally add the most relevant features. As an embedded method, our ESFS also carries out the selection of an optimal feature subset together with the classifier construction [22, 23]. Its originality concerns the use of mass functions from the evidence theory that allows merging elegantly information sources carried out by features, in an embedded way, thereby leading to a lower computational cost than the original SFS. We first introduce some basics of the evidence theory then describe our ESFS.

2.1. Introduction to the Evidence Theory

In our feature selection scheme, the term “belief mass” from the evidence theory is introduced into the processing of features. Dempster and Shafer wanted in the 1970s to calculate a general uncertainty level from the Bayesian theory. They developed the concept of “uncertainty mapping” to measure the uncertainty between a lower limit and an upper limit [24, 25]. Similar to the probabilities in the Bayesian theory, they presented a combination rule of the belief masses (or mass function) .

The evidence theory was completed and presented by Shafer in [26]. It relies on the definition of a set of hypothesis which have to be exclusive and exhaustive. In this theory, the reasoning concerns the frame of discernment which is the set composed of the subsets of [27]. In order to express the degree of confidence we have in a source of information for an event of , we associate to it an elementary mass of evidence . The elementary mass function or belief mass which presents the chance of being a true statement is defined as which satisfies The belief function is defined if it satisfies and and for any collection of subsets of Ω The belief function shows the lower bound on the chances, and it corresponds to the mass function with the following formulae: where means the number of elements in the subset.

The doubt function is defined as And the upper probability function is defined as The true belief in should be between and .

The Dempster’s combination rule can combine two or more independent sets of mass assignments by using orthogonal sum. For the case of two mass functions, let and be mass functions on the same frame , the orthogonal sum is defined as , to be , and For the case with more than two mass functions, let , and it satisfies and This definition of mass functions from the evidence is used in our model in order to represent the source of information given by each acoustic feature here and to combine them easily and to consider them as a classifier whose recognition value is given by the mass function.

2.2. The ESFS Scheme

An exhaustive search of the best subset of features, leading to explore a space of 2ⁿ subsets, is impractical; we thus turn to a heuristic approach for the feature selection as does SFS. However, different from SFS which is wrapper-based approach, our evidence theory-based feature-selection technique, ESFS, makes use of the concept of belief mass from the evidence theory as a classifier and its combination rules to fuse various audio features, leading to an embedded feature-selection method. Moreover, as compared to the original SFS, the range of subsets to be evaluated in the forward process in ESFS is extended to multiple subsets for each size, and the feature set is reduced according to a certain threshold before the selection in order to decrease the computational burden caused by the extension of the subsets in the evaluation.

A heuristic feature selection algorithm can be characterized by its stance on four basic issues that determine the nature of the heuristic search process [28]. First, one must determine the starting point in the space of feature subsets, which influences the direction of search, and the operators used to generate successor states. The second issue concerns the search strategy. As an exhaustive search in a space of feature subsets is impractical, one needs to provide a more realistic approach such as greedy methods to traverse the space. At each point of the search, one considers local changes to the current state of the features, selects one, and iterates. The third issue concerns the strategy used to evaluate alternative subsets of features. Finally, one must decide on some criterion for halting the search.

As illustrated in Figure 1, we can summarize our embedded ESFS using belief masses by the following four steps while answering the previous four questions:(i)computation of the belief masses of the single features from the training set,(ii)evaluation and ordering of the single features to decide the initial set for potential best features,(iii)combination of features for the generation of the feature subsets, making use of operators of combination,(iv)selection of the best feature subset.

2.2.1. Calculation of the Belief Masses of the Single Features

Before the feature selection starts, all features are normalized into [0,1]. For each feature, where is the set of original value of the th feature and is the normalized value of the th feature.

By definition of the belief masses, the mass can be obtained in different ways which can represent the chance for a statement to be true. In our work, the PDFs (probability density functions) of the features of the training data are used for calculating the masses of the single features.

The curves of PDFs of the features are obtained by applying polynomial interpolation to the statistics of the distribution of the feature values from the training data.

Taking the case of a 2-class classifier as an example, the two classes are defined as subset and subset . First, the probability densities of the features in each of the 2 subsets are estimated from the training samples by the statistics of the values of the features in each class. We define the probability density of the th feature in subset as and the probability density in subset as , where the is the value of the feature . According to the probability densities, the masses of feature on these 2 subsets can be defined to meet the requirement in (2) as where at any possible value of the th feature , .

In the case of classes, the classes are defined as . The masses of feature of the th class can be obtained as which satisfies

2.2.2. Evaluation of the Single Features and Selection of the Initial Set of Potentially Best Features

Once computed the belief masses associated with each single feature from the training data, they are used to build a simple classifier. Indeed, given the belief masses associated with a single feature, a data sample can be simply assigned to the class having the highest belief mass. Using classification accuracy rate by these singles feature-based classifiers, all the features can then be sorted in a descending order as , where is the number of features in the whole feature set. In order to reduce the computational burden in the feature selection, an initial feature set FS_ini is selected with the first best features in this ordered feature set, using for instance a threshold for cutting off classification accuracy rates, leading to

The threshold of the classification rates is decided according to the best classification rate as where , as illustrated in Figure 2. In our work on vocal emotion analysis, the threshold value is set to 0.8 according to a balance between the overall performance and the calculation time by experiments. This threshold may vary with the underlying problem. Out of 68 features considered in our work, around 30 features are kept in our vocal emotion analysis problem while setting the threshold to 0.8.

Only the features selected in the set FS_ini will attend to the latter steps of the feature selection process. The elements (features) in FS_ini are considered as subsets of features having the size 1 at the same time.

2.2.3. Combination of Features for the Generation of the Best Feature Subsets

Having the best feature subsets with size (), the generation of a new feature subset of size is achieved by computing a new composite feature through an operator of combination, thus fusing a feature subset of size with one from the initial feature set FS_ini. All these composite features are then sorted in a descending order according to their classification accuracy and the best ones are selected using a threshold as we did for the selection of the initial feature set FS_ini.

We note the set of all the feature subsets in the evaluation with size as and the set of the selected feature subsets with size as . Thus, FS₁ equals to the original whole feature set, and . From , the set of the feature subsets is noted as where the function “Combine” aims to generate new composite features by combining features from each of the two sets and with all the possible combinations except the ones in which a feature from already appears in the composite feature from ; represents the generated new composite features using an operator of combination, and is the number of elements in the set .

The creation of a new composite feature from two other features is achieved by combining the belief masses of the two features, making use of an operator of combination. The fusing process works as follows.

Assume that classes are considered in the classifier. For the th class , the preprocessed mass for the new composite feature , which is generated from a composite feature in and a feature from FS_ini, , is calculated as where the is the value of the feature and is an operator of combination. The commonly used existing operators for fusing two elements, the triangle norms, are used in our work to combine the features. These operators will be explained in details in next subsection. The sum of s may not be 1 depending upon the combination operator being used. In order to meet the definition of belief masses, the s can then be normalized as the masses for the new composite feature

The performance of the new composite feature may be better than both its two base features used in the combination, as illustrated in Figure 3. However, the new composite feature may also perform worse than any of the two original features, in which case the new composite feature will be eliminated in the feature selection process.

All these new composite features can also be sorted in a descending order according to their classification accuracy on the training dataset as we did for the single original audio features

The best composite feature having size is noted as , and its classification accuracy recorded as . Similar to the selection of FS_ini from the single original features, a threshold is set to select a number of composite features having the size for the next step of forward selection. The set of these selected composite features is noted as which satisfies . In order to simplify the selection, the threshold value is set in our work to the same value as 0.8 in every step without any adaptation to each step.

2.2.4. Stop Criterion and the Selection of the Best Feature Subset

The stop criterion of ESFS occurs when the best classification rate begins to decrease while increasing the size of the feature subsets. In our work, in order to avoid missing the real peak of the classification performance, the forward selection stops when the classification performance continues to decrease in two steps, . The number of the selected composite features is noted as Num_select, and the selected composite features are

2.3. Operators of Combination

Aggregation and fusion of different information sources are basic concerns in many systems and applications. There exists different fusion approaches, including evidence theory, possibility theory, or fuzzy set theory, but all these approaches can be summarized as application of some numerical aggregation operators. Generally speaking, the aggregation operators are mathematical functions consisting of reducing a set of numbers into a unique representative number [29].

Since the combination of the masses of the features in our feature selection scheme amounts to combine two features, the commonly used existing operators for two elements, the triangle norms, are used in our work to fuse the features as in (16).

The triangular norm (abbreviated as t-norm) is a kind of binary operation used in the framework of probabilistic metric spaces and in multivalued logic which was first introduced by Menger [30] in order to generalize the triangular inequality of a metric. The current concept of a t-norm and its dual operator (t-conorm) is developed due to Schweizer and Sklar [31, 32]. The t-norms generalize the conjunctive “AND” operator and the t-conorms generalize the disjunctive “OR” operator. These properties enable them to be used to define the intersection and union operations [29, 33].

The definitions of a t-norm and a t-conorm are as follows.

t-norm
A t-norm is a function , having the following properties(1) (T1) commutativity,(2), if and (T2), monotonicity (increasing),(3) (T3), associativity,(4) (T4), one as a neutral element.A well-known property of t-norms is

t-conorm
Formally, a t-conorm is a function , having the following properties:(1) (S1), commutativity,(2), if and (S2), monotonicity (increasing),(3) (S3), associativity,(4) (S4) Zero as a neutral element.

A well known property of t-conorms is

We say that a t-norm and a t-conorm are dual (or associated) if they satisfy the DeMorgan law. The minimum is the biggest t-norm; and its dual is the smallest t-conorm.

Six parameterized t-norms, namely Lukasiewicz, Hamacher, Yager, Weber-Sugeno, Schweizer and Sklar, and Frank, which are frequently proposed in the literatures [34], were tested with different parameters in our work. They are defined as follows: (1)Lukasiewicz (2)Hamacher (3)Yager (4)Weber-Sugeno (5)Schweizer and Sklar (6)Frank

Figure 4 depicts the various surfaces associated with the aforementioned six combination operators. The -axis represents the output of the operators from the two inputs and . As we can see from these curves, the Yager, Weber-Sugeno, and Schweizer and Sklar operators have convex surfaces, while the Lukasiewicz and Frank operators have flatter or even concave surfaces. For each operator, the degree of convexity or concavity is affected by the parameters. The difference in the shape of the surfaces may influence the performance when they are applied in the classification.

(a)

(b)

(c)

(d)

(e)

(f)

In addition to these t-norm operators, the average and the geometric average of the features are also used for the combination of the features.(i) Average (ii) Geometric average

The property curve surfaces of average and geometric average are displayed in Figure 5.

(a)

(b)

It should be noticed that since a step of normalization is applied in calculating the masses of the combined new features in (17), the “associativity” property of the t-norms is not effective in our case. Furthermore, in order to ensure the performance of the final combined new feature, the order of the selected features cannot be moved randomly.

2.4. Discussion

ESFS can be used either as an embedded feature selection method, as we do in the next section when building the hierarchical classification scheme for emotion analysis, or as a simple filter method for selection of relevant features which can then be embedded into classifiers. Used as a filter method, we carried out experiments aiming at comparing the behavior of our ESFS with other filter feature-selection techniques, including Fischer filter method, PCA, and SFS. Using Berlin dataset for emotional speech recognition and Simplicity dataset for visual object recognition, our ESFS displayed better performance, showing its effectiveness in the selection of relevant features [35].

3. ESFS-Based Hierarchical Classification Scheme for Vocal Emotion Recognition

The fuzzy neighborhood relationship between some emotional classes, for instance between sadness and boredom, as evidenced by studies on acoustic correlates, leads to unnecessary confusion between emotion states when a single global classifier is applied using the same set of features. While several previous works have shown the effectiveness of multistage classification schemes on vocal emotion analysis, the elaboration of these hierarchical classification schemes were intuitive and manual. On the other hand, the number of emotions and their types to be recognized are typically dataset or application dependent. The empirically built hierarchical classification structure thus needs to be adjusted when the emotional space changes. In this section, we propose an automatically elaborated Hierarchical Classification Scheme (ACS) which is driven by our evidence theory-based feature selection technique ESFS. While keeping at least similar performance, the main goal here is to avoid unnecessary repeated work for manually building a new multistage classification scheme each time the vocal emotions to be analyzed change.

Basically, our ESFS, when applied as an embedded feature selection technique to an application specific vocal emotion recognition problem, automatically divides in an optimal way the set of emotional states to be recognized into two disjoint subsets of emotional states, leading to a hierarchical classifier represented by a binary tree whose root is the union of all emotion classes, while leaves are single emotion classes and intermediate nodes composite emotional classes discriminated by a subclassifier. Each of these subclassifiers is based on our ESFS introduced in the previous section, thus extracting the best features to best discriminate two composite emotional classes.

The generation process of an ACS is shown in Figure 6. The discrete emotional classes concerned in the classification problem are first assigned to a frame of discernment , where stands for the th emotional state in the frame of discernment . For example, the frame of discernment associated with the Berlin database is Anger, Happiness, Fear, Neutral, Sadness, Boredom} while the frame of discernment associated with the DES dataset is Anger, Happiness, Neutral, Surprise, Sadness}. The frame of discernment describes the initial affect space under study. Using our embedded ESFS, the affect space will be recursively divided into two complementary subaffect spaces which best describe the affect space with respect to the training data, until the subaffect spaces become simple emotional classes.

The hierarchical classifier is thus expressed by a binary tree. The initial frame of discernment is set as the root node of this binary tree. The main steps for generating an ACS are listed as follows.

3.1. The Algorithm

Step 1. The hierarchical structure is composed of several binary subclassifiers. The is divided into pairs of nonempty subsets exhausting all possible partitioning of the initial affect space. The two subsets in each pair are complements to each other, and each subset represents a class with respect to the initial affect space
All the possible pairs of complements are evaluated using ESFS to decide which partitioning of the initial affect space is the best from the viewpoint of classification accuracy. In order to avoid repeated partitioning, the pairs are defined to ensure that the number of elements in subset is not larger than the number of elements in . For example, in the case with 4 classes, 7 pairs of subsets can be evaluated as listed in Table 1.

Our feature combination and selection process (ESFS) is applied to each pair of the subsets and the belief masses of the training samples in the subsets can be obtained. All these pairs can then be sorted by their classification accuracy rates.

Step 2. The two subsets in the pair with the highest classification rate (assuming it is the th pair of subsets) are assigned as the children nodes: as the left child node and as the right child node.
The two children nodes of and are then processed in the same way as in Step 1. The numbers of elements in the children nodes are counted. Note the subsets or as .(i) If (only one element in the subset), this node is marked as a leaf node.(ii) If (the subset can be further partitioned), the frame of discernment is updated as , and the construction of the binary tree continues with Step 1.

Step 3. When the number of leaf nodes equals to the number of emotional classes, the generation process of the binary tree stops. The information about the binary tree is stored in the model of the classifier.

3.2. Practice and Improvement

In practice, we want our ACS resulted from the previous scheme to be as balanced as possible. Indeed, the overall classification accuracy rate of a multistage hierarchical classifier is approximately the product of the classification rates at each stage. Assuming the different stages in the classifier have classification accuracy rates close to each other as , for an stage classifier, the overall classification rate can be approximated by . Thus, too many stages may lead to dramatic degrading of the overall classification accuracy rate. In order to reach classification accuracy as higher as possible, one needs to reduce the depth of the tree-based hierarchical classifier so that it is a balanced structure.

In our work, balanced pair of subsets is put forward. For each pair of subsets and , a subset distance is calculated as the difference of the number of elements of the two subsets with where the means the number of elements in the set .

Because the subsets are defined so that has alsways fewer or the same number of elements than , always satisfies when the th pair of subsets satisfies , it is defined as a balanced pair of subsets.

If the pair of subsets with the highest classification rate (assuming that it is the th pair and the classification rate is ) is a balanced pair, the generation of the binary tree continues normally; if it is not a balanced pair, it will be compared with the balanced pair having the best classification rate (assuming that it is the th pair and the classification rate is ). If we have only five or six classes in our applications as it is the case for the Berlin and DES datasets, there should be two or three stages in a balance binary tree. We thus set a threshold thre_diff defining an acceptable difference on classification accuracy which can be measured by , assuming the number of the stages does not exceed three. The approximate values of thre_diff related to are listed in Table 2. The best classification rate in the first stage of the hierarchical structure is normally around 90% in the experiments, so the most commonly selected thre_diff is between 4% and 5%. When the number of classes increases in the classification problems, the thresholds should be adjusted according to the number of classes.

If thre_diff, the binary tree with the balanced pair is assumed to have better overall performance in the classification, and the th pair is selected instead of the th pair. However, when an unbalanced pair has much better recognition rate than the balanced one, it will be still selected.

The common structure of the ACS generated by this approach is shown in Figure 7. The grey doubled line illustrates the possible recognition route of an audio sample.

If the number of affect classes varies from 3 to 7 as it is the case for most affect recognition problems currently studied, Figure 8 illustrates some typical hierarchical classifier schemes with balanced pairs of subsets.

4. Experimental Results

The effectiveness of our approach is experimented both on the Berlin and DES datasets. In the following, we first introduce the audio features. Then, our experimental results are presented and discussed.

4.1. The Feature Set

We consider the same set of 68 features as in [11, 16], covering popular frequency and energy-based features as well as our newly introduced features, namely harmonic features for a better description of voice timbre pattern, and Zipf features for a better rhythm and prosody characterization. They are the following.

Frequency-Based Features
1–20. Mean, maximum, minimum, and median value and the variance of F0 and the first 3 formants.

Energy-Based Features
21–23. Mean, maximum, and minimum value of energy 24. Energy ratio of the signal below 250 Hz 25–32. Mean, maximum, median, and variance of the values and durations of energy plateaus 33–40, 42–49. Statistics of gradient and durations of rising and falling slopes of energy contour 41–50. Number of rising and falling slopes of energy contour per second

Harmonic Features
51–63. Mean, maximum, variance and normalized variance of the 4 areas 64–66. The ratio of mean values of areas 2~4 to area 1

Zipf Features
67. Entropy feature of Inverse Zipf of frequency coding 68. Resampled polynomial estimation Zipf feature of UFD (Up-Flat-Down) coding

4.2. Experimental Results on the Berlin Dataset

Hold out cross-validation with 10 iterations were carried out on the Berlin dataset. In each of the iterations, 50% of samples were used as training set and the other 50% of samples as test set. Out of the seven basic emotions in the Berlin dataset, we excluded “disgust” as there are only 8 samples of “disgust” in the male samples, which is much less than the other emotional classes. Moreover, the acoustic features for this emotion was shown to be inconsistent [12]. The influence of gender information on the emotion classification accuracy was also highlighted. For each classification scheme, three experimental settings, using only the female speech samples, the male speech samples, and a combination of all the samples (mixed samples), respectively, were evaluated and compared. Figure 9 illustrates the two hierarchical classification schemes automatically generated by the previous ESFS driven ACS, respectively for male and female.

(a)

(b)

Figure 10 displays the best classification rates achieved by the eight combination operators that we tested. The error bars in the figure show the root mean square errors of the classification rates.

The best classification accuracy is for the female samples, for the male samples, and for the mixed genders with gender classification and without gender classification. These results are quite closed to the ones achieved by our manually elaborated but driven by the dimensional emotion model multistage classification scheme DEC [11, 16]. All of the best results are obtained with Schweizer and Sklar operator. From the two curves “All samples (1)” and “All samples (2)”, we can see that a preprocessing of the audio samples for gender classification obviously improve the overall classification performance for the mixed gender samples. We also have discovered that the fusion operators Hamacher, Yager, Weber-Sugemo, and Schweizer and Sklar having properties of convex curve surfaces perform better.

4.3. Experimental Results on the DES Dataset

Our ACS described in Section 3 was also benchmarked on the DES dataset and Figure 11 illustrates the automatically generated hierarchical classification scheme which proves to be the same for both the two genders. Similar to the hierarchical classification schemes generated on the Berlin dataset, the first stage of the ACS scheme proceeds in the arousal dimension (or energy/active dimension) with the separation between neutral and sadness versus anger and happiness and surprise. For the three active emotions, surprise is separated from anger and happiness in the second stage which is still in the arousal dimension. Anger and happiness are separated in the appraisal dimension in the last stage in the hierarchical framework as for the female samples in Berlin dataset.

Holdout cross-validations with 10 iterations are used in our experiments on DES dataset. In order to compare with previous works [17, 36–38], for each iteration of experiments, 90% of segments were used as training set and the remaining 10% used as testing set. The training set and testing set were selected randomly in each group. Figure 12 shows the best classification rates of the eight combination operators that we experimented. The classification rates for the case of all speech samples from both genders are obtained by adding an automatic gender classifier [39] as we did in the experiments on the Berlin dataset. The error bars in the figure show the root mean square errors of the classification rates.

The indexes on the axis stand for the operators (1) Lukasiewicz, (2) Hamacher, (3) Yager, (4) Weber-Sugemo, (5) Schweizer and Sklar, (6) Frank, (7) Average, and (8) Geometric Average. “All samples (1)” refers to the classification on the mixed genders samples with gender classification, and “All samples (2)” refers to the classification on the mixed genders samples without gender classification.

The best result is for the female samples with Hamacher operator when , for the male samples with Schweizer and Sklar operator when , for the mixed genders with gender classification, and without gender classification with Schweizer and Sklar operator when . Significant improvement of up to 23% is obtained for the mixed genders with the gender classification as a preprocessing step.

Experimented on the same DES dataset with the same 90% data of training and 10% data of testing in cross-validation, the best result obtained in the literature by Ververidis and Kotropoulos [17] is 66% for only male samples using a one step GMM (Gaussian Mixture Model) classifier for all the five emotions and 76.15% by Schuller et al. [18]. Significant improvement in classification accuracy is achieved with our automatically generated hierarchical classifier.

4.4. Synthesis and Comparison

In Table 3, we synthesize and compare the performances between the automatically generated hierarchical classification schemes ACSs and the early empirical built hierarchical DECs on both the Berlin and DES datasets. Almost the same results are obtained by the two kinds of hierarchical classifiers for the Berlin dataset (71.52% versus 71.38%), while the empirically built hierarchical DEC classifier for DES dataset performs slightly better than the automatically derived one (81.22% versus 76.74%).

As we can see from the table, the automatically derived ACS offer very closely performance as compared to the empirical DEC while providing the advantage to avoid repeated empiric work when the emotion classification problem changes.

5. Concluding Remarks

In this paper, we have introduced a new embedded feature selection scheme ESFS which is then used as the basis for automatically deriving hierarchical classification schemes called ACS in this paper. Such a hierarchical classifier is represented by a binary tree whose root is the union of all emotion classes, leaves are single-emotion classes, and nodes are subsets containing several emotion classes obtained by a subclassifier. Each of these subclassifiers is based on a new embedded feature selection method, ESFS, which allows to easily represent classifiers characterized by their mass function which is the combination of the information given by an appropriate feature subset, each subclassifier having its own one. Benchmarked on the Berlin and DES datasets, our approach has shown its effectiveness for vocal emotion analysis, leading to closely similar performance as compared to our previous empiric dimensional emotion model-driven hierarchical classification scheme (DEC).

Many issues need to be further studied. For instance, from machine learning point of view, the automatically derived ACS consists of successively dividing the initial set of class labels into two disjoint subsets of class labels by the most optimal binary classifier according to ESFS. Unfortunately, the number of such disjoint subset pairs increases exponentially. When this is feasible with a set of 4 or 6 class labels as it was the case with the Berlin and DES datasets, we cannot do it anymore if the cardinality of class labels is a much bigger number. Therefore, some heuristic rules also need to be found in order to be able to automatically derive the ACS that we proposed in this paper.

Another issue in machine recognition of vocal emotions is fuzzy and subjective character of vocal emotion. The judgment on emotional state conveyed by an utterance may be between some emotional states or even multiple according to person. Thus, ambiguous or multiple judgments also need to be addressed. A preliminary attempt of this issue has been studied in [40, 41].

References

http://emotion-research.net.
R. Picard, Affective Computing, MIT Press, Cambridge, Mass, USA, 1997.
Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2009.
View at: Publisher Site | Google Scholar
J. Hirschberg, S. Benus, J. M. Brenier et al., “Distinguishing deceptive from non-deceptive speech,” in Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH '05), pp. 1833–1836, September 2005.
View at: Google Scholar
J. Liscombe, J. Hirschberg, and J. J. Venditti, “Detecting certainness in spoken tutorial dialogues,” in Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH '05), pp. 1837–1840, September 2005.
View at: Google Scholar
O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, “Emotion recognition by speech signals,” in Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), Geneva, Switzerland, September 2003.
View at: Google Scholar
T. Zhang, M. Hasegawa-Johnson, and S. E. Levinson, “Children’s Emotion Recognition in an Intelligent Tutoring Scenario,” in Proceedings of the 8th European Conference on Speech Communication and Technology (INTERSPEECH '04), 2004.
View at: Google Scholar
T. Vogt and E. André, “Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '05), pp. 474–477, July 2005.
View at: Publisher Site | Google Scholar
B. Schuller, M. Wimmer, L. Mösenlechner, C. Kern, D. Arsic, and G. Rigoll, “Brute-forcing hierarchical functionals for paralinguistics: a waste of feature space?” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), pp. 4501–4504, April 2008.
View at: Publisher Site | Google Scholar
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis et al., “Emotion recognition in human-computer interaction,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001.
View at: Publisher Site | Google Scholar
Z. Xiao, E. Dellandrea, L. Chen, and W. Dou, “Recognition of emotions in speech by a hierarchical approach,” in Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII '09), Amsterdam, The Netherlands, September 2009.
View at: Publisher Site | Google Scholar
R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of Personality and Social Psychology, vol. 70, no. 3, pp. 614–636, 1996.
View at: Google Scholar
B. Schuller, G. Rigol, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine—belief network architecture,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), pp. I577–I580, May 2004.
View at: Google Scholar
B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, and G. Rigoll, “Speaker independent speech emotion recognition by ensemble classification,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '05), pp. 864–867, July 2005.
View at: Publisher Site | Google Scholar
D. Morrison and L. C. De Silva, “Voting ensembles for spoken affect classification,” Journal of Network and Computer Applications, vol. 30, no. 4, pp. 1356–1365, 2007.
View at: Publisher Site | Google Scholar
Z. Xiao, E. Dellandrea, W. Dou, and L. Chen, “Multi-stage classification of emotional speech motivated by a dimensional emotion model,” Multimedia Tools and Applications, vol. 46, no. 1, pp. 119–145, 2010.
View at: Publisher Site | Google Scholar
D. Ververidis and C. Kotropoulos, “Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '05), pp. 1500–1503, July 2005.
View at: Publisher Site | Google Scholar
B. Schuller, S. Reiter, and G. Rigoll, “Evolutionary feature generation in speech emotion recognition,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '06), pp. 5–8, July 2006.
View at: Publisher Site | Google Scholar
R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
View at: Google Scholar
I. Guyon and A. Elisseff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
View at: Google Scholar
M. Sebban and R. Nock, “A hybrid filter/wrapper approach of feature selection using information theory,” in Proceedings of International Conference on Machine Learning and Cybernetics, vol. 4, pp. 2537–2542, 2004.
View at: Google Scholar
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, Calif, USA, 1993.
J. R. Quinlan, “Improved use of continuous attributes in C4.5,” Journal of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996.
View at: Google Scholar
A. P. Dempster, “Upper and lower probabilities induced by a multivalued mapping,” Annals of Mathematical Statistics, vol. 38, no. 2, pp. 325–339, 1967.
View at: Publisher Site | Google Scholar
A. P. Dempster, “A generalization of Bayesian inference,” Journal of the Royal Statistical Society B, vol. 30, 1968.
View at: Google Scholar
G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, Princeton, NJ, USA, 1976.
G. Fioretti, “Evidence theory: a mathematical framework for unpredictable hypotheses,” Metroeconomica, vol. 55, no. 4, pp. 345–366, 2004.
View at: Google Scholar
A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial Intelligence, vol. 97, no. 1-2, pp. 245–271, 1997.
View at: Google Scholar
M. Detyniecki, Mathematical aggregation operators and their application to video querying, Doctoral thesis, University of Paris 6, France, LIP6 research report 2001/002, November 2000.
K. Menger, “Statistical metrics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 8, pp. 535–537, 1942.
View at: Google Scholar
B. Schweizer and A. Sklar, “Statistical metric spaces,” Pacific Journal of Mathematics, vol. 10, pp. 313–334, 1960.
View at: Google Scholar
B. Schweizer and A. Sklar, Probabilistic Metric Spaces, North Holland, New York, NY, USA, 1983.
R. Fuller, “OWA operators in decision making,” in Exploring the Limits of Support Systems, C. Carlsson, Ed., vol. 3 of TUCS General Publications, pp. 85–104, 1996.
View at: Google Scholar
W. Dou, Segmentation of multispectral images based on information fusion: application for MRI images, Ph.D. thesis, Université de Caen, 2006.
H. Fu, Z. Xiao, E. Dellandréa, W. Dou, and L. Chen, “Image categorization using ESFS: a new embedded feature selection method based on evidence theory,” in Proceedings of the International Conference on Advanced Concepts Intelligent Vision Systems (ACIVS '09), Bordeaux, France, September 2009.
View at: Publisher Site | Google Scholar
D. Ververidis, C. Kotropoulos, and I. Pitas, “Automatic emotional speech classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), vol. 1, pp. I593–I596, Montreal, Canada, May 2004.
View at: Google Scholar
D. Ververidis and C. Kotropoulos, “Automatic speech classification to five emotional states based on gender information,” in Proceedings of 12th European Signal Processing Conference, pp. 341–344, Austria, September 2004.
View at: Google Scholar
D. Ververidis and C. Kotropoulos, “Emotional speech classification using Gaussian mixture models,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '05), pp. 2871–2874, May 2005.
View at: Publisher Site | Google Scholar
H. Harb and L. Chen, “Voice-based gender identification in multimedia applications,” Journal of Intelligent Information Systems, vol. 24, no. 2-3, pp. 179–198, 2005.
View at: Publisher Site | Google Scholar
Z. Xiao, Recognition of emotion in audio signals, Ph.D. thesis, Ecole Centrale de Lyon, 2008.
Z. Xiao, E. Dellandrea, W. Dou, and L. Chen, “Ambiguous classification of emotional speech,” in Proceedings of the International Workshop on EMOTION—Satellite of International Conference on Language Resources and Evaluation (LREC '08), 2008.
View at: Google Scholar

Copyright

Copyright © 2011 Zhongzhe Xiao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1325

Downloads

818

Citations