Abstract

Current machine-based techniques for vocal emotion recognition only consider a finite number of clearly labeled emotional classes whereas the kinds of emotional classes and their number are typically application dependent. Previous studies have shown that multistage classification scheme, because of ambiguous nature of affect classes, helps to improve emotion classification accuracy. However, these multistage classification schemes were manually elaborated by taking into account the underlying emotional classes to be discriminated. In this paper, we propose an automatically elaborated hierarchical classification scheme (ACS), which is driven by an evidence theory-based embedded feature-selection scheme (ESFS), for the purpose of application-dependent emotions' recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% classification accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to a 81.22% accuracy rate displayed by a manually elaborated multistage classification scheme (DEC).

1. Introduction

Speech emotion analysis has attracted growing interest within the context of increasing awareness of the wide application potential of affective computing [1, 2]. Current machine-based techniques for vocal emotion recognition only consider classification problems of a finite number of discrete emotion categories [3] whereas the kinds of emotional states and their number are typically application dependent. These affective categories can be the six basic emotional states but also some nonbasic emotional classes, including for instance deception [4], certainty [5], stress [6], confidence, confusion, and frustration [7].

Most works in the literature made use of acoustic correlation of vocal emotion expressions and explore prosodic features, including pitch, energy, formants, cepstral, voice quality [810], and more recently Harmonic and Zipf features [11]. Moreover, the majority of them rely on one or several global one-step classifiers, including SVM, neural networks, and GMM, using the same feature set for all the emotional states, while studies on emotion taxonomy suggests that some discrete emotions are very close to each other on the dimensional emotion space, and there is confusion of emotion class borders. Indeed, Banse and Sherer evidenced [12] that acoustic correlates between fear and surprise or between boredom and sadness are not very clear, thus making an accurate emotion classification by a single step global classifier very hard. Implementing an intuition that hardly separable classes should be divided at last, Schuller et al. [13] proposed an interesting multilayer-SVM architecture. Some authors [14, 15] also tested an ensemble classification scheme making a voting by several base classifiers. However, only a very slight improvement of classification accuracy was observed. In our previous work [16], we proposed an effective multistage classification scheme driven by the dimensional emotion model which hierarchically combines several binary classifiers. At each stage, a binary class classifier made use of a different set of the most discriminative features and discriminated emotional states according to different emotional dimensions. However, all these hierarchical classification schemes, including our own ones which is based on an empiric mapping of the discrete emotion states onto the dimensional emotion model, were manually elaborated to take into account the various emotional states under consideration whereas in practice, the types of emotion considered are rather application or dataset dependent. Clearly, we need an automatic way for building such a hierarchical classification scheme for machine-based emotion analysis, especially when the number of emotions changes and their types vary.

In this paper, we propose an automatically elaborated hierarchical classification scheme, which is driven by an evidence theory-based feature selection (ESFS), for the purpose of application-dependent emotions’ recognition. Experimented on the Berlin dataset with 68 features and six emotion states, this automatically elaborated hierarchical classifier (ACS) showed its effectiveness, displaying a 71.38% accuracy rate compared to a 71.52% classification rate achieved by our previously dimensional model-driven but still manually elaborated multistage classifier (DEC). Using the DES dataset with five emotion states, our ACS achieved a 76.74% recognition rate compared to an 81% accuracy rate displayed by the manually elaborated multistage classification scheme (DEC). So far as we know, the best classification rates displayed in the literature are, respectively, 66% [17] and 76.15% [18] on the same DES dataset.

The remainder of this paper is organized as follows. Section 2 briefly introduces our evidence theory-based embedded feature selection scheme, the ESFS. We describe in Section 3 our ESFS-based algorithm for automatically deriving a hierarchical classification scheme (ACS) for application-dependent emotion analysis. Section 4 presents the experimental results of our ACS both on the Berlin and the DES datasets compared to the ones by our previously empirically elaborated but dimensional emotion model-driven hierarchical classification schemes (DECs). Finally, we summarize and conclude our work in Section 5.

2. ESFS: An Evidence Theory-Based SFS

Feature subset selection is an important subject when training classifier in Machine Learning (ML) problems. Practical ML algorithms are known to degrade in prediction accuracy when faced with many features that are not necessary [19, 20]. The current feature selection methods can be categorized into three broad classes according to their dependence to the underlying classifer [21]: filter approach, wrapper approach, or embedded one. In this section, we describe a novel embedded feature selection method, called ESFS, which is similar to the wrapper method SFS, since it relies on the simple principle to incrementally add the most relevant features. As an embedded method, our ESFS also carries out the selection of an optimal feature subset together with the classifier construction [22, 23]. Its originality concerns the use of mass functions from the evidence theory that allows merging elegantly information sources carried out by features, in an embedded way, thereby leading to a lower computational cost than the original SFS. We first introduce some basics of the evidence theory then describe our ESFS.

2.1. Introduction to the Evidence Theory

In our feature selection scheme, the term “belief mass” from the evidence theory is introduced into the processing of features. Dempster and Shafer wanted in the 1970s to calculate a general uncertainty level from the Bayesian theory. They developed the concept of “uncertainty mapping” to measure the uncertainty between a lower limit and an upper limit [24, 25]. Similar to the probabilities in the Bayesian theory, they presented a combination rule of the belief masses (or mass function) 𝑚().

The evidence theory was completed and presented by Shafer in [26]. It relies on the definition of a set of 𝑛 hypothesis Ω which have to be exclusive and exhaustive. In this theory, the reasoning concerns the frame of discernment 2Ω which is the set composed of the 2𝑛 subsets of Ω [27]. In order to express the degree of confidence we have in a source of information for an event 𝐴 of 2Ω, we associate to it an elementary mass of evidence 𝑚(𝐴). The elementary mass function or belief mass which presents the chance of being a true statement is defined as []𝑚2Ω0,1,(1) which satisfies 𝑚(Φ)=0,𝐴2Ω𝑚(𝐴)=1.(2) The belief function is defined if it satisfies Bel(Φ)=0 and Bel(Ω)=1 and for any collection 𝐴1𝐴𝑛 of subsets of Ω𝐴Bel1𝐴𝑛𝐼{1,,𝑛}𝐼𝜙(1)|𝐼|+1Bel𝑖𝐼𝐴𝑖.(3) The belief function shows the lower bound on the chances, and it corresponds to the mass function with the following formulae: Bel(𝐴)=𝐵𝐴𝑚(𝐵),𝐴Ω,𝑚(𝐴)=𝐴2Ω(1)|𝐴𝐵|Bel(𝐵),(4) where |𝑋| means the number of elements in the subset.

The doubt function is defined asDou(𝐴)=Bel(¬𝐴).(5) And the upper probability function is defined as 𝑃(𝐴)=1Dou(𝐴).(6) The true belief in 𝐴 should be between Bel(𝐴) and 𝑃(𝐴).

The Dempster’s combination rule can combine two or more independent sets of mass assignments by using orthogonal sum. For the case of two mass functions, let 𝑚1 and 𝑚2 be mass functions on the same frame Ω, the orthogonal sum is defined as 𝑚=𝑚1𝑚2, to be 𝑚(𝜙)=0, and𝑚(𝐴)=𝐾𝑋𝑌=𝐴𝑚1(𝑋)𝑚2(1𝑌),𝐾=1𝑋𝑌=𝜙𝑚1(𝑋)𝑚2.(𝑌)(7) For the case with more than two mass functions, let 𝑚=𝑚1𝑚𝑛, and it satisfies 𝑚(𝜙)=0 and 𝑚(𝐴)=𝐾𝐴𝑖=𝐴1𝑖𝑛𝑚𝑖𝐴𝑖,1𝐾=1𝐴𝑖=𝜙1𝑖𝑛𝑚𝑖𝐴𝑖.(8) This definition of mass functions from the evidence is used in our model in order to represent the source of information given by each acoustic feature here and to combine them easily and to consider them as a classifier whose recognition value is given by the mass function.

2.2. The ESFS Scheme

An exhaustive search of the best subset of features, leading to explore a space of 2n subsets, is impractical; we thus turn to a heuristic approach for the feature selection as does SFS. However, different from SFS which is wrapper-based approach, our evidence theory-based feature-selection technique, ESFS, makes use of the concept of belief mass from the evidence theory as a classifier and its combination rules to fuse various audio features, leading to an embedded feature-selection method. Moreover, as compared to the original SFS, the range of subsets to be evaluated in the forward process in ESFS is extended to multiple subsets for each size, and the feature set is reduced according to a certain threshold before the selection in order to decrease the computational burden caused by the extension of the subsets in the evaluation.

A heuristic feature selection algorithm can be characterized by its stance on four basic issues that determine the nature of the heuristic search process [28]. First, one must determine the starting point in the space of feature subsets, which influences the direction of search, and the operators used to generate successor states. The second issue concerns the search strategy. As an exhaustive search in a space of 2𝑛 feature subsets is impractical, one needs to provide a more realistic approach such as greedy methods to traverse the space. At each point of the search, one considers local changes to the current state of the features, selects one, and iterates. The third issue concerns the strategy used to evaluate alternative subsets of features. Finally, one must decide on some criterion for halting the search.

As illustrated in Figure 1, we can summarize our embedded ESFS using belief masses by the following four steps while answering the previous four questions:(i)computation of the belief masses of the single features from the training set,(ii)evaluation and ordering of the single features to decide the initial set for potential best features,(iii)combination of features for the generation of the feature subsets, making use of operators of combination,(iv)selection of the best feature subset.

2.2.1. Calculation of the Belief Masses of the Single Features

Before the feature selection starts, all features are normalized into [0,1]. For each feature,Fea𝑛=Fea𝑛0minFea𝑛0maxFea𝑛0minFea𝑛0,(9) where Fea𝑛0 is the set of original value of the 𝑛th feature and Fea𝑛 is the normalized value of the 𝑛th feature.

By definition of the belief masses, the mass can be obtained in different ways which can represent the chance for a statement to be true. In our work, the PDFs (probability density functions) of the features of the training data are used for calculating the masses of the single features.

The curves of PDFs of the features are obtained by applying polynomial interpolation to the statistics of the distribution of the feature values from the training data.

Taking the case of a 2-class classifier as an example, the two classes are defined as subset 𝐴 and subset 𝐴𝐶. First, the probability densities of the features in each of the 2 subsets are estimated from the training samples by the statistics of the values of the features in each class. We define the probability density of the 𝑘th feature Fea𝑘 in subset 𝐴 as Pr𝑘(𝐴,𝑓𝑘) and the probability density in subset 𝐴𝐶 as Pr𝑘(𝐴𝐶,𝑓𝑘), where the 𝑓𝑘 is the value of the feature Fea𝑘. According to the probability densities, the masses of feature Fea𝑘 on these 2 subsets can be defined to meet the requirement in (2) as𝑚𝑘𝐴,𝑓𝑘=Pr𝑘𝐴,𝑓𝑘Pr𝑘𝐴,𝑓𝑘+Pr𝑘𝐴𝐶,𝑓𝑘,𝑚𝑘𝐴𝐶,𝑓𝑘=Pr𝑘𝐴𝐶,𝑓𝑘Pr𝑘𝐴,𝑓𝑘+Pr𝑘𝐴𝐶,𝑓𝑘,(10) where at any possible value of the 𝑘th feature 𝑓𝑘, 𝑚𝑘(𝐴,𝑓𝑘)+𝑚𝑘(𝐴𝐶,𝑓𝑘)=1.

In the case of 𝑁 classes, the classes are defined as 𝐴1,𝐴2,,𝐴𝑁. The masses of feature 𝐹𝑘 of the 𝑖th class 𝐴𝑖 can be obtained as 𝑚𝑘𝐴𝑖,𝑓𝑘=Pr𝑘𝐴𝑖,𝑓𝑘𝑁𝑛=1Pr𝑘𝐴𝑛,𝑓𝑘,(11) which satisfies 𝑁𝑖=1𝑚𝑘𝐴𝑖,𝑓𝑘=1.(12)

2.2.2. Evaluation of the Single Features and Selection of the Initial Set of Potentially Best Features

Once computed the belief masses associated with each single feature from the training data, they are used to build a simple classifier. Indeed, given the belief masses associated with a single feature, a data sample can be simply assigned to the class having the highest belief mass. Using classification accuracy rate by these singles feature-based classifiers, all the features can then be sorted in a descending order as {𝐹𝑠1,𝐹𝑠2,,𝐹𝑠𝑁}, where 𝑁 is the number of features in the whole feature set. In order to reduce the computational burden in the feature selection, an initial feature set FSini is selected with the first 𝐾 best features in this ordered feature set, using for instance a threshold for cutting off classification accuracy rates, leading to FSini=𝐹𝑠1,𝐹𝑠2,,𝐹𝑠𝐾.(13)

The threshold of the classification rates is decided according to the best classification rate as 𝑅single𝐹𝑠_𝐾thres_1𝑅best1,(14) where 𝑅best_1=𝑅single(𝐹𝑠_1), as illustrated in Figure 2. In our work on vocal emotion analysis, the threshold value thres_1 is set to 0.8 according to a balance between the overall performance and the calculation time by experiments. This threshold may vary with the underlying problem. Out of 68 features considered in our work, around 30 features are kept in our vocal emotion analysis problem while setting the threshold to 0.8.

Only the features selected in the set FSini will attend to the latter steps of the feature selection process. The elements (features) in FSini are considered as subsets of features having the size 1 at the same time.

2.2.3. Combination of Features for the Generation of the Best Feature Subsets

Having the best feature subsets with size 𝑘1 (𝑘2), the generation of a new feature subset of size 𝑘 is achieved by computing a new composite feature through an operator of combination, thus fusing a feature subset of size 𝑘1 with one from the initial feature set FSini. All these composite features are then sorted in a descending order according to their classification accuracy and the best ones are selected using a threshold as we did for the selection of the initial feature set FSini.

We note the set of all the feature subsets in the evaluation with size 𝑘 as FS𝑘 and the set of the selected feature subsets with size 𝑘 as FS𝑘. Thus, FS1 equals to the original whole feature set, and FS1=FSini. From 𝑘=2, the set of the feature subsets FS𝑘 is noted as FS𝑘=CombineFS𝑘1,FSini=Fc01_𝑘,Fc02_𝑘,,Fc0𝑁𝑘_𝑘,(15) where the function “Combine” aims to generate new composite features by combining features from each of the two sets FS𝑘1 and FSini with all the possible combinations except the ones in which a feature from FSini already appears in the composite feature from FS𝑘1; Fc0𝑛_𝑘 represents the generated new composite features using an operator of combination, and 𝑁𝑘 is the number of elements in the set FS𝑘.

The creation of a new composite feature from two other features is achieved by combining the belief masses of the two features, making use of an operator of combination. The fusing process works as follows.

Assume that 𝑁 classes are considered in the classifier. For the 𝑖th class 𝐴𝑖, the preprocessed mass 𝑚 for the new composite feature Fc0𝑡_𝑘, which is generated from a composite feature Fc𝑥_𝑘1 in FS𝑘1 and a feature Fs𝑦 from FSini, Fc0𝑡_𝑘=Combine(Fc𝑥_𝑘1,Fs𝑦), is calculated as 𝑚𝐴𝑖,fc0𝑡_𝑘𝑚𝐴=𝑇𝑖,fc𝑥_𝑘1𝐴,𝑚𝑖,fs𝑦,(16) where the 𝑓𝑥 is the value of the feature 𝐹𝑥 and 𝑇(𝑥,𝑦) is an operator of combination. The commonly used existing operators for fusing two elements, the triangle norms, are used in our work to combine the features. These operators will be explained in details in next subsection. The sum of 𝑚s may not be 1 depending upon the combination operator being used. In order to meet the definition of belief masses, the 𝑚s can then be normalized as the masses for the new composite feature 𝑚𝐴𝑖,fc0𝑡_𝑘=𝑚𝐴𝑖,fc0𝑡_𝑘𝑁𝑛=1𝑚𝐴𝑛,fc0𝑡_𝑘.(17)

The performance of the new composite feature may be better than both its two base features used in the combination, as illustrated in Figure 3. However, the new composite feature may also perform worse than any of the two original features, in which case the new composite feature will be eliminated in the feature selection process.

All these new composite features can also be sorted in a descending order according to their classification accuracy on the training dataset as we did for the single original audio features FS𝑘=Fc01_𝑘,Fc02_𝑘,,Fc0𝑁𝑘_𝑘=Fc1_𝑘,Fc2_𝑘,,Fc𝑁𝑘_𝑘.(18)

The best composite feature having size 𝑘 is noted as Fcbest_𝑘=Fc1_𝑘, and its classification accuracy recorded as 𝑅best_𝑘. Similar to the selection of FSini from the single original features, a threshold is set to select a number of composite features having the size 𝑘 for the next step of forward selection. The set of these selected composite features is noted as FS𝑘=Fc1_𝑘,Fc2_𝑘,,Fc𝑁0𝑘_𝑘,(19) which satisfies 𝑅(Fc𝑁0𝑘_𝑘)thres_𝑘𝑅best_𝑘. In order to simplify the selection, the threshold value thres_𝑘 is set in our work to the same value as 0.8 in every step without any adaptation to each step.

2.2.4. Stop Criterion and the Selection of the Best Feature Subset

The stop criterion of ESFS occurs when the best classification rate begins to decrease while increasing the size of the feature subsets. In our work, in order to avoid missing the real peak of the classification performance, the forward selection stops when the classification performance continues to decrease in two steps, 𝑅best_𝑘<min(𝑅best_𝑘1,𝑅best_𝑘2). The number of the selected composite features is noted as Num_select, and the selected composite features areSS=Fc1_Num_select=CombineFc𝑥_Num_select1,Fs𝑦==CombineCombineCombineFs𝑝,Fs𝑞.(20)

2.3. Operators of Combination

Aggregation and fusion of different information sources are basic concerns in many systems and applications. There exists different fusion approaches, including evidence theory, possibility theory, or fuzzy set theory, but all these approaches can be summarized as application of some numerical aggregation operators. Generally speaking, the aggregation operators are mathematical functions consisting of reducing a set of numbers into a unique representative number [29].

Since the combination of the masses of the features in our feature selection scheme amounts to combine two features, the commonly used existing operators for two elements, the triangle norms, are used in our work to fuse the features as in (16).

The triangular norm (abbreviated as t-norm) is a kind of binary operation used in the framework of probabilistic metric spaces and in multivalued logic which was first introduced by Menger [30] in order to generalize the triangular inequality of a metric. The current concept of a t-norm and its dual operator (t-conorm) is developed due to Schweizer and Sklar [31, 32]. The t-norms generalize the conjunctive “AND” operator and the t-conorms generalize the disjunctive “OR” operator. These properties enable them to be used to define the intersection and union operations [29, 33].

The definitions of a t-norm and a t-conorm are as follows.

t-norm
A t-norm is a function 𝑇[0,1]×[0,1][0,1], having the following properties(1)𝑇(𝑥,𝑦)=𝑇(𝑦,𝑥) (T1) commutativity,(2)𝑇(𝑥,𝑦)𝑇(𝑢,𝑣), if 𝑥𝑢 and 𝑦𝑣 (T2), monotonicity (increasing),(3)𝑇(𝑥,(𝑇(𝑦,𝑧))=𝑇(𝑇(𝑥,𝑦),𝑧) (T3), associativity,(4)𝑇(𝑥,1)=𝑥 (T4), one as a neutral element.A well-known property of t-norms is 𝑇(𝑥,𝑦)min(𝑥,𝑦).(21)

t-conorm
Formally, a t-conorm is a function 𝑆[0,1]×[0,1][0,1], having the following properties:(1)𝑆(𝑥,𝑦)=𝑆(𝑦,𝑥) (S1), commutativity,(2)𝑆(𝑥,𝑦)𝑆(𝑢,𝑣), if 𝑥𝑢 and 𝑦𝑣 (S2), monotonicity (increasing),(3)𝑆(𝑥,(𝑆(𝑦,𝑧))=𝑆(𝑆(𝑥,𝑦),𝑧) (S3), associativity,(4)𝑆(𝑥,0)=𝑥 (S4) Zero as a neutral element.

A well known property of t-conorms is 𝑆(𝑥,𝑦)Max(𝑥,𝑦).(22)

We say that a t-norm and a t-conorm are dual (or associated) if they satisfy the DeMorgan law. 1𝑇(𝑥,𝑦)=𝑆(1𝑥,1𝑦).(23) The minimum is the biggest t-norm; and its dual is the smallest t-conorm.

Six parameterized t-norms, namely Lukasiewicz, Hamacher, Yager, Weber-Sugeno, Schweizer and Sklar, and Frank, which are frequently proposed in the literatures [34], were tested with different parameters in our work. They are defined as follows: (1)Lukasiewicz 𝑇(𝑥,𝑦)=max(𝑥+𝑦1,0),(24)(2)Hamacher 𝑇(𝑥,𝑦)=𝑥𝑦𝛾+(1𝛾)(𝑥+𝑦𝑥𝑦),𝛾0,(25)(3)Yager 𝑇(𝑥,𝑦)=max1(1𝑥)𝑝+(1𝑦)𝑝1/𝑝,𝑝>0,(26)(4)Weber-Sugeno 𝑇(𝑥,𝑦)=max𝑥+𝑦1+𝜆𝑇𝑥𝑦1+𝜆𝑇,0,𝜆𝑇>1,(27)(5)Schweizer and Sklar 𝑇(𝑥,𝑦)=1(1𝑥)𝑞+(1𝑦)𝑞(1𝑥)𝑞(1𝑦)𝑞1/𝑞,𝑞>0,(28)(6)Frank 𝑇(𝑥,𝑦)=log𝑠1+(𝑠𝑥1)(𝑠𝑦1)𝑠1,𝑠>0,𝑠1.(29)

Figure 4 depicts the various surfaces associated with the aforementioned six combination operators. The 𝑧-axis represents the output of the operators from the two inputs 𝑥 and 𝑦. As we can see from these curves, the Yager, Weber-Sugeno, and Schweizer and Sklar operators have convex surfaces, while the Lukasiewicz and Frank operators have flatter or even concave surfaces. For each operator, the degree of convexity or concavity is affected by the parameters. The difference in the shape of the surfaces may influence the performance when they are applied in the classification.

In addition to these t-norm operators, the average and the geometric average of the features are also used for the combination of the features.(i) Average 𝐴(𝑥,𝑦)=(𝑥+𝑦)2.(30)(ii) Geometric average 𝐺_𝑎(𝑥,𝑦)=𝑥𝑦.(31)

The property curve surfaces of average and geometric average are displayed in Figure 5.

It should be noticed that since a step of normalization is applied in calculating the masses of the combined new features in (17), the “associativity” property of the t-norms is not effective in our case. Furthermore, in order to ensure the performance of the final combined new feature, the order of the selected features cannot be moved randomly.

2.4. Discussion

ESFS can be used either as an embedded feature selection method, as we do in the next section when building the hierarchical classification scheme for emotion analysis, or as a simple filter method for selection of relevant features which can then be embedded into classifiers. Used as a filter method, we carried out experiments aiming at comparing the behavior of our ESFS with other filter feature-selection techniques, including Fischer filter method, PCA, and SFS. Using Berlin dataset for emotional speech recognition and Simplicity dataset for visual object recognition, our ESFS displayed better performance, showing its effectiveness in the selection of relevant features [35].

3. ESFS-Based Hierarchical Classification Scheme for Vocal Emotion Recognition

The fuzzy neighborhood relationship between some emotional classes, for instance between sadness and boredom, as evidenced by studies on acoustic correlates, leads to unnecessary confusion between emotion states when a single global classifier is applied using the same set of features. While several previous works have shown the effectiveness of multistage classification schemes on vocal emotion analysis, the elaboration of these hierarchical classification schemes were intuitive and manual. On the other hand, the number of emotions and their types to be recognized are typically dataset or application dependent. The empirically built hierarchical classification structure thus needs to be adjusted when the emotional space changes. In this section, we propose an automatically elaborated Hierarchical Classification Scheme (ACS) which is driven by our evidence theory-based feature selection technique ESFS. While keeping at least similar performance, the main goal here is to avoid unnecessary repeated work for manually building a new multistage classification scheme each time the vocal emotions to be analyzed change.

Basically, our ESFS, when applied as an embedded feature selection technique to an application specific vocal emotion recognition problem, automatically divides in an optimal way the set of emotional states to be recognized into two disjoint subsets of emotional states, leading to a hierarchical classifier represented by a binary tree whose root is the union of all emotion classes, while leaves are single emotion classes and intermediate nodes composite emotional classes discriminated by a subclassifier. Each of these subclassifiers is based on our ESFS introduced in the previous section, thus extracting the best features to best discriminate two composite emotional classes.

The generation process of an ACS is shown in Figure 6. The 𝑁 discrete emotional classes concerned in the classification problem are first assigned to a frame of discernment Ω={𝐸1,𝐸2,,𝐸𝑁}, where 𝐸𝑛 stands for the 𝑛th emotional state in the frame of discernment Ω. For example, the frame of discernment associated with the Berlin database is ΩBelin={Anger, Happiness, Fear, Neutral, Sadness, Boredom} while the frame of discernment associated with the DES dataset is ΩDES={Anger, Happiness, Neutral, Surprise, Sadness}. The frame of discernment describes the initial affect space under study. Using our embedded ESFS, the affect space will be recursively divided into two complementary subaffect spaces which best describe the affect space with respect to the training data, until the subaffect spaces become simple emotional classes.

The hierarchical classifier is thus expressed by a binary tree. The initial frame of discernment is set as the root node of this binary tree. The main steps for generating an ACS are listed as follows.

3.1. The Algorithm

Step 1. The hierarchical structure is composed of several binary subclassifiers. The Ω is divided into pairs of nonempty subsets exhausting all possible partitioning of the initial affect space. The two subsets in each pair are complements to each other, and each subset represents a class with respect to the initial affect space ΩΩ=𝐴𝑛𝐴𝐶𝑛,𝐴𝑛𝜙,𝐴𝐶𝑛𝜙.(32)
All the possible pairs of complements are evaluated using ESFS to decide which partitioning of the initial affect space is the best from the viewpoint of classification accuracy. In order to avoid repeated partitioning, the pairs are defined to ensure that the number of elements in subset 𝐴𝑛 is not larger than the number of elements in 𝐴𝐶𝑛. For example, in the case with 4 classes, 7 pairs of subsets can be evaluated as listed in Table 1.

Our feature combination and selection process (ESFS) is applied to each pair of the subsets and the belief masses of the training samples in the subsets can be obtained. All these pairs can then be sorted by their classification accuracy rates.

Step 2. The two subsets in the pair with the highest classification rate (assuming it is the 𝑛th pair of subsets) are assigned as the children nodes: 𝐴𝑛 as the left child node and 𝐴𝐶𝑛 as the right child node.
The two children nodes of 𝐴𝑛 and 𝐴𝐶𝑛 are then processed in the same way as in Step 1. The numbers of elements in the children nodes 𝐴𝑛/𝐴𝐶𝑛 are counted. Note the subsets 𝐴𝑛 or 𝐴𝐶𝑛 as 𝐴.(i) If Size𝐴=1 (only one element in the subset), this node is marked as a leaf node.(ii) If Size𝐴>1 (the subset can be further partitioned), the frame of discernment is updated as Ω=𝐴, and the construction of the binary tree continues with Step 1.

Step 3. When the number of leaf nodes equals to the number of emotional classes, the generation process of the binary tree stops. The information about the binary tree is stored in the model of the classifier.

3.2. Practice and Improvement

In practice, we want our ACS resulted from the previous scheme to be as balanced as possible. Indeed, the overall classification accuracy rate of a multistage hierarchical classifier is approximately the product of the classification rates at each stage. Assuming the different stages in the classifier have classification accuracy rates close to each other as 𝑅stage, for an 𝑛 stage classifier, the overall classification rate can be approximated by 𝑅𝑛stage. Thus, too many stages may lead to dramatic degrading of the overall classification accuracy rate. In order to reach classification accuracy as higher as possible, one needs to reduce the depth of the tree-based hierarchical classifier so that it is a balanced structure.

In our work, balanced pair of subsets is put forward. For each pair of subsets 𝐴𝑛 and 𝐴𝐶𝑛, a subset distance is calculated as the difference of the number of elements of the two subsets 𝐷𝑛=Size𝐴𝐶𝑛Size𝐴𝑛,(33) with Size𝐴=||𝐴||,(34) where the |𝑋| means the number of elements in the set 𝑋.

Because the subsets 𝐴𝑛/𝐴𝐶𝑛 are defined so that 𝐴𝑛 has alsways fewer or the same number of elements than 𝐴𝐶𝑛, 𝐷𝑛 always satisfies 𝐷𝑛0.(35) when the 𝑛th pair of subsets satisfies 𝐷𝑛1, it is defined as a balanced pair of subsets.

If the pair of subsets with the highest classification rate (assuming that it is the 𝑛1th pair and the classification rate is 𝑅𝑛1) is a balanced pair, the generation of the binary tree continues normally; if it is not a balanced pair, it will be compared with the balanced pair having the best classification rate (assuming that it is the 𝑛2th pair and the classification rate is 𝑅𝑛2). If we have only five or six classes in our applications as it is the case for the Berlin and DES datasets, there should be two or three stages in a balance binary tree. We thus set a threshold thre_diff defining an acceptable difference on classification accuracy which can be measured by (𝑅𝑛1thre_di)2(𝑅𝑛1)3, assuming the number of the stages does not exceed three. The approximate values of thre_diff related to 𝑅𝑛1 are listed in Table 2. The best classification rate 𝑅𝑛1 in the first stage of the hierarchical structure is normally around 90% in the experiments, so the most commonly selected thre_diff is between 4% and 5%. When the number of classes increases in the classification problems, the thresholds should be adjusted according to the number of classes.

If 𝑅𝑛1𝑅𝑛2< thre_diff, the binary tree with the balanced pair is assumed to have better overall performance in the classification, and the 𝑛1th pair is selected instead of the 𝑛1th pair. However, when an unbalanced pair has much better recognition rate than the balanced one, it will be still selected.

The common structure of the ACS generated by this approach is shown in Figure 7. The grey doubled line illustrates the possible recognition route of an audio sample.

If the number of affect classes varies from 3 to 7 as it is the case for most affect recognition problems currently studied, Figure 8 illustrates some typical hierarchical classifier schemes with balanced pairs of subsets.

4. Experimental Results

The effectiveness of our approach is experimented both on the Berlin and DES datasets. In the following, we first introduce the audio features. Then, our experimental results are presented and discussed.

4.1. The Feature Set

We consider the same set of 68 features as in [11, 16], covering popular frequency and energy-based features as well as our newly introduced features, namely harmonic features for a better description of voice timbre pattern, and Zipf features for a better rhythm and prosody characterization. They are the following.

Frequency-Based Features
1–20. Mean, maximum, minimum, and median value and the variance of F0 and the first 3 formants.

Energy-Based Features
21–23. Mean, maximum, and minimum value of energy 24. Energy ratio of the signal below 250 Hz 25–32. Mean, maximum, median, and variance of the values and durations of energy plateaus 33–40, 42–49. Statistics of gradient and durations of rising and falling slopes of energy contour 41–50. Number of rising and falling slopes of energy contour per second

Harmonic Features
51–63. Mean, maximum, variance and normalized variance of the 4 areas 64–66. The ratio of mean values of areas 2~4 to area 1

Zipf Features
67. Entropy feature of Inverse Zipf of frequency coding 68. Resampled polynomial estimation Zipf feature of UFD (Up-Flat-Down) coding

4.2. Experimental Results on the Berlin Dataset

Hold out cross-validation with 10 iterations were carried out on the Berlin dataset. In each of the iterations, 50% of samples were used as training set and the other 50% of samples as test set. Out of the seven basic emotions in the Berlin dataset, we excluded “disgust” as there are only 8 samples of “disgust” in the male samples, which is much less than the other emotional classes. Moreover, the acoustic features for this emotion was shown to be inconsistent [12]. The influence of gender information on the emotion classification accuracy was also highlighted. For each classification scheme, three experimental settings, using only the female speech samples, the male speech samples, and a combination of all the samples (mixed samples), respectively, were evaluated and compared. Figure 9 illustrates the two hierarchical classification schemes automatically generated by the previous ESFS driven ACS, respectively for male and female.

Figure 10 displays the best classification rates achieved by the eight combination operators that we tested. The error bars in the figure show the root mean square errors of the classification rates.

The best classification accuracy is 71.75%±3.10% for the female samples, 73.77%±2.33% for the male samples, and 71.38%±2.33% for the mixed genders with gender classification and 57.95%±2.87% without gender classification. These results are quite closed to the ones achieved by our manually elaborated but driven by the dimensional emotion model multistage classification scheme DEC [11, 16]. All of the best results are obtained with Schweizer and Sklar operator. From the two curves “All samples (1)” and “All samples (2)”, we can see that a preprocessing of the audio samples for gender classification obviously improve the overall classification performance for the mixed gender samples. We also have discovered that the fusion operators Hamacher, Yager, Weber-Sugemo, and Schweizer and Sklar having properties of convex curve surfaces perform better.

4.3. Experimental Results on the DES Dataset

Our ACS described in Section 3 was also benchmarked on the DES dataset and Figure 11 illustrates the automatically generated hierarchical classification scheme which proves to be the same for both the two genders. Similar to the hierarchical classification schemes generated on the Berlin dataset, the first stage of the ACS scheme proceeds in the arousal dimension (or energy/active dimension) with the separation between neutral and sadness versus anger and happiness and surprise. For the three active emotions, surprise is separated from anger and happiness in the second stage which is still in the arousal dimension. Anger and happiness are separated in the appraisal dimension in the last stage in the hierarchical framework as for the female samples in Berlin dataset.

Holdout cross-validations with 10 iterations are used in our experiments on DES dataset. In order to compare with previous works [17, 3638], for each iteration of experiments, 90% of segments were used as training set and the remaining 10% used as testing set. The training set and testing set were selected randomly in each group. Figure 12 shows the best classification rates of the eight combination operators that we experimented. The classification rates for the case of all speech samples from both genders are obtained by adding an automatic gender classifier [39] as we did in the experiments on the Berlin dataset. The error bars in the figure show the root mean square errors of the classification rates.

The indexes on the 𝑋 axis stand for the operators (1) Lukasiewicz, (2) Hamacher, (3) Yager, (4) Weber-Sugemo, (5) Schweizer and Sklar, (6) Frank, (7) Average, and (8) Geometric Average. “All samples (1)” refers to the classification on the mixed genders samples with gender classification, and “All samples (2)” refers to the classification on the mixed genders samples without gender classification.

The best result is 79.54%±1.95% for the female samples with Hamacher operator when 𝛾=3, 81.96%±1.27% for the male samples with Schweizer and Sklar operator when 𝑞=1, 76.74%±0.83% for the mixed genders with gender classification, and 53.75%±1.71% without gender classification with Schweizer and Sklar operator when 𝑞=0.6. Significant improvement of up to 23% is obtained for the mixed genders with the gender classification as a preprocessing step.

Experimented on the same DES dataset with the same 90% data of training and 10% data of testing in cross-validation, the best result obtained in the literature by Ververidis and Kotropoulos [17] is 66% for only male samples using a one step GMM (Gaussian Mixture Model) classifier for all the five emotions and 76.15% by Schuller et al. [18]. Significant improvement in classification accuracy is achieved with our automatically generated hierarchical classifier.

4.4. Synthesis and Comparison

In Table 3, we synthesize and compare the performances between the automatically generated hierarchical classification schemes ACSs and the early empirical built hierarchical DECs on both the Berlin and DES datasets. Almost the same results are obtained by the two kinds of hierarchical classifiers for the Berlin dataset (71.52% versus 71.38%), while the empirically built hierarchical DEC classifier for DES dataset performs slightly better than the automatically derived one (81.22% versus 76.74%).

As we can see from the table, the automatically derived ACS offer very closely performance as compared to the empirical DEC while providing the advantage to avoid repeated empiric work when the emotion classification problem changes.

5. Concluding Remarks

In this paper, we have introduced a new embedded feature selection scheme ESFS which is then used as the basis for automatically deriving hierarchical classification schemes called ACS in this paper. Such a hierarchical classifier is represented by a binary tree whose root is the union of all emotion classes, leaves are single-emotion classes, and nodes are subsets containing several emotion classes obtained by a subclassifier. Each of these subclassifiers is based on a new embedded feature selection method, ESFS, which allows to easily represent classifiers characterized by their mass function which is the combination of the information given by an appropriate feature subset, each subclassifier having its own one. Benchmarked on the Berlin and DES datasets, our approach has shown its effectiveness for vocal emotion analysis, leading to closely similar performance as compared to our previous empiric dimensional emotion model-driven hierarchical classification scheme (DEC).

Many issues need to be further studied. For instance, from machine learning point of view, the automatically derived ACS consists of successively dividing the initial set of class labels into two disjoint subsets of class labels by the most optimal binary classifier according to ESFS. Unfortunately, the number of such disjoint subset pairs increases exponentially. When this is feasible with a set of 4 or 6 class labels as it was the case with the Berlin and DES datasets, we cannot do it anymore if the cardinality of class labels is a much bigger number. Therefore, some heuristic rules also need to be found in order to be able to automatically derive the ACS that we proposed in this paper.

Another issue in machine recognition of vocal emotions is fuzzy and subjective character of vocal emotion. The judgment on emotional state conveyed by an utterance may be between some emotional states or even multiple according to person. Thus, ambiguous or multiple judgments also need to be addressed. A preliminary attempt of this issue has been studied in [40, 41].