Reordering Features with Weights Fusion in Multiclass and Multiple-Kernel Speech Emotion Recognition

Jiang, Xiaoqing; Xia, Kewen; Wang, Lingyin; Lin, Yongliang

doi:https://doi.org/10.1155/2017/8709518

Journal of Electrical and Computer Engineering

On this page

Abstract Introduction Analysis Conclusions Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2017 | Article ID 8709518 | https://doi.org/10.1155/2017/8709518

Reordering Features with Weights Fusion in Multiclass and Multiple-Kernel Speech Emotion Recognition

Xiaoqing Jiang,^1,2Kewen Xia ,¹Lingyin Wang,²and Yongliang Lin^1,3

Academic Editor: Andreas Spanias

Received29 Nov 2016

Revised05 May 2017

Accepted20 Jun 2017

Published27 Jul 2017

Abstract

The selection of feature subset is a crucial aspect in speech emotion recognition problem. In this paper, a Reordering Features with Weights Fusion (RFWF) algorithm is proposed for selecting more effective and compact feature subset. The RFWF algorithm fuses the weights reflecting the relevance, complementarity, and redundancy between features and classes comprehensively and implements the reordering of features to construct feature subset with excellent emotional recognizability. A binary-tree structured multiple-kernel SVM classifier is adopted in emotion recognition. And different feature subsets are selected in different nodes of the classifier. The highest recognition accuracy of the five emotions in Berlin database is 90.549% with only 15 features selected by RFWF. The experimental results show the effectiveness of RFWF in building feature subset and the utilization of different feature subsets for specified emotions can improve the overall recognition performance.

1. Introduction

Feature selection is a crucial aspect in pattern recognition problems. In multiclass SVM classifier, for example, the structure of the classifier can be one-to-one, one-to-all, hierarchy, or tree structure, so several SVM nodes or models exist in the multiclass classifier [1–3]. There are two questions in speech emotion recognition (SER): (1) how to seek the optimal feature subset from the acoustic features; (2) whether the same acoustic feature subset is proper in all nodes of the multiclass classifier. These questions are researched in this paper. A novel algorithm named Reordering Features with Weights Fusion (RFWF) is proposed to select feature subsets. And in emotion recognition procedure, different feature subsets are adopted in SVM nodes to recognize different emotions.

In SER field, the dimension of feature set ranges from tens to hundreds. However, the increasing dimension does not mean a radical improvement of the recognition accuracy, because the variety and redundancy between more and more features influence the overall performance and complexity of the system [4]. And there is not a categorical assertion about the most effective feature set in SER nowadays. Feature selection algorithms used in machine learning widely can choose the optimal feature subset with the least generalization error. There are three types of feature selection methods: the wrapper method, the embedded method, and the filter method [5]. Compared with the wrapper method and the embedded method, the filter method is simpler and faster in calculation and its learning strategy is more robust to overfitting. Additionally, because the selection result of the filter method is independent of the learning model, the filter method can be adopted in a variety of leaning tasks. The criteria in filter methods mainly focus on the relevance, the redundancy, and the complementarity. For example, Joint Mutual Information (JMI) [6] considers the relevance between features and classes. Fast correlation-based filter (FCBF) [7] takes the redundancy between features into account. Max-Relevance Min-Redundancy (MRMR) [8] gives consideration to both relevance and redundancy to find the balance between the two properties. In Conditional Information Feature Extraction (CIFE) [9], the information provided by the features is divided into two parts: the class-relevant information that benefits the classification and the class-redundant information that disturbs the classification. And the key idea of Double Input Symmetrical Relevance (DISR) [10] is the utilization of symmetric relevance to consider the complementarity between two input features. In the SER field, various feature selection criteria are often adopted [7, 11–13], and different criterion emphasizes different aspects. Reordering Features with Weights Fusion (RFWF) algorithm proposed in this paper aims to consider relevance, redundancy, and complementarity comprehensively.

Traditionally the same feature subset is adopted in all emotional classes for training and testing [14]. In [11], different feature subsets are adopted on two emotional speech databases, but the emotional recognizability of the features to different emotions has not been considered. Research has shown that acoustic features have different recognizability to specific emotions. For example, pitch related features are usually essential to classify happy and sad [15], while they are often weak in the recognition between happy and surprise because of their high values in these emotions [16]. In order to improve the performance of the whole system, different feature subsets are selected and adopted on the different nodes of the multiclass classifier in this paper.

The content of the paper is arranged as follows: Section 2 gives the basic concepts of filter feature selection and the method of RFWF; Section 3 introduces the structure of the multiclass and multiple-kernel SVM classifier; Section 4 is the analysis of experiments including the results of RFWF and recognition accuracies of emotions; and the final section is devoted to the conclusions.

2. Features and Feature Selection Methods

2.1. Acoustic Features

Speech acoustic features usually used in SER are the prosodic features, voice quality features, and spectral features. In this paper, 409 utterances in Berlin database [17] of 5 emotions including happy (71 samples), angry (127 samples), fear (69 samples), sad (63 samples), and neutral (79 samples) are studied. These samples are separated into training and testing categories randomly. The training samples are 207 including happy (36 samples), angry (64 samples), fear (35 samples), sad (32 samples), and neutral (40 samples), and the rest 202 ones are the test samples.

Pitch, energy, time, formant, and Mel Frequency Cepstrum Coefficient (MFCC) and their statistics parameters are extracted. The total dimension the feature set is 45. Table 1 lists the acoustic features and their sequence indices in this paper.

2.2. Mathematical Description of Feature Selection

Relevance, redundancy, and complementarity are considered in feature selection methods. If a feature can provide information about the class, the relevance exists between the feature and class. The redundancy is based on the dependency between the selected and unselected features. And complementarity means that the interaction between an individual feature and the selected feature subset is beneficial to the classification. Complementarity is important in the cases of null relevance, such as XOR problem [10, 18].

The concepts of information theory, such as mutual information denoted by and entropy denoted by , are widely used in feature selection. Mathematically, is the feature vector of th sample, and the is the th feature of the th sample in the feature set . The selected subset and unselected subset are and with the mathematic relation of and . , is the specified emotion in Berlin database. In the following content, the mathematical description of relevance, redundancy, and complementarity is interpreted through the introduction of MRMR and DISR.

In MRMR , and , , the relevance term and the redundancy term are used in the criterion:where can represent the relevance between an unselected feature and the class and can represent the redundancy between the unselected and selected features. The detailed computation can be found in [8].

The key idea of DISR depends on the consideration of the second average sub-subset information criterion in (2) to consider the complementarity between an unselected feature and a selected feature given the specific class C.

Equation (2) also can be modified by a normalized relevance measure named symmetric relevance calculated by the following:

In DISR, is the complementarity calculated bywhere stands for the interaction among , , and C. From its general meaning, for sets of random variables , the interaction can be defined as

The detailed definition and proof can be found in [10].

2.3. Reordering Features with Weights Fusion

For the comprehensive consideration of relevance, redundancy, and complementarity, the following criterion named Reordering Features with Weights Fusion (RFWF) is proposed to fuse the intrinsic properties of the features:where , , and are the fusing weights of the unselected feature and they are combined in (6) to reflect the contribution of to given class. The procedure of RFWF algorithm described in the following is illustrated in Figure 1:(1) is the sequence number of the feature ranked in order of the values of , and , respectively. If the dimension of the feature set is 45, is an integer value ranging within 1~45. For example, if the is the largest, is 1. And if the is the lowest, is 45. The initial selected feature in is confirmed by the largest value of .(2)Weighted values can be calculated by the following formula:For example, if is 1, the corresponding weight about the relevance between feature and class is =45.(3)All of the features can be reordered by the fusing result using , , and .(4)The top features can be selected to construct the optimal feature subset.

Because this algorithm fuses three weights to consider the contribution of features in the classification and a reordering process exists in the process, the algorithm is named Reordering Features with Weights Fusion (RFWF).

3. Multiclass and Multiple-Kernel SVM Classifier with Binary-Tree Structure

The Support Vector Machine (SVM) is a discriminative classifier proposed for binary classification problem and based on the theory of structural risk minimization. The performance of a single-kernel method depends heavily on the choice of the kernel. If a dataset has varying distributions, a single kernel may not be adequate. Kernel fusion has been proposed to deal with this problem [19].

The simplest kernel fusion is a weighted combination of M kernels:where is the optimal weights and is the sth kernel matrix. The selection of is an optimal question, and the objective function and constraints of the problem can be formulated as the Semidefinite Programming (SDP) form. The detailed proof of can be found in [20].

In this paper, a multiple-kernel SVM classifier with an unbalance binary-tree structure illustrated in Figure 2 is adopted. In Figure 2, there are five emotions to be recognized. The first classifying node (Model 1) is improved by multiple-kernel SVM to recognize the most confusable emotion while the subsequent classifying nodes still retain single-kernel SVM. This arrangement attributes the reduction of recognition error accumulation and the computing complexity required in the calculation of the multiple-kernel matrices for all nodes.

According to the previous works [2, 11, 21, 22], happy is the most confusable emotion and its recognition accuracy is the main factor influencing the total performance in Berlin database. Thus in the classifier shown in Figure 2, happy is 1, angry is 2, fear is 3, neutral is 4, and sad is 5. Feature subset selected by RFWF is adopted in the SVM training and testing, where Model 1 is learned by multiple kernels. Models 2, 3, and 4 are still single-kernel SVM models.

4. Experiments and Analysis

4.1. The Experimental Results of RFWF

Table 2 lists the reordering results of features for the four SVM models in Figure 2 according to the fusing results of , , and . In Table 2, the numbers are the indices of the features listed in Table 1.

It is clear that, in the four SVM models, the contribution of different features to emotional recognizability is distinct. For example, the standard deviation of pitch (feature sequence index is 5) is the most essential feature to classify happy and the other emotions in Berlin database, while ratio of voiced frames versus total frames (feature sequence index is 23) is the most important feature to recognize neutral and sad. The results show that it is necessary to adopt different feature subsets to recognize different emotions.

4.2. Experimental Results of SER and Analysis

In the SER experiments, LibSVM package is adopted. Three basis Radial Basis Functions (RBF) kernel functions with parameters of , , are combined in Model 1. YALMIP toolbox is used to solve the SDP problem and find three with the features listed in Table 1. In single-kernel SVM models, the value of is , where is the selected feature number in the recognition procedure. When the selected feature number is specified, the same is adopted for all single-kernel models.

Recognition accuracies, Root Mean Square Error (RMSE) and Maximum Error (MaxE) are used to evaluate the performance of the SVM classifier. RMSE and MaxE are calculated by following equations:where is the recognition error (%) of ith emotion. Obviously, the higher the recognition accuracies and the lower the values of RMSE and MaxE, the better the performance of the classifier.

If the dimension of feature subset is , then the top N features in the Table 2 are selected to construct the feature subset. ranging within 1–45 achieves different recognition performance. Figure 3 plots the curves of total emotion recognition accuracies of MRMR, DISR, and RFWF features selection algorithms, respectively, where RFWF is adopted in multiple-kernel (RFWF MK) and single-kernel (RFWF SK) SVM classifiers. Different feature subsets are selected for Models 1–4. Figure 4 gives the RMSE and MaxE corresponding to Figure 3. In Figures 3 and 4, the horizontal axis describes the different number of the selected features or the dimension of the selected feature subset. Table 3 lists the detailed experimental data of the highest total accuracies of MRMR, DISR, RFWF MK, and RFWF SK methods.

The recognition results show DISR and MRMR algorithms reach their highest accuracies with 39 features. However, the highest accuracy of RFWF MK is 90.594% with only 15 features. The accuracies of DISR and MRMR are 70.792% and 77.723%, respectively, when the selected feature number is 15. When the selected feature number is 45, it means that no feature selection algorithms are utilized. In this situation, the performances of DISR, MRMR, and RFWF MK are the same and the total accuracy is 83.663%. These results show that the RFWF algorithm has the best performance with the lowest dimension of feature subset. The corresponding RMSE and MaxE curves of RFWF are the lowest when the selected feature number N is below 30. If the dimension of the feature subset increases, the three feature selection methods with multiple-kernel classifier have similar performance. This is mainly because RFWF uses the same weighing methods for the relevance, redundancy, and complementarity. From this aspect, it is an average strategy in the procedure of weights fusion. The results show that when the dimension feature subset is close to 45, the RFWF degrades to handle with the complex inherent properties between the features and more optimal feature fusion method should be studied.

The highest total accuracy of RFWF SK is 79.208%, which is much lower than accuracy of RFWF MK. The recognition accuracies of happy are 97.143% steadily in the three methods when Model 1 is improved by multiple-kernel SVM. The experimental results demonstrate that multiple-kernel classifier can solve the confusion between happy and other emotions effectively, which cannot be dealt with by single-kernel SVM. The highest SER accuracies of RFWF MK can be compared to the results of the Enhanced Sparse Representation Classifier (Enhanced-SRC) in [11] and feature fusion based on MKL in [23]. The experimental comparison is listed in Table 4, where the symbol “” denotes no relating experimental results in the reference.

If Models 2–4 use the same feature subset as in Model 1 with the dimension of 15, the accuracy of RFWF MK is only 63.861%. And when all models use the same feature subset of 39, the highest accuracy of RFWF MK is 85.149%. The data confirms that the utilization of the same feature subset in all models influences the emotion recognition performance negatively. These experimental results demonstrate that different feature subset is necessary in the recognition of different emotions, which also indicates the difficulty to build a robust and effective feature subset for all emotions.

5. Conclusions

In this paper, a RFWF feature selection method is proposed for building more effective feature subset in SER. A binary-tree structured multiclass and multiple-kernel SVM classifier is adopted to recognize emotions in a public emotional speech database. The experimental results indicate the effectiveness of the whole system.

The conclusions of this paper are as follows: () intrinsic properties of features about relevance, redundancy, and complementarity can be considered comprehensively by weights fusion. () Feature subset selected by RFWF achieves higher total accuracy than MRMR and DISR with lower dimension. () In multiclass classifier, different feature subsets adopted in different nodes can improve the recognizability of the whole system. () Multiple-kernel SVM classifier is robust and effective in recognizing the most confusable emotion.

The next work can focus on the research about more optimal feature fusion algorithms and automatic acquisition of optimal dimension of feature subset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61501204 and no. 61601198), Hebei Province Natural Science Foundation (no. E2016202341), Hebei Province Foundation for Returned Scholars (no. C2012003038), Shandong Provincial Natural Science Foundation (no. ZR2015FL010), and Science and Technology Program of University of Jinan (no. XKY1710).

References

L. Chen, X. Mao, Y. Xue, and L. L. Cheng, “Speech emotion recognition: features and classification models,” Digital Signal Processing, vol. 22, no. 6, pp. 1154–1160, 2012.
View at: Publisher Site | Google Scholar | MathSciNet
S. Chandaka, A. Chatterjee, and S. Munshi, “Support vector machines employing cross-correlation for emotional speech recognition,” Measurement: Journal of the International Measurement Confederation, vol. 42, no. 4, pp. 611–618, 2009.
View at: Publisher Site | Google Scholar
C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, “Emotion recognition using a hierarchical binary decision tree approach,” Speech Communication, vol. 53, no. 9-10, pp. 1162–1171, 2011.
View at: Publisher Site | Google Scholar
J. Yuan, L. Chen, T. Fan, and J. Jia, “Dimension reduction of speech emotion feature based on weighted linear discriminate analysis,” in Image Processing and Pattern Recognition, vol. 8, pp. 299–308, International Journal of Signal Processing, 2015.
View at: Google Scholar
Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.
View at: Publisher Site | Google Scholar
H. Hua Yang and J. Moody, “Data visualization and feature selection: new algorithms for nongaussian data,” Advances in Neural Information Processing Systems, vol. 12, pp. 687–693, 1999.
View at: Google Scholar
D. Gharavian, M. Sheikhan, A. Nazerieh, and S. Garoucy, “Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network,” Neural Computing and Applications, vol. 21, no. 8, pp. 2115–2126, 2012.
View at: Publisher Site | Google Scholar
H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
View at: Publisher Site | Google Scholar
D. Lin and X. Tang, “Conditional infomax learning: an integrated framework for feature extraction and fusion,” in Proceedings of the 9th European Conference on Computer Vision, pp. 68–82, Graz, Austria, 2006.
View at: Google Scholar
P. E. Meyer, C. Schretter, and G. Bontempi, “Information-theoretic feature selection in microarray data using variable complementarity,” IEEE Journal on Selected Topics in Signal Processing, vol. 2, no. 3, pp. 261–274, 2008.
View at: Publisher Site | Google Scholar
X. Zhao, S. Zhang, and B. Lei, “Robust emotion recognition in noisy speech via sparse representation,” Neural Computing and Applications, vol. 24, no. 7-8, pp. 1539–1553, 2014.
View at: Publisher Site | Google Scholar
A. Mencattini, E. Martinelli, G. Costantini et al., “Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure,” Knowledge-Based Systems, vol. 63, pp. 68–81, 2014.
View at: Publisher Site | Google Scholar
D. Ververidis, C. Kotropoulos, and I. Pitas, “Automatic emotional speech classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-593–596, Quebec, CA, USA, 2004.
View at: Google Scholar
J. Liu, C. Chen, J. Bu et al., “Speech emotion recognition based on a fusion of all-class and pairwise-class feature selection,” in Proceedings of the ICCS, pp. 168–175, Beijing, China, 2007.
View at: Google Scholar
X. Xu, Y. Li, X. Xu et al., “Survey on discriminative feature selection for speech emotion recognition,” in Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, ISCSLP 2014, pp. 345–349, Singapore, Singapore, September 2014.
View at: Publisher Site | Google Scholar
L. Tian, X. Jiang, and Z. Hou, “Statistical study on the diversity of pitch parameters in multilingual speech,” Control and Decision, vol. 20, no. 11, pp. 1311–1313, 2005.
View at: Google Scholar
F. Burkhardt, A. Paeschke, M. Rolfes et al., “A database of German emotional speech,” in Proceedings of the 9th European Conference on Speech Communication and Technology, pp. 1517–1520, Lisbon, Portugal, 2005.
View at: Google Scholar
J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural Computing and Applications, vol. 24, no. 1, pp. 175–186, 2014.
View at: Publisher Site | Google Scholar
C.-Y. Yeh, W.-P. Su, and S.-J. Lee, “An efficient multiple-kernel learning for pattern classification,” Expert Systems with Applications, vol. 40, no. 9, pp. 3491–3499, 2013.
View at: Publisher Site | Google Scholar
G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett et al., “Learning the kernel matrix with semidefinite programming,” Machine Learning Research, vol. 5, no. 1, pp. 27–72, 2004.
View at: Google Scholar
X. Jiang, K. Xia, X. Xia, and B. Zu, “Speech emotion recognition using semi-definite programming multiple-kernel SVM,” Journal of Beijing University of Posts and Telecommunications, vol. 38, no. S1, pp. 67–71, 2015.
View at: Google Scholar
B. Yang and M. Lugger, “Emotion recognition from speech signals using new harmony features,” Signal Processing, vol. 90, no. 5, pp. 1415–1423, 2010.
View at: Publisher Site | Google Scholar
Y. Jin, P. Song, W. Zheng, and L. Zhao, “Novel feature fusion method for speech emotion recognition based on multiple kernel learning,” Journal of Southeast University, vol. 29, no. 2, pp. 129–133, 2013.
View at: Google Scholar

Copyright

Copyright © 2017 Xiaoqing Jiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

919

Downloads

753

Citations