Journal of Electrical and Computer Engineering

Volume 2017 (2017), Article ID 8639782, 6 pages

https://doi.org/10.1155/2017/8639782

## Cross-Corpus Speech Emotion Recognition Based on Multiple Kernel Learning of Joint Sample and Feature Matching

College of Big Data and Information Engineering, Guizhou University, Guiyang 550002, China

Correspondence should be addressed to Ping Yang

Received 5 April 2017; Revised 3 August 2017; Accepted 13 September 2017; Published 1 November 2017

Academic Editor: Ping Feng Pai

Copyright © 2017 Ping Yang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Cross-corpus speech emotion recognition, which learns an accurate classifier for new test data using old and labeled training data, has shown promising value in speech emotion recognition research. Most previous works have explored two learning strategies independently for cross-corpus speech emotion recognition: feature matching and sample reweighting. In this paper, we show that both strategies are important and inevitable when the distribution difference is substantially large for training and test data. We therefore put forward a novel multiple kernel learning of joint sample and feature matching (JSFM-MKL) to model them in a unified optimization problem. Experimental results demonstrate that the proposed JSFM-MKL outperforms the competitive algorithms for cross-corpus speech emotion recognition.

#### 1. Introduction

In cross-corpus speech emotion recognition, there is a descent in the recognition performance of many algorithms [1–3]. This is because the lacking of robust features representation and important properties training samples. To address the above issue, researchers use the matched feature selection and sample reweighting [4, 5]. Feature selection or extraction algorithm discovers the shared feature representation for reducing the distribution mismatch between the training and test data. Sample reweighting also aims at reducing this distribution mismatch by reweighting the training samples and then training a robust recognizer on the reweighted training samples. In cross-corpus speech emotion recognition, as well known, there will always exist some training samples that are not relevant to the test samples even in the feature matching subspace [6]. Recent works have also exploited matching feature leaning and sample reweighting individually for improving the performance of cross-corpus speech emotion recognition [4, 5]. However, it is natural to combine the benefits of the two categorical learning strategies in cross-corpus speech emotion recognition. In this work, we extend the idea of feature extraction and sample reweighting to multiple kernel learning (MKL) and propose a novel multiple kernel learning of joint sample and feature matching (JSFM-MKL) to model them in a unified optimization problem. We test the proposed JSFM-MKL on FAU Aibo speech emotion corpus, which was used in the Interspeech 2009 Emotion Challenge. Experimental results show that the proposed JSFM-MKL outperforms MKL [7] and adaptive multiple kernel learning (A-MKL) [8] and significantly improves the baseline performance of the Emotion Challenge.

#### 2. MKL of Joint Sample and Feature Matching

##### 2.1. Problem Definition

We are given the training and test data , respectively. The training data is fully labeled and represented as , where is the label of . The test data is divided into labeled and unlabelled parts. The training and test data have the equal dimensionality of feature representation . Our goal is design a robust recognizer to predict label on the unlabelled test data. The proposed recognizer is based on MKL framework [8], in which the sample reweighting and feature matching schemes are modeled in a unified optimization problem of MKL. Specifically, the learning framework of joint sample and feature matching MKL (JSFM-MKL) can be formulated aswhere is any increasing monotonic function and is the trade-off between the distribution mismatch and the structural risk function on the labeled data.

Our work JSFM-MKL is motivated by the following two aspects: matching feature selection and sample reweighting. The training data may be less representative with the testing data for cross-corpus speech emotion recognition. More specifically, is different from . This indicates that some features may behave differently between the training and test data. A recognizer that heavily relies on these features in training data may be not perform well in the recognition tasks of the unlabelled test data. Thus, one key computational problem is to reduce the distribution mismatch between and [9]. However, it is not a nontrivial problem to intermediately estimate the probability density. To avoid this problem, we resort to the empirical Maximum Mean Discrepancy (MMD) [10], which is an effective nonparametric distance measure to compare data distribution in RKHS. Using the training and test data, the MMD can be formulated as follows:Let be the feature mapping matrix of training data and be the feature mapping matrix of test data. In addition, we define two column vectors and , respectively. has entries by setting each entry as , and has entries by setting each entry as . Then (2) can be rewritten asInstead of learning a kernel matrix, following [8], we assume a kernel is a linear combination of base kernels, namely,where . We furthermore assume that the first objective in (1) is However, (5) does not consider the role of each feature on reducing the mismatch of conditional distribution. Therefore, it is natural to select the features that can reduce the mismatch of conditional distribution. Although the previous MKL can perform feature selection by the corresponding kernel weights, it generally regards the all features from the same distribution. In other words, it did not address this problem of cross-corpus feature selection as we do [7]. To address this problem, we construct each type of feature with different kernel choices and formulate the weight of kernel as the matrix . The entry is the weight of the th type feature corresponding to the th kernel. As to feature selection, we impose norm constraint on , which shrinks the entries of some rows to zero. This norm constraint is defined as the summation of the norm of row of . Then, (4) can be reformulated as follows: where is the weight matrix of base kernels. The mixed norm constraint creates the sparsity between different features, while the values of for the same feature need not sparsity. This will make that a different property of selected features able to be represented by more than one kernel.

However, matching feature selection based on the MMD minimization is not good enough for cross-corpus speech emotion recognition, since it only reduces the mismatch of conditional distribution by high order moments of probability distribution. Then the distribution mismatch is far away perfect. In fact, there are some training samples that are irrelevant to the test samples. Therefore, a sample reweighting procedure should be combined with the matching feature selection to deal with this difficult setting. Following the previous works, Kernel mean matching (KMM) [5] is introduced to weight the training data by minimizing the difference between the means of weighted-training and test data distribution in RKHS. Different from the previous works, the sample reweighting procedure and matching feature selection are modeled in a unified optimization problem. Thus the optimization problem can be rewritten as

Letting ,

(6) can be rewritten as follows:After obtaining , we use the objective function of MKL to model the second objective function . Thus, the optimization problem JSFM- MKL can be written aswhereBy introducing the Lagrange multiplier , the dual form of the optimization of JSFM- MKL can be formulated aswhere

In this work, we employ alternate optimization algorithm [8] to iteratively update the dual variable , the weighting matrix , and the weighting vector . Specifically, we update the dual variable with the fixed weighting matrix and the weighting vector ; then we update the weighting matrix and the weighting vector with fixed variable .

#### 3. Experiments

In this work, we evaluate the proposed JSFM-MKL using the spontaneous FAU Aibo Emotion Corpus [11]. This corpus was an integral part of Interspeech 2009 Emotion Challenge [12]. It contains recordings of 51 children at the age of 10–13 years interacting with Sony’s dog-like Aibo robot. The children were asked to treat the robot as a real dog and were led to believe that the robot was responding to their spoken commands. In this recognition task, we use these utterances including 5-class emotion: angry, emphatic, positive, neutral, and rest. The evaluation measure of all experimental results is the average unweighted recall, which is defined as the accuracy per class averaged by total number of classes and is more suitable for imbalanced data [12]. To achieve good average unweighted recall, we arrange multiple recognizers into the binary decision tree structure proposed by Lee et al. [13]. In addition, we use synthetic minority oversampling [14] to reduce the imbalance of classes during each recognizer training phrase. For acoustic feature extraction, we use a “brute force” approach based on a baseline feature set without any attempt to select a smaller subset of well-performing features. Specifically, we use the OpenEar toolkit [15] to extract acoustic features from each utterance.

The feature set includes 16 low level descriptors consisting of prosodic, spectral envelope, and voice quality features listed in Table 1. These low level descriptors are zero crossing rate, root mean square energy, pitch, harmonics-to-noise ratio, and 12 mel-frequency cepstral coefficients and their deltas. Then 12 statistical functionals were computed for every low level descriptor per utterance in the Aibo database: kurtosis, skewness, minimum, maximum, relative position, range, two linear regression coefficients, mean, standard deviation, and their respective mean square error. This results in a collection of 384 acoustic features for per utterance. Then they were normalized between 0 and 1.