Abstract

Posttraumatic stress disorder (PTSD), bipolar manic disorder (BMD), obsessive compulsive disorder (OCD), depression, and suicide are some major problems existing in civilian and military life. The change in emotion is responsible for such type of diseases. So, it is essential to develop a robust and reliable emotion detection system which is suitable for real world applications. Apart from healthcare, importance of automatically recognizing emotions from human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. Detection of emotion in speech can be applied in a variety of situations to allocate limited human resources to clients with the highest levels of distress or need, such as in automated call centers or in a nursing home. In this paper, we used a novel multi least squares twin support vector machine classifier in order to detect seven different emotions such as anger, happiness, sadness, anxiety, disgust, panic, and neutral emotions. The experimental result indicates better performance of the proposed technique over other existing approaches. The result suggests that the proposed emotion detection system may be used for screening of mental status.

1. Introduction

Stressful situation can cause some major psychiatric problems such as depression, suicide, PTSD, BMD, and OCD in civilian as well as in military life. Earlier treatment may become useful for such type of psychiatric problems [1]. So, there is a need to develop technology for recognizing early change in human behavior. Several biomarkers are reported by the medical researchers for psychiatric diseases [1, 2]. But these biomarkers are not effective in military life as they required a big and complicated machine for detecting psychiatric diseases. On the other hand, there is a fast development in voice, speech, and emotion detection technologies in engineering field. These technologies provide human-machine interaction for emotion detection and further treatment of psychiatric problems [35]. Several researches measured the level of fatigue and stress from speech [6]. But the level of fatigue and stress does not lead to psychiatric disorder directly. Emotion change of a human can cause mental diseases. Mostly, clinicians recognize the mental state of a patient from his/her face and voice which represents his/her emotion. This fact leads to the possibility that emotion detection system can be used for recognizing the mental disorder or disease in human. Early detection of disease improves the prognosis and is helpful to provide effective treatment at early stages. Emotion detection system can provide support to the clinicians to perform the task of emotion detection more efficiently. In automated call centers or in a nursing home, while nursing staff may not be available to assist everyone, automated emotion detection can be used to “triage” a patient. Automated emotion detection system is helpful to recognize whether a patient becomes angry or impatient and if so then the staff or treatment is provided to that patient as soon as possible.

Nowadays emotion detection from speech is an active research area and is useful for man-machine interaction [37]. Various researches have been done about automated emotion detection from facial expressions. But this task is computationally expensive and complex due to the requirement of high quality cameras for capturing face images. Apart from facial expression, emotions are also detected from speech which has been proven to be more promising modality. Since speech is the primary mode of human communication, the detection of emotion from speech is an important aspect.

Machine learning algorithms such as -nearest neighbor (NN), artificial neural network (ANN), and support vector machine (SVM) are widely used for emotion detection due to their excellent performance [813]. In this paper, the proposed emotion detection system recognizes seven different emotions which are anger, anxiety, disgust, happiness, sadness, panic, and neutral emotions. Different emotions can be seen as different classes. So, it requires a multiclassifier for emotion detection. In this paper, we proposed a novel multi least squares twin support vector machine (MLSTSVM) classifier which is the extension of binary least squares twin support vector machine (LSTSVM). So, the proposed system predicts the class or emotion for a given input. In order to check the validity of the proposed classifier, we evaluated its performance against 5 benchmark datasets.

The paper is organized as follows: introduction section includes need for emotion detection system. Section 2 provides the detail of our novel classifier which is multi least squares twin support vector machine. Proposed framework for emotion detection and dataset details are discussed in Section 3. The experimental results and conclusion of the proposed emotion detection system are presented in Sections 4 and 5, respectively.

2. Multi Least Squares Twin Support Vector Machine

Kumar and Gopal proposed LSTSVM for binary classification which solves two linear programming problems and constructs two nonparallel hyperplanes, one for each class [14]. Since real world data contains multiple classes and requires a classifier that works well for multiple classes, in this paper, we propose a novel multiclassifier termed as MLSTSVM. This classifier is an extension of the binary LSTSVM and is based on “one-versus-rest” strategy. Here, we selected and extended the binary LSTSVM because it shows better generalization ability and is faster as compared to other existing approaches [14, 15]. MLSTSVM constructs “” hyperplanes, one for each class, by optimizing -linear programming problems, where “” denotes number of classes. It adopts the concept of “one-versus-rest” in which the data points of each class are trained with the data points of other classes. Consider dataset has “” number of data points in training dataset: . Here is a feature vector in -dimensional space and is the label of corresponding class. “One-versus-rest” generates binary LSTSVM classifier, each of which separates one class from the rest of the classes. The th LSTSVM classifier assumes the data points of th class as positive data points and the data points of other classes as negative data points. Consider the data points of th class are indicated by the matrix , where represents number of data points in th class. Let the data points of rest of the classes be indicated by matrix as The matrix includes all the data points except th class. MLSTSVM classifier for both linear and nonlinear cases is formulated as follows.

2.1. Linear Case

The equation of th hyperplane is obtained as where and represent normal vector and bias term, respectively, in real space . The th LSTSVM classifier optimizes the following objective function: where and denote the vector of 1’s and and represent the penalty parameter and the slack variable correspondingly. The first term of (3) denotes the squared sum distance of the data points of the th class. The minimization of this term keeps the hyperplane in the close affinity of the th class. The second term of (3) minimizes the misclassification error of the data points of rest of the -1 classes. So, in this way the hyperplane is kept in the close affinity with the data points of th class and lies as far as possible from the data points of other classes. The objective function is solved by taking its dual form. Lagrangian function of the objective function as mentioned by (3) is achieved as follows: where represents the Lagrangian multiplier. The optimization of Lagrangian function is achieved by differentiating it with respect to normal vector, bias, slack variable, and Lagrangian multiplier and the following Karush-Kuhn-Tucker (KKT) conditions are obtained: By combining (5) and (6), the following equation is obtained: Consider and and . After putting these values in (8), it may be reformulated as

The above equation requires the inverse of . Sometimes a matrix may be singular or ill-conditioned due to which it is not possible to obtain its inverse. The situation may be avoided by adding a regularization term to the matrix and the above equation is reformulated as where is a very small nonnegative integer and is an identity matrix of suitable size. Lagrangian multiplier is obtained from (7) as After substituting the value of in (10), we obtain the normal vector and bias for th classifier as follows: For a new data point or test data sample, its perpendicular distance is measured from each hyperplane and the data sample is assigned to that class depending upon which of the planes lies at minimum distance from it.

Algorithm 1. For to , where is total number of classes,(i) obtain two matrices and as where and denote the data points of th class and the rest of the classes, respectively;(ii)use validation process to obtain penalty parameters;(iii)calculate weight and bias for each class by using (13).
Achieve decision function by using (14). Use this function to assign the class to new data points.

2.2. Nonlinear Case

Mostly, the real dataset is nonlinear in nature; that is, the classes are separable by nonlinear class boundaries. So, it is essential for a classifier that it works well both for linear and for nonlinear separable data points. In this section, we proposed the formulation of the MLSTSVM classifier for nonlinear cases. Firstly, kernel functions are used for mapping the input data points into higher-dimensional feature space and then the data points are classified by constructing nonlinear or kernel surfaces in this space. In higher-dimensional space, the equation of th kernel surface or nonlinear surface for any kernel function is obtained as where and Ker is any suitable kernel function. The optimization problem of MLSTSVM for nonlinear cases is formulated as Lagrangian function of the above-mentioned equation is achieved as KKT conditions for nonlinear MLSTSVM are Combining (19) and (20), we get Let and . Then (22) can be rewritten as The value of normal vector and bias is achieved by solving (21) and (22) as For a new data point, its perpendicular distance is measured from each nonlinear surface and it is assigned to that class depending upon which of the planes lies at minimum distance from it. The values of weight and bias are used to construct kernel surfaces for each class. The decision function for nonlinear MLSTSVM is obtained as Gaussian and polynomial kernel functions for two input vectors and are obtained as

Algorithm 2. Choose kernel function.
For to , where is total number of classes,(i)obtain two matrices and as where and denote the data points of th class and the rest of the classes correspondingly;(ii)use validation process to obtain penalty parameters;(iii)calculate weight and bias for each class by using (24).
Obtain decision function by using (25) and assign the class to new data points by using this decision function.

In order to prove the validity of the proposed MLSTSVM, we performed experiment on five benchmark datasets. All the datasets are taken from UCI machine learning database [16]. Table 1 shows the accuracy comparison of the proposed MLSTSVM classifier with other exiting classifiers. Accuracy refers to the correct classification rate and is calculated by taking the average of testing accuracies. It is clear from the table that the proposed classifier has achieved better accuracy for Wine, Glass, Vehicle, and Teaching Evaluation datasets as compared to NN, ANN, and multi-SVM, while for Iris dataset MLSTSVM obtained 97.75% accuracy which is better than ANN and multi-SVM and comparable with NN. The experiment is performed using 10-fold cross validation method.

3. Description of Dataset and Proposed Model

3.1. Dataset Description

Emotion detection system from human speech is divided into two parts: dataset collection and feature extraction system and emotion detection system as shown in Figure 1. Emotion dataset, used in this research work, is taken from two sources: a benchmark dataset from source http://www.eecs.qmul.ac.uk/mmv/datasets/deap/download.html and real dataset (audio file) collected using Nexus 5 smart phone. The benchmark dataset contains the voice recordings of expert reciting numbers and dates with different emotions and real dataset contains the audio recording of 10 male persons where each person voice is recorded minimum 2-3 times with different sets of emotions. The dataset includes seven different emotions such as anger, anxiety, happiness, sadness, disgust, neutral emotions, and panic for emotion detection. Audio file of a person or a patient is given for emotion detection. Power Audio Cutter is used to cut the audio files and the duration of each audio file is 2 seconds. This research used PRAAT scripting tool for feature extraction which is a freeware and flexible tool developed by Paul Boresma and David Weenink of the University of Amsterdam for speech analysis. It performs spectral, formant, pitch, intensity, jitter, and shimmer analyses [17, 18].

This tool generates voice report from audio files and converts the audio files into text files. The voice report of these audio files contains different features of voice like pitch, intensity, shimmer, jitter, and so forth. Pitch, also known as vibration rate of the vocal folds, is one of the most important and essential parts of the human voice. The sound of the voice varies according to the vibration rate. High pitch refers to high vibration rate which further increases the sound of the voice while low pitch corresponds to the lower sound. Vibration rate is dependent on the duration and thickness of vocal cords. Relaxation and tightening of the muscles around vocal cords also affect the vibration rate. Emotion or mood of a person also has an effect on his/her pitch. During excitement or fright, the muscles put strain on vocal cords which further produce high pitch voice. The tone of a person describes the way a statement is presented and can convey the emotion, psychological arousal, and mood of that person. Usually, softer pitch and tone are seen as nonaggressive and indicate the friendly behavior of a person. Jitter and shimmer are another important attribute of a voice. Jitter and shimmer measure the irregularity percentage in the pitch and in the amplitude of the vocal note correspondingly. Voice quality and signal-to-noise ratio can be estimated from harmonicity. The second part includes significant feature selection from voice report generated by PRAAT and classification of emotions using MLSTSVM. Feature selection (FS) is the process of selecting relevant and important attributes from a dataset and plays a significant role in the construction of a classification system [1921]. FS is also termed as attribute selection process which reduces the number of input attributes by selecting only important attributes for a classifier in order to enhance its performance. In this paper, we used the combination of -score and sequential forward selection approaches for feature selection.

F-Score. It is a significant FS approach and mostly used in machine learning. Initially, -score calculates the discrimination between two class sets of real numbers. Later, it is extended efficiently for calculating the discrimination between more than two class sets of real numbers [1921]. Let the dataset contain “” classes and each class contain data points, where . The formulation of -score for th feature is obtained as [21, 22] where is average of the th feature of the whole dataset, is average of the th feature of the th class, and is th feature of the th instance in the th class. The value of -score for a feature represents its discriminative ability and larger value of it indicates that the corresponding feature has better separation ability.

Sequential Forward Selection. It works in bottom to top fashion and begins with an empty set of features. It starts with adding single best feature to the empty set. At each subsequent step, the best one feature of the remaining original features is added to the previous feature set.

The result obtained from feature selection is given to the proposed classifier and usability of feature selection is compared while observing the behavior of the same classifier without feature selection. The detailed description of MLSTSVM is explained in Section 2.

3.2. Proposed Framework

Figure 2 shows the proposed framework for emotion detection from audio files by using a novel MLSTSVM classifier. The proposed system contains the following steps.

Step 1. Record the voice in audio format of different emotions.

Step 2. Convert the audio files into text files using PRAAT scripting tool.

Step 3. Generate a database containing different emotions.

Step 4. Preprocess the dataset.

Step 5. Partition the dataset into training and testing datasets by using -fold cross validation.

Step 6. Select the relevant features with the help of -score and sequential forward selection approaches.

Step 7. Train and test the model with selected features and evaluate its performance with different features.

Step 8. Select the model with highest classification accuracy.

4. Experimental Results and Discussion

The benchmark and real data exist in the form of audio files. The experiment is performed on the time span of 2 seconds for each audio file for which a tool Power Audio Cutter is used to cut the audio files as per required duration. The feature of audio files is extracted by using PRAAT scripting tool. Figure 3 shows the browsing of audio file for running PRAAT script in order to extract features from it. The voice report of an audio file generated by using PRAAT is shown in Figure 4.

Table 2 shows the extracted speech features by using PRAAT. PRAAT extracted six major attributes and 24 subattributes from the audio files.

The emotion detection benchmark dataset used in this research work contains 290 voice recordings with 24 features and 7 class labels are given in Table 3.

Figure 5 shows the snapshot of the emotion detection dataset. In this snapshot 35 instances, 5 instances of each emotion, have been taken to generate a complete view of range of various attributes for corresponding class. The first attribute of the snapshot denotes emotions. Here, “1,” “2,” “3,” “4,” “5,” “6,” and “7” are used for anger, anxiety, disgust, happiness, neutral emotions, panic, and sadness, respectively.

Since the range of attributes varies from each other, normalization of each attribute value is performed to take them within the specified range. Two feature selection techniques, -score and SFS, are used for selecting significant features. Table 4 shows the average value of -score for each attribute or feature by using 10-fold cross validation. After calculating the -score of each attribute, SFS is used for obtaining 24 feature subsets or models. The importance or -score values of each feature from high to low are 1, 2, 4, 14, 6, 7, 5, 3, 10, 12, 8, 17, 23, 20, 19, 11, 9, 13, 16, 15, 18, 22, 21, and 24. Table 5 shows the twenty-four feature subsets or models on the basis of SFS. For each feature subset, a MLSTSVM classifier is constructed and its predictive accuracy is checked using 10-fold cross validation method. The proposed MLSTSVM classifier is implemented using MATLAB R2012a.

Appropriate choice of parameters such as penalty parameters and , , and degree can enhance the performance of the MLSTSVM. This study used grid search approach for this purpose. The parameters are selected from the following range: , , sigma , and degree by using 10-fold cross validation. Table 6 indicates the average predictive accuracies of the proposed linear and nonlinear MLSTSVM classifier on each feature subset. In this research, we used Gaussian and polynomial kernel functions. Among 24 models, model 16 obtained highest predictive accuracy, 87.28% for linear MLSTSVM, 92.89% for Gaussian MLSTSVM, and 88.87% for polynomial MLSTSVM classifiers. Therefore, model 16 is considered as best feature subset.

We have compared the performance of the proposed classifier with NN, ANN, and multi-SVM on emotion detection dataset as shown in Table 7. All these machine learning algorithms are also implemented with MATLAB R2012a on Windows 7 operating system. It is found that the proposed feature selection based MLSTSVM model with Gaussian kernel function has obtained highest accuracy, 92.89%, as compared to the other approaches. Table 8 indicates the comparison of emotion detection by using various classifiers. MLSTSVM correctly detects neutral, happy, and panic emotions. While for anger emotion, it detects 90% anger and 10% sadness. For anxiety, it detects 80% anxiety, 10% disgust, and 10% sadness. For disgust, it detects 90% disgust and 10% anxiety. For sadness, it detects 90% sadness and 10% disgust.

Figure 6 indicates the accuracy comparison of NN, ANN, multi-SVM, and MLSTSVM for different emotions. It is observed from the figure that, for anger emotion, the accuracy of NN, multi-SVM, and MLSTSVM is 90%, while the accuracy of ANN is 80%. The accuracy of the proposed MLSTSVM is 100% for happy, neutral, and panic emotions. For disgust emotion, the accuracy of ANN is highest as compared to other approaches. So, it is concluded that the accuracy of MLSTSVM is lower only for disgust emotion as compared to ANN, while for other emotions, its accuracy is either equal or higher with respect to other approaches.

We also have checked the performance of the proposed system on real data. For this purpose, we have recorded the voice of 10 male persons by using Nexus 5 smart phone. Nexus 5 voice recorder has dual membrane MEMS microphones to record audio. Each membrane serves a specific purpose, one focused on high-sensitivity while the other focused on high sound pressure. After the two signals have been combined, background noise is greatly reduced, which increases the overall sound quality of the recording. The real data contains 160 voice recordings of seven different emotions. The performance comparison of various classifiers such as KNN, ANN, multi-SVM, and MLSTSVM on real dataset is shown in Table 9. So, the proposed MLSTSVM also works well for real data as compared to the other existing approaches and obtains highest accuracy, 86.18%.

5. Conclusion

The proposed emotion detection system can be used in automated call centres or in a nursing home where resources or nursing staff may not be available to aid everyone. Automated emotion detection system can be useful to identify the emotion change of patients and to trigger the alarm according to their emotion change so that effective treatment or facility can be provided to patients as soon as possible. This system can assist the clinician to perform the task of emotion detection more efficiently. The proposed emotion detection system may serve as an important tool because change in emotion is responsible for several diseases such as PTSD, BMD, OCD, depression, and suicide. In this paper, emotions are detected by using a novel classifier, named MLSTSVM, and its performance is validated on five benchmark datasets. PRAAT scripting tool is used for feature extraction and extracted 24 features from voice recording. The combination of -score and SFS is used for selecting important features from emotion detection dataset. It is found that MLSTSVM classifier based emotion detection system with sixteen features has achieved better predictive accuracy, 87.28% for linear MLSTSVM, 92.89% for Gaussian MLSTSVM, and 88.87% for polynomial MLSTSVM classifier. The performance of proposed system is compared with NN, ANN, and multi-SVM approaches. Experimental results indicate that our proposed novel classifier based emotion detection performs well as compared to the other existing approaches. The results of proposed classifier are also verified by using real dataset containing the voice of 10 persons with different emotions and obtained 86.18% accuracy with Gaussian MLSTSVM. The whole system may be adopted and extended as an intelligent personal assistant application for helping disable, autistic children, psychic patients, and elderly people. Apart from healthcare, importance of automatically recognizing emotions from human speech as achieved here in this proposed system may also be used as a part of human computer interaction applications such as robotics, games, and intelligent tutoring system. We have developed the emotion recognition system using MATLAB on Windows operating system. The system has certain limitations; for example, it does not deal with the background noise and is trained for male voices only. Hence few enhancements are possible in the future. A better performance could be guaranteed by optimizing the values of certain parameters like sigma (for Gaussian kernel function) and cost parameters by using genetic algorithm, particle swarm optimization, or any other optimization approaches.

Conflict of Interests

The authors declare that they have no conflict of interests regarding the publication of this paper.