Abstract

In the recent past, handling the high dimensionality demonstrated in the auditory features of speech signals has been a primary focus for machine learning (ML-)based emotion recognition. The incorporation of high-dimensional characteristics in training datasets in the learning phase of ML models influences contemporary approaches to emotion prediction with significant false alerting. The curse of the excessive dimensionality of the training corpus is addressed in the majority of contemporary models. Modern models, on the other hand, place a greater emphasis on merging many classifiers, which can only increase emotion recognition accuracy even when the training corpus contains high-dimensional data points. “Ensemble Learning by High-Dimensional Acoustic Features (EL-HDAF)” is an innovative ensemble model that leverages the diversity assessment of feature values spanned over diversified classes to recommend the best features. Furthermore, the proposed technique employs a one-of-a-kind clustering process to limit the impact of high-dimensional feature values. The experimental inquiry evaluates and compares emotion forecasting using spoken audio data to current methods that use machine learning for emotion recognition. Fourfold cross-validation is used for performance analysis with the standard data corpus.

1. Introduction

Emotions have a profound influence on the physical and psychological well-being in humans. How well patients convey their emotions and how well their therapists recognize and respond to them determine improvement in therapeutic settings. [1] Therapists must deal with enormous volumes of data over a lengthy period of time, which is difficult because they must see numerous patients throughout that time. A platform that can give meaningful speech-based emotion identification insights, for example, might be useful in therapy sessions. EmoViz allows users to take voice samples and use a neural network to determine emotional feelings (such as joyful, sad, angry, surprised, or neutral). Emotional information may be obtained through the examination of spoken audio signals without the need of intrusive technology such as facial recognition or internal signal-based physiological sensor data. Users may view how their emotions have evolved over time and how they have grouped audio and texts based on their emotions using the application EmoViz. [2] Emotion is important in everyday interpersonal connections and is seen as a necessary skill for human communication. [2] It facilitates communication by expressing emotions and responding to individuals being communicated with. Many cognitive and affective computing tasks, such as rational decision-making, perception, and learning, benefit from emotional input. As intelligent systems grow more ubiquitous, emotion identification is becoming increasingly crucial. [3].

Computer games, banking, call centers, video monitoring, and psychiatric diagnosis are just a few examples of real-world applications for emotion detection systems. Other practical applications for emotion detection systems include online learning, business applications, clinical investigations, and entertainment [4, 5]. Voice signals incorporate emotions when it comes to the creation of intelligent systems known as “emotion recognition from speech.” Because of a host of intrinsic socio-economic benefits, speech signals are a great source for emotional computing. Because of their inexpensive cost, they are more appealing for speech emotion recognition research than other physiological signals such as electroencephalograms, electrooculograms, and electrocardiograms [6].

Despite modest development, the accuracy of this approach in identifying fear is lower than for other emotions [7, 8]. When Semwal and colleagues [8] integrated fundamental frequency, ZCR (zero-crossing rate), MFCC, and energy, they were able to identify fear with a 77 percent accuracy. Sun et al. [9] revealed that a deep learning neural network model identified bottleneck information with an accuracy of 62.50 percent in detecting fear.

1.1. Motivation

A number of processes are utilized by machine learning techniques to obtain a collection of speech features that may be used to properly categorize emotions. To build appropriate emotion recognition from a speech system, a suitable collection of characteristics must be chosen from which to train the selected learning algorithm. Emotion recognition algorithms mainly rely on features extracted from spoken audio signals [3, 10]; however, identifying an appropriate feature set is challenging [11]. Speech emotion recognition is challenging for a variety of reasons, including an imperfect description of an emotion and the blurring of the boundaries between distinct emotions. Emotion identification from speech is being improved by introducing new aspects, as demonstrated in [12], with an accuracy of 91.75 percent on an acting corpus when employing PLP characteristics. This accuracy is rather low when compared to the 95.20 percent accuracy attained for the synthesis of acoustic characteristics focusing on MFCC and pitch for recognizing speech emotion. Some studies have sought to agglutinate numerous auditory characteristics to increase the accuracy and precision of speech emotion identification [7, 8].

1.2. Problem Statement

“Ensemble learning” refers to the process of combining various learning models with the goal of producing a better learner as a result. Such algorithms are used in a variety of fields, including medical investigations [13] and dialect prediction [14]. Bagging [15] and boosting [16] are two of the most common ensemble approaches. In terms of accuracy, ensembles of core estimation methods have been shown to outperform single hypotheses [17]. Quinlan [18] tested boosting and bagging ensemble models on a variety of datasets and found them to be remarkably effective. Bagging, as the name implies, aims to train several estimators on random subsets of the dataset. If the training samples are drawn with replacement, they are referred to as “bootstrap samples.” Ensemble methods were also used to analyse audio data. Schuller et al. [19] investigated ensemble learning methods for recognizing speaker emotion through speech and found an increase in the accuracy of movie content. Morrison et al. [20] combined several classifiers for emotion recognition tasks in call center practices using an unweighted vote method. However, the majority of the contributions indeed are limited to opt the classification decision delivered by the majority of classifiers used in the ensemble of diversified classifiers. The crux of high-dimensional features remains the same. Hence, rather than the ensemble of diversified classifiers, the focus shall be on handling the high dimensionality of the features.

1.3. Organisation of the Manuscript

This paper's structure includes an introduction to the previously stated ensemble learning by high-dimensional acoustic features for emotion recognition from speech audio signals. In Section 2, we look at related work and numerous models for emotion recognition from speech audio signals. Section 3 covers the methods and materials connected to the suggested model. In Section 4, experimental research is conducted, and the proposed model is compared to other modern models. The conclusion of this article is explained in Section 5, followed by references.

There have been a few studies on support vector machine ensemble learning [21]. Hu et al. [22] used such an ensemble to solve the problem of rotating machinery failure detection. However, studies of this nature are few and far between.

Bhavan et al. [23] used a bagged ensemble approach on the Emo-DB and achieved a prediction accuracy of 92.45 percent. Shegokar and Sircar [24] proposed a CWT with prosodic elements for recognizing emotion in speech audio signals. Using PCA feature transformation and SVM with quadratic kernel as a classification approach, they achieved an overall accuracy of 60.1 percent on the RAVDESS database. The EMD (empirical mode decomposition) method, which uses the reconstructed signal's optimal features, was used to analyse reconstructed speech signals. On the Spanish database, they were able to achieve an average classification accuracy of 91.16 percent using the RNN technique.

As stated in the introduction, there are numerous reasons why emotion identification remains a major challenge. There is a disconnect between acoustic qualities and human emotions, as well as a theoretical framework for linking voice characteristics to a speaker’s emotional state [10, 2426]. Because of these underlying difficulties, there is disagreement in the research about which elements are better for recognizing emotion recognition. [10, 26]. When several different types of auditory characteristics are combined, researchers have shown promising results in speech emotion identification [10, 2629]. They have, however, struggled to find a way to combine the various elements in a way that is both effective and efficient. The study [3, 10, 27] emphasises the importance of identifying appropriate features in order to improve the stability of speech emotion recognition systems. Researchers frequently use specialist software to simplify the extraction, selection, and unification of speech features across multiple sources. Diverse learning algorithms for speaker emotion recognition have been demonstrated to be learned and verified using specific features extracted from public databases.

Multiple neural networks are fused together to achieve the goal of increasing the recognition efficiency from multiple perspectives. When a trained model is applied to an unprepared platform, gradient disappearance and overfitting can easily occur. The ability to generalise is crucial in speech emotion recognition. Ensemble learning has a number of advantages, including the ability to generalise and parallelism. The accuracy and reliability of each individual expert are crucial in ensemble learning [30, 31].

The use of ensemble learning and traditional machine learning approaches in speech emotion recognition has recently increased [32]. Weighted sum fusion was used by Mao et al. [33] to demonstrate that separating complex language features from emotional aspects in speech improves the recognition rate. Liu et al. used a variety of emotional feature subsets to train subclassifiers, which were then used to create a decision-making layer fusion, resulting in improved recognition results. Existing ensemble learning relies heavily on expert credibility allocation, which is a significant flaw in the system. In contrast, the data root for the initial decision is speech features, and acquisition methods are limited, resulting in slight variations across samples and inaccurate grouping information [34, 35].

On this basis, ensemble learning models can be used to make more stable decisions by combining multiple models. On the other hand, each expert's credibility is updated online based on their accuracy rate. Both generalisation and recognition of speech emotions have improved [36].

The most recent attempt to conduct ensemble learning by fusing together diverse categorization strategies was HAF [37], which combined various classification algorithms. Despite the model's superior performance, the high dimensionality of the training corpus remains a problem. This contribution depicts an ensemble learning model for clustering the speech audio signals of the dataset used as input to the training phase to mitigate the negative impact of the high-dimensional features. The suggested method uses the distribution diversity of feature values spanned over different records of divergent emotions to determine the best aspects. The proposed model is motivated by the previously described model titled “Speech Emotion Recognition Using Supervised Bayes Learning on Digital Features (SBL-DF)” [38]. The SBL-DF, on the other hand, does not address the negative impact of high-dimensional features.

3. Methods and Materials

This section explores the materials and methods used in the proposed model that enables to predict emotions in speech audio signals. The materials and methods explored in this section are centric to handle the adverse impact of high-dimensional features towards emotion prediction, feature extraction, feature optimization, and ensemble classification using the adaptive boosting technique as represented in Figure 1. The detailed description of these materials and methods is explored in following sections.

3.1. Dimensionality Reduction

The Fuzzy C-Means [39] clustering technique has been employed to handle the high dimensionality of the training corpus that leads to low sensitivity and specificity, which causes intolerable false-alarming.

The FC-Means method divides the input data into clusters, with each cluster retaining a group of records with a substantial association. Concerning this:

Take the records randomly as centroids and perform fuzzy clustering using Fuzzy C-Means, such that one or more records would be in more than one cluster.

Find the optimal centroids of the resultant clusters and perform the clustering of records recurrently till there is no change in the centroids.

The records that may settle in more than one cluster can be scaled for their relationship by measuring their distance from the centroids of the corresponding clusters having those records.

The algorithm works by distributing membership to each record, resulting in each cluster centroid being proportional to the related format of the distance between each record and the corresponding centroid. The closer the data is to the cluster’s centroid, the closer their membership is to a specific core of the cluster. Following the membership iteration, the cluster’s centroid shall be revised using the following formulas:

The number of records representing the record is indicated by the notation . The notation reflects the record having aspect with the highest support towards the cluster, while the notation reveals index fuzziness. Centroids are indicated by the set . The notation denotes the Euclidian distance of the record of the record towards the current centroid of the cluster. The depiction represents the Euclidean distance between the cluster centroid and the records of record . This Fuzzy C-Means main algorithm’s purpose is fading:

// is the Euclidean distance of the cluster centroid as well as the record .

The steps involved in Fuzzy C-Means clustering are as follows:(i)The set represents a set of records such that each record represents the record, whereas the notation indicates set of centroids of all clusters.(1)The cluster centroid of the cluster has been selected randomly.(2)The fuzzy membership has been computed by utilizing(3)Here, the fuzzy centroid has been measured by utilizing(4)The afore two steps are recurrent till the condition is true or the value of the notation is minimal.

The notation in this case reflects the iteration’s progress. Criterion termination is indicated by the use of the notation that ranges between 0 and 1. The notation illustrates a fuzzy membership matrix. Finally, the depiction denotes the fitness estimation process.

Let the number of fuzzy clusters that have been generated be of the size of the set , which contains fuzzy clusters in the chronological order.

3.2. Optimal Feature Selection

For each set of the records representing the label, find the optimal features (x-coordinates of the given speech audio signal) compared to the counterpart set . For each set , a feature (x-coordinate) is optimal if the values projected to the set's feature are having distribution diversity with the values projected for the same feature in other sets . For each feature of the set , the process shall estimate the diversity weight towards each of the other sets , which is the absolute difference between the maximum similarity one and the probable similarity observed . The mathematical model of identifying optimal features from each pair of sets is portrayed in the following description: Begin // for all the feature attributes// Begin  // the overall diversity of the feature concerning the set (label)  // Begin    // performing the fusion of diversity estimation of the feature between the sets    // the diversity of the feature between sets has initialized to distance threshold     Begin// the probable similarity value observed for the feature between the sets has found to be greater than the given probability threshold     // the diversity of the feature between sets has been discovered from the ks-test   End //of the condition     End // of the iterations   Begin // if the diversity weight of the feature towards the set (label) is greater than or equal to the given diversity threshold ,   //then consider the feature is optimal to the set and move that to the optimal features set   EndEnd// of the iterationsEnd // of iterations   // Preprocess the datasets of diversified labels// Begin // for each feature that is selected as an optimal feature of the set of the label ,  // discarding the feature and values projected to the corresponding feature from the set End

3.3. The Classifier
3.3.1. Classification Procedure

This section describes the classifier employed in this proposal, as well as the training stage model and the classification procedure's objective function.

The proposed classifier was built using adaptive boosting. The classifier was designed to combine a large number of Boolean classifiers, also known are weak classifiers, that have been built using decision trees. Each weak classifier was built using the best features taken from a series of quantitative steps. These weak classifiers categorize the provided test data based on whether the condition is true or false. Another bad classifier might label the negatives as bipartite, which includes both false positives and false negatives. This procedure has been repeated until the overall weak classifier determines that the task has been finished. Furthermore, the outcomes obtained, all weak classifiers, in general, are combined into the rating scale and provide the final result.

In this article’s projected model, each weak classifier was employed to highlight the coherently ideal features gathered during the quantitative seed phase towards binary classification. The classification technique was also repeated for each risk management implementation using a weak classifier; the corpus component that could not be accurately identified was the focus of the next classifier iteration, known as “boosting.” Furthermore, weight classification revealed an inferior classifier, which is employed on each iteration. Completing weak classifiers iteratively results in accurately categorised records from all of these weak classifiers. Each weak classifier, according to the projected method, recommends a certain n-gram for classification accuracy. Furthermore, the classification results of weak classifiers would be justified in order to discover the polarity of the given record. When compared to other binary classification challenges, the Adaboost classifier has been demonstrated to be a feasible approach for optimising DT output (decision trees). It has the potential to be widely employed to improve the performance of various machine learning approaches. The label prediction approach for an unlabelled record consists of the steps listed below:(i)Extract the values of all considered features from the unlabelled records(ii)The adaptive boosting classification strategy recommended in this proposal shall be used to predict the germination quality of seed samples as:(iii)Discover the standard measures of the fitness coefficients of the features towards all weak classifiers(iv)Consider the values of the features in the given input record; the considered features are optimal in regard to one or more weak classifiers(v)Prepare the normal distribution for each optimal feature that uses the input value of the feature as a standard measure(vi)Find the fitness confidences of the input record towards all optimal features of the corresponding weak classifier(vii)Compare the standard measures of the fitness coefficients discovered during the training phase and fitness confidences of the respective features to predict the label(viii)There shall be a label assigned to each input record after completing this prediction phase

4. Experimental Study

This section focuses on the proposed model’s practical implementation in comparison to some of the latest methods discussed in the literature. This section describes the dataset in detail, the changed programme’s requirements, and the system conditions that are critical for performance study. Python [40] is used to execute the model, while PyCharm [41] is used to write the code.

4.1. The Data

For the experimental analysis of the proposed model, the dataset RAVDESS [42] was used, which is a corpus of speech audio signals reflecting a variety of emotions. 247 people who were typical of untrained adult researchers assessed the emotional relevance, expressiveness, and authenticity of the RAVDESS dataset. A total of 72 volunteers have also been made available for the dataset's cross-validation. It has been reported that emotional relevance, reliability, and cross-reliability are all higher. 6204 speech audio signal records were used in the experiment, each of which was labelled with the emotions identified in the corresponding speech audio signal. The following are the counts of records representing different emotions: anger, disgust, fear, joy, neutral, surprise, and sad, where the records labelled as anger, fear, joy, and sad each counted at 1128, disgust counted at 576, neutral counted at 564, and surprise counted at 552. Overall, the 200 words spoken by 200 different people in 200 different emotional contexts represent a wide range of emotions.

4.2. Data Processing

The speech audio signals of the dataset are transmuted into the digital format [43] such that each speech signal transformed to a set of y-coordinates representing the corresponding x-coordinates. It is viewed as a two-dimensional matrix of digital representations of each speech audio signal. A total of seven datasets in the CSV format, each representing one of the emotion labels, are generated after data processing.

4.3. Performance Analysis

This approach has been evaluated for performance using metrics from the confusion matrices of all other contemporary models, including those that use “hybrid acoustic features (HAF)” [37] and “Supervised Bayes Learning of Digital Features (SBL-DF)” [38]. It has to divide the records of each label into two sets to perform cross-validation. The suggested EL-HDAF and contemporary models HAF and SBL-DF have undergone fourfold cross-validation to demonstrate the superiority of EL-HDAF over the existing HAF and SBL-DF models. Table 1

The overall number of records taken for the experimental study is 6204. The records used for training are 4653, and the overall records used for testing are 1551.

In order to evaluate the multilabel cross-validation adopted in the performance analysis, the metrics including precision (positive predictive value) and sensitivity should be used. Some other metrics for analysis that are not deemed significant include the weighted sensitivity, weighted measures of F-score, and precision metrics. The breadth of the solution's implementation and its effectiveness can be determined at the micro-level study of the associated assessment metrics.

When compared to the HAF and SBL-DF approaches, the recommended EL-HDAF strategy shows a more consistent rate of accuracy for all emotions, according to the statistical data shown in Figure 2.

Figure 3 shows that the EL-HDAF has similar performance advantages of emotion prediction sensitivity (recall) compared to contemporaneous models HAF and SBL-DF.

The F-measure and distinct labels are used to display the graphs in Figure 4, with the F-measure representing the harmonic-mean of the precision and sensitivity. The EL-HDAF surpassed the other frameworks, HAF and SBL-DF, that were used for comparison, according to the statistical statistics as in graphical representation.

Figure 5 specifies some factors of which one of the critical measures, the ratio defined for true negative amongst the cumulative set of actual negatives, is considered. The graphical representation of the performance refers to the conditions that refer to the fact that the proposed model is EL-HDAF and is performing superior in comparison to other key models HAF and SBL-DF reviewed for the corpus of requirement specifications. The comparison of the two models is shown in the form of graph with the help of fourfold labels as angry, disgust, fear, glad, neutral, sad, and surprised. Thus, it has been concluded that the performance of the proposed model in terms of specificity is better in all the labels while compared to the contemporary models.

The accuracy metric has been used for measuring the performance of EL-HDAF, HAF, and SBL-DF over the fourfolds as exhibited in Figure 6. The comparison of the three models is shown in the form of graph with the help of fourfold labels as angry, disgust, fear, glad, neutral, sad and surprised. Therefore, it has been concluded that the performance of the proposed model in terms of accuracy is better in all the labels compared to other contemporary models.

Weighted measures of accuracy, recall, and F-score are all essential metrics in determining the strength of the multilabel classification performance because they assist to understand the classifier’s performance overall. The metric values represent the classifier's ability to scale its performance based on the precision, sensitivity, and evaluation accuracy factors that include the harmonic-mean of the precision and sensitivity. The micromeasures of the corresponding metrics precision, sensitivity, accuracy, and f-score are also critical to assess the performance of multilabel classification.

For EL-HDAF, HAF, and SBL-DF, the weighted measures of the corresponding metrics observed for each emotion prediction are the essential inputs to determine the micromeasures of the corresponding metrics. The micromeasures of the corresponding cross-validation metrics are represented in Figure 7. The fourfold cross-validation process and the resultant micromeasures of precision, sensitivity, f-score, and class prediction accuracy indicate that the model EL-HDAF outperforms the models SBL-DF and HAF.

5. Conclusion

In recent years, predicting emotional states from acoustic features of spoken audio signals has been a prominent objective in the field of speech audio signal processing. Machine learning models with a high feature dimension are used to recognize empathy from audio data. To reduce the effect of high-dimensional data on the proposed model during training, the feature values of various classes were analysed for diversity, and a novel clustering approach was devised. It is also worth mentioning that the adaptive boosting classification technique is intended to learn from the various clusters in the training corpus. Ensemble Learning by High-Dimensional Acoustic Features (EL-HDAF) is a projected model that has been evaluated against two existing models, HAF and SBL-DF, using the benchmark dataset RAVDESS using fourfold cross-validation. In performance analysis, the cross-validation metrics and accompanying micromeasures were investigated. The results of the suggested and current measurements demonstrate that EL-emotion HDAF detection beats the existing methods HAF and SBL-DF with the fewest false alarms and the highest decision accuracy. In the future, the acoustic features of the speech stream can be adjusted utilizing evolutionary computing methodologies to increase the performance of ensemble learning models in predicting emotion. The contribution would motivate future research towards emotion detection through acoustic features of speech signals, where an evolutionary technique has an optimal scope in feature optimization.

Abbreviations

ML:Machine learning
EL-HDAF:Ensemble learning by high-dimensional acoustic features
ZCR:Zero-crossing rate
EMD:Empirical mode decomposition
HAF:Hybrid acoustic features
SBL-DF:Speech emotion recognition using supervised Bayes learning on digital features
:Cluster centroid
:Euclidean distance
:Fuzzy centroid
:Fuzzy clusters
:Diversity
:Feature
:Probable similarity value
:Probability threshold
:Diversity weight
DT:Decision trees.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request ([email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Authors’ Contributions

M M Venkata Chalapathi conceptualised the study, curated the data, performed a formal analysis, devised the methodology, contributed to the software, and wrote the original draft; M. Rudra Kumar supervised the study, wrote and reviewed the content, edited the article, and helped with project administration and visualization; Neeraj Sharma supervised the study, wrote and reviewed the software, validated the content, and wrote the original draft and was responsible for devising the methodology; S. Shitharth wrote, reviewed, and edited the article, helped acquire funding, and contributed to the visualization and formal analysis, and also software development.