Sound classification is a broad area of research that has gained much attention in recent years. The sound classification systems based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have undergone significant enhancements in the recognition capability of models. However, their computational complexity and inadequate exploration of global dependencies for long sequences restrict improvements in their classification results. In this paper, we show that there are still opportunities to improve the performance of sound classification by substituting the recurrent architecture with the parallel processing structure in the feature extraction. In light of the small-scale and high-dimension sound datasets, we propose the use of the multihead attention and support vector machine (SVM) for sound taxonomy. The multihead attention is taken as the feature extractor to obtain salient features, and SVM is taken as the classifier to recognize all categories. Extensive experiments are conducted across three acoustically characterized public datasets, UrbanSound8K, GTZAN, and IEMOCAP, by using two commonly used audio spectrograms as inputs, respectively, and we fully evaluate the impact of parameters and feature types on classification accuracy. Our results suggest that the proposed model can reach comparable performance with existing methods and reveal its strong generalization ability of sound taxonomy.

1. Introduction

As an essential medium in human-machine interactions, automatic sound classification (ASC) is a wide area of study that has been investigated for years, mostly focusing on its subareas such as environment sound classification (ESC), music genre classification (MGC), and speech emotion recognition (SER). Few works attend to the uniform model for these specific tasks, which can increase the expenditure on the hardware devices in the application scenarios. Hence, how to explore a general-purpose approach to improve sound classification performance remains a challenge for researchers.

In recent years, deep learning approaches are drawing more attention in sound classification tasks. Due to its strong learning ability and excellent generalizability to extract task-specific hierarchical feature representations from large quantities of training data, the deep neural network (DNN) has shown impressive performance in the research of automatic speech recognition and music information retrieval [1, 2]. Compared with traditional machine learning classifiers [3], the DNN could capture features from the raw data and obtain impressive results. Han et al. [4] propose to train a DNN model using segments with the highest energy in one utterance as inputs to discover valid segment-level feature information. However, the DNN’s in-depth fully connected architecture is not robust for feature conversion. Several latest studies have revealed that the CNN is powerful in investigating potential relationships between adjacent frames. CNN involves a group of interleaved convolutional layers and pooling layers followed by a certain number of fully connected layers. It shares weights by using locally connected filters, which makes inputs translation invariant, and these convolution filters have interpretable time and frequency significance for the audio spectrogram. Mao et al. [5] utilize the CNN to learn affect-salient feature representations from local invariant features that are preprocessed by a sparse autoencoder for the speech emotion recognition task and achieve stable performance on several benchmark datasets. Zhang et al. [6] improved the model effectiveness by using structural adjustment of the CNN and newly generated training data with the mixup method. Medhat et al. [7] have found that the enforced systematic sparseness of embedded filter banks within the model could facilitate exploring the nature of audio. Audio signals could convey contextual information in the temporal domain, which means the audio information at the current time step is relevant to that at the previous time steps. Hence, it is beneficial to apply the RNN and LSTM to capture time-dependent feature representation in sound classification tasks [8]. As an advanced version of RNN architecture, LSTM has a built-in storage gate to retain temporal context-related information. It is suitable for LSTM to extract chronological characteristics from input sequences. Sainath et al. [9] have demonstrated that LSTM outperforms the conventional RNN techniques in handling large-scale training data. Nevertheless, LSTM’s sequential nature means increasing the training time since larger sequence lengths require more training steps. Furthermore, the structures of RNN and LSTM have to suffer from a big problem of long-range dependency disappearance, which can affect the performance stability of models in real-world application scenarios.

The attention mechanism has become a hot topic in the sequence-to-sequence field in recent times. As a type of attention mechanism, self-attention could learn inherent spatial connections between each frame in a sequence to capture the hierarchical structure of the whole sequence. Transformer [10] is a sequence transduction model based entirely on the self-attention mechanism and has shown excellent performance in the field of natural language processing (NLP). Compared with traditional RNN networks, transformer replaces the recurrent structure with multihead attention to process all frames of the sequence in parallel efficiently, which is able to solve the problem of long-range dependency disappearing and greatly outperforms LSTM in NLP, such as pretraining language models [11, 12] and the end-to-end speech recognition approach [13]. Moreover, the high training efficiency of the transformer also facilitates the development of models for various real-world scenarios. Since audio signals express their semantic content relying on the implicit relations between their components in sparse positions, the multihead attention mechanism is more suitable for sound classification tasks. As a classifier, the support vector machine (SVM) has shown great advantages in dealing with the small sample, nonlinear, and high-dimensional pattern recognition. It transforms the original samples by nonlinear mapping algorithm from low-dimensional feature space to high-dimensional feature space or even infinite-dimensional feature space (Hilbert space) and uses a radial kernel function to construct the optimal hyperplane in high-dimensional space to make these samples linearly separable. With respect to the small-size and high-dimension sound datasets, this paper proposes an architecture based on two stages: MhaNN, a variant of transformer, where the salient feature representations are extracted, and a second stage based on SVM where we obtain the final classification results on the basis of the extracted feature representations. In order to show that SVM is more suitable for our task, we also report the experimental results of two other classifiers, K-Nearest Neighbor (KNN) and Logistic Regression (LR). The idea of KNN is to assign the most frequent category of the K-nearest samples to that sample. LR is to create regression equations for decision boundaries based on training data and then map the regression equations to classification functions to realize the classification. Experiments showed that the proposed MhaNN-SVM achieved the strongest results across three acoustically characterized datasets and demonstrates its generalization capability.

The summarized contributions of the proposed framework are demonstrated as follows:(1)Leveraging theories from the multihead attention and SVM, we proposed a sound classification model based on the multihead attention and SVM, which was able to extract the spatiotemporal hierarchical feature representations from inputs through the multihead attention and map these representations into high-dimensional feature space by SVM to further improve classification results. To demonstrate the suitability of SVM for our task, we also tested the classification performances of KNN and LR as final classifiers separately.(2)To enhance the comparability of our model, we chose two types of features as inputs: mel-spectrograms and the 68-dimensional feature set, both of which are most commonly used in sound classification fields, and also investigate the impacts of the different numbers of layers or heads to the model.(3)We compared the proposed model with other top-performing methods on the UrbanSound8K, GTZAN, and IEMOCAP dataset individually to access its generalization capability. Experiment results showed the simplicity and the robustness of our model and demonstrated its applicability for the actual application.

The remainder of the paper is organized as follows: Section 2 reviews some related work in the field of sound classification. Section 3 describes the details of the proposed model. The evaluation of the model is provided in Sections 4, and Section 5 presents our conclusions.

Most works on the abovementioned architectures are commonly restricted to a single audio classification task, and a few studies have focused on a uniform framework to solve different audio classification problems. To compare with our experiments on the same datasets, we review some task-independent models.

Medhat et al. [14] suggest a binary-masked CNN model that adopts a controlled systematic sparseness such as embedding a filterbank-like behavior within the network to preserve the spatial locality of features in the process of training weights. The experiments on GTZAN and UrbanSound8K have achieved the accuracies of 85.1% and 74.2%, respectively. In their model, they introduce a set of hidden layers in which each neuron establishes a connection with the input by the activation function through the influence of distinct active regions in the feature vector, so the spatial information of the learned feature is saved as these active weights’ locations which are fixed in position. In order to exploit the interframe relations in a sequential signal, the network allows a concurrent exploration of a range of feature combinations to find the optimum combination of features through an exhaustive manual search. This causes the increased complexity and training difficulties, especially when the number of features increased substantially.

Palanisamy et al. [15] present an ensembled Dense CNN architecture with five identical dense blocks. They pretrain the model with a larger size of image dataset ImageNet for the problems of MRC and ESC on the basis of transfer learning knowledge and then initialize the network with these pretrained weights instead of random settings, which brings 90.50% accuracy on the GTZAN dataset and 87.42% accuracy on the UrbanSound8K dataset. This model can solve the problem of poor prediction results caused by the insufficient number of samples. However, its performance is based on consistent data distribution between the pretraining dataset and the target dataset, which limits the broad exploitation of the model for real application scenarios.

Choi et al. [16] provide a transfer learning approach. They design a CNN network with five convolutional layers for a source task on a training dataset and then use the network with trained weights as a feature extractor for target tasks. The network achieves the accuracies of 89.8% on GTZAN and 69.1% on UrbanSound8K. Since the authors choose the Million Song dataset to pretrain their model, the experimental results of GTZAN outperform UrbanSound8K, which illustrates the importance of data distribution consistency between the pretraining dataset and the target dataset.

Transformer was first presented in 2017, and its related studies mainly concentrate on the area of text processing. Relevant to the sound classification on the same datasets used in our experiments is the multimodal transformer [17]; Delbrouck et al. induce a multimodality fusion method based on LSTM and transformer for emotion recognition and sentiment analysis. The method takes mel-spectrograms as input to feed into LSTM, where the linguistic features and acoustic features are extracted independently. Then, these two types of features are fused into a new multimodel representation in the transformer encoder through multihead attention modulation and linear modulation. The experiment on IEMOCAP yields a 74% precision rate that outperforms the single-modal approach, which shows the multihead attention is adept at handling sequential data with different characteristics. Because the multimodal model only has strong effects for some specific types of classification tasks, when encountering large-scale datasets, its internal circular structure will reduce the efficiency of model operations, which is not conducive to the terminal deployment of the model.

3. Proposed Model

In the literature of sound classification, mel-spectrograms and mel-spectrogram-related feature sets have been broadly applied as acoustic features in many deep learning models and shown their powerful performance. In this paper, two types of spectrograms were used as features to be fed into the model, respectively. We applied Librosa [18] to extract the mel-spectrograms, named Feature 1, with 128 mel-filters banks ranging from 0 to 22050 Hz. The PyAudioAnalysis library by Giannakopoulos [19] was used to produce a 68-dimensional feature vector, namely, Feature 2, listed in detail in the following:(i)3 time domains: zero-crossing rate, energy, and entropy of energy(ii)5 spectral domains: spectral centroid, spectral entropy, spectral flux, spectral roll-off, and spectral spread(iii)13 MFCCs(iv)13 chroma: 12-dimensional chroma vector and standard deviation of chroma vector(v)34 first-order differential features: first-order differential of the abovementioned 34 features

The proposed feature-extracted model MhaNN was composed of a stack of L identical blocks with their own set of training parameters. Each of these contained a multihead self-attention layer and a feed-forward layer. Around each of these two layers, the residual connection [20] was used and followed by layer normalization [21]. To facilitate residual connections, all layers within the model produced the output with dimension dmodel = 256. The structure of MhaNN is described in Figure 1.

In MhaNN, the attention value is the outcome of a Key K, Query Q, and Value that interact together, as shown in the following:where Query Q, Key K, and Value are input matrices in case of self-attention. represents similarities between Q and K. is a scaling factor. In the multihead attention mechanism, a feature space needs to be evenly segmented into m slices, each of which corresponds to one subspace. The multihead attention is to conduct m self-attention functions simultaneously for learning the information from these different subspaces at different positions. Its function is listed as follows:where , , , and . In the feed-forward layer, we adopted a multilayer perceptron that consists of two linear projections and a ReLU activation. The hyperparameters in the fully connected layer were set as follows: dropout was 0.5, and the number of neurons was 64. The cross entropy was taken as the loss function. In SVM, we employed the one-versus-one method and choose the radial basis function, in which the penalty coefficient of C is set to 1 and the parameter is set to 0.01.

The schematic diagram of our model is shown in Figure 2. Firstly, we entered the target training dataset into MhaNN and extracted features in its global average pooling layer to input into the SVM for training. Then, we applied the trained MhaNN extractor to capture features from the target testing dataset and fed these features into the trained SVM classifier for final results.

4. Experiment Evaluation

4.1. Datasets

In this paper, we evaluated the proposed MhaNN-SVM on three publicly available datasets, respectively, including UrbanSound8K, GTZAN, and IEMOCAP. Their classes and the number of samples contained in each class are given in Table 1.

UrbanSound8K [22]: this dataset is widely used as a benchmark in environmental sound classification. It includes 8732 short (less than 4 seconds) audio clips derived from the live sound recordings of the FreeSound online archive. These clips are pregrouped into ten folders, each containing one of the ten possible urban sound sources.

GTZAN [23]: it is a very popular dataset in the research of music genre classification. It includes 1000 audio excerpts that are collected from radio compact disks and MP3 compressed audio files. Each excerpt is 30 seconds long and annotated with one of the ten music genres.

IEMOCAP [24]: this dataset is designed for the task of multimodal speech emotion recognition. It is composed of twelve hours of audio-visual recordings made by ten professional actors in the form of conversations between two actors of different genders performing scripts or improvising. These collected recordings are divided into short utterances of length between 3 to 15 seconds, each utterance labeled as one of ten emotions (neutral, happiness, sadness, anger, surprise, fear, disgust, frustration, excited, and other). In this paper, we only use audio data. For the consistent comparison with previous works [2527], all utterances annotated “excited” are grouped into the happiness category with that annotated “happiness,” and only four emotion categories (neutral, happiness, sadness, and anger) are considered. As shown in Table 1, the imbalanced distribution of the IEMOCAP makes a great challenge to the classification task.

4.2. Training and Other Details

To minimize the cross entropy, we employed the ADAM optimizer to train the model on three public datasets for 400 epochs with a batch size of 64, respectively. The 10-fold cross validation was implemented to access the proposed MhaNN-SVM. In all experiments, the training set, test set, and validation set were randomly divided at the ratio of 8/1/1, and we calculated the classification accuracy as the average of 10-fold cross validation. Our experiments were carried out on the Tensorflow2.0 framework, running on NVIDIA TITAN Xp GPU with 16 GB memory. In the following, we analysed the performance of our model on each dataset in more detail.

For optimizing the parameters of the network, we first tested the MhaNN-SVM with the different number of layers, namely, L, and heads on these two types of features individually to evaluate the impacts of parameters and feature type on the classification results. We also attempted two other classifiers LR and KNN to verify the suitability of SVM for processing high-dimensional data. Last, we chose the best accuracy to compare with other competitive methods.

4.3. Experiment and Discussion on UrbanSound8K

Environment sound classification (ESC) has gained considerable attention in recent times. ESC aims at capturing a variety of natural and artificial sounds, rather than speech and music, presented in an acoustic scene by audio sensors, and classifying these sounds into predefined categories.

4.3.1. Preprocessing

As introduced in Section 3, the preprocessing was implemented with the frame size of 50 ms at a rate of 25 ms to obtain Feature 1 with the size of 128 × 173 and Feature 2 with 68 × 173. These two spectrograms are depicted in Figure 3.

4.3.2. Results and Analysis

We fed the two types of spectrograms into the model separately to yield sixty classification accuracies. As listed in Table 2, we can observe that SVM was significantly more effective in improving MhaNN classification results than KNN and LR. Feature 1 made a larger contribution to the improvement of accuracy than Feature 2, no matter how many L and heads were set, which indicated that mel-spectrograms can enhance the predictive capability of the model effectively in ESC tasks. In particular, when we set 4 to head and 3 to L, the accuracy can reach 94.6%, the highest in all of the results. Figure 4 showed the confusion matrix and t-SNE plot. In this confusion matrix, the rows stood for true labels, the columns for predicted labels, and values along the diagonal represent the number of samples classified correctly for each category, while those nondiagonal terms denoted incorrect classification. We can find that the proposed model distinguished most environmental sound categories effectively, but it was difficult to separate drilling from street music. One explanation was that they shared a lot of similar frequency information, which made it more difficult to discriminate them in nature.

Table 3 illustrated the detailed information about precision, recall, and F-score for each category. Overall, most of the categories had been correctly classified, and the F-scores of AI, EN, and JA were able to reach more than 96%. Even though the number of CA and GU training samples were both small, their recognition accuracies were considerably high, based on which we can assume that the proposed model had outstanding classification capability on the UrbanSound8K dataset.

To evaluate the performance of the presented model, we compared the MhaNN-SVM with five other top-performing methods on the UrbanSound8K in Table 4. DenseNet [28] exploited a dense connectivity pattern between layers in a feed-forward pattern to realize feature reuse, which led to high convergence efficiency. MelNet [29] and RawNet [29] both used five convolutional layers with batch normalization and a fully connected layer, but took mel-spectrograms and raw audio waves as input, respectively. GoogleNet [30] applied a deep convolutional neural network that is designed for image recognition to the ESC field, achieving a 93% accuracy rate with a large number of parameters up to 6.7 M. 1D CNN [31] was an end-to-end environmental sound recognition model initialized with Gammatone filter banks at the first convolutional layer. It utilized convolutional layers to capture temporal information from raw waveforms. From these results, it can be found that our model could yield 94.6% accuracy on the UrbanSound8K dataset, surpassing the compared models.

4.4. Experiments and Discussion on GTZAN

Automatic music classification could provide convenience for music storage and improve the efficiency of music information retrieval systems. The challenges involved in the MGC are typically related to a variety of attributes, such as instruments, melody, timbre, the mixing of two or more genres in the same piece of music, and even human voices appearing in the music, all of which need to be considered for decision making.

4.4.1. Preprocessing

First, we segmented each audio excerpt in the dataset evenly into four parts and then used two different tool libraries to transform each part with 50 ms frame size at a rate of 25 ms into two types of spectrograms separately, Feature 1 and Feature 2. Their own sizes were 128 × 300 and 68 × 300, respectively, visualized as Figure 5.

4.4.2. Results and Analysis

The spectrograms extracted from the preceding step were input into the model individually, and all experimental results are listed in Table 5 by adjusting the different numbers of layers and heads. From these results, we can find that MhaNN-SVM had the highest classification result, while LR had no boosting effect on MhaNN. Meanwhile, the performance of Feature 1 was overall better than that of Feature 2, which indicated that mel-spectrogram can improve the accuracy of music genre classification effectively, especially when the head was set to 4 and L to 2, and the model can obtain the best result of 88.4% whose experimental description is shown in Figure 6.

Table 6 lists the precision, recall rate, and F-score for each class. The proposed model recognized most classes, but it was difficult to discriminate rock from country and disco. One reason was that they may share more similar frequency information, which made it more challenging to categorize them in nature. In general, most genres can be correctly recognized, with blues, classic, metal, and hip-hop all achieving over 90% precision rates.

In Table 7, we made comparisons between the MhaNN-SVM with existing models. Hybrid [32] and CVAF [33] investigated different feature fusion strategies to improve classification accuracy, with results of 88.3% and 90.9%, respectively. Choi et al. [16] applied a transfer learning framework to the MGC task and obtained an accuracy of 89.8%. Freitag et al. [34] proposed an AuDeep approach based on the recurrent sequence-to-sequence autoencoder that can extract the characteristics of time series data from their temporal dynamics, resulting in an accuracy of 85.4%. Liu et al. [35] employed a CNN-based BBNN method to extract the low-level information from spectrograms of audio signals for the long context involved in recognition decisions and achieved 93.3% accuracy.

4.5. Experiment and Discussion on IEMOCAP

Human emotions have been playing an essential role in human communication. Speech emotion recognition refers to the analysis of changes of emotion hidden in human conversations, by extracting the relevant features from the speech to feed into the neural network for classification, to identify the possible emotional changes of the speaker.

4.5.1. Preprocessing

All files in the dataset went through the preprocessing step stated in Section 3, with settings of 50 ms frame length and 25 ms frame shift size. As a result, Feature 1 had a size of 128 × 310 and Feature 2 had 68 × 310, as portrayed in Figure 7.

4.5.2. Results and Analysis

Two types of spectrograms produced from the preprocessing were input into the network separately and yielded a variety of classification accuracies under different quantities of head and layers. Table 8 showed all the experimental results. The performance of Feature2 outperformed that of Feature1 overall. In terms of impact on MhaNN classification results, SVM was able to improve, LR had no effect, and KNN even decreased, especially in the Feature2 group. One reason is that LR is a linear classification model, and KNN is not able to estimate the prediction error statistically, which leads to great fluctuations in the results. This indicated that compared to environmental events and music genres, we need more kinds of features to make accurate judgments on the emotion. The high dimension of the data and the uneven distribution of data onto categories are not suitable for LR and KNN classifiers. When the model parameters were set as head = 4 and L = 1, the accuracy can reach 62.8%, the highest of all the results. We visualized the experiment results of 62.8% in Figure 8.

Table 9 shows the precision, recall, and F-score for each category. Due to the few training samples of the sadness category in the training set, the model can only learn limited discriminative information, so it could misclassify some sad samples into the neutral or happy label. Lower precision and recall rate on the sad label could weaken the ability of the model to identify the sad emotion. On the whole, the precision rates of other labels were all over 60%, including 68% for the angry category, which indicated that MhaNN-SVM had better recognition performance on an unbalanced dataset.

In Table 10, the accuracy of MhaNN-SVM was close to the best available results on the IEMOCAP dataset. Haytham et al. [36] provided an RNN-based SER technique that used mel-spectrograms to train the model at a large computing cost. Gu et al. [37] introduced a multimodal architecture based on hierarchical attention to fuse text and audio at word level for emotion classification. The HSF-CRNN [38] was a two-channel SER system that combined handcrafted HSFs and CRNN-learning spectrograms to jointly extract emotion-related feature representations for SER. Reference [39] proposed an architecture of a parallel multilayer CNN stacked on the LSTM to capture multiple temporal-contextual interactions. Lakomkin et al. [40] presented a progressively trained neural network based on transfer knowledge of automatic speech recognition with SER.

5. Conclusions

In this paper, we presented a method of combining the multihead attention mechanism with SVM to replace the recurrent framework most commonly used in sound recognition models. Three benchmarked datasets (UrbanSound8K, GTZAN, and IEMOCAP) were employed to validate its applicability to different acoustically characterized fields. In order to highlight the advantages of SVM as a classifier, we also compare SVM with two other classifiers, LR and KNN. Experiment results demonstrated that the proposed method could bring outstanding improvements in classification accuracy. However, we mainly focused on the generality of the model and did not perform feature fusion and model parameter setting according to the characteristics of each dataset. In the future, we will fully explore the characteristics of different audio datasets and design the task-specific feature representation and model parameters to further evaluate the performance of the transformer.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.