Abstract

The popularity of the Internet has brought the rapid development of artificial intelligence, affective computing, Internet of things (IoT), and other technologies. Particularly, the development of IoT provides more references for the realization of smart home. However, when people have achieved a certain amount of material satisfaction, they are more likely to want to communicate emotionally. Music contains a lot of emotion information. Music data is an important communication way between people and a better way to convey emotions. Therefore, it has become one of the most convenient and natural interactive ways expected by people in intelligent human-computer interaction. Traditional music emotion recognition methods have some demerits such as low recognition rate and time-consuming. So, we propose a generative adversarial network (GAN) model based on intelligent data analytics for music emotion recognition under IoT. Driven by the double-channel fusion strategy, the GAN can effectively extract the local and global features of the image or voice. Meanwhile, in order to increase the feature difference between the emotional voices, the feature data matrix of the Meyer frequency cepstrum coefficient of the music signals is transformed to improve the expression ability of the GAN. The experiment results show that the proposed model can effectively recognize the music emotion. Compared with other state-of-the-art approaches, the error recognition rate of proposed music music data recognition is greatly reduced. In terms of the accuracy, it exceeds 87% which is higher than that of other methods.

1. Introduction

The 21st century is a new era of information technology, including smart city, 4G communication technology, low-carbon technology, Internet of Things (IoT) [1], 3D display, enhanced display technology (AR), cloud computing, human vaccine technology, motor system energy saving, and combustible ice mining technology. Here, the IoT technology has naturally become a hot topic in the scientific community. Industry experts believe that on the one hand, the IoT can improve the quality of people’s life and work and change the way people live. On the other hand, the IoT is an information technology revolution, which will drive the huge development of related industries, promote the progress of ST, and provide technological power for the recovery of the global economy.

Artificial emotion is a research field of simulating, identifying, and understanding human emotional processes by means of information science so that machines can generate human-like emotions and interact with humans naturally and harmoniously [2]. At present, research studies on artificial emotion mainly include two related fields: affective computing and Kansei Engineering. Artificial psychology is to use the means of information science to simulate human emotion activities; its purpose is to study emotion, cognition, and motivation from the general psychological level of artificial machine realization. In this paper, we focus on the research studies of music emotion.

The music expresses the feelings of the composer and lyricist when it is created. Music is closely related to emotion and conveys a message that is difficult to quantify. With the development of the Internet, music plays an important role in people’s life. People begin to pay more and more attention to the music emotion characteristics, and music emotion has also begun to be applied to music retrieval and music recommendation [35].

Music emotion recognition refers to the high-level effective emotional state recognition from low-level features of music, which can be regarded as a classification problem based on music sequence. The main processes include the emotion database establishment, the phonetic emotion features extraction, dimensionality reduction and features selection, and emotion classification and recognition. There are many methods for music emotion recognition, which have achieved better effects, such as hidden Markov model (HMM) [6], artificial neural network (ANN) [7], Gaussian mixture model (GMM) [8], support vector machine (SVM) [9], K-nearest neighbor (KNN) [10], and maximum likelihood Bayesian classification [11, 12]. However, the research objects (languages) are different, and there is no unified standard for the corpus database, so the recognition results are greatly different with each other.

SVM and KNN methods are often used in some models with high certainty, while human emotions are complex and uncertain. Therefore, the effect of music emotion recognition is poor. Neural network is a typical nondeterministic model, which has the characteristics of I/O nonlinear mapping, strong generalization ability [13], self-learning, self-organization, and self-adaptation ability. It has unique advantage in dealing with uncertain and nonlinear mapping problems. In the neural network models, convolutional neural network (CNN) [14, 15] is a kind of multilayer feed-forward network that is widely used and most successful in pattern recognition. For example, Wal et al. [16] proposed a new emotion recognition method based on deep spatio-temporal analysis of facial geometric features. Nantasri et al. [17] investigated the possibility of using the mean values of MFCCs and their derivatives to create a new set of informative features. Maheshwari et al. [18] proposed the rhythm-specific multichannel convolutional neural network (CNN)-based approach for automated emotion recognition using multichannel EEG signals. Rajapakshe et al. [19] introduced a novel policy called “Zeta policy” which was tailored for speech emotion recognition and applied pretraining in deep reinforcement learning to achieve faster learning rate. Pretraining with cross dataset was also studied to discover the feasibility of pretraining the reinforcement learning agent with a similar dataset in a scenario where no real environmental data was available. Since human emotions have complex and uncertain information, the recognition rate of voice emotion based on convolutional neural network is still not high. In order to increase the feature difference between emotional music information, this paper proposes a generative adversarial network (GAN) model via double-channel fusion strategy based on intelligent data analytics. The main contributions are as follows:(1)A generative adversarial network (GAN) model based on intelligent data analytics for music emotion recognition under IoT is proposed(2)Driven by the double-channel fusion strategy, the GAN can effectively extract the local and global features of the image or voice(3)Meanwhile, in order to increase the feature difference between the emotional voices, the feature data matrix of the Meyer frequency cepstrum coefficient of the music signals is transformed to improve the expression ability of the GAN(4)The experiment results show that the proposed model can effectively recognize the music emotion

The remaining of this paper is organized as follows. Music emotion feature extraction is introduced in Section 2. Section 3 presents the experiments. Section 4 concludes this paper.

2. Music Emotion Feature Extraction

Traditional music emotion feature extraction methods through global analysis extract music signal pitch frequency, amplitude, energy, speed, and formant parameters. The timing and distribution characteristics of these parameters are analyzed to find the rhythm rules in different emotional sounds, which can be used as the basis for emotion recognition. In literature [20], a 40-dimensional emotional feature vector A is extracted by extracting relevant emotional features in music signals. Its form is as follows:

From signal analysis, music signal is composed of many different overlapped frequency signals. The spectrum features analysis of signals is also conducive to emotion recognition research. MFCC is an algorithm based on the auditory characteristics of human ears, which uses a nonlinear frequency unit (Mel frequency) to simulate the human auditory system. In recent years, relevant research has applied MFCC to music recognition [21]. This paper also adopts the MFCC feature extraction method to extract music emotion features.

At present, the traditional MFCC feature extraction method takes 256 sampling points as 1 frame length and 160 sampling points as frame shift. The coefficient order is 12. The energy level, first-order difference, and second-order difference of each frame are calculated, respectively, and the average value of each frame coefficient is further calculated. Therefore, a 40-dimensional filter band coefficient is obtained for each frame [22]. The sampling points of each voice sample are not uniform, resulting in inconsistent frame number. In order to obtain a uniform frame number, the following two extracting feature methods are studied in this paper:Method 1. The common feature extraction scheme is to directly extract feature data from each music sample. Although the frame number of extracted feature is not uniform, most of them are between 140 and 170. In order to retain the features in each music, the feature data of each sample is cut to 160 frames. And the feature data is further converted into a matrix form with size 80 × 80 as the input of the convolutional neural network.Method 2. The extracted data is huge in method 1, which leads to long training time. To reduce the dimension of the feature, the music sampling points of each sample are first uniformly adjusted as 7136. Then, the MFCC coefficient is extracted to obtain the unified 40-frame feature data. The feature data is converted into the matrix form A1 with size 40 × 40 as the input of the convolutional neural network.

Through abundant comparison experiments for the above two methods, it is found that the training time of method 2 is less and the recognition rate is higher. This paper also considers the normalization effect on A1. The comparison results show that the recognition rate of A1 after normalization is not improved and unstable. Therefore, the original A1 is used as the input of the neural network in this paper. The feature vectors A and A1 are used as the input of the convolutional neural network model, respectively. The experiment shows that the error recognition rate is lower when extracting MFCC features.

3. GAN

GAN was proposed by GoodFellow in 2014 [23]. The framework consists of two subnetworks: generator and discriminator D. The corresponding functions are mapping random noise to sample distribution and discriminating real samples and generated samples, respectively. Different from other generative models, GAN adopts an adversarial approach. First, it learns the difference between the real sample and the generated sample through D and then guides to generate false samples closer to the distribution of the real sample. It adopts alternate training to continuously reduce the difference. At present, GAN mainly optimizes the following maximum and minimum loss function to achieve Nash equilibrium:where is a potential variable obtained from a distribution p(z) such as Gaussian noise or uniform distribution. The generator and discriminator D have their own loss functions. During the training, and D will update their respective parameters and minimize the loss. and D cannot update each other’s parameters, but they need to rely on adverse parameters to update their own. Arjovsky et al. [24] proposed the DCGAN (deep convolutional generative adversarial network) framework in 2016 and applied convolutional neural networks (CNNs) to GAN for the first time. Since then, generator and discriminator D usually adopt the CNN model. Deep learning-based generative adversarial network has achieved great success in the field of image generation. As a unique image synthesis technology, it is widely used in image generation. GAN has the following advantages:(1)It can train the unconditional generator only by inputting random noise(2)It is a new technique for data transfer between different domains and an effective method for unsupervised image conversion between domains(3)It is a new optimization method and provides an effective image perception loss function

Although GAN has made great progress and effectively generated convincing images, there are still some problems to be solved:(1)The training process of GAN is extremely unstable, and the network is very sensitive to super parameters, so it is difficult to reach Nash equilibrium(2)GAN often shows the model collapse phenomenon, which results in the model simulating only a part of the real distribution, rather than all the object distribution(3)GAN cannot capture the structure and geometric shapes in some categories of images

Most of the existing works are devoted to optimizing the training process of GAN, and some works focus on changing the objective function of GAN. For example, the cross entropy loss replaced the least squares loss in the LSGAN (least squares generative adversarial network) method, which not only improved the stability of training but also shortened the training time. Some works focus on gradient punishment or the gradient of constraint D to ensure that D can provide effective gradient for . The WGAN (Wasserstein generative adversarial network) model [25] is restricted D to satisfy Lipschitz constraint, which greatly improved the stability of the network. Although WGAN satisfied Lipschitz constraint, it directly restricted the parameter matrix, which destroyed the structure of the parameter matrix. To solve this problem, a new regularization technique was introduced in reference [25], which not only satisfied Lipschitz constraint but also did not destroy the structure of the parameter matrix.

Additionally, some references modify the GAN framework. EBGAN (energy-based generative adversarial network) was the first framework by introducing the energy model into GAN [26]. It regarded D as an energy model and adopted an auto-encoder structure. Real samples were given low energy, and fake generated samples were given high energy. By reducing the reconstruction error of the generated samples, the samples were gradually closer to the real sample distribution. ProGANs (progressive generative adversarial networks) [27] trained a high resolution GAN by gradually enhancing and D. Training starts with low-resolution images, and it gradually improved resolution by adding layers to the network. This training method first detected the distribution of large structural images and then shifted attention to finer and finer scale details. It could not learn all the ratios at once. However, it only produced good results on a single feature image. By reinforcing the connection between local and global locations of feature graphs, SAGANs (self-attention generative adversarial networks) [28] attempted to generate high quality images on multicategory images, but it ignored the connection between channels of feature graphs.

In view of the fact that GAN cannot capture features in certain categories of music, this paper proposes a GAN model based on double-channel attention mechanism, which can effectively capture the feature distribution of music through adaptive learning the dependence between local and global features to improve the music emotion recognition accuracy.

4. Double-Channel Attention Mechanism

GAN has made great progress in this field. It is difficult to train the model on larger data sets, and it cannot capture the geometric features that occur many times in some classes. The reason for this problem is that the current model relies much on the dependence between different regions of the convolution simulating image. Due to the local receptive field of convolution, the dependence between a large range of regions can only be obtained through multiple convolution operations. As shown in Figure 1, three 3 × 3 convolutional layers are required to obtain the feature relations between 7 × 7 receptive fields. However, in the process of convolution operation, the optimization algorithm may be difficult to coordinate so many convolution layers. And more convolutional layers capture little dependencies. If the size of the convolution kernel is enlarged, for example, a convolution kernel with the size of 7 × 7 is adopted, the feature dependence between 7 × 7 receptive fields can be obtained through only one convolutional layer. However, this method is not only less effective than the convolution layer combination of several small filters but also greatly increases the computational burden. Therefore, it is difficult to obtain the dependencies between images only through convolutional layer.

To solve the problem of ineffectively capturing music features by CNN, some scholars introduced an attention model in GAN to make up for the deficiency of CNN framework. The essence of the attention model is to emphasize or select the important information of the object through many attention distribution coefficients (weight coefficient) and suppress some irrelevant details. The attention mechanism can capture the relations between local and global flexibly, improve the representation ability of the model, and reduce the model complexity. Therefore, in order to improve the recognition rate of music emotion, this paper proposes a GAN framework based on double-channel attention mechanism (DCGAN) and introduces two different attention models: feature attention model and channel attention model, which can capture the feature dependency in feature space and channel, respectively.

4.1. Feature Attention Model

In order to add the dependent information between the local feature and the global feature in the feature graph, a feature attention model is introduced. This model enhances its representation capability by encoding extensive global spatial information and adding it to local feature information. The specific framework is shown in Figure 2 where C represents the channel number in the feature graph and H and represent the height and width of the feature graph, respectively.

Firstly, the feature map of the previous layer with 1 × 1 convolution forms three feature spaces: R, S, and T. The number of channels in each feature space is C/8, C/8, and C, respectively, in which matrix multiplication is performed on the transpose of feature space R and S. Softmax is applied to obtain the parameters of feature attention layer. The specific parameter values are calculated by the following equation:where represents the influence of the feature in the i-th position on the feature of the j-th position. If the features of the two positions are more similar, the correlation between them is bigger. Then, the feature graph of feature attention is obtained by matrix multiplication of the feature space T and the transpose of the feature attention layer.

4.2. Channel Attention Mechanism

For feature graphs, each channel can be considered to be a specific class. The different channels are related to each other. Therefore, a channel attention model is proposed to extract the dependencies between different channels. The channel attention model framework is shown in Figure 3.

Feature attention needs to convolve feature graph . However, channel attention directly uses feature graph X to calculate channel attention feature layer parameters, but the calculation process is similar. The calculation formula is shown in the following equation:where is the influence of the n-th channel on the m-th channel. If the features of the two channels are very close, the dependency between them is greater [29]. In addition, matrix multiplication is performed for channel attention feature layer and transpose of input feature space X, and finally channel attention feature graph is output.

4.3. Double-Channel Attention Model

Figure 4 shows the double-channel attention model. The input feature graph is fused with the output P (obtained by feature attention model) and Q (obtained by channel attention model) to obtain the feature space with local and global feature dependence information as well as class dependence information. Its calculation formula is shown in the following equation:where and are the hyperparameters of P and Q, respectively, and they are initialized as 0 and updated through back propagation. In the process of network training, while the two attention models start from the simple feature dependence, they gradually learn to complex dependencies; and of P and Q are gradually increased; and the weight feature map learned by the attention module is added to the original feature map. Thus, the feature graph that needs to be applied attention is emphasized. In the high-level networks of and D, the double-channel attention mechanism acts as an auxiliary structure of GAN, cascaded after CNN [29]. Figure 5 shows the training flow chart of proposed network, where CNN represents the convolution operation and DC represents the introduced double-channel attention mechanism. Through continuous cyclic alternate training of and D, generates more and more realistic images.

5. Experiments and Analysis

The experimental songs are mainly from the mood song lists recommended by various music websites on the Internet, such as Kuwo Music Box, BaiDu Heartlisten, and other music websites. A total of 637 songs are used in this experiment, among which 445 pop musics (215 happy musics and 230 sad musics) are used to train the music emotion classifier and 192 pop musics (93 happy musics and 99 sad musics) are used as test songs [30]. The music emotion is divided into sad, happy, quiet, lonely, and miss in this paper [31]. The extracted music features are used to construct training and testing samples. In these experiments, we also make comparison with CLSTM [32], RNN [33], and HTG [34].

First, the feature attention, channel attention, and double-channel attention models are compared, and the results are shown in Table 1. Figure 6 is the visualization result.

From Table 1 and Figure 6, we can see that the double-channel attention model achieves the 93.4% accuracy, which improves by 7.1% and 5.6% than that of feature attention and channel attention, respectively. However, the running time of the double-channel attention model is 1.2 s, which is a little longer than that of feature attention and channel attention due to the two channels’ combination. Table 2 shows the comparison with different methods.

Table 2 shows that the proposed method has the best results than other methods. The bold values in Table 2 are the best values. Especially, the sad and happy recognition rates exceed 90%, because the double channels are utilized to extract the local and global features. The error rate is shown in Table 3.

As can be seen from the comparison of error recognition rates in Table 3, although A and B begin to converge in the 700-th iteration, their convergences are not stable. The proposed model is relatively stable in the 500-th iteration. This is because the two channel attention mechanisms are adopted, so the weight of the convolution kernel gradually becomes stable with the increase in iteration times, and the recognition rate of the model converges steadily, and the error recognition rate decreases significantly. Increasing the input feature data by one time can better reflect the difference of music emotion features. The error recognition rate is lower, so the convergence is faster.

6. Conclusions

To solve the problem that traditional CNN cannot effectively extract the dependence between music features and different categories, a generative adversarial network model based on double-channel attention mechanism is proposed, which includes feature attention and channel attention. Driven by attention mechanism, the two submodels are used to model the dependencies between local features and global features and the dependencies between classes, respectively. Then, it realizes the task of generating music feature library. The results show that the proposed framework can obtain the feature information of music more comprehensively than other frameworks and improve the ability of music emotion recognition.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.