Abstract

With the rapid development of information technology and communication, digital music has grown and exploded. Regarding how to quickly and accurately retrieve the music that users want from huge bulk of music repository, music feature extraction and classification are considered as an important part of music information retrieval and have become a research hotspot in recent years. Traditional music classification approaches use a large number of artificially designed acoustic features. The design of features requires knowledge and in-depth understanding in the domain of music. The features of different classification tasks are often not universal and comprehensive. The existing approach has two shortcomings as follows: ensuring the validity and accuracy of features by manually extracting features and the traditional machine learning classification approaches not performing well on multiclassification problems and not having the ability to be trained on large-scale data. Therefore, this paper converts the audio signal of music into a sound spectrum as a unified representation, avoiding the problem of manual feature selection. According to the characteristics of the sound spectrum, the research has combined 1D convolution, gating mechanism, residual connection, and attention mechanism and proposed a music feature extraction and classification model based on convolutional neural network, which can extract more relevant sound spectrum characteristics of the music category. Finally, this paper designs comparison and ablation experiments. The experimental results show that this approach is better than traditional manual models and machine learning-based approaches.

1. Introduction

With the rapid development of multimedia and digital technologies [13], there are more and more digital music resources on the Internet, and consumers’ music consumption habits have shifted from physical music to online music platforms. Massive music resources and a huge online music library stimulate users to generate a variety of complex music retrieval needs. For example, at a certain moment, users are eager to listen to a certain genre or a song with a certain emotion. At this time, the music label is essential to the quality of music retrieval. In addition to music retrieval, many recommendation and subscription scenarios also require music category information to provide users with more accurate content [4, 5].

Music is diverse; it is made of different elements such as melody, rhythm, and harmony combinations according to certain rules of art forms. Understanding the music of different forms often requires some background knowledge, not as a music classification standard, so almost all music media platforms use text labels as the basis of the classification of music or retrieval. Music labels are text descriptors that express musical properties in high dimensions, such as “happy” and “sad” to express emotions, and “electronic” and “blues” to express musical styles [6, 7].

Music genre classification [810] is an important branch of music information retrieval. Correct music classification is of great significance for improving the efficiency of music information retrieval. At present, music classification mainly includes text classification and classification based on music content. Text classification is mainly based on music metadata information, such as singer, lyrics, songwriter, age, music name, and other labeled text information. The advantages of this classification method are easy to implement, simple to operate, and fast to retrieve, but the shortcomings are also obvious. First of all, this method relies on manually labeled music data, which requires a lot of manpower, and manual labeling is difficult to avoid incorrectly labeling music information problems. Secondly, this text method does not involve the audio data of the music itself. Audio data includes many key characteristics of music, such as pitch, timbre, and melody. These characteristics are almost impossible to label with text; and based on the classification of the content, the features of the original music data are extracted, and the extracted feature data are used to train the classifier, so as to achieve the purpose of music classification. Therefore, music classification based on content has also become a research hotspot in recent years. Based on this, the research direction of this article is also based on content-based music classification [11].

The emergence of deep learning has brought music classification technology into a new period of development. Deep learning has been widely used in image processing, speech recognition, and other fields, and its performance on many tasks surpasses traditional machine learning methods. Scholars have also begun to use deep learning technology to study related issues in the field of music information retrieval, so it is necessary to research music classification methods based on deep learning to improve the effect of music classification [12]. The following are the main innovation points of this paper:(i)The paper aims to convert the audio signal of music into a sound spectrum as a unified representation, avoiding the problem of manual feature selection.(ii)It aims to use 1D convolution, gating mechanism, residual connection, and attention mechanism, and it proposes a music feature extraction and classification model based on convolutional neural network, which can extract and correlate more closely related sound spectrum features.(iii)Sufficient comparison and ablation experiments have been carried out. The experimental results have proved the effectiveness and superiority of our algorithm, surpassing several other well-known methods.

The organization of the paper is as given. Section 2 depicts the background knowledge of the proposed study. The methodology of the paper is shown in Section 3 with the details in the subsections. Experiments and results are presented in Section 4. The paper is concluded in Section 5.

2. Background

As a very important component in the field of music information retrieval, music feature extraction and classification recognition have been widely studied since the 1990s. In 1995, Benyamin Matityaho and Furst [13] proposed a method for frequency-domain analysis of music signals. First, fast Fourier transform is performed on the audio data, and then the logarithmic scale transformation is performed to use the obtained data as feature data. Training was done in a neural network [1417] containing two hidden layers and two music genres were finally identified: classical and pop music. Tzanetakis and Cook [18] systematically proposed in 2002 the division of the characteristics of music into three feature data sets, namely, timbre texture characteristics, rhythm content characteristics, and tonal content characteristics; the authors adopted the Gaussian mixture model and K. The proximity method is used as a classifier. It is worth mentioning that, due to the numerous music genres, there was no relatively fixed classification standard in the academic circles before. Since the groundbreaking research results of Tzanetakis, the ten music genres contained in the GTZAN data set used by George Tzanetakis have become music information. The classification standard was generally recognized in the search field.

As George Tzanetakis’ research results laid a lot of foundation for us, later scholars in the field of automatic recognition of music genres mainly focused on two aspects. On the one hand, they made corresponding improvements in the selection of music feature extraction and the dimension of feature vectors. On the other hand, they improved the choice of classification algorithm. The extraction of music features is a very critical part of music genre recognition. If the extracted features cannot represent the essential characteristics of music, then the music classification effect will undoubtedly be very bad. Scaringella et al. [19] divided the music signal characteristics into three categories: pitch, timbre, and rhythm. At present, the commonly used characteristics of music signals mainly include short-term zero-crossing rate, short-term energy, linear prediction coefficient, frequency spectrum, flux, Mel frequency inverse coefficient, spectral centroid, and spectral contrast. Since these characteristics are both in the time domain and in the frequency domain, they can reflect the musical perception characteristics of pitch, rhythm, timbre, and loudness to some extent. The process of music feature extraction is generally to first perform frame processing on the original audio signal, then perform related calculations based on the mathematical statistical significance of the features, and finally use the calculated results as the training data of the classifier in the form of vectors. Because music feature extraction is based on music signal analysis, the current audio-based music signal analysis techniques mainly include time-domain analysis methods and frequency-domain analysis methods. The so-called time-domain analysis method is to analyze and count the waveform state of the music signal from the time dimension. Frequency-domain analysis converts the music signal in the time domain into the frequency domain through Fourier transform, so many useful features in the frequency domain can be obtained, for example, Mel to general coefficient, spectral centroid, pitch frequency, subband energy, spectrogram, etc. Literature [20] cascades together the Mel-to-Pop coefficient and pitch frequency, spectral centroid, subband energy, and other perceptual characteristics to form a high-dimensional feature vector. In the music classification algorithm, traditional machine learning methods are mainly used, such as support vector machines, Gaussian mixture models, decision trees, nearest neighbors, hidden Markov, and artificial neural networks [2124]. In addition, there are some improvements to the above algorithms. For example, literature [25] adds a genetic algorithm to the Gaussian mixture model, which improves the accuracy of classification from the experimental results.

3. Methodology

3.1. Music Signal Features
3.1.1. Spectral Centroid

The spectral centroid is a metric used to characterize the frequency spectrum in digital signal processing. It indicates where the “centroid” of the frequency spectrum is located. It feels that it has a close relationship with the brightness of the sound. Generally speaking, the smaller the value is, the more energy is concentrated in the low frequency range. Since the spectral centroid can better reflect the brightness of the sound, it is widely used in digital audio and music signal processing. It is used as a measure of the timbre of music. Its mathematical definition is as follows:where represents the magnitude of the Fourier transform of the -th frame at the frequency group .

3.1.2. Spectral Flux

The spectrum flux is generally a measure of the rate of change of the signal spectrum. It is calculated by comparing the spectrum of the current frame with the spectrum of the previous frame. More precisely, it is usually calculated as the 2-norm between two normalized spectrums. Since the spectrum is normalized, the spectrum flux calculated in this way does not depend on the phase; only the amplitudes can be compared. Spectrum flux is generally used to determine the timbre of an audio signal or to determine whether to pronounce. Its mathematical definition is as follows:

3.1.3. Spectral Contrast

Spectral contrast is a feature used to classify music genres. Spectral contrast is expressed as the difference in decibels between peaks and valleys in the frequency spectrum, which can represent the relative spectral characteristics of music. It can be seen from the experimental results of the literature [26] that the spectral contrast has a good ability to discriminate music genres.

3.1.4. Mel-Scale Frequency Cepstral Coefficients

Since the cochlea has filtering characteristics (as shown in Figure 1), different frequencies can be mapped to different positions of the basilar membrane. So the cochlea is often regarded as a filter bank. Based on this feature, psychologists obtained a set of filter banks similar to the cochlear effect through psychological experiments, that is, the Mel frequency filter bank. Since the sound level perceived by the human ear is not linearly related to its frequency, researchers have proposed a new concept called Mel frequency. The Mel frequency scale is more in line with the auditory characteristics of the human ear. The relationship between Mel frequency and frequency is as follows:where is the converted Mel frequency, is the frequency, and the unit is .

Firstly, the audio signal is divided into frames, preemphasized, and then windowed, and then short-time Fourier transform (STFT) is performed to obtain its frequency spectrum. Secondly, set the Mel filter bank of L channels on the Mel frequency. The L value is determined by the highest frequency of the signal, generally 12 to 16, and each Mel filter has the same interval on the Mel frequency. Let , , and be the lower limit frequency, center frequency, and upper limit frequency of the -th triangular filter, respectively; then, the relationship between the three frequencies of adjacent triangular filters is as follows:

Pass the linear amplitude spectrum of the signal through the Mel filter to get the output of the filter:

The frequency features of the filter are

Take the natural logarithm of the filter output value, and then transform the discrete cosine to MFCC. The expression is as follows:

3.2. 1D Residual Gated Convolutional Neural Model
3.2.1. Selection of Convolution Kernel

Convolutional neural networks can well identify potential patterns in the data. By superimposing convolution kernels to perform repeated convolution operations, more abstract features can be obtained in the deep layers of the network. One-dimensional convolution is often used to deal with problems related to time series. Unlike two-dimensional convolution that attempts to convolve in multiple directions, one-dimensional convolution focuses more on capturing the translation invariance of data features in a specific direction. When dealing with time-related data, this direction is often the direction of time change. One-dimensional convolution is often used to analyze time series or sensor data and is suitable for signal data analysis within a fixed period of time, such as audio signals.

Figure 2 shows the convolution process of one-dimensional convolution and two-dimensional convolution on the sound spectrum. It can be seen that the receptive field of the one-dimensional convolution kernel covers all frequency ranges on the sound spectrum, which is only performed on the time axis. Convolution can capture the percussion components of the musical instruments appearing on the sound spectrum, and their overtones and other musical elements. Unlike the one-dimensional convolution that only convolves in the time direction, the two-dimensional convolution performs convolution in the two dimensions of time and frequency and can extract specific patterns of frequency within a certain time range, such as the rise and fall of pitch. In the field of music classification, many models use two-dimensional convolution as the basic convolution structure of convolutional networks.

The time perception of two-dimensional convolution is not as good as one-dimensional convolution, and the range of perception in the frequency range is not as broad as one-dimensional convolution, and the computational complexity of one-dimensional convolutional neural networks is smaller. In addition, two-dimensional convolution also performs convolution in the frequency dimension of the sound spectrum, which is inexplicable for sound signals. Therefore, the model in this article will use one-dimensional convolution as the basic convolution structure, which is more in line with the fact that the audio signal is expanded in time and has less correlation in the frequency range.

The essential difference between one-dimensional convolution and two-dimensional convolution lies in the translation direction, and its calculation method is not essentially different from that of two-dimensional convolution. Although the original audio signal is a time series, after it is converted into a sound spectrum, its expression is similar to a single-channel grayscale picture, so the calculation of convolution can be expressed by the following equation:where is the width and height of the feature map, is the activation function used by the convolution layer, is the width of the convolution kernel, is the height of the convolution kernel, is the offset of the convolution, and and represent the weight matrix and data input of the product core, respectively. In the one-dimensional convolution operation based on the sound spectrum, and the frequency range of the sound spectrum have the following relationship:

That is, the height of the convolution kernel in one-dimensional convolution is equal to the frequency range in the sound spectrum, and the receptive field of the convolution kernel covers the entire frequency axis, so as to capture a specific frequency pattern. Then the convolution operation can be expressed as

Assuming that the output of the convolution kernel is and the bias matrix is , then the convolution operation can be simply expressed as

The width of can be obtained by the following formula:where represents the length of the sound spectrum on the time axis, that is, the width of the sound spectrum. represents the size of the padding, and represents the width of the convolution kernel. Since the one-dimensional convolution only performs translation in the time dimension of the sound spectrum, the height of the output feature map is as follows:

In other words, has nothing to do with the frequency range of the acoustic spectrum and the high of the convolution kernel. After one-dimensional convolution, the dimension of the acoustic spectrum changes to that of the two-dimensional convolution. After one-dimensional convolution, the specification of the feature graph is also reduced.

3.2.2. Gated Linear Units

Assuming that the sound spectrum sequence to be processed is and the output of the convolution kernel is , then the gated linear unit can be expressed as

The two Conv1D1 and Conv1D in the above formula represent two identical one-dimensional convolutions, but the weights are not shared. represents the (element-wise) operation, and represents the Sigmoid activation function. One of the results after the two convolutions is activated by the Sigmoid function, and the other is not added with the activation function, and then the creation gate is multiplied bit by bit. Formally, it is equivalent to adding a “valve” to each output of one-dimensional convolution to control the flow. The convolution-based gating mechanism is different from the complex threshold mechanism in the LSTM network. It does not need to forget the gate, only an input gate, which also makes the network model based on the gated convolution unit perform better than LSTM in training speed.

Figure 3 shows the basic structure of the one-dimensional gated convolution unit. You can see the data flow inside the one-dimensional gated convolution unit. After the input of the convolution unit undergoes two identical convolutions, one of the 1D convolution kernels is extra. The activation operation of the Sigmoid function is performed, and then the output of another convolution kernel is multiplied bit by bit to produce the output of this layer.

3.2.3. Residual Connection

The entire convolutional neural network can be regarded as a process of information extraction. The more the layers of the network, the stronger the ability of the network to gradually extract from the underlying features to the highly abstract features. When the network layers are deepened, the model is more likely to discover high-level abstract features related to music categories. Increasing the depth of the network too much will cause gradient disappearance and explosion problems to the model. The solution to gradient disappearance and explosion is generally to add regular initialization and an intermediate regularization layer, but the network degradation problem also arises. When the network begins to degenerate, the accuracy on the training set will decrease as the number of network layers increases. This problem is essentially different from overfitting, which will show excellent results on the training set.

The basic residual module is shown in Figure 4. It can be seen that the residual structure has an additional identity mapping channel, so that when the depth of the network increases and it is not conducive to the enhancement of network performance, the network can directly skip these useless layers. Directly accept the output of the upper layer. The calculation equation of the residual structure is as follows:

3.3. Music Feature Extraction and Classification Model

The model in this paper can be divided into GLU stacking layer, global pooling feature aggregation layer, and fully connected layer from input to output. The overall structure of the network is shown in Figure 5.

To make full use of the statistical information of the pooling layer, the model in this chapter combines the global maximum pooling and the global average pooling to form a global pooling feature aggregation layer. The feature maps obtained from the GLU block stacking layer undergo global average pooling and global maximum pooling to obtain average pooling statistics and maximum pooling statistics, respectively. The results of the pooling operation here are all one-dimensional. In Figure 5, two rectangular blocks of different colors are used to represent these two one-dimensional features, and the two pooled statistical features are spliced into the next layer of fully connected network.

4. Experiments and Results

4.1. Experimental Setup

Due to the repetitive information in the multichannel of the original audio, all audio is converted to mono, and downsampling is performed at a sampling rate of 16 kHz. The Fourier transform window length used when converting the Mel sound spectrum is 512, the window jump size is 256, and the number of frequency bins is 128. The original audio sample is segmented by the segmentation method. The slice duration is 5 seconds, and the overlap rate is 50%. The Mel sound spectrum specification of a single slice generated after processing according to the above settings is (313, 128), and each audio sample produces 11 slices of the same size.

4.2. Data Set

The experiment in this chapter uses the GTZAN data set, which is widely used to verify the performance of music classification methods and is the most popular music classification data set. The GTZAN data set has 10 music genre categories (as shown in Table 1). The number of audio samples in each genre category is 100, the sample duration is 30 seconds, and the sampling rate is 22050 Hz.

4.3. Evaluation Index

The classification accuracy rate (Acc) is selected as the evaluation index of the music classification method proposed in this chapter. The calculation method of classification accuracy is as follows:

4.4. Experimental Results

Different models produce different recognition results by learning different deep features. In order to make a fair comparison, all experiments were implemented in the same environment, and all parameters were retained, comparing the proposed model with SVM, CNN, GLU, RCNN, and RGLU.

The above five types of networks with different structures are tested with the same experimental settings, and the results are shown in Table 2 and Figure 6. The GLU network using the gated structure has higher accuracy than the ordinary convolutional network CNN, which indicates that the stacking of multiple gated convolutions used in the model in this chapter is more conducive to the sound spectrum characteristics than the ordinary convolution learning. The gating structure makes the features passed to the next layer of the network pay more attention to the sound spectrum features that are more important for the music classification task, and the information that is not related to the music classification task is ignored by the gating mechanism. The results of the comparison experiment verify that the gating structure is based on the effectiveness of the sound spectrum in the task of music classification. From the perspective of information filtering, GLU can be used as another implementation of the attention mechanism. Unlike the RGLU structure that determines an attention weight for each feature map channel, GLU can adaptively determine the time during the network learning process. The attention weight in the one-dimensional convolution is expanded in time; this kind of gated structure that increases attention in the time dimension, combined with the one-dimensional convolution in the time dimension, can get better performance. Compared with CNN and GLU without residual structure, the accuracy of RCNN and RGLU with added residual structure has been improved, which shows that the use of residual connection can improve the accuracy of classification to a certain extent. It is worth noting that the accuracy of RGLU using the residual structure is improved compared to GLU, and the accuracy of RCNN is greater than that of CNN. This indicates that the combination of residual structure and gated convolution is more beneficial for the transmission of information in the network. Therefore, this experiment fully proves the effectiveness and superiority of our algorithm.

4.5. Ablation Study of Global Pooling

This section will compare the classification performance of different pooling features and their combinations in the global pooling feature aggregation layer. We used three pooling methods to conduct experiments, and the experimental results are shown in Table 3.

The aggregation of the two global pooling features can make the model obtain a higher accuracy rate. Using the global average pooling feature alone is more accurate than using the global maximum pooling feature alone, which means the overall statistical information in the spectroscopic feature map is more conducive to classification. The model in this chapter combines two types of global pooling features, which enables the fully connected layer to grasp more statistical information of the features abstracted by the convolutional layer and makes the classification performance of the model stronger.

5. Conclusion

Digital music has grown and exploded with the growing developments of information technology and communication. Music feature extraction and classification are considered as a significant portion of music information retrieval. The design of features requires knowledge and in-depth understanding in the domain of music. The features of different classification tasks are often not universal and comprehensive. Traditional music classification approaches use a large number of artificially designed acoustic features. It is difficult to effectively extract music features due to manual and traditional machine learning methods. Therefore, the contribution of this paper is to convert the audio signal of music into a sound spectrum as a unified representation, avoiding the problem of manual feature selection. According to the characteristics of the sound spectrum, combined with one-dimensional convolution, gating mechanism, residual connection, and attention mechanism, a music feature extraction and classification model based on convolutional neural network is proposed, which can extract and correlate more closely related sound spectrum features of music category. Finally, this paper designs a comparison and ablation experiment. Experimental results show that this method is superior to traditional manual models and machine learning-based methods.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no conflicts of interest.