Advanced Pattern Recognition Systems for Multimedia DataView this Special Issue
Active Learning Music Genre Classification Based on Support Vector Machine
The improved SVM (support vector machine) offers an active training method that provides users with the most informative sample through multiple iterations and adds it to the training package, which can significantly reduce the cost of manually labeling samples. To evaluate the classifier’s performance, 801 music samples were tested for five music types (Dance, Lyric, Jazz, Folk, and Rock). The effectiveness of the proposed SVM active training method was confirmed by two things: the convergence speed and the classification accuracy, and the number of samples to be labeled with the same accuracy. And the classification accuracy was 81%. At the expense of a little precision, both SVM active training methods drastically reduce the number of labels to be trained, and the method proposed in this paper works better. At the same time, the smaller the value, the fewer the labels that need to be labeled. This is because increasing the number of iterations allows the classifier to select the most appropriate sampling points, while the larger the set value, the smaller the number of iterations. So you can choose between the two depending on the actual situation.
Music is a common art form. With the wide application of emerging technologies such as the Internet, there are not only more types, but also more efficient digital music. Because the threshold of creation is gradually reduced, more and more people choose to become creators, and the number of works is also explosive, which leads to a large number of genres. It is difficult to distinguish between different genres through significant features, but the works of the same genre have similar features. Analyzing such features can make it possible to classify genres. The early genre classification was mainly manually distinguished by professionals, but it is difficult to deal with the huge number of classification tasks today. Therefore, it is necessary to solve the difficulty of manual classification with the help of the automation advantage of computers, so as to improve the efficiency of genre classification [1, 2]. In order to make the music in the network database more humanized, the discipline of music information retrieval (MIR) was applied. Genre classification is an important application of music information retrieval and a basic step in music recommendation, content identification, and other applications .
The primary task of electronic music genre classification is to extract appropriate features. The extracted features of different genres of music indirectly represent different genres of music. Music data contains a huge amount of information, and feature data is also complex. The number of dimension of features can reach hundreds of dimensions. If we classify all feature data, the classification effect will not be satisfactory, because there are not only related features, but also many irrelevant and redundant features. Uncorrelated features have a great impact on the performance of the classifier. So how to effectively select features for specific classification has become an important problem to be solved in the research of music classification .
2. Literature Review
Yao and Chen analyzed music data in time domain and frequency domain and then used the neural network with two hidden layers to classify the classical and popular music . Then, Wen et al. selected two measures of short-term energy and zero crossing rate and used Gaussian classifier to classify music and speech, with an accuracy of more than 90% . Ge et al. completed the extraction of audio features in the three directions of timbre, rhythm, and pitch frequency; trained Gaussian mixture model and K-neighbor classifier; and finally achieved 61% recognition rate . In terms of feature selection, Das and Satpathy used DWCH (Daubechies wavelet coefficients histograms) as the classification feature and evaluated and compared classification algorithms such as Gaussian mixture model, K-neighbor classifier, support vector machine (SVM), and linear discriminant analysis (LDA), which improved the recognition rate to 79% . Daliman and Abdul Ghapar explored the relationship between the timbre and rhythm features and the classification effect, such as changing the number of selected orders of Mel cepstrum coefficient, increasing high-order statistics, and introducing covariance matrix to change the classification effect, and combined a variety of feature sets to use classifiers to achieve good accuracy . Garrido-Arévalo et al. used the deep convolution belief network to obtain the parameters of the trained network, passed the parameters as the initial parameters of the convolution neural network, and then applied them to music retrieval related tasks such as identifying music authors and distinguishing music genres . At the same time, because the theory of SVM is relatively mature and widely used, more research is needed on the active learning algorithm based on SVM. Zhang et al. successively proposed SVM active learning algorithms based on the distance between samples and classification hyperplane and version space reduction and achieved good classification results . Sontayasara et al. proposed a simple incremental learning algorithm based on SVM, which only retains the support vector set as the historical training sample .
In this paper, based on the traditional SVM active learning assumption that the closer the sample to the classification hyperplane, the greater the amount of information contained, an improved SVM active learning method is proposed. That is, when selecting the samples, we should not only consider selecting those samples close to the classification hyperplane, but also consider ensuring the diversity of samples. Through experimental comparison, the feasibility of the method is confirmed from two aspects: the convergence speed and classification accuracy, and the number of labeled samples. Figure 1 is the experimental principle flow chart of this paper.
3. Research Methods
3.1. Feature Extraction
In this paper, Mel frequency cepstrum coefficients (MFCCs), which have good performance in music classification, and relative spectral-perceptual linear prediction (RASTA-PLP), which is proposed based on traditional PLP, are selected for feature extraction of training samples . Among them, the MFCC is 20 dimensions, the spectrum correlation coefficient in PLP is 21 dimensions, and the cepstrum correlation coefficient in PLP is 9 dimensions, a total of 50 dimensions. Then, calculate the mean and variance of the 50 dimensional music features, so that the feature vector dimension number of each music sample is 100.
3.1.1. MFCC (Mel Frequency Cepstrum Coefficient)
MFCC is an audio feature parameter based on human auditory characteristics. It is one of the most widely used feature parameters in automatic recognition and classification system. To some extent, MFCC is closer to the principle of human auditory system. It can simulate human auditory model, and in terms of music features, MFCC can more accurately represent music signals than other short-term feature parameters. The calculation flow of MFCC is shown in Figure 2.(1)Preemphasis treatment. That is, the original music signal is divided into frames and windowed to obtain a frame of music signal.(2)Fast Fourier transform. The sampling rate used in this paper is 16 kHz, the window function is 32 ms, the frame shift is 16 ms, and the window length can be expressed as n = 512 sampling points, so there are 512 sampling points in a frame. Perform fast Fourier transform on N points of a frame of audio signal to obtain the spectrum of a frame of music signal, which is expressed as follows: After obtaining the signal spectrum value, square it to obtain the energy spectrum.(3)Designing a filter bank. The Mel filter bank is a filter bank consisting of triangular filters defined on the Mel frequency scale. This paper chose M = 19. The IF of each triangular filter is distributed equally along the Mel frequency axis and expands as m increases along the frequency axis. The frequency response of the triangular filter is as follows: where f(m) is the center frequency of the m-th triangular filter, which is calculated by the following formula: where and are the lowest frequency and the highest frequency of the triangular filter bank, respectively, N is the number of points during FFT, M is the number of filters in the filter bank, is the sampling frequency of the music signal, and is the conversion formula from Mel frequency to time domain frequency, as follows:(4)Taking the logarithm of the output signal. This is to obtain a good strength spectrum evaluation error. After passing the short-term energy spectrum through the Mel filter bank, obtain the logarithm of the output signal as shown in the following equation:(5)Discrete cosine transform is performed on the logarithmic energy obtained from (4). The cepstrum is transformed as follows:
Then, take the first 20 coefficients of each frame, which are the MFCC characteristic parameters of the extracted music signal.
3.2. RASTA-PLP (Relative Spectral-Perceptual Linear Prediction)
The traditional feature extraction of perceptual linear prediction (PLP) is based on the short-time spectrum, and the change of frame shift spectrum will be introduced into the feature parameters in the short-time time division. RASTA (relative spectra) technology actually adds a bandpass filter to each frequency band during the critical spectrum estimation in the traditional PLP feature extraction process, which can effectively suppress the rapid change of spectrum between frames in the PLP feature extraction process. The commonly used RASTA filters mainly include log RASTA and J-RASTA. The difference lies in the processing before and after adding bandpass filters to each pair of critical spectrum. The log RASTA filter is used in this paper, and the RASTA-PLP feature extraction process is shown in Figure 3.(1)Preprocess the music signal, calculate the power spectrum, and analyze the critical frequency band.(2)The obtained critical band power spectrum amplitude is nonlinearly mapped; that is, the logarithmic calculation is taken, and the processing of the critical band spectrum is transformed into the processing of the logarithmic domain. In this way, some multiplicative distortion in the frequency domain can be transformed into additive distortion and filtered out.(3)The bandpass filter here is equivalent to an IIR filter, and its transfer function expression is as follows: Take here.(4)The obtained data is subjected to anti-nonlinear mapping, that is, exponential expansion transformation.(5)Equal loudness curve preemphasis. The purpose of suppressing the low-frequency and high-frequency parts through the equal loudness curve is to transform the music signal into a more sensitive 400∼1200 Hz frequency band. where is the center frequency corresponding to the k-th critical band spectrum and is the weight coefficient, which reflects the sensitivity of human ear to perceived frequency.(6)Amplitude cube root compression-intensity loudness transformation. Since the intensity of loudness and the intensity of perceived loudness are nonlinear, it needs to be multiplied by 0.33 to simulate the power law of hearing. The mathematical expression is as follows:(7)The linear prediction coefficient is solved by the all-pole model. The basic idea of applying linear prediction coefficient to music signal is that there is correlation between sampled music signals, and a sampling signal can be approximated by the linear combination of several music sampling signals.(8)Calculating cepstrum. The linear prediction coefficient obtained from LPC can calculate the cepstrum characteristics of music signals .
3.3. SVM and Active Learning
3.3.1. SVM Classifier
For the introduction of support vector machine, there is a classic example, as shown in Figure 4: at the two endpoints above the horizontal axis, all the points in the red part between A and B are considered as positive samples, while the points in the black part on both sides are considered as negative samples. Obviously, it is impossible to find a linear function to correctly separate the two types, but looking at Figure 4(b), we can find a curve that can completely divide the two types. The upper part of the curve is positive, and the lower part is negative. The principle diagram of SVM is shown in Figure 5.
SVM has a good learning effect on both binary classification and multi-classification problems. Then, taking the binary classification problem as an example, this paper expounds the principle of SVM. Classification problems can be divided into two types: linearly separable and linearly non-separable. For the linearly separable problem, the training sample set , is given. The corresponding category is marked as . Let the dimension of the training sample eigenvector be , the number of samples be , and its linear discriminant function be . Through normalization, satisfies . At this time, the classification interval is . Therefore, whenis true, the classifier can label all samples correctly. Increasing the class interval obviously reduces . Therefore, the hyperplane of the optimal class must satisfy (11) and reduce at the same time. The support vectors are the samples that hold (1). In conclusion, the solution of the optimal classification of the hyperplane is equivalent to the solution of the following limited optimization problem :
In this way, the solution of SVM is finally transformed into the solution of quadratic programming problem. Therefore, theoretically, the solution of SVM is the globally unique optimal solution. According to the Lagrange multiplier method of the following equation:
because the sample determines , it can be expressed as some combination of samples; namely,
Then, the Lagrange multiplier is rewritten as follows:
In (15), represents the inner product of two vectors, and and are the parameters for determining the classification hyperplane.
The above is the SVM solving function for linear classification problem. For nonlinear classification problems, SVM solving function maps from low-dimensional space to high-dimensional space through kernel function, so that the original linearly non-separable problem becomes linearly separable. In this way, without increasing the computational complexity, the function for solving the classification problem becomes the following formula:
The commonly used kernel functions are as follows :(1)Linear kernel function:(2)Polynomial kernel function:(3)Sigmoid kernel function:(4)Radial basis kernel function:
There is still a lack of guiding principles for the selection of kernel function. We can only judge the effect of various kernel functions by observing the data results of various experiments. Here, the RBF kernel with the best experimental effect is selected and set to 8.
In addition, SVM is only used to solve two classification problems at first. However, in practical applications, it is usually necessary to solve multi-classification problems. In order to address practical needs, there are usually two solutions: “one to one” and “one to rest” (as shown in Figures 6 and 7). The comparison of the two solutions is shown in Table 1. Here, the “one to rest” method with relatively small amount of calculation is selected, so that five classifiers are corresponding to the classification of five kinds of music in the later experiment.
3.3.2. Active Learning
Active learning is a cyclical, repetitive process. All samples in the candidate samples are unmarked. Based on prior knowledge or random sampling, a small number of samples should be selected from the candidate samples as the first batch of samples, and their classification should be recorded to ensure that the first batch of samples contains at least one positive and one negative sample. Use the first training sample with these labels to train the classifier. Under this classifier, use some selection strategies to select the most valuable sample of this classifier from the candidate sample, label the classes, add them to the training sample package, retrain the classifier, and select the most valuable sample using the new classifier. Again, in the middle of the remaining sample, repeat the above steps until the candidate sample is empty or reaches the specified index . A schematic diagram of active learning is shown in Figure 8 .
The “most valuable” samples in active learning are actually the most uncertain samples of the classifier. The purpose of SVM classifier is to find an optimal classification hyperplane to separate the two types of data with the maximum separation. In each iteration, after the samples within the geometric interval are selected from the unlabeled samples to join the training, the classification surface position of the new classifier is most likely to change, while the samples outside the children’s interval have little effect on the classification surface position of the new classifier. Therefore, in the traditional SVM active learning method, it is considered that, for SVM classifier, those sample points closest to the optimal classification hyperplane are the most valuable sample points. However, if it is only used as the standard to judge whether the sample is the most valuable sample, it may face the problem of repeated learning, because the number of sample sets is often quite large in practice, so it is necessary to select the value samples in batch each iteration, and the value samples selected in batch each time are likely to have information redundancy due to the high correlation between them, which will lead to repeated learning . In other words, we hope that the sample set selected by the classifier in each iteration not only has the most uncertainty, but also will maintain diversity, so as to maximize the modified classification hyperplane.
The diversity of samples can be measured by the angle between samples. Mapping the sample points of feature space to version space is actually a hyperplane. Then, the angle between two samples can be expressed by the angle of its corresponding two hyperplanes and . According to the kernel function K, it can be written as the following formula:where and represent the normal vector of the hyperplane corresponding to the sample. Formula (21) obtains the absolute value of the cosine value of the angle between two samples. The smaller the value, the greater the angle between the two samples, that is, the smaller the correlation.
In order to ensure the diversity of each batch selection sample set, a classification hyperplane is trained according to the labeled initial sample set, and the distance from each unmarked sample to this hyperplane can be calculated. At the same time, is defined as the sample set selected in each batch, and its initial value is . The angle between the unmarked sample and the currently selected sample set is defined as the maximum angle of all sample angles in and . This angle measures the diversity of the finally selected sample set and takes it as the judgment factor of whether the sample is selected as the value sample. Then, for sample , the diversity judgment standard can be written as the following formula :
It should be noted here that active learning itself is an iterative process; that is, each training time represents an iteration. Before each iteration, when batch selection of unmarked samples is carried out, the samples in increase one by one until the set batch value is reached; that is to say, the formation of sample set itself is an internal iterative process. Because the initial value of is empty, the first sample added is considered to be the sample closest to the classification hyperplane, and the subsequent samples are added according to the sample with the minimum value calculated according to (22). Detailed batch selection steps will be given in the following active learning algorithm.
Since the selected sample points need to take into account the distance from the sample to the hyperplane (i.e., its own uncertainty) and the diversity of the sample itself, parameter is introduced to weigh the above two selection conditions, and sample is formulated. The evaluation standard of value is shown in the following formula:
The first half of the above formula, function , calculates the distance from the unmarked sample to the current classification hyperplane, which is a judgment factor for the uncertainty of the sample. We hope the smaller the distance, the better. The second half of the above formula calculates the cosine value of the angle of unmarked samples and according to the formation strategy proposed above. It is hoped that the larger the angle between them, the better. Then, the corresponding natural law is that the smaller the cosine value obtained by the formula itself, the better. plays a trade-off role, and its value range is 0 to 1. If the value is greater than 0.5, it means that we pay more attention to the uncertainty of the sample; otherwise, it means that we pay more attention to the diversity of the sample . To sum up, the smaller the score value, the greater the value of the sample.
After the sample selection strategy of SVM active learning is determined, the following assumptions are made.
Set is the candidate sample set that has not been marked. The elements in the set are the most valuable samples that need to be marked in batches selected for each iteration, and their initial value is , which should be cleared before each iteration. Set the batch selection value to m. The samples in are all manually labeled samples, which are used as SVM training set, and their initial value is also .
Based on the above assumptions, the specific steps of the improved SVM active learning proposed in this paper are as follows :
Improved SVM active learning algorithm:(1)Select samples from the candidate sample set through clustering algorithm, label their categories, take them as the initial training sample set , and ensure that contains at least one positive sample and one negative sample, so that .(2)The classification hyperplane is trained by improved SVM algorithm.(3)Build a batch selected sample set :①Judge whether the number of samples in the sample set is less than . If so, skip to Step ①; otherwise, skip to Step (4).②Calculate the value of all samples in according to (23), select the sample with the smallest score value, execute , and skip to Step ①.(4)Manually mark the category of sample points in , and execute . When the set number of iterations or conditions is reached, skip to Step (5); otherwise, return to Step (2).(5)The algorithm ends.
In the later experiments, it will be found that the classifier needs to be retrained in each iteration, which reduces the manual labeling but increases the number of training times, that is, the operation time of the machine. However, in practical application, because each time, the most valuable sample is selected, it only needs to calculate the distance from all samples to the classification hyperplane and the angle between them, which does not increase the calculation too much. Moreover, compared with the time and manpower consumed by manually labeling all samples, the increased machine operation time is completely acceptable.
4. Result Analysis
All the music sample sets used in this experiment are downloaded from Baidu music box according to the music genre category label and intercepted by us. The interception time of each music segment is 30 s in the middle; the sampling frequency is 16000 Hz; the format is WAV, set as mono; and the dimension of each music sample is 100 after feature extraction. The classification environment used in the experiment is SVM light. According to the characteristics of the training samples, the radial basis function (RBF) kernel is selected, the value is 8, and the trade-off parameter in the active learning method is set to 0.5. Music clips are labeled into five categories according to genre: Dance music, Lyric, Jazz, Folk music, and Rock. The number of training samples and test samples of each category are shown in Table 2 and Figure 9.
In order to verify the effectiveness of the active learning method based on SVM proposed in this paper, two groups of experiments are done. In the first group of experiments, the convergence rate of classification accuracy is compared with the traditional SVM active learning method and SVM random sampling method, which simply take the distance from the sample to the classification hyperplane as the judgment standard. The second group of experiments fix the desired accuracy and compare the number of samples to be labeled.
4.1. The First Group of Experiments: Classification Accuracy and Convergence Speed
In order to better compare the effects of the three learning methods, 100 samples from 2500 training samples are selected by clustering algorithm as the initial training samples for labeling. The initial samples of the three algorithms are the same in each experiment. Different batch selection values are set in the experiment, and the classification accuracy of each iteration is calculated. The accuracy of the first 20 experimental records is averaged. The comparison results are intuitively shown in Figure 10.
Looking at Figure 10 carefully, it is found that the SVM active learning method proposed in this paper is superior to the other two methods in terms of classification accuracy and convergence speed in the whole iterative process. In addition, after observing the previous iterations, the gap between the three learning methods is not large. The traditional SVM active learning method is even worse than SVM random sampling method. It is speculated that this may be due to too few initial samples, resulting in a large characteristic deviation from all samples. The random sampling algorithm is more likely to select those samples with small correlation, which verifies that the sample point closest to the hyperplane is not necessarily the sample point with the largest amount of information, and the redundancy of sample information needs to be considered. Later, with the increase of the number of iterations, the classification hyperplane is modified again and again, and the classifier can more and more represent the characteristics of all samples. The convergence speed and the classification accuracy of the two active learning classifiers begin to be significantly better than those of the random sampling SVM classifier, and the improved SVM active learning method proposed in this paper has the best effect.
4.2. The Second Group of Experiments: Determining the Number of Samples
To ensure that SVM active learning can significantly reduce the number of labeled samples to achieve the same accuracy, all samples are labeled, added to the training package, and the classification accuracy is shown in Table 3.
In order to demonstrate the effectiveness of the method proposed in this paper, the accuracy of the classification is assumed to reach 81%, and the iteration is stopped. In addition, the number of labels on the two active training methods is statistically compared (an average of 20 trials are also taken). Each sample shall be recorded only once; i.e., the number of newly recorded samples shall not be counted if the same sample is selected for multiple iterations of each classifier. The final statistics are shown in Table 4.
The SVM active learning method proposed in this paper only labels about half of all samples and achieves the same accuracy as labeling all samples to participate in training. Although the number of samples labeled by the traditional SVM active learning method is reduced compared with the total number of samples, the effect is not obvious. In addition, in practical application, it is almost impossible to label all samples to participate in training, so to say the least, if it is allowed to sacrifice a little accuracy, the performance of active learning method in reducing the number of labeled samples will become more prominent. For example, in experiment 1, it can be seen that the accuracy of the other two active learning methods quickly converged to a fixed value; that is, the improvement of the classification accuracy in the later iterations has been very limited, except that the accuracy of the random sampling method has tended to a gentle upward trend with the increase of the number of iterations. So we might as well sacrifice a little accuracy and set the iteration stop condition to 80% accuracy. The recomparison results are shown in Table 5.
The experimental results show that the two SVM active learning methods can greatly reduce the number of samples to be labeled in training, and the effect of the method proposed in this paper is better. At the same time, it is observed that the smaller the m value set, the less the number of samples to be labeled. This is because the increase of the number of iterations gives the classifier more opportunities to select the sample points that are most useful to the classifier. The larger the m value set, the less the number of iterations required. Therefore, a trade-off can be made between the two according to the actual situation.
Through the comparison of the above two experiments, it is not difficult to draw a conclusion. When training, purposefully selecting value samples for labeling iterative training can make the classification accuracy converge rapidly, and the number of samples that need to be labeled to achieve the same accuracy is much less.
For the traditional music, classification needs a lot of manpower and time to label the training samples. This paper proposes an improved SVM active learning algorithm; that is, by taking uncertainty and diversity as the judgment standard of whether the samples are value samples, the music is divided into five categories according to the genre. Experiments verify the advantages of the proposed method in the convergence speed and classification accuracy, and the number of samples to be labeled to achieve the same accuracy. This fully proves that SVM active learning method can greatly reduce the cost of manually labeling samples and is of great significance in music classification.
The labeled data set used to support the findings of this study is available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
The research was supported by Guangxi Arts University and Sehan University.
Z. Rustam, D. A. Utami, R. Hidayat, J. Pandelaki, and W. A. Nugroho, “Hybrid preprocessing method for support vector machine for classification of imbalanced cerebral infarction datasets,” International Journal of Advanced Science, Engineering and Information Technology, vol. 9, no. 2, p. 685, 2019.View at: Publisher Site | Google Scholar
G. Manogaran, J. J. P. C. Rodrigues, S. A. Kozlov, and K. Manokaran, “Conditional support-vector-machine-based shared adaptive computing model for smart city traffic management,” IEEE Transactions on Computational Social Systems, vol. 9, no. 1, pp. 174–183, 2022.View at: Publisher Site | Google Scholar
S. K. Krishnan and D. M. Malleswaran, “Cardiovascular disease prediction and classification using modified neural network and support vector machine,” Solid State Technology, vol. 63, no. 6, Article ID 22141, 2021.View at: Google Scholar
Y. Li, Y. Liu, Y. Z. Guo, X. F. Liao, B. Hu, and T. Yu, “Spatio-temporal-spectral hierarchical graph convolutional network with semisupervised active learning for patient-specific seizure prediction,” IEEE Transactions on Cybernetics, vol. 51, no. 99, pp. 1–16, 2021.View at: Publisher Site | Google Scholar
C. Crockett, C. J. Finelli, M. Demonbrun, K. A. Nguyen, and R. S. Rosenberg, “Common characteristics of high-quality papers studying student response to active learning,” International Journal of Engineering Education, vol. 37, no. 2, pp. 420–432, 2021.View at: Google Scholar