Abstract

Digital multimedia resources have become an important part of people’s daily cultural life. Automatic scene classification of a large number of dance art videos is the basis for scene semantic based video content retrieval. In order to improve the accuracy of scene classification, the videos are identified using a deep convolutional neural network based on differential evolution for dance art videos. First, the Canny operator is used in YCbCr colour space to detect the human silhouette in the key frames of the video. Then, the AdaBoost algorithm based on cascade structure is used to implement human target tracking and labelling, and the construction and updating of weak classifiers are analysed. Next, a differential evolution algorithm is used to optimise the structural parameters of the convolutional neural network, and an adaptive strategy is adopted for the scaling factor of the differential evolution algorithm to improve the optimisation solution accuracy. Finally, the improved deep convolutional neural network is used to train the classification of the labelled videos in order to obtain stable scene classification results. The experimental results show that by reasonably setting the crossover rate of differential evolution and the convolutional kernel size of the convolutional neural network, high scene classification performance can be obtained. The high accuracy and low root-mean-square error validate the applicability of the proposed method in dance art scene classification.

1. Introduction

Due to the accelerated pace of life today, many people are busy with work, resulting in not having enough time to rest. People want to enrich their spare time activities more in their leisure time, such as dancing. However, as dance learning usually requires attending professional classes offline, resulting in many people not having much time to learn. Therefore, it has become a trend to learn dance by searching and watching online videos [17]. The main means of expression in the art of dance is the flexible footwork and graceful movement of the human body. Dance expresses feelings and reflects social life through this art form. As society continues to develop, people’s demand for quality of life continues to increase. The traditional offline dance learning method is no longer able to meet people’s needs, and is still very limited. As more and more people want to learn dance, which creates the problem of limited resources in teaching dance teachers, the original face-to-face teaching method can no longer meet the actual needs. Online teaching has become more and more accepted and has become a new mode of teaching.

The online teaching and learning process requires students to access the knowledge they learn through the computer network. As online teaching not only maximises the sharing of information resources, it can also generate multiple forms of teaching and learning, helping to improve teaching efficiency and achieve sharing of teaching resources. At present, there is an explosive growth in the use of digital multimedia video. Along with the popularity of the Internet, massive amounts of dance art video data have appeared on various online media [810]. According to relevant statistical reports, online media around the world generate about tens of T of video data every day, which contains video data of movies, music and dance.

With so much dance video data available, only a fraction of it is of interest to each individual. So, how can one find the data one needs from this dance video data? This requires effective scene classification of this dance video data. Classifying scenes from a large number of dance videos can be a very difficult task. The traditional method is to annotate and classify these videos manually, thus, forming a database of dance videos that can be indexed by keywords. However, with the huge amount of dance video data, it would take a lot of human resources, money, and time to use the manual annotation method [1113]. This manual approach requires staff to face a large amount of dance video data every day, which is prone to visual errors, thus, leading to errors in video annotation and classification. Therefore, this traditional method has major drawbacks [14]. An alternative approach is to use computers to analyse these massive amounts of video data and eventually achieve an automated dance video scene classification system. In using computer technology to annotate, classify, and retrieve dance videos, an efficient algorithm needs to be designed to process them [15]. In recent years, the issues of video annotation, video scene classification, and video retrieval have become a hot research topic in the multimedia field. Numerous scholars and research institutions have conducted in-depth research on this problem.

Traditional video scene classification methods generally use manually designed features for modelling. Wei et al. [16] proposed a motion human tracking algorithm based on region segmentation contours with more accurate and stable performance in complex occlusion situations. Suganya et al. [17] proposed an AdaBoost-STC and random forest based human eye tracking and localisation algorithm. Wang et al. [18] proposed a target tracker based on likelihood graph and real-time AdaBoost cascade. Both methods are effective in improving tracking speed without degrading tracking accuracy. Ibrahim et al. [19] conducted a video classification study using video saliency features. They divided the RGB colour channel of each frame into three images, and then combined the grey-scale images to arrange these three images in temporal order to obtain three spatio-temporal container models. These spatio-temporal containers were then subjected to pyramidal degradation and the regions of significance in the containers were divided using mean clustering. Finally, a support vector machine is used to classify the video scenes. This algorithm has a more complex process and is not effective in video scene classification. Calvin et al. [20] mapped motion vectors into the unit circle and divided it into 8 regions. Each motion vector is mapped to the coordinate axes of the corresponding region, and the corresponding matrix is derived as features for the data on the axes. Finally, the SVM is used to classify the video. However, this method can only detect the corresponding moments taken as features, and finally, the video is classified using SVM. However, this method can only detect some motion patterns in the video, such as jumping, running, swimming, and some other specific events, and cannot determine the scene classification of the video. Lu et al. [21] classify the video by taking the comparative values of luminance between regions in the video as featured, and by using Hidden Markov Model (HMM). This method is able to eliminate the influence of factors such as illumination on the video, but can only perform the classification of different categories of videos, such as news, movie, and animation videos. In addition, the calculation of the parameters of the HMM requires a large number of videos for training, and the whole process is more complicated.

Semantic-based information processing has developed rapidly in recent years with the development of artificial intelligence and data mining techniques. Many researchers are conducting research in mapping from the underlying features of the video to the semantic information of the video. By mining the semantic information of videos and forming semantic rules according to certain algorithms, scene classification of video data can be achieved. Therefore, the use of semantic information to classify video is also a future trend in video classification. Deep learning abandons the complex operation process of the underlying features in the traditional algorithm, so it can effectively achieve the task of video semantic information mining based on computer vision. Convolutional neural network (CNN) [2224], which emerged in the field of deep learning, first achieved great success in image recognition and image segmentation. Then, breakthroughs in typical network structures continued, such as recurrent neural network (RNN) [25, 26], deep belief network (DBN) [27], generative adversarial networks (GAN) [28], and other types of network structures. These network structures are capable of enhancing the feature extraction capability of models in a supervised learning manner. Compared to traditional machine learning methods, deep neural networks perform feature extraction at different scales on images, combining gradients to explore better strategies, and saving the tedious manual feature extraction process. As a result, deep neural networks only require a well-designed network structure. With the excellent image feature representation capability, deep neural networks have good robustness in dealing with scene classification problems of sports, news, and other videos. However, dance art videos are more diverse and involve human target tracking and labelling problems, so, the various types of network structures available in deep learning do not perform well enough for the scene classification task of dance videos.

The aim of this study is to automatically classify scenes from dance videos using deep convolutional neural networks and to further improve the accuracy of the model through structural parameter optimisation. The proposed method helps to implement a video content retrieval task based on scene semantics.

Key innovations and contributions to the video include the following:(1)Both the contour model and the AdaBoost algorithm show some advantages in terms of robustness and accuracy of video target tracking. Therefore, a combination of both is proposed to solve the person tracking problem in dance art videos.(2)A deep CNN based on differential evolution (DE) [29] was proposed to address the problem of unsatisfactory classification efficiency and stability of the traditional CNN structure in processing the classification of dance video scenes based on semantic information. The DE algorithm was introduced to optimally solve the network parameters, and an adaptive strategy was adopted for the scaling factor of the DE algorithm to improve the accuracy of the optimal solution.

The rest of the paper is organized as follows: in Section 2, the target detection based on human silhouette model in dance video isstudied in detail, while Section 3 provides the human tracking based on the cascade structure AdaBoost algorithm. Section 4 provides the DE-CNN based dance video scene classification. Section 5 provides the experimental results and analysis. Finally, the paper is concluded in Section 6.

2. Target Detection Based on Human Silhouette Model in Dance Videos

In dance video data, the human body often rotates, translates, and stretches. However, the detection of human targets becomes difficult, when the body’s pose is constantly changing. Therefore, the video uses a statistical learning model based on a human contour model to implement human body detection.

2.1. YCbCr Colour Model

First, the RGB colour model is converted into a YCbCr colour model in the 3D colour space, which is mainly used for edge detection and image segmentation in the digital video field, and its colour cube diagram is shown in Figure 1.where is the intensity information, and are the colour difference components of the colour image.

2.2. Canny Operator Edge Detection

The human body image is preprocessed by edge detection, in order to extract the human contour features. The first order derivative of the two-dimensional function of the human body image is expressed as follows:

The second order derivative of a two-dimensional function is expressed as follows:

The luminance region can be divided by finding the pixel points that satisfy . The video uses the Canny operator [30] to implement human edge detection. The edge direction of each pixel point is calculated by equation (4).

The pixel point with the maximum pixel gradient is set as the edge pixel point. The pixel gradient is calculated as shown follows:

3. Human Tracking Based on the Cascade Structure AdaBoost Algorithm

Recently, integrated methods like AdaBoost (adaptive boosting) have been successfully applied to many target tracking problems. AdaBoost classifier is a meta-algorithmic classifier and utilises the same base classifier (weak classifier). Based on the error rate of the classifier, the AdaBoost classifier can be assigned different weighting parameters. Finally, the integrated classifier outputs predictions occur by means of a summation operation of the weights.

3.1. Construction and Updating of Weak Classifiers

For each pixel in each image frame, the weak classifier is defined as follows:where is the sample and is the adjusted segmentation hyperplane, calculated as follows:where is the sample label, is a matrix, and is a diagonal matrix of weights. The sample weights are updated as follows:

3.2. Description of the Algorithm Flow

The AdaBoost algorithm is an iterative algorithm. AdaBoost can aggregate multiple weak classifiers from the same training set to form a strong classifier. The main steps of the AdaBoost algorithm are shown as follows:

Step 1. Set the input be where, and the data set be . Initialize the weights is shown as follows:

Step 2. Find the weak classifier , when and train a weak classifier with each feature , which gives a weighted error rate.

Step 3. The classifier with the smallest weighted error rate is selected and its smallest weighted error rate value is noted as . The weights of the weak classifier are then calculated as follows:

Step 4. The actual method used to update the sample weights is shown as follows:where denotes the normalisation parameter.

Step 5. Construct the final strong classifier using the following approach.

4. DE-CNN Based Dance Video Scene Classification

4.1. Adaptive DE Algorithm

Let the population size be , the attribute dimension be , the differential scaling factor be , the crossover rate be , and the value of each individual be , then the dimensional attribute [31] of the i-th individual can be shown as follows:where , , are random numbers in (0, 1).

Individuals , of the generation can obtain the generation using the mutation operation.where , , and are three random individuals from the generation. A common range of values is .

The individual crossover method is shown as follows:

Compare with and find the fitness value of each individual. Select the individual with the higher fitness value for the subsequent evolutionary process.where represents the fitness function. The DE algorithm stops, when the maximum number of generations is reached.

A common range of values is . The optimisation process for DE is closely related to the value. A wrong choice of value will result in unsatisfactory optimisation performance of the differential evolution algorithm. Therefore, adaptive values are introduced in the calculation. The value range of and is .

The value becomes progressively smaller as the evolutionary generation changes. Early evolution pursues population diversification, while late evolution focuses on search ability, so that the DE algorithm is more likely to obtain optimal individuals.

4.2. CNN Model Design

Machine learning has played a huge role in computer vision processing techniques. Most of the traditional machine learning methods use shallow structures that deal with limited data operations. A large number of experiments have proven that the feature expressions learned from shallow structures, when dealing with complex classification problems, are difficult to meet the practical needs. In recent years, computer performance has continued to improve, providing a powerful support for deep learning. New deep learning models are constantly being proposed and successfully incorporated into application areas such as image recognition, speech recognition, and natural language processing.

Common deep learning models in image recognition include deep belief network (DBN), recurrent neural network (RNN), generative adversarial network (GAN), capsule network (CapsNet), restricted boltzmann machines (RBMs), and convolutional neural network (CNN). Based on the deep convolution neural network, this paper selects the most representative dance video as the recognition object.

Originally designed, specifically to handle image recognition tasks, CNNs are multilayer neural networks and are currently the most classical and commonly used computational structure in the field of computer vision. The basic structure of a CNN consists of an input layer, an implicit layer and an output layer. The implicit layer is the core part of the convolutional neural network, which contains the convolutional layer, the pooling layer (also known as the downsampling layer), and the fully connected layer, as shown in Figure 2.

Pooling layers generally reduce the dimensionality of the input feature map between successive convolutional layers. The pooling layer effectively reduces the output feature vector of the convolution layer. This process uses a partially contiguous region of the image as the pooling region and translates the sliding window matrix of the pooling function within the region. The pooling size and step size control the sliding window size and translation rule respectively, as shown in Figure 3.

Let the set of dance video samples be . The m video attribute features are convolved through the l layer.where and represent the weights and biases assigned to the features by the layer, respectively, and is the convolution.

Convolution of features is from samples. Convolution kernel size :

Assuming , then the original sample is reconstructed after convolution pooling as . The conversion operation is then performed on .

The restrictions are .

After obtaining all the connected layers of the CNN, the classifier is selected to predict the sample class. Let the training output and the actual value of the k-th node be and , respectively, and the error term be .

Assuming that the and layers contain and nodes, respectively, the error of node in the layer is .where is the output and is the weight of the neuron to the neuron in the layer. The weights are updated as shown follows:where is the learning rate.

The bias is updated as follows:where is the bias update step, typically . The adjusted weights are shown as follows:

The adjusted offsets are shown as follows:

The error for all nodes is shown as follows:

When meets the set threshold, the iteration stops and a stable CNN model is obtained.

4.3. Classification Process Based on DE-CNN Model

Before the CNN can be applied to classify a video, the sample data to be classified first needs to be transformed, which is mainly to address the vectorisation process of the video attributes. The converted Skip-gram facilitates efficient input to the CNN. After the CNN video classification model is established, the random weights and biases are optimally solved by the DE algorithm. An adaptation function is established based on the video classification accuracy function. The optimal individuals of weights and biases are obtained by multigeneration evolution of DE. Finally, the video classification results are obtained using CNN for classification training, as shown in Figure 4.

5. Experimental Results and Analysis

5.1. Experimental Setup

In order to validate the performance of the DE-CNN model in dance video scene classification, simulation experiments were conducted on dance video sequences (resolution 640 × 480), with the length of 400 frames. Firstly, the performance of human target tracking was verified. Secondly, the performance was verified for different DE algorithm parameters. Then, the performance was verified for different convolutional kernel sizes. Finally, the performance of the DE-CNN model is compared with commonly used video scene classification algorithms.

The data sources for the video classification experiment were 11 large video websites. All videos were in MP4 format, and seven categories of dance videos were selected for the classification test: classical dance, ballet, folk dance, modern dance, tap dance, jazz dance, and Latin dance. The number of videos in each category is 500, so there are 3500 dance video sequences in the experimental video dataset. The length of each video sequence was 400 frames and the duration was 5 min. Some of the data of the dance video samples are shown in Figure 5.

The proposed method classifies the dance videos so that automatic scene recognition can be achieved. Information on the experimental video dataset is shown in Table 1.

The video from Table 1 was transformed using the Skip-gram structure, thus completing the video-to-attribute vector mapping. This allowed the video samples to be trained for CNN classification. During the experiments, the entire dance video sample set was trained and tested in a 7 : 3 ratio respectively. The experimental hardware environment is: CPU i7 3770 (3.4 Hz), 8 G RAM. The experimental software environment is: Windows 10 operating system, Matlab 7.0 simulation software. The initial values of DE algorithm settings are , , , and . CNN convolutional kernels are 22 by default.

5.2. Human Target Tracking Performance

The effect of human target detection was first quantified in order to assess its robustness. The panning errors for human detection are shown in Figure 6. As can be seen from Figure 6, the human detection is good in the panning case with an average error of less than 10 pixels.

In addition, in order to quantitatively compare the tracking performance, the comparison experiments of the same video sequences are conducted by using hybrid algorithm, AdaBoost-STC algorithm, and adaptive EKF algorithm.where is the centre of the trace result and is the centre of the baseline result.

After repeating the experiment 100 times and taking statistical averages, the human tracking results for the three different algorithms on a 400-frame video sequence are shown in Figure 7.

As can be seen in Figure 7, the difference in centroid pixel error between the three different algorithms is not very significant until 250 frames. However, as the tracking time increases, the hybrid tracking algorithm based on the contour model and AdaBoost shows a stronger advantage when it exceeds 250 frames. In other words, the hybrid tracking algorithms based on the contour model and AdaBoost are better in terms of stability and robustness under the same conditions.

5.3. Video Classification Performance with Different Convolution Sizes

CNN structures with different kernel sizes were used to test the experimental samples separately, and the results are shown in Table 2.

From Table 2, the best results were obtained when the convolutional kernel size of 33 was chosen, and the classification accuracy of the dance video data samples came to 92.16%. When the size increases, the classification accuracy and standard deviation are decreasing. This is because the convolution size is too large, resulting in a larger convolutional granularity, which reduces the opportunity for the important attributes of the samples to participate in the convolution and transformation operations. The temporal performance of the DE-CNN algorithm on the dance video dataset, when the convolutional kernel size is 33 is shown in Figure 8.

As can be seen from Figure 8, the classification time of the DE-CNN model was about 55 s at a convolutional kernel size of 33. Ultimately, the classification accuracy of the DE-CNN model at convergence was all over 0.9.

5.4. Optimisation Performance of the DE Algorithm

In order to verify the optimisation performance of the DE algorithm for CNN, the performance of the test samples was simulated using the CNN algorithm and the DE-CNN algorithm, respectively.

As can be seen from Table 3, the DE-CNN algorithm showed better performance in the classification of dance video scenes. All three metrics of DE-CNN video classification exceeded 0.9. The maximum classification accuracy of DE-CNN was 93.18%, while the maximum classification accuracy of CNN was only 88.96%, so the accuracy of DE-CNN was significantly improved. This is mainly due to the fact that after weight optimisation by DE, the CNN obtains better weights and bias initial values, resulting in a more accurate video classification performance. The comparison of the convergence performance of the two algorithms will be continued below, as shown in Figure 9.

It can be seen that the convergence performance of DE-CNN is significantly superior compared to CNN. In the classification of dance video data samples, DE-CNN converges with an RMSE of about 0.18, while CNN converges with an RMSE value of about 2.5. Therefore, the DE-CNN algorithm has better classification stability compared to the CNN algorithm. In terms of convergence time, the CNN converges in about 5 s less than the DE-CNN. This may be due to the longer time taken by the DE algorithm to solve for the optimal weights and biases. However, in terms of the overall DE-CNN classification time, the DE algorithm consumes a small percentage of the time and has less impact on the video classification time.

5.5. Video Classification Performance of Different Algorithms

The commonly used plain Bayesian (NB) [32], BP neural network [33], LSTM neural network [34], and DE-CNN were used to compare and analyse the test dataset respectively, as shown in Figure 10.

In terms of classification accuracy of the videos, DE-CNN and LSTM algorithms have the highest classification accuracies. In terms of classification time, the LSTM algorithm consumes the longest time, followed by the DE-CNN algorithm, and the NB algorithm the least time.

The following continues to test the stability of the 4 algorithms in video scene classification. The RMSE performance of the 4 algorithms was verified and is shown in Figure 11.

It can be seen that the DE-CNN algorithm has the best RMSE values and the NB performs the worst. This also indicates that the classification RMSE values are more sensitive to the number of video categories. In summary, for scene classification of 3500 dance video sequences, the DE-CNN model still achieves good classification time and RMSE performance under the condition of obtaining high classification accuracy.

6. Conclusion

In this paper, a differential evolutionary convolutional neural network model is applied to scene classification of dance videos. A contour model-based detection approach is used to achieve human target detection, which effectively improves the robustness of human detection. The AdaBoost algorithm based on cascade structure is used to achieve human target tracking. The weight optimisation solution advantage of the differential evolution algorithm is used to improve the applicability of the convolutional neural network model in video scene classification. The following conclusions are drawn.(1)The average error in human motion detection is less than 10 pixels, which indicates higher robustness.(2)The proposed method has a smaller pixel error in the centroid of human movement than other methods and is suitable for a long tracking process.(3)Compared with commonly used video classification algorithms, the proposed DE-CNN model has significant advantages in terms of classification accuracy and RMSE performance. Subsequent studies will further tune the differential evolution parameters to improve the video scene classification time performance.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.