Dance is an artistic form that relies on human body to interpret the beauty. The expression of the character image in dance works through the display of the actor’s limb language and their expression of intrinsic emotions, making dance works aesthetic and delivering delicate emotions and profound ideas. This article aims to study the change of dance form in the air equipment dance performance by convolutional neural network technology. This paper also proposes to track the position and scale of the dancer in convolutional neural network. When estimating the position and scale of the dancer, the discriminant correlation filter is used to obtain a training sample around the initial area. The experimental results show that the recognition rate of HOG characteristics in the combination of the follower and the combination of the inner flower was 42.8%, 40%, respectively, which was 3.33%, 29.2% higher than the combination of the flower. At the same time, only the identification rate of 53.3% in this paper in the combination of the inner flowers was lower than 60% of the reference method. The algorithm herein can maintain a certain identification for complicated dance actions, and it is still able to ensure a certain accuracy in the case of confusion in the background and target. The effectiveness of this action recognition algorithm for dance action recognition algorithm is verified.

1. Introduction

The study of form identification technology and dance form is also in its starting stage. Meanwhile, due to the high complexity of the dance form and the problem of human self-obstruction in the performance of the dance, the research progress in the identification of dance video is relatively slow.

If the body identification technology can be applied to these music dance video analyses, then the music-linked music-dance body fragment can be got, which not only can reduce the work strength of dance professionals but also is convenient for music dance video data to retrieve, as well as makes automatic dance arranging systems more efficient. The results can be more colorful. It can be seen that the research results of dance video-based body identification technology are conducive not only to dance professionals to analyze dance video but also for teaching and protection of artistic cultural heritage.

The innovation of this paper is that (1) this paper focuses on the extraction and representation of body features in videos, including directional gradient histogram features, optical flow directional histogram features, bag-of-words model, and the audio features involved in this study. (2) Aiming at the characteristics of dance form, this paper studies an effective feature extraction method. (3) A multifeature fusion dance body recognition method based on directional gradient histogram feature, optical flow direction histogram feature, and audio signature feature is studied. In view of the existing heterogeneous feature fusion problem, this paper chooses to use the multikernel learning method in order to fuse the three types of features for dance body recognition.t

There are festivals and celebrations almost every day in India where people perform dances to express joy and festivals. Study of Dharmalingam B found that every festival was accompanied by a celebration of folk and tribal dances, almost all of which were constantly evolving and improvised [1]. Although Indian folk dance is dynamic, its variations are very varied. It is important to recognize the physical form of the dance. Therefore, Anami and Bhandage proposed a three-stage method to identify the mudra images of Bharatana dance while using artificial neural network to identify and classify unknown mudra [2]. But this work is particularly unsuitable for dance learning and dance automation. Building an automated dance assessment tool is significant. Faridee et al. discussed his experience of building an automated dance assessment tool using IMUs and Internet of Things devices, and highlighted the main challenges of this effort, although this method has proven to be more effective in accurately identifying the microstep aspect of dance activities compared to traditional feature engineering methods [3]. However, this will cause the limbs to be out of tune. After a lot of research, dance therapy has proven to be an extremely beneficial exercise. Grosu et al. reduced stress by applying an intervention program based on a series of specific dance steps, artistic programs, which were developed entirely for this purpose [4]. Although it could greatly reduce mental tension by releasing muscle tension, it had little effect on improving certain stress indicators. The classification of folk dance videos is very important for dance education and protection of cultural heritage. Bhatt and Patalia proposed a folk dance classification framework, which extracted audio signals from videos, generated high-dimensional feature vectors, and reduced dimensionality using Principal Component Analysis (PCA) [5]. Sabahi proposed a new self-organising fuzzy neural network. Although simulation results showed the effectiveness of the method in the presence of interference, the rest was uncertain [6]. In order to confirm the classification accuracy obtained, Anami and Bhandage used a deep learning method, namely, convolutional neural networks, but this led to discrepancies in the experimental results [7].

Since the shapes in the current body recognition datasets are not as complex as dance shapes and most datasets are single scene and limited by the external environment, the perspectives are mostly fixed while dance forms in dance performances are often too complex. Due to the influence of factors such as the speed of the individual’s performance shape itself and the difference in acquisition speed, it is difficult to use a traditional single shape feature to characterize complex dance shapes. Therefore, the difficulty of dance body recognition lies in how to extract effective features to accurately characterize the dance body in dance videos.

3. Methods on the Change of Dance Form in Dance Performance

3.1. Development of Body Recognition Technology

The main application areas of body recognition are shown in Figure 1.

As shown in Figure 1, the main application areas of form recognition are mainly in motion capture, video surveillance, and auxiliary motion analysis, as well as virtual reality technology and human-machine interaction, while the research on form recognition technology for airline cabin crew is relatively backward [8, 9].

3.2. Human Posture Features

Human pose features are derived from human pose information. There are two ways to obtain human body posture information [10]. One of the two methods is the coordinates of each joint of the human body obtained when the dance video is collected by the body capture device. The other is to use pose estimation in the test set to obtain human joint positions and then calculate joint angle information. Figure 2 shows the division of human body regions according to joint and pose estimation.

As shown in Figure 2, the impact of complex backgrounds on body recognition and the changing backgrounds as well as complex costumes in dance performances will have a negative impact on recognition. The use of the pose estimator can remove the influence of the background, and the body area divided by the body pose can reduce the influence of clothing occlusion on body recognition [11, 12]. It can also filter out the background information with insignificant changes and small changes in the image according to the optical flow value in the process of extracting optical flow features. Therefore, the areas where the information is finally extracted are the upper body, lower body, and whole body of the actor.

3.3. Convolutional Neural Network of Dancer’s Form

Main dancer (C position) tracking: the main dancer is manually marked in the first frame of the template video (usually in the middle) [13]. During the rest of the dance, the main dancer moves, so the position of the dancer in the initial frame is used to initialize the DSST tracking algorithm (a tracking method based on kernel correlation filtering), through which simultaneous tracking of position and scale can be achieved [7]. When estimating dancer position and scale, discriminative correlation filtering is used to sample training samples around the initial region. The squared error function should be optimized as follows:

Here, represents the training samples sampled around the initial region. represents the Gaussian feature map of the target location. is the correlation filter to be computed. They are all of size M × N, and denotes cyclic correlation [14]. Parseval’s theorem is used to get the right side of formula (1). H, F, and G represent the values after discrete Fourier transformation, respectively. The upper horizontal line represents complex conjugation. After minimizing formula (1), the following equation can be obtained:

During the training process, the numerator and denominator of formula (2) are regarded as a whole for iterative optimization. After the training is completed, if there is a new image area z, firstly its discrete Fourier transformed value Z needs to be calculated. Then, the response score of this area can be obtained by the following formula:

Here, represents the inverse Fourier change, and finding the largest y is finding the position of the tracking dancer.

When estimating the dancer scale, the calculation is similar to such calculation, but it takes into account both the position and scale dimension advancement else. F is the feature area. There are a total of d scale dimensions. h and also have similar meanings, which just has more scale dimension. The loss function to be optimized is calculated as follows:

Here, λ represents the regular term. After solving it, the H of the Fourier space can be obtained as follows:

In order to make the tracking algorithm more robust, the numerator and denominator of each iteration are represented by and , respectively. The update strategy as formulae (6) and (7) is adopted.

The final score for the region with the scale space is calculated as follows:

3.4. Attitude Estimation Algorithm Based on Coordinate Regression

Since 2014, research in the field of pose estimation has begun to move towards deep learning. Previously, pose estimation was modeled through using the position information between adjacent key points, which made the part search limited to local positions and difficult to detect when faced with incomplete human bodies [8]. Deep pose is one of the first methods to use coordinate regression in deep neural networks. It uses an end-to-end approach to predict human key points from the global perspective of the human body. The powerful feature extraction capability of deep neural networks saves people from artificially designing features. Then, the key points of the human body from the features are located, which greatly simplifies the prediction of key points [15, 16]. The pose estimation algorithm based on coordinate regression takes a whole image as the input of the model, uses a simple 7-layer convolutional neural network as the characteristic special zone network and finally fully connects into a multidimensional vector corresponding to the coordinates. For example, (x, y) represents the coordinates of a key point, and a total of five key points need to be regressed. Then, the vector and supervision information output by the network are both a vector with a length of ten [17].

This model can usually be expressed as follows: a pose of the human body can be represented by the positions of k joint points of the human body, expressed as a vector as follows:

The absolute coordinates of the predicted pose vector can be represented as follows:

The loss function used is L2 loss, and then the model can be written as follows:

In essence, the convolutional neural network based on coordinate regression is to return the size offset of each key point from the image boundary. The information provided by this supervision method is relatively small, making the convergence speed of the entire network slow down. The error in the actual model training is also large [18, 19].

For the convolutional neural network, the calculation formula of a single convolutional layer is as follows:

Here, the subscripts l all represent the lth convolutional layer. In the lth layer, represents the number of convolution kernels of the convolution layer; that is, the number of channels of the output feature map; represents the width of the convolution kernel; represents the height of the convolution kernel; represents the number of channels of each convolution kernel; that is, the number of channels of the input feature map, or the number of output channels of the (l − 1)th layer. in parentheses represents the number of weights of a convolution kernel. In particular, in depthwise separable convolutional layers,

At this time, when , in brackets represents the number of weights of a convolution kernel. The total parameters of the model are as follows:

Since the size space of the original image needs to be predicted and fusing with a stride greater than 1 will reduce the size of the image, the function of the deconvolution layer (deconv) is the resampling layer. The size of the original image is usually restored by linear interpolation or nearest neighbor interpolation method to resampling. The deconvolution layer is shown in Figure 3.

As shown in Figure 3, the image to be resampled is reset to zero. There are two possible outcomes of this operation: the dimensionality of the fused features is reduced compared to the input, or the dimensionality is increased or unchanged. The convolution after zero padding and the effective pixel convolution are shown in Figure 4.

As shown in Figure 4, when the convolution operation is performed, the input image does not perform any padding, and the size of the feature map obtained after convolution will become smaller: the latter will pad zeros around the input image before convolution so that the size of the feature map after convolution is equal to the input image [20, 21]. When the input image is of size 4 × 4 without padding and then convolved with a 3 × 3 kernel on it, the convolution matrix is found to be of size 2 × 2. On the other hand, when a 5 × 5 image is expanded to a 6 × 6 image and then convolved with a 3 × 3 kernel on it, the resulting convolution matrix is of size 5 × 5.

3.5. Model Calculation

The number of operations of the model is the time complexity, which can be measured by FLOPs, that is, floating-point operations. In computer vision, a “Multiadd” combination is often considered as a floating-point operation.

The number of operations in the lth layer is

Here, represents the width of the output feature map, and represents the height of the output feature map.

In particular, in depthwise separable convolutional layers,

The total computational cost of the model is as follows:

4. Experiments on Convolutional Networks for Estimating Changes in Dance Shapes

4.1. Feature Extraction and Description

In form recognition research, the first thing to be done is usually feature extraction. It seems that the extracted features have a crucial impact on the accuracy of the body recognition results and the robustness of the body recognition method, and the specific feature extraction process is shown in Figure 5.

As shown in Figure 5, the above features extracted from the dance videos are used to characterise the dance forms, respectively, after taking into account the characteristics of the dance forms. The directional gradient histogram feature is used to describe the local appearance and shape features of the dancing body, and the optical flow direction histogram feature is used to describe the motion information of the dancing body. In addition, the study of dance form recognition should also take into account the influence of music on dance. In this paper, the audio features extracted from the audio files corresponding to the dance form videos are combined with the above two features for the recognition of dance forms.

4.2. Preprocessing Operation

After a study of dance videos, it was found that for a certain type of dance. It all consists of a set of different dance forms in combination, and each dance form in the combination is fixed. Even if different people perform the dance forms, there will not be much difference so that it can be understood that each specific dance form has a fixed shape. For this reason, a method of accumulating edge features is proposed for a dance form video that can be aliquoted. The multicore learning feature fusion process is shown in Figure 6.

As shown in Figure 6, the limited ability of each class of features to discriminate dance forms individually was considered in order to achieve mutual complementarity of multiple classes of features to improve the recognition ability of the classifier.

4.3. Improved Dance Body Model

Current human pose estimation methods often only consider how to improve the accuracy and generalization performance of the model but ignore significant model efficiency issues. This will lead to the development of models with poor portability and cost-effectiveness in practical use. Based on a simple benchmark human pose estimation network, the relationship between the scaling of network depth, width, and resolution and the model accuracy gain is studied, and a series of efficient human pose estimation networks are obtained by compounding the three dimensions of the extended network. This is because deep neural networks are able to approximate complex nonlinear mapping functions well, from arbitrary human images to joint positions, even in the presence of unconstrained human appearance, viewing conditions, and background noise.

In general, with the deepening of the model network and the increase of parameters, the obtained results will be more refined while the network structure will become more complex and the more computing resources will be consumed. In order to solve the problem of large amount of calculation, the network pruning technology is proposed. Among the many parameters of the neural network, there are some parameters that do not contribute much to the output results. The purpose of network pruning is to find and delete these redundant parameters and connections so that they do not participate in the backward and forward processes of the model in order to achieve the purpose of reducing the amount of calculation and the size of the model. The network pruning is shown in Figure 7.

As shown in Figure 7, the original complex network will become sparse due to the deletion of some neurons and connections. Network pruning is actually an iterative process. Although the pruned neurons contribute less to the model, they will still affect the final result as the calculation process accumulates. To reduce this effect, pruning is usually alternated with training fine-tuning, which is often referred to as “pruning at iteration” until the entire network reaches a balance between model performance and network size.

4.3.1. Human Pose Estimation Model Based on Benchmark Backbone Network

In terms of the final heat map regression, the upsampling of the feature maps continues to follow the method of the inverse convolution layers, behind the baseline backbone network, using three consecutive layers with a channel count of 256 inverse convolution layers and a convolution kernel size of 4 × 4, so that the size of the output feature map is scaled up by a factor of 8, which is equal to the size of the labelled heat map, and the number of output channels is the number of key points of the human body to be detected. The baseline human pose estimation model is shown in Figure 8.

As shown in Figure 8, the reference pose estimation model of (a) is called M0. M0 is trained on the MPII human pose dataset. Although the average accuracy of M0 can reach more than 86%, it is found that the benchmark model has a large gap between the model parameters and the calculation amount of 7.0M and 4.42GFLOPs, respectively, which is far from the expected values. The parameters and computations of the backbone network of M0 and the last four-layer network for pose estimation are calculated, respectively. Then, 3.60M, 0.34GFLOPs, 3.4M, and 4.08GFLOPs are obtained. The last four layers of feature processing network account for 48.57% of the M0 parameter amount and 92.31% of the computation amount, respectively. In order to further simplify the network and reduce the parameters and computational overhead, the number of channels in the output feature map is gradually approximated to the number of joint points that need to be regressed during pose estimation. The model of Figure 8(b) is called , and “” represents the number of channels of the feature processing network using a stepwise decreasing scheme.

4.3.2. A Depth-Scaled Benchmark Human Pose Estimation Model

Extending the network depth is one of the methods commonly used in many convolutional neural networks. Using deeper convolutional neural networks can not only capture more complex and richer feature information but also transfer well to new tasks. However, deeper networks are also harder to train due to the vanishing gradient problem. Although the training problem can be alleviated by some techniques, as the network depth increases, the accuracy gain of the network will gradually decrease. For example, ResNet-152 has similar accuracy to ResNet-101 although the former has more layers. The test results of the depth scaling model on the MPII validation set are shown in Figure 9.

As shown in Figure 9, based on the model , the input image resolution is fixed at 192 × 192, and the network depth is scaled to the original (0.5∼4) depth, which further show that the accuracy gains of extremely deep convolutional neural networks are not as good as those of relatively shallow convolutional neural networks.

4.3.3. Width-Scaled Baseline Human Pose Estimation Model

Small-scale models usually adopt the method of expanding the network width. Using a wider network can usually extract more fine-grained features and is easier to train. However, higher-level features generally require deeper networks to extract. Such wide and shallow single-layer networks often have difficulty extracting such high-level features, stagnating the performance of the network. The test results of the width scaling model on the MPII validation set are shown in Figure 10.

As shown in Figure 10, based on the model , the input image resolution is fixed at 192192, and the network depth width is scaled to the original (0.5∼2) width. As the width of the network increases with the width, the accuracy quickly saturates.

4.3.4. Resolution-Scaled Benchmark Human Pose Estimation Model

The higher the high resolution of the input image is, the easier it will be for the convolutional neural network to extract more fine-grained features. In terms of image classification, the resolution of the input image starts from 224 × 224 in the early days, and modern convolutional neural networks tend to use 299 × 299 or 331 × 331 for better accuracy. For object detection convolutional networks, higher resolution is even more essential. In the field of top-down human pose estimation, the resolution of the input image is generally 128 × 128∼384 × 384; commonly used input resolutions on MPII datasets are 256 × 256 and 384 × 384; on the MSCOCO dataset, commonly used input resolutions are 256 × 192 and 384 × 288. The test results of the resolution scaling model on the MPII validation set are shown in Figure 11.

As shown in Figure 11, based on the model , keeping the network width and depth unchanged, the input resolution gradually increases from 128128 to 224224, which indicates the results of network resolution scaling. As the resolution of the input image increases, the accuracy of the network does increase, but the gain in accuracy also decreases gradually.

5. Experiments of Dance Form

The experimental results of the algorithm and three characteristics of this paper in the four groups of dance combinations on the FolkDance dance dataset are shown in Table 1.

From the experimental results in Table 1, among the four dance combinations in the FolkDance dataset, the similarity between the dance shapes in the double flower combination with the step and the lining flower combination is much smaller than that of the towel flower combination and the pie flower combination. There are similar dance shapes in both the towel flower combination and the piece flower combination. Especially the piece flower combination has multiple similar shapes and the same shape is divided into different directions, which also increases the difficulty of dance shape recognition. The recognition rate of HOG features in this group is the lowest among the four groups at 29.29%. The HOF feature used in this paper to characterize the movement information of dance body has a recognition rate of 38.1% and 33.3% in the combination of the double flower with the step and the combination of the inner flower, which is not very different from 37.5% and 33.3% in the towel flower combination and the piece flower combination. The experimental results also show that the HOF feature is less affected by the similarity between dance shapes in the four groups than the HOG feature. In the four dance combinations of the FolkDance dataset, the experimental results of the method in this paper and the benchmark method are shown in Table 2.

As shown in Table 2, in the DanceDB dataset, there is a phenomenon of confusion between the target and the background. From the experimental results, we can draw a single feature to identify the results as shown in Table 3.

As shown in Table 3, the recognition effect of optical flow direction histogram features is the best 35.4%, and the recognition rate of audio features is 33.3%, which is higher than 31.2% of directional gradient histogram features.

6. Conclusions

With these challenges in mind, this paper provides a detailed survey and analysis of outstanding research in the field of shape recognition. It focuses on feature extraction, representation, and shape recognition methods based on dance videos. First, each dance shape video in the dataset is divided into equal segments. The divided videos are processed to accumulate edge features individually, and the edge features of all videos in each segment are accumulated into a single image from which directional gradient histogram features are extracted. Finally, a set of directional gradient histogram features is used to represent the local appearance and shape of the chorus forms. In this paper, directional gradient histogram features, visual flow histogram features and audio features are extracted to identify dance forms through multifeature fusion. Considering the problem of heterogeneous feature fusion, the three types of features are organically fused by multicore learning to achieve dance shape recognition. Although some achievements have been made in the research on dance body recognition in this paper, the recognition rate of dance video body recognition research is not very high at present. The main reason is that the dance shape is too complex, and the existing methods are still not well suited for dance shape recognition. Therefore, the future research on dance body recognition needs further research and improvement.

Data Availability

No data were used to support this study.

Conflicts of Interest

The author declares that there are no conflicts of interest.


This work was supported by the Key Projects of Art and Science in Shandong Province in 2020 (nNo. ZD202008105).