The purpose of video key frame extraction is to use as few video frames as possible to represent as much video content as possible, reduce redundant video frames, and reduce the amount of computation, so as to facilitate quick browsing, content summarization, indexing, and retrieval of videos. In this paper, a method of dance motion recognition and video key frame extraction based on multifeature fusion is designed to learn the complicated and changeable dancer motion recognition. Firstly, multiple features are fused, and then the similarity is measured. Then, the video sequences are clustered by the clustering algorithm according to the scene. Finally, the key frames are extracted according to the minimum amount of motion. Through the quantitative analysis and research of the simulation results of different models, it can be seen that the model proposed in this paper can show high performance and stability. The breakthrough of video clip retrieval technology is bound to effectively promote the inheritance and development of dance, which is of great theoretical significance and practical value.

1. Introduction

With the continuous progress of multimedia technology and computer network, video and images show more positive significance in daily life, and the amount of video image data is increasing geometrically [1]. Therefore, for video data, how to index it and finally retrieve it quickly and accurately has become an urgent demand [2]. At first, the way of human communication was sound and language, and then words and graphics appeared [3]. In modern civilized society, the emergence of digital products such as digital cameras and digital video cameras has further made images and videos a popular way of information exchange [4]. Video has developed into the main carrier of information dissemination, enriching people's lives and bringing opportunities for the vigorous development of artificial intelligence and big data industry [5]. Among them, computer vision plays a vital role. Motion recognition is a very challenging subject in the current research field of computer vision. Its purpose is to analyze video data using image processing [68] and classification recognition technology [9,10] to recognize human motion [11]. Effective fragment retrieval of dance video can help dance teachers arrange dance and assist Dance Teaching [12]. The breakthrough of dance video retrieval technology will effectively promote the inheritance and development of dance.

Every day, a large amount of video data is generated, and digital video is becoming more widely used in all aspects. With the large increase in video data, video database management systems have received a lot of attention and have a lot of potential [13]. Due to the large amount of video data, the current standard practice is to first detect and segment the video and then select several representative still image frames, referred to as key frames, from the lens to represent the visual content of the entire lens [14]. On the basis of video segmentation into shots, key frame extraction analyzes the color, texture, and other characteristics of image frames in the lens and finds the image frame that best represents the lens content based on the relationship between frames [15]. There are several common key frame extraction methods available today, but most of them are only effective for specific videos and cannot be applied to other videos, and the extracted key frames do not always represent the video's main content [16]. Using the key frame as the index, extract the key frame set from the video sequence, summarize the original video content from high-level semantic information to low-level visual features, and retrieve the original video content. As a result, the number and quality of extracted key frames have a direct impact on the final search results' efficiency and accuracy [17]. Based on this, this paper proposes a multifeature fusion-based video key frame extraction method and applies it to dance action recognition.

Image retrieval is an early well-known video retrieval technology. It distinguishes the video by manually labeling some text descriptions or numbers. When retrieving the video, it uses the labeled label to search [18]. The content-based video retrieval method retrieves massive video data in the database according to the relationship between video content and context, in order to provide a visual feature algorithm that can automatically understand and recognize video in an unsupervised state [19]. Content-based video retrieval extracts the lowest level features to the high-level semantic features, analyzes and processes the video, automatically establishes an index of video data, and retrieves and browses according to the index [20]. The research results of motion recognition technology based on dance video not only are conducive to the analysis of dance video by dance professionals, but also can be used for teaching, protection, and excavation of artistic and cultural heritage. In addition, the research of motion recognition method based on dance video will also play a positive role in the research of human motion recognition in a large number of real and complex environments, enriching the application fields of motion recognition technology [21]. If the motion recognition technology is applied to the analysis of these music and dance videos, so as to obtain the organically related music and dance motion fragments, it can not only reduce the work intensity of dance professionals and facilitate the retrieval of music and dance video data, but also make the automatic dance arrangement system more efficient.

Firstly, this paper shows the feature extraction process and model in the key frame extraction method of music and dance video and then applies the feature fusion and recognition method to the key frame extraction of music and dance video. The simulation results show that the method proposed in this paper has high performance and accuracy.

Literature [22] extracts the features of edge force field by Boosting classifier and designs a human posture estimation algorithm based on component detection. Literature [23] proposed an appearance model combining histogram and color features to estimate the pose of dance movements. However, due to the complexity of human posture changes, it is difficult for traditional methods to achieve effective posture estimation. Therefore, the method based on deep learning [9,24,25] is gradually used for human posture estimation. Literature [26] designed an hourglass-shaped neural network structure to extract multiscale features and identify dance movements. Literature [27] proposed a method to obtain human skeleton map by partial affinity domain. In addition, many dance movement recognition algorithms based on deep learning have been proposed one after another. Literature [28] takes out a fixed number of image frames at the first frame, the first frame, the second frame, or the equally spaced positions as key frames. Literature [29] selects multiple key frames according to the significant changes between frames. Firstly, the first frame of the shot is taken as the key frame, and then the difference between the previous key frame and the remaining frames is calculated. If the difference is greater than a certain threshold, another key frame is selected. Literature [30] proposed a method of extracting key frames based on shot activity. Firstly, the histograms of internal frames and reference frames were calculated, and then the activity marks were calculated. According to the curve of activity, the frame with local minimum is regarded as the key frame.

These methods often do not consider the change and complexity of the visual content in the lens. Most of the more complicated methods measure the similarity between any two frames in the shot by means of some underlying features such as color, texture, and motion information and divide all frames in the shot into different classes by combining threshold or clustering and then select representative frames from each class as key frames. Therefore, this paper proposes a method of dance motion recognition and video key frame extraction based on multifeature fusion, which is used to learn complex and changeable dance motion recognition. Through the steps of preprocessing, classifying, and indexing video data, a practical, convenient, and economical video retrieval system is developed, and the mechanism of video information retrieval and browsing scheme is improved.

3. Key Frame Extraction Method of Music and Dance Video

The relationship between video data units in terms of operation is unclear. The relationship between video segments is complex and difficult to define precisely, which introduces a slew of new issues into the setup and operation of a video database. It is difficult to process unstructured video data directly because it is difficult to measure the similarity between two unstructured data [31]. The successful application of motion recognition technology in other fields provides us with a sufficient theoretical foundation to apply it to dance video motion recognition. Currently, there are a large number of music and dance video materials, and professionals must spend a significant amount of time listening to and looking at these dance video materials, which is clearly inefficient. A specific action category is thought to have generated the image sequence in the video. As a result, the single-layer motion recognition method is primarily concerned with how to represent and match videos [32]. One or more frames of images that reflect the main information content in a group of shots and can express the shot content succinctly are known as key frames. Because each shot is taken in the same scene, each frame of images in the same shot contains a lot of the same information. Feature extraction is usually the first step in motion recognition research.

The feature extraction process mainly consists of three parts: the first part is to extract directional gradient histogram features by using the method of accumulating edge features; the second part mainly extracts the directional histogram features of optical flow from the dance data set; the third part extracts the corresponding audio stream files from the dance action videos and then extracts the audio signature features from the audio stream files. The specific feature extraction process is shown in Figure 1.

To achieve the data compression effect, only the key frames of the shot can be stored due to storage capacity. Second, key frames are used to represent shots, similar to keywords in text retrieval, so video shots can be processed by image retrieval technology. Key frame extraction has been made difficult by the variability of dance movements and the presence of too many redundant movements. This paper will calculate the optical flow of the image sequence of the dance action video after framing in order to extract a set of key frames with less redundancy and can summarize the video content. For smaller objects, this method can match movements with large displacement and estimate optical flow. At the moment, there is not much of a difference in the visual characteristics and content of the image frame.

When a video stream is segmented into a series of semantically independent shots, although the amount of data that needs to be analyzed and processed is segmented, the amount of image data in the shots is still huge. To reduce the amount of data in video index, it is more important to facilitate users to retrieve video information and improve retrieval efficiency. It is necessary to extract one or more key frames from a shot according to the complexity of the shot content [33]. Since the shot is composed of frame images that are continuous in time and highly relevant in content, the most irrelevant frames can be selected as the key frames of the shot to contain the most information. The specific algorithm is to let be the feature vector of the i-th frame of a shot with N frames of images and define the correlation coefficient between feature vectors and as

Here, , and m is the mean vector. Select the k frame with the smallest correlation as the key frame (k << N):

The main problem of the above method is that the amount of calculation is too large, because it is necessary to calculate the correlation for any two frames. The method is simplified, and 1 to 3 frames of images are automatically extracted as key frames according to the different characteristics of the lens.

Let f denote a frame of image and denote a shot with N frames. Take image frames , , and as candidate key frames. Define the distance between two images and as

When extracting key frames, first calculate the distance between two candidate key frames, namely, . Compare them with a predetermined threshold T. If they are both smaller than T, it means that they are relatively close. At this time, take as the key frame. If they are all larger than t, it means that there is a big gap between them. At this time, all three frames are regarded as key frames. In other cases, take the two images with the largest distance as key frames.

Key frames are digital images that contain the most intuitive information summary for users of video retrieval systems. The summary should as much as possible express the main content of the shot, so that the user understands the content to be expressed in the video from the start. The envelope and music energy features of music will be extracted in this paper, and the music feature and entropy sequence will be fused to produce a music-related entropy sequence. Figure 2 depicts the main flow of video key frame extraction.

Use the optical flow calculation method to obtain the movement characteristics of dance videos:

Here, α, ß, and γ are adjustable weight parameters, and is the assumption of brightness invariance, which is applicable to both color images and grayscale images. The influence of light is inevitable. Therefore, in order to reduce the influence of light, it is necessary to add gradient constraint on this basis and then smooth it through . The last two items are to construct descriptor matching and find its minimum value through variable models and optimizations.

Calculate the entropy value of the current optical flow diagram in chronological order:

Here, represents the proportion of pixels with a gray value of k in the image, m represents the gray level, and S is the entropy value. The greater the amount of information contained in the image, the greater the entropy value.

In the process of correspondence between audio and dance movements, the length of dance video, the frame number of images, and the frame rate of video are known. Then, the standard deviation is used to carry out interval operation to obtain the corresponding audio value per second, and the audio value and entropy value sequence are subjected to feature fusion.

4. Feature Fusion and Recognition

Although multiple key frames can more effectively describe the information expressed by shots than a single one, the likelihood of repeated or redundant video frames increases dramatically as the number of key frames increases. As a result, the focus and difficulty of key frame extraction technology is how to select the appropriate key frames that can not only represent the shot information, but also improve retrieval efficiency and reduce the amount of video index data. Optical flow directional histogram features are used to describe the motion information of dance movements, while directional histogram features are used to describe the local appearance and shape features of dance movements. Furthermore, the influence of music on dance should be considered when studying dance action recognition. All dancers perform with music playing in the background, and the type of music is related to the type of dance. Audio features, on the other hand, contain a lot of information, making them an important auxiliary feature that can help reduce the impact of self-occlusion on dance movements. Figure 3 depicts the multicore learning feature fusion process.

Suppose there are dance moves and category in the dance data set. At the same time, the G kernel functions corresponding to the Histogram of Oriented Gradient (HOG) feature are defined as , the F kernel functions corresponding to the Histograms of Oriented Optical Flow (HOF) feature are defined as , and the M kernel functions corresponding to the audio signature feature are defined as . The linear combination of the kernel function combining the above three characteristics can be expressed by the following formula:

Formula (6) satisfies . , , and are the weights of the corresponding kernel functions.

In order to express the content of the shot as completely as possible, conservative principles will be adopted when extracting key frames. When analyzing a video, if all the image frames at every moment are used, too many redundant image frames will be used. Therefore, people think of extracting key frames from thousands of image frames. The use of key frames greatly reduces the amount of data in video index and also provides an organizational framework for searching and browsing videos.

5. Result Analysis and Discussion

In the past, motion recognition methods that relied on a single feature could only describe one aspect of human motion in video, but they could not effectively describe human motion. As a result, motion recognition research has turned to the multifeature fusion method. Combining different features can more comprehensively describe human motion in video, resulting in a better recognition effect. The dance video retrieval database is made up of key frames and video clips that have been summarized. Kernel functions play different roles in classification depending on the problem. The goal of multicore learning is to improve the classification effect by giving different kernel functions reasonable weights and combining multiple kernel functions to describe features more thoroughly. The data set is used to extract directional gradient histogram features, optical flow directional histogram features, and audio features for the dance motion recognition method described in this paper. Figure 4 shows the entire appearance feature of the audio signal as a result of envelope feature extraction from dance video accompaniment music.

The dance video is composed of a series of dance movements, and the coherent dance movements reflect more or less amount of movement. The motion information in the dance video is expressed by optical flow, and then the information in each optical flowchart is counted by entropy. Entropy sequence and music features are fused to obtain a music-related entropy sequence. Then, the key frames are selected by the threshold, and when the threshold is set, it will be compared with the key frame set selected by several users to select the best threshold suitable for the video. The matching of feature points starts from the first frame of the query segment, and the frames in the query segment are sequentially compared with the frames in the key frame set. Then select a key frame that is most similar to the frame of the query fragment. Video clips can be described by one or several key frames. No matter what level of similarity matching, if there are some dissimilar parts between the query segment and some subsegments, shots, or frames in the video segment, such segments are discontinuous. When this situation produces more dissimilarities, it indicates that the similarity between the two fragments is lower. Adaptive key frame extraction algorithm based on unsupervised clustering is used to extract key frames for different types of shots, as shown in Figure 5.

Different combinations of repetitive dance movements are frequently used to create dance videos. Similar dance movements will appear in various types of dance videos during this process, and eventually, everyone will follow this choreography pattern. We can discover that dance movements in dance videos are closely related to music through segment retrieval of dance videos. All of the video frames in the shot are treated as independent subclusters at the start of merging, and pairwise similarity is calculated. The two most similar subcategories, that is, the least similar, are chosen and merged into a new subcategory. Merge according to this cycle, then wait until an automatic merging stop rule is satisfied before getting the final clustering result. We should extract the features of the image frames in the video first, then process the features of the image frames with the algorithm, and finally extract the key frames, regardless of which algorithm we use to extract the key frames. Figure 6 shows a simulation comparison of image key frame extraction reliability optimization.

The motion features of video describing human motion information are often essential features in the research based on motion recognition, and the optical flow method is usually used to extract the features in some related motion recognition research at present. Texture is a value calculated from the texture image, which quantifies the features of gray scale changes inside the texture. Generally, texture features are related to the position, direction, size, and shape of the texture but have nothing to do with the average gray level. The purpose of feature extraction is to transform the spatial structure difference of random texture or geometric texture into the difference of feature gray value and use some mathematical models to describe the texture information of the image, including the smoothness, sparseness, and regularity of the image area. The linear regression curve is calculated according to the stepwise multiple linear regression equation, as shown in Figure 7.

FS denotes a method of searching that uses F7 layer features, HS denotes a method of searching that uses coarse searching, and HFS denotes a method of searching that goes from coarse to fine. Figure 8 depicts the effect of various algorithmic features on the retrieved video key frames. Clearly, HFS is the first to demonstrate good retrieval accuracy, implying that the video GIS data retrieval algorithm can obtain richer detail features of video GIS key frames. When retrieving more than 13 images, however, the retrieval effect of HS outperforms that of FS. This demonstrates that the binary code generated by this algorithm has a high level of discrimination and contains a lot of semantic information.

For a video containing multiple shots, the key frame fidelity is the average of the fidelity of each shot and the group of key frames. The purpose of extracting key frames is to use as few video frames as possible to represent as much video content as possible, so the higher the compression rate, the more effective this key frame extraction method. Make the estimate continuous at the threshold. The shrinking trend of the adaptive nonlinear curve is shown in Figure 9.

The purpose of key frame extraction is to replace the whole video with few image frames, so as to facilitate the viewer to quickly browse the content of the whole video and reduce the amount of video data, thus making the video processing more convenient and faster. Therefore, the effectiveness of the key frame extraction method should be considered first, and then its computational complexity and efficiency should be considered on the premise of effectiveness. The experimental results of the comparison between this method and the benchmark method in four dance combinations are shown in Figure 10. In the recognition of four dance combinations, the recognition rate of this method is higher than that of the benchmark method. The recognition rate of this method is 53.9%, which is lower than 59.9% of the benchmark method. In other combinations, the performance of this method is better than that of the benchmark method, especially when the similarity of dance movements in the combination of towel and flower is too high.

When the dance movements are too complicated and there are similar movements and self-occlusion, the benchmark method based on trajectory feature fusion can not accurately represent the dance movements. The fusion algorithm in this paper can avoid the above influence to a certain extent, thus improving the recognition rate of dance, and it also verifies the effectiveness of the algorithm. The experimental results show that not only is the algorithm in this paper relatively simple in calculation, but also the extracted key frames can effectively summarize the main content of the video, realize video compression and storage, and lay a good foundation for video retrieval and video summarization. From the perspective of the future development trend of video data mining, the requirement for video data processing technology is getting higher and higher because of the contradiction between the large amount of calculation of video processing and the short retrieval time expected by users. Moreover, with the continuous development of information technology, it is more and more difficult to process video data with various features.

6. Conclusions

The dance contains far too many repetitive dance moves, which will slow down retrieval speed when searching. As a result, this paper proposes a method for extracting key frames from music and dance videos. First, the framed video's optical flow is calculated, and the video's motion features are extracted. Music and dance are inextricably linked. The corresponding audio in the video is then extracted, along with its features. This paper presents an unsupervised automatic video key frame extraction method. This method uses simple cyclic merging to cluster video frames and creates an automatic merging stop rule to stop merging when the clustering results are optimized. There is no need to set parameters or prior knowledge in advance for this video frame clustering process. The results of the experiments on multiple test videos show that the extracted key frames can effectively represent the video's main visual content.

Although some progress has been made in dance motion recognition research in this paper, the recognition rate of dance video motion recognition research is currently low, owing to the complexity of the dance motion and the inadequacy of existing methods for dance motion recognition. Because of the complexity of dance movements, the dance data set we created at this point only considers solo dance situations, ignoring changing stage scenes and other factors. More research will be done in the future on how to apply music theory-related content to create a more accurate mapping between music and dance movements, in order to improve dance movement recognition accuracy. [33].

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no conflicts of interest.