Abstract

The paper aims to apply the deep learning-based image visualization technology to extract, recognize, and analyze human skeleton movements and evaluate the effect of the deep learning-based human-computer interaction (HCI) system. Dance education is researched. Firstly, the Visual Geometry Group Network (VGGNet) is optimized using Convolutional Neural Network (CNN). Then, the VGGNet extracts the human skeleton movements in the OpenPose database. Secondly, the Long Short-Term Memory (LSTM) network is optimized and recognizes human skeleton movements. Finally, an HCI system for dance education is designed based on the extraction and recognition methods of human skeleton movements. Results demonstrate that the highest extraction accuracy is 96%, and the average recognition accuracy of different dance movements is stable. The effectiveness of the proposed model is verified. The recognition accuracy of the optimized F-Multiple LSTMs is increased to 88.9%, suitable for recognizing human skeleton movements. The dance education HCI system’s interactive accuracy built by deep learning-based visualization technology reaches 92%; the overall response time is distributed between 5.1 s and 5.9 s. Hence, the proposed model has excellent instantaneity. Therefore, the deep learning-based image visualization technology has enormous potential in human movement recognition, and combining deep learning and HCI plays a significant role.

1. Introduction

Modern technologies, such as the Internet and multimedia technology, have developed rapidly. Multimedia systems based on computer information technology have been applied in many fields. The intelligent interactive multimedia is a new platform that develops under the foundation of computer technology [1, 2]. However, the applications of traditional multimedia systems are often independent and mechanized, which are inadequate to meet people’s needs. Consequently, the human-computer interaction (HCI) technology emerged. People can interact with the multimedia engine and obtain the required media information quickly and efficiently via HCI. Besides, HCI technology can promote the accurate transmission of information and improve work efficiency [35], which has triggered a research boom. In daily life, people can directly express their thoughts or emotions through movements. Therefore, movement recognition and analysis have become a critical direction in the field of HCI and attracted widespread attention, which leads to the wide popularity of human movement-based recognition technology [6, 7].

With the advancement of social informatization, human beings have an increasing requirement in the intelligence level of computers. HCI no longer only depends on the original hardware-based interaction, and some relatively more intelligent interaction methods gradually appear in mass life. The face recognition, gesture recognition, and speech recognition systems constructed by machine learning technology have established a bridge between humans and computers [8]. The emergence of these convenient interaction modes has become a major development trend in the field of HCI. The development of HCI mode aims to enable the computer to serve and adapt to human needs well, so HCI focuses on humans instead of adapting to the computer. Therefore, the friendly interaction between robots and humans is extremely vital in the research of machine learning and HCI. Some scholars focus on the importance of emotional factors related to the interaction between people and computer systems, when exploring the people-centered interaction systems [9]. Motion recognition technology is essentially a classification problem close to machine learning [10].

The above research results imply that the development of the Internet and multimedia technology has made multimedia systems successfully applied to many fields. The friendly interaction between robots and human beings plays an extremely important role in the study of machine learning and HCI. Deep learning shows excellent application potential in function extraction and HCI. A combination of deep learning and HCI is innovatively proposed to extract and identify human skeleton operations to expand the application field of HCI. The ultimate research purpose is to achieve a significant reduction in time costs and dependence on traditional equipment and facilities. The innovative ideas can also achieve the purpose of improving human-computer collaboration and interaction. Moreover, combined with the image visualization technology based on deep learning and HCI system, it is envisaged that the visual geometric group network (VGGNet) and long short-term memory (LSTM) can be optimized. The final HCI system and the research results of the recognition and analysis of human dance provide a reference value.

The contributions based on the extraction and recognition of human dance movements are as follows:(1)An optimized VGGNet human skeleton movement extraction algorithm is proposed. Its extraction accuracy reaches 96%, which is significantly better than traditional algorithms.(2)An optimized multiple LSTM human skeleton movement recognition algorithm is proposed. Its recognition accuracy reaches 88.9%, which is significantly better than traditional LSTMs.(3)An HCI system based on image visualization is designed, and the interaction accuracy rate reaches 92%.(4)A reference is provided for more in-depth human movement extraction and recognition, and deep learning methods’ application range in HCI systems is expanded.

2. Literature Review

2.1. Current Situation of Deep Learning in Dance Education

Dance is an important intangible cultural heritage. Dimitropoulos et al. introduced a research project (i-juries) of intangible cultural heritage, emphasizing the importance of 3D dance interaction [11]. Grammalidis et al. introduced an intangible cultural heritage dataset, i-treasure, including audio and other data information [12]. Doulamis et al. considered that intangible cultural heritage was an important source of cultural diversity, but there were few electronic documents of intangible cultural heritage. According to the “Terpsichore” project funded by the Horizon 2020 of the European Union, they proposed a high-level method based on the digitization of cultural assets [13]. Doulamis et al. discussed the digitization of tangible and intangible cultural heritage and proposed that 3D digital assets would develop into a part of augmented, virtual, and mixed reality experience [14]. Lv studied the application of virtual reality (VR) in 3D environment and HCI system and revealed the excellent performance of VR technology in 3D digitization [15]. The digitization of intangible cultural heritage has become an inevitable development trend, so has dance.

On the recognition and extraction of dance movements, Rallis et al. proposed a dance summarization method based on 3D capture data of the Vicon motion capture system. They analyzed and studied the automatic extraction of dance patterns. This method was a hierarchical scheme based on the temporal and spatial changes of dance characteristics [16]. Aiming at the preservation and dissemination of dance performance, Aristidou et al. proposed a dance action recognition framework based on Laban analysi which used feature space to capture different dance action components and pointed out a new direction for dance evaluation [17]. In terms of editing and synthesis of dance movements, Aristidou et al. used Laban analysis, radial basis function regression, and interpolation methods to map the movement features and emotional features in two directions and realized the stylization of high dynamic dance movements [18]. To sum up, there is a difference between the research of human action recognition and HCI, and there is little research on action recognition in dance education.

2.2. Research Progress of HCI

Experts and scholars have made great efforts on deep learning and HCI. Bhardwaj et al. applied support vector machine and artificial neural network classifier to fingerprint recognition. By integrating the relevant dynamic information from hundreds of biometric scanning sample datasets, they found that the accuracy of fingerprint dynamic recognition by fusing the deep learning method was improved by 5.3% [19]. Israelsen and Ahmed analyzed the influence of artificial intelligence (AI) agent in HCI and machine learning based on the research of algorithm-guaranteed AI agent and discussed the advantages and disadvantages of different methods [20]. Based on similarity embedding, Spathis et al. proposed an interactive dimension reduction framework (iSP). In this framework, user interaction formed different goals. Gradient descent was used for learning, and an end-to-end composition structure could be trained. By evaluating the framework in two interaction scenarios, they found that the framework could be applied to semisupervised learning, transfer learning, and adaptive learning in interaction field [21]. Using interactive machine learning, Wu et al. studied local decision-making in feature selection of emotion classification task and analyzed the influence of interactive machine learning tools on feature selection results [22]. To improve the performance of multimodal image retrieval by using unmarked and marked multimodal web objects, Xu et al. proposed a semisupervised multiconcept retrieval method based on deep learning (SMRDL). Different from the traditional method of using multiple independent concepts in multiconcept semantic query, the proposed method regarded multiple concepts as a whole scene, which was used for multiconcept scene learning of unimodal retrieval. The comprehensive experimental results on two datasets of MIR flickr2011 and NUS-WIDE indicated that the proposed method was superior to some of the latest methods [23]. Long and Zhao held that intelligent teaching mode overcame the shortcomings of traditional online and offline teaching. However, there were some shortcomings in the real-time feature extraction of teachers and students. In view of this, they used particle swarm image recognition and deep learning technology to process the video teaching image of intelligent classroom. To overcome the shortcomings of premature convergence of standard particle swarm optimization (PSO) algorithm, they proposed an improved multi PSO algorithm strategy. Moreover, to improve the premature problem of PSO in search performance, they combined the algorithm with the useful attributes of other algorithms to improve the diversity of particles in the algorithm, enhance the global search ability of particles, and achieve effective feature extraction [24]. To sum up, there are many research results on the application of deep learning in HCI, but few studies on the combination of the two for dance action extraction.

3. Methods

In computer vision and image processing, movement recognition is a crucial component. However, some problems are found in its research and applications. For example, when extracting and recognizing human skeleton movements, bone modeling is challenging, movement amplitude can affect the extraction results, and feature extraction can be insufficient, increasing the difficulty in analyzing and classifying human movements. Deep learning has developed rapidly. CNN shows excellent performance in feature extraction, while LSTM has significant performance in processing time sequence problems. Therefore, CNN and LSTM are introduced to extract and recognize human skeleton movements. However, traditional CNN models have lots of parameters, using a large convolution kernel to extract features. Traditional LSTM models never consider the connection of multiple different movement times in a long time. Hence, the CNN-based VGGNet is introduced and optimized in parallel. In the meantime, LSTM is improved and optimized before extracting and recognizing human skeleton movements.

3.1. Optimization of VGGNet CNN Model

Cat’s visual cortex theory inspires the deep learning-based CNN. Compared with the traditional neural network, CNN extracts the object’s local feature information through the convolution layer, a critical CNN component that contains multiple convolution kernels [25]. VGGNet is a typical CNN. Unlike traditional CNNs that employ big convolution kernels to extract features, VGGNet utilizes several small convolution kernels for feature extraction. Hence, VGGNet can extract richer features and reduce the calculation amount significantly [2628].

The features extracted by the convolution layer are integrated to improve the accuracy of VGGNet, i.e., the parallel CNN [2931].

Extractions of input image features before fusion are as follows:

In (1) and (2), and represent features. The feature information extracted by the two small convolution kernels is fused via the feature fusion module. The convolution operation is denoted as . The feature map after fusion processing can be written as follows:

The process of fusion of the above feature maps , , and can be expressed as follows:

The above fusion processing can enrich and diversify the extracted features. Graphics Processing Unit (GPU) processing is utilized for training VGGNet to compare the performance of the CNN-based VGGNet before and after optimization. Images in the training set are taken by the Kinect camera and the host computer program. The selected human movements include clapping, slapping, standing, picking up objects, and sitting down.

Movement capture includes the following steps: (1) the demonstrator makes different movements in front of the Kinect camera and (2) Kinect is utilized for evaluating human skeleton changes in real-time. Several demonstrators complete the collection of the entire training set. One thousand images are collected for each movement. Finally, a total of 5,000 human skeleton images under different movements are obtained. The skeleton images affected by the environment are removed, and the remaining human skeleton images are retained. These images train the VGGNet before and after feature fusion. Accuracy and loss rates are taken as evaluation indicators [32, 33]. Parameter settings of the entire training process are shown in Table 1.

3.2. Extraction Algorithm of Human Skeleton Movements

Traditional human pose estimation algorithms extract human skeleton features via the bottom-up manner. Each skeleton extraction object requires a detector, and each movement is estimated separately. Therefore, traditional algorithms have many problems, such as false detection, long-running time, and poor instantaneity, which cannot meet the demands. Based on the OpenPose open-source database [34], the optimized VGGNet is the network architecture, and the histogram equalization [35, 36] is introduced to suppress noises, thereby extracting the 2D features of the human skeleton.

OpenPose is an open-source database released in 2017 based on skeleton extraction. Unlike traditional pose estimation algorithms, OpenPose uses a bottom-up method. The joint points of all human body parts are detected first. Then, the nodes are connected to obtain the skeleton, thereby significantly reducing the running time. Also, OpenPose can improve detection accuracy and shorten the running time. Figure 1 illustrates the video information processing by OpenPose.

The unique convolution kernel structure in the CNN can learn spatial information in human actions, and more useful information can be obtained by different convolution kernels. Compared with traditional machine learning methods, CNN is more systematic and comprehensive in task learning with better performances. Unlike traditional CNN models, the VGGNet model extracts features by massive small convolution kernels as a typical CNN model. It can extract more features and reduce calculation amount with satisfactory generalization performance. The optimized VGGNet consists of three parts. The first part processes the image data via the input layer and employs CNN to extract the feature values of body parts. Then, the extracted feature values enter the other two parts for critical point positioning and the body-based 2D vector field positioning. The input to output via the neural network spends a total of k periods, and the information input to the current period is the output feature value obtained through the learning process of k − 1. The optimized VGGNet’s output is formed by a 2D vector field of crucial body parts and a confidence map. As the calculations increase, the candidate human body parts and the corresponding structure division become apparent via this cyclic process. Here, CNN’s first convolutional layer is a double convolutional layer, and each contains 64 convolution kernels in the size of . Simultaneously, an activation layer and a normalization layer are added after each convolutional layer to process the nonlinear data. A pooling layer is added after the normalization layer to reduce dimensionality and prevent overfitting, located between the two convolutional layers. The Dropout layer comes after the second pooling layer. The Part Affinity Fields (PAFs) [37, 38] are adopted to predict all the human body key points in the images.

In summary, extracting human skeleton information includes the following two processes: first, adding the corresponding image data to the input layer of VGGNet and, second, learning the feature value F according to the body parts. The 2D vector field of output corresponding to the human body in the k= 1 period is

In (5) and (6), represents the set of 2D position confidence maps, and denote the set parameters, refers to the period corresponding to the feature value, and signifies the set of 2D vector fields.

The solution to the confidence in the confidence map can be presented as follows:

In (7) and (8), represents the position confidence atlas and denotes the output image in the corresponding period. Meanwhile, refers to the number of people in the input image, stands for the body part’s serial number, and is a constant.

The joint point position in the 2D vector field is judged according to

In (9) and (10), represents the pixel of the prejudgment part and denotes the unit vector. On this basis, the average value of the 2D vector field can be written as follows:

In (11), represents the number of all points of the pixel on the link . After testing, candidate positions on PAFs should be determined first. Then, all connected line segments are determined.

The OpenPose open-source library can achieve excellent results of skeleton extraction. However, the image noise limits feature extraction. Therefore, histogram equalization is introduced, which enhances the contrast and reduces the noise by stretching the distribution range of pixel intensity. Videos based on image visualization are processed by Compute Unified Device Architecture (CUDA) to ensure the instantaneity of information extraction. Eighteen key part points are chosen as the input of skeleton movement extraction while utilizing the OpenPose open-source library. A variable-view movement database containing 40 kinds of aerobic exercises is chosen for analyzing algorithm extraction effects. Eight different movements are chosen for analysis, with the classification accuracy as the primary evaluation indicator.

Here, the optimized 3D CNN (O-3DCNN) algorithm, Spatial-Temporal CNN(ST-CNN) algorithm, and optimized Deformable Part Model CNN (ODPM-CNN) are compared with the optimized VGGNet to prove its effectiveness.

3.3. Skeleton Movement Recognition Based on Optimized LSTM

Traditional neural networks have major limitations in practical application. For example, in time series processing, traditional methods perform well only in short-time series processing. In the separate data processing, the good learning and understanding abilities enable CNN to be applied in practice. However, CNN has limitations in the sequence problem processing related to time correlation. LSTM is a unique Recurrent Neural Network (RNN). LSTM can solve the long-term dependence problem in RNN applications, which has an inseparable relationship with the particular gate structure of LSTM, explicitly referring to input gates, forget gates, and output gates. The input data are calculated according to the following equation:

In (12), represents the weight, corresponds to the deviation, and denotes the output value corresponding to the time . Meanwhile, refers to the input value, represents the activation function, and stands for the forget gate. Moreover, the memory information can be displayed as follows:

In (13), represents deciding whether to memorize the information at the time and means the input gate.

Finally, the output gate can be expressed as follows:

Although LSTM has many excellent performances, LSTM does not consider the correlation and feature influence between different skeleton movements over a long time. Hence, the LSTM model only depends on the human skeleton joints while recognizing human skeleton movements, resulting in limitations to recognizing human skeleton movements. Therefore, the idea of time integral is introduced. First, the pre-acquired skeleton sequence information is transformed, such as translation and rotation. In this way, all movements can obtain their relative coordinates. If the human skeleton movement has differences due to different times, a multiple LSTM model is used to extract and fuse features [39]. Finally, multiple types of movements are captured by integrating multiple LSTMs. Figure 2 reveals the overall implementation framework of the optimized multi-LSTM human skeleton movement recognition.

Extraction accuracy and loss entropy of various LSTMs are compared to verify the effectiveness of the optimized multi-LSTM human skeleton movement recognition algorithm. Specifically, algorithms selected for comparison include the single-LSTM and double-LSTM. A skeleton sequence input into the optimized F-Multi-LSTM contains 24 frames, among which each frame consists of multiple 2D skeleton points. During analysis, the Adam optimization algorithm is used as the optimization tool, and the initial learning rate is set to 10−4, in an effort to achieve the model’s global optimization. The single-LSTM has one input layer, while the double-LSTM has two input layers. The input is assumed as a sentence. In double-LSTM, one side of the input corresponds to the word at the beginning of the sentence and the other side corresponds to the word at the end of the sentence.

3.4. Design of HCI System Based on Dance Education

Dance education based on physical education helps improve students’ physical fitness and transforms traditional sports teaching. According to the above image visualization-based extraction and recognition method of human skeleton movements, the Web3D engine-oriented deep movement recognition system’s functional modules are shown in Figure 3.

The system based on dance education and dance movement recognition consists of the front-end interactive function module and the back-end recognition function module. The former is a 3D world built on Web Graphics Library (WebGL) technology, including data processing of video images, 3D processing, and the HCI submodule. The latter consists of two subfunction modules, namely, node recognition and classification of human dance movements.

In this HCI system, the OpenPose open-source database and optimized VGGNet model can estimate facial expressions, positioning of limbs and trunk, and people’s feature information. This human skeleton extraction method can identify the critical points of the human body, thereby employing the optimized F-Multi-LSTM skeleton movement recognition network to determine the classification and label attribution of human dance movements. The designed system is based on recognizing and analyzing dance movements. Eight types of dance movements are analyzed and discussed, including stepping and knee lift (S), crouching (C), reaching out and jumping (R), turning and clapping (T), straight punch (B), arm circles (A), jumping (J), and high knee (H).

In the HCI system, the dance pose estimation module and dance movement classification module in the background recognition module are the keys. Accuracy and response time are evaluation indicators to analyze the chosen dance movements, thereby testing the feasibility of the HCI system based on dance education and movement analysis and recognition.

3.5. Data Preprocessing

The image is preprocessed as follows to better meet the needs of behavior recognition: first, the image is uniformly scaled to 432 × 368 based on the center point; second, image denoising. Noises are common in images, in which Gaussian noise is the most common one. The Gaussian filter is used for processing to effectively suppress the Gaussian noise in the image. The one-dimensional Gaussian distribution and two-dimensional Gaussian distribution are shown in (15) and (16), respectively. The Gaussian filter function in open-source computer vision library (Open CV) is used to realize image denoising, and the relevant parameters are optimized.

4. Results

This section analyzes the optimized VGGNet algorithm’s performance through comparison with several human skeleton movement extraction algorithms. The accuracy of the VGGNet algorithm in human skeleton movement extraction is analyzed and optimized on this basis. The effectiveness of the optimized model is verified. Besides, comparative analysis is conducted on the performance of the LSTM model, the single-LSTM model, and the double-LSTM model. Finally, the interaction accuracy and system real-time performance shall prevail to verify the HCI dance education system’s performance.

4.1. Performance Comparison of Skeleton Movement Extraction Algorithms

Table 2 presents the comparison result of the extraction accuracy of human movements by several algorithms, including the original and optimized VGGNet.

Table 2 suggests that the optimized VGGNet algorithm presents the best performance in extracting human movements, with the highest accuracy of 98.2%, showing apparent superiority in performance over traditional VGGNet algorithms. The 3D CNN model can only extract a type of features from a three-dimensional space because the weights of the convolution kernel are the same in the whole space; that is, the weights are shared by the same convolution kernel, so the extraction accuracy of 3D CNN is only 91.2%. The spatial invariance of ST-CNN refers to the invariance of spatial transformation of images such as rotation, translation, and scaling. Even if the input is transformed or slightly modified, the model can recognize and extract features. ST-CNN is the most time-consuming and error-prone place in debugging interpolation and image index, so the extraction accuracy of ST-CNN is only 90.5%. ODPM-CNN model is a variability network and ODPM-CNN just the opposite, and its recognition accuracy reached 97.08%. The optimized VGGNet is also superior to other human movement extraction algorithms. In this way, the effectiveness of the proposed skeleton extraction algorithm is verified preliminarily.

4.2. Extraction Results of Human Skeleton Movements

The accuracy distribution of the eight human skeleton movements’ extraction results by optimized VGGNet on OpenPose open-source database is shown in Figure 4.

This collection of 100 dance pictures is seen as a total sample, and each picture contains eight parts of the action changes. S represents the step and knee lifting head, shoulders, elbows, wrists, hips, knees, ankle bone node extraction accuracy; other C, R, T, B, A, J, and H dataset content for the above eight parts of the extraction accuracy changes under the action of the title annotation. The extraction accuracy of the head is the highest, reaching 96%, and 100 images are correctly extracted. The extraction accuracy of the shoulder reaches 84.8%, with 90 pictures extracted correctly. The extraction accuracy of the elbow reaches 92.6%, with 89 pictures extracted correctly. The extraction accuracy of the wrist reaches 87.6%, with 86 pictures correctly extracted. The extraction accuracy of the hip reaches 91.0%, with 100 pictures extracted correctly. The extraction accuracy of the knee reaches 95.8%, with 90 pictures extracted correctly. The extraction accuracy of the ankle reaches 86.7%, with 88 pictures extracted correctly. Figure 4 signifies that the extraction accuracy of bone nodes in eight body parts is different, and the proportion of sample number is also different. Moreover, Figure 4 implies that the proportion of accurate number extracted from the large part of the space occupied by the body parts will be significantly higher.

4.3. Skeleton Movement Recognition Results of Multiple LSTMs

The single-LSTM, double-LSTM, F-multi-LSTM, and A-multi-LSTM are compared. The results are shown in Figure 5.

The parameters represented by the abscissa in Figure 5 are different neural network models. The corresponding left-axis variables refer to the accuracy, and the corresponding right-axis variables stand for loss rates. Single-LSTM is a sequence that supports one-way variable input and output, while double-LSTM is a sequence that supports two-way input and output. Multi-LSTM is a multidimensional LSTM for high-frequency time series, which supports multiple parallel input sequences with multiple inputs, rather than the planar structure of multiple inputs in other models. F-Multi-LSTM is an optimized multidimensional LSTM, and A-Multi-LSTM is expressed as a pair of optimized multidimensional LSTM. The double-LSTM has higher accuracy than the single-LSTM according to the comparison results of loss rate and accuracy of single-LSTM and multi-LSTM. The recognition accuracy reaches 79.8%, and the loss rate is 0.0685. Compared with the single-LSTM model, the difference is 43.8%; overall, the recognition accuracy and loss rate of the proposed multi-LSTM model are the best. Specifically, the single-LSTM model’s recognition accuracy reaches 88.9%, and the loss rate is 0.0748, which is the best among the comparative algorithms. Compared with the traditional LSTM model before improvement, the optimized LSTM model has higher recognition accuracy. The optimized LSTM model has the best applicability in recognizing human skeleton movements.

4.4. HCI System Performance Based on Dance Education and Movement Analysis

The eight dance movements are chosen as the benchmark. According to the indicators of interaction accuracy and system instantaneity, the HCI system’s performance for dance education is shown in Figure 6.

In the dance education HCI system, the eight dance movements’ overall interaction accuracy is above 70%. The interaction accuracy of movement B is the highest, reaching 92%. The overall accuracy of interactive recognition is distributed in the range of 72%–92%, with a large span. The overall response time corresponding to the eight dance movements is distributed within 5.1 seconds to 5.9 seconds, showing that the dance education HCI system has a high instantaneity.

5. Discussion

The above results indicate changes in the OpenPose open-source database’s recognition accuracy and the optimized VGGNet model. The reason is that the head has almost no changes in coordinates or rotation angle. Besides, the movement range of the head is small. Therefore, the accuracy of classification and recognition of the head is the highest. In contrast, the shoulders are greatly affected by external factors, such as rotation angle and abscissa among different movements. Hence, classification and recognition accuracy of the shoulders are relatively low. The elbow movements and the wrist movements are affected by changes in moving speed and longitudinal coordinates. If the human body’s moving speed is slow and the position between the arm and the camera is not parallel, the classification and recognition accuracy will be high. The hips are easily affected by changes in the leg movements. The overall accuracy of classification and recognition corresponding to the knees is high, but movements with large fluctuations, such as movement H, can significantly affect classification and recognition accuracy. Therefore, the accuracy is low. The ankles and other parts’ classification and recognition accuracies are low, probably because of external factors such as clothes and shoes. Although the classification and recognition accuracy of different dance movements are mainly different, the average accuracy is high, confirming the proposed algorithm’s effectiveness.

The multiple LSTM model is also advantageous in skeleton movement recognition. Because the optimized LSTM model is robust, its learning and classification abilities are increased, thereby increasing its accuracy in recognizing different dance movements. The distribution changes of the interaction accuracy corresponding to the dance education HCI system reveal that the interaction accuracy corresponding to different movements has a large span. Compared with the model training process, the actual interaction will be affected by sophisticated environmental conditions, such as different lighting, the restraint between different dance movements, and the conversion frequency of various dance movements. Under sophisticated environmental conditions, the interaction accuracy of the HCI system drops. Hence, attention should also be paid to improve datasets in actual HCI applications.

Meanwhile, the proposed algorithm is compared with the methods proposed by other scholars [4049] to verify its superiority. For the training, the input image size is set to 432 × 368, the number of cycles is set to 50, the batch size is set to 16, and the initial learning rate is set to 0.001. Table 3 reflects the results. Table 3 demonstrates that the proposed multi-LSTM model has the highest accuracy in bone motion recognition, and the recognition accuracy has been improved by 27.79%, 17.69%, and 27.62%, respectively, compared with the comparative methods.

6. Conclusions

For the dance education HCI system, the CNN-based VGGNet model is optimized and applied to extract human skeleton movements based on the OpenPose open-source database and histogram equalization. The proposed extraction algorithm for human skeleton movements shows intentional performance in extracting eight different dance movements, with the highest accuracy rate reaching 96%. From the comparison results of loss value and accuracy between a single-LSTM model and a multi-LSTM model, the accuracy of bone motion recognition by the multi-LSTM model is 79.8%, which is higher than that by a single-LSTM model. The optimized multi-LSTM model has higher accuracy in recognizing human skeleton movements than the traditional LSTM models. The constructed HCI system has an interaction accuracy of 92%. This work achieves the extension of application range of deep learning in skeleton movement recognition and the organic combination of deep learning and HCI.

The contributions based on the extraction and recognition of human dance movements are as follows:(1)An optimized VGGNet human skeleton movement extraction algorithm is proposed, which achieves a better extraction accuracy than traditional algorithms, attaining 96%.(2)An optimized multiple LSTM human skeleton movement recognition algorithm is proposed. Its recognition accuracy reaches 88.9%, which is significantly better than traditional LSTMs.(3)A HCI system based on image visualization is designed, with the interaction accuracy rate of 92%.(4)A reference is provided for more in-depth human movement extraction and recognition, and deep learning strengthens the applicability to the HCI system.

Due to computational resource limitations, other larger and more complex datasets are considered in this experiment [5055]. In addition, the algorithm can meet the real-time requirements, the recognition speed is still very slow. In view of the above problems, it is worth further expanding the datasets in complex scenes in the subsequent work and further optimizing the model to improve the detection speed [56].

Limited by the computing resources, other larger and more complex datasets are not explored [57, 58]. In addition, the recognition speed of the algorithm is slow although it can meet the real-time performance [5961]. In view of the above problems, the dataset will be further expanded, especially in complex scenarios, which further optimizes the model to improve the speed of detection.

Data Availability

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent was obtained from all individual participants included in the study.

Conflicts of Interest

All authors declare that they have no conflicts of interest.

Authors’ Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.