In order to improve the group animation motion capture effect and enhance the design technology of group animation, this paper studies the method of group animation motion capture combined with virtual reality technology. This paper constructs a crowd motion capture system based on virtual reality technology and describes the attributes by organizing crowd profiles of different levels and structures. Moreover, this paper conducts collaborative modeling of multiple attributes and constructs a model based on structured interactive attributes. The strength of the model is that it takes into account scene scale, chaos, and crowding properties to effectively characterize the potential interactions of group movement patterns. Combined with the simulation research, it can be seen that the group animation motion capture method based on virtual reality technology proposed in this paper can play an important role in group animation motion capture and animation design.

1. Introduction

Technically speaking, there are four main steps in 3D animation production: geometric modeling, material adjustment, motion trajectory setting, and coloring technology. At present, the main problem of geometric modeling is the reproduction of the model in the virtual world. The main problem of material finishing is how to make virtual material feel like material in reality [1]. The main problem with motion trajectories is the movement of characters, especially the limb movements of humans and animals. In addition, the main problem of shading and rendering technology is how to make corresponding static or dynamic pictures according to the setting of the scene. In the process of 3D animation production, the most difficult problem is the movement trajectory, the adjustment of character movement, and the coordination of limbs. Although we often see domineering scenes in ordinary animations, for a good animation work, the movement of the characters is the object of attention [2]. Like film and television works, even though the special effects of the scenes are well done, the acting skills of the actors still come first. If there is no good acting and lack of realism, then this must not be a favorite movie. In the production of 3D animation, the movement trajectory of the character is equivalent to the acting skills of the actors in the film and television performance. The set movement of the character is usually adjusted by the animator, but for a long animation, it is very difficult to just rely on the animator to manually adjust the movement of the character. Therefore, the setting and adjustment of the character’s motion trajectory have become the most difficult bottleneck in the process of making 3D animation [3].

The animation system of traditional film and television performances is based on the graphics processing principle of computer technology, through many video capture devices, the movement of things is recorded in the form of image recording, and then the image information obtained by the video capture device is captured by computer graphics processing technology. To be processed, image capture technology mainly includes acoustic capture, optical capture, mechanical capture, and electromagnetic capture. Optical motion capture system is the technology used in most performance animations. The general process of its work is as follows. Write and then draw the sub-shot script, design the atmosphere map and a different character modeling, and make the model required for the animation according to the requirements of the script. Usually in order to facilitate later video editing and animation production, performers need to wear green clothes and then install light-emitting points on their key body parts, such as body joints, wrists, and elbows. The settings of such light-emitting points are conducive to the recognition of the visual system. Then, by means of camera cooperation, the video of the performer is captured from all angles, and then according to the mark of each frame in the image obtained by the camera, the movement of the performer is recorded, so that many points of movement according to time can be obtained. Then, according to the three-dimensional technology, the motion trajectories of these points are made into the movements of the skeleton model, so that the movements of the performer can be transformed into the movements of these points.

This paper studies the group animation motion capture method combined with virtual reality technology, constructs a group motion capture system based on virtual reality technology, and improves the effect of subsequent group animation motion capture.

Since optical motion capture technology is more sensitive to light, it is easy to be interfered by external light in the actual application process, which leads to the phenomenon that the overall motion capture effect is reduced. However, when the motion capture work is performed in the studio, due to the large difference between the scenery environment in the studio and the actual environment, and the lack of overall literature interaction, it makes the performance of the overall observed object more difficult, and it is necessary to rely on the director’s guidance. As well as the imagination of the actors, this makes the entire motion capture more difficult, and relatively more equipment is required in the process of optical motion capture. If the character cannot quickly enter the state, it will make the time for motion capture longer, resulting in serious resource consumption [4]. With the continuous improvement of the level of optical motion capture technology, in order to better shoot outdoors and reduce the impact of external light factors on the optical motion capture technology, corresponding researchers have gradually improved the tracker installed on the observed object. The light source receiving method is changed to the method of actively emitting light source, so the influence of external optical factors on the optical motion capture technology can be effectively avoided [5].

The use of virtual shooting technology in the process of optical motion capture can enable the director to more intuitively reflect the operation status of the current scene, so as to help the director to better guide the live action actors, so that the corresponding observed actors can express their performance. With the expressions and actions that the animated objects in the scene need to perform, the quality of the overall animation production is improved [6]. In the process of using optical motion capture technology for motion capture, virtual shooting technology is used to simulate different scenes in the film and television animation, so that the corresponding shooting team can more intuitively feel the production results of the corresponding film and television animation, and can quickly analyze the results. If there are deficiencies in the beginning process, make the director communicate with the actors and the postproduction team in time, adjust the corresponding actions and performances, reduce the phenomenon that needs to be changed after the film and television animation production is completed, reduce the cost of film and television animation production, and ensure the film and television animation production cost [7].

Use three cameras to capture the changes of 102 landmark points to achieve voice expression animation [8]. Literature [9] uses two camcorders, combined frame receivers to transmit data to the computer, and uses direct linear transformation to process the captured point information into available information. Literature [10] uses marked points of different colors to capture faces at different angles, after obtaining data, recognizes and trains templates according to different color points, and uses image processing technology to calculate 3D point information. Literature [11] proposed a method to obtain three-dimensional information with only one camera, using two mirrors to reflect the faces covered with fluorescent markers. When shooting, irradiate the face with a violet light, the reflection effect of the fluorescent marker points is obvious, and the image has a high contrast, which is convenient for tracking. Then, use the principle of space geometry to calculate the three-dimensional information of the point, and obtain the initial captured data. Literature [3] smears colored pigments on the face, the blue marker points are used to track the overall expression changes, and the colored stripes in other regions are used to track the changes of detailed wrinkles. Literature [12] proposed to achieve real face geometry reconstruction by projecting different types of textures.

Literature [13] uses somatosensory peripherals as their data capture tools, which can simultaneously capture two-dimensional images and obtain three-dimensional position information, which has the advantages of high speed and synchronization, but the noise and error of the three-dimensional information it acquires are relatively large, requiring software methods for denoising and correction. Literature [14] uses the trained facial expression model and uses the matching of the model to limit the influence of noise. The advantage of this is that it has good real-time performance, but the expression changes of characters are somewhat monotonous, lacking variability and degrees of freedom. Literature [15] uses structured light to illuminate a face without marked points, calculates the depth difference between two image sequences, and matches the depth with the face template to drive the model change. This method has higher requirements on lighting and is more cumbersome to use in general. Literature [16] uses five synchronous cameras to shoot faces without markers, supplemented by two-dimensional grid tracking to establish the connection between frames to achieve facial expression animation reproduction. It consumes more time.

3. Group Motion Capture Algorithm Based on Virtual Reality Technology

The proposed method first utilizes a nontracking method to obtain group motion information of dense scenes. The ideal situation for analyzing group behavior is to keep track of all individual goals. However, there are still a lot of occlusions in dense scenes, making it difficult to track all individual targets. Therefore, we use particle flow to approximate crowd trajectories to overcome the difficulty of tracking animated pedestrians. This method regards dense pedestrians as a particle country and uses the advective motion of particles to propagate in the optical flow field to approximate the motion of pedestrians and capture the continuous group motion. Its effect is much better than the motion information generated by the pure optical flow representation.

A video of a group scene is given, and a series of video frames are first divided into video blocks of size T × W × H that can be represented:

The particles are evenly placed in the optical flow field according to a certain step size, and the average optical flow in the space-time cube representing the surrounding fixed area is obtained by calculation. Each particle moves with the average optical flow of its covering points, and the corresponding particle trajectory is accumulated by the fourth-order Runge–Kutta numerical calculation algorithm. The particle flow forms a trajectory as follows:

Particles follow the fluid motion to generate trajectories guided by the average neighbors. The particle trajectory includes T-tuples in the optical flow field as follows:

Among them, s and represent the position vector of particle i obtained from the optical flow field at and the velocity vector obtained by particle i at time r and displacement .

This clustering method can obtain reliable clustering patterns determined by particle density (unlike the K-means method), and it is robust to different trajectory types. In particular, c samples are given in the T-dimensional space [17]:

The algorithm randomly selects a point and drifts to with the highest probability density at the current scale by computing the mean vector , as follows:

Among them, is the inference coefficient obtained using the gradient descent algorithm. The next point is drifted from as follows:

The graph structure is as follows: in order to express the graph structure of the group, we use the spectral information to reflect the structural properties of the graph, because the Laplacian spectrum achieves good results in recognition and classification problems. We assume that there are N graphs in T-frame video clips containing m trajectories. For each graph , the Laplacian matrix can be defined by F as [18]

Among them, i, j = 1, 2, … m, k = 1, 2, … ., N, represents the distance between different vertices. The eigenvalue of the Laplace matrix can be obtained by the singular value decomposition (SVD) method.

Among them, . We select the largest 3 eigenvalues as graph structure representation features, which can discriminatively distinguish diverse graph structure patterns.

The group attribute is as follows: the group attribute is used to express the characteristics of the group, including the orientation distribution and the velocity distribution. In each trajectory graph , for each node , we calculate the basic velocity S(.) and orientation O(.) feature channel as follows:

Among them, is the position of the graph node trajectory . Attributes can be viewed as quantitative measures to represent attributes of group behavior. We use a histogram containing n bins (n = 8) of uniform orientation and velocity to define the group properties as follows:

Movement dynamics are combined with the internal attributes of the group, and the movement information outside the group also needs to be considered to describe the group. For each trajectory graph Gj, we choose the highest three as the trajectory graph velocity and consider the average node position as the position of the graph to record the dynamic motion.

These features record the motion information of the structure and trajectory graphs, and effectively express typical group behavior patterns. Based on this, all features can be expressed as a unified 24-dimensional vector (concatenated by 3 + 8+8 + 3 + 2-dimensional vectors), which describes the group-level structure and apparent patterns. Next, a bag-of-words model is constructed to quantify the trajectory graph patterns.

The dictionary construction of trajectory graphs is inspired by visual words to represent local patterns of images, and trajectory graphs represent group behavior patterns for specific video sequences, which can be applied to group recognition tasks. The concatenated feature vectors are clustered using the K-means method to build a dictionary of trajectory graph words. The trajectory graph word bag model BoTG represents the group behavior pattern through a histogram vector hj, as shown below [19]:

Among them, d is the number of words selected as a dictionary. is the frequency with which the i-th trajectory map appears in the j-th video clip. The vector contains the pattern distribution information of a particular group scene, which is normalized by the number of all images in the j-th video segment. Therefore, each T-frame video can be represented as a trajectory graph word packet BoTG. According to this, BoTG captures informative group information by maintaining the symbiotic pattern. We extracted BoTG from swarm video clips to train an SVM model to recognize different event types. BoTG can serve as an efficient behavioral representation of global groups with symbiotic structural coherence.

We assume that there are N types of attributes to describe each group video, and there are M group videos in total. First, a graph structure is constructed separately for all attributes to measure the video relationship. The graph is defined using Wi, as the similarity matrix, where W is denoted as the similarity corresponding to the n-th attribute, the i-th and the j-th video among all N attributes. For the n-th class of attributes (n = 1, … 7), each subgraph can be constructed as follows:

A feature xn can be mapped to in the original feature space. This mapping can be a linear or nonlinear mapping. Inspired by deep autoencoders, we expect that both the group data relationship and the underlying manifold structure of the data distribution can be better learned by nonlinear transformations in deep models. Therefore, we next add the properties of deep embedding to the graph sorting framework.

Overall, graph-based ranking methods can be formalized as the following regularization framework:

In particular, the label Y, the attribute vector xni (for the n-th class attribute of the video), and the similarity matrix for the n-th class attribute are set, and our goal is to find the relevance ranking score f. The specific form is as follows [20]:

Among them, represents the weights of different graphs, and is a diagonal matrix whose . In order to better integrate different properties and improve computational efficiency, α can be regarded as a variable, and formula (5) should be able to be optimized simultaneously.

The system parameters can be derived through the structure of the deep network and the multilevel nonlinear transformation abstraction, and the depth transformation metric An can be derived when f and αn are fixed. The deep structure of the stacked autoencoder (SDAE) can autonomously abstract higher-level semantic information through a series of nonlinear reconstruction transformations. In particular, each layer of the stacked autoencoder (SDAE) is a hidden layer representation generated by data training, which is equivalent to a higher-level abstract response.

A linear transformation function and a continuous nonlinear transformation function both transform x through h into r neurons.

The representation of the manifold structure present in the data is introduced by such a decoder reconstruction form to describe the manifold distance relationship between the data. The total reconstruction error is thus defined as follows:

Afterward, x can be updated by implicitly representing h(x), resulting in the (t + 1)-th iteration. The hidden layer can embed semantic attributes through each iteration as a depth layer, which can well explain the relationship between the input data. Furthermore, the transformation metric matrix can gradually approach the inherent manifold structure of the population data, thus promoting the fusion of diverse and heterogeneous properties of population patterns.

The maximization (M) step is to optimize f and .

It is worth noting that by the time A is fixed, the main function is a convex function with respect to and . Formula (19) for f and can likewise be solved by iteratively minimizing . When is fixed, the partial derivative of L can be obtained as follows:

The specific details of the iterative algorithm process and the corresponding convergence proof can be referred to. When f is fixed, we can derive formula (18) to obtain , which has the constraint , as follows:

If is fixed, f can be solved as follows:

It is worth mentioning that we ended up adopting 7 kinds of attributes, so we constructed 7 separate graphs.

The entire iterative optimization process can be found in Algorithm formula (19). At the same time, our method can easily extend to a wider variety of properties and incorporate more semantic structural information.

4. Group Animation Motion Capture Method Based on Virtual Reality Technology

The design of the group animation capture system is shown in Figure 1. The key technology of motion capture is marker tracking and three-dimensional reconstruction of spatial coordinates. In addition, in computer vision, it is necessary to use the position information of the viewpoint and the orientation information of the viewpoint to calculate the three-dimensional spatial structure from the two-dimensional image information. This uses various parameters of the camera. The relationship between the three-dimensional geometric position of a point on the surface of the space object and its corresponding point in the image is determined by the geometric model of the camera’s imaging. These geometric model parameters are camera parameters, which can only be obtained through camera calibration calculation. Therefore, camera calibration is also one of the key technologies of motion capture.

Video capture cards offer two methods of storing data into memory. In this paper, the method of double buffering with better performance is selected, and two buffers are opened in the memory; one is used to capture images. Another block is used to analyze the image data that has been acquired. In this way, the collected image data can also be analyzed and processed while being captured. After completing the acquisition of one frame of image, switch the two buffers to the entire video capture module, as shown in Figure 2. A video capture thread is opened for each capture card, which is responsible for its image capture and analysis. In addition, a calculation thread is opened, which uses the image coordinates of the analyzed marker points to calculate its spatial position.

Motion capture is a technology that captures and records human movement. The skeleton captured by this technique contains some typical joint points. As shown in Figure 3, the skeleton contains 21 joint points, and the root node contains 6-dimensional data, which are the three-dimensional translation and three-dimensional rotation information. Motion capture data can be represented by a matrix, in which each row represents a frame and each column represents a dimension.

In the early stage of designing the system, the requirements of the system functions are firstly analyzed. This section will focus on expounding the function and specific implementation process of each module of the animation synthesis prototype system based on 3D motion capture data key frames. The functional modules of the prototype system are shown in Figure 4.

This prototype system is mainly composed of four main modules: Bvh loading, player, key frame visualization, and animation synthesis. The system frame structure diagram is shown in Figure 5.

In order to verify whether the animation effect is good, the animation is synthesized on the basis of the extracted key frames, and the synthesized animation effect is displayed. The specific operation process is shown in Figure 6.

Figure 7 shows the group animation image designed by the group animation capture method proposed in this paper.

The effect of the group animation motion capture method based on virtual reality technology proposed in this paper is evaluated, and the motion capture effect and animation design effect are counted. The statistical test results are shown in Tables 1 and 2.

It can be seen from the above research that the group animation motion capture method based on virtual reality technology proposed in this paper can play an important role in group animation motion capture and animation design.

5. Conclusion

The development of motion capture technology itself also has certain limitations. For example, the price of the optical system is too expensive. Motion capture also needs to capture special light spots on the performer. The subsequent adjustment and data modification work is a particularly large workload. Therefore, the motion capture technology itself is constantly developing and improving. In optical motion capture, it is these blips that make postanimation processing data tricky. Now, some scholars propose to use video processing instead of directly capturing light spots to identify the movement trajectory of the performer, thereby simplifying the process of motion capture and reducing people’s workload. This paper studies the group animation motion capture method combined with virtual reality technology and builds a group motion capture system based on virtual reality technology. The simulation results show that the group animation motion capture method based on virtual reality technology proposed in this paper can play an important role in group animation motion capture and animation design.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare no competing interests.


This work was supported by Vocational Education Reform and Innovation Project of “Science, Innovation and Education” of the Ministry of Education (Grant HBKC217128), by Industry-University-Research Innovation Fund for Chinese Universities, Ministry of Education (Grant 2021ALA02024), by University-Industry Collaborative Education Program of the Ministry of Education of China (Grant 201702028006), and by Team and Science Project Funds of Yibin Vocational and Technical College (Grants ybzysc20bk05, ybzy21cxtd-06, and ZRKY21ZDXM-03).