Abstract

Computer vision is widely used in manufacturing, sports, medical diagnosis, and other fields. In this article, a multifeature fusion error action expression method based on silhouette and optical flow information is proposed to overcome the shortcomings in the effectiveness of a single error action expression method based on the fusion of features for human body error action recognition. We analyse and discuss the human error action recognition method based on the idea of template matching to analyse the key issues that affect the overall expression of the error action sequences, and then, we propose a motion energy model based on the direct motion energy decomposition of the video clips of human error actions in the 3 Deron action sequence space through the filter group. The method can avoid preprocessing operations such as target localization and segmentation; then, we use MET features and combine with SVM to test the human body error database and compare the experimental results obtained by using different feature reduction and classification methods, and the results show that the method has the obvious comparative advantage in the recognition rate and is suitable for other dynamic scenes.

1. Introduction

According to the different states of human motion, human shape capture research can be divided into two categories: static human shape capture and dynamic human shape capture. Early human shape capture research was mainly based on the construction of static human body models based on images, researchers proposed to use three or four orthogonal photographs, using automatic or semiautomatic methods to construct a simple human body shape, the system cost is low, and the resulting human body shape is coarse [1]. The problem of capturing static human shapes can also be treated as a special case in the field of three-dimensional reconstruction, where the technique for recovering shapes is called Shape from Xin computer vision, and it was a cue for contour, stereovision, motion, texture, shadow, focus, etc. In the case of monocular cameras, it is very difficult to reconstruct the shape of an object because there are no good constraints between the available cues, and there is no perfect solution to the problem of 3D reconstruction for monocular cameras [2]. In the case of multicamera, since the cues obtained between the cameras can form better constraints, including the reconstruction of the 3D model from the contour with volume intersection or with multicamera stereo vision techniques, generalists applying sophisticated 3D scanning devices or 3D reconstruction algorithms can capture fine human shapes, often with millimeter-level accuracy [3]. The study of dynamic human form capture is aimed at capturing both the skin deformation of the human body and the effect of the garment shaking due to the movement of the body.

Existing human motion capture technology, from the principle of this paper, can be divided into mechanical, electromagnetic, acoustic, inertial, and optical motion capture. Mechanical, electromagnetic, acoustic, and inertial motion capture equipment used in different principles of different sensors have their advantages and disadvantages, generally, from the following aspects of evaluation: wood, capture accuracy, ease of use, applicable range, real-time, anti-interference, and so on [4, 5]. Most of the optical motion capture is based on the principle of computer vision, for a point in three-dimensional space, if it can be two or more cameras at the same time for the two or more cameras to see, according to the same moment the camera to obtain the image and its calibration parameters, you can calculate the three-dimensional coordinates of the point [6]. When the cameras record certain frame rates continuously, the motion path of the point can be obtained from the image sequence. The typical optical action capture system generally uses 4–8 cameras arranged around the scene. The overlapping field of view of the camera is to capture the movement of the object area [7]. In practical research, motion capture is usually divided into marked motion capture and unmarked motion capture according to whether logos or luminescent dots are used. Marker-based systems usually require the object to be labeled with a marker on key parts of the body, such as the head, indigo, and joints. The vision system analyzes and processes the input image sequence, identifies the marker points, and then calculates the spatial position of the marker on each frame to obtain the trajectory of the movement. Many marker-based motion capture systems have been successfully commercialized, but they still have shortcomings and deficiencies [8]. Marking points are very time-consuming to affix and wearing clothing with marking points or sensors may make the subject feel uncomfortable and may lead to distortion of some movements. The second is the environmental requirements, such as lighting conditions and clothing limitations for outdoor or natural environments. It requires manual user interaction when the tracked marker is missing. Nonlabeled motion capture estimates body posture from images or videos and captures objects without labeling them. Compared to a marked motion capture system, unmarked motion capture is still a challenging problem [9].

For model-based human motion capture, there is an important relationship between the accuracy of the model initialization and the quality of the next tracking step. Automatic initialization will undoubtedly improve the accuracy of tracking, and automatic initialization methods may have some limitations, such as the need for a specific post- or predefined motion [10]. Over the past decade, there have been many research efforts on the automatic initialization of human models in multiocular images. To improve the accuracy of tracking, these methods reconstruct a model of a joint structure that is like the tracked object. Due to the limited number of cameras, it is difficult to obtain accurate information about the human body size, and the shape of different tracked objects varies. In previous studies, simple models such as sticks were used, which required users to manually adjust the length and posture of the limbs to match the tracked objects [11]. Due to the complexity of the human body, the use of simple models such as stocks can lead to problems such as lack of accuracy in capturing the posture. With the continuous improvement of computer processing power, model-based methods are increasingly used by research scholars, and the geometric representation of the model from simple to complex, the more realistic the model is, the better the degree of approximation with the captured object and the higher the accuracy can be achieved [12]. The best initialization results can be achieved if the 3D scan model of the captured object is used in this paper. However, most of the three-dimensional scanned models of the human body are expensive and require a long time to do postprocessing, . Since there is no joint structure information, the initial model can be obtained from the scan. Before pose estimation, the bone structure needs to be set to suit its skin and so on. This increases the initial model to a certain extent and the complexity of the initialization process to some extent [13, 14]. As shown in Figure 1, we study and discuss the key issues in the construction of logo-free motion capture based on the human body model against the background of the application needs of human motion capture in the cultural and creative industries.

2. Three-Dimensional Human Error Motion Shape Capture Based on Computer Vision

2.1. Human Rigid False Motion Posture Matching

The construction of human models and their motion control has been one of the most tedious tasks faced by human animators. We construct a realistic three-dimensional human body model close to the geometric surface of the human body. At present, the construction method of the human body model can be roughly divided into three categories: creation, reconstruction, and interpolation. The capture of the human body shape can be classified as static and dynamic. Static mannequins depict the human body shape in a particular posture, usually a standing posture. The main problems associated with static mannequins include shape reconstruction, body depiction, alignment, filling of mesh holes, and body size differentiation. Although 3D scanners are capable of reconstructing human models with high accuracy, they are expensive and have limitations, such as the environment in which they can be used, as well as the need for postprocessing and other tedious steps. Reconstruction of the human body from the images can be seen as a special case in the field of three-dimensional reconstruction. The study of three-dimensional reconstruction of static scenes has a long history, and algorithms in this field include shape from stereoscopic vision, etc. In the case of a multicamera and a clear outline of the limb without its occlusion, better reconstruction results can be obtained, and the quality of the reconstruction is also related to the number of cameras and their angles. In the case of human movement, the body parts are prone to block each other; if the reconstruction is carried out frame by frame, it will lead to some wrong frame reconstruction results, due to the lack of depth information of the human body posture which is prone to the problem of adhesion of the body parts [15].

Three-dimensional objects are usually represented by polygons that do not contain vertices and faces. The polygonal mesh matches the image data by moving the position of its vertices. Usually, the smooth deformation of the mesh surface is achieved by ensuring that it is within the Laplace mesh editing framework. The final human body shape is achieved by optimizing a linear system that constrains the mesh vertices to correspond to the image pixels. Most mesh models can be smoothly constrained to achieve good deformation results. As shown in Figure 2, the challenge is to find the correct correspondence between the mesh vertices and the pixels in the image.

2.2. False Motion Recognition and Visual Algorithm Matching

For model-based human motion capture, there is an important relationship between the accuracy of the model initialization and the quality of the next tracking step. Automatic initialization will undoubtedly improve the accuracy of tracking, and automatic initialization methods may have some limitations, such as the need for a specification or a predefined motion. Over the past decade, there have been many research efforts on the automatic initialization of human models in multicolor images. To improve the accuracy of tracking, these methods reconstruct a model of a joint structure that is like the tracked object. Due to the limited number of cameras, it is difficult to obtain accurate information about the human body size, and the shape of different tracked objects varies [16]. In previous studies, simple models such as sticks were used, which required users to manually adjust the length and posture of the limbs to match the tracked objects. Due to the complexity of the human body, the use of simple models such as stocks can lead to problems such as lack of accuracy in capturing the posture. With the continuous improvement of computer processing power, model-based methods are used by more and more research scholars, and the geometric representation of the model is from simple to complex. The more realistic the model is, the better the approximation with the captured object and the higher the accuracy can be achieved. The best initialization results can be achieved if the 3D scan model of the captured object is used in this paper. As shown in Figure 3, most of the three-dimensional scanned models of the human body are expensive and require a long time to do postprocessing, and the initial model obtained from the scan because there is no joint structure information, in the posture estimation before the need to set the skeletal structure, for its skin, etc. This increases the initial model to a certain extent [17]. This increases the complexity of the initialization process to a certain extent.

The goal of human motion capture is to recover human motion information from monocular or multicamera images. Estimating human pose from observed images, even with multiple cameras and simple backgrounds, it remains a complex optimization problem due to model-to-image matching ambiguities, ambiguities in image depth information, and high-dimensional state space. Local optimization algorithms are fast due to their reliance on individual pose assumptions, but once a frame fails to be tracked due to occlusion, etc., it is difficult for the system to recover the correct pose for tracking. To accurately describe the motion of the human body, at least 30 degrees of freedom (DOF) pose information that needs to be captured, and it is very difficult to search all the pose parameters (DOFs) simultaneously in the high-dimensional pose space [18].

The pinhole camera model is a widely used perspective projection camera model and the relationship between the camera coordinate system and the world coordinate system can be described by the translation vector and the rotation matrix R:where [R] is the projection matrix, represents the true image coordinates of the X projection, and [K] is the calibration matrix containing the camera’s internal references:

The parameter [R] provides a transformation of the perspective, converting the coordinates (X, Y, Z, 1)T of point X in the world coordinate system into camera coordinates (Xc, Yc, Zc, 1)T. The process of estimating the values of the inner and outer parameters of a single or more camera is known as camera calibration.

The vertices on the model are multiplied by this matrix to get the new coordinates of the vertices. Rotations in 3D space can be the X-, Y-, or Z-axis, and each rotation can be represented by their matrix.

3. Parametric Human False Motion Recognition Model

3.1. Computer Vision-Based Initialization of Athlete Error Motion Shapes and Postures

A model-based human motion capture system can be divided into four steps, namely, initialization, tracking, pose estimation, and recognition. The initialization involves two aspects: posture and model representation. The initialization of vision-based human motion capture often requires defining a human model that is similar to the shape and joint structure of the captured object, and the initial pose of the model should be consistent with the pose of the tracked object. In most 3D pose estimation algorithms, the user has to manually initialize a generic model so that its limb length as well as its post is consistent with the tracked object. For model-based human motion capture, the accuracy of the model initialization has an important relationship with the quality of the next tracking step.

Automatic initialization will undoubtedly improve the accuracy of tracking, and automatic initialization methods can have some limitations, such as the need for a specific post- or a predefined motion [19]. This is a hotspot, and, there have been many studies of automatic initialization of human models in multiocular images. To improve the accuracy of tracking, these methods reconstruct a model of a joint structure that is close to the tracked object. Due to the limited number of cameras, it is difficult to obtain accurate information about the human body size, and the shape of the body varies from one tracking object to another.

For a given tracking body, the results of the initialization can be used as a priori knowledge to constrain the next step of tracking and attitude estimation. Manual initialization often finds it difficult to obtain good initialization results due to visual errors. This paper aims to obtain detailed human shape and pose information quickly and accurately. Estimating human body shape from monocular or multiocular images requires projecting a 3D model onto a 2D observation image, constructing a cost function for the distance between the projected model and the 2D contour, and minimizing this cost function. The popular parametric model SCAPE (Shape Completion and Animation for People) is a data-driven method for constructing models of the human body with different postures and body shapes. As shown in Figure 4, the SCAPE model has been extended to generate a special model, which is matched to a visual hulls sequence for each body part using a contour and nearest point iteration- (ICP-) based method.

The SCAPE model is matched to the observed contour to capture the detailed human body deformation. The SCAPE model is used as a three-dimensional to two-dimensional conversion to achieve the deformation of the human body in the image, and good results are obtained [20]. Discriminative models based on hybrid experts were used to estimate the parameters of the SCAPE model from monocular or multiocular image contours. A matching error is considered to have occurred when the application of a local optimization algorithm appears to match the limb portion of the model with the observed contour with an error greater than a given error mean. For the experiments in this paper, this paper sets the error mean value of 500 mm, and when the matching error of a limb part is greater than 500 mm, this paper applies the global optimization algorithm to it for postrepair. Global optimization is the application of a certain number of party sets and number of iterations to estimate the human body pose; the number of particles represents the number of particles applied in the search space; the number of iterations determines the time required for the global optimization process. This paper can effectively reduce the number of particles used and the number of iterations after applying a constrained weighted energy function to the particles, as shown in Figure 5.

3.2. Parametric Human Error Motion Recognition Shape Estimation

In human motion capture, the use of models close to the human body not only improves the realism of the model but also helps to improve the performance of unmarked motion capture systems. In an unmarked motion capture system, there is often a need for capturing objects of varying height and body types. In a motion capture system, this kind of human model is usually built for a human body, and the processing function method is also built around this human body, and the whole process relies on this human body which is difficult to be universal. Considering that most of the motion capture systems require the capture object to wear tight clothing to obtain accurate human motion postures and, in this section, combining the parametric human body model and the initialization needs of motion capture systems, this paper proposes a deformable model-based human body size and posture estimation method. The 3D model compares it with the 3D scan model data to obtain good reconstruction results. In this paper, we adopt the framework of prostate-specific membrane antigen (PSMA) for stepwise estimation of human posture and shape proposed, use the algorithm for stepwise estimation of human posture and body shape parameters using the SCAPE model to first obtain the human posture from the reconstructed visual hull mesh model, and then define the contour matching cost function to estimate the human body shape and fine-tune the posture.

As shown in Figure 6, the mean error of the athlete’s erroneous movements in SCAPE is between −11.14 mm and 12.05 mm, with a standard deviation of 16.16 mm, where the red part indicates a large difference, and it can be seen that the estimated human body model is very similar to the scanned model except for the hand part, which is not very important for most applications. Once there is an error in a frame, this error will persist throughout the tracking sequence leading to an incorrect estimate for all remaining frames. In this paper, the human model is divided into 15 parts as defined by the joint tree, and the contour-matching errors are counted separately for each part of the human model.

As shown in Figure 7, when a matching error occurs from frame 29, the remaining estimates are incorrect, so the local optimization algorithm cannot automatically recover from the incorrect estimates. Although the local optimization algorithm is prone to incorrect posture estimation results, it can obtain correct position tracking results, and the pelvic part of the human model is correctly matched, i.e., it can obtain global position and orientation information of the erroneous notion of the human body.

3.3. Comparative Analyses of Human Error Motion Recognition Model Algorithms

To discuss the effect of human models of different fineness on pose estimation, this paper uses the visual hull model obtained from images using the Stepless Frequency Selection (SFS) method and the approximate stick model, respectively, for validation. This study uses the mesh model obtained by the SFS method, uses the same algorithm for its embedded skeleton for posture estimation, and analyzes the results in the same perspective. The visual hull model obtained by the SFS method is not very accurate, and the obtained triangular mesh model is irregular, especially in the treatment of the shoulder, and the embedded skeleton causes problems with shoulder deformation when the human body is moving at a slower speed. The problem is not obvious; as the human body moves faster, the deformation of the shoulder becomes obvious. The same algorithm with visual hull model will have errors, while the initialization model will not have this problem.

In this paper, foreground segmentation is required for each camera-recorded image sequence to obtain a binary foreground contour. The projection matrix is used to project the three-dimensional model onto the two-dimensional image, define the relationship between the three-dimensional model and the two-dimensional foreground contour, use the local optimization algorithm to calculate the best match between the body parts in the three-dimensional model and the foreground contour, and obtain the pose information of the human body under the local optimization algorithm. When the estimation results obtained by the local optimization algorithm are wrong, the global optimization algorithm is used to re-estimate the posture of the wrongly matched body parts. The global optimization algorithm searches for the optimal posture in the entire state space, which makes it very time-consuming since it searches the entire state space. To overcome this drawback, the global optimization method is used only when the local optimization algorithm makes an estimation error and only the pose parameters of the limb part where the error occurs are estimated. As shown in Figure 8, the global optimization algorithm is constrained using prior knowledge of human pose learned from a motion capture database to reduce the number of particles and the number of iterations to optimize the algorithm.

In solving real-world problems, a variety of different factors may lead to multiple solutions, which are also known as multimodal problems (multimodal). To find the best and unique solution in a multimodal problem, a global optimization approach is required. Unidentified motion capture is a multimodal problem where global information is unknown and there is no very efficient way to mine this information. This problem can be regarded as a black box problem, which is usually solved using metaheuristic stochastic optimization methods. Stochastic optimization methods not only are very scalable but also overcome the limitations of learning from samples in learning strategies.

In the unidentified motion capture problem, stochastic optimization methods can represent and exploit multimodal global information well, and the results obtained are also characterized by accuracy and robustness. To clearly show its capture results, this article looks at 8 viewpoints. As shown in Figure 9, each row is the result of one camera viewpoint. Column 1 is the input image, column 2 is the stick-like model, column 3 is the projection of the stick-like model on the image contour, column 4 is the 3D initialized model, column 5 is the projection of the initial model on the image contour, the green part of the figure indicates the contour belonging to the original image, the Fuchsia part indicates the contour belonging to the captured mannequin, and the white part indicates the part where the two contours meet. The estimates in this paper are relatively close to human posture.

The local optimization algorithm, although running fast, is prone to estimate wrong results due to blocking, or when the action is fast. And, once an error occurs, it is difficult to recover on its own. Global optimization algorithms such as particle filtering represent the uncertainty in the pose space through a Bayesian paradigm, and since depicting the human body pose requires at least 20 degrees of freedom of the human model, an uncountable number of particles will be required in such a high-dimensional pose space, and solving it using particle filtering will be a very tedious problem. The addition of a simulated annealing strategy greatly reduces the number of particles and makes the problem solvable, but still requires a long run time. A hybrid approach of local and global optimization is used to estimate the human body pose, the global optimization algorithm is only activated for use when the local optimization algorithm has a matching error, and the initial pose of the global optimization is based on the results of the local algorithm estimation. This paper adds a priori constraints on the human body pose, including joint angle constraints and poses information learned from the motion capture database as constraints on the particle energy. The number of particles and the number of iterations is greatly reduced, and the feasibility of the experimentally confirmed algorithm is used to obtain correct attitude estimation results.

During local optimization, this paper divides the 3D-2D matching into contour matching and texture matching. Texture matching is the matching of SIFT feature points between two adjacent frames, which need to be computed for each frame of the image for each viewpoint. Although it takes time to calculate texture matching, the additional texture matching is faster than the pose estimation without texture matching, which can improve the accuracy of local optimization algorithm estimation and reduce the number of joints initialized by the global optimization algorithm to some extent, improving the overall efficiency. The texture matching results are shown in Figure 10.

To verify the effect of the fineness of the model on motion capture, experiments were performed in this paper with the visual hull mesh model and the stick-like model obtained from multiocular images, respectively. Since the visual hull model is not fine-grained enough, especially in the shoulder and armpit areas, it can lead to incorrect deformation effects during the model movement, which can affect the results of the posture estimation, whereas with a stick-like model, although an approximation of the post can be obtained, the capture accuracy is much lower than that in a fine-grained model which is initialized. Therefore, the accuracy of pose estimation can be better improved by using a human model that is close to the captured object.

4. Conclusion

In film and television, animation and computer games, and other fields, three-dimensional character model shape design and motion control have been the most time-consuming and laborious work in these fields; motion capture technology has brought convenience for people to understand and use widely. With the conti[[parms resize(1),pos(50,50),size(200,200),bgcol(156)]] computer graphics processing technology, people are no longer confined to obtaining information about human postures, the deformation effects of human skin and clothes have also begun to attract scholars' attention. Obtaining human body shape and posture information from images is an important branch in the field of computer vision research. In addition to the abovementioned film and television animation, games are widely used in many fields such as security, industrial design, sports training, and medical care.

This paper provides an in-depth discussion on the topic of logo-free motion capture based on human body models, focusing on three important topics: how to capture the 3D human body shape in corresponding frames from image sequences, how to construct parametric human body models and use them to generate initialized models for motion capture, and how to perform human motion capture using local optimization and global optimization algorithms based on fine 3D human body models. To solve the problem of topological errors in the three-dimensional human shape capture problem, the human body model reconstructed by the application of stereoscopic vision algorithm in this study has topological errors, and this paper proposes a methodological framework for stepwise PSMA matching of human body pose and shape based on a three-dimensional template model. The framework firstly realizes the matching of the template model pose with the human body pose in the target image and then the shape matching of the template model, as the human body model may cause slight changes in the human body pose when the shape changes; it is necessary to optimize the human body pose twice to get the optimal solution. The three-dimensional template model is taken from the first frame of each part of the body outline showing clear images, which can be generated by three-dimensional scanning or three-dimensional reconstruction algorithm, the transformation of the pose is realized by the embedded skeleton, and the shape matching is realized by the Laplace deformation technique, which applies to scenes in which the human body wears ordinary or loose clothes, and its feasibility is confirmed by experiments.

This paper implements the human body parametric model SCAPE and proposes a method to use the SCAPE model to obtain the human body model pose parameters and body shape parameters in multicamera images. The SCAPE model is applied to the PSMA framework to generate an initialized model for human motion capture, the human body pose transformation is also achieved by the embedded skeleton, the human body shape is matched using the matching error between the projected contour of the human body model and the image contour, and the human body shape parameters and pose parameters are estimated according to the pixel distance function and verified experimentally. The effectiveness of the method is demonstrated by comparing the generated 3D human body model with the model obtained from real 3D body scans or 3D reconstruction algorithms.

This paper also presents a particle energy function based on the a priori knowledge constraint of the human body poses. Based on this function, a combination of the local and global optimization algorithm is used for the accurate estimation of the human body pose. The method improves the accuracy of the local optimization algorithm by using not only the human body contour matching but also the texture matching of adjacent image frames. The energy function is used to constrain the particle states used in the global optimization algorithm, which effectively reduces the number of particles used and the number of running iterations and improves the efficiency of the algorithm. At the same time, a fine-grained human body model has a higher capture accuracy than the one using a coarse stick model, and it is verified with experiments for comparison.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no known conflicts of financial interest or personal relationships that could have appeared to influence the work reported in this paper.