Abstract

This paper surveys the modeling methods for deformable human body and motion analysis in the recent 30 years. First, elementary knowledge of human expression and modeling is introduced. Then, typical human modeling technologies, including 2D model, 3D surface model, and geometry-based, physics-based, and anatomy-based approaches, and model-based motion analysis are summarized. Characteristics of these technologies are analyzed. The technology accumulation in the field is outlined for an overview.

1. Introduction

Human body modeling is experiencing a continuous and accelerated growth. This is partly due to the increasing demand from computer graphics and computer vision communities. Computer graphics pursues a realistic modeling of both the human body geometry and its associated motion. This will benefit applications such as games, virtual reality, or animations, which demand highly realistic human body models (HBMs).

Recently, computer vision has been used for the automatic generation of HBMs from a sequence of images by incorporating and exploiting prior knowledge of the human appearance. Computer vision also addresses human body modeling, but in contrast to computer graphics it seeks more for an efficient than an accurate model for applications, such as intelligent video surveillance, motion analysis, telepresence, or human-machine interface. Computer vision applications rely on vision sensors for reconstructing HBMs. Obviously, the rich information provided by a vision sensor, containing all the necessary data for generating a HBM, needs to be processed. Approaches such as tracking segmentation-model fitting or motion prediction-segmentation-model fitting or other combinations have been proposed showing different performances according to the nature of the scene to be processed (e.g., indoor environments, studio-like environments, outdoor environments, single-person scenes, etc.). The challenge is to produce a HBM able to faithfully follow the movements of a real person [1, 2].

Modeling a human is a great challenge if we consider the numerous parts needed to compose a body. The first step is the basic structure modeling, the definition of the joints, their positions, orientations, and the geometric model that will describe the body hierarchy. Next, we need to think about the body volume and on top of this, we can use a parametric surface to simulate skin for example. With these three elements, we can reach a good representation of a body.

Methods used to deform the human skin layer include curve-, contour-, surface-, geometry-, physics-, and anatomy-based approaches. 2D models such as curve and contour models offer representational and computational simplicity and are often preferred over 3D models for applications involving monocular images and video. These models typically represent the shape of the human body coarsely. None of these methods explicitly models how clothing influences human shape. Surface models used in animation have a highly structured mesh to give high-resolution representation in areas of deformation and efficient representation in other areas. Preserving this vertex parameterization is important in reconstructing models that can be used for animation. Geometry-based approach such as free form deformation (FFD) provides flexibility for the users to control deformations of models. The method is simple and fast but requires considerable skill to model realistic model. Physics-based approaches such as the finite element method (FEM) can model skin layers according to their physical properties accurately. Anatomy-based methods create an accurate human body model based on a precise representation of the skeleton, muscles, and fatty tissues. These techniques generate realistic and dynamic deformation of an articulated body using physical simulation but, due to their high computational cost, applications are mainly in offline simulation and animation [3].

Model-based pose estimation algorithms aim at recovering human motion from one or more camera views and a 3D model representation of the human body. The model pose is usually parameterized with a kinematic chain, and thereby the pose is represented by a vector of joint angles. The majority of algorithms are based on minimizing an error function that measures how well the 3D model fits the image. This category of algorithms usually has two main stages, namely, defining the model and fitting the model to image observations. The model image association problem for pose estimation is usually formulated as the minimization/maximization of an error/likelihood function. The two main strategies have been described, namely, local and particle-based optimization. Local optimization methods are faster and more accurate but in practice, if there are visual ambiguities, or really fast motions, the tracker might fail catastrophically. To achieve more robustness, particle filters can be used because they can represent uncertainty through a rigorous Bayesian paradigm [4].

In the past, there are several reviews on the study of human motion capture, detection, and analysis. Aggarwal et al. [5] wrote a review about articulated and elastic nonrigid motion in 1994. There is no other review about non-rigid, elastic, or deformable human motion analysis in the past 19 years. This paper is the first such review after 1995.

This paper is organized as follows Section 2 presents the 2D curve and contour human models. Section 3 presents the 3D human surface models such as quadrics, superquadrics, implicit surface, spline surface, and mesh surface models. Section 4 presents geometry-based human models. Section 5 presents physics-based human models. Anatomically based human models are presented in Section 6.

2. Two-Dimensional Models

2.1. Curve Models

Tabia et al. proposed an approach to matching 3D objects in the presence of nonrigid transformation and partially similar models [6]. They adopt square-root elastic (SRE) framework because it simplifies the elastic shape analysis. They define a space of closed curves of interest, impose a Riemannian structure on this space using the elastic metric, and compute geodesic paths under this metric. These geodesic paths can then be interpreted as optimal elastic deformations of curves. Mori and Malik [7] took a single two-dimensional image containing a human figure, locate the joint positions, and use these to estimate the body configuration and pose in three-dimensional space (see Figure 1(a)). They match the input image to each stored view using the technique of shape context matching in conjunction with a kinematic chain-based deformation model and then use joint positions to estimate the body configuration and pose in three-dimensional space. In their approach, a shape is represented by a discrete set of points sampled from the internal and external contours on the shape. They first perform edge detection on the image, using the boundary detector to obtain a set of edge pixels on the contours of the body. They then sample some number of points (300–1000) from these edge pixels to be used as the sample points for the body. Srivastava et al. [8] introduced a square-root velocity (SRV) representation for analyzing shapes of curves in Euclidean spaces using an elastic metric (see Figure 1(b)). Huang et al. [9] presented a variational and statistical approach for shape registration (see Figure 1(c)). Shapes of interest are implicitly embedded in a higher-dimensional space of distance transforms. In this implicit embedding space, registration is formulated in a hierarchical manner: the mutual information criterion supports various transformation models and is optimized to perform global registration; then, a B-spline-based incremental free form deformations (IFFD) model is used to minimize a sum-of-squared-differences (SSD) measure and further recover a dense local nonrigid registration field.

2.2. Contour Models

Liu et al. [10] presented boosted deformable model for human body alignment. Their model representation consists of a shape component represented by a point distribution model and an appearance component represented by a collection of local features, trained discriminatively as a two-class classifier using boosting (see Figure 2(a)). Freifeld et al. [11] defined a new “contour person” model of the human body that has the expressive power of a detailed 3D model and the computational benefits of a simple 2D part-based model. The contour person (CP) model is learned from a 3D SCAPE model of the human body that captures natural shape and pose variations; the projected contours of this model, along with their segmentation into parts, form the training set (see Figure 2(b)). The CP model factors deformations of the body into three components: shape variation, viewpoint change, and part rotation. This latter model also incorporates a learned non-rigid deformation model. Zuffi et al. [12] defined a new deformable structures (DS) model that is a natural extension of previous pictorial structures (PS) models and that captures the non-rigid shape deformation of the parts (see Figure 2(c)). Each part in a DS model is represented by a low-dimensional shape deformation space, and pairwise potentials between parts capture how the shape varies with pose and the shape of neighboring parts. A key advantage of such a model is that it more accurately models object boundaries. This enables image likelihood models that are more discriminative than the previous PS likelihoods. They focus on a human DS model learned from 2D projections of a realistic 3D human body model and use it to infer human poses in images using a form of nonparametric belief propagation. Guan et al. [13] studied detection, tracking, segmentation, and pose estimation of people in monocular images. They start with a contour person (CP) model (see Figure 2(d)), which is a low-dimensional, realistic, parameterized generative model of 2D human shape and pose. The CP model is learned from examples created by 2D projections of multiple shapes and poses generated from a 3D body model such as SCAPE. The CP model is based on a template, corresponding to a reference contour that can be deformed into a new pose and shape. This deformation is parameterized and factors the changes of a person’s 2D shape due to pose, body shape, and the parameters of the viewing camera. This factorization allows different causes of the shape change to be modeled separately.

3. 3D Surface Models

3.1. Quadrics

Park and Hodgins [14] presented a technique for capturing and animating those motions using a commercial motion capture system and approximately 350 markers. They supplement these markers with a detailed, actor-specific surface model (see Figure 3(a)). The motion of the skin can then be computed by segmenting the markers into the motion of a set of rigid parts and a residual deformation (approximated first as a quadratic transformation and then with radial basis functions). Fayad et al. [15] proposed a more general shape model that accounts for quadratic deformations. Their approach takes motion capture (MOCAP) data as input and enables the extraction of more accurate estimates for the rigid component of the different body segments using a factorization framework. The parameters of the model are computed using a Levenberg-Marquardt nonlinear optimization scheme. Hyun et al. [16] presented a new approach to the modeling and deformation of a human or virtual character’s arms and legs. Each limb is represented as a set of ellipsoids of varying sizes interpolated along a skeleton curve (see Figure 3(b)). A base surface is generated by approximating these ellipsoids with a swept ellipse, and the difference between that and the detailed shape of the arm or leg is represented as a displacement map. Pan and Liu [17] presented a model of elastic articulated objects based on revolving conic surface and a method of model-based motion estimation (see Figure 3(c)). The model includes 3D object skeleton and deformable surfaces that can represent the deformation of human body surfaces. In each limb, surface deformation is represented by adjusting one or two deformation parameters. Then, the 3D deformation parameters are determined by corresponding 2D image points and contours with volume invariable constraint. The 3D motion parameters are estimated based on the 3D model.

3.2. Superquadrics

Terzopoulos and Metaxas [18] presented a physically based approach to fitting complex three-dimensional shapes using a class of dynamic models that can deform both locally and globally. They formulate the deformable superquadrics, which incorporate the global shape parameters of a conventional superellipsoid with the local degrees of freedom of a spline. The model’s six global deformational degrees of freedom capture gross shape features from visual data and provide salient part descriptors for efficient indexing into a database of stored models. The local deformation parameters reconstruct the details of complex shapes that the global abstraction misses. The equations of motion which govern the behavior of deformable superquadrics make them responsive to externally applied forces. The authors fit models to visual data by transforming the data into forces and simulating the equations of motion through time to adjust the translational, rotational, and deformational degrees of freedom of the models. Sminchisescu [19] built a human body model which consists of a kinematic “skeleton” of articulated joints controlled by angular joint parameters, covered by “flesh” built from superquadric ellipsoids with additional tapering and bending parameters (see Figure 4(a)). A typical model has around 30 joint parameters, plus 8 internal proportion parameters encoding the positions of the hip, clavicle, and skull tip joints, plus 9 deformable shape parameters for each body part, gathered into a vector. A complete model can be encoded as a single large parameter vector. During tracking, they usually estimate only joint parameters, but during initialization the most important internal proportions and shape parameters are also optimized, subject to a soft prior based on standard humanoid dimensions and updated using collected image evidence. Although this model is far from being photo realistic, it suffices for high-level interpretation and realistic occlusion prediction, and it offers a good trade-off between computational complexity and coverage. Hofmann and Gavrila [20] presented an approach for 3D human body shape model adaptation to a sequence of multi-view images. They use an articulated model with linearly tapered superquadrics as geometric primitives for torso, neck, head, upper and lower arm, hand, upper and lower leg, and foot, assuming body symmetry (see Figure 4(b)). The parameter space of each superquadric comprises parameters for length, squareness, and tapering. They implement automatic pose and shape estimation using a three-step procedure: first, they recover initial pose over a sequence using an initial (generic) body model. Both model and poses then serve as input to the above-mentioned adaptation process. Finally, a more accurate pose recovery is obtained by means of the adapted model. Sundaresan and Chellappa [21] proposed a general approach using Laplacian Eigenmaps and a graphical model of the human body to segment 3D voxel data of humans into different articulated chains. They select the superquadric model to represent human bodies (see Figure 4(c)). They use a hierarchical approach beginning with a skeletal model (joint locations and limb lengths) and then proceeding to increase the model complexity and refining parameters to obtain a volumetric model (superquadric parameters). Yang and Lee [22] reconstructed a 3D human body pose from stereo image sequences based on a top-down learning method. The 3D human body model consists of 17 body components. The human body model has 40 degrees of freedom (DOF). Tapered superquadrics are employed to represent body components.

3.3. Implicit Surface

Matsuda and Nishita [23] modeled the human body by layered metaballs, which correspond to the horizontal cross section of the body, in their cloth simulation system. For each cross section, metaballs are generated by measured sample points on the boundary of the cross section. In order to fit the metaball surface with the sampling points, they employed the steepest descent method. For body deformation, the sampling points on the cross section are smoothly moved using Bezier curves. Blinn [24] presented a new algorithm applicable to other functional forms, in particular to the summation of several Gaussian density distributions. He models human body by using implicit surface (see Figure 5(a)), but he did not model from images. Thalmann et al. [25] presented different methods for representing realistic deformations for virtual humans with various characteristics: sex, age, height, and weight. Their methods based on a combination of metaballs and splines could be applied to frame-by-frame computer generated films and virtual environments. Smooth implicit surfaces, known as metaballs, are attached to an articulated skeleton of the human body and are arranged in an anatomically based approximation. This particular human body model includes 230 metaballs. D’Apuzzo et al. [26] outlined techniques for fitting a simplified model to the noisy 3D data extracted from the images and a new tracking process based on least squares matching is presented. They present a simplified model of a limb. Ellipsoidal metaballs are used to simulate the gross behavior of bone, muscle, and fat tissue. Only three ellipsoidal metaballs are attached to each limb skeleton and arranged in an anatomically based approximation (see Figure 5(b)). Each ellipsoidal metaball has four deformation parameters and each limb has three ellipsoidal metaballs, so each limb has 12 deformation parameters. Fua et al. [27] presented a comprehensive concept to fit animation models to a variety of different data derived from multiimage video sequences. Their research includes setting up and calibrating a system of three CCD cameras, extracting image silhouettes, tracking individual key body points in 3D, and generating surface data by stereo or multi-image matching. To reduce the number of degrees of freedom (DOFs) and to be able to robustly estimate the skeleton’s position, they replace the multiple metaballs by one ellipsoid attached to each bone in the skeleton (see Figure 5(c)). Plänkers and Fua [28] developed a framework for 3D shape and motion recovery of articulated deformable objects. They propose a formalism that incorporates the use of implicit surfaces into earlier robotics approaches that is designed to handle articulated structures. Their human body model also includes 230 metaballs (see Figure 5(d)). Tong et al. [29] constructed a human body model using convolution surface with articulated kinematic skeleton (see Figure 5(e)). The human body’s pose and shape in a monocular image can be estimated from convolution curve through nonlinear optimization.

3.4. Spline Surface

Nahas et al. [30] described how the use of B-spline surfaces allows lissome movements of body and face (see Figure 6(a)). Their method is empirical, based on a parametrical animation. It can be combined with a muscles model for animation. Fu and Yuan [31] introduced the establishment of human body based on B-spline (see Figure 6(b)). According to the military criteria, they divide the whole body into sixteen limbs and use the method of multiplying of matrix to establish the equation of human body’s movement. It constructs the blending surface between two limbs by transfinite interpolant. Huang et al. [32] discussed a motion modeling method to simulate the bend of human leg and corresponding deformations of muscles on the basis of NURBS FFD (free form deformation) [33]. Generally, FFD uses a trivariate tensor product spline to transmit deformations, and it is feasible to choose only two order B-spline basis functions in this case (see Figure 6(c)). According to the anatomic structure of joints, the simulation formulas are presented, and they coincide with the motion characteristics of knee joint and muscles. Wang and Jiang [34] simulated three-dimensional human’s leg bending by using free form deformation on the basis of NURBS (see Figure 6(d)); leg looked as combo of rigid body bone and flexible body muscle. It improves Barr’s deformation methods, weight is used to control deformation, and a good visual effect of simulation is achieved.

3.5. Mesh Surface

Huang et al. [35] considered the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion (see Figure 7(a)). A global alignment graph structure is introduced, which uses shape similarity to identify frames for intersequence registration. Graph optimization is performed to minimize the total non-rigid deformation required to register the input sequences into a common structure. Chang and Lin [36] presented a 3D model-based tracking algorithm called the progressive particle filter to decrease the computational cost in high degrees of freedom by employing hierarchical searching. A 3D virtual human model is developed to simulate human movement. The proposed 3D human model is constructed from deformable flesh (see Figure 7(b)). Deformable flesh can be deformed to precisely fit the target body to achieve accurate tracking results. Liao et al. [37] reconstructed complete 3D deformable models over time by a single depth camera, provided that most parts of the models are observed by the camera at least once. A mesh warping algorithm based on linear mesh deformation is used to align different partial surfaces. A volumetric method is then used to combine partial surfaces, fix missing holes, and smooth alignment errors (see Figure 7(c)). Varanasi et al. [38] addressed the problem of surface tracking in multiple camera environments and over time sequences. In order to fully track a surface undergoing significant deformations, they cast the problem as a mesh evolution over time. Such an evolution is driven by 3D displacement fields estimated between meshes recovered independently at different time frames. The contribution is a novel mesh evolution-based framework that allows to fully track, over long sequences, an unknown surface encountering deformations, including topological changes (see Figure 7(d)).

Balan and Black [39] estimated the detailed 3D shape of a person from images of that person wearing clothing. They employ a parametric body model called SCAPE that is able to capture variability of body shapes between people, as well as articulated and non-rigid pose deformations. The model is derived from a large training set of human laser scans, which have been brought in full correspondence with respect to a reference mesh (see Figure 7(e)). Starck and Hilton [40] presented a model-based approach to recover animated models of people from multiple view video images. A prior humanoid surface model is first decomposed into multiple levels of detail and represented as a hierarchical deformable model for image fitting. A novel mesh parameterization is presented that allows propagation of deformation in the model hierarchy and regularization of surface deformation to preserve vertex parameterization and animation structure (see Figure 7(f)). De Aguiar et al. [41] jointly captured the motion and the dynamic shape of humans from multiple video streams without using optical markers. Their approach uses a deformable high-quality mesh of a human as scene representation. It jointly uses an image-based 3D correspondence estimation algorithm and a fast Laplacian mesh deformation scheme to capture both motion and surface deformation of the actor from the input video footage (see Figure 7(g)). Wang et al. [42] developed an efficient and intuitive deformation technique for virtual human modeling by silhouettes input. With their method, the reference silhouettes and the target silhouettes are used to modify the synthetic human model, which is represented by a polygonal mesh. The system moves the vertices of the polygonal model so that the spatial relation between the original positions and the reference silhouettes is identical to the relation between the resulting positions and the target silhouettes. Their method is related to the axial deformation. Seo et al. [43] aimed to carry out realistic deformations on the human body models as well as make its usage simple. Their system is composed of several modules: skin attachment to an H-Anim skeleton is carried out first in order to get deformation in skeletal shape modification as well as in animation; volumetric deformation module deals with the volumetric scale of body parts (see Figure 7(h)). These deformation operators, together with the skeletal deformation, allow the automatic adaptation of the body model to different sizes and proportions to accommodate anthropometrical variations; surface optimization is used to simplify the model in consideration of not only geometric features, but also the animation aspect of it; finally, the BDP (MPEG-4 format) generation module describes the geometry of the model as well as how to animate it according to the MPEG-4 BDP specifications.

4. Geometry-Based Approaches

Kokkinos et al. [44] presented intrinsic shape context (ISC) descriptors for 3D shapes (see Figure 8(a)). They generalize to surfaces the polar sampling of the image domain used in shape contexts: for this purpose, they chart the surface by shooting geodesic outwards from the point being analyzed; “angle” is treated as tantamount to geodesic shooting direction and radius as geodesic distance. To deal with orientation ambiguity, they exploit properties of the Fourier transform. For the analysis of deformable 3D shapes, Raviv et al. [45] introduced an (equi)affine invariant diffusion geometry by which surfaces that go through squeeze and shear transformations can still be properly analyzed (see Figure 8(b)). The definition of an affine invariant metric enables them to construct an invariant Laplacian from which local and global geometric structures are extracted. Castellani et al. [46] exploited a new generative model for encoding the variations of local geometric properties of 3D shapes. Surfaces are locally modeled as a stochastic process, which spans a neighborhood area through a set of circular geodesic pathways, captured by a modified version of a Hidden Markov Model (HMM), named multicircular HMM (MC-HMM). The approach proposed consists of two main phases: local geometric feature collection and MC-HMM parameter estimation. Akhter et al. [47] proposed a dual approach to describe the evolving 3D structure in trajectory space by a linear combination of basis trajectories. They describe the dual relationship between the two approaches, showing that they both have equal power for representing 3D structure. They further show that the temporal smoothness in 3D trajectories alone can be used for recovering nonrigid structure from a moving camera (see Figure 8(c)). The principal advantage of expressing deforming 3D structure in trajectory space is that they can define an object on independent basis. This results in a significant reduction in unknowns and corresponding stability in estimation. Gotardo and Martinez [48] addressed the classical computer vision problems of rigid and nonrigid structure from motion (SFM) with occlusion. They assume that the columns of the input observation matrix describe smooth 2D point trajectories over time. They then derive a family of efficient methods that estimate the column space of using compact parameterizations in the Discrete Cosine Transform (DCT) domain. In non-rigid SFM, they propose a 3D shape trajectory approach that solves for the deformable structure as the smooth time trajectory of a single point in a linear shape space.

Raviv et al. [49] presented a generalization of symmetries for non-rigid shapes and a numerical framework for their analysis (see Figure 8(d)), addressing the problems of full and partial exact and approximate symmetry detection and classification. Zhu et al. [50] formulated a hierarchical configurable deformable template (HCDT) to model articulated visual objects—such as horses and baseball players—for tasks such as parsing, segmentation, and pose estimation. HCDTs represent an object by an AND/OR graph where the OR nodes act as switches, which enables the graph topology to vary adaptively. This hierarchical representation is compositional, and the node variables represent positions and properties of subparts of the object. The graph and the node variables are required to obey the summarization principle, which enables an efficient compositional inference algorithm to rapidly estimate the state of the HCDT. Cui et al. [51] reported a parameterized model for virtual human body (see Figure 8(e)). In this model, the virtual human body was partitioned into several parts. Based on the partitioned human model, the proportional characteristics of the human body were used to calculate the offset of the vertices to implement the deformation on specific part of the body. The interpolation method was used to smoothen the deformed surfaces.

Liu and Shang [52] presented the example-based method for generating realistic, controllable human models (see Figure 8(f)). Users are assisted in automatically generating an example body data by controlling the parameters. The examples from the Poser and 3D Max are preprocessed as templates. The modeling method learns from these examples. After this learning process, the synthesizer translates the mesh of vertices to the generation of appropriate shape and proportion of the body geometry through free form deformation method. Oshita and Suzuki [53] proposed an easy-to-use real-time method to simulate realistic deformation of human skin. They utilize the fact that various skin deformations can be categorized into various deformation patterns. A deformation pattern for a local skin region is represented using a dynamic height map (see Figure 8(g)). Users of their system specify a region of the body model in the texture space to which each deformation pattern should be applied. Then, during animation, the skin deformation of the body model is realized by changing the height patterns based on the joint angles and applying bump mapping or displacement mapping to the body model. Tian et al. [54] presented an improved skinning method that can effectively reduce traditional flaws of this method. After the key terminologies are introduced, the improvement for skinning deformation is illustrated in detail. The main measures include that adding extra joints on JSL to minimize the distance between adjacent joints and thus joint’s importance attribute is introduced, using joint cluster to replace single joint; creating the corresponding relationship between skin and JSLs based on flexible model and multijoints-binding method (MJBM), that is to say binding one skin vertex to several joints using distance criterion and weight coefficient, is the function of distances between the skin vertex and its related joints. All these improvements can make the skin deformation more smooth (see Figure 8(h)). Zhou and Zhao [55] presented a skin deformation algorithm for creating 3D characters or virtual human models. The algorithm can be applied to rigid deformation, joint-dependent localized deformation, skeleton driven deformation, cross-contour deformation, and free-form deformation (FFD). These deformations are computed and demonstrated with examples, and the algorithm is applied to overcome the difficulties in mechanically simulating the motion of the human body by club-shape models. The techniques enable the reconstruction of dynamic human models that can be used in defining and representing the geometrical and kinematical characteristics of human motion. Shen et al. [56] presented an approach for human skin modeling and deformation based on cross-sectional methods. Internally, the authors use dynamic trimmed parametric patches for describing the smooth deformation of skin pieces; then they polygonalize parametric patches for final body skin synthesis and rendering. Simple and intuitive, their method combines the advantages of both parametric and polygonal representations, produces very realistic body deformations, and allows the display of surface models at several levels of detail. Smeets et al. [57] used an isometric deformation model. The geodesic distance matrix is used as an isometry-invariant shape representation. Two approaches are described to arrive at a sampling order invariant shape descriptor: the histogram of geodesic distance matrix values and the set of largest singular values of the geodesic distance matrix. Shape comparison is performed by the comparison of the shape descriptors using the -distance as dissimilarity measure.

Rumpf and Wirth [58] introduced the covariance of a number of given shapes if they are interpreted as boundary contours of elastic objects. Based on the notion of nonlinear elastic deformations from one shape to another, a suitable linearization of geometric shape variations is introduced. Once such a linearization is available, a principal component analysis can be investigated. This requires the definition of a covariance metric—an inner product on linearized shape variations. The resulting covariance operator robustly captures strongly nonlinear geometric variations in a physically meaningful way and allows to extract the dominant modes of shape variation. The underlying elasticity concept represents an alternative to Riemannian shape statistics. Fundana et al. [59] proposed a method for variational segmentation of image sequences containing nonrigid, moving objects. The method is based on the classical Chan-Vese model augmented with a novel frame-to-frame interaction term, which allows them to update the segmentation result from one image frame to the next using the previous segmentation result as a shape prior. The interaction term is constructed to be pose invariant and to allow moderate deformations in shape. Mio et al. [60] studied shapes of planar arcs and closed contours modeled on elastic curves obtained by bending, stretching, or compressing line segments nonuniformly along their extensions. Shapes are represented as elements of a quotient space of curves obtained by identifying those that differ by shape-preserving transformations. The elastic properties of the curves are encoded in Riemannian metrics on these spaces. Geodesics in shape spaces are used to quantify shape divergence and to develop morphing techniques. The shape spaces and metrics constructed offer an environment for the study of shape statistics. Elasticity leads to shape correspondences and deformations that are more natural and intuitive than those obtained in several existing models. Cremers [61] tackled the challenge of learning dynamical statistical models for implicitly represented shapes. They show how these can be integrated as dynamical shape priors in a Bayesian framework for level set-based image sequence segmentation. They propose learning the temporal dynamics of a deforming shape by approximating the shape vectors of a sequence of level set functions by a Markov chain of order .

5. Physics-Based Approaches

Tang [62] presented a physics-based approach to model human skin deformation using boundary element method (BEM). Given the magnitude of displacement between the skin layer and the underlying skeleton at the anatomical landmarks, the approach determines the displacement of each vertex of the human skin model by using the BEM (see Figure 9(a)). They demonstrated their results by modeling the skin deformation of human lower limb with jumping and walking motions. Shin and Badler [63] modeled a deformable human arm to improve the accuracy of constrained reach analysis. Their research is largely composed of two parts. The first part is modeling a deformable human arm based on these empirical biomechanical properties and calculating the deformation due to various contact areas (see Figure 9(b)). The second part is evaluating the reachable space (reachability) from the arm deformation in a given geometric (CAD) environment. Using the empirical force-displacement relation, they have built a simple human arm model that deforms using a finite element method. Pentland and Horowitz [64] introduced a physically correct model of elastic nonrigid motion. This model is based on the finite element method (see Figure 9(c)), but it decouples the degrees of freedom by breaking down object motion into rigid and nonrigid vibration or deformation modes. The result is an accurate representation for both rigid and nonrigid motions that has greatly reduced dimensionality. Because of the small number of parameters involved, they have been able to use this representation to obtain accurate overconstrained estimates of both rigid and nonrigid global motions. These estimates can be integrated over time by the use of an extended Kalman filter [65], resulting in stable and accurate estimates of both 3D shape and 3D velocity. The formulation was then extended to include constrained nonrigid motion. Examples of tracking single nonrigid objects and multiple constrained objects were presented.

6. Anatomy-Based Approaches

Hyun et al. [66] presented a sweep-based approach to human body modeling and deformation. A rigid 3D human model, given as a polygonal mesh, is approximated with control sweep surfaces. The vertices on the mesh are bound to nearby sweep surfaces and then follow the deformation of the sweep surfaces as the model bends and twists its arms, legs, spine, and neck (see Figure 10(a)). Anatomical features including bone protrusion, muscle bulge, and skin folding are supported by a GPU-based collision detection procedure. The volumes of arms, legs, and torso are kept constant by a simple control using a volume integral formula for sweep surfaces. Zuo et al. [67] proposed a new method of muscle modeling based on both anatomical and real-time considerations. In the muscle modeling system, muscle can be constructed and edited easily through appointing some radial and transverse cross-section control parameters. Deformation of muscle model can be achieved through axial deformation and cross section’s deformation. The user can adjust the precision of models to meet different requirements. Nedel and Thalmann [68] proposed a method to simulate human bodies based on anatomy concepts. Their model is divided into three layers and presented in three steps: the concept of a rigid body from a real skeleton, the muscle design and deformation based on physical concepts, and skin generation. Muscles are represented at two levels: the action lines and the muscle shape. The action line represents the force produced by a muscle on the bones, while the muscle shapes used in the simulation consist of surface-based models. To physically simulate deformations, they used a mass-spring system with a new kind of springs, called “angular springs,” which were developed to control the muscle volume during simulation (see Figure 10(b)). Aubel and Thalmann [69] proposed a new, generic, multilayered model for automating the deformations of the skin of human characters based on physiological and anatomical considerations. Muscle motion and deformation are automatically derived from an action line that is deformed using a 1D mass-spring system. They cover the muscle layer with a viscoelastic fat layer that concentrates the crucial dynamics effects of the animation (see Figure 10(c)). Min et al. [70] proposed an anatomically based modeling and animation scheme for a human body model whose shape was created from 3D scan data of a human body. The proposed human body model is composed of three layers: a skeleton layer, a muscle layer, and a skin layer. The skeleton layer, represented as a set of joints and bones, controls the animation of the human body model. The muscle layer deforms the skin layer realistically during animation. They create the muscles in that layer using soft objects, also known as blobby objects or metaballs, and deform them through the insertion/origin points of the muscles and the volume-preserving constraints. To deform the skin layer during animation, they bind the skin layer to both the skeleton layer and the muscle layer by finding corresponding joints and muscles of the vertices on the skin layer. They applied the proposed scheme for modeling the upper limb and shoulder of human body (see Figure 10(d)).

7. Conclusion

This paper attempts to provide a comprehensive survey of research on deformable human modeling and motion analysis and to provide some structural categories for the methods described in over 60 papers. This work can be generally categorized as 2D model, 3D surface model, and geometry-based, physics-based, and anatomy-based approaches. Compared with rigid motion analysis, the analysis of nonrigid and elastic articulated motion is still in its infancy at the current stage. The main difficulties in developing algorithms for human shape and motion analysis stem from the complex 3D nonrigid motions of human. The motivations of human motion analysis based on deformable models are driven by application areas such as medical imaging, biomedical applications, gesture recognition, choreography, video conferencing, material deformation studies, and image compression.

Although over the last decade much progress has been made in human pose estimation based on deformable model, there remain a number of open problems and challenges. First, one important motivation of this research is to build a body surface model that properly describes human body deformation from a small number of parameters and human 3D shape analysis from 2D image sequences. Second, sports biomechanics analysis is made to determine temporal and spatial parameters, kinematic variables, and kinetic variables of human body. Sports biomechanics analysis has been restricted to rigid models. There is less research using nonrigid model. Image-based sports biomechanics analysis of deformable human body model includes volume, gravity center, force of gravity and moment of inertia determination, and kinetic and rotational dynamics analysis. Third, while tracking walking motions in semicontrolled settings is more or less reliable, robust tracking of arbitrary and highly dynamic motions is still challenging even in controlled setups. Fourth, tracking arbitrary motions in outdoor settings has been mostly unaddressed and remains one of the open problems in computer vision. Outdoor human tracking would be allowed to capture sport motions in their real competitive setup. Fifth, tracking people in the office or in the streets interacting with the environment is still an extremely challenging problem to be solved. We expect that novel schemes will be presented to deal with human motion analysis based on deformable models in the future.

Acknowledgments

This research is supported by the National Science Foundation of China 61075031, 31270998, and 61173096, China Postdoctoral Science Foundation 2012M511321, Jiangsu Postdoctoral Science Foundation 1102169C, and National Science and Technology Support Program of China 2012BAZ04319.