Abstract

A framework of imitating real human gait in 3D from monocular video of an uncalibrated camera directly and automatically is proposed. It firstly combines polygon-approximation with deformable template-matching, using knowledge of human anatomy to achieve the characteristics including static and dynamic parameters of real human gait. Then, these characteristics are processed in regularization and normalization. Finally, they are imposed on a 3D human motion model with prior constrains and universal gait knowledge to realize imitating human gait. In recognition based on this human gait imitation, firstly, the dimensionality of time-sequences corresponding to motion curves is reduced by NPE. Then, we use the essential features acquired from human gait imitation as input and integrate HCRF with SVM as a whole classifier, realizing identification recognition on human gait. In associated experiment, this imitation framework is robust for the object’s clothes and backpacks to a certain extent. It does not need any manual assist and any camera model information. And it is fitting for straight indoors and the viewing angle for target is between 60° and 120°. In recognition testing, this kind of integrated classifier HCRF/SVM has comparatively higher recognition rate than the sole HCRF, SVM and typical baseline method.

1. Introduction

Recently, the investigation on human gait is receiving more and more attention. Especially, in monitoring, human gait is the only one characteristic that can be recognized in long distance without contacting. Hence, the study on it lies at an important status in safeguard area. In medicine, human gait study is valuable in relative diagnosis and modifying bones for patients.

As an important branch of human gait study, human gait imitation does not only help to investigate deeply in real motion—it can acquire the valuable cue of sight characteristic that is hard to get from the raw images directly, making the results more clear. (e.g., some characteristics on human body cannot be detected continually because of the sheltering or by themselves as human is motioning. By the means of imitating human gait, we can adjust the relative motion model at any viewing angle, any time to detect and analyze)—but also owns broad scope of application in virtual reality and 3D TV.

At present, most studies on imitating motion use multicamera to realize reconstruction of motion in 3D [15]. These methods share obvious style: solving the problem of sheltering directly in object’s motion, and the results are remarkable for various poses of human motion at the expense of complex camera model information and complex computations.

Also, there are many attempts to restore human motion in 3D or to track human motion from monocular video sequence directly. However, most do not only need training to learn and estimate poses but also need manual help originally. For example, [6] needs to initialize the 2D parameter values for the first frame manually by overlaying a model onto the 2D image of the first frame. Reference [7] sets initialization by matching the first frame to six key poses acquired by manual clustering, and the pose having minimal matching error is chosen as the initial pose. Reference [8] involves some prior information which was extracted from some hand-labeled data. In [9], the local model needs information of contour extracted manually from real images.

Some need the information of relative camera, such as [10] which recovers 3D model of humans from just one frame or a monocular video sequence using a simplification of the camera model based on a collinearity condition. In [11], the webcam used needs being calibrated at the phase of preprocessing. Reference [12] needs both to estimate camera information and to locate joints manually in the first frame of video sequences.

Some have other too many demands or constraints. For example, [13] subjects to not only Gaussian prior and Gaussian stabilizers but also the objective time-consuming based on its covariance-scaled sampling. Reference [14] needs to detect pedestrian’s motion trajectory and footprints throughout the segmented video sequence by associated clustering technique. [15, 16] need both relative information of edges and textured regions. And [17] needs to preextract state subspace from one sequence of motion capture data for each motion type.

We can see that most motion imitations, whether using multicamera or using single camera, based on numerous continuous characteristics at the cost of expensive accurate instruments or accurate camera model information or manual assist or other extra demands. Are these demands necessary?

For these doubts above, this paper presents a framework. It firstly combines polygon-approximation with deformable template-matching, using knowledge of human anatomy to achieve the characteristics including static and dynamic parameters of real human gait. Then, these characteristics are processed in regularization and normalization. Finally, they are imposed on 3D human model with prior constrains and universal gait knowledge to imitate real human gait, thus realizing the reconstruction of 3D human gait from monocular video directly and automatically. The method is robust for detecting subject’s clothes and backpack to a certain extent at 60°~120° angle of view in stable straight walking in test of CAISIA gait database. Moreover, it does not need any manual assist and any camera model information. In application of this framework, with the dimensionality of time sequences corresponding to curves of human gait imitation reduced by the method of neighborhood preserving embedding (NPE), a kind of classifier which integrates HCRF (hidden conditional random field) with SVM (supported vector machine) using the essential time sequence acquired from the human gait imitation as input to realize human identification recognition on gait and presents a higher recognition rate than using HCRF or SVM as classifier alone and the typical baseline method in associated experiment as it contains more structural traits of the data to be classified in space and time during dealing with them.

The remainder of this paper is arranged as follows. Section 2 interprets the basic idea and principle of the framework on gait imitation; Section 3 describes the realization of the method in detail; Section 4 gives the application of human gait imitation in identification recognition; Section 5 provides relative experiment and results analysis; Section 6 concludes this paper.

2. Principle

We all know that when people begin to know something, they used to compare the new thing with some general mode. Then, they will carve the mode in detail by the use of characteristics of the real thing, that is, to say, imposing individuation to the mode. Thus, a complete impression on the new thing is formed in their minds, finally.

Here, analogously, during the course of knowing human gait, we have a prior general human motion model firstly. Secondly, we acquire the individual characteristics of some real gait in some way. Then, for the characteristics detected, we update the partial prior values of the relative part in the human model; for the characteristics undetected, we use the prior configuration of the relative part in the human model to compensate in proportion. Thus, integrating the real individual characteristics with prior general information, we form a sufficient and complete expression on the impression of human gait.

This idea of acquiring characteristics heuristically overcomes the phenomenon of sheltering in human motion. It maintains the continuity of motion detection to some extent and has strong adaptation to the situation of some characteristics hard to detect.

The basic principle for the method of gait imitation above can be described in mathematics as both the posterior probability function in formula (2.1) and entropy function in formula (2.2). Here, is the prior knowledge of the human model corresponding to some person, is the prior knowledge on gait , likelihood function is the individual characteristic information on gait and random variable whose range is a group of state sequence on gait representing gait imitation:

In formula (2.1), according to Bayesian rules, at time of , basing on a certain prior information (note here the prior information is general or universal for most persons) on and , the more remarkable the individual characteristic information on gait is, the bigger the value of posterior probability function will be. Then, at this time, in formula (2.2), according to information theory [18, 19], the value of the relative entropy function will be smaller, showing that the next state closed to the real will be more ensured. At other times, similar analysis is as above. Thus, as a result, the whole imitation will more approximate a real person’s gait.

3. Designing

Before designing, we assume that the human gait to be detected is walking in a straight line with a constant speed. According to the idea of Section 2, the designing includes three parts mainly, which is displayed in Figure 1. In the first part—building the human motion model, the universal prior knowledge for drawing human model and letting the human model walking is equivalent to and , respectively. The result of second part—acquiring the individual characteristics from real human gait—is equivalent to . Both the former two parts based on the knowledge of human anatomy, dynamics and kinematics. Besides, the second part uses mainly a method that combines polygon- approximation and deformable template-matching to track and detect motional object. The third part that integrates real individual characteristics with prior human motion model is equivalent to the principle of formula (2.1) and (2.2), which is realized mainly on updating partial both static and dynamic parameters as well as relative compensations.

3.1. Building Human Model in Prior Knowledge
3.1.1. Drawing for the Whole Human Model

Referring to the standard of H-Anim [20] in VRML (virtual reality modeling language) [21], the human model of this paper is illustrated in Figure 2. Figure 3 gives the relative 3D human model. The general structure of the 3D human model in this paper, with the crotch of the human model in the center of the world coordinates, consists of three parts: trunk, upper limbs, and lower limbs. As can be seen, except for head, hands, and feet, the basic cell is that a sphere plus a cylinder side by just as Figure 4. The cylinder denotes bone, and the sphere denotes joint of bones in the human model. The three parts above are connected with such cells in different direction. In detail, in each joint of the human model, we build a local coordinate to make its -axis in the same direction of the axis of the next bone. All the parts above begin from the origin of the world coordinates. Firstly, we configure the pose of the local coordinates. Then, we draw a cell in the local coordinates. Next, we move the local coordinates to the end of the cell. Again, we configure the pose of the local coordinates and draw the next cell. Thus, we take that turn repeating until all the parts are drawn over finally. That is shown in Figure 5, where equals the total number of the cells in each part above. , , and stand for the rotating matrixes in , and axes, respectively. , , stand for the translating matrixes in , and axes, respectively.

Consequently, combining Figure 2 with Figure 3 again, the designing for trunk begins from crotch up, consisting of 6 segments: head, neck, chest, belly, waist, and abdomen. The designing for upper limbs begins from neck down, consisting of 4 segments: shoulders, the upper arms, forearms, and hands. (Here, from crotch to neck, we do not draw any cell, but configure the pose of the local coordinates and move the local coordinates because we have drawn the trunk at first.) The designing for lower limbs begins from crotch down, consisting of 4 segments: hip, thigh, calf, and feet. The last cell of each part above should be replaced with relative part of head, hand, and foot. That is sphere or cuboid. Thus, the human model has 22 segments overall, each of which has 3 degrees of freedom (DOFs). Plus additional 3 DOFs for global rotation, so there are DOFs for the whole human model. Because hands and feet themselves affect human pose little in motion, by simplification, the overall number of DOFs is .

When the human model above is in motion, the display of motion for the model is realized on adjusting poses of the cells in the model and updating the display continuously when we draw the cells.

3.1.2. The Universal Prior Knowledge for the Human Motion Model

According to the knowledge of human anatomy [2224], combining with the -Anim standard, if we assume the model’s height is , then: head occupies 0.13; shoulders’ width is 0.26; the width of the hip is 0.19; when the upper limbs are horizontal, the width between left and right hand equals ; the midpoints of the lower limbs correspond to the knees of the model; the lengths of the upper arm, forearm, and hand are 0.19, 0.15, and 0.11, respectively.

When the model is in walking mode, according to the knowledge of human dynamics and kinematics [2530], as Figure 6 displaying, a normal walking mode can be described as follows.

The upper limbs swing backwards and forwards on the left and right sides in turn. At one time, so do the lower limbs. The phrases of the motion for the upper limbs and the lower limbs are in contrary. That is, when the upper limbs swing forward on the left side, the lower limbs swing forward on the right side, and when the upper limbs swing forward on the right side, the lower limbs swing forward on the left side, thus to and fro. During walking, the neck is slightly bending forward. And especially, the elbows and knees are both in cycles of slight bending and stretching. Simultaneously, the shoulders and hips, corresponding to relative limbs, are also swinging slightly back and forth. With the limbs in motion, the trunk, including head are also swinging wiggly, turning clockwise and anticlockwise, slightly. In addition, there are some constrains: the elbows can bend forward only; the knees can bend backward only, the extent of bending in the elbow increases to maximum when the relative part of upper limbs swings to front end and decreases to minimum when the relative part of upper limbs swings back end; the extent of bending in the knee decreases to minimum when the relative part of lower limbs swings to front end and increases to maximum when the relative part of lower limbs swings back end.

The prior parameters above on structure and motion in model will be updated partially after the actual parameters of some real individual are acquired.

3.2. Acquiring the Individual Characteristics from Motional Object in Reality
3.2.1. Detecting and Tracking Motional Object

This paper processes the video images of real human gait using graying, background modeling and updating, background subtracting, binary conversion, and binary morphological operation in turn to acquire the actual motional object firstly. Secondly, consulting [31], we use a method of connectivity to retrieve all the contours and reconstruct the full hierarchy of nested contours of motional object. Then, we compress horizontal, vertical, and diagonal segments, leaving only their ending points. Thus, we acquire the external contour of the actual motional object. Finally, referring to [32], the polygon curve is approximated with assigned accuracy. When the overall number of vertexes of the polygon is 4, we begin to carry on a deformable template matching to acquire the individual characteristics of the real human gait. The principle of deformable template-matching is as in Figure 7.

3.2.2. The Principle of Deformable Template-Matching

In Figure 7, the points a, b, c, and d correspond to the vertex of head, crotch, left ankle and right ankle, respectively. Assume that their coordinates are , and , respectively. So the constraining rules are

We can infer that only when the space between lower limbs of the object is the biggest or so, the deformable template-matching can be in effect. In other words, at the moment, the matching is successful probably.

According to the knowledge of human anatomy [2224], in Figure 7, if we assume the length from the crotch to the top of head in human is , the neck is about 0.75 h away from the crotch, and the waist is 0.30 h away from the crotch or so. Then, we connect the waist with left ankle and right ankle respectively, so the middle position of the connected parts can be known as knees. Thus, we confirm the motion position of the lower limbs approximately at the moment. Because of the variety of the upper limbs’ motion, moreover, some parts of the upper limbs have little effect on gait, the estimation of the structure and motion for the upper limbs is achieved by proportion to the prior human motion model completely. Thus, all the signs above on human object’s body are marked automatically, not requiring any manual work, once the deformable template-matching is successful.

Figure 8 shows the actual form of deformable template-matching. Because the prior model sets the crotch and hip in the same horizontal line by simplification and in motion, the model itself has been added with the slight bend of elbows, knees and neck in proportion as universal prior knowledge before imitating real human gait in 3D, consequently, the errors that the measured lengths for limbs are probably shorter than the real lengths because of the bends of real joints can be compensated to a certain extent. This strategy saves processing time effectively.

3.2.3. Regularization and Normalization for Detected Characteristics

Because there is some information on depth-variance for the motion of detected object, to the same part in reality, the lengths detected at the different time, different position differ in values possibly. Hence, in order to be uniform approximately, the lengths detected need to be regularized and normalized. The principle of regularization in this paper is illustrated in Figure 9:

In Figure 9, the two rectangles are alike in shape and different in size. We assume that this phenomenon results from the same object being detected in different distances or depths from view. The line segments and differ in direct measurement, but according to the analysis of formula (3.2), as long as we divide them by the square root of area of the relative smallest circumscribable rectangle (or other circumscribable shapes), the two different measurements in directivity will be transformed into equivalent forms, that is, the changed forms can act as the expressions of real size. Figure 10 shows the form of actual deformable template-matching with circumscribable rectangle for real motional object.

This paper regularizes the lengths of every part of detected object in the light of the principle above. Then, we divide the accumulated regularized lengths of each part in the whole sample process by respective sample count to achieve the mean value of each part. Next, we divide the mean values by the height of the object to acquire the proportions of all the parts, realizing the normalization. In addition, in Figure 7, the step is measured based on the maximum angle between two thighs; so for this variable, we do not need any regularization and normalization, but use law of cosines and the relative inverse trigonometric function to achieve its value when the space between two thighs is maximum. As for the obliquity of trunk, it is measured with inverse tangent function computed by the differences of -coordinates and -coordinates each between the hip and the top of head, furthermore, acquiring its mean. The actual obliquity of trunk may lean forward or backward, which is decided by the sign of result. The corresponding formula is in (3.3), where equals length of relative part; and stand for -coordinate and -coordinates of relative position; is sample amount; is the area of the smallest circumscribable rectangle for object. Finally, all the parameters above are regarded as the final results of real detected individual characteristics:

3.2.4. Transforming from General Viewing Angle to Silhouette

In theory, any angle of view can be decomposed into two parts in horizontal and vertical directions, that is, any viewing angle can be regarded as the synthesizing from the two kinds of viewing angles. Figure 11 gives the decomposing principle of a general viewing angel for a common camera, where and are the obliquities of the camera against the flat A in horizontal and vertical directions. Basing on these parameters, Figures 12 and 13 display the projections of the deformable template mentioned in horizontal and vertical directions, where the blue figure is the real gait detecting in flat A, the red figure is the watching result in flat B, and the green lines are relative assistant lines. In Figures 12 and 13, according to the relationships of the coordinates between the projection in flat A and the original graph in flat B, the transforming computation and distance computation of coordinates to a general angle of view can be inferred as formula (3.4):

In this paper, the angle of view is defined as that the facing direction of detected object circles anticlockwise from walking direction to watching screen, forming the angle. Namely,

From formula (3.5) above, it can be found that when the viewing angle is about 90 degree or so, the associated cosine function approximates 1 in formula (3.4). So, at this time, in formula (3.4), the viewing angle can be ignored for the relative transforming computation at a certain extent.

And in reality, strictly speaking, the viewing angle for a real camera is changing because of the limited range of eyesight by the camera itself even if the camera fixes position, just as Figure 14.

In Figure 14, at the fixed leaning angle r for a real camera, when the motion object (the red rectangle) is moving from position A to B and C, we can see that the relative viewing angle is changing and the relative transforming computation should base on the associated viewing angle at special some position according to formula (3.4). Especially, when the object is moving to the position B, the relative viewing angle is 90 degree. At this moment, according to the analysis above, the angle can be ignored during the transforming computation. This paper, in subsequent experiment, we will find that the detecting deformable template-matching is often successful at this angle or so because the triangle template can be in effect only when the both shoulders of subject, nearly in overlapping or in approximate verticality over viewing. And considering the normalization and average in formula (3.3), thus, the viewing angle can be omitted during transforming computation to a great extent.

3.3. Integrating Real Individual Characteristics of Detected Motional Object with Prior Human Motion Model

The process that this paper maps for both the parameters of structural proportion and motion for the real motional object above onto the 3D human model is as follows.

3.3.1. Mapping Static Parameters

This paper configures the height of human model used for gait imitation in some constant prior value. Then, integrating with the proportions of all the parts on trunk and lower limbs in Section 3.2, we figure out the relative lengths. For lengths of all the parts of the upper limbs, we default the normal prior values of the human model.

3.3.2. Mapping Dynamic Parameters

This paper substitutes the maximum angle between two thighs and the obliquity of trunk detected in reality both for relative prior values. When the limbs of the human model in walk are swinging backward and forward, the condition of switching between left side and right side is based on step, so, the span of the upper limbs swinging to and fro is in proportion to the step of the lower limbs.

3.4. Interpretation and Relative Analysis for the Final Motional Form of the Mapped Human Model

Figure 15 shows the final form of motion for the mapped human model and draws the curves of motion for the model’s main joints including ankles, knees, hips, wrists, elbows, shoulders and the top of head in real time. According to analysis, the asymmetry exists in the curves because there are cycles of slight bow in elbows and knees when the limbs swing backward and forward on both sides. Besides, the curves of the upper limbs in reality need the curves in the lower limbs to synthesize. Note that in Figure 15, the size of video image for detecting motional object is 320 × 240, and the size of image for imitating motion is 362 × 522 initially. Here, because of the limitation in page size, we decrease sizes of the images by unchanging proportion between width and height of image. And we set angles of view for the human motion model and the real motional object to be same for the convenience of observation in contrast (similar intention for the subsequent same kinds of images). In fact, we can transform the viewing angle of the human motion model freely to observe after it achieves the actual individual characteristics on motion and updates by itself.

In order to study deeply, we draw the relative curves accurately in MATLAB tool by using the data corresponding to the real-time curves acquired from the development environment of VC6.0 in Figure 15. This is shown in Figure 16. Next, we combine Figure 15 with Figure 16 to understand the characteristics of the curves from important parts in the motional model.

For ankles, when the corresponding lower limb rises from the last position of body and the knee bends backward, the highest position of the ankle’s curve is reached. Then, this lower limb begins to swing forward, and the curve begins to decline. After the limb’s motion passes the trunk, the curve begins to rise. When the limb swings to front end, because the knee can only bend backward and can not bend forward furthermore, at one time, the extent of the bending is the smallest compared with the whole motion process, the curve rises to the higher position. Then, the limb begins to fall to the ground. Because the horizontal displacement from front end to the ground is quite short, the curve begins to decline with higher slope than before. As the landing of the limb is finished, the curve declines to the lowest position. Next, it is turn of the part in opposite side of body to motion in the same style, thus, time after time. In respect of wrists, similarly, the differences from ankles’ motion exist in that the elbows can only bend forward and cannot bend backward. So, when part of the upper limbs swings to front end, the relative wrist’ curve arrives at the highest position, and when the upper limb swings to back end, the curve arrives at the higher position. While the upper limb’s motion passes by the trunk, the curve arrives at the lowest position. Notice that in reality, although the curves of the upper limbs need the curves in the lower limbs to synthesize, both curves change with the same trend generally in horizontal and vertical directions. Therefore, they affect each other with the same tendency. Specifically, when the upper limbs swing to front end and back end, corresponding to the highest position, and the higher position, the lower limbs swing also to front end and back end. Moreover, at one time, for the lower limbs, the part of front end does not bend and the part of back end does not rise yet, as a result, the body’s barycenter tends to rise. While the upper limbs swing parallel with the trunk, the lower limbs swing parallel with the trunk too and slightly bend in order to exchange the phase of the swinging. So, the body’s barycenter tends to decline.

For knees, during the swinging of lower limbs, in fact, the motional displacement from last position to front end for one lower limb is approximately two times that from front end to last position for another lower limb compared with ground; therefore, in motion, the length of curve of swinging forward is bigger than that of swinging backward for knees, which conforms to reality basically. As for elbows, similarly, furthermore, the tendencies of changes are the same in both curves of upper limbs and lower limbs’ relative parts from description above, so the length of elbow’s curve of swinging forward is also bigger than that of swinging backward in motion, which consists with reality too.

For shoulders, when relative part of the upper limbs swings forward, with trunk’s twist, the corresponding shoulder produces a short offset forward and upward and when the relative part swings backward, with trunk’s reverse twist, the shoulder produces a short offset backward and downward. To hip, also, its motion style is just as shoulders except that the phases are opposite with shoulders on the same side and the extents of motion are smaller than shoulders.

For head, during the process of swinging in both sides of the upper limbs, because of both the inertia of arms dragging the trunk and twisting motion existed in trunk itself, head swings slightly leftward and rightward, always towards the part of upper limbs swinging backward.

The above analysis is mainly based on the characteristics of human anatomy. According to the analysis, the curves on imitating motion are basically consistent with the actual characteristics of human gait. In terms of the results of subsequent experiment, different individuals differ mainly in the values of parameters of these curves, but general shapes of the curves do not change essentially.

4. Application to Identification Recognition on Gait

As an application, we will utilize the features of curves acquired from the framework of human gait imitation proposed in this paper, combining with associated classifier, to realize human identification recognition. Figure 17 displays the principle of this gait recognition. Here, firstly, because of the periodicity for human gait, only one gait cycle is used to study by letting the human model walk for two steps. Then, all the curves are arrayed in sequences respectively. Thus, each curve of human gait imitation can be regarded as a time sequence. Next, the problem of recognition is transformed into the problem of dealing with all the data of time sequences.

4.1. NPE for Reducing the Dimensionality of Time-Sequences

In reality, the lengths of the time-sequences are very long and perceptually meaningful structure of the sequences is of much lower dimensionality, so dimensionality reduction is needed. Considering that all the time-sequences of the curves are correlated with each other for the same person, and these correlations are important individual characteristics, so, these structural correlations should be preserved as much as possible while carrying on reducing dimensionality.

Referring to [33], the method of neighborhood preserving embedding (NPE) aims at preserving the local neighborhood structure on the data manifold and is less sensitive to outlier than principal component analysis (PCA). Comparing to the recently proposed manifold learning algorithms such as Isomap and locally linear embedding, NPE is defined everywhere, rather than only on the training data points. So, we adopt this method to reduce dimensionality:

The method in detail is just as [33], whose main principle is displayed in formulas (4.1)–(4.3). Before applying [33], note, here, each time-sequence produced by associated curve is regarded as data point . To preserve the property of correlation among curves, the way to construct adjacency graph is KNN, and is set 39. That is to say, each time-sequence is reconstructed by adjacent other 38 time-sequences in motion model. And it is reasonable to assume that each local neighborhood is linear although these data points might reside on a nonlinear submanifold. Then, we use formula (4.1) to compute the weight matrix of the structural relation existed in the data points, use formula (4.2) to compute the projections on reducing dimensionality and use formula (4.3) to realize the final transformation of reducing dimensionality for each data point , in turn.

Originally, each of the motional curves in the human model for one gait cycle has 90 3D space samples. By using NPE, the corresponding time-sequences’ dimensionality reduces from 90 dimension of one gait cycle to 39 dimension and the local manifold structure is preserved in low-dimensional space with an optimal embedding.

4.2. Classifier Integrates SVM with HCRF for Classifying All the Time-Sequences

During the key phase of recognition, we integrate the hidden conditional random field (HCRF) with supported vector machine (SVM) to construct classifier. This kind of classifier owns both merits of HCRF and SVM. On one hand, for each time-sequence, a set of latent variables conditioned on local features can be learned, while the observations need not be independent and may overlap in space and time [34]. On the other hand, the separating margins of final decision boundaries on classification are maximized in the high-dimensional space called feature space [35]. That is, it can resolve the problem of classification for the whole multisequence existed in the same course of time.

4.2.1. HCRF for Marking All the Time-Sequences

In Figure 17, number of () person, according to that different joints in human motion model corresponds to different motion curves, hence, all the curves on his motion model, further, all the corresponding time-sequences are marked with X01A, X01B, X01C, X02A, X02B, X02C, …, X13A, X13B, X13C differently, respectively.

Since all the time-sequences are correlated with each other, these observations are not conditional independence of course. Referring to [34, 36], hidden conditional random field (HCRF) which uses intermediate hidden variables to model latent structure of input domain and defines a joint distribution over class label and hidden state labels conditioned on the observations, with dependencies between the hidden variables expressed by an undirected graph, does not need observations to be independent and may overlap in space and time. And it can also model sequences where the underlying graphical model captures temporal dependencies across frames and incorporate long range dependencies. So, using HCRF to mark these sequences is a reasonable mode for describing them. And the mapping between the sequences and the corresponding labels is conducted by HCRF’s training or inferring. Here, the HCRF method which is just as [37] in principle is similar with [36]. The associated formulas are as (4.4) and (4.5): Formula (4.4) describes the principle of inferring label from HCRF model given the observation , the HCRF model’s parameters and the window parameter which is used to incorporate long-range dependencies. And is a vector of latent variables, which are not observed on training examples and where each is a member of a finite set of possible hidden states in the HCRF model. Intuitively, each corresponds to a hidden state of with some member of , which may correspond to “component” structure in an observation. is a potential function parameterized by and . The graph is a chain where each node corresponds to a hidden state variable at time ; is a vector that can include any feature of the observation for a specific window size . The inner product measures the compatibility between the observation and hidden state at window size . Each parameter measures the compatibility between hidden state and a label . Each parameter measures the compatibility between an edge with states and and the label .

Formula (4.5) describes the estimation for HCRF model’s parameters . The first term is the logarithmic likelihood of the trained data. The second term is the log of a Gaussian prior with variance , that is, . Combing with [37], the typical method of conjugate gradient in [38] is used to estimate the parameters .

In this paper, the number of hidden states is set 10 and the window size is set at 0,1,2 in turn to test; there are too many time-sequences in kinds and numbers to fit for the one-versus-all HCRF model or the muti-class HCRF model directly, so the compromise between one-versus-all HCRF model and muti-class HCRF model is adopted. In training, each type of articular time-sequence for all the persons and its corresponding labels are learned with a separate HCRF model. In inferring, the tested sequence is run with the HCRF model producing the same type of articular time-sequence. The class label with the biggest probability corresponds to the label of the test sequence.

4.2.2. SVM for Classifying All the Signs of Multisequence

In Figure 17, since from HCRF’s training, a group of labels and their corresponding motion curves, namely, corresponding time-sequences of a specific human model are learned and these labels’ definitions or values differ from person to person, thus, these labels can be regarded as a group of features for a specific person. However, as there are some similarities existed in most people’s walking styles, it is possible that from HCRF’s inferring, sometimes, a few of these features among some persons are identical although HCRF can overcome the overlapping of space and time to a certain extent. At this time, part of these features is overlapping. Referring to [35], according to SVM’s property that the decision boundaries are determined directly by the training data, as a result, the separating margins of decision boundaries are maximized in high-dimensional space called feature space. Thus, most nonseparable data in low-dimensional space becomes separable possibly in high-dimensional space by mapping. So, here, SVM is used as final classifying means to recognize person by using the marked features associated with his motion imitation as input.

Here, it is the problem of multiclass classification. On multiclass SVM, there are many methods at present. According to relative comparison [39], the “one-against-one” approach [40, 41] is suitable for practical use. So, we use this method, which is just as [42]. According to this method, if the amount of training persons in database is , binary SVM classifiers are needed. Combing with [42], each classifier adopts C-support vector classification (CSVC) model with RBF kernel, in which two parameters are considered: and . The two parameters are selected by using typical cross validation via parallel grid search, and all the decision functions share the same finally.

For training data from the th and the th classes, formula (4.6) displays the binary classification problem to be solved. In (4.6), the training data is mapped into a higher space by the function Φ and is penalty parameter, is relaxation parameter. Minimizing ()/2 means to maximize , the margin between the th and the th classes of data. When data are not linear separable, the penalty term manages to balance between the regularization term ()/2 and reducing the number of training errors. In addition, during final classifying, voting strategy suggested in [40], in which if infers is in the th class, then the vote for the th class is added by one, otherwise, the th class is increased by one, is used to predict is in the class with the largest vote. For the case that two classes have identical vote, the one with smaller index is selected:

5. Performance Evaluation

5.1. Evaluation Setup and Dataset

In order to prove the ability of the application in general environment for the method proposed in this paper, all the experiments are conducted in the environment of Microsoft visual c++6.0 at the platform of Pentium 1.73 GHz personal computer.

We test the framework proposed by this paper with the videos of CASIA gait database [43]. There are 3 subsets in this database: dataset A, dataset B, and dataset C. Dataset A (viz.: NLPR) consists of 20 subjects. Each subject has 12 walking sequences, which include 3 walking directions (that make an angle of 0°, 45°, 90°, resp., with image plate) and in each direction, there are 4 walking sequences. The length of each image sequence varies from person to person in speed.The sum of frames in each sequence is between 37 and 127. Dataset B is a large scale of database with multiangle of view. The subset includes 124 subjects, each of whom has 11 angles of view (covering: 0°, 18°, 36°, 54°, 72°, 90°, 108°, 126°, 144°, 162°, and 180°) and walks in 3 conditions (involving: thin coat, thick coat, and backpack). Dataset C is a large scale of database screened with infrared photography in night, in which there are 153 subjects walking in 4 conditions (involving: common walk, quick walk, slow walk, and backpack walk).

5.2. Experimental Procedures
5.2.1. For Gait Imitation

This paper uses the original videos in CASIA gait database (viz. Dataset B) to test the proposed framework on gait imitation. We experiment with all the 124 subjects who walk in thin coat, thick coat, and backpack, respectively, under different angles of view in the dataset.

In experiment, the whole different extent of detected parameters among the different working conditions for the same person at the same angle of view is measured from two kinds of distance measure: the Euclidean distance in formula (5.1) and the Mahalanobis distance in formula (5.2), where are any two groups of different detected data above and are elements in the , respectively. Finally, all the results of the testing subjects are drawn in 3D space with MATLAB R2009b tool to analyze.

5.2.2. For Gait Recognition Based on Motion Imitation

In the experiment of the application on gait imitation, we use the original videos of dataset B in CASIA gait database to test the recognition framework which is just as Figure 17 based on the integrated HCRF/SVM classifier at different window size .

We adopt the leave-one-out cross validation to train/infer gait identification. In detail, for each person, we take 5 out 6 videos with thin coat, take 1 out 2 videos with thick coat, and take 1 in 2 videos with backpack for training. And we take the remainder one video in thin coat, thick coat and backpack for recognizing or inferring. Next, we change the order and repeat the train/infer experiment above until all the videos have chances of inferring. At last, we compute the average value of these recognition rates and regard it as the final result. The associated computation is as formula (5.3), where function inference can be regarded as the whole function corresponding to the relative recognizing system of HCRF/SVM when the th subject is testing object . And only when the function’s result equals input , the result is correct at this time:

Combing with Section 4, in training for HCRF, the method of conjugate gradient as [38] is used to estimate associated parameters and in training for SVM, the method of cross-validation via parallel grid search as [42] is used to estimate associated parameters. According to Section 4, since there are 124 subjects and each subject corresponds to 13 3D time-sequences, there are 13 × 3 = 39 HCRF models and SVM classifiers altogether. When the training for all the subjects in the dataset B of CASIA gait database is finished, all the trained parameters for associated HCRF models and SVM classifiers are saved as another database.

In the area of gait recognition, baseline algorithm in [44] is a kind of typical method which estimates silhouettes by background subtraction and performs recognition by temporal correlation of silhouettes. So, we will compare this method with the framework presented in this paper at recognition property. Here, during realizing [44], the silhouettes including the actual motional object are acquired from the CASIA gait dataset B videos by the detecting phase mentioned in Section 3. Because of the various viewing angles in the gait database, the associated gait period is detected by computing the ratio of the number of pixels in the silhouettes to the relative smallest circumscribable rectangle. And the leave-one-out cross validation is also adopted with the average recognition rates computed as the final identification rates.

In addition, at the phases of just after acquiring the normalized parameters of real detected object and just after reducing the dimensionality of associated time-sequences, we use these temporal associated data as input with HCRF, SVM used as classifier solely to recognize human gait. At this situation, one HCRF model corresponds to one subject, namely, there are 124 HCRF models, and the amount of SVM classifiers is as before. Of course, another database on associated trained parameters is produced from dataset B of CAISIA gait database after training. At last, we conduct the comparisons of the recognition rates with the results of recognition framework proposed by this paper at same window parameter .

5.3. Experimental Results and Associated Analysis
5.3.1. For Gait Imitation

Figure 18 displays motion imitation for a same subject who wears thin coat, thick coat and backpack, respectively, at angle of view of 54°. Observing from relative emulational model and motion curves, the three impressions are quite similar with each other. The parameters on real motion characteristics of the three types of walking in Figure 18 list in Table 1. The relative measuring results of the two kinds of distances: Euclidean distance and Mahalanobis distance are shown in Table 2. By comparing and analyzing, all the values in Table 2 are very small universally. Namely, the whole different extent between the congener values in Table 1 is very small universally. Thus, we can infer that the detected parameters’ values listed in Table 1 are quite close to each other.

That is to say, for the same person at the three different walking conditions, the detected proportions, key poses of limbs, trunk in human body are almost unchanged, which is the essential reason why the three types of walking motion imitation are alike.

Figure 19 shows the experiment of 3D gait imitation for some subjects at other angles of view with three types of walking. By observation, comparison and relative measurement as above, the effect of these motion imitations each quite resembles the form of relative real individual objects in motion.

If we take columns of Table 2 as points in 3D space of Euclidean distance and the Mahalanobis distance, respectively, we can draw a point in each of the two spaces, which corresponds to a subject’s measurement for different extent in different walking conditions. Similarly, Figure 20 displays the testing results of 124 subjects’ measurement on different extent of different walking conditions at viewing angle of 54°, 72°, 90°, 108°, and 126°, respectively, in the CAISIA database. From Figure 20, we can see that most points in the two 3D spaces near origin. Whether in Euclidean or Mahalanobis measurement, the distances between thin coat and backpack for same subject are universally slightly bigger comparing with the other distances and because the Mahalanobis measurement includes extracovariance matrix comparing with Euclidean measurement, the former is slightly bigger than the latter, but they all are in acceptable ranges from the whole result. So, according to inferring as before, it means that for each of the 124 subjects, the whole different extent of walking at different walking conditions is comparatively small universally.

Thus, we find that the method proposed by this paper is robust in clothes and backpack for the motional persons to a certain extent. Notice, here, the robust means that the forms of gait imitation for the same object walking in different conditions are consistent with each other to a great extent. Thereby, some latent constant essential characteristics for gait are shown to some extent.

With regard to the experiment at other angles of view, because it is very hard for the shape of contour of detected object to be uniform with the deformable template proposed by this paper, or in other words, the errors are too big, the effect of relative gait imitation is failed in those situations.

The final quality of the results for the framework of human motion imitation proposed in this paper based mainly on analysis for the relative characteristics of human anatomy and curves of gait, besides, the observation, comparison and associated measurement in the experiment of CASIA gait database. It can be seen that the extent of similarity between motion imitation and real motional object is rather large. As the motion imitation of this paper uses the detected data directly come from preceding detected real object and does not execute any prediction on poses of motion, the motion imitation is not carried out simultaneously. Here, synchronization is not our main object. In this paper, after collecting enough information of real person’s gait, we attempt to reconstruct human gait in 3D for other studies.

Up to now, it can be seen that the method of the deformable template-matching in this paper not only can apply in many angles of view and is robust in clothes, backpack for the motional persons to a certain extent, but also not need any manual work and any model information. And it does not need fitting the motion model in each video frame unless the outer template-matching at the key states is successful in some frames and it does not need considering bending at elbows, knees, and neck during fitting, but compensates in proportion as universal prior knowledge before imitating real gait in 3D finally, which improves detecting efficiency greatly. In comparison, [7, 8] study motion imitation in 2D and not only need manual assistant originally, but also mainly aim at silhouettes with 90° angle of view and need fitting relative motion model in each frame, which is a very limited range of angle and objective time-consuming.

5.3.2. For Gait Recognition Based on Motion Imitation

Table 3 gives the results of associated comparisons of the integrated HCRF/SVM classifiers based on recognition application of gait imitation at different window size and different viewing angle with typical baseline method. Figure 21 gives the bar graph associated with data in Table 3.

From Figure 21, combing with Table 3, we can see that the recognition rate differs from different window size and when equals 1, the recognition rates are universally higher than at other window sizes. At every tested , when the viewing angle near 90°, including 108°, the recognition rates are universally higher than at other viewing angles because the relative detected parameters including proportions of trunk and lower limbs are more accurate than at other viewing angles. Although any viewing angle can be mapped into 90 degree viewing angle according to the transform computation mentioned, the little error still cannot be escaped. We can also see that, at same window size and same viewing angle, usually the recognition rate in thin coat is slightly higher than in thick coat and the recognition rate in thick coat is slightly higher than in backpack. Here, after all, whether in thick coat or in backpack, more or less, the sheltering affects the detecting accuracy to a certain extent. When the walker is in backpack, not only sheltering but also the disturbance of additions on body affects the accuracy. So in this situation, the recognition rates are the smallest comparing with other walking conditions. Of course, from the whole effect, the results of the recognition framework proposed by this paper are satisfied, and this method overcomes the limitations to some extent. That is, the recognition framework, as the gait imitation framework above, is robust to subject’s coat or backpack to a certain extent.

Not depicting gait information in 3D, baseline method mainly studies silhouettes gait sequences in 2D, which makes this kind of gait information’s volume and accuracy more limited than the framework of this paper. So, its identification rates are lower than this paper’s framework universally. In addition, at same walking conditions, its identification rate at about 90 degree viewing angle is slightly lower than at other angles because the gait information on frontal in silhouettes at this angle is less than at other angles, which improves the possibility of identical detecting results among the testing samples. At the same viewing angles, the identification rate in thin coat is higher than in thick coat, and the rate in thick coat is higher than in backpack, whose associated reasons are also the disturbances of relative sheltering and additions on body.

Table 4 presents associated comparisons of the sole SVM or HCRF classifier on gait recognition at same window size and different viewing angles. Figure 22 gives the bar graph associated with data in Table 4 and data of the integrated HCRF/SVM classifier at same window size in Table 3.

In Figure 22, at each tested visual angel, the recognition rate of HCRF + SVM is comparatively higher than HCRF or SVM solely. According to the analysis in Section 4, when the method of NPE reduces the dimension of the time-sequences, the local neighborhood structure on the data manifold is preserved; when the HCRF trains or infers the signs of the time-sequences, the sequences where the underlying graphical model captures temporal dependencies across frames is modeled and incorporates long range dependencies and when the SVM trains or infers the final identification of the relative signs, the separating margins of decision boundaries on classification are maximized as the data is mapped into high-dimensional space. Thus, comparing with the HCRF or SVM solely, the HCRF + SVM contains more structural traits of the data to be classified during dealing with the data, which make the recognition more sufficiently. At the phase of just after acquiring normalized parameters of real detected object, the recognition rate of SVM is a little higher than HCRF because these detected parameters at the same time instant are not correlated sequence in time or space, so merits of HCRF could not be presented fully. Namely, taking the independent parameters as correlated sequences to do with is unreasonable to some extent. And SVM is more fitting for classifying multi-dimensional data at the same time instant than HCRF. Contrarily, at the phase of just after reducing dimensionality of associated time-sequences, using similar analysis above, the recognition rate of HCRF is a little higher than SVM of course. In addition, for whether the sole HCRF or the sole SVM, the trends of recognition rates with viewing angle changing is consistent with the integrated HCRF/SVM classifier above.

6. Conclusion

This paper mainly proposes a framework of human gait imitation, analogous with human cognitive process, which integrates individual gait characteristics in reality with general prior knowledge of human motion to realize the reconstruction of human gait in 3D from monocular video of an uncalibrated camera directly and automatically. According to the results of the experiment with the CASIA gait database and relative measurement, analysis, the method in this paper is reasonable and robust to object’s clothes and backpack to a certain extent.

In the application of this framework, firstly, all the imitated motion curves are transformed into time-sequences with limited lengths by the means of extracting one gait cycle, arraying in lines, reducing dimensionality with the method of NPE, in turn. Then, a kind of classifier which integrates HCRF with SVM is used to classify the multisequences by marking time-sequences with relative labels and classifying the labels in turn, realizing identification recognition on human gait. At associated experiment, this kind of integrated classification displays better properties than using HCRF or SVM solely and the typical baselined method because it contains more structural traits of the data to be classified in space and time during dealing with the data.

We do not think that the adopted integrated classification in this paper is the only one and the best one suited to this human gait imitation proposed in this paper. It is just one way and maybe there are other good classifying methods fitting for recognizing the gait imitation, too. We do not also think that the adopted integrated classification can only deal with this kind of dataset in this paper. Maybe it is also suitable for other kinds of datasets. Of course, that needs new testing in other situations.

In the next work, we will investigate the imitated walking curves deeply in other different speeds or different moods for persons and search the relative constant properties among these curves for identification recognition. In fact, the recognition process of this paper is two-stage of classifying process. This produces the possibility that the error aroused from former classifier affects the last classifier, forming accumulated error. And the training time and testing time is about 2-3 hours and 30 seconds or so, respectively. So, we will study the two kinds of classifier more deeply in theory and manage to search only one stage of classifying process which equals the two stages of classifying process above in theory, thus uniting the two classifiers into one classifier is essentially to improve the accuracy and speed of recognition.

Acknowledgments

The authors wish to thank to the National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences for supplying investigators who study gait with free-CAISIA gait database. This brings us much convenience in the relative experiment of this paper.