Abstract

One of the most important issues in human motion analysis is the tracking and 3D reconstruction of human motion, which utilizes the anatomic points' positions. These points can uniquely define the position and orientation of all anatomical segments. In this work, a new method is proposed for tracking and 3D reconstruction of human motion from the image sequence of a monocular static camera. In this method, 2D tracking is used for 3D reconstruction, which a database of selected frames is used for the correction of tracking process. The method utilizes a new image descriptor based on discrete cosine transform (DCT), which is employed in different stages of the algorithm. The advantage of using this descriptor is the capabilities of selecting proper frequency regions in various tasks, which results in an efficient tracking and pose matching algorithms. The tracking and matching algorithms are based on reference descriptor matrixes (RDMs), which are updated after each stage based on the frequency regions in DCT blocks. Finally, 3D reconstruction is performed using Taylor’s method. Experimental results show the promise of the algorithm.

1. Introduction

One of the challenging issues in machine vision and computer graphic applications is the modeling and animation of human characters. Especially body modeling using video sequences is a difficult task that has been investigated a lot in the last decade. Nowadays, 3D human models are employed in various applications like movies, video games, ergonomic, e-commerce, virtual environments, and medicine.

3D scanners [1, 2] and video cameras are two sample tools that have been presented for 3D human model reconstruction. 3D scanners have limited flexibility and freedom constraints. In addition, the higher cost of these devices put them out of reach for general use.

Video cameras are nonintrusive and flexible devices for extraction of human motion. However, due to the high number of degrees of freedom for the human body, human motion tracking is a difficult task. In addition, self-occlusion of human segments and their unknown kinematics make the human tracking algorithm more challenging.

Existing vision-based approaches for human motion analysis may be divided in two groups, including model-based and model-free methods [3]. In model-based methods [48], a priori known human model is employed to represent human joints and segments as well as their kinematics. Model-free approaches do not employ a predefined human model for motion analysis; instead, the motion information is derived directly from video sequences. Model-free approaches mostly use a database of exemplars [9] or a learning machine [10, 11] for motion reconstruction. They are mostly restricted to known environments or images taken from a known viewpoint. Model-based approaches are more general and typically support the viewpoint independent processing or multiple viewpoints. However, they need initialization.

Various algorithms may also be divided into different categories based on the acquisition system. Some approaches are based on monocular cameras [414], while others employ multicamera video streams [1520]. Also, some approaches benefit from calibrated views or cameras [1520], while others utilize uncalibrated images [514].

Nowadays, monocular uncalibrated video sequences such as sports video footage are the most common source of human motions. Generally, 3D pose estimation is not possible using a monocular camera. Therefore, it is necessary to employ special assumptions for 3D pose estimation. Furthermore, 3D reconstruction of human motion poses more additional difficulties like self-occlusion, high-dimensional representation, lack of calibration, and articulated human motion to name a few.

To compensate the lack of enough information for 3D reconstruction of human motion using uncalibrated monocular video sequences, different approaches considered some restrictive conditions. Some algorithms assumed the manual specification of key features such as joints positions or segments length [5, 21]. Furthermore, some algorithms employed a database of different motions from various human subjects to facilitate motion reconstruction [9, 13, 14].

Different algorithms for motion reconstruction using monocular videos are roughly divided into three categories, including, (i) discriminative methods [9, 13, 14], (ii) estimating and tracking methods [68], and (iii) method based on learning [4, 11]. In discriminative methods, 3D joint coordinates are found by using database, motion libraries and so on. In estimating and tracking methods, 3D information is extracted using a sequence of images and tracking algorithm. In methods based on learning, a machine or model is trained with some a priori features and used for motion reconstruction.

Various algorithms for human motion reconstruction may utilize different image descriptors for tracking, matching, or model extraction. In [9, 13], shape context descriptor was used for matching key points. A shape context is a representation of shape by a discrete set of points sampled from the internal or external contours on the shape. The contour can be obtained as the locations of edge pixels as found by an edge detector. Some image-matching algorithms employed scale invariant feature transform (SIFT) [22], to detect and describe local features in images. SIFT features are scale and rotation invariant, but computationally expensive. In [7, 23], silhouette and contour of the human body were extracted for human model reconstruction. Silhouette and contour can be easily extracted in static cameras. However, in mobile camera and cluttered background, it is difficult to extract silhouette robustly. Edge or edge lines [24] and point features [12] were also used in some algorithms as image descriptors.

In this article, we introduce a new method for 3D reconstruction of human motion in uncalibrated monocular video streams, which is based on our previous work [25]. The method utilizes a combination of discriminative and tracking algorithm. In this algorithm, the information of database is utilized to increase tracking accuracy. The method utilizes a new descriptor based on discrete cosine transform (DCT). The advantage of using this descriptor is the capability of selecting proper frequency regions in various tasks, which results in better tracking and poses matching. For example, we use low and middle frequency in tracking for intensity as well as edge tracking. Also, we pass up color of clothes in database matching by avoiding low-frequency information.

The paper is organized as follows. In the next section, we review the human model utilized in the proposed algorithm. Section 3 discusses the proposed algorithm for tracking and 3D reconstruction of human motion using sequences of images acquired by a single video camera. Experimental results appear in Section 4, and we conclude the paper in Section 5.

2. Human Body Model

Human skeleton system is treated as a series of jointed links (segments), which can be modeled as rigid bodies. In the motion reconstruction applications, it is common to use a simple skeleton system for modeling the important segments. We describe the body as a stick model consisting of a set of thirteen joints (plus the head), which are connected by thirteen segments as shown in Figure 1.

The algorithm needs the knowledge of relative lengths of the segments for the 3D reconstruction purpose, which can be obtained from anthropometric data, which is shown in Table 1.

With known 2D position and using the knowledge of length of the segments and enforcing some constraints such as dynamic smoothing, we can reconstruct 3D human model.

3. Proposed Algorithm

Figure 2 shows the block scheme of the proposed algorithm. In the proposed method, we track 2D joints position using a static and uncalibrated monocular video and use them to estimate 3D skeletal configuration. Since not enough information is available from monocular video for 3D reconstruction; we save several 2D exemplars of various body poses in the database and use them to correct tracked points. In this algorithm, joint tracking is based on the n*n block of DCT coefficients (descriptor matrix). Algorithm starts by background subtraction and 2D joints’ positions are initialized by the user in the first frame. Then, the descriptor matrix is calculated and saved as “reference descriptor matrix” for each joint. In the next stage, all joints are tracked using their own RDMs. After finding joint positions in the subsequent frames, RDMs are updated based on DCT block frequency regions considering occlusion problem and tracking errors. The advantage of using RDMs is the capabilities of selecting proper frequency regions in various tasks, which results in an efficient tracking and pose-matching algorithms. When the human pose is estimated in the current frame, it is compared with different poses in the database based on middle-frequency information. In the case of correspondence, joint positions are corrected and RDMs are updated. We use the information of middle-frequency regions for this purpose to remove clothing color (low frequency) and body deformation details (high frequency).

A major problem that may be encountered in the algorithm is the occlusion of joints. To handle the problem, we detect occluded joints and mark them as “occluded.” When an “occluded” joint appears again, its positions are corrected by interpolation.

As it is shown in Figure 2, we utilize different frequency regions for various tasks in the proposed algorithm. Table 2 summarizes various tasks in the proposed algorithm and the utilized frequency regions.

Given the 2D joint locations, the 3D body configuration is estimated using Taylor’s algorithm [12].

3.1. Descriptor Matrix

In this article, we use DCT-based descriptors for the tracking and matching purposes. Descriptor Matrix (DM) for a point 𝑝𝑖 is an n*n DCT coefficients matrix. By utilizing the image window of fixed size (n*n) centered on point 𝑝𝑖, a descriptor matrix for the point 𝑝𝑖 is calculated as follows:𝐹(𝑢,𝑣)=𝐶(𝑢)𝐶(𝑣)𝑥=𝑝𝑥+𝑛/2𝑥=𝑝𝑥𝑛/2𝑦=𝑝𝑦+𝑛/2𝑦=𝑝𝑦𝑛/2𝑓(𝑥,𝑦)cos(2𝑥+1)𝑢𝜋2𝑛×cos(2𝑦+1)𝑣𝜋,2𝑛(1) where (𝑝𝑥, 𝑝𝑦) is the coordinate of central point 𝑝𝑖, and C(x) is calculated using the following equation:1𝐶(𝑥)=𝑛,if𝑥=02𝑛,otherwise.(2) There are 𝑛2 coefficients in each DM matrix divided into three frequency regions according to Figure 3. White region is the low-frequency region, gray region is the middle-frequency region, and black region is high-frequency region. We use these three frequency regions for tracking and matching joints. We use matrix distance as a method to measure the similarity between two descriptor matrixes. Matrix distance for two descriptor matrixes 𝑀 and N, in the specified frequency region of R, is calculated as𝑀dis(𝑀,𝑁)=𝑓𝑅𝑀𝑓𝑁𝑓2.(3)

3.2. Reference Descriptor Matrix (RDM)

RDMs store the required information for the tracking of joints. To find the location of a joint in the current frame, RDMs of joints in the previous frame as well as the information of database is employed. RDMs are generated for different joints of the body independently and are specified by RDM1,,RDM13. Reference descriptor matrix for joint j (RDM𝑗) is loaded from the descriptor matrix for joint 𝑗 after the initialization of joints by the user and updated after finding the location of joints in the subsequent frames. Updating routine is different for each frequency region as follows.

Low-Frequency Region
This region consists of general shape and intensity information of the tracked joint, so it changes gradually in successive frames. Tracking process may lose the tracked joint for several reasons such as occlusion problem or large distortion. Therefore, the tracked joint information may be incorrect in the current frame. For safekeeping of the general joint information, we leave the low frequency coefficients unchanged during the tracking. This region is updated only when a correspondence is found in the database.

Middle-Frequency Region
This region consists of general edge information. Because the individual limbs are deformable due to moving muscle and clothing, we update middle frequency coefficients during tracking only if the tracked joint is not occluded. Furthermore, this region is updated when a correspondence is found in the database.

High-Frequency Region
This region consists of joints’ details. The region is updated frame by frame without any restriction.

3.3. Tracking

The tracking process is based on the matching techniques in the frequency domains. Tracking process aims to find body joints in successive frames. Because of temporal correspondences between subsequent frames, search for the corresponding joint is local. In two successive frames, limbs and joints have the same intensity and general shape, but they are different in details. So, we use low- and middle-frequency regions in tracking process.

The tracking process is based on DCT matching techniques. Its basic idea is to track joints through the sequence of frames by utilizing RDMs. For this purpose, descriptor matrices are computed for each pixel in the search window. The best match is found by selecting minimum matrix distance between low and middle frequencies of RDM𝑗and search window descriptor matrixes (SWDMs).

Assuming that the initial estimate of the pose has been given, the tracking algorithm can be summarized in two steps as follows.(1)Generate descriptor matrices for all pixels in the search window at frame t (SWDMs).(2)Determine best matching point in the search window by computing matrix distance between RDM𝑗and SWDMs.

As mentioned before, a major problem that may be encountered in the algorithm is the occlusion of joints. To handle the problem, we detect occluded joints and mark them as “occluded.” In order to detect the occlusion of the tracked joint j at frame t, we calculate matrix distance in the middle-frequency region between descriptor matrix of tracked point (DM𝑗(t)) andRDM𝑗. Then, we determine the occlusion of the joints based on the following equation:𝑀dismiddleRDM𝑗,DM𝑗(𝑡)<Δ,jointisnotoccluded,>Δ,jointisoccluded.(4) When an occluded joint appears again, its positions during the period of occlusion are estimated by linear interpolation using the positions of the joint before and after the occlusion.

3.4. Database Matching Process

The database consists of required information of different poses for video sequences of a number of subjects. This information includes body joints’ positions and their descriptor matrixes in the middle frequency region as well as necessary labels for 3D reconstruction. Head position is used as the reference joint to calculate joints’ positions. In other words, joints’ positions are determined with respect to head.

To measure similarity between human pose in the current frame (𝑝𝑓) and human pose in the database (𝑝𝑑), we employ two kinds of the descriptor matrix: DDMs and FDMs, which will be defined later. If pose distance is smaller than a predefined threshold, correspondence occurs. In this case, joints’ positions and middle frequency region of RDM are corrected. Human pose distance is defined by𝑃dis𝑝𝑝𝑓,𝑝𝑑=13𝑗=1𝑀dislow,midDDM𝑗,FDM𝑗,(5) where database descriptor matrix (DDM) is generated using the low-frequency information of RDM (for intensity similarity of joints) and middle-frequency information of database (for edge similarity). DDM descriptor is defined as follows:DDM𝑓=RDM𝑓𝑓lowfrequency,Database𝑓middlefrequency,0𝑓highfrequency.(6)

Frame descriptor matrix (FDM) is also generated using the following algorithm.(i)Search locally around the previous head position to find correspondence for RDMhead point in the current frame.(ii)Determine other joints in the current frame by adjusting the head position.(iii)Generate descriptor matrices for each joint and save them as FDMs.

The algorithm to measure the similarity between human pose in the current frame (𝑝𝑓) and human pose in the database (𝑝𝑑) can be summarized as follows.(1)Generate DDMs.(2)Search locally to find the head position in the current frame.(3)Determine other joints’ positions in the current frame.(4)Compute matrix distance for DDM and FDM in low and middle frequency regions.(5)In the case of correspondence, correct joints’ positions.(6)Update RDMs.

As in the constant descriptor size, the descriptor matrices are not scale invariant. In the absence of substantial background clutter, scale invariance can be achieved by setting descriptor matrix size as a function of length for the body segments.

3.5. 3D Reconstruction

We use Taylor’s method [12] to estimate the 3D configuration of the human body given the joints’ positions. Taylor’s method operates on a single 2D image, taken by an uncalibrated camera. It assumes a scaled orthographic projection model for the camera and need the following information.(i)The image coordinates of joints (u, v).(ii)The relative lengths of body segments connecting the joints.(iii)The “closer endpoints” for body segments and joints.

In this paper, the image coordinates of joints are obtained using the proposed tracking and matching algorithms. The closer endpoints for segments are supplied by exemplars in the database, and automatically transferred to the input image after the matching process. The relative lengths of body segments are fixed in advance but can also be transferred from exemplars.

We use the same 3D kinematics model defined over joints as that in Taylor’s work. We can solve for the 3D configuration of the body {(𝑋𝑖,𝑌𝑖,𝑍𝑖)𝑖joints} up to some ambiguity in scale s. The method considers the foreshortening of each body segment to construct the estimate of body configuration. For each pair of body segment’s joints, we have the following equations:𝑙2=𝑋1𝑋22+𝑌1𝑌22+𝑍1𝑍22,𝑢1𝑢2𝑋=𝑠1𝑋2,𝑣1𝑣2𝑌=𝑠1𝑌2,𝑍𝑑𝑍=1𝑍2,𝑑𝑍=𝑙2𝑢1𝑢22+𝑣1𝑣22𝑠2.(7) To estimate the configuration of a body, we first fix one joint as the reference point and then compute the positions of the others with respect to the reference point. Since we are using a scaled orthographic projection model, the X and Y coordinates are known up to the scale factor 𝑠. All that remains to compute relative depths of endpoints dZ. We compute the amount of foreshortening and use the user-supplied “closer endpoint” labels from the closest matching exemplar to solve for the relative depths.

Moreover, Taylor notes that the minimum scale s can be estimated from the fact that dZ cannot be complex:s𝑢1𝑢22+𝑣1𝑣22𝑙2.(8)

This minimum value is a good estimate for the scale since one of the body segments is often perpendicular to the viewing direction.

4. Experimental Results

The proposed algorithm was applied for the reconstruction of human subjects from single-camera videos. The database consists of some poses of a number of subjects, performing different types of motions from the CMU MoCap database [28]. On this collection of poses, we manually determined joint locations of each pose and “closer endpoint” labels for each body segment, which are used in 3D reconstruction. Also, we save middle frequency of the descriptor matrix for each labeled joint.

Our experiments are divided into two parts: (i) reconstruction results for the sequences of real people with different motions in CMU MoCap database, and (ii) 2D tracking results in different video sequences.

4.1. Reconstruction Results

We tested the proposed algorithm on a variety of sequences of real human subjects performing various motions. To facilitate the tracking process, we utilized a background estimation algorithm based on temporal median filter. To make the proposed descriptor matrixes scale invariant, we set descriptor matrix size as a function of length of the body segments.

Figure 4 shows sample results of 2D body joint localization before and after interpolation and finally 3D reconstruction on the CMU dataset. Note that some joints are occluded or failed in 2D tracking. These joints are reconstructed by interpolation. Figure 5 shows sample results of another video, which its 3D reconstruction performed successfully.

4.2. Tracking Results

In this section, we investigate the robustness of the tracking algorithm in some video sequences consisting of occluded limbs and noise.

Figure 6 shows the robustness of the proposed algorithm for limb tracking and distinguishing the occluded limbs. We tracked head and right hand joints using the proposed algorithm. Bigger circles show the nonoccluded tracked joints. The occluded or falsely tracked joints, which are detected by (4), are shown by smaller circles. It is obvious that our tracking algorithm performs very well in tracking and detecting occluded limbs.

To show the efficiency of the proposed image descriptor, we compared the proposed descriptor with shape context descriptor [13, 29]. Figure 7 shows the results of joints tracking using the proposed descriptor as well as the shape context descriptor. The results of Figure 7 reveal that the proposed algorithm has tracked the joints more efficiently.

We also compared the proposed algorithm with a well-known tracking algorithm which tracks the feature points by optical flow and iterative Lucas-Kanade (LK) method in pyramids [26]. Figure 8 illustrates the true position of the head and hand joints as well as their positions tracked by the proposed algorithm and LK method. The figure shows that the LK method has lost the head and hand positions; however, the proposed algorithm successfully tracked it.

To show the efficiency of the proposed algorithm in the noisy environment, we tested the proposed algorithm with noisy images. Figure 9(a) shows the tracking results for the video sequence of Figure 7 corrupted with 10 percent salt and pepper noise. In Figure 9(b), the results of the proposed tracking algorithm are shown for the same video sequence corrupted with Gaussian noise of SNR = 10 dB. Solid circles in the figure are the joints that are tracked normally, and empty circles show the joints labeled as “occluded.” Figure 10 shows the true position of the left-hand joint as well as its position tracked by the proposed algorithm and LK method for the noisy images of Figure 9. Figures 9 and 10 show the efficiency of the proposed algorithm in tracking videos in noisy environments.

We also investigated the effect of DCT block size on the efficiency of the tracking algorithm. Figure 11 shows the average tracking error for a typical video. As the figure show, the algorithm has the best output at the DCT block size of 10*10. However, the efficiency of the algorithm does not change considerably for DCT block size of 8 to 14. Our experiments show that the optimal DCT block size depends on the height of the human body in pixels. For example, for human height of 130 pixels, the optimal block size is 8. By the increase of human height, the optimal block size linearly increases with the rate of 0.1 per pixel.

5. Conclusion

In this paper, a new method for 3D reconstruction of human motion from the image sequence of a single static and uncalibrated camera is described. In this method, 2D tracking is used for 3D reconstruction, which a database of selected frames is used for the correction of tracking process. We used DCT blocks as matrix descriptors, which are used in the matching process for finding appropriate pose in the database and tracking process. We used three frequency regions for different tasks to enhance the accuracy of the proposed algorithm. The algorithm can detect occluded joints and recover their positions by interpolation. The proposed algorithm was tested with several video sequences in noisy and noiseless environments, and experimental results showed the reliability of the algorithm. This method is robust in 2D tracking and holding the properties of each joint along tracking process.

We also investigated the effect of DCT block size on the efficiency of the tracking algorithm. To make the tracking system scale invariant, it is possible to use an adaptive block size based on the height of human in pixels.