Abstract

Segmenting human hand is important in computer vision applications, for example, sign language interpretation, human computer interaction, and gesture recognition. However, some serious bottlenecks still exist in hand localization systems such as fast hand motion capture, hand over face, and hand occlusions on which we focus in this paper. We present a novel method for hand tracking and segmentation based on augmented graph cuts and dynamic model. First, an effective dynamic model for state estimation is generated, which correctly predicts the location of hands probably having fast motion or shape deformations. Second, new energy terms are brought into the energy function to develop augmented graph cuts based on some cues, namely, spatial information, hand motion, and chamfer distance. The proposed method successfully achieves hand segmentation even though the hand passes over other skin-colored objects. Some challenging videos are provided in the case of hand over face, hand occlusions, dynamic background, and fast motion. Experimental results demonstrate that the proposed method is much more accurate than other graph cuts-based methods for hand tracking and segmentation.

1. Introduction

There are four main kinds of object tracking methods which are points, skeleton, contour, and silhouette tracking in recent papers [1, 2]. As an important branch of tracking, hand tracking is a critical step in computer vision systems, such as human computer interaction (HCI) [3], sign language interpretation [4], and gesture recognition [5]. Besides, vision-based hand gesture recognition [3] is a meaningful direction to enable computers to understand the meaning in robot systems where the first key step is to achieve robust hand tracking. Hence, we concentrate on silhouette tracking which means that hand silhouette or region should be split from cluttered backgrounds.

In the last decade [6], human hand motion capture has gained widespread interest in pattern recognition area. For example, Yang et al. [5] presented a method to obtain hand trajectories based on pixel matches with affine transformations. Then an optical flow-based method [7] is proposed for hand tracking. Although the method [7] can capture quick motion and fast hand shape deformations, it still fails to hand tracking when hands and skin-colored objects are occluded. Some other works [4, 8] try to use linear quadratic estimation model (e.g., Kalman filter) or sequential Monte Carol model (e.g., particle filter) to hand track trajectory. Later on, a real-time hand tracking method is applied in a mechanical device by the authors [9] who utilized the advantages of particle filter and mean shift (MS). They incorporate MS optimization into particle filter to improve the sampling efficiency considerably. Though these approaches have delivered promising results, they are difficult to handle occlusions.

In recent years, graph cuts-based methods have been applied in tracking or segmentation systems. Xu and Ahuja [10] firstly proposed a method to track object contour by graph cuts. They dilate object contour into a narrow band and construct a graph only on this band. Nevertheless, it cannot deal with large displacements because there is no dynamic model to estimate object location. Freedman and Turek [11] presented a method based on graph cuts to track objects when the illumination drastically changes. Yet, they do not achieve object segmentation from their experimental results. Later, Malcolm et al. [12] incorporated a distance penalty into graph cuts to realize object segmentation and used a simple filter to estimate the location of interested objects. Although this method can achieve multiobject tracking, it still cannot deal with occlusions. Bugeau and Pérez [13] proposed a method based on optical flow and graph cuts to simultaneously track and segment objects. However, this method needs a reference background image that would restrict its application and popularization. In the work of [14], the authors managed to track objects in live videos via reseeding strategy. And Papadakis and Bugeau [1] presented that the interested object is comprised by visible and occluded parts which are tracked, respectively. Regardless of the fact that those methods have achieved success in some areas, they still have some drawbacks in some situations, such as hand over face and hand occlusions.

Hand tracking is a challenging problem because the hand presents 27 degrees of freedom (DOFs), including 21 DOFs for the joint angles and 6 DOFs for orientation and location [6]. Therefore, hand shape and motion are more arbitrary than rigid objects. In this paper, we present an effective approach to track and segment hands even though hands have arbitrary shape deformations. Similar to the methods of [12, 13], a dynamic model and graph cuts are used. However, compared with these methods, the key contributions of our method are summarized as follows.(i)To avoid the degeneracy problem of interest points [12, 13], we combine the resampling strategy and optical flow algorithm to robustly track interest points from hand regions.(ii)An augment graph cuts method is introduced to track and segment hand regions and different hands labelled with different colors.(iii)The proposed method can track and segment hands on some challenging environments, such as hands overlap, hand fast motion, and hand over face. Also the proposed method can track and segment hands in dynamic backgrounds where some skin-colored objects may be present.

The framework of our method is shown in Figure 1, which consists of optical flow estimation and augmented graph cuts introduced in Section 3.

This paper is a substantial extension of our conference paper [15]. Compared with [15], further details of our method are presented, and more extensive performance evaluation is conducted. We also give a more comprehensive literature review to introduce the background of our method and make the paper more self-contained. Therefore, this paper provides a more comprehensive and systematic report of our work. The rest of the paper is organized as follows. We describe basic notions of multiobject tracking based on graph cuts in Section 2. The proposed method is described in Section 3. Section 4 shows the experimental results and the performance evaluation. The conclusion is given in Section 5.

2. Notion of Traditional Graph Cuts

Here, we describe the basic principle of graph-cuts based methods for object tracking and segmentation. We review image segmentation via graph cuts at first. Then, object tracking is described via graph cuts and dynamic model.

2.1. Segmentation via Graph Cuts

We briefly outline multilabel graph cuts technique. The detailed information can be found in [16, 17]. The simple segmentation of the background and objects can be obtained by minimizing the following energy with respect to the labelling function in (1): where data term evaluates the likelihood of a pixel belong to the th object and is defined as where is delta function (equal to 1 if and 0 otherwise); represents an image; is the number of tracked objects; is calculated by a normalized histogram of the th object.

The smooth term evaluates the penalty for assigning two neighboring pixels to different labels. is defined as where and are coordinates of pixels, is the Euclidean distance, is a smooth parameter, , and is all neighborhood pixel pairs which are 4 or 8 neighborhood systems.

2.2. Tracking via Graph Cuts

Suppose objects are tracked and is a set of pixels at time of an image . represents the th object at time ( denotes the background region). Therefore, we can know that . Equations (1)–(3) can be rewritten by adding temporal information:

In [12], it assumes that the mean velocity is known for each object, the authors translate the current object at time to have a prediction at time . A new term called distance term is introduced, which discourages pixels to be associated with the th object, when the pixels do not belong to the predicted set . is defined as where , is a scaling function explained in Section 3, which constraints a new estimate to be in the spatial neighborhood of the prediction. For example, if a pixel is in the mask of predicted object , then . If a pixel is out of the mask of , is equal to the nearest distance between and other pixels . can be quickly calculated with fast matching algorithm [18]. Therefore, the energy function is reformulated as

Although the methods [12, 13] have achieved to track and segment objects which are partly occluded in some occasions, they cannot access to track overlapped objects when these objects are similar colors(e.g. the hands and face are overlapped). We give an example to illustrate the limitations of these methods in Figure 2. In the initialization step at time , the result of the left/right hand is labelled blue/green as shown in Figure 2(a). At time , in Figures 2(b) and 2(c), we can see that the colors are confused between the left and right hand by [12, 13]. Besides, some pixels in the red circle are wrongly labelled by [13] as shown in Figure 2(b) while pixels in the background are correctly segmented by the method [12] (see Figure 2(c)). That is, because is added in [12] to constrain the estimation to be in the spatial neighbourhood of the prediction. However, the method [12] still does not distinguish pixels of each hand (see Figure 2(c)).

3. The Proposed Method

Suppose that hands are tracked and each hand is totally visible at time . This means that , . The initialized segmentations labelled different colors are provided by manual operation at time . At time , our approach can sequentially process the frames for simultaneously hand tracking and segmentation.

3.1. State Prediction

When the segmentation result is correct at time , the prediction set at time is estimated by the mean velocity using (9) as

To compute unknown mean velocity , some methods (such as autoregression model [19] and interest points detector [20, 21]) have been proposed in the past decades. Compared with these methods, optical flow delivers excellent results on fast moving objects with a high computational efficiency [7]. Therefore, we choose optical flow (the same as the method [13]) based on pyramid Lucas-Kanade multiresolution scheme [22] as our dynamic model. However, there are two problems shown in the methods [13]. The first problem is that some interesting points may be wrongly detected by optical flow as shown in the first row of Figure 3, and the second is the degeneracy problem which perhaps happens in the second row of Figure 3. In our dynamic model, two strategies are introduced for avoiding these two problems.

To compute the unknown velocities, a set of interest points is considered. At time , the interest points are found by good-feature-to-track [7, 21] which suggests seeking a steep brightness gradient along at least two directions for promising feature candidates. Then, at time , can be detected [22]. So the velocity is computed between two successive frames as

And the mean velocity at time is calculated as

From (10), we know that every detected point has contributions to the mean velocity. When some points are beyond the scope of hand region (see the first row of Figure 3), the mean velocity may have a bias to true velocity. So a distance penalty in (12) is created to eliminate outlines. Here, we only consider points when their displacements are less than a given threshold as

In order to capture fast hand motion, we can set a large value to . In our experiments, is well suitable for all test videos.

As time goes on, the number of interest points may goes down via optical flow (see the second row in Figure 3). For the sake of avoiding the degeneracy problem, the second strategy is to resample interest points. When the number of interest points is below a given threshold , we can redetect new interest points using good-features-to-track [21]. After these two strategies, the detected interest points are shown in the third row of Figure 3.

3.2. Error of Prediction

In this work, we accept the idea of the work [12] to handle the error prediction problem. The prediction error is the distance between the predicted centroid and the actual centroid at time . The scaling function is defined as

Here, is a threshold based on empirical motion, which controls the change rate of penalty . If is large, will slowly change. As mentioned in [12], in practice, is quite robust to our model. When the actual sets are off , is lowered to hopefully still capture motion. can automatically rise when prediction errors decrease by (13).

3.3. Augmented Graph Cuts

Now we explain how to define new terms and incorporate them into energy function. Those new terms are the core principle in augmented graph cuts.

3.3.1. Spatial Constraint

Owing to the similar color of human skin, it is difficult to eliminate the effect of each hand by the works [12, 13] as shown in Figure 2. Here, we introduce a new energy term called spatial term : where denotes the centroid of the predict set . is the parameter value. The penalization is made through the function : where is the Euclidean distance from the location of a pixel to the centroid of . When a pixel is close to , the value becomes a small value which indicates that the pixel is encouraged to assign the th object.

As illustrated in Figure 4(a), when hands are visible (), then which means that the pixel is inclined to assign . Therefore, when hands are totally visible in the same scene, spatial term can distinguish each hand. Nevertheless, when hands overlap together (), it will be ambiguous to assign the pixel to or in Figure 4(b). This means that spatial term is suitable for .

3.3.2. Motion Constraint

In (8), the energy function does not consider the situation in which hands pass over other skin-colored objects, such as face. Therefore, a new energy term called motion term is given to handle this situation: where is a weight parameter. The function is defined as where is the motion parameter mentioned in (13).

Using the motion information allows to reject some bad segmentations in the case of hands over skin-colored objects. When a pixel is from with the velocity , it assigns to and the value to the other sets according to (16). means that the pixel is intended to assign the object. The motion term can keep good segmentation when hands and other skin-colored objects overlap (e.g., hands over face).

3.3.3. Chamfer Distance

The above defined terms are based on motion information and the prediction set . However, spatial and motion terms still cannot deal with hand occlusions (see Figure 4(b)). Therefore, a new term called chamfer term is introduced to deal with hand occlusions. is defined as where is the function of chamfer distance transform and is the weight parameter. Before computing the chamfer distance, we should get the binary image from the frame at time (e.g., using canny edge detection [23]). Then the value of chamfer distance can be fast calculated in two passes over the frame [24] as shown in Figure 5. encourages to keep discontinuous in the image boundary. In particular, when hands overlap, we can set a large value to for rejecting bad segmentation in the areas of occlusion .

3.4. Final Energy Function

We merge all of the mentioned terms. Therefore, the hand tracking problem consists of six terms to minimizing the following energy function:

Compared with the energy function equations (4) and (8), our model can handle hand occlusions, hands over face, and fast hand capture. After building the graph by (19), we can apply the α-expansion algorithm [16] to minimize the energy function.

3.5. Overview of the Proposed Method

We have described the principle of our method to track and segment hands in different circumstances. We use four steps to achieve hands tracking and segmentation. At first, initialization segmentations for all tracked hands are provided by manual operation at time . Then at time , the prediction can be estimated by the dynamic model. Later on, we construct the graph by the augmented graph cuts and use α-expansion to obtain final segmentation results. Finally, we judge whether the number of interest points is larger than a given threshold. If the number of interest points is below a given threshold, we can resample the interest points. An overview of our algorithm is given in Algorithm 1.

Step  1. Initialization (at time )
(i)  : the number of tracked hands
(ii) : the displacement of one interesting point from time to
(iii) : minimum number of interesting points
(iv) Manually Initialize the sets , (such as at time in Figures 612)
   , .
(v)  Find interesting points in via good-feature-to-track [7, 21].
   For at time
Step  2.
(i)  Find interesting points using via optical flow [22].
(ii) If obtain the final interest points .
(iii) Compute the hands mean velocity , using via (11).
(iv) Predict the sets , using .
Step  3.
   Build the graph and apply -expansion via (19).
Step  4.
   If < (the number of interest points below a given threshold )
   Update interesting points in the region via good-feature-to-track [7, 21].
   If <
  Return to Step  2.

4. Experimental Results

To validate and evaluate the proposed approach, we afford four videos (three videos were captured by our webcam and one video is an American sign language (ASL) video provided by Purdue ASL database [25]). All the videos have the same frame rate with 30 fps. In this paper, we only provided four challenge videos, but more results (e.g., four hand tracking and segmentation) can be found in the website: http://joewan.weebly.com/my-research.html.

4.1. Results

The proposed method is implemented in Microsoft Visual Studio 2008. All the videos we have offered are tested on a Core 2 Duo P8600 Processor with 2 GB RAM. The initialization segmentations (at time ), the tracking results, and the different parameters are given in our experiments. Every tracked hand is labelled with different color. Although there are some methods which are similar to us, we only compare the proposed approach with the methods [12]. That is because the methods [13] require a reference background image for background subtraction to obtain external observations. It is not suitable to hand track in dynamic background. Papadakis and Bugeau [1] proposed a framework for object tracking. But the method [1] has a strong assumption that the occluded part of an object is a subset of the prediction of the whole object, which is not appropriate for self-occlusions that commonly happen on hands motion, especially fingers movement. To compare with the methods [12], the parameters , , and are set to zero, as they can recover the original energy function equation (8).

4.1.1. Hand Occlusions

This video has 141 frames and the frame size is 320 240 pixels, which shows that two hands may be overlapped when both hands are in motion. It is called video 1. The parameters of our method are as follows: , , , , and . And the method [12] parameters are as follows: , , , , and .

In Figure 6, let me firstly analyze the results which are shown in the first row by the method [12]. We can see that the two hands are labelled green color at , which means that the right hand is wrongly segmented. Additionally, when two hands are partially overlapped at , the left hand fails to track. Nevertheless, the hands are well recovered after hand occlusions by our method as shown in the second row of Figure 6 which shows that our approach is able to solve two principal problems: dealing with hand occlusions and rejecting oversegmentation.

4.1.2. Hand over Face

Now we give an example to demonstrate that our method can achieve hand segmentation even though hands pass over skin-colored objects, such as face. The video called video 2 is recorded outdoors including 106 frames. The frame size is 640 480 pixels. Our parameters are as follows: , , , , and . The parameters of the method [12] are as follows: , , , , and .

As shown in Figures 7 and 8, when the hands move from the left to the right, hand over face occlusion occurs at time . Figure 7 shows the results by the method [12], which reveals the failure to accurately track and segment hand when hands pass over face. In Figure 8, the hand segmentation is quite well achieved along the sequence by our method. Owing to the motion constraint in (19), when the hands pass over the face, our method still can reject the bad segmentations which may occur in the face region.

4.1.3. Fast Hand Tracking in Sign Language

The video called video 3 is from Purdue ASL database [25]. It involves fast hand motion (entire frames), hand over face (), partly hand occlusions (), and hand shape deformation (the entire video). This video includes 265 frames with the frame size of 640 480 pixels. Our parameters are , , , , and . As shown in Figure 9, our method is robust to segment and track hands. From the results in Figures 68, the method [12] cannot deal with hand occlusions and hand over face. So we only give the results by our method.

4.1.4. Dynamic Background

In order to further evaluate the effectiveness of the proposed method under complex situations, we test our method in dynamic background. The video called video 4 was captured in lab environment including 174 frames with frame size 320 240. The moving pedestrian as the dynamic background walk and happen occlusions when the tracked hand is in motion. The parameters of our method are as follows: , , , , and . Final results are given in Figure 10 which displays that good performance can be achieved in dynamic background by our method.

4.2. Discussion and Adjusting Parameters

The energy minimizing function in (19) is composed of six different terms. It has eight parameters to be tuned (three dynamic model parameters, five graph cuts parameters). However, most of them can be fixed in our experiments. For dynamic model parameters, the three parameters are given constant value because those parameters are not sensitive to our model. We set , , and .

Here, we give some hints about adjusting parameters to help use the proposed method. The spatial parameter denotes the weight value to every pixel of image, which measures the distance from the location of a pixel to the centroid of every tracked hand. When hands are disjoint, can be set to a large value. If is set to infinity, the model will only consider the separated areas of the tracked objects. The motion parameter represents the weight for handling hands over face. When the hand passes over skin-colored object, a large value can be set to (see Section 4.1.2). The chamfer parameter indicates the weight for hand occlusions. When hands overlap, a large value can be set to (see Section 4.1.2). These three parameters often vary from 0 to 6.

The parameter makes everywhere more smoothly everywhere [16]. If the value of is set to be large, it will lead to poor results at object boundaries. In most circumstances, usually varies from 5 or 10 which is well suitable for the model according to the experimental results. It is certain that the small value allows [12] to search farther from the prediction and well track the object. When a small value is set to , oversegmentations may occur. For instance, as illustrated in Figure 11, when we set and other parameters are the same as Section 4.1.3, we can see that oversegmentations happen at time .

4.3. Evaluation and Complexity Analysis

In order to perform objective comparison, we first manually segment hand mask (ground truths) for every frame in our test videos. Then we calculate mean percentage error (MPE) [27] between ground truths and segmentation results. MPE is defined as where denotes the number of false detected pixels, represents the frame size, and is the number of frames in a video. Note that the false detection happens in two situations: pixels in the background are detected as the hand region; pixels in one hand is treated as other hands or background. Figure 12(a) shows MPEs for four videos. As shown in Figure 12(a), we can get two conclusions as follows.(i)When hands and other skin-colored objects are in the same scene (hand over face, hand occlusions), MPEs by our method are much lower than the method [12]. In particular, the MPEs of videos 3 and 4 by the method [12] are very high (>10%) due to wrong labeled face region, when hands and the face are overlapped.(ii)The MPE (0.1714%) of video 4 by our method is close to ground truth (0%), which proves that the proposed method is well suitable for hands tracking and segmentation in sign language video.

Next, we give the running times of both the proposed method and the method [12] as shown in Figure 12(b) where the average execution time (AET) for every frame in all test videos is given. We can see that the AETs by our method are approximate to the method [12], although the truth is AETs by the proposed approach is slight higher that the method [12] about 20 to 30 milliseconds pre frame. That is because additional terms are incorporated into energy function, which leads to a slight high complexity. Meanwhile, the proposed method can successfully track and segment hands when the face and hands are partly occluded. AET depends on the frame size and the number of tracked number (see Figure 12(b)). In our future research, we will consider a narrow band around the prediction sets [10] to decrease the computational cost. The study of this band will be the subject of future works for real-time purpose.

5. Conclusion

In this paper, we present a method based on augmented graph cuts and the dynamic model for hand tracking and segmentation in different environments. The proposed algorithm can resolve three problems: fast hand motion capture, hand occlusions, and hand over face. In our method, we reformulate the energy function by adding some new energy terms which are more robust to hand tracking and segmentation. Additionally, the new terms can deal with occlusions and obtain accurate segmentation.

Meanwhile, there are a lot of perspectives that can be improved. At first, we can develop a method to automatically extract hand region instead of manually segmented hands in initialization step. For instance, we can apply AdaBoost algorithm [26] to detect the region of interest (ROI) of hands and use grab cut [28] in ROI to achieve hand segmentation. Second, some prior knowledge can be incorporated into the proposed method to handle totally occlusion. Moreover, another important point is the tuning of the parameters in energy function. In our future research, we will focus on these problems.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported partly by the National Natural Science Foundation of China (61172128), National Key Basic Research Program of China (2012CB316304), New Century Excellent Talents in University (NCET-12-0768), the Fundamental Research Funds for the Central Universities (2013JBZ003), Program for Innovative Research Team in University of Ministry of Education of China (IRT201206), Beijing Higher Education Young Elite Teacher Project (YETP0544), and Research Fund for the Doctoral Program of Higher Education of China (20120009110008).