Abstract

Gesture recognition is an important part of human-robot interaction. In order to achieve fast and stable gesture recognition in real time without distance restrictions, this paper presents an improved threshold segmentation method. The improved method combines the depth information and color information of a target scene with hand position by the spatial hierarchical scanning method; the ROI in the scene is thus extracted by the local neighbor method. In this way, the hand can be identified quickly and accurately in complex scenes and different distances. Furthermore, the convex hull detection algorithm is used to identify the positioning of fingertips in ROI, so that the fingertips can be identified and located accurately. The experimental results show that the hand position can be obtained quickly and accurately in the complex background by using the improved method, the real-time recognition distance interval can be reached by 0.5 m to 2.0 m, and the fingertip detection rates can be reached 98.5% in average. Moreover, the gesture recognition rates are more than 96% by the convex hull detection algorithm. It can be thus concluded that the proposed method achieves good performance of hand detection and positioning at different distances.

1. Introduction

Nowadays, the interaction between people and the machines is mainly completed through the mouse, keyboard, remote control, touch screen, and other direct contact manner, while the communication between people is basically achieved through more natural and intuitive noncontact manner, such as sound and physical movements. The communication by natural and intuitive noncontact manner is usually considered to be flexible and efficient; many researchers have thus tried efforts to make the machine identify other’s intentions and information through the noncontact manner like people, such as sound [1], facial expressions [2], physical movements [3], and gestures [4, 5]. Among them, gesture is the most important part of human language, and its development affects the nature and flexibility of human-robot interaction [610].

In the past decades, gestures were usually identified and judged by wearing data gloves [11] to obtain the angles and positions of each joint in the gesture. However, it is difficult to use widely due to the cost and inconvenience of wearing the sensor. In contrast, the noncontact visual inspection methods have the advantage of low cost and comfort for the human body, which are the currently popular gesture recognition methods. Chakraborty et al. [12] and Song et al. [13] proposed the skin color models utilizing image pixel distribution in a given color space, which can significantly improve the detection accuracy in the presence of varying illumination conditions. However, it was difficult to achieve the desired results using the model-based methods because of the light sensitivity during the imaging process. The algorithm-based noncontact visual inspection methods were also used to conduct the gesture recognition, such as the hidden Markov model [14], the particle filter [15], and Heer features AdaBoost learning algorithm [16]; however, it is difficult to execute real time due to the complicated algorithms. The above results cannot acquire gestures efficiently in real time since only the insufficient 2D image information was used.

Therefore, it is inevitable that gesture recognition by 2D image is replaced by 3D with depth information. In general, 3D information can be acquired by binocular cameras [17], Kinect sensor [1820], Leap Motion [21], and other devices. Those devices can be usually utilized to obtain depth information by spatial relationship of different direction [17] or infrared reflection [22], which can conveniently acquire noncontact image for recognition and classification instead of wearing the complicated equipment. For example, Ťupa et al. [23, 24] presented the detection of selected gait attributes by Microsoft Kinect image and depth sensors to track movements in three-dimensional space. Youness et al. [25] proposed a real-time human pose classification technique by using skeleton data from a depth sensor. However, the calibration process of binocular camera is usually complex, and the recognition distance of Leap Motion is only from 2.5 cm to 60 cm. Due to the noncalibration and long-distance recognition, Kinect sensors have been widely used in body pose detection [26, 27], skeleton tracking technology [28], and other aspects.

Gesture extraction from the complex image is considered to be important, and more information generally leads to more accuracy and larger scope of the gesture recognition. Kinect sensor is usually selected to acquire gestures to obtain extra depth information, and it has demonstrated a significant performance of light sensitivity in gesture recognition. However, there are still unsolved issues of gesture finding, and segmenting with depth information needs to be further treated. Recently, researchers have focused on the recognition problem instead of the problem of gesture segmentation in the applications of hand gesture recognition. The gesture segmentation methods were usually utilized by the direct distance interval setting [4, 5, 29] or the hand being the frontmost object [30]. The simplified methods demonstrated to be quick and effective; however, the distance between the hand and Kinect sensor is restricted so that the hand gestures can only be recognized by moving the hand to a specific position and keeping the distance during the whole process. In general, the researchers tried their best to seek the human-robot interactions in a natural way like the communication of human beings. However, the restriction of a shortage of distance is unnatural in the above and existing literatures. The gesture recognition methods such as template matching [31] or finite-state machines [32] were usually used, where high classification rates could also be obtained. However, only specific gestures can be recognized by the above methods. The convex hull detection algorithm [33] recognizes gestures with the finger hull and can identify each fingertip position of the human hand. It can get more gesture information and have potential advantage. The convex hull detection algorithm was utilized to recognize gesture by the finger hull, and each fingertip of the human hand was thus positioned.

In this paper, an improved threshold segmentation method is proposed based on the Kinect sensor with the depth information for long-distance recognition. The proposed method not only has the advantage of light insensitivity but it also can extract the gestures accurately in a wide range of contexts and complex backgrounds where the hand gesture is not covered completely by the front objects. Firstly, the RGB image data and the depth image data obtained by the Kinect sensor are preprocessed by median filtering. Secondly, combined with the depth of information and skin color threshold, an improved spatial stratification method is proposed to extract gestures; the gestures can be thus identified within wide contexts in complex backgrounds. Finally, the local neighbor method is conducted to segment the ROI of the human hand. In order to verify the efficiency of the proposed method, the k-cosine curvature method is also presented to detect the fingertips and recognize the number gestures. The experimental results demonstrate that the proposed method can achieve a good performance and has strong robustness.

The remainder of this paper is organized as follows. In Section 2, hand gestures are extracted from an image by the improved threshold segmentation method. In Section 3, the algorithm of fingertip detection is described. In Section 4, the experimental tests are conducted to detect the fingertips and recognize number hand gestures. And conclusions are drawn in Section 5.

2. Gesture Extraction

Gesture extraction includes preprocessing and hand segmentation. Preprocessing is to register the depth image and RGB image and to process the spatial stratification of depth distance. Hand segmentation is to extract the hand area at different distances from the complex background.

2.1. Hand Recognition Preprocessing

In order to carry out the hand segmentation, it is necessary to register the depth image and RGB image, to stratify the depth distance, and to filter the depth image.

The first step is to register the depth image and RGB image. The resolution of RGB image obtained by the Kinect sensor is 1920 × 1080, while the resolution of depth image is 512 × 424 which is converted by depth information. Moreover, the RGB camera and IR camera of the Kinect sensor are at different points. Therefore, it is mismatched in the spatial size and cameras positioned between the RGB image and the depth image. In order to obtain the corresponding RGB information and depth information for each point, the point on the depth image has to be registered to match a corresponding point on the RGB image. In this paper, CoordinateMapper in Kinect v2 API is used to register the RGB image and depth image. The result is basically accurate except the burr on the edge of the object due to the distance gap. However, it can be ignored because the accuracy can meet the requirement in the following calculations. After registration, assume that the target image is P, which is composed of pixel sets . The value of each pi is , where (xi, yi) represents the pixel coordinate position of pi, di represents the depth value of pi, and (ri,gi,bi) represents the RGB element value of pi.

The second step is to spatially stratify the depth distance. The aim of the spatial stratification is to seek the characteristics of the human hand in depth space from near to far, so that one can avoid the limitation of traditional methods where only gestures in the forefront can be detected. According to our experimental investigation, the detection range of the depth camera is 0.5 m~4.5 m; however, it is not sufficient to identify the gestures when the distance between the Kinect sensor and the human hand is greater than 2.5 m. Therefore, the stratification range of our experimental test is selected to be 0.5 m~2.0 m. Taking into account the real time and accuracy of the Kinect sensor, we set 0.1 m to be the layer step; therefore, P can be divided into 15 parts, and the kth layer is represented by Pk pixel set, then

The last step is to filter the depth image. Since the depth image is obtained by calculating the random speckles, which are produced by infrared light of IR camera reflected on a rough object, the points without values or nonuniform regions would be thus inevitable. The median filtering method is a nonlinear smoothing technique, which sets the gray value of each pixel to the median of the gray values of all the pixels within a certain neighborhood of the point. In comparison to other filters such as mean filter and Gaussian filter, the main advantage of median filtering method is that the isolated noise points can be eliminated efficiently while the gesture edge information can be well retained. Therefore, in this paper, the depth image is preprocessed by using the median filter to get rid of the small noise points in the image, where the aperture linear size of median filtering is set to be 3. The medianBlur [34] in the OpenCV is used for depth image filtering due to its fast and effective performance.

2.2. Hand Segmentation

After registering the depth image and RGB image, P is detected layer by layer by spatial stratification. There are two steps to obtain the ROI of gesture. The first step is to find the approximate location of the gesture and distinguish the right or left hand and other uncovered parts of the body by using GetJoints in the Kinect v2 API. The second step is to detect skin color layer by layer in the approximate image frame. In this way, the accurate ROI of gesture can be then determined. In this paper, the object detection is conducted from the object near the camera, which is combined with the depth information; the target can be thus located quickly when the conditions of RGB information are met. In this way, the ROI area of the target depth image can be obtained. From the first layer space P1, the hand detection would be turned to the next layer to detect until the RGB value of pi in Pk point set in the range of the skin color values. Then, the area where the hand is located would be determined if in satisfy where and are the skin color of the lower and upper bounded values of the tricolor, respectively.

When the number of interest points meets the condition that is a given certain number of points in this paper, one can rule out the possibility of interference noise or other uncertainties. Then, these points are determined in the kth layer and to be the skin color. In this way, the area where the hand is located can be finally determined.

After determining the hand position, the local neighbor method is used to obtain the ROI area in this paper; that is, a square is chosen to sign the hand position, where the side length is b and the interest point is taken as the center. Due to the large interval range from 0.5 m to 2.0 m between the hand position and Kinect sensor, the hand pixels in the image appear to be big when near and small when far.

Remark 1. The appropriate value of b in a short distance of ROI area would be too large in a long distance, and the calculation accuracy and speed performance would thus be decreased. On the contrary, the necessary information would be lost and the location may not be obtained. Therefore, the value of b should be chosen according to the layer size k to ensure the accuracy of the hand segmentation.

According to the above analysis, the ROI area can be obtained, which is shown in Figure 1(a). In the ROI area, the gray value of the background in the image is set to be 255, while the gray value of the gesture area is set to be 0, the binary image of the hand gesture can be thus obtained. Let the gray scale value set of ROI be ; therefore, PROI is the pixels set of the gesture location.

3. Fingertip Detection

After extracting the binary image of the hand gesture from the complex background, contour profile is extracted from the ROI region to find the hand, the palm point is then detected, and fingertips point can be finally positioned based on the obtained contours and palm point.

3.1. Hand Contour Extraction

In this paper, the FindContours algorithm [34] is utilized to extract hand gesture contour from the ROI region. Contour point extraction is generally achieved by comparing the size between adjacent pixels. The basic principle of the FindContours algorithm is to find the contour by detecting the boundary between black and white regions of the binary image. In the previous section, we have obtained the binary image of hand gesture, the FindContours algorithm is then used to extract the gesture contour from the ROI region, which is shown in Figure 1(b).

After obtaining the gesture contour, the contour points are stored clockwise by using the array .

3.2. Palm Point Detection

Kinect sensor provides the palm detection of hand gestures, where the end of the upper limb is considered to be the palm point based on the detection and recognition of human skeleton. However, this method can only be used in the case where the whole body image can be detected. Moreover, the hand detection provided from Kinect sensor has errors since it is usually in the inferred state. Therefore, the gesture contour center is used as the palm point in this paper. According to our above experiment test, the hand contour of the hand gesture in the ROI has been obtained. The center point of the contour is calculated as the palm point O based on the coordinate array of hand contour, which is shown in Figure 1(c).

Remark 2. Although the above palm point detection method has low accuracy, it is relatively fast and stable. In this paper, the palm coordinate point is mainly detected for excluding the nonfingertip groove point as the fingertip point is calculated; the requirement of precision is thus not too high. Therefore, the method can meet the requirements of calculation.

3.3. Fingertip Calculation

To track the characteristics of hand gestures, it is necessary to know the feature points of the hand in the human-machine interaction. The most important feature of human hand is the fingers; therefore, we have to locate finger positioning, that is, to find the fingertip points. In the previous subsection, we obtained the gesture contour image, where the main feature of the fingertips is convex hull. Therefore, in this paper, the k-cosine curvature algorithm, which is shown in Figure 2, is used to calculate the curvature values of the gesture contours. The points, which are obtained by the reasonable setting of the parameters and matched curvature values, are the coordinates of the fingertips.

In Figure 2, pi+k is the succedent kth point of pi in the clockwise contour array, and pi-k is the previous kth point.

Define the two vectors formed by pi, pi+k, and pi-k on the gesture contour curve as

Then, the k-cosine value of pi can be obtained as

In general, the appropriate interval of eik needs to be selected. A point that complies with the interval can be considered to be the required corner.

Remark 3. It is necessary to select an appropriate k value so that the position of each fingertip can be precisely detected. In order to handle the inaccuracy problem caused by the difference in size of the gesture at different distances, and to prevent the burr problem of edge curve caused by unsmooth contour, the range of k value should be selected carefully.

In this paper, we select , the k-cosine of pi, as ; can be then calculated, as shown in Figure 3. Then the maximum cosine value eik can be found when k is a certain value between m and n, which is exactly the k-cosine value of pi.

Based on the k-cosine of contour point, the angle between and can be calculated. Given an angle , the corresponding contour point pi is suspected as the finger point when .

As shown in Figure 4, the contour point that satisfies the condition can be the convex point of the fingertip or the groove point between fingers. Therefore, the distance between this point and the palm point is utilized to further determine the fingertip or finger groove point. In Figure 4, assuming that pi0 is the midpoint of pi-k and pi+k, and the distance between pi0 and palm point O is di1, the distance between pi and palm point O is di2. That distance is the length between two points in the Cartesian coordinate system according to these point coordinates. Then (1)pi is the fingertip, if di1 < di2.(2)pi is the groove point, if di1 > di2.

4. Experimental Results and Discussion

In this paper, Microsoft Kinect 2.0 is used as the data acquisition device, and dynamic per frame rate is used for video data acquisition; the experimental tests are conducted by the Visual Studio 2010 platform using the C++ program. OpenCV is used for image processing, such as image data storing and contour points seeking. According to the above analysis, there are two steps in the experimental tests. First, the improved threshold segmentation method is used for hand recognition. Then, the fingertip convex hull detection algorithm is used for fingertip positioning and gesture recognition. The experimental process is shown in Figure 5.

4.1. Experiment Results of Gesture Extraction

According to the design process of Section 2, the improved threshold segmentation method is used for gesture extraction, and the flow chart is shown in Figure 6.

According to the flow chart, the real-time recognition results of the gesture extraction at different distances are shown in Figure 7. From Figure 7, it can be seen that the improved threshold segmentation method can automatically and efficiently identify the position of hand gesture in real time with 0.6 m, 1.0 m, 1.5 m, and 2.0 m, respectively, and the recognition distance is relatively wider than that in the existing literature [4, 5, 25]. Furthermore, clear hand gestures in ROI region can be obtained by using the improved method, which lead to a higher accuracy of the fingertip detection.

4.2. Experiment Results of Fingertip Detection and Gesture Recognition

According to the presented method in the Section 3, the fingertip positions of the human hand can be detected, and the marked pixel coordinates are shown in Figure 8.

Figure 8(a) shows the result of the fingertip detection, where the red dot indicates the positions of the five fingertips. Figure 8(b) shows the results of the positioning, where d1 to d5 are the pixel distance values between five fingertips and the palm point, coordinate1 to coordinate5 are the coordinates of five fingertips in the pixel coordinates of image P, and e is the k-cosine curvature corresponding to the previous fingertip coordinate point.

We obtain the images in different distances and complex backgrounds from Kinect sensor in real time and extract the images of five fingers open gesture using the improved threshold segmentation method. The results of the fingertip detection are shown in Table 1, where “correct” means the five fingertips can be found and positioned correctly.

From Table 1, it can be seen that hand gesture is identified and fingertips are positioned between 0.5 m and 2.0 m in a complex background. Actually, the detection rate is relatively high. At a distance of about 1.0 m, the best positioning performance of fingertip detection is achieved. However, the recognition speed is reduced when the distance is too close or too far. Some small noise spots around the hands will reduce the detection performance of fingertips when the distance is too close, and the contour sequence points in the ROI are so less that there are not enough data to detect the five fingertips when the distance is too far.

According to the fingertip positioning, 650 images including six gestures from real-time videos were randomly selected to recognize hand gesture in the condition of different depth distances and different backgrounds from five experimental participants. The six kinds of gestures which represent numbers 0~5 are shown in Figure 9. Then, the identification results of gesture numbers 0~5 are shown in Table 2, where “recognition rate” indicates that the test results are the same as those of the extended gesture.

According to the experiment results of fingertip detection and gesture recognition, hand gesture identification can be achieved by the improved threshold segmentation method between 0.5 m and 2.0 m in real time, which is a relatively longer distance gesture identification in comparison to the direct distance interval setting. For example, the distance interval is set between 0.8 m and 1.0 m in [25]. Moreover, the proposed method has good recognition performance in a complex forefront and background, and the recognition rate of the number gesture experiment in this paper further shows that the improved method can not only identify the hand in real time in a complex background and at different distance but also meet the requirement of fingertip detection and gesture recognition to achieve a natural human-robot interaction.

5. Conclusions

Aiming at the distance limitation of gesture recognition, this paper proposes an improved threshold segmentation method with depth information for hand gesture segmentation and presents the k-cosine curvature algorithm for fingertip detection. First, the improved threshold segmentation method, which is a spatial stratification scanning method combined depth information with skin color RGB interval, is used to identify the position of hand gestures in a long distance. Second, the k-cosine curvature algorithm is used to detect the convex hull of fingers so as to determine the positioning of fingertips, and the numbers 0~5 of hand gesture can be thus identified. Third, the experimental results show that the proposed method can efficiently increase the detection distance in comparison to the traditional threshold segmentation methods. Moreover, every fingertip can be basically detected in the ROI by the improved method, and the recognition rates are more than 96%. Finally, the experiment results of number gesture recognition also show that the proposed method can meet the requirement of hand gesture recognition at different distances. Further work will be devoted to identify more gesture information, to apply to the human-machine interaction, and to achieve more machine control function by dynamic and static gesture recognition.

Abbreviations

ROI:Region of interest
2D:2 dimensional
3D:3 dimensional
RGB:Red green blue
IR:Infrared radiation.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (61773351, 61473265, and 61374128), the Natural Science Foundation of Henan Province (162300410260), the Outstanding Young Teacher Development Fund of Zhengzhou University (1521319025), the Training Plan for University’s Young Backbone Teachers of Henan Province (2017GGJS004), and the Science and Technology Innovation Research Team Support Plan of Henan Province (17IRTSTHN013).