Abstract

In this work, we propose a vision-based hand gesture recognition system to provide a high-security and smart node in the application layer of Internet of Things. The system can be installed in any terminal device with a monocular camera and interact with users by recognizing pointing gestures in the captured images. The interaction information is determined by a straight line from the user’s eye to the tip of the index finger, which achieves real-time and authentic data communication. The system mainly contains two modules. The first module is an edge repair-based hand subpart segmentation algorithm which combines pictorial structures and edge information to extract hand regions from complex backgrounds. Second, the position which the user focuses on is located by an adaptive method of pointing gesture estimation, which adjusts the offsets between the target position and the calculated position due to lack of depth information.

1. Introduction

Internet of Things (IoT) which connects objects around us together to the Internet becomes one of the most important information technologies nowadays. One of the issues of IoT is how to make the object-object interaction and human-object interaction smart and safe. Data collected from users may include sensitive or private information of their daily activities; hence security protection and privacy preserving are vital for the development of IoT. According to the principle of information generation and the technical architecture, IoT consists of three layers, namely, perception [1, 2], network protocol [3, 4], and application layers [5, 6]. The perception layer is composed of various sensors and gateways, such as temperature sensor, two-dimensional barcode, camera, and Global Positioning System (GPS). The perception layer recognizes objects and collects information with Radio Frequency Identification (RFID) and Wireless Sensor Networks (WSNs); therefore, the security issue for the perception layer is essentially on node capture attacks of RFID and WSNs [7, 8]. The network protocol layer is the central pivot of IoT which transmits and processes the information from the perception layer. As one of the key technologies in the network layer, wireless mobile communication networks employ traditional encryption and authentication techniques to improve the security and privacy of IoT [9, 10]. The application layer realizes the intelligent services by connecting IoT and users, such as city management, intelligent transportation, telemedicine, and smart home. The application layer also faces the challenges of security threat. For example, the attackers can capture the unattended equipment and send data to the application system by identity impersonation and data modification. Therefore, this study is aimed at developing a secure and authentic interaction system based on computer vision for smart home in the application layer of IoT. As shown in Figure 1, our interaction system can control the cursor of a computer or a smart TV by recognizing users’ pointing gestures. The cursor is located immediately according to the line from eye to fingertip when the user points to the target. Since the interaction data is produced only when the user’s eye and fingertip are detected and the pointing direction is recognized, the system can effectively avoid unauthorized access of illegal users, which is more secure than the traditional point-touch devices.

Many studies have been conducted to recognize pointing gestures based on computer vision. When human interacts with an object by pointing gesture, the pointing direction is estimated by two points on the camera perspective line. According to the different input cameras, pointing gesture recognition technologies can be classified into 3D methods [1113] and 2D methods [1416]. 3D methods rely on specific input devices, such as stereo camera or Kinect, while 2D methods use less expensive and more easily available cameras, with reduced computation cost for 2D information. However, pointing gesture recognition based on 2D methods is still challenging. First, since 2D information of the hand provides relatively weak and ambiguous characteristics, the discrimination of bare hand from background is easily influenced by the large variability of hand appearances and the presence of other skin-like objects. Some studies have been conducted to deal with this problem [1719]. For example, Li and Wachs [17] proposed a weighted elastic graph matching method to detect and recognize ten hand postures under complex backgrounds. The average recognition accuracy reached 97%; however the performance was unclear when hand shape distorts. In [18], hand regions were extracted based on the Bayesian model of visual attention by combining shape, texture, and color cues. The long run time, 2.65 s, of this method clearly made it unsuitable for real-time application. Gonzalez et al. [19] proposed a hand segmentation method based on pixel color and edge orientation to deal with the overlap of hand and face, whereas the segmentation results were affected by thin lines under the chin or over the collar. The second challenge is to estimate the pointing position determined by intersection between the interaction plane and the pointing vector due to lack of depth information. In order to solve this problem, some interaction systems based on 2D pointing gesture methods only utilize information of pointing direction instead of pointing position [14, 15], and some other systems require users to make coordinate calibration before operating [16]. In this paper, we propose an edge repair-based hand subpart segmentation algorithm, which accurately and effectively segments the palm and finger regions from the background by using 2D information. Furthermore, on the basis of the hand segmentation algorithm, an adaptive method of pointing direction estimation is developed which can adjust the eye-fingertip line during operation.

2. System Model

We develop a vision-based interaction system using pointing gesture recognition as a node in the application layer of IoT. When the user points to the screen, a straight line from the eye to the fingertip determines the position on the screen where the cursor should locate. Figure 2 illustrates the flow chart of our system. First, 2D images are captured by a normal camera. Second, the user’s eye is detected by the AdaBoost classifier based on Haar-like features [20]. Third, hand region is segmented from the complex backgrounds using the edge repair-based hand subpart segmentation algorithm. Fingertip of the index finger is detected by combining convex hull and convexity defect features [21]. If both the eye and the hand are located, an adaptive method of pointing direction estimation is proposed to obtain the pointing position according to the eye-fingertip line. Finally, the cursor is moved to the pointing position. Consequently, the system is secure and authentic because it can only be activated when the eye and hand of the user are both detected.

2.1. Methods of Eye Detection and Hand Segmentation

In our system, the face position is first coarsely determined by background subtraction and skin color detection. Then connected regions with large areas are extracted. In order to verify face from these regions, an ellipse model is employed to select the approximately elliptical candidates because of the face shape [22]. For each candidate, the AdaBoost algorithm based on Haar-like features is employed for eye detection. The AdaBoost classifier is trained by the positive samples of eye images and the negative samples of all kinds of background images without human eyes. A result of eye detection is shown in Figure 2.

In order to segment hand regions efficiently, we propose an edge repair-based hand subpart segmentation algorithm which includes four procedures as illustrated in Figure 3.

Firstly, a hierarchical chamfer matching algorithm (HCMA) [23] is used to locate the whole hand region in the binary image produced by combining skin color detection and background subtraction. As shown in Figure 4, the chamfer distance image of the hand is searched for the optimal position which matches the hand template from the previous frame. After the distance image is traversed, the optimal position is figured out by calculating the minimum edge distance Ed.where is the pixel value that the template hits and is the number of the contour pixels in the template. In order to accelerate the matching, a pyramid structure is built by halving the resolution of the distance image gradually. At the top level of the pyramid structure, a grid of positions is chosen to start the matching. Each position and its neighborhood are computed for Ed. If a smaller edge distance is found, the template is moved to the new position in the distance image of level by wherewhere are the coordinates of the points in the template and , , and θ are translation, scaling, and rotation parameters, respectively. For each start position, the position with local minima is obtained and then used as the start position at the next level . When all the levels are traversed, the optimal position is finally identified, which is illustrated by the blue rectangle in Figure 4(b).

Secondly, the located hand region is analyzed to detect the palm and fingers separately by combining pictorial structures and Histogram of Oriented Gradient (HOG) features [24]. Figure 5(a) shows the hand model based on pictorial structures, which includes a root part of the palm and five-finger parts. Thus the hand configuration is denoted by where the subscript 0 corresponds to the palm part and the subscripts correspond to the finger parts. Given an image and a set of hand model parameters , the maximum a posteriori (MAP) probability of is represented by where is the appearance parameter of part i and is the connection parameter between part and part . Due to the different characteristics of the palm and finger in the model, two support vector machine (SVM) classifiers are employed to detect the subparts of the hand. On the one hand, the classifier for the palm part is trained by HOG features. On the other hand, the input feature vector of the classifier for the finger part considers both HOG features and the spatial relationship between the finger and palm. Let the state of the finger part be . Assuming and are the coordinates of the joint and the palm center, the relative position is computed bywhere is the size of the palm part. is the absolute part orientation as shown in Figure 5(a). Figure 5(b) shows the detection result of the image in Figure 4.

Thirdly, since the blurry border of the hand and face probably leads to an incompletely connected hand silhouette, we propose an edge repair method to recover the contour of each subpart. The edge image of each subpart is extracted to detect where the contour breaks. An edge point is determined as a breakpoint if one or two adjacent points among its eight-neighborhood are edge points. For each breakpoint , a contour point in the prestored template is found, which is the closest to based on Euclidean distance. As shown in Figure 6(a), in order to connect the adjacent breakpoints depending on the template, is divided into several subsegments by and a set of is generated by .Then, and are connected by the Catmull–Rom interpolation method [25]. Figure 6(b) shows the result of edge repair where the connections of breakpoints are illustrated in red.

Finally, the repaired edge images of all the subparts are used to extract the refined hand pixels from the coarse binary image as shown in Figure 6(c). All images of hand subparts are combined to generate the whole hand region in Figure 6(d).

2.2. Cursor Positioning

When both the eye and the fingertip are detected, the position on the screen where the user points to can be figured out by the line extending forward from the dominant eye to the fingertip. Since the depth information of the eyes and fingertips is unavailable and the pointing ways of different users are individually different, the offset exists between the target position and the calculated position. Thus, we propose an adaptive method of pointing direction estimation which can adjust the eye-fingertip line through a learning process.

As shown in Figure 7, a three-dimensional coordinate system with the camera position as the origin is established. The intersection point of the screen and the eye-fingertip line can be calculated through similar triangle theory as where and represent the eye’s coordinates and the fingertip’s coordinates in the 3D coordinate system, respectively. Since the relationship of those coordinates in 3D coordinate system is similar to that in the captured image (Figure 7(c)), can be computed bywhere are the coordinates of the human eye in the image, and are the camera’s angles of view in the horizontal and vertical direction, and is half of the spatial resolution of the image. () can be figured out by a similar equation to (7).

Then the cursor’s coordinates are computed by transforming into the screen coordinates in pixels as where is the spatial resolution of the screen and and are the width and height of the screen.

It can be seen that the cursor’s position is closely related to the distance between the fingertip and the screen and the distance between the eye and the screen . However, the two distances cannot be accurately obtained from our monocular vision-based system. The inaccurate distances will lead to an offset between the calculated position and the target position of the cursor. Hence, we propose a method to adjust the cursor’s position through a learning process. Firstly, is estimated according to the face’s area and is initialized to minus 30 based on users’ habits. Secondly, when the cursor is not located at the desired position, the user is allowed to alter the cursor’s position by moving the fingertip slightly. The cursor’s coordinate is adjusted from to by where represents the moving distance of the fingertip in the successive frames and (,) is the multiple coefficient. Our system monitors the fingertip’s movement and records when the fingertip moves a short distance after a pause. Then can be adjusted by (10) from to .where α is the update rate. and are estimated based on and , respectively, according to the derivation of (6)–(9). After the adjusting process works several times, will approach the real value.

2.3. Feasibility Verification of Our System

In order to verify whether our cursor positioning system is feasible or not, we perform error analysis to evaluate how much the cursor error depends on the eye-hand position in the direction of -axis and the eye-hand position in the image. Because of the lack of depth information, the real locations of the fingertip and the eye in the direction of -axis cannot be obtained, represented by and . Hence the estimated values of and are used in our pointing direction estimation method. Generally, the offsets of and exist between the real values and the estimated values, which lead to the errors of the cursor position. Similarly, assuming only plane is considered; Δ and Δ possibly exist due to the detection deviations of the fingertip’s and eye’s -coordinates in the image. Note that the adjusting process of the cursor position is not activated here in order to analyze the error.

As shown in Figure 8, it is assumed that exists and the eye is fixed on -axis; the cursor’s coordinate is computed byThen the cursor error Δ is calculated by subtracting the deviation value from the real value as where = . can be computed by using similar triangle principle in Figure 8. Therefore,

Assuming the width of the screen is 31 cm, Figure 9 shows the errors of the cursor with different and when the user points to the edge of the screen and a quarter of the screen. It is indicated that the cursor error becomes larger when the user points to the position closer to the screen edge. The cursor error increases as decreases and as the hand is closer to the head. Moreover, the cursor error is acceptable when operating the laptop in a close distance less than 100 cm and the error will be the smallest when the hand locates at the midpoint of the screen and the eye.

Figure 10 illustrates how much the cursor error depends on the four offsets of , , Δ, and Δ using the similar derivation of (13), where the two distances from the screen denoted by and are set according to users’ operation habits. Not that the user is assumed to point to the edge of the screen, which causes the maximum error among the entire screen. As shown in Figure 10, the maximum cursor errors fall below 3 cm and 1.5 cm when the offsets of eye and hand positions on -axis are less than 3 cm, and they fall below 3 cm when the offsets of eye and hand positions in the image are less than 9 pixels. The precision is acceptable for the block-level positioning and the following experiments will prove the attainability when the adjusting process is activated.

3. Experimental Results

Our proposed system was evaluated on a laptop with a 2D camera and the resolutions of the screen and the captured image were and , respectively. Besides, on the basis of the proposed methods mentioned above, the user was not permitted to keep shaking his/her body back and forth which would affect the adjusting process of cursor positioning. Ten subjects were asked to operate the system by pointing gesture.

Firstly, some segmentation results of the edge repair-based hand subpart segmentation algorithm are shown in Figure 11, where yellow lines indicate the hand regions. It demonstrates that our method can extract hand pixels accurately under different complex backgrounds, including human faces with similar color. Moreover, the method is independent of users and robust to various hand appearances.

Then in order to evaluate the accuracy of cursor positioning, the computer screen was divided into several blocks as shown in Figure 12 and the subjects pointed to the four blocks marked with “1” to “4” repeatedly. Each block was pointed 15 times as a set of data.

Figure 12 shows the qualitative experimental results of cursor positioning by different subjects under different backgrounds. In each row, the same subject points to the different positions of the screen and the relative positions between the fingertip and the eye appear different. The cursors highlighted by the black circles are successfully located at the corresponding marked blocks. It is demonstrated that our cursor positioning method is robust to diverse individuals and different situations. Besides, it is also implied that our hand segmentation method works well when hand overlaps the face.

Let the length of the blocks be 1. When the subject points to a block, the cursor error between the calculated position and the desired position is computed by Euclidean distance. Figure 13 shows the errors of several sets of data. The horizontal axis represents the repeat times when the users point to the blocks. It is proved that the error decreases significantly after several times owing to the adjusting process in the adaptive pointing direction estimation method. Moreover, Table 1 shows the average errors of cursor positioning for the four marked blocks. The errors are estimated by , where represents the area of the block. Because the size of block icon in Windows 8/10 Metro can be set as 3 cm or 6 cm, it can be concluded that the positioning errors are small and acceptable for our application requirement.

Besides, our system can work in real time with an average speed of 131 ms per frame.

4. Conclusion

A security and smart Internet of Things interaction system based on hand gesture recognition is proposed in this work. When a user points to screen, the target position which the user points to is estimated by a straight line from the user’s eye to the fingertip. Therefore, the interaction between human and computer should be activated by the coexisting of eye and hand. In our system, we employ a novel hand segmentation algorithm which combines the pictorial structure model, hierarchical chamfer matching algorithm, and curve fitting that segments hand regions accurately and efficiently. Furthermore, we propose an adaptive pointing direction estimation method for cursor calibration. An adjusting process is presented to correct the offsets between the target position and the calculated position arising from diverse individuals and lack of depth information. Experimental results show that our system provides a natural and friendly human-computer interaction and possesses satisfactory and accuracy of cursor positioning under complex backgrounds.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Perspective Joint Research Project of Jiangsu Province Technology Project (BY2016076-07).