Abstract

This paper presents a vision-based approach for hand gesture recognition which combines both trajectory and hand posture recognition. The hand area is segmented by fixed-range CbCr from cluttered and moving backgrounds and tracked by Kalman Filter. With the tracking results of two calibrated cameras, the 3D hand motion trajectory can be reconstructed. It is then modeled by dynamic movement primitives and a support vector machine is trained for trajectory recognition. Scale-invariant feature transform is employed to extract features on segmented hand postures, and a novel strategy for hand posture recognition is proposed. A gesture vector is introduced to recognize hand gesture as an entirety which combines the recognition results of motion trajectory and hand postures where a support vector machine is trained for gesture recognition based on gesture vectors.

1. Introduction

It was shown that nearly 90% of daily communication is nonverbal [1]. Hand gesture has always been a powerful communication tool in people’s daily life. As the technology is developing, hand gesture recognition is becoming an important component in innovative applications, such as human computer interface, robotic tele-control, and sign language interpretation. In this paper, hand gesture refers to hand moving trajectory and hand posture stands for the hand shape and appearances.

Decades ago, the interaction between human and computers was through command-line interface through keyboard entries. Through technological advancements, other types of user interface devices have been introduced (such as the computer mouse) which can offer alternatives to the traditional design of human computer interfaces. In addition, it was shown that 65% of human communication involves nonverbal gestures [2]. For hearing-impaired persons, this percentage rises to nearly 100%. Therefore, hand gestures, as one of the most dominant communication tools in our daily life, can express an enriched mode of communication with abundant coded messages. Developing a hand gesture interface to enrich the human computer interface not only offers a more convenient approach for hearing impaired but also can enhance and extend the current and existing modes of interfaces.

Hand gesture recognition can be divided into two main categories: device-based and ambient-based. Device-based hand gesture recognition requires the user to wear devices such as gloves, markers, or other tools in order to acquire hand or arm joint angles corresponding to their spatial position [36]. A data glove, modeling a 3D hand is designed in [7]. This glove measures the angles of finger bending using analog flex sensors. A multicolored glove is adopted in [5] to reconstruct the pose of hand based on its color pattern. Due to the current advancements in design of sensors, device-based hand gesture recognition collects relatively accurate information of the hand gestures. In addition, such approach for gesture recognition is robust to illuminate changes which is the main drawback of various vision-based hand gestures recognition.

For ambient-based hand gesture recognition, sensors capture images of the scenes and process the required information needed for determining hand movements and appearances. For RGB type sensors, gesture information is mainly dependent on the hand color or texture [812]. A drawback of the RGB-based system is its sensitivity to illumination variations. Therefore, color spaces such as HSV [13], YCbCr [14], CIE Lab, or CIE Luv can be utilized. For classification, methods such as Bayesian classifier with the histogram technique [15] and Gaussian classifiers [16, 17] were usually introduced. There are other methods for hand gesture recognition. For example, in order to address some of the limitations associated with the RGB imaging, depth sensors were introduced in order to capture hand motion [1820]. The output of the depth sensor can be encoded as a gray-scale image, where the intensities correspond to the distances between the objects and the camera. For hand gesture recognition, since the hand is usually the closest moving object to the camera, a suitable threshold can be defined on gray-scale images in order to eliminate any background noises. IR camera combined with retroreflective markers is also used in order to determine the position and posture of the hand in view of a single camera [21]. In this paper, two calibrated RGB cameras are employed to record hand gestures. These cameras are utilized under a steady lighting condition. Due to exposure compensation of the cameras, YCbCr color space is adopted for hand area segmentation.

Hands, as one of the most dexterous part of the human body, have 27 degrees of freedom which can occupy various shapes and appearances. To extract hand region from the rest of the image, color cues and motion cues are most often used to segment a hand from the background. The skin color is usually more distinctive and less sensitive to illumination changes in the hue-saturation space than in RGB color space [9]. Most of the color segmentation approaches rely on histogram matching [23, 24]. Color cue is not robust to illumination variation and frequently results in undetected skin regions or falsely detected nonskin area. To solve such problem, some assumptions such as area size (scale filter) or certain spatial position (position filter) are adopted. Another solution to such problem is to allow the users to wear gloves having distinctive colors [5] or special markers (LED light [3, 25], fluorescent material [26]) or clean the background so there is not much noise from it [12]. These methods are robust to illumination variation but lose the intention of liberating hand from gloves. Motion cues are usually used as one of the main components for segmenting moving objects such as hands or arms in image frames. They can also be used in segmenting hand gestures from a stationary background [11, 27, 28].

Feature extraction is very important to posture recognition. The simplest and most frequently used features are hand silhouette which can be easily extracted. Contours are a group of commonly used features. Several different edge detection schemes can be used to produce contours [9]. Contours are often employed with 3D hand model that is build based on hand shape and structure. Hand posture can be recognized by comparing the similarities between detected contours and generated contours based on the hand model [29, 30]. In [29], they build a 3D hand model with 27 degrees of freedom to model the articulates of hand. The hand gesture recognition is done by comparing the generated contours based on the hand model and the input hand images. Another frequently used feature in posture recognition is figure tip. Postures can be recognized based on the positions of five finger tips, extracted by either markers (LED light [3, 25] or spatial color) or convex hull on silhouette [31]. There are also other feature detectors that can be applied to achieve posture recognition, such as scale-invariant feature transform (SIFT) which is insensitive to illumination variations, scale, and orientation changes [3235]; Haar-like feature which transforms hand postures into a coefficients vector in the Haar wavelet transform [36]; and orientation histogram [37].

Hand gesture is represented in four aspects: hand shape, position, orientation, and movement [38]. The same semantic paths are usually made in different scales, speed, and shapes due to the individual differences. As a statistical model, Hidden Markov model (HMM) has been found efficient in modeling spatiotemporal time series where the same gesture has different shape and duration [39, 40]. Other features extraction methods such as Gaussian mixture model and principle component analysis [41, 42] can be used to enhance HMM recognition process. Finite state machine (FSM) is similar to HMM which models hand movement as an ordered sequence of states in a spatiotemporal configuration space [8, 43]. Dynamic movement primitives (DMP) proposed by [44] is employed in [43] for 2D trajectory recognition and has obtained an impressive accuracy of 98.06%. DMP encodes gesture paths into weight vectors which preserve the topological structures of paths. The benefits of DMP are being (a) robust to the spatiotemporal variations in gesture paths and (b) easy to adjust the dimension of the weight vector based on the gesture paths complexity to adapt different applications.

Hand posture recognition is another key part of gesture recognition. Template matching is a simple method of posture recognition, and it is easy to add or remove template classes. To extract features on hand posture, a convex hull on silhouette [31] or fingertip detection using circular mask as a correlation techniques [45] can be employed. However, these recognition approaches based on hand silhouette or contour usually require a clean background where the hand can be well segmented. In our case, hand posture is made in a cluttered and moving background and segmented using YCbCr color space. Sometimes the hand cannot be well segmented from the background. In such case, hand can be taken as partially occluded. Therefore, a feature detector that is insensitive to illumination variation and partially occlusion is needed. Scale-invariant feature transform (SIFT) can be such a feature detector and descriptor which is also robust to scale and orientation changes. In addition, it is robust to affine distortion within some ranges that can benefit posture recognition. This is due to the fact that the relative position between hands and cameras changes which can cause affine distortion between the input hand postures and posture templates. In our work, SIFT is used for feature detector and posture recognition. Combining recognition results of gesture path and hand postures, a gesture vector is proposed for gesture recognition [46].

In this paper, hand gesture is referred to hand moving trajectory and hand posture which stands for the hand shape and appearance. Hand movement is also called gesture path. Two calibrated cameras are employed to record hand gestures. These cameras are utilized under a steady lighting condition. Due to exposure compensation of the cameras, YCbCr color space is adopted for hand area segmentation. This paper also only focuses on hand shape and movement since hand position and orientation are involved with body context which is not considered in our work.

2. Preliminaries

This section presents preliminary analysis and results which are obtained pertaining to the main contributions of the paper. Two preliminaries, 3D coordinate reconstruction and hand segmentation, are introduced. 3D coordinate reconstruction method of a point in world space using two calibrated cameras is presented in Section 2.1. Section 2.2 introduces and experimentally compares several approaches of skin color segmentation and the scheme of extracting hand area from the background.

2.1. Camera Calibration

Two calibrated cameras are employed to capture hand motion made in the overlapping field of view to reconstructing 3D hand motion trajectory. The setup of the two cameras is shown in Figure 1. The relationship of coordinates between the cameras and world are marked in Figure 2. Based on the pinhole camera model, the relationship between a point in the world coordinates and its projection on the image plane of camera is shown in (1). The relationship between projection in camera and the same point is formulated in (2).

The intrinsic matrices, and , are separately calculated by Camera Calibration Toolbox for MATLAB [47] on both cameras. The extrinsic matrices, and , represent the rotation and translation between the world coordinate system and the camera coordinate system, and , respectively.

The objective of this section is to reconstruct the location of a point in world space using its projection on image planes taken by the two calibrated cameras. The objective is to reconstruct the coordinates of based on and in (1) and (2) (this would result in 4 equations and 3 unknowns). The coordinates of that have three unknowns can be calculated. Based on the geometry of our experimental setup, the specific values of , , , , , and are computed as

For the experimental evaluation, a box made by transparent plastics is placed in the overlapping view of both cameras. The dimensions of the box are marked in Figure 3. Figure 4 shows the two different views captured by the cameras.

Table 1 lists the coordinates of eight vertices of the box in both image planes and the reconstructed 3D coordinate in the physical world. The ground truths of the eight box corners are also listed in this table. To eliminate any displacement error between coordinate systems, the relative locations between eight corner points on the box are calculated and compared with their ground truths. The results are listed in Table 2 which also showing the relative error between the values.

2.2. Hand Segmentation

An efficient hand segmentation method is the key of success towards visual tracking and further posture recognition. There can exist a large variety of hand appearances in different postures, angles, and orientations. Color cue is an efficient tool to identify the hands from the background. However, segmenting hand from a clutter background is very challenging [48] due to existence of different skin colors associated with people where the color of skin can also change under different illumination. In this section, a proper color space to represent skin colors is explored; a position and size constraint is added to locate hand area.

Given a unicolor or distinct colors background from the hand area, the hand area can be segmented by thresholding the background color. For cluttered background, there are multiple colors included inside the camera view. Human skin has relatively consistent colors which are distinct from the colors of many objects [49]. Therefore, skin color can be an essential cue to separate the hand area from the background. A suitable color space and a classification algorithm are essential for successful and efficient skin segmentations. Skin segmentations have adapted many color spaces in the previous researches [1315]. Red-Green-Blue (RGB) color space is sensitive to illumination variations, which is less efficient for hand segmentation. Hue-Saturation-Value (HSV) [13] and YCbCr [14] color spaces are more robust comparing to RGB, thus widely used in skin segmentations with various lighting conditions.

There exist different classification algorithms, such as piecewise linear classifiers [14, 50] and Bayesian classifier with the histogram technique [15] for skin segmentation. An analysis and comparison of skin segmentation using color pixel classification are carried by [49] demonstrating the performance of different skin color representation and classification. In addition, the classifier based on Bayesian RGB, 3D Gaussian mixture (RGB), and fixed-range CbCr have shown to all obtain good performances on a common data set [49]. This study was reproduced on our experimental setup based on three skin color classifiers. The Bayesian RGB classifier and 3D RGB Gaussian mixture classifier are trained using the methods and data reported in [15]. The fixed-range CbCr classifier ( and ) is obtained using the method reported in [14]. Figure 5(a) shows a typical frame in a gesture video. The skin segmentation results using Bayesian RGB, 3D RGB Gaussian mixture, and fixed-range (CbCr) classifiers are shown in Figures 5(b), 5(c), and 5(d), respectively. The fixed-range CbCr classifier gave the best result among the three classifiers, from the accurate skin segmentations represented in Figure 5.

We evaluated the fixed-range CbCr classifier on several other sample gesture frames taken with different skin color, lighting condition, and background. Table 3 shows the segmentation results. In our study, room light stands for the existing fluorescent illumination condition in our laboratory (i.e., 300–500 Lux measured using standard light meter (Reed LM-81LX), held vertically at the position of the hand). The strong light stands for the existing room light as stated before plus the additional standard LED desk lamp pointing toward the subject (i.e., 500–800 Lux measured using standard light meter (Reed LM-81LX), held vertically at the position of the hand). For the remaining of this study, fixed-range CbCr skin classifier is adopted for skin segmentation in our system for segmenting the hand in a cluttered and moving background. Similar to other segmentation methods, the fixed-range CbCr method can still result in presence of some background noise (blobs showed in the segmentation results in Table 3). For our study, an upper size constraint on the size of segmented blobs is introduced (i.e., 3000 pixels) where all the noise blobs smaller than the constraint will be eliminated.

3. Hand Tracking and Trajectory Reconstruction

In the previous section, hand blobs are segmented, where in some nonideal cases other skin areas such as face and neck are also included as a part of the segmentation process. Due to the similarities of the hand area and other skin areas shared in various attributes, such as shape and color, a scheme involving position constraint is adopted to separate hand blobs from other skin areas. Kalman Filter (KF) is employed to track hand and reduce the influence of other skin areas on posture and trajectory recognition. Based on the tracking results, trajectories of the moving hands can then be reconstructed.

3.1. Application of Kalman Filter

Kalman Filter (KF) [51] has been extensively used in the computer vision community for object tracking [52]. Here, a simplified representation of the state of the hand and time is defined as the position and velocity of the centroid of the hand blob, or where represent the pixel values of the centroid of the hand blob, which is calculated by the yellow bounding box in Figure 6. The bounding box is generated by the utmost points in four directions, including up, down, left, and right, of the hand blob. The point is represented by a vector , where and are the pixel values of the bounding box center which is represented by a red cross. and indicate the velocity of the hand blob in the direction of and , respectively, at frame . The implementation of the filter consists of two steps, the prediction step (updates) and the correction step (measurements). For each video frame, the hand location is predicted from previous frames. The KF model is modified according to the measurements. Figure 7 displays the conceptual diagram of iterations in KF. Since KF is a recursive estimator, in the prediction step, the computation of the estimated state for the next time step only requires the estimated state and the measurement of the current time step.

For example, can be defined as a the state transition matrix models the transition relationship between the current state and the next state . The measurement matrix is a matrix which maps the state into the observation vector . The KF motion model in this system is simplified as a constant velocity model. The transition model and measurement model are remaining the same during the whole tracking process. In the prediction step, the prior estimated state and the prior estimated error covariance can be calculated by the following equations, where is the process noise in Gaussian distribution with the noise covariance .

The actual hand location needs to be measured to correct the prior state and error covariance . In order to generate and improve posterior state with a better accuracy, the measurement can be calculated based on our constant velocity assumption as follows: where is the measurement noise in Gaussian distribution with noise covariance . The posterior state and error covariance can be obtained from the correction equations. where is the gain matrix. In addition and in order to constrain the tracking state defined within each frame, a saturation value is defined in-between frames. This added metric which is defined as a function of the area of the blob can further increase the robustness of the tracking results.

3.2. Experimental Study

In this section, experimental results associated with one-hand tracking for different speed and conditions are presented (results associated with tracking two hands can be found in [53]). Figure 8 shows the setup of the two cameras and example frames taken from cameras and are also included. Frames taken from camera and camera are also referred to as front view and side view, respectively, for the rest of the paper.

Figure 9 shows the results of multicamera tracking of a single hand in the presence of minimum background noise. For the case that involves background noises, Figure 10 shows an example of the tracking result. At the initialization, hand blob is successfully located and tracked when the hand blob is not overlapping with any background noise area (Figure 10(a)). In the next frame (Figure 10(e)), the hand is now overlapped with a noise color blob area. Instead of the previous frame’s bounding box being expanded to include the whole color blob, the predicted tracking results of KF are adapted in order to adjust the bounding box at the new hand location. In the next frame, the overlapping is over and the bounding box is updated with the size and location of the hand blob. The KF tracking performed well in such case, but if the hand area is overlapped with the background noise for a longer period, the state of the bounding box associated with the tracking result will drift away based on the predicted hand speed. Figure 11 shows an example. The hand blob is tracked in Figures 11(a), 11(b), and 11(c). When the hand is overlapping with the background noise area for a long time, the system would lose track of the hand area.

4. Trajectory Reconstruction and Smoothing

In Section 2, it was shown that the 3D coordinates of a point which is inside of the overlapping area of the camera views can be reconstructed by the pixel values on projected image planes. For each instance of time , and the definition of the tracking state, the hand center and can be extracted from both camera views. Substitute and for , in (1) and (2); the hand coordinates at time and the corresponding points in the world space can be reconstructed. Figure 12 shows example of reconstructed hand trajectory of single-hand and two-hand movements.

4.1. Trajectory Smoothing

The reconstructed trajectory shown in Figure 12 in general is not a smooth trajectory. For example, in Figure 13(a), trajectory starts from the green marker point, follows the blue arrows, and then ends at the red marker point. The fluctuation along the trajectory can be due to the variation in the size of the bounding box. For instance, if the hand area is occluded with other skin-like areas, the center of the segmented color blob does not represent the true position of hand center. To eliminate effects of such fluctuation, each trajectory is represented by 3 spatiotemporal components along , , and directions. Figure 13(b) shows projected information of circle trajectory over time in , , and directions. Two smoothing methods are then implemented and compared on the trajectory, namely, locally weighted scatterplot smoothing (LOESS) [54] and the robust version of LOESS (RLOESS). LOESS is locally weighted scatterplot smoothing using least squares quadratic polynomial fitting. RLOESS is a robust version of LOESS smoothing that assigns lower weight to outliers in the regression.

Figure 14 compared the result of these two methods. In this figure, span is a percentage of the total number of data points, less than or equal to 1. For example, if the span equals 0.1, it implies that 10% of data points are included for each smoothness computation. The red dots in Figure 14 represent the original data points along direction of 3D reconstruction circle trajectory. The blue curve represents the smoothed data points. Based on observation, if span value is set too big (Figures 14(b) and 14(d)), the smoothed curve is not fitting well with the original data points. If the span value is set too small, the effect of outliers in the original data cannot be completely eliminated (Figure 14(a)). In conclusion, RLOESS is superior to LOESS for this case. In our system, we adopt RLOESS with span of 0.1 to eliminate small fluctuation on reconstructed hand trajectories. Figure 15 shows the smoothed trajectory by RLOESS, where the fluctuation has been erased.

5. Hand Trajectory Recognition

In the previous section, methods for hand tracking, trajectory reconstruction, and smoothing were presented. Recognition of hand trajectory is a challenging task due to various patterns that hands can make in space and time. For instance, the same intended motion trajectory performed by different people usually has the same spatial pattern. Dynamic Motion Primitive (DMP) [44, 55] is a method for trajectory control which is shown to have good performances in [6] for handwriting recognition and in [11] for 2D hand trajectory recognition. In this paper, we extended the method for spatial hand trajectory recognition.

5.1. An Overview of Dynamic Movements Primitives

Dynamic Movements Primitives (DMP) method models movements with the given start and end states into a set of differential equations. For example, it is capable of encoding the spatiotemporal information of trajectories of the hand movements into a weight vector that is robust with respect to the spatiotemporal variations along the same hand trajectory.

The differential equations that characterize the spatiotemporal evolution of a dynamic system with the given start and end states are given in (8) and (9). A second-order linear damped spring model with a nonlinear function is added in (8) as the forcing term. This nonlinear force function can capture the complexities of motion patterns made by human.

In (8) and (9), , , and represent the position, velocity, and acceleration of the hand motion dynamics. is a time constant which represents the trajectory duration and is a known goal representing the final hand position of the trajectory. For a suitable selection of parameters and , the forcing term would decay to zero over time, which allows the system to converge to the goal position .

The nonlinear forcing function is composed of a set of Gaussian-like basis functions aswhere is weights of basis functions and are basis Gaussian-like functions. The forcing term would be vanished a long time.

The weight vector in preserves the shape information of the trajectories. For instance, if is fixed and other parameters such as the goal state or time constant change, the DMP will generate topological similar trajectories. In other words, similar trajectories would have similar feature vector which is called the invariance properties of the DMP model [55]. With such property, trajectories can be classified based on the weight vectors.

5.2. Weight Vector Extraction from 3D Trajectory

Given a trajectory in one dimension , the dynamics at each point denotes the state vector at time and can be calculated based on frame rate, where is the time duration of the whole path. To learn the weight vector for a given trajectory, the initial and goal states and also the time duration can be extracted from the trajectory. From (8), can be rewritten as The weight vector can be learned using the locally weighted regression (LWR) as stated in [44].

Before conducting the trajectory recognition, a trajectory instance “Circle” (Figure 17(d)) is selected from the collected trajectories in order to visualize the weight vector and learned DMP models. The acquired hand trajectory is in 3D space, where DMP can be utilized along each projected direction of the world coordinate system. Figure 16 shows the learned DMP models and weight vectors with different dimensions () of the trajectory projected in direction. Figures 16(a), 16(c), 16(e), and 16(g) display the original trajectories in green and the learned trajectories by DMP in blue. Figures 16(b), 16(d), 16(f), and 16(h) show the corresponding learned weight vectors. It can be seen that when the dimension of the weight vector increases, the learned trajectory approaches the original trajectory.

5.3. Training Stage for Hand Trajectory Classifier Using Support Vector Machine

The invariance properties of DMP preserve the shape information of trajectory in weight vectors and can be used for trajectory recognition. In [11], they compare two classification methods: -nearest-neighbor (-NN) and support vector machine (SVM). Based on their experiment result, SVM obtains a much more better accuracy than -NN. In our implementation, multiclass SVM training and testing are performed using the LIBSVM library [56].

For trajectory recognition, five classes of trajectories are collected. These five classes are consisted by “Jump,” “Left,” “Right,” “Circle,” and “Forward.” Figure 17 shows the example trajectories for each class. Eight people (two male and six female) are asked to perform the trajectories, 5 for each class. Over 200 trajectories are collected, and 3/4 of the data set is taken to train the SVM, while the rest of the data set are held out for testing. The trajectories in training and testing data set are performed by different people.

The SVM with a linear kernel is trained for trajectory recognition. Table 4 gives the 5-fold cross-validation recognition accuracy based different weight vector dimensions. As the dimension increases, for the same number of training data, the accuracy decreases. This is because the bigger the weight vector dimension is, the more parameters in the SVM need to be decided and more training data is needed. The highest recognition rate is obtained at . This is also because the classes of trajectories we collected are relatively simple and well distinguished with each other. For complex trajectories, a weight vector with higher dimensions is needed.

5.4. Testing Stage for Hand Trajectory Recognition

We apply the trained SVM on the testing data set which resulted in accuracy of 88.0%. The recognition results are shown in Table 5. The misclassification appears due to the user habit. Because the trajectories “Jump” and “Push” are both moving forward, for some cases, if the trajectory “Jump” is made with smaller radian, it would be misclassified as “Push.” Also the trajectory “left,” sometimes, is performed with an angle, and the SVM will recognize it as “Push” as well. But the trajectory “Circle,” which is quite distinctive from other classes, can be classified perfectly.

We tested the trained SVM on 90 testing trajectories collected among 3 people, 6 for each class. The recognition result is shown in Table 5. The misclassification appears due to the user habit. As you can see, because the trajectories “Jump” and “Push” are both moving forward, for some cases, if the trajectory “Jump” is made with smaller radian, it would be misclassified as “Push.” Also for the trajectory left, sometimes it was presented with an angle, and the SVM would take it as push as well. But for the trajectory “Circle,” which is quite distinctive from other classes, it can be classified perfectly.

6. Hand Posture Recognition

Hand posture recognition is another key part of the gesture recognition. Our system samples hand postures and follows the posture changing along hand trajectory (Figure 18). There are three steps for posture recognition, namely, (a) hand image acquisition, (b) feature extraction, and (c) classification. For our system, hand image acquisition is done along the trajectory tracking process where the hand postures are already segmented from the background.

Template matching is a simple method of posture recognition, and it is easy to add or remove template classes. To extract features on hand posture, a convex hull on silhouette [31] or fingertip detection using circular mask as a correlation techniques [45] can be employed. However, these recognition approaches based on hand silhouette or contour usually require a clean background where the hand can be well segmented. In our case, hand posture is made in a clutter and moving background and segmented using fixed-range CbCr skin classier. Sometimes, the hand cannot be well segmented from the background. In such case, hand can be taken as partially occluded. Therefore, a feature detector that is insensitive to illumination variation and partially occlusion is needed. SIFT is such a feature detector and descriptor which is also robust to scale and orientation changes. In addition, it is robust to affine distortion in some range, which could benefit posture recognition, since the relative position kept changing between hand and cameras and causes affine distortion between the input hand postures and posture templates. In our work, SIFT is used for feature detector and posture recognition. Method of Bag of visual words and SVM are also combined for classification.

6.1. Scale-Invariant Feature Transform

Scale-Invariant Feature Transform (SIFT) is a feature detector developed in [22]. SIFT features are shown to provide robust matching in a range of occlusion and affine distortions, addition of noise, and change in illumination. To acquire features at different scales, SIFT use Gaussian filters to change the variance to convolve with the original image and also the downsampled images. The difference of Gaussian is calculated by subtracting the adjacent images convolved with Gaussian in the same octave (Figure 19).

In order to detect the local maxima or minima of difference of Gaussian (DoG), each sample point is compared to its 26 neighbors in a region, 8 in its own image, and 9 in the scale above and below (Figure 20). A point is selected as a feature point only if it is larger or smaller than all of the other 26 neighbors. For the stability of feature points, once a feature point is detected by the method above, a threshold on minimum contrast is performed on these feature points between its neighbors. Also eliminating edge response is applied on the DoG since it has a strong response to edges. In this way, the feature points with strong contrast and within the image remain for feature points matching.

By assigning a constant orientation to each feature point based on local image properties, the feature point is made invariant to rotation. This orientation information is also used to build a feature point descriptor that is a vector containing 128 nonnegative elements. The resulting vector is defined as SIFT keys and is used in a nearest-neighbors approach to find the matching points and detect the same object between images. Figure 21 shows the feature points detected by SIFT in green circles on two-hand palms. The size of the circles presents the scale of feature points, and the radius of each circle represents the direction of each feature point. 36 sets of matching points are found between these two posture examples and connected in blue lines.

Our work is designed to recognize six targeted hand postures, namely, “Palm,” “V,” “Point,” “Y,” “Fist,” and “Eight” (Figure 22). In order to demonstrate that features detected on these postures have good discrimination between different classes, we apply SIFT to each class of postures shown in Figure 23 and listed the number of matching points in Table 6. The numbers marked in bold font on diagonal are the number of the matching points in the same class. As you can see, it is bigger than other numbers on the same row or column, which present the number of matching points between two different classes. It shows that although there are matching features between different posture classes, the number of matches in the same type posture is larger for any misclassifications among other postures.

6.2. Bag of Visual Words

Bag of visual words is a popular algorithm for image classification. Each image is represented by a set of detected feature points, and bag of visual words is using a vector to represent the occurrence counts of feature points. In other words, it is a histogram over the feature points. In this way, each image is represented by a histogram vector. Figure 24 shows a typical processing pipeline for generating a feature point histogram for each image.

In this paper, the feature points are detected by SIFT and the clustering is accomplished by -means. -means clustering encodes each feature point by the index of which cluster it belongs to. Usually, this is done by finding the shortest Euclidean distance between the input feature point and the cluster centers which are trained by a group of feature points extracted from a set of training images. How to determine the number of clusters is a problem. If the number of clusters is too small, there have not been enough discrimination between classes, and the classification accuracy will be decreasing. If the number of clusters is too big, then the features would be over-scattered, and the classification accuracy is still going to be decreasing. In [57], an approach is proposed to determine the number of clustering .where is the number of detected SIFT feature points on the training images. There are over 20,000 feature points detected on our training images and we have selected to be 150. After clustering, each template is represented by a vector of indexes that shows which cluster the corresponding feature is belonging to. The final step would be to sort this vector into a histogram which has bins. In this way, each template would be mapped into a vector which is called bag-of-words vector.

6.3. Training Stage for Posture Classifier Using Scale-Invariant Feature Transform

After mapping the feature points of each template into one bag-of-words vector, those vectors are employed to train a multiclass SVM classifier model. The SVM is a supervised learning method for classification and regression by creating an -dimensional hyperplane that optimally divides the data into difference groups.

For posture recognition, six classes of postures are collected; both front view and side view for each class are included. Figure 25 shows the six classes of postures; the front view is shown in Figure 25(a), while the side view is shown in Figure 25(b); from left to right, the postures are “Palm,” “V,” “Point,” “Y,” “Fist,” and “Eight.”

For training a posture classifier, well-segmented hand posture templates are collected from six posture classes clipped from gesture videos; both front views and side views are included. By applying the bag of words, each image is represented by a vector. The multiclass SVM is trained based on such vectors. Table 7 shows the 5-fold cross-validation accuracy of different numbers of training data with a linear kernel SVM. By increasing the number of training data, the accuracy would have slightly increased. Here we use 402 posture images as the training data set in order to build SVM classifier.

6.4. Testing Stage for Posture Recognition

The testing set contains 432 postures made by the other four people at 216 time state, 36 time states for each class. At each time state, one front view and side view of the posture are obtained. To test the performance of the trained posture classier, the images of each posture taken from both camera views are recognized individually. The recognition result is shown in Table 8. Each column refers to a posture instance that is classified into the corresponding class. An accuracy of 78.7% is obtained.

The classifier distinguishes most of the testing postures correctly. But for the postures belonging to class “Fist,” a fair amount of postures is classified into class “Point,” and also a few numbers of postures of “Y” are misclassified into class “Fist.” The reason for this is that these three postures all contain a similar hand part of three fingers curling together. Therefore, they share more similar features. But for these postures which are well distinguished with each other, their recognition accuracies are higher.

Also, if we look into the misclassified postures, not well-segmented hand postures can cause the misclassification as well. Figure 26 shows three samples of misclassified postures. Background noises which are segmented as hand area (Figures 26(a) and 26(b)) are a cause of misclassification. Also due to hand appearance changing with viewpoint, some postures cannot be recognized when the appearance changes too much. For instance, in Figure 26(c), the index finger of posture “Eight” is total invisible from the camera view, and it was misclassified into class “Fist.”

Considering the fact that the postures taken at the same time state are the same posture but different view, the posture recognition results from both camera views are associated with each other in the following way. At the same time state, if the recognition results of the postures taken from two cameras are different, then this posture is taken as ambiguous and discarded. The postures that are recognized as the same class from both camera views will be kept for gesture recognition in later steps. For each posture class, the recognition and abandon rate are shown in Table 9. Although nearly thirty percent of the testing data is abandoned, this scheme increases the recognition accuracy from 78.7% to 96.6%.

7. Gesture Recognition

For gesture recognition, the results of posture and trajectory recognition are combined together into a feature vector we called gesture vector. A gesture vector is extracted from each gesture video and used for classification by SVM. The details of how to define a feature vector are introduced in the following. Also, the performance of gesture recognition is evaluated in this section.

7.1. Gesture Vector

A gesture vector is constructed from two components: posture elements and trajectory element . indicates the recognized trajectory class. Depending on the complexity of gestures, each gesture can be separated into segments to deal with posture variations. represents the occurrence number of the recognized posture class in segment . Equation (13) shows a gesture vector with segments and posture class.

The gesture made in different speed would generate a big number difference on recognized postures. Because the frame rate is fixed for cameras, the faster the hand is moving, the fewer postures can be taken. Therefore, the posture elements in gesture vector are normalized by the total number of recognized postures for each section . Equation (13) is rewritten asThe gesture vector is extracted for each hand gesture and adopted for gesture classification.

7.2. Defining Gesture Classes and Gesture Vectors

To test the recognition performance, we defined eight classes of hand gesture. Both one-hand and two-hand gestures are included. Although we only defined eight classes of gesture for evaluation, more gesture classes can always be added to the system with some training data and gesture definitions. Figure 27 shows the combination of hand moving trajectory and posture. The blue arrows represent the hand moving trajectory, and the posture changing is also shown in this figure.

A list of the gesture name and its component of posture and trajectory is listed in Table 10. Since the defined gestures only contain one or no posture change, each trajectory is segmented into two sections which make the gesture vector elements long for one-hand gesture. In two-hand case, the gesture vector for each hand is extracted for each hand and then combined together which makes the two-hand gesture vector twice as long as the one-hand case, and it contains elements.

7.3. Training Stage for Gesture Classifier Using SVM

In the training stage, 10 gestures for each class, 80 gestures in total, are collected for the training stage among four people. The one-hand and two-hand posture models are trained separately by SVM. The ground truth of trajectory and the recognition result of hand postures are utilized to compose the gesture vector. A linear kernel SVM is trained which obtained a 5-fold cross-validation accuracy at 94% for single-hand gestures and 100% percent for double-hand gestures. The training accuracy for hand gestures is listed in Table 11.

7.4. Testing Stage for Gesture Recognition

For this section, another 80 gestures are collected for testing. These 80 gestures, 10 for each gesture class, are collected among four people. Based on how many hand postures are detected in the initial areas, the gesture would be classified as single-hand or double-hand gesture automatically. Then, based on the trajectory and posture recognition for each hand, the trained SVM model would recognize each hand gesture based on its gesture vector. Figure 28 shows the testing pipeline. Table 12 shows the recognition results and accuracies for each class.

All the postures are well recognized except for classes “Grab” and “Hit.” This is because, at posture recognition stage, the trained classifier could misclassify “Fist” into “Point” (Table 8) due to the similarity of these two postures. Therefore, there is a high chance of misclassification for this class. For the gestures that involve posture changing, there would be a period of interval where postures are not defined in the posture classes. With our posture recognition scheme, such posture would have a high chance to be discarded, while the posture recognition results are not consistent. For instance, in the class “Move,” the interval posture for changing postures “V” to “Eight” is discarded and acquires a high recognition accuracy.

8. Conclusions and Future Works

This paper proposed a vision-based approach for both trajectory and posture recognition of hand using multiple calibrated cameras. A fixed-range CbCr is adopted for hand area segmentation. Kalman Filter is adopted for hand tracking in the presence of occlusion or background noise. It was found that the hand trajectory contains both temporal and spatial patterns that can vary among different individuals. DMP is applied as a method that could preserve the spatiotemporal information in weighted vectors. For example, topological similar trajectories would have similar weight vectors that can be utilized as feature vectors for trajectory classification using SVM. It was shown that with only a few trajectory training data, the recognition can be achieved with an accuracy of 88.0%.

Feature points of a hand posture are detected using SIFT method. A bag-of-words approach is employed to represent each posture as a unisize histogram vector. Such histogram vector is used for posture recognition using SVM. The training postures contain both front and side views taken by the cameras. The hand posture is only considered as being recognized when the recognition results from both camera views are consistent. With such scheme, although some of the postures are taken as unrecognizable which are then discarded (30%), the recognition rate can have an accuracy of 96.6%. For hand gesture recognition, a gesture feature vector is defined which combines the results of both trajectory and posture recognition. In our experimental study, gesture recognition using SVM approach has resulted in 92.5% accuracy.

8.1. Future Work

A combination of RGB and depth sensors (RGB-D) can be utilized in future for determining various silence features when combining both sensing modalities. In particular, various features can be associated with each of the scan planes in the depth map which can be used as a part of a deep learning algorithm [58]. The current experimental setup utilizes two calibrated cameras placed at a fixed distance to the hand. For applications where the cameras cannot be placed in proximity of the users (e.g., for the case of elderly living and for application of elderly computer interface), pan-tilt-zoom cameras can be deployed [59]. This active camera setup can offer a robust approach where one camera can zoom in on the hand of the user for gesture recognition, while following (through pan and tilt movements) the predicted movement pattern of the hand which is obtained through the other cameras.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.