Abstract

Sliding-window based multiclass hand posture detections are often performed by detecting postures of each predefined category using an independent detector, which makes it lack efficiency and results in high postures confusion rates in real-time applications. To tackle such problems, in this work, an efficient cascade detector that integrates multiple softmax-based binary (SftB) models and a softmax-based multiclass (SftM) model is investigated to perform multiclass posture detection in parallel. The SftB models are used to distinguish the predefined postures from the background regions, and the SftM model is applied to discriminate among all the predefined hand posture categories. Another usage of the cascade structure is that it could effectively decompose the complexity of background pattern space and therefore improve the detection accuracy. In addition, to balance the detection accuracy and efficiency, the HOG features of increasing resolutions will be adopted by classifiers of increasing stage-levels in the cascade structure. The experiments are implemented under various scenarios with complicated background and challenging lightings. Results show the superiority of the proposed SftB classifiers over the traditional binary classifiers such as logistic regression, as well as the accuracy and efficiency improvements brought by the softmax-based cascade architecture compared with the noncascade multiclass softmax detectors.

1. Introduction

Hand detection refers to determining the hands location and their shapes. It works as a prerequisite step for various hand gesture recognition systems [1, 2] that have been widely studied, due to their potential application in entertainment and virtual reality [3], medical systems, and assistive technologies, as well as in crisis management and disaster relief [4]. However, hand detection is never an easy task due to the hand deformation [5], the sensitivity of skin colors to lighting conditions [6], and the complicated environments for practical applications. As a result, robust and efficient hand detection remains a challenging task in computer vision community.

Multiclass hand posture detection is worthy of investigation for several reasons: different users may be habituated to using different postures for interaction, many application systems require multiple postures to realize different functions, and robust detection of human hand from multiple viewpoints can be achieved through multiclass hand detection by letting different posture categories represent postures captured under different viewpoints. One way to deal with multiclass hand detection is first to locate the human hand and then to determine the hand shape by classification. Such methods are usually of low accuracy. For example, to locate human hand using skin color cues can be easily affected by the lighting condition and the skin-like background, which will lead to high miss and false rates and will degrade the follow-up classification accuracy/speed in detection. Another example is to train binary classifier for sliding-window-based hand localization, in which all predefined postures are treated as a positive class and the background is regarded as negative class. In this method, the difference in posture shapes increases the pattern complexity of positive space and resultantly leads to low excluding rate for background. Another way for multiclass hand detection is to build independent detector for each predefined posture and perform multiposture detection by sequentially detecting each of the predefined postures with the corresponding posture detectors [7, 8]. The disadvantage of such practice contains several aspects: (a) the computing cost is high, because multiple rounds of detections are required to find the postures of multiple categories; (b) a window image may be predicted into multiple posture categories, which would result in heavy overlapping detection results; and (c) the multiple detectors are trained independently rather than jointly and in collaboration, which causes confusion detection between different postures easily.

To improve the performance of multiclass hand posture detection system, here in this work, we provide a softmax-based cascade detector that integrates several SftB classifiers at early stages and a SftM classifier at the last stage. Advantages of this proposed method include the following: (a) the softmax-based structure makes it possible to perform multiclass posture detection in parallel; (b) the cascade structure helps decompose the complexity of background pattern space and therefore improve the detection accuracy; (c) the pass-rate of postures and the false rates of background can be adjusted easily by using the binary SftB classifiers (adapted from softmax models) in the first few stages; (d) the SftB-based binary classification is actually made based upon the multiple decision surfaces implied by the softmax model and has a stronger background excluding ability than the binary classifiers trained with examples of all defined posture categories as a single positive class; and (e) with cascaded softmax scheme, the prediction probability across multiple stages can be merged to make final decisions, which helps to reduce the confusion rates between posture categories. Moreover, stage-classifiers of increasing stage-levels will take the HOG features of increasing resolutions to balance the detection accuracy and efficiency. To sum up, the major contribution of this work can be concluded as follows:(1)A softmax-based cascade architecture is proposed to perform multiclass hand postures detection in parallel and meanwhile to decompose the complexity of background pattern space to improve the detection accuracy.(2)The SftB classifier is proposed to better distinguish the predefined postures from the background regions, since it could decompose the complexity of multiclass posture pattern space by the multiclass decision boundaries that are learned jointly.(3)The cascade is designed to take low-resolution HOG features at the lower stages and to use HOG features of higher resolutions for stage-classifiers of higher levels, which helps to balance between the detection accuracy and efficiency.

The remainder of this paper is organized as follows. Section 2 briefly reviews the existing work on vision-based human hand detection problem. The proposed softmax-based cascade architecture is described in Section 3 in detail. Experimental results and discussions are provided in Section 4. Conclusions and future work are offered in Section 5.

The vision-based hand detection methods can generally be separated into two groups: the appearance-based methods and the 3D-model-based methods [2, 7]. The appearance methods carry out the detection by directly comparing the image features with prebuilt appearance models. These methods are usually of high efficiency, but their performance can be easily affected by viewpoint variation and hand deformation. The 3D methods adopt a kinematic model with high degree of freedom [5, 8]. Such methods offer a richer hand description and therefore could deal with more posture categories, but they are usually computationally expensive due to the complex model matching algorithms. Here in this work, an appearance method is explored to perform the multiclass hand posture detection in parallel.

The key of appearance methods is to seek effective features for hand posture representation as well as to develop an efficient and expressive posture classification model. The frequently used appearance features include the Haar-like [2, 7, 9], HOG [1012], SIFT [13, 14], and BRIEF [14, 15]. However, such features are seriously affected by the cluttered backgrounds that introduce noise to features encodings. For this reason, recently there are trends to adopt the combination of multiple feature descriptions, such as the integration of HOG and skin features in [16] and the association of Haar-like and HOG in [2]. However, the accuracy improvements for such multifeature methods are usually gained at the expense of considerable increase in computing cost. To improve the efficiency, a classifier of two levels is presented in [1], in which the possible presence of hands is determined from a global perspective in the first level, and then hand regions are precisely delineated at pixel level by a probabilistic model in the second level. And, in [17], the saliency map generated by a Bayesian model is firstly thresholded to localize the hand regions, and then shape and texture features are extracted from the saliency map of hand regions for hand posture recognition. More recently, the deep learning (DL) methods are also investigated for hand posture detection, such as the integration of CNN scheme with fast candidate generation [18], the multiscale deep feature approach [19], and the deep architecture with three networks of sharing convolution layers [20]. However, the speeds of DL-based methods are much lower than those of the classical methods if the algorithms are running on a machine without advanced GPUs.

Multiclass posture detection problem is often addressed by two-stage methods [2023], in which hand region proposals are firstly obtained by techniques like skin, motion, or saliency detection which are robust to hand deformation and viewpoint variation, and then these regions are classified by multiple binary models or single multiclass model to achieve the final posture recognition. For such methods, precise region proposals are prerequisite to achieve satisfactory recognition rates, while obtaining precise proposals is never an easy job in itself if no specific posture models are utilized. As a result, the misdetection is often relatively high for such methods. The sliding-window-based methods usually perform the multiclass posture detection with multiple posture-specific detectors [9, 24]. Such methods may have relatively high recall rates. But they lack efficiency since each window needs to be classified by multiple detectors and suffer from heavy confusion detections because the detectors for different categories are trained independently rather than in a coordinated manner. Besides, there are works that adopt tree-type structure [7], but practical experiments show that there is no significant improvement in accuracy or efficiency. Here in this work, we propose a softmax-based cascade detector to perform multiclass hand posture detection simultaneously rather than category by category. Moreover, owing to the multiclass objective function, the decision boundaries are essentially obtained by seeking a balance among all categories and therefore can help reduce the confusion rates among different posture categories.

3. The Proposed Methodology

In this section, the softmax model is firstly presented for multiclass classification. Then, the softmax-based cascade architecture is introduced for multiclass hand posture detection. And, finally, we will show how to apply multiresolution features to the cascade architecture to balance the detection accuracy and efficiency.

3.1. Multiclass Hand Posture Classification by Softmax Regression

Instead of utilizing multiple independent binary classifiers, here in our method, the softmax model [25] is applied to discriminate among the background category and multiple hand posture categories. To be specific, given the feature vector of image , the distribution of class label can be modeled aswhere are model parameters and represent basis functions used for feature transformation. means that is an image of the th posture category, and indicates that is an image of background or undefined postures. In this work, the identity basis functions are adopted; that is, there is . For kernelized softmax model, there is , where is the kernel function and are the features for the training examples. To facilitate the subsequent discussions, the ground-truth label of is reformulated into a -dimensional vector as , where its th element is equal to if and otherwise. Moreover, we use to denote the softmax model with parameter and use to denote the vector for simplicity. With these notations, the distribution for label vector can be formulated as

The model parameter can be obtained by maximal likelihood estimation (MLE) [25, 26]. To be specific, given the training set , under the assumption of identical and independent distributions, the likelihood for parameters can be formulated as where is the feature representation for , is -dimensional label for example , and is the th component of . In implementation, is acquired by minimizing the negative log-likelihood as follows:Since the loss function in (4) remains unchanged as all elements in change in the same proportion, the penalization on should be added to the objective function to suppress the magnitude of model parameters. Therefore, in practice, we take the loss function with regularization term as follows:where and is the regularization coefficient. Finally, we take the efficient iterative BFGS algorithm [27, 28] to find the solution of (5). Once the model parameters are obtained, the prediction of can be made based upon the softmax model by This prediction formula will be slightly modified in the next subsection to carry out two-class classification.

3.2. Softmax-Based Cascade Architecture for Human Hand Detection

For multiscale sliding-window-based hand detection, the background pattern space is highly complicated because of the varied background window images. To decompose the complexity of background space, a softmax-based cascade architecture is introduced, which comprises a set of softmax-based binary (SftB) classifiers and a softmax-based multiclass (SftM) classifier . These classifiers are obtained based on the softmax regression models which are learned with a cascade training procedure. The classifiers with outputs in are mainly used to distinguish the defined hand postures from the background window images, where SftB is formulated asThat is to say, for stage , the window can be accepted if and only if the maximal probability of posture categories is larger than the probability of background category by at least . The parameters are set to the values so that most windows that properly contain the defined postures can get through, and they are determined at the training stage based upon the settings for posture example pass-rates (for , could be computed based upon the posture examples set which is used for learning . Sort in ascending order to produce vector , and take the value as threshold, where are the preset posture examples pass-rates for the th stage SftB during training period). The SftM classifier with output in is of the formulation as described in (6), and it is mainly used to discriminate among the categories including the classes of defined posture and the difficult backgrounds. To speed up the classification, the classifier can be replaced by the classifiers defined as follows:The threshold can be determined in a similar way to that in which the threshold is determined (for , could be computed based upon the posture examples set which is used for learning . Sort in ascending order to produce vector , and take the value as threshold, where are the preset posture examples pass-rates for the th stage SftB during training period).

The classification of window image is achieved by a two-step decision process. In the first step, the class label of is predicted aswhere represents the feature representation used by . The range of is . When is 0, the window will be directly excluded, and the second step will not be carried out any more. In the second step, the class label of window image accepted by (9) is reidentified aswhere is the -dimensional score vector calculated using the softmax models at the high-level stages:

In the experimental part, is set at 2. For ease of understanding, the flowchart for the window image classification is provided in Figure 1.

3.3. Multiresolution HOG Feature for Different Stage-Classifiers

For sliding-window-based hand detection, there are tens of thousands window images to be classified in single frame, which makes the detection system lack efficiency. To improve the efficiency, here in this work, the multiresolution HOG features are adopted for posture representations [24]. The cascade is designed so that the HOG features with low resolutions are utilized by classifiers of lower stage-levels, and HOG features with high resolutions are utilized by classifiers of higher stage-levels. The varied feature resolutions can be achieved by adjusting the density of cell splits in window images as discussed in [24]. With such multiresolution scheme, a large number of background windows can be excluded by the classifiers using low-resolution HOG features. And only few difficult background windows need to be further classified by the HOG features of high resolutions which are more discriminative and more computationally costly. In this way, the detection speed can be greatly improved without sacrificing the detection accuracy. Concretely, let denote the time consumption for single window classification with , and denote the percentage of windows through the th stage as follows: = number of windows through the first k stage of all windows generated from the full-sized image. Then, based upon the proposed multiresolution and cascade scheme, the average time expense for classifying one window image is . However, if the detection system adopts a single softmax with HOG features of the highest resolution, the time expense would be , which is usually several times as much as .

To promote the understanding, details of the training process for the proposed method are described in Algorithm 1. In Step , the training data is prepared and some hyperparameters are defined to control the training process. In Step , the first stage-classifier is trained, while the rest of stage-classifiers are trained one by one in Step . During training of the first stage, the initial negatives are randomly cropped, and all the rest are acquired using hard example mining techniques (Step (2.4)). Such strategy could enhance the discriminative ability of the first stage-classifier. For stage larger than 1, all the negatives are directly mined based upon the previous stage-classifier (Step (3.2)). In the th stage, the multiclass softmax model is firstly learned, and then based upon the predefined pass-rate hyperparameters , the modified SftB classifiers and can be generated. Once the stage reaches the predefined , the procedure could stop and return the set of cascade components .

Prepare multiclass posture example set and the full-sized background images set . Specify the control factors ,
the stage number , the HOG resolutions for different stages, the posture samples pass-rates for the first stages ,
and the size of train samples . Set the current stage level as , the set of stage-classifiers as . Note that,
all sub-images cropped from full-sized background images are of size in training process.
Train the first stage classifier as follows:
Set , sub-images randomly cropped from images in , and .
Train a softmax model with sample sets and HOG of specified resolution, and modify the model into two SftB
classifiers (Eq. (7)) and (Eq. (8)) based upon the pass-rate .
Add and to . If , go to step . Otherwise, go to step . Here represents the number of
examples in .
Randomly crop sub-image from an image queried from . Add to if . Repeat this process until
reaches to .
Reset and . And go to step .
Train the remaining stage classifiers:
Set example sets and as: , .
Randomly crop from image . Add to if . Repeat this process until reaches to the
predefined .
Train a softmax model with sample sets and HOG of specified resolution.
Then, modify this model into SftB classifiers and based on pass-rate , if .
Add and to , and let .
If , go to . Otherwise, cascade training has been finished and the procedure could be stopped.

4. Experimental Results and Discussions

The proposed method is evaluated on a dataset that is collected under various scenarios with complex background and challenging light conditions. In this section, we firstly describe the dataset and experimental settings. Then, performances of the proposed SftB classifier and softmax-based cascade are evaluated. And, finally, influences of the settings for posture example pass-rates are discussed.

4.1. Datasets and Experimental Settings
4.1.1. Datasets

The experimental dataset comprises four predefined posture categories. For each category, there are around 2000 positive examples with normalized size of pixels. The samples are obtained by cropping hand regions from the full images that are collected from ten subjects under various backgrounds and lighting conditions. The negative samples are generated during training process by randomly cropping image regions from 500 extra complicated pictures of full size. These full-sized images comprise various undefined hand postures but contain no hand posture of predefined categories. Except for the training samples, we also prepare 4000 full-sized images to evaluate the performance of the proposed method, and each image contains at least one predefined posture instance. Examples for the defined posture categories are presented in Figure 2.

4.1.2. Experimental Settings

In the experiment, training samples are normalized into the resolution of pixels. HOG features of various resolutions are utilized for classification, where different resolutions are achieved by adopting different cell splits. Cell splits for the adopted 3 resolutions are illustrated in Figure 3. Parameters for HOG features of all resolutions are fixed as unsigned gradient orientation, 9 equally distributed angle bins, cells per block, and block steps equaling to cell size. Totally four stages-classifiers are incorporated into the softmax structure. The first three are SftB classifiers, and the last one is SftM classifier. Feature configuration for each stage-classifier is presented in Table 1. In addition, to improve the detection efficiency, changing window size is employed for multiscale search rather than resizing the image itself (e.g., we could take window size of , , , and to detect hands of different scales in the frame. For window size of , the region of cell will be taken as , where are the top left coordinates of this window image, , , , , and is the cell number at horizontal or vertical direction (totally cells as shown in Figure 3). To sum up, the cell size changes with the window size. Although such calculation for cell location is not so accurate when is not divisible by , the feature is still effective. In video-based detection, if the application scenario requires the users to be near the camera, the window sizes should be larger, while if the users are required to stay far away from the camera, the window sizes should be smaller) and the window step is set as 0.05 times of the window size. For live hand detection, the web-camera is set so that image with resolution could be captured. All experiments are conducted on a PC equipped with Intel(R) Pentium(R) G3220 @3.00GHz CPU, 4.00GB RAM, and under the visual studio 2013 platform.

4.2. Effectiveness of the Proposed SftB Classifiers

To evaluate the proposed SftB classifier, we, respectively, use the softmax and logistic regression (LR) techniques to train the first three binary stage-classifiers to produce the final four-stage cascade. During the SftB cascade training period, all samples prepared for the th stage-classifiers are divided into the training set and the testing set . is learned from dataset , and ROC curve for is calculated based on the testing set (the ROC describes the variation relation between false positive rates (FPR) and true positive rates (TPR). Different TPR of are achieved by adjusting the value of threshold . And varying can in return produce varying FPR on . In this way, the ROC curve for SftB can be produced). Similarly, we can train . In this way, totally six ROC curves are produced based upon . In addition, an extra ROC curve is also generated for a SftB classifier based upon and using HOG features of the first resolution. All the seven ROC curves are displayed in Figure 4, where the notations “stage2&Reso2&LR” and “stage2&Reso2&SftB,” respectively, represent the LR and SftB classifiers trained with HOG features of the second resolution. Other notations can be explained in a similar way.

From Figure 4, we can see that, with the same HOG resolution and for fixed TPR (Table 4), the FPR (Table 4) under SftB classifier is much smaller than that calculated with LR classifier. This is because that the SftB is modified from a multiclass classifier, which essentially provides the decision boundaries among different posture categories and therefore can decompose the complex space formed by multiclass posture examples. Moreover, we find that the classifier “Stage2&Reso1&SftB” seriously underperforms the others, which indicates that increasing the resolution of HOG features is crucial to guarantee the classification accuracy.

In addition, the histograms for outputs from (see (8)) are calculated and presented in Figure 5, so that more knowledge can be gained about the proposed softmax-based binary classification. In the illustration, the upper histogram is calculated based upon the background examples and the bottom one is calculated based upon the predefined hand posture examples.

4.3. Effectiveness of the Proposed Softmax-Based Cascade Detector

To fully evaluate the proposed method, we compare the performance of softmax-based cascade and noncascade detectors based on their confusion matrices. The three compared noncascade softmax detectors are trained, respectively, with each of the three HOG feature resolutions as illustrated in Figure 3. For the cascade detector, posture pass-rates for the first three stage-classifiers are set to 98.0%, 98.5%, and 99.0%, respectively. In practice, the multiclass posture detection is carried out on the full-sized testing images with each of the four detectors (one cascade and three noncascade) and based on the multiscale sliding window scheme. For each detector, all rectangular regions that are classified into a same category will be postprocessed by the nonmaxima suppression techniques to determine the final locations for posture instances. The confusion matrix for a detector is computed from the final results produced by detector . With zero-based indexes, the elements of are defined as follows: where instances that belong to the th posture category but are predicted into the th , instances from the th posture category but they are not predicted into any of the defined , posture instances from the th posture , background regions that are predicted into the th posture , full-sized pure background images used for , and full-sized pure background images that do not contain false . The pure background image refers to the image that does not contain instances of the predefined posture categories. And a detected region is the correct detection to an instance if and only if the following exist: (a) the predicted class of is just equal to the ground-truth class of and (b) the overlap ratio between and the ground-truth region of is larger than 0.6.

The four confusion matrices corresponding to the four detectors are presented in Table 2, where the Softmax+Resolution1, Softmax+Resolution2, and Softmax+Resolution3, respectively, represent the confusion matrix computed from the three noncascade detectors. Note that the confusion matrix here is different from that for classification problem. In fact, for sliding-window-based detection, one target instance may be covered by many windows, and the postprocessing is only applied to windows that are classified into the same category. As a result, one region can be finally predicted into more than one posture category. For this reason, the sum of elements in each row does not necessarily equal to one.

From Table 2, we can see that the hand detection with noncascade softmax detectors may cause high false detection rates at the background areas and high confusion rates among different posture categories. By contrast, the proposed softmax-based cascade could significantly suppress all kinds of false detections without sacrificing the recall rates. This is because the complexity of background space can be effectively decomposed by the usage of multiple stage-classifiers, and therefore it becomes much easier for the final multiclass softmax model to discriminate among the predefined postures and the minorities of remaining backgrounds.

To make more direct and intuitional comparisons, multiple performance values based on summary measures are also computed and provided in Table 3. The measures mean recall rate and mean correct rate, respectively, represent the averaged recall rates and the averaged confusion rates among the four predefined posture categories. For the definition of FPPI and mean correct rate, please refer to Table 4.

From Table 3, we can see that the detection accuracy with Softmax+Resolution3 is the highest among the three noncascade classifiers. However, by comparison, the proposed multiclass cascade detector further improves the mean recall rate from 0.9225 to 0.9448 and boosts the mean correct rate from 0.5155 to 0.8475. Meanwhile, the mean confusion rate is reduced from 0.2182 to 0.0515, and the FPPI is reduced from 0.2248 to 0.0169. In addition, the proposed detector is faster than Softmax+Resolution3 by almost 4 times.

Figure 6 shows some hand posture detection result based on a normal web-camera. From the results, we can see that the proposed method can detect the defined hand postures under various environments. And the system can reach a real-time running speed of 27 FPS under our experimental setup.

4.4. The Influences of the Settings for Posture Example Pass-Rates

Performance of the proposed cascade is directly affected by the thresholds of its stage-classifiers as shown in (8). The thresholds affect not only the detection results but also the training process, since the background samples for the th stage are acquired by the previous (p-1) stage-classifiers. These thresholds are determined based upon the settings for pass-rates of posture samples (for , could be computed based upon the posture examples set which is used for learning . Sort in ascending order to produce vector , and take the value as threshold, where are the preset posture examples pass-rates for the th stage SftB during training period) which are set at the training stage to control the training process. To acquire better cascade detector, we prepare multiple groups of settings for the pass-rates and then train the four-stage cascade classifier with each group of settings. After that, the FPPW and detection rate (Table 4) are computed based upon each of these cascade detectors, and the best group of settings is selected by comparing the values of all FPPW and detection rates. Note that the detection rate does not necessarily equal to the mean correct rate, since confusion detections may exist among different posture categories.

The six groups of pass-rates being compared are, respectively, %%%, %%%, %%%, %%%, %%%, and %%%. The notation %%% means that, for the first three stage-classifiers, the pass-rate of posture examples is successively set to %, %, and %. Each group contains exact three pass-rate settings because there are exact four stage-classifiers in each cascade detector, while the fourth stage is a multiclass softmax model that will not be modified. The curves for variation relations of FPPW with the stage-level are presented in Figure 7, and the detection rates are illustrated in Figure 8. Except that FPPW and detection rate are both increasing with the product of three pass-rates, we have another important observation. That is, when the product of the three pass-rates is fixed, the detection rate in (Table 4) is significantly higher than that in (Table 4), while the FPPW in both cases are very close. This indicates that the detectors trained in are more discriminative than those trained in . This observation suggests that, to achieve good performance, it is better to set low pass-rates for classifiers at low stages and set higher pass-rates for classifiers at higher stages.

5. Conclusion and Future Work

In this work, a softmax-based cascade detector is proposed to perform the multiclass hand posture detection in parallel. The cascade contains several SftB classifiers used for distinguishing all predefined postures from the backgrounds and a SftM classifier mainly used to discriminate among all predefined hand postures. Moreover, the HOG features of increasing resolutions are adopted by stage-classifiers with increasing stage-levels so as to further reduce the efficiency without sacrificing the detection accuracy. Experimental comparison of ROC curves demonstrates the superiority of the proposed SftB classifier. And evaluation results on a challenging dataset indicate that the proposed model structure could improve both the accuracy and efficiency as compared with the noncascade multiclass posture detection methods. In the future work, we will replace the softmax-based stage-classifiers in the cascade with more expressive classification model, such as the convolutional neural networks, to further improve the accuracy of single-stage classification.

Appendix

Acronyms, Definitions, and Terminology

See Table 4.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (no. 2017YFB1103602), the National Natural Science Foundation of China (nos. 51705513, U1613213, and U1713213), Shenzhen Science Plan (KQJSCX20170731165108047 and JCYJ20170413152535587), and Shenzhen Engineering Laboratory for 3D Content Generating Technologies (no. 2017]476).