Hand gesture recognition is a topic in artificial intelligence and computer vision with the goal to automatically interpret human hand gestures via some algorithms. Notice that it is a difficult classification task for which only one simple classifier cannot achieve satisfactory performance; several classifier combination techniques are employed in this paper to handle this specific problem. Based on some related data at hand, AdaBoost and rotation forest are seen to behave significantly better than all the other considered algorithms, especially a classification tree. By investigating the bias-variance decompositions of error for all the compared algorithms, the success of AdaBoost and rotation forest can be attributed to the fact that each of them simultaneously reduces the bias and variance terms of a SingleTree's error to a large extent. Meanwhile, kappa-error diagrams are utilized to study the diversity-accuracy patterns of the constructed ensemble classifiers in a visual manner.

1. Introduction

Hand gesture language, being one type of sign languages, originates from deaf people communicating with each other. In order to convey a meaning, a person generally needs to simultaneously combine the shape, orientation, and movement of his hands. The complex spatial grammars of hand gesture language are markedly different from the grammars of spoken languages.

Apart from facilitating deaf people to express their thoughts more conveniently, hand gestures are also a very natural part of human communications for common people. In some special situations such as very noisy environments where speech is not possible, they can become the primary communication medium. With the rapid development of hand gesture language, it nowadays has been applied in many fields such as human-computer interaction, visual surveillance, and so on [1–3]. Thus, hand gesture recognition becomes a hot topic in artificial intelligence and computer vision with the goal to automatically interpret human hand gestures via some algorithms. Nevertheless, the great variability in spatial and temporal features of a hand gesture, such as that in time, size, and position, as well as interpersonal differences, makes the recognition problem be very difficult. For instance, different subjects have different hand appearance and may sign gesture in different pace.

Recent works in hand gesture recognition tend to handle the spatial and temporal variations separately and therefore lead to two smaller areas, namely, static posture recognition [4–10] and dynamic action recognition [2, 3, 11, 12]. In static posture recognition, the pose or the configuration of hands should be recognized using texture or some other features. By contrast, hand action recognition tries to interpret the meaning of the movement using dynamic features such as the trajectory of hands and so on. In the current study, we will focus on hand posture classification and recognition.

As for hand posture recognition, Bedregal et al. [4] introduced a fuzzy rule-based method for recognizing the hand gestures of LIBRAS (the Brazilian Sign Language). The method utilizes the set of angles of finger joints for classifying hand configurations and classifications of segments of hand gestures for recognizing gestures based on the concept of monotonic gesture segment. Just et al. [5] applied an approach that has been successfully used for face recognition to the hand posture recognition. The used features are based on the modified census transform and are illumination invariant. To achieve the classification and recognition processes, a simple linear classifier is trained using a set of feature look-up tables. Kim and Cipolla [7] attempted to address gesture recognition under small sample size where direct use of traditional classifiers is inappropriate due to high dimensionality of input space. Through combining canonical correlation analysis with the discriminative functions and scale-invariant feature transform (SIFT), they developed a pairwise feature extraction method for robust gesture recognition. In the experiments using 900 videos of 9 hand gesture classes, the proposed procedure was seen to notably outperform support vector classifier and relevance vector classifier. Based on a hand gesture fitting procedure via a new self-growing and self-organized neural gas (SGONG) network, a new method for hand gesture recognition was proposed by Stergiopoulou and Papamarkos [8]. The main idea of this method is as follows. Initially, the region of the hand is detected by applying a color segmentation technique based on a skin color filtering procedure in the 𝑌𝐶𝑏𝐶𝑟 color space. Then, the SGONG network is applied on the hand area so as to approach its shape. Based on the output grid of neurons produced by the neural network, palm morphologic characteristics are extracted. These characteristics, in accordance with powerful finger features, allow the identification of the raised fingers. Finally, the hand gesture recognition is accomplished through a likelihood-based classification technique. The proposed system has been extensively tested with success. Furthermore, Flasiński and Myśliński [9] presented a novel method for recognizing hand postures of the Polish sign language based on a synthetic pattern recognition paradigm. The main objective is to construct algorithms for generating a structural graph description of hand postures that can be analyzed with the ETPL(𝑘) graph grammar parsing model. The structural description generated with the designed algorithms is unique and unambiguous, which results in good discriminative properties of the method.

In recent years, classifier combination strategies are rapidly growing and enjoying a lot of attention from pattern recognition as well as many other various domains due to their potential to greatly increase the prediction accuracy of a learning system. So far, these techniques have been proven to be quite versatile in a broad field of real applications such as face recognition, sentiment classification, and so forth [13–15]. Compared with one single classifier, an ensemble classifier has advantages to handle a classification task which is difficult for traditional methods, to achieve much higher prediction accuracy, and so forth. In the research works of hand gesture recognition, however, there is very little literature about the applications of ensemble classifier methods. Dinh et al. [6] proposed a hand gesture classification system which is able to efficiently recognize 24 basic signs of the American sign language. In the system, computational performance is achieved through the use of a boosted cascade of classifiers that are trained by AdaBoost and informative Harr wavelet features. To adapt to complex representation of hand gestures, a new type of feature was suggested. Some experimental results show that the proposed approach is promising. Burger et al. [10] suggested to apply a belief-based method for SVM (support vector machine) fusion to recognize hand shapes. Moreover, the method was integrated into a wider classification scheme which allows taking into account other sources of information, by expressing them in the belief theories’ formalism. The experimental results have shown that the proposed method was better than the classical methods in avoiding more than 1/5 of the mistakes. In this paper, we will employ several classifier combination techniques to deal with this specific classification problem. On the basis of some hand gesture data at hand, the methods AdaBoost and rotation forest are seen to behave significantly better than all the other considered algorithms, especially a classification tree. The reasons for the success of AdaBoost and rotation forest are then investigated by analyzing the bias-variance decompositions of error for all the compared algorithms. Moreover, the diversity-accuracy patterns of each ensemble classifier are studied via kappa-error diagrams in a visual manner and some promising results are obtained.

The rest of the paper is organized as follows. In Section 2, some commonly used classifier combination methods are reviewed in brief. Section 3 describes our used hand gesture recognition data set. Some experiments are conducted in Section 4 to find out the best method to solve this specific classification task. In the meantime, the reasons for the better performance of some algorithms are also investigated. Finally, the conclusions of the paper are offered in Section 5.

2. Brief Review of Ensemble Methods

In this section, we will give a brief review of our later used classifier combination techniques. In an ensemble classifier, multiple classifiers which are generally referred to as base classifiers should be first generated by means of applying a base learning algorithm (also called base learner) to different distributions of the training data, and then the outputs from each ensemble member are combined with a classifier fusion rule to classify a new example. In order to construct an ensemble classifier having better prediction capability, its constituent members should be accurate which at the same time should disagree as much as possible. In other words, the diversity between base classifiers and their individual accuracy are two essential factors for building a good ensemble classifier. However, these two factors are contradictory in practice. Generally speaking, there is a tradeoff between diversity and accuracy: as the classifiers become more diverse, they must become less accurate; conversely, as they become more accurate, the diversity between them must turn to be lower. With respect to different ensemble classifier generation strategies, the difference between them mainly lies in how to achieve a better tradeoff between diversity and accuracy of their base classifiers.

In order to facilitate the following discussions, we first introduce some notations here. Denote by ℒ={𝐱𝑖,𝑦𝑖}𝑁𝑖=1 a training set consisting of 𝑁 observations, where 𝐱𝑖=(𝑥𝑖,1,𝑥𝑖,2,…,𝑥𝑖,𝑝)𝑇 is a 𝑝-dimensional feature vector and 𝑦𝑖 is a class label coming from the set Φ={𝜙1,𝜙2,…,𝜙𝐽}. Meanwhile, let 𝐗=(𝐱1,𝐱2,…,𝐱𝑁)𝑇 be an 𝑁×𝑝 matrix containing the training features and 𝐘=(𝑦1,𝑦2,…,𝑦𝑁)𝑇 be an 𝑁-dimensional vector containing the class labels for the training data. Put in another way, the training set ℒ can be expressed as concatenating 𝐗 and 𝐘 horizontally; that is, ℒ=[𝐗𝐘]. Furthermore, let 𝑇 be the number of base classifiers, and let 𝐶1,𝐶2,…,𝐶𝑇 be the 𝑇 classifiers to construct an ensemble classifier, say, 𝐶∗. Denote by 𝒲 the given base learning algorithm to train each base classifier.

Bagging [16], random subspace [17], and random forest [18] may be the three most intuitive and simplest ensemble learning methods to implement. These three methods share the same combination rule, that is, simply majority voting scheme, to combine the decisions of their base classifiers. They only differ in how to use the given training set to generate a diverse set of base classifiers. Breiman’s bagging, an acronym of bootstrap aggregating, trains its base classifiers by applying a base learning algorithm to some bootstrap samples [19] of the given training set ℒ. Each bootstrap sample is generated by performing 𝑁 extractions with replacement from ℒ. As a result, in each of the resulted training sets for constructing base classifiers, many of the original training examples may appear several times whereas others may never occur. The random subspace method [17] obtains different versions of the original training set by performing modifications in the feature space (i.e., randomly selecting some features) rather than in the example space like bagging. The different training sets are then provided as the input of the given learning algorithm to build its base classifiers. As for random forest [18], it is an ensemble technique by taking a decision tree algorithm [20] as its base leaner. Beside utilizing bootstrap sampling to obtain different training sets like bagging, random forest tries to produce additional diversity between base classifiers by adding a randomization principle in the tree induction process, which randomly selects a feature subset of size 𝐾 (a hyperparameter of random forest) at each nonterminal node and then chooses the best split among it. In Algorithm 1, we present the general algorithmic framework for these three ensemble methods.

● Input
  A training set ℒ = { ( 𝐱 𝑖 , 𝑦 𝑖 ) } 𝑁 𝑖 = 1 ; A base learner 𝒲 ; Number of iterations 𝑇 ; A new data point 𝐱
  to be classified.
● Training Phase
  For 𝑡 = 1 , … , 𝑇
   (1) Utilize the corresponding technique (i.e., bootstrap sampling or randomly selecting
   features) to get a training set ℒ 𝑡 .
   (2) Provide ℒ 𝑡 as the input of 𝒲 (random forest has an additional randomness injection  
       operation) to train base classifier 𝐶 𝑡 .
● Output
   – The class label for 𝐱 predicted by the ensemble classifier 𝐶 ∗ as
              𝐶 ∗ ( 𝐱 ) = a r g m a x 𝑇 𝑦 ∈ Φ ∑ 𝑡 = 1 𝐼 ( 𝐶 𝑡 ( 𝐱 ) = 𝑦 ) ,
   where 𝐼 ( ⋅ ) denotes the indicator function which takes value 1 or 0 depending on
   whether the condition of it is true or false.

Nowadays, boosting can be deemed as the largest algorithmic family in the domain of ensemble learning. Unlike bagging, whose base classifiers can be trained in parallel, boosting is a sequential algorithm in which each new classifier is built by taking into account the performance of the previously generated classifiers. AdaBoost [21, 22],  due to its simplicity and adaptability, has become the most prominent member in boosting family. AdaBoost works by constructing an ensemble of subsidiary classifiers by applying a base learner to successive derived training sets that are formed by either resampling from the original training set [21] or reweighting the original training set [22] according to a set of weights maintained over the training set. Initially, the weights assigned to each training example are set to be equal, and, in subsequent iterations, these weights are adjusted so that the weight of the instances misclassified by the previously trained classifiers is increased whereas that of the correctly ones is decreased. Thus, AdaBoost attempts to produce new classifiers that are able to better predict the “hard” examples for the previous ensemble members. After a sequence of classifiers is trained, which is then combined by weighted majority voting in the final decision. Algorithm 2 lists the main steps for the resampling version of AdaBoost which is utilized in our later experiments.

● Input
  A training set ℒ = { ( 𝐱 𝑖 , 𝑦 𝑖 ) } 𝑁 𝑖 = 1 ; A base learner 𝒲 ; Number of iterations 𝑇 ; A new data point 𝐱
  to be classified.
● Training Phase
  Initialization: Set the weight distribution over ℒ as 𝐷 1 ( 𝑖 ) = 1 / 𝑁 ( 𝑖 = 1 , 2 , … , 𝑁 ) .
  For 𝑡 = 1 , … , 𝑇
  (1) According to the distribution 𝐷 𝑡 , draw 𝑁 training instances at random from ℒ with
   replacement to compose a new set ℒ 𝑡 = { ( 𝐱 𝑖 ( 𝑡 ) , 𝑦 𝑖 ( 𝑡 ) ) } 𝑁 𝑖 = 1 .
  (2) Provide ℒ 𝑡 as the input of 𝒲 to train a classifier 𝐶 𝑡 , and then compute the weighted
   training error of 𝐶 𝑡 as            
          𝜀 𝑡 = P r 𝑖 ∼ 𝐷 𝑡 ( 𝐶 𝑡 ( 𝐱 𝑖 ) ≠ 𝑦 𝑖 ) = 𝑁 ∑ 𝑖 = 1 𝐼 ( 𝐶 𝑡 ( 𝐱 𝑖 ) ≠ 𝑦 𝑖 ) 𝐷 𝑡 ( 𝑖 ) ,     (1)
   where 𝐼 ( ⋅ ) takes value 1 or 0 depending on whether the 𝑖 th training instance is
   misclassified or by 𝐶 𝑡 or not.
  (3) If 𝜀 𝑡 > 0 . 5 or 𝜀 𝑡 = 0 , then set 𝑇 = 𝑡 − 1 and abort loop.
  (4) Let 𝛼 𝑡 = ( 1 / 2 ) l n ( ( 1 − 𝜀 𝑡 ) / 𝜀 𝑡 ).
  (5) Update the weight distribution 𝐷 𝑡 over ℒ as
             𝐷 𝑡 + 1 𝐷 ( 𝑖 ) = 𝑡 ( 𝑖 ) 𝑍 𝑡 × ⎧ ⎪ ⎨ ⎪ ⎩ 𝑒 − 𝛼 𝑡 , i f 𝐶 𝑡 ( 𝐱 𝑖 ) = 𝑦 𝑖 𝑒 𝛼 𝑡 , i f 𝐶 𝑡 ( 𝐱 𝑖 ) ≠ 𝑦 𝑖 ,   (2)
   where 𝑍 𝑡 is a normalization factor being chosen so that 𝐷 𝑡 + 1 is a probability
   distribution over ℒ .
● Output
  – The class label for 𝐱 predicted by the ensemble classifier 𝐶 ∗ as
              𝐶 ∗ ( 𝐱 ) = a r g m a x 𝑇 𝑦 ∈ Φ ∑ 𝑡 = 1 𝛼 𝑡 𝐼 ( 𝐶 𝑡 ( 𝐱 ) = 𝑦 ) .

Based on principal component analysis (PCA), Rodríguez et al. [23] developed a novel ensemble classifier creation technique rotation forest and demonstrated that it outperforms several other ensemble methods on some benchmark classification data sets from the UCI repository [24]. With the aim to create training data for a base classifier, the feature set of ℒ is randomly split into many subsets, and PCA is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, some axis rotations take place to form new features for training a base classifier. The main idea of rotation forest is to simultaneously encourage diversity and individual accuracy within an ensemble classifier. Specifically, diversity is promoted by using PCA to do feature axis rotation for each base classifier while accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Here, we summarize the detailed steps of rotation forest in Algorithm 3. Note that in this algorithm, there is another parameter, namely, the number of features 𝑀 contained in each feature subset, which should be specified in advance. For simplicity, suppose that 𝑀 is a factor of 𝑝 so that the features are distributed into 𝐾 subsets with each of them containing 𝑀 features. Otherwise, the 𝐾th feature subset will have 𝑝−(𝐾−1)𝑀 features. According to the results reported by Rodríguez et al. [23], rotation forest with 𝑀=3 performs satisfactorily, and it thus provides users a directive advice to choose suitable value for 𝑀.

● Input
  A training set ℒ = { ( 𝐱 𝑖 , 𝑦 𝑖 ) } 𝑁 𝑖 = 1 = [ 𝐗 𝐘 ] ; Number of input features 𝑀 contained in each feature
   subset; A base learner 𝒲 ; Number of iterations 𝑇 ; A new data point 𝐱 to be classified.
● Training Phase
  For 𝑡 = 1 , 2 , … , 𝑇
  – Calculate the rotation matrix 𝐑 ğ‘Ž 𝑡 for the 𝑡 th classifier 𝐶 𝑡
    (1) Randomly split the feature set 𝐹 = { 𝑋 1 , 𝑋 2 , … , 𝑋 𝑝 } into 𝐾 subsets 𝐹 𝑡 , 𝑘 ( 𝑘 = 1 , 2 , … , 𝐾 ) .
    (2) For 𝑘 = 1 , 2 , … , 𝐾
      (a) Select the columns of 𝐗 that correspond to the attributes in 𝐹 𝑡 , 𝑘 to compose a
      submatrix 𝐗 𝑡 , 𝑘 .
       (b) Draw a bootstrap sample 𝐗  𝑡 , 𝑘 (with sample size smaller than that of 𝐗 𝑡 , 𝑘 ,
     generally taken to be 75%) from 𝐗 𝑡 , 𝑘 .
      (c) Apply PCA to 𝐗  𝑡 , 𝑘 to obtain a matrix 𝐃 𝑡 , 𝑘 whose 𝑖 th column consists of the
      coefficients of the 𝑖 th principal component.
     (3) EndFor
     (4) Arrange the matrices 𝐃 𝑡 , 𝑘 ( 𝑘 = 1 , 2 , … , 𝐾 ) into a block diagonal matrix 𝐑 𝑡 .
     (5) Construct the rotation matrix 𝐑 ğ‘Ž 𝑡 by rearranging the rows of 𝐑 𝑡 so that
     they correspond to the original features in 𝐹 .
  – Provide [ 𝐗 𝐑 ğ‘Ž 𝑡 𝐘 ] as the input of 𝒲 to build a classifier 𝐶 𝑡 .
● Output
  – The class label for 𝐱 predicted by the ensemble classifier 𝐶 ∗ as
              𝐶 ∗ ( 𝐱 ) = a r g m a x 𝑇 𝑦 ∈ Φ ∑ 𝑡 = 1 𝐼 ( 𝐶 𝑡 ( 𝐱 𝐑 ğ‘Ž 𝑡 ) = 𝑦 ) .

Furthermore, Melville and Mooney [25] proposed a new meta-learner DECORATE (diverse ensemble creation by oppositional relabeling of artificial training examples) that can use any “strong” learner (one that provides high accuracy on the training data) to build a diverse ensemble. This algorithm is accomplished by adding different randomly constructed instances to the training set when building new ensemble members. The artificial constructed instances are given category labels that disagree with the prediction of the current ensemble, thereby directly increasing diversity when a new learner is trained on the augmented data and added to the ensemble. Based on the experimental results using the J48 (an open source Java implementation of the C4.5 algorithm in the Weka data mining tool) decision-tree induction as a base learner and the analysis of the cross-validated learning curves for DECORATE as well as some other ensemble methods, Melville and Mooney [25] found that DECORATE produces highly accurate ensembles that outperform bagging, AdaBoost, and random forest low on the learning curve. In order to make this paper self-sufficient, we list the pseudocodes of DECORATE in Algorithm 4 as follows.

● Input:
  ℒ : training set consisting of 𝑁 instances;
  𝒲 : base learner whose output is assumed to be a class probability distribution;
  𝐶 s i z e : desired ensemble size;
  𝐼 m a x : maximum number of iterations to construct an ensemble classifier;
  𝑅 s i z e : a factor to determine number of artificial instances to generate.
● Training phase
   – Initialization:
    Let 𝑖 = 1 and t r i a l s = 1 ;
    Provide the given training set ℒ as the input of base learner 𝒲 to get a classifier 𝐶 𝑖 ;
    Initialize ensemble set 𝐶 ∗ = { 𝐶 𝑖 } ;
     – Compute ensemble error as
               1 𝜀 = 𝑁 𝑁 ∑ 𝑖 = 1 𝐼 ( 𝐶 ∗ ( 𝐱 𝑖 ) ≠ 𝑦 𝑖 ) .     (3)
  – While 𝑖 < 𝐶 s i z e and t r i a l s < 𝐼 m a x
    (1) Generate ⌊ 𝑅 s i z e × 𝑁 ⌋ training instances, ℛ , according to the distribution of training data;
     (2) Label each instance in ℛ with probabilities that each class label is selected
     being inversely proportional to those predicted by 𝐶 ∗ ;
     (3) Combine ℒ with ℛ to get a new training set ℒ ′ ;
     (4) Apply base learner 𝒲 to ℒ ′ to obtain a new classifier 𝐶 ′ ;
      (5) Add 𝐶 ′ to ensemble set 𝐶 ∗ , namely, let 𝐶 ∗ = 𝐶 ∗ ∪ { 𝐶 ′ } ;
      (6) Based on the training set ℒ , compute the ensemble error of 𝐶 ∗ , say, 𝜀 ′ , as
     that done in equation (3);
      (7) If 𝜀 ′ ≤ 𝜀 , let 𝑖 = 𝑖 + 1 and update ensemble error as 𝜀 = 𝜀 ′ ; Otherwise,
    delete 𝐶 ′ from the ensemble set 𝐶 ∗ , that is, 𝐶 ∗ = 𝐶 ∗ − { 𝐶 ′ } ;
     (8) t r i a l s = t r i a l s + 1 ;
  – EndWhile
● Prediction phase
  – Let 𝑝 𝑖 , 𝑗 ( 𝐱 ) be the probability that 𝐱 comes from class 𝑗 supported by the classifier 𝐶 𝑖 .
  Calculate the confidence for each class by the mean combination rule, that is,
            𝑑 𝑗 1 ( 𝐱 ) = 𝐿 𝐿 ∑ 𝑖 = 1 𝑝 𝑖 , 𝑗 ( 𝐱 ) , 𝑗 = 1 , 2 , … , 𝐽 ,   (4)
  where 𝐿 stands for the real ensemble size.
  – Assign 𝐱 to the class with the largest confidence.

As can be seen in Algorithm 4, DECORATE builds an ensemble classifier iteratively like all the other ensemble methods. Initially, a classifier is trained on the basis of the given training data ℒ. In each successive iteration, one classifier is created by applying a base learner 𝒲 to ℒ combined with some artificial data. In each iteration, some artificial training instances are generated according to the data distribution that the given training data (only consider the input variables now) come from, where the number of instances to be generated is specified as a fraction, 𝑅size, of the training set size 𝑁. As for the labels for each artificially generated training instance 𝐱𝑘, first utilize the current ensemble to predict the class membership probabilities 𝐏(𝐱𝑘𝑃)=(1(𝐱𝑘𝑃),2(𝐱𝑘𝑃),…,𝐽(𝐱𝑘))𝑇 that this instance belongs to each class. Then, replace zero probabilities with a small nonzero value and normalize the probabilities to make them form a probability distribution 𝐏(𝐱𝑘𝑃)=(1(𝐱𝑘𝑃),2(𝐱𝑘𝑃),…,𝐽(𝐱𝑘))𝑇. The label 𝑦𝑘 of the instance 𝐱𝑘 can then be determined such that the probability for each class 𝑖 being selected is inversely proportional to the ensemble’s prediction; namely, î‚ğ‘ƒî…žğ‘–(𝐱𝑘𝑃)=(1/𝑖(𝐱𝑘∑))/𝐽𝑗=1𝑃(1/𝑗(𝐱𝑘)). The main purpose in doing so is to make the labels for the artificially generated instances differ maximally from the current ensemble’s predictions in order to promote the diversity in the constructed ensemble classifier. Thus, the labeled artificially created training set is called diversity data. On the other hand, DECORATE tries to maintain the accuracy of each ensemble member while forcing diversity through rejecting a new classifier if adding it to the existing ensemble decreases its accuracy. The above whole process is repeated until the desired ensemble size is reached or the maximum number of iterations is exceeded.

It is worth to mention that, in DECORATE, the artificially generated training data are randomly picked from an approximation of the training-data distribution. For a numeric feature, the values of it are created from a Gaussian distribution whose mean and standard deviation are computed from the corresponding data in the training set. As for a nominal feature, the probability of occurrence of each distinct value in its domain should be first calculated in which the Laplace smoothing needs to be employed so that nominal feature values not represented in the training set still have a nonzero probability of occurrence. Then, some values can be generated based on this distribution. Another issue that should be pointed out is that we can only specify a desired ensemble size 𝐶size when using DECORATE to deal with a classification task. The size 𝐿 of the finally obtained ensemble may be smaller than 𝐶size because the algorithm will terminate if the number of iterations exceeds the maximum limit even if 𝐶size is not reached. As for 𝑅size, it can take any value in theory. Nevertheless, the experiments done by Melville and Mooney [25] have shown that 𝑅size lower than 0.5 adversely affect the performance of DECORATE, and the results with 𝑅size chosen in range 0.5~1 do not vary much.

3. Data Set

For a hand gesture recognition problem, the task is to design a classifier to recognize different hand gestures where each gesture has a meaning of one simple or compound word. Our used data set contains 120 different signs of the Dutch sign language, each performed by 75 different persons. The images were captured at 640×480 pixels and 25 frames per second. Most sign examples include partial occlusions of hands of each other or with the face/neck. For the detailed process to obtain the experimental data, the readers can refer to [3, 26]. The supplementary video can be found on the Computer Society Digital Library at http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2008.123. We briefly introduce the process to collect the experimental data as follows. When a person is making a hand gesture, two cameras are used to independently record the continuous activity of his left and right hands. Because the gesture is made continuously and we would like to obtain some features, three images (frames) which, respectively, denote the beginning, middle, and ending of the gesture were extracted from the video recorded by one camera, and they were denoted as image 1, 2, and 3 here. Based on each obtained image, some segmentation algorithm was first used to segment two hands from the background, and then 7 invariant moments were computed for each hand. This process was repeated for the video of the other camera. Finally, we obtained 84 features in total through collecting the computed moments corresponding to two hands, three images, and two cameras together. In the original data set, there are totally 120 different gestures among which 29 ones denote compound words and 91 ones indicate simple words. For each gesture, there are 75 objects made by different persons, and every object is described by 84 features which were extracted in the above-mentioned way.

Unfortunately, we encountered a problem during segmenting two hands from one image. Sometimes there may occur an overlap between head, left hand, or right hand, which make it impossible to compute meaningful moments for each hand. Therefore, some corresponding features cannot be obtained in the general manner, and they are indicated as missing. In our experiments, we only considered the objects without missing features. Meanwhile, we tried to select the classes consisting of approximately equal number of objects. Through preprocessing the experimental data in this way, we finally obtained a set having 11 classes with about 70 objects in each class. The data set contains 793 objects in total, and each object is described by 84 features.

4. Experimental Study

4.1. Experimental Setting

In this section, we did some experiments by applying six commonly used classifier combination methods to the hand gesture recognition data set that is described previously. The considered ensemble methods include bagging [16], random forest [18], random subspace [17], AdaBoost [22], rotation forest [23], and DECORATE [25].

The experimental settings were as follows. In all the ensemble methods, a decision tree [20] was always adopted as the base learning algorithm because it is sensitive to the changes in its training data and can still be very accurate. The following experiments were all conducted in Matlab software with version 7.7. The decision tree algorithm was realized by the “Treefit” algorithm contained in the “Stats” package of Matlab. The parameters involved in this algorithm, such as the minimum number of training instances that impure nodes to be split should have, were all set to the default values. The implementations of the considered ensemble methods were realized in Matlab by writing programs according to their respective pseudocodes.

The ensemble size was set to 25 since the largest error reduction achieved by ensemble methods generally occurs at the first several iterations. Although larger ensemble size may result in better performance, the improvement achieved at the cost of additional computational complexity is trivial in comparison with that obtained with just a few iterations. As for the hyperparameter 𝐾 in random forest, which specifies how many features should be firstly selected at each nonterminal node in the process of building a decision tree, the value of it was taken to be ⌊log2(𝑝)+1⌋ since some experiments [18] have proven that this choice makes random forest achieve good performance very often. When using random subspace technique to construct an ensemble classifier, one half of features were randomly selected to train its each constituent member. With respect to the parameter 𝑀 which indicates the number of features contained in each feature subset in rotation forest, we set it to be 3 just like Rodríguez et al. [23] did because they have found that this value was almost always the best choice in their experiments. In DECORATE algorithm, the used parameters except for the ensemble size were all identical to those utilized by Melville and Mooney [25]; namely, the maximum number of iterations 𝐼max to build an ensemble classifier was set to 50, and the factor 𝑅size to determine number of artificial examples to generate was chosen to be 1. Here, it should be noted that we can only specify a desired ensemble size for DECORATE algorithm while it may terminate if the number of iterations exceeds the maximum limit even if the desired ensemble size is not reached.

4.2. Results and Discussion
4.2.1. Comparison of Prediction Error

Because there have not been separate training and testing data to use, we employed the 10-fold cross-validation method to investigate the performance of the considered classification methods. Specifically, the data was first split into ten subsets with approximately equal sizes, and then nine of them were utilized as a training set to construct a forest while the other one was used to estimate the prediction error of the forest. The experiment was conducted ten times through alternating the role of ten subsets until each of them was used for testing once. We repeated the above process ten times with different random number generating seeds to split the data in order to eliminate the impact of random factor to the performance of each algorithm.

Before utilizing each obtained training set to carry out experiments, we preprocessed the data based on one normalization technique. Given a training set ℒ={(𝐱𝑖,𝑦𝑖)}𝑁𝑖=1, the normalization of the values corresponding to each feature 𝑋𝑗(𝑗=1,2,…,𝑝) can be expressed as𝑥𝑐𝑗=𝑁𝑐𝑖=1𝑥𝑖𝑗,𝑠𝑐𝑗=1𝑁𝑐−1𝑁𝑐𝑖=1𝑥𝑖𝑗−𝑥𝑐𝑗2,𝑠=𝑚𝑐=1𝑃𝜔𝑐𝑠𝑐𝑗,ğ‘¥î…žğ‘–ğ‘—=𝑥𝑖𝑗−𝑥𝑐𝑗√𝑠,𝑖=1,2,…,𝑁𝑐,(4.1) where 𝑥𝑐𝑗 and 𝑠𝑐𝑗, respectively, denote the mean and variance for class 𝜔𝑐 and 𝑠 is the weighted sum of the variances for each class with weights equal to class prior probabilities. After obtaining 𝑥𝑐𝑗 and 𝑠, the same mapping was applied to the test set.

Table 1 reported the mean as well as the standard deviation of the computed test errors for each algorithm. In order to make a complete comparison, the results calculated with a classification tree were also taken into account. In Table 1, the best results were highlighted in bold face to facilitate the comparison. With the aim to make clear whether there is significant difference between the performance of our evaluated ensemble methods on this specific data set, we adopted a one-tailed paired 𝑡-test with significance level 𝛼=0.01 to carry out some statistical tests between each pair of algorithms. If an algorithm is found to be significantly better than its competitor, we assigned score 1 to the former and −1 to the latter. If there is no significant difference between the two compared methods, they both score 0. Obviously, the higher the score of an approach, the better its performance. In the third row of Table 1, we listed the scores that each classification method gets according to the number of times that it has been significantly better or worse than the other algorithms.

From the obtained mean test errors and scores for each classification method, it can be observed that the prediction error of a single decision tree has been improved greatly by each classifier combination technique, especially by AdaBoost and rotation forest. Among the ensemble algorithms, DECORATE was seen to perform much worse than the other ensemble learning strategies; the reason may be that its main advantage is to deal with classification problems with small training set size while the sample size of the current hand gesture recognition data set is medium. Based on the scores calculated from the statistical tests between each pair of algorithms, rotation forest is seen to be the best method to solve this specific problem, and it performs significantly better than all the other algorithms at significance level 𝛼=0.01. Meanwhile, AdaBoost is the second best approach since it was only beaten by rotation forest. However, a single decision tree and DECORATE behave very badly and they should not be selected to deal with this problem. As for the other three ensemble methods, the performance of them is almost equivalent even though their working mechanism is different as described in Section 2.

4.2.2. Bias-Variance Decomposition of Error

In order to investigate the reasons for the better performance of an ensemble classifier than its constituent members, to decompose its error into bias and variance terms is a good choice, and this method has been used by many researchers [27–29]. The decomposition of a learning machine’s error into bias and variance terms originates from analyzing learning models with numeric outputs in regression problems. Given a fixed target and training set size, the conventional formulation of the decomposition breaks the expected error into the sum of three nonnegative quantities.(i)Intrinsic “target noise”  (ğœŽ2). This quantity is a lower bound on the expected error of any learning algorithm. It is the expected error of the Bayes optimal classifier.(ii)Squared “bias" (bias2). This quantity measures how closely the learning algorithm’s average guess (over all possible training sets of the given size) matches the target.(iii)“Variance" (variance). This quantity measures how much the learning algorithm’s guess fluctuates from the target for the different training sets of the given size.

Notice that the above decomposition cannot be directly translated to contexts where the value to be predicted is categorical; a number of ways to decompose error into bias and variance terms in the field of classification prediction tasks have been proposed [30–33]. Each of these definitions is able to provide some valuable insight into different aspects of a learning machine’s performance. In order to gain more insight into the performance of the considered ensemble methods on the hand gesture recognition data set, we utilized the bias-variance definition developed by Kohavi and Wolpert [30] in the current research, and they were, respectively, denoted by Bias and Var in the following discussions.

If denote by 𝑌𝐻 and 𝑌𝐹 the random variables, respectively, representing the evaluated and true labels of an instance, Bias and Var defined for a testing instance (𝐱,𝑦) can be expressed as1Bias(𝐱)=2𝑦′∈Φ𝑌Pr𝐹=ğ‘¦î…žî€¸î€·ğ‘Œâˆ£ğ±âˆ’Prℒ𝐻=ğ‘¦î…žâˆ£ğ±î€¸î€»2,1Var(𝐱)=2⎧⎪⎨⎪⎩1−𝑦′∈Φ𝑌Prℒ𝐻=ğ‘¦î…žâˆ£ğ±î€¸î€»2⎫⎪⎬⎪⎭.(4.2) Here, the superscript ℒ is used in 𝑌ℒ𝐻 to denote that the evaluated class label is predicted by the machine trained on the set ℒ. The term Pr(⋅) in the above formulae can be computed as the frequency that the event included in the parentheses occurs in the trials which are conducted with different training sets of the given size.

To compute the above two statistics, the distribution of where the training data come from should be known in advance. Unfortunately, the knowledge we have in the current hand gesture recognition situation is only a learning sample with medium size. In consequence, the Bias and Var terms should be estimated instead. In our experiments, the method similar to that used by [32], that is, ten trials of 10-fold cross-validation procedure, was utilized to estimate the bias and variance defined above. Once the cross-validation trials have been completed, the relevant measures can be estimated directly from the observed distribution of results. The use of cross-validation in this way has the advantage that every instance in the available data ℒ is used the same number of times, both for training and for testing.

According to the above approach, the previously defined bias and variance decompositions of the errors for each classification method were estimated for each instance in the given data set, and then their values were averaged over that data set. Detailed decompositions of mean error into Bias and Var for each classification method were provided in the following Table 2.

As can be seen from Table 2, the order of the considered classification methods ranked in terms of Bias value from best to worst is AdaBoost, RotForest, RandForest, Bagging, RandSubspace, DECORATE, and SingleTree. With regard to Var, these algorithms are rated from best to worst as RotForest, RandSubspace, Bagging, AdaBoost, RandForest, DECORATE, and SingleTree. Therefore, the better performance of RotForest and AdaBoost can be attributed to the fact that they reduce both bias and variance of the SingleTree’s error to a large degree. RotForest does a better job to reduce variance term while AdaBoost has a small advantage on reducing bias. The working mechanism of RandForest is similar to that of RotForest and AdaBoost, but the reduction achieved by it is not enough. In the meantime, bagging and RandSubspace are observed to mainly reduce the variance term. As for DECORATE, it was seen to only decrease the bias and variance of the SingleTree’s error to a small extent.

4.2.3. Kappa-Error Diagrams

On the other hand, many researchers [34–36] have pointed out that the success of an ensemble classifier achieving much lower generalization error than its any constituent member lies in the fact that the ensemble classifier consists of highly accurate classifiers which at the same time disagree as much as possible. Put in another way, with the purpose to construct an ensemble classifier with good performance, we should achieve a good tradeoff between diversity and accuracy.

The kappa-error diagrams developed by Margineantu and Dietterich [37] provide us an effective means to visualize how an ensemble classifier which is constructed by some ensemble learning technique attempts to reach the tradeoff between the diversities and accuracies of its constituent members. For each pair of classifiers, the diversity between them is measured by the statistic kappa (𝜅) which evaluates the level of agreement between two classifier outputs while correcting for chance; the accuracy of them is measured by the average of their error rates estimated on the testing data set. A kappa-error diagram is a scatter plot in which each point corresponds to a pair of classifiers 𝐶𝑖 and 𝐶𝑗. On the 𝑥-axis of the plot is the diversity value 𝜅 and on the 𝑦-axis of it is the mean error of 𝐶𝑖 and 𝐶𝑗, say, 𝐸𝑖,𝑗=(𝐸𝑖+𝐸𝑗)/2.

The statistic 𝜅 is defined as follows. Suppose that there are 𝐽 classes and 𝜅 is defined on the 𝐽×𝐽 coincidence matrix ℳ of two classifiers 𝐶𝑖 and 𝐶𝑗(𝑖,𝑗=1,2,…,𝑇) where 𝑇 is the number of classifiers in an ensemble. The entry 𝑚𝑘,𝑠 of ℳ is the proportion of the testing data set, in which classifier 𝐶𝑖 labels as 𝜔𝑘 and classifier 𝐶𝑗 labels as 𝜔𝑠. Then the agreement between 𝐶𝑖 and 𝐶𝑗 can be measured as𝜅𝑖,𝑗=∑𝐽𝑘=1𝑚𝑘𝑘−ABC1−ABC,(4.3) where ∑𝐽𝑘=1𝑚𝑘𝑘 is the observed agreement between the two classifiers 𝐶𝑖 and 𝐶𝑗. “ABC," the acronym of “agreement-by-chance,” is defined asABC=𝐽𝑘=1𝐽𝑠=1𝑚𝑘,𝑠𝐽𝑠=1𝑚𝑠,𝑘.(4.4) According to the above definition of 𝜅, low values of it indicate higher diversity. And since small values of 𝐸𝑖,𝑗 indicate better accuracy, the most desirable pairs of classifiers should lie in the bottom left corner of the scatter plot.

Figure 1 illustrates the kappa-error diagrams of the ensemble classifiers constructed by each ensemble algorithm on the hand gesture recognition data. All the constructed forests but the one built by DECORATE consist of 25 trees; therefore there are 300 (𝐶225) points in each plot. Because the data set has no separate training and testing parts, we randomly took 90% of observations to build the forests and the remaining to compute the kappa-error diagrams. With regard to each plot, the axes of them were adjusted to be identical so that the comparisons can be easily carried out. Moreover, on the top of each plot in Figure 1, we presented the used ensemble method, its prediction error estimated on the testing set, and the coordinates (shown as the red point in each plot) for the mean of diversities and that of errors which were averaged over all pairs of base classifiers.

As can be observed in Figure 1, DECORATE gives a very compact cloud of points. Each point has a low error rate and a high value for 𝜅, which indicates that the base classifiers are accurate but not very diverse. The shape of the kappa-error diagrams for bagging and random subspace is similar, but the points for bagging are more diverse while those for random subspace are slightly more accurate. Although the test error of random forest is identical to that of AdaBoost, we can find that AdaBoost is more diverse but is not as accurate as random forest. In the meantime, rotation forest is seen to achieve better diversity than random subspace and DECORATE, but its accuracy is only worse than DECORATE. Thus, the final prediction error of rotation forest is a little higher than that of random forest and AdaBoost in this special situation.

Notice that here we just randomly selected 90% of the data to construct forests and utilized the remaining 10% to estimate the values of 𝜅 and mean error for each pair of base classifiers. The experiment was only carried out for one trial. If comparing the results obtained herein with those listed in Table 1, we should be cautious to draw conclusions since the errors for each algorithm as reported in Table 1 were averaged over ten trials of 10-fold cross-validation in order to eliminate some random factor that may affect the relative performance of our considered classification methods. However, our aim in this subsection was just to employ the kappa-error diagrams to study the working mechanism of the compared ensemble methods more clearly.

5. Conclusions

In this paper, we adopted some widely used classifier fusion methods to solve a hand gesture recognition problem. Since the data of this classification task are likely coming from a multi-normal distribution, the ensemble methods are found to be more appropriate to deal with this problem because the performance of them is much better than that of a single classification tree. Among the ensemble techniques, AdaBoost and rotation forest behave significantly better than their rivals and they achieve the lowest generalization error. Through investigating the bias-variance decompositions of error for the considered classification algorithms, the success of AdaBoost and rotation forest can be attributed to the fact that each of them simultaneously reduces the bias and variance terms of the SingleTree’s error to a large extent. Rotation forest does a better job to reduce variance whereas AdaBoost has a small advantage on reducing bias. Furthermore, we made use of kappa-error diagrams to visualize how a classifier combination strategy attempts to reach a good tradeoff between diversity and accuracy in the process of constructing an ensemble classifier. The experimental results demonstrate that AdaBoost creates the most diverse base classifiers but with a little higher error. With respect to rotation forest, it is observed to generate very accurate base classifiers while the diversity between them is only medium.


This research was supported in part by the National Natural Science Foundation of China (no. 61075006), the Tianyuan Special Funds of the National Natural Science Foundation of China (no. 11126277), the Fundamental Research Funds for the Central Universities of China, as well as the Research Fund for the Doctoral Program of Higher Education of China (no. 20100201120048). The authors would like to thank Gineke A. ten Holt for providing the hand gesture data. The authors are grateful to the reviewers as well as the editor for their valuable comments and suggestions which lead to a substantial improvement of the paper.