Abstract

The recognition of hand movements is an important method for human-computer interaction (HCI) technology, and it is widely used in virtual reality and other HCI areas. While many valuable efforts have been made, efficient ways to capture over 20 types of hand movements with high accuracy by one data glove are still lacking. This paper addresses a new classification framework for 52 hand movements. This classification framework includes the following two parts: the movement detection algorithm and the movement classification algorithm. The fine K-nearest neighbor (Fine KNN) is the core of the movement detection algorithm. The movement classification algorithm is composed of downsampling in data preparation and a new deep learning network named the DBDF network. Bidirectional Long Short-Term Memory (BiLSTM) is the main part of the DBDF network. The results of experiments using the Ninapro DB1 dataset demonstrate that our work can classify more types of hand movements than related algorithms with a precision of 93.15%.

1. Introduction

As one of the hotspots in the research field of human-computer interaction (HCI) technology, hand movement recognition has been deeply studied by researchers and has been widely used in virtual reality, artificial intelligence, and other HCI areas. With the popularization and application of advanced sensors, hand movement recognition based on contact sensors has developed rapidly, especially data glove technology. Data gloves can more intuitively obtain the three-dimensional spatial information of hand posture by using multiple sensors and are not constrained by the surrounding environment.

In the virtual reality scene, the user’s different intentions can be judged by hand movement recognition, which facilitates the user’s manipulation of objects in the virtual scene and improves the user’s immersive experience. In daily life, the emergence of smart homes also provides more application fields for hand movement recognition. Especially in virtual reality training scenes, users need a simpler and faster operation and adapt to reality, which makes the traditional HCI mode of mouse keyboards gradually replaced by data gloves. Data glove technology allows users to manipulate objects in the training scene only through gesture changes so that users have a realistic immersive experience. Hand movement recognition has great potential for development in the future.

According to whether there is a time series of actions, hand movement recognition is divided into the following two types: static and dynamic recognition. Static recognition refers to recognition and classification without time series, focusing only on changes in hand spatial features, such as hand shape, contour, and center of gravity. However, the system often needs continuous instruction rather than a single, on-off instruction, which makes the static gesture unable to meet the needs of HCI. Dynamic hand movement recognition can be understood as the recognition of static gesture combinations in a period of time series, which not only provides users with better fluency but also compensates for the defects of static gesture recognition in human-computer interactions.

At the same time, according to different methods of collecting gesture interaction information, gesture recognition can be divided into gesture recognition based on noncontact sensors and contact sensors. A camera is one kind of typical noncontact sensor. An image or video of hand movement is collected using a camera, and the processed gesture information is recognized synthetically by gesture detection, tracking, positioning, and other methods. The camera-based hand movement recognition method has been widely used in smart homes, but it is greatly affected by the background environment, especially in the case of weak illumination, low camera pixels, and complex and changeable backgrounds. Therefore, this method is not suitable for the application of virtual reality scenes. Hand movement recognition based on contact sensors refers to the recognition method of obtaining hand information through sensors. This method generally refers to hand movement recognition based on data gloves. There are many sensors in the data glove. Through these sensors, we can obtain various hand change data in real time without omission and then extract and classify the features of these data to finally achieve hand movement recognition. Compared with the camera-based hand movement recognition method, this method has faster recognition speed and higher accuracy. At the same time, when users use data gloves for hand movement recognition, they have better comfort and freedom because they do not need to consider the position of the camera.

In the past, due to the slow development of sensor technology, the accuracy of data gloves was not high, and the price was expensive, which made the camera-based hand movement recognition method dominant in the field of gesture recognition. In recent years, with the rapid development of sensor technology, the cost of data gloves has been reduced. The feasibility of hand movement recognition methods based on data gloves in various fields has become increasingly obvious, and people have paid increasing attention to them. Therefore, research on hand movement recognition technology based on data gloves plays an increasingly important role in this recognition field. Nassour et al. [1] proposed a sensory glove and used machine learning algorithms to estimate the angles of the joints in the hand and to identify 15 gestures with an average accuracy of 89.4%. Chen et al. [2] presented a wearable hand rehabilitation system that offers 16 kinds of finger gestures with an accuracy of 93.32%. Pan et al. [3] presented a wireless smart glove that can recognize 10 American Sign Language gestures, and the highest testing classification accuracy of our system is 99.7%. Maitre et al. [4] developed a data glove prototype allowing for the recognition of objects in 8 basic daily activities with an accuracy of 95%. Lee and Bae [5] proposed a real-time gesture recognition system that uses a data glove to classify 11 gestures, and the recognition result showed 100% accuracy. Ayodele et al. [6] proposed the use of convolutional neural networks on 6 grasp classifications using a piezoresistive data glove, and the average classification accuracy of the convolutional neural networks (CNN) algorithm was 88.27% and 75.73% in the object-seen and object-unseen scenarios, respectively. Chauhan et al. [7] presented grasp prediction algorithms that can be used for naturalistic, synergistic control of exoskeleton gloves for 5 activities, and an average accuracy of approximately 75% was achieved. In [8], an automatic recognition algorithm to identify hand movements using surface electromyogram signals was proposed with an average accuracy of 93.53% on Ninapro DB5 (17 gestures). In [9], a prototype of a data glove was proposed that performs well object recognition during 13 basic daily activities, and the accuracy was almost 93%. Huang et al. [10] used a prefabricated data glove to monitor the bending angle of the finger joint in real time and then realized the recognition of 9 Chinese Sign Language (CSL) words in which the accuracy reached 98.3%. A novel deep learning framework based on the graph convolutional neural networks (GCNs-Net) was presented in [11] and GCNs-Net achieved the highest averaged accuracy 96.24% at the subject level on the High Gamma dataset (4 movements). In [12], an attention-based BiLSTM-GCN was proposed to accurately classify four-class electroencephalogram motor imagery tasks and showed prediction based on individual training with 98.81% accuracy.

Although many scholars have made remarkable development in the field of hand movements recognition, there is still a need to improve the classification accuracy of multiple hand movements [13]. The recognition of multiple hand movements is challenging because the accuracy typically decreases obviously with the addition of more hand movements [14]. However, to the best of our knowledge, there are few actions in the existing research that can be recognized with high precision when only using data gloves. In the above research using data gloves, the average accuracy exceeds 90%, but the number of hand movements does not exceed 20 types. Towards accurate and more hand movements recognition from raw signals of data gloves, a new classification framework should be constructed.

Therefore, in this paper, we propose applying the fine K-nearest neighbor (Fine KNN) and Bidirectional Long Short-Term Memory (BiLSTM) in classifying a larger Ninapro DB1 dataset including 52 movements with 22 sensor measurements of one data glove. The classification framework is divided into two steps. First, a Fine KNN classifier is used to obtain the sensor sequence of a hand movement over a period and determines whether the sequence is a resting state or movement state. Second, a classification model, named the DBDF network, is proposed to recognize 52 movements for the movement state sensor sequence. The structure of this classification framework is shown in Figure 1. Our results show that the classification framework achieved the recognition of the 52 hand movements in Ninapro DB1 only by one data glove with an accuracy of 93.15%.

The main contributions of this work can be summarized under three headings:(1)A novel classification framework is introduced to detect 52 hand movements while just cooperating with data gloves.(2)The hand movement classification algorithm on the benchmark datasets of Ninapro DB1 performs better than the existing state-of-the-art algorithms in recognizing the number of hand movements and maintains high classification accuracy.(3)The hand movement classification algorithm has good scalability when the input size grows larger. Data preparation involves the downsampling technology to rescale input variables before training the DBDF network and the downsampling technology improves neural network stability and modeling performance by scaling data.

The structure of the rest of this paper is as follows. Section 2 describes the Ninapro DB1. Section 3 introduces the hand movement detection algorithm based on Fine KNN. The hand movement classification algorithm is reported in Section 4. Sections 5 and 6 illustrate the experimental results and conclusion, respectively.

2. Ninapro DB1

Ninapro [15, 16] is a publicly available multimodal database to foster research on robotic hands controlled with artificial intelligence. Ninapro includes electromyography, kinematic, inertial, eye tracking, visual, clinical, and neurocognitive data. Ninapro DB1 includes data from 27 intact subjects. We use hand kinematics data in Ninapro DB1 as our experimental data. In Ninapro DB1, 52 hand movements and the Reset state are listed in Table 1 [16]. Seventy percent of Ninapro DB1 is used as the training set, and 30% is used as the testing set in the experiment.

Hand kinematics were measured for all subjects using a 22-sensor CyberGlove II data glove. CyberGlove II is a motion capture data glove instrumented with joint-angle measurement sensors. It uses proprietary resistive bend-sensing technology to transform hand and finger motions into real-time digital joint-angle data. The number and the corresponding position of each CyberGlove II sensor are shown in Figure 2.

3. The Movement Detection Algorithm

3.1. Algorithm Introduction

Sample data of CyberGlove II are represented symbolically as follows:

The corresponding movement label is defined as follows:

This window can provide different transition widths for the same L, which is something other fixed windows lack:

Define as a subvector extracted from a vector . The movement label is as follows:

Hand movement detection is a classification that classifies the data between two classes in “0” and “1” of .

Let the training set be , where is an input vector and its label. The K-nearest neighbors (KNN) [17] algorithm is a type of supervised machine learning algorithm that can be used for both classification and regression predictive problems. However, it is mainly used for classification predictive problems in industry. KNN is used to train the dataset and classify the dataset based on similarity and distance measures. In this paper, Fine KNN is a classifier in which the nearest neighbors are determined based on Euclidean distance, which makes finely detailed distinctions among classes with the number of neighbors set to 1 because of just two classes in this movement label. Define as the Fine KNN classifier prediction of .

If , then a whole movement period is composed of the measurement of CyberGlove II:

3.2. Algorithm Effect Analysis

The thumb is the most important part of the hand. The thumb has only two bones, so it is obviously shorter and plays a very important role, which cannot be done by other fingers in physiology. In this paper, we select the number of sensors (1, 2, 3, 4, 5, 6, 9, 13, 17) and is defined as follows:

As a result, in Table 2, the Fine KNN by the number of sensors (1, 2, 3, 4, 5, 6, 9, 13, 17) achieved the highest accuracy. The Fine KNN achieved the highest 99.4% completeness of the test dataset among other numbers of sensors. The Fine KNN classifier is selected for the optimized classified model. Figure 3 shows the confusion matrix of Fine KNN. This classification model can effectively obtain a whole movement period [18].

4. The Movements Classification Algorithm

The model presented in Section 3 can detect only the hand movement state and its movement period, but it is necessary to propose a new model to recognize 52 hand movements. Hand movement recognition is mainly classified and recognized according to the characteristics of hand movements, and the feature extraction of hand movements is based on the data in the training sample dataset. Therefore, the quality of the data in the training sample dataset directly affects the subsequent hand movement recognition.

4.1. Data Preparation

In Ninapro DB1, each hand movement period is not the same; the shortest sample is approximately 200 records, and the longest sample is nearly 1000 records. It should process records of these movements to make the length of every movement sample close. Downsampling technology is used to process these records. Downsampling, which is also sometimes called decimation, reduces the sampling rate. The idea of downsampling is to remove records from the signal while maintaining its length with respect to time.

The sensor data rate of CyberGlove II is 90 records/second typical. This means that the sampling frequency of CyberGlove II is fs = 90 Hz. The shortest sample time of a whole hand movement is 200/90 = 2.2 seconds. The longest sample time was approximately 1000/90 = 11 seconds. This means that the frequency f range of a whole hand movement is 0.1 Hz to 0.45 Hz. If the downsampling algorithm is used to change the number of samples for every movement between 100 and 200, the new sampling frequency will change from 9 Hz (1000 records) to 90 Hz (200 records), and will be conducted. According to the Nyquist theorem, new samples processed by downsampling enable a complete reconstruction of the corresponding original samples.

Consider downsampling a labeled signal of length N and reducing the number of samples N by a factor of ns, where ns is a divisor of N. A new signal is represented as follows:where is the input of the hand movement classifier below.

4.2. DBDF Network

Essentially, hand movement recognition is a sequence classification, which is a predictive modeling problem where the sensor sequence of CyberGlove II and the task is to predict a hand movement category for the sequence. What makes this problem difficult is that the sensor sequences of CyberGlove II can vary in length. Although the sensor sequences have been shortened by downsampling, the classifier is required to learn the long term in the input sequence.

Recently, deep learning methods have been shown to provide state-of-the-art results on challenging hand movement recognition tasks with little or no data feature engineering instead of using feature learning on raw data.

The LSTM (Long Short-Term Memory) [19] network model is a recurrent neural network that can learn and memorize long sequences of input data. BiLSTM [20] is an extension of traditional LSTM, which can improve the performance of the model on sequence classification problems. In this problem, where all time steps of the input sequence are available, BiLSTM trains two LSTMs on the input sequence instead of one. The first LSTM is on the input sequence, and the second is on the reverse copy of the input sequence. This can provide additional context for the network to understand the problem faster and more comprehensively.

The BiLSTM network can support multiple parallel sequences of input data, such as each sensor of the CyberGlove II data. The BiLSTM network learns to extract features from sequences of observations and how to map the internal features to different movement categories.

The benefit of using BiLSTM for sequence classification is that they can learn from the raw time series data directly and in turn do not require domain expertise to manually engineer input features. The BiLSTM network can learn an internal representation of the time series data and ideally achieve comparable performance to models fit on a version of the dataset with engineered features. BiLSTM is an effective solution that obtains access to both preceding and succeeding information by involving two separate hidden layers with opposite information flow directions. The traditional BiLSTM classification network usually uses the final state for classification.

Since sensor records show the time dependence of dynamic CyberGlove II sensor data, to better classify the sensor time series of CyberGlove II and further improve the classification accuracy of the sensor sequences, a classification network composed of double BiLSTM layers (bilstmLayer) and double fully connected layers (fullyConnectedLayer) is proposed, called the DBDF network. The association relationship between a sequence of CyberGlove II sensors and the forward and backward association relationship of the sequence itself can be fully extracted by the DBDF network. It has the advantage of an easy training process, and the computational complexity is reduced [21].

A DBDF network is designed to classify the experiment in Section 3, and the structure of this model is shown in Figure 4. The BiLSTM cell is the core of the DBDF network. The joint-angle measurement data of CyberGlove II are used as the input of the DBDF network. The outputs of each repeating BiLSTM cell are concatenated into a dense layer for further prediction. In the DBDF network, two fully connected layers are added after two bilstmLayers. In this network, the numbers of hidden units in bilstmLayers #1 and #2 are 350, and the output sizes for fullyConnectedLayer #1 and fullyConnectedLayer #2 are 200 and 52, respectively. The two bilstmLayers extract features from the sensor sequences, concatenate them together, and map them to two fullyConnectedLayers. The combination of three bilstmLayers increases the amount of data, which can then improve the accuracy and stability of classification [22].

5. Experiments and Analysis

5.1. Experimental Results

The DBDF network training and testing are completed under the deep learning framework of MATLAB 2020B with Intel i7–8750 h @ 2.20 GHz CPU, 32 GB memory, Nvidia RTX 2070 GPU. Training and testing of the DBDF network are accelerated by the GPU. The RMSProp optimizer is used to train the network, and the network weights and biases are updated continuously. The learning rate was reduced by a factor of 0.1 every 10 epochs. To set the maximum number of epochs for training to 30, a minibatch with 128 observations was used at each iteration. The RMSProp optimizer algorithm is used to monitor the training error “MSE” to update the model parameters continuously. The learning rate is 0.0001, and the maximum training epoch is 30.

The training process usually takes approximately 6 minutes on a single GPU, and the network achieves a training accuracy of 99.5% and a validation accuracy of 93%. The confusion matrix is used in the performance analysis of classification algorithms. The testing result is shown in the confusion matrix chart in Figure 5, and the accuracy for 52 hand movements is listed in Table 3. The prediction accuracy of movements 2, 3, 4, 7, 10, 15, 16, 19 can achieve 100%, and the accuracy of movements 43, 9, 52, 44, 21 is lower than 80%. Improving the recognition accuracy of these movements is the focus of the future.

We compare the number of movements and classification accuracy of the proposed methodology with the existing hand movement recognition algorithms in Table 4. We conclude that the proposed algorithm of the DBDF network performs better than the existing state-of-the-art algorithms in recognizing the number of hand movements and maintains high classification accuracy.

5.2. Classification Network Structure and Downsampling

The deep learning model in the hand movement classification algorithm is composed of double BiLSTM layers and double fully connected layers in Section 4. We also consider changing the number of bilstmLayers and fullyConnectedLayers. The experimental results are shown in Table 5. As the number of layers increases, the training time of the classification network increases. The DBDF network outperforms other classification networks in Table 5.

In this research, downsampling is very important in data preparation. Downsampling the sensor sequences will save considerable time when training the network, and the validation accuracy of classification will increase. In the case of these variable-length sequence problems, data should be transformed such that each sequence has the same length as possible. A comparison of different classification networks without downsampling is shown in Table 6. In the same classification network, the validation accuracy with downsampling is higher than that without downsampling, and using downsampling saves more training time.

5.3. Computational Cost

DBDF network is the core of the proposed approach and the DBDF network is one kind of the deep learning networks. The DBDF network is adapted to use a GPU accelerated approach to gain a significant boost in performance and GPU will shorten the training time of the DBDF network. GPU usage, GPU memory usage, GPU power usage, and training time are the key metrics of evaluating the computational cost of the proposed approach because experiments are running on a single GPU. Because training time metrics have been shown in Tables 5 and 6, GPU usage, GPU memory usage, and GPU power usage will be discussed as follows. Data in this section are logged by GPU-Z.

Figure 6 is a GPU usage trend when the movements classification algorithm is running. The preparation process means the initial run of the algorithm program. At this time, GPU usage and GPU memory usage are nearly zero. In Figure 7, the DBDF network training causes over 30% GPU usage, and GPU memory usage increases gradually to about 2500 MB.

In Figure 8, GPU basic power consumption is 9 W when GPU is power on. GPU power consumption is over 60 W when initializing the construct of the DBDF network and is average nearly 21 W in the program running process. In the testing process, GPU usage and GPU power consumption will decrease to 0% and 9 W, respectively, but about 2500 MB GPU memory is occupied.

Table 7 summarizes metrics for evaluating the GPU performance of the movements classification algorithm.

6. Conclusions

In this paper, we have presented the use of Fine KNN and BiLSTM network on hand movement classification using one data glove. The fine KNN classifier is the core of the movement detection algorithm, and the sensor sequences of a hand movement over a period are obtained by this algorithm. The hand movement classification algorithm includes downsampling and the DBDF network. The DBDF network architecture consists of two BiLSTM layers and two fully connected layers. Notably, the average classification accuracy of our algorithm is 93.15% in Ninapro DB1.

Because a hand is an elastic object, there are different features between the same movements and high similar features between different movements [23]. Our approach can detect 52 hand movements while just cooperating with data gloves. This approach performs better than the existing state-of-the-art algorithms in recognizing the number of hand movements and maintains high classification accuracy from 27 intact subjects. The downsampling technology is introduced to improve neural network stability and modeling performance by scaling data.

The prediction accuracies of movements 43, 9, 52, 44, 21 are lower than 80% and we will focus on improving the recognition accuracy of these movements. In the future, we will optimize the movements classification algorithm and consider more deep learning networks to improve the accuracy in Ninapro DB1 and other movement prediction applications. We are also interested in studying the DBDF network parameters transferability from one dataset to another without retraining.

Data Availability

The Ninapro DB1 data used to support the findings of this study have been deposited in the Ninapro Project repository (http://ninapro.hevs.ch/).

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this study.

Acknowledgments

This work was partially supported by Guangdong University of Education Teaching Quality and Teaching Reform Project (no. 2018sfjd02) and the Appropriative Researching Fund for Guangdong Provincial Key Laboratory of Precision Equipment and Manufacturing Technology under Grant PEM201604.