Abstract

3D gestural interaction provides a powerful and natural way to interact with computers using the hands and body for a variety of different applications including video games, training and simulation, and medicine. However, accurately recognizing 3D gestures so that they can be reliably used in these applications poses many different research challenges. In this paper, we examine the state of the field of 3D gestural interfaces by presenting the latest strategies on how to collect the raw 3D gesture data from the user and how to accurately analyze this raw data to correctly recognize 3D gestures users perform. In addition, we examine the latest in 3D gesture recognition performance in terms of accuracy and gesture set size and discuss how different applications are making use of 3D gestural interaction. Finally, we present ideas for future research in this thriving and active research area.

1. Introduction

Ever since Sutherland’s vision of the ultimate display [1], the notion of interacting with computers naturally and intuitively has been a driving force in the field of human computer interaction and interactive computer graphics. Indeed, the notion of the post-WIMP interface (Windows, Icons, Menus, Point and Click) has given researchers the opportunity to explore alternative forms of interaction over the traditional keyboard and mouse [2]. Speech input, brain computer interfaces, and touch and pen-computing are all examples of input modalities that attempt to bring a synergy between user and machine and that provide a more direct and natural method of communication [3, 4].

Once such method of interaction that has received considerable attention in recent years is 3D spatial interaction [5], where users’ motions are tracked in some way so as to determine their 3D pose (e.g., position and orientation) in space over time. This tracking can be done with sensors users wear or hold in their hands or unobtrusively with a camera. With this information, users can be immersed in 3D virtual environments and avateer virtual characters in video games and simulations and provide commands to various computer applications. Tracked users can also use these handheld devices or their hands, fingers, and whole bodies to generate specific patterns over time that the computer can recognize to let users issue commands and perform activities. These specific recognized patterns we refer to as 3D gestures.

1.1. 3D Gestures

What exactly is a gesture? Put simply, gestures are movements with an intended emphasis and they are often characterized as rather short bursts of activity with an underlying meaning. In more technical terms, a gesture is a pattern that can be extracted from an input data stream. The frequency and size of the data stream are often dependent on the underlying technology used to collect the data and on the intended gesture style and type. For example, , coordinates and timing information are often all that is required to support and recognize 2D pen or touch gestures. A thorough survey on 2D gestures can be found in Zhai et al. [6].

Based on this definition, a 3D gesture is a specific pattern that can be extracted from a continuous data stream that contains 3D position, 3D orientation, and/or 3D motion information. In other words, a 3D gesture is a pattern that can be identified in space, whether it be a device moving in the air such as a mobile phone or game controller, or a user’s hand or whole body. There are three different types of movements that can fit into the general category of 3D gestures. First, data that represents a static movement, like making and holding a fist or crossing and holding the arms together, is known as a posture. The key to a posture is that the user is moving to get into a stationary position and then holds that position for some length of time. Second, data that represents a dynamic movement with limited duration, like waving or drawing a circle in the air, is considered to be what we think of as a gesture. Previous surveys [7, 8] have distinguished postures and gestures as separate entities, but they are often used in the same way and the techniques for recognizing them are similar. Third, data that represents dynamic movement with an unlimited duration, like running in place or pretending to climb a rope, is known as an activity. In many cases these types of motions are repetitive, especially in the entertainment domain [9]. The research area known as activity recognition, a subset of computer vision, focuses on recognizing these types of motions [10, 11]. One of the main differences between 3D gestural interfaces and activity recognition is that activity recognition is often focused on detecting human activities where the human is not intending to perform the actions as part of a computer interface, for example, detecting unruly behavior at an airport or train station. For the purposes of this paper, unless otherwise stated, we will group all three movement types into the general category of 3D gestures.

1.2. 3D Gesture Interface Challenges

One of the unique aspects of 3D gestural interfaces is that it crosses many different disciplines in computer science and engineering. Since recognizing a 3D gesture is a question of identifying a pattern in a continuous stream of data, concepts from time series, signal processing and analysis, and control theory can be used. Concepts from machine learning are commonly used since one of the main ideas behind machine learning is to be able to classify data into specific classes and categories, something that is paramount in 3D gesture recognition. In many cases, cameras are used to monitor a user’s actions, making computer vision an area that has extensively explored 3D gesture recognition. Given that recognizing 3D gestures is an important component of a 3D gestural user interface, human computer interaction, virtual and augmented reality, and interactive computer graphics all play a role in understanding how to use 3D gestures. Finally, sensor hardware designers also work with 3D gestures because they build the input devices that perform the data collection needed to recognize them.

Regardless of the discipline, from a research perspective, creating and using a 3D gestural interface require the following:(i)monitoring a continuous input stream to gather data for training and classification,(ii)analyzing the data to detect a specific pattern from a set of possible patterns,(iii)evaluating the 3D gesture recognizer,(iv)using the recognizer in an application so commands or operations are performed when specific patterns are detected.

Each one of these components has research challenges that must be solved in order to provide robust, accurate, and intuitive 3D gestural user interaction. For example, devices that collect and monitor input data need to be accurate with high sampling rates, as unobtrusive as possible, and capture as much of the user’s body as possible without occlusion. The algorithms that are used to recognize 3D gestures need to be highly accurate, able to handle large gesture sets, and run in real time. Evaluating 3D gesture recognizers is also challenging given that their true accuracies are often masked by the constrained experiments that are used to test them. Evaluating these recognizers in situ is much more difficult because the experimenter cannot know what gestures the user will be performing at any given time. Finally, incorporating 3D gestures recognizers as part of a 3D gestural interface in an application requires gestures that are easy to remember and perform with minimal latency to provide an intuitive and engaging user experience. We will explore these challenges throughout this paper by examining the latest research results in the area.

1.3. Paper Organization

The remainder of this paper is organized in the following manner. In the next section, we will discuss various strategies for collecting 3D gesture data with a focus on the latest research developments in both worn and handheld sensors as well as unobtrusive vision-based sensors. In Section 3, we will explore how to recognize 3D gestures by using heuristic-based methods and machine learning algorithms. Section 4 will present the latest results from experiments conducted to examine recognition accuracy and gesture set size as well as discuss some applications that use 3D gestural interfaces. Section 5 presents some areas for future research that will enable 3D gestural interfaces to become more commonplace. Finally, Section 6 concludes the paper.

2. 3D Gesture Data Collection

Before any 3D gestural interface can be built or any 3D gesture recognizers can be designed, a method is required to collect the data that will be needed for training and classification. Training data is often needed (for heuristic recognition, training data is not required) for the machine learning algorithms that are used to classify one gesture from another. Since we are interested in 3D gestural interaction, information about the user’s location in space or how the user moves in space is critical. Depending on what 3D gestures are required in a given interface, the type of device needed to monitor the user will vary. When thinking about what types of 3D gestures users perform, it is often useful to categorize them into hand gestures, full body gestures, or finger gestures. This categorization can help to narrow down the choice of sensing device, since some devices do not handle all types of 3D gestures. Sensing devices can be broken down into active sensors and passive sensors. Active sensors require users to hold a device or devices in their hands or wear the device in some way. Passive sensors are completely unobtrusive and mostly include pure vision sensing. Unfortunately, there is no perfect solution and there are strengths and weaknesses with each technology [12].

2.1. Active Sensors

Active sensors use a variety of different technologies to support the collection and monitoring of 3D gestural data. In many cases, hybrid solutions are used (e.g., combining computer vision with accelerometers and gyroscopes) that combine more than one technology together in an attempt to provide a more robust solution.

2.1.1. Active Finger Tracking

To use the fingers as part of a 3D gestural interface, we need to track their movements and how the various digits move in relation to each other. The most common approach and the one that has the longest history uses some type of instrumented glove that can determine how the fingers bend. Accurate hand models can be created using these gloves and the data used to feed a 3D gesture recognizer. These gloves often do not provide where the hand is in 3D space or its orientation so other tracking systems are needed to complement them. A variety of different technologies are used to perform finger tracking including piezoresistive, fiber optic, and hall-effect sensors. These gloves also vary in the number of sensors they have which determines how detailed the tracking of the fingers can be. In some cases, a glove is worn without any instrumentation at all and used as part of a computer vision-based approach. Dipietro et al. [13] present a thorough survey on data gloves and their applications.

One of the more recent approaches to finger tracking for 3D gestural interfaces is to remove the need to wear an instrumented glove in favor of wearing a vision-based sensor that uses computer vision algorithms to detect the motion of the fingers. One example of such a device is the SixSense system [14]. The SixSense device is worn like a necklace and contains a camera, mirror, and projector. The user also needs to wear colored fiducial markers on the fingertips (see Figure 1). Another approach developed by Kim et al. uses a wrist worn sensing device called Digits [15]. With this system, a wrist worn camera (see Figure 2) is used to optically image the entirety of a user’s hand which enables the sampling of fingers. Combined with a kinematic model, Digits can reconstruct the hand and fingers to support 3D gestural interfaces in mobile environments. Similar systems that make use of worn cameras or proximity sensors to track the fingers for 3D gestural interfaces have also been explored [1619].

Precise finger tracking is not always a necessity in 3D gestural interfaces. It depends on how sophisticated the 3D gestures need to be. In some cases, the data needs only to provide distinguishing information to support different, simpler gestures. This idea has led to utilizing different sensing systems to support course finger tracking. For example, Saponas et al. have experimented with using forearm electromyography to differentiate fingers presses and finger tapping and lifting [20]. A device that contains EMG sensors is attached to a user’s wrist and collects muscle data about fingertip movement and can then detect a variety of different finger gestures [21, 22]. A similar technology supports finger tapping that utilizes the body for acoustic transmission. Skinput, developed by Harrison et al. [23], uses a set of sensors worn as an armband to detect acoustical signals transmitted through the skin [18].

2.1.2. Active Hand Tracking

In some cases, simply knowing the position and orientation of the hand is all the data that is required for a 3D gestural interface. Thus, knowing about the fingers provides too much information and the tracking requirements are simplified. Of course, since the fingers are attached to the hand, many finger tracking algorithms will also be able to track the hand. Thus there is often a close relationship between hand and finger tracking. There are two main flavors of hand tracking in active sensing: the first is to attach a sensing device to the hand and the second is to hold the device in the hand.

Attaching a sensing device to the user’s hand or hands is a common approach to hand tracking that has been used for many years [5]. There are several tracking technologies that support the attachment of an input device to the user’s hand including electromagnetic, inertial/acoustic, ultrasonic, and others [12]. These devices are often placed on the back of the user’s hand and provide single point pose information through time. Other approaches include computer vision techniques where users wear a glove. For example, Wang and Popović [24] designed a colored glove with a known pattern to support a nearest-neighbor approach to tracking hands at interactive rates. Other examples include wearing retroreflective fiducial markers coupled with cameras to track a user’s hand.

The second approach to active sensor-based hand tracking is to have a user hold the device. This approach has both strengths and weaknesses. The major weakness is that the users have to hold something in their hands which can be problematic if they need to do something else with their hands during user interaction. The major strengths are that the devices users hold often have other functionalities such as buttons, dials, or other device tools which can be used in addition to simply tracking the user’s hands. This benefit will become clearer when we discuss 3D gesture recognition and the segmentation problem in Section 3. There have been a variety of different handheld tracking devices that have been used in the virtual reality and 3D user interface communities [2527].

Recently, the game industry has developed several video game motion controllers that can be used for hand tracking. These devices include the Nintendo Wii Remote (Wiimote), Playstation Move, and Razer Hydra. They are inexpensive and massproduced. Both the Wiimote and the Playstation Move use both vision and inertial sensing technology while the Hydra uses a miniaturized electromagnetic tracking system. The Hydra [28] and the Playstation Move [29] both provide position and orientation information (6 DOF) while the Wiimote is more complicated because it provides certain types of data depending on how it is held [30]. However, all three can be used to support 3D gestural user interfaces.

2.1.3. Active Full Body Tracking

Active sensing approaches to tracking a user’s full body can provide accurate data used in 3D gestural interfaces but can significantly hinder the user since there are many more sensors the user needs to wear compared with simple hand or finger tracking. In most cases, a user wears a body suit that contains the sensors needed to track the various parts of the body. This body suit may contain several electromagnetic trackers, for example, or a set of retroreflective fiducial markers that can be tracked using several strategically placed cameras. These systems are often used for motion capture for video games and movies but can also be used for 3D gestures. In either case, wearing the suit is not ideal in everyday situations given the amount of time required to put it on and take it off and given other less obtrusive solutions.

A more recent approach for supporting 3D gestural interfaces using the full body is to treat the body as an antenna. Cohn et al. first explored this idea for touch gestures [31] and then found that it could be used to detect 3D full body gestures [32, 33]. Using the body as an antenna does not support exact and precise tracking of full body poses but provides enough information to determine how the body is moving in space. Using a simple device either in a backpack or worn on the body, as long as it makes contact with the skin, this approach picks up how the body affects the electromagnetic noise signals present in an indoor environment stemming from power lines, appliances, and devices. This approach shows great promise for 3D full body gesture recognition because it does not require any cameras to be strategically placed in the environment, making the solution more portable.

2.2. Passive Sensors

In contrast to active sensing, where the user needs to wear a device or other markers, passive sensing makes use of computer vision and other technologies (e.g., light and sound) to provide unobtrusive tracking of the hands, fingers, and full body. In terms of computer vision, 3D gestural interfaces have been constructed using traditional cameras [3437] (such as a single webcam) as well as depth cameras. The more recent approaches to recognizing 3D gestures make use of depth cameras because they provide more information than a traditional single camera in that they support extraction of a 3D representation of a user, which then enables skeleton tracking of the hands, fingers, and whole body.

There are generally three different technologies used in depth cameras, namely, time of flight, structured light, and stereo vision [38]. Time-of-flight depth cameras (e.g., the depth camera used in the XBox One) determine the depth map of a scene by illuminating it with a beam of pulsed light and calculating the time it takes for the light to be detected on an imaging device after it is reflected off of the scene. Structured-light depth cameras (e.g., Microsoft Kinect) use a known pattern of light, often infrared, that is projected into the scene. An image sensor then is able to capture this deformed light pattern based on the shapes in the scene and finally extracts 3D geometric shapes using the distortion of the projected optical pattern. Finally. stereo based cameras attempt to mimic the human-visual system using two calibrated imaging devices laterally displaced from each. These two cameras capture synchronized images of the scene, and the depth for image pixels is extracted from the binocular disparity. The first two depth camera technologies are becoming more commonplace given their power in extracting 3D depth and low cost.

These different depth camera approaches have been used in a variety of ways to track fingers, hands, and the whole body. For example, Wang et al. used two Sony Eye cameras to detect both the hands and fingers to support a 3D gestural interface for computer aided design [39] while Hackenberg et al. used a time-of-flight camera to support hand and finger tracking for scaling, rotation, and translation tasks [40]. Keskin et al. used structured light-based depth sensing to also track hand and finger poses in real time [41]. Other recent works using depth cameras for hand and finger tracking for 3D gestural interfaces can be found in [4244]. Similarly, these cameras have also been used to perform whole body tracking that can be used in 3D full body-based gestural interfaces. Most notably is Shotton et al.’s seminal work on using a structured light-based depth camera (i.e., Microsoft Kinect) to track a user’s whole body in real time [45]. Other recent approaches that make use of depth cameras to track the whole body can be found in [4648].

More recent approaches to passive sensing used in 3D gesture recognition are through acoustic and light sensing. In the SoundWave system, a standard speaker and microphone found in most commodity laptops and devices is used to sense user motion [49]. An inaudible tone is sent through the speaker and gets frequency-shifted when it reflects off moving objects like a user’s hand. This frequency shift is measured by the microphone to infer various gestures. In the LightWave system, ordinary compact fluorescent light (CFL) bulbs are used as sensors of human proximity [50]. These CFL bulbs are sensitive proximity transducers when illuminated and the approach can detect variations in electromagnetic noise resulting from the distance from the human to the bulb. Since this electromagnetic noise can be sensed from any point in an electrical wiring system, gestures can be sensed using a simple device plugged into any electrical outlet. Both of these sensing strategies are in their early stages and currently do not support recognizing a large quantity of 3D gestures at any time, but their unobtrusiveness and mobility make them a potential powerful approach to body sensing for 3D gestural user interfaces.

3. 3D Gesture Recognition and Analysis

3D gestural interfaces require the computer to understand the finger, hand, or body movements of users to determine what specific gestures are performed and how they can then be translated into actions as part of the interface. The previous section examined the various strategies for continuously gathering the data needed to recognize 3D gestures. Once we have the ability to gather this data, it must be examined in real time using an algorithm that analyzes the data and determines when a gesture has occurred and what class that gesture belongs to. The focus of this section is to examine some of the most recent techniques for real-time recognition of 3D gestures. Several databases such as the ACM and IEEE Digital Libraries as well as Google Scholar were used to survey these techniques and the majority of those chosen reflect the state of the art. In addition, when possible, techniques that were chosen also had experimental evaluations associated with them. Note that other surveys that have explored earlier work on 3D gesture recognition also provide useful examinations of existing techniques [8, 5153].

Recognizing 3D gestures is dependent on whether the recognizer first needs to determine if a gesture is present. In cases where there is a continuous stream of data and the users do not indicate that they are performing a gesture (e.g., using a passive vision-based sensor), the recognizer needs to determine when a gesture is performed. This process is known as gesture segmentation. If the user can specify when a gesture begins and ends (e.g., pressing a button on a Sony Move or Nintendo Wii controller), then the data is presegmented and gesture classification is all that is required. Thus, the process of 3D gesture recognition is made easier if a user is holding a tracked device, such as a game controller, but it is more obtrusive and does not support more natural interaction where the human body is the only “device” used. We will examine recognition strategies that do and do not make use of segmentation.

There are, in general, two different approaches to recognizing 3D gestures. The first, and most common, is to make use of the variety of different machine learning techniques in order to classify a given 3D gesture as one of a set of possible gestures [54, 55]. Typically, this approach requires extracting important features from the data and using those features as input to a classification algorithm. Additionally, varying amounts of training data are needed to seed and tune the classifier to make it robust to variability and to maximize accuracy. The second approach, which is somewhat underutilized, is to use heuristics-based recognition. With heuristic recognizers, no formal machine learning algorithms are used, but features are still extracted and rules are procedurally coded and tuned to recognize the gestures. This approach often makes sense when a small number of gestures are needed (e.g., typically 5 to 7) for a 3D gestural user interface.

3.1. Machine Learning

Using machine learning algorithms as classifiers for 3D gesture recognition represents the most common approach to developing 3D gesture recognition systems. The typical procedure for using a machine learning-based approach is to(i)pick a particular machine learning algorithm,(ii)come up with a set of useful features that help to quantify the different gestures in the gesture set,(iii)use these features as input to the machine learning algorithm,(iv)collect training and test data by obtaining many samples from a variety of different users,(v)train the algorithm on the training data,(vi)test the 3D gesture recognizer with the test data,(vii)refine the recognizer with different/additional feature or with more training data if needed.

There are many different questions that need to be answered when choosing a machine learning-based approach to 3D gesture recognition. Two of the most important are what machine learning algorithm should be used and how accurate can the recognizer be. We will examine the former question by presenting some of the more recent machine learning-based strategies and discuss the latter question in Section 4.

3.1.1. Hidden Markov Models

Although Hidden Markov Models (HMMs) should not be considered recent technology, they are still a common approach to 3D gesture recognition. HMMs are ideally suited for 3D gesture recognition when the data needs to be segmented because they encode temporal information so a gesture can first be identified before it is recognized [37]. More formally, an HMM is a double stochastic process that has an underlying Markov chain with a finite number of states and a set of random functions, each associated with one state [56]. HMMs have been used in a variety of different ways with a variety of different sensor technologies. For example, Sako and Kitamura used multistream HMMs for recognizing Japanese sign language [57]. Pang and Ding used traditional HMMs for recognizing dynamic hand gesture movements using kinematic features such as divergence, vorticity, and motion direction from optical flow [58]. They also make use of principal component analysis (PCA) to help with feature dimensionality reduction. Bevilacqua et al. developed a 3D gesture recognizer that combines HMMs with stored reference gestures which helps to reduce the training amount required [59]. The method used only one single example for each gesture and the recognizer was targeted toward music and dance performances. Wan et al. explored better methods to generate efficient observations after feature extraction for HMMs [60]. Sparse coding is used for finding succinct representations of information in comparison to vector quantization for hand gesture recognition. Lee and Cho used hierarchical HMMs to recognize actions using 3D accelerometer data from a smart phone [61]. This hierarchical approach, which breaks up the recognition process into actions and activities, helps to overcome the memory storage and computational power concerns of mobile devices. Other work on 3D gesture recognizers that incorporate HMMs include [6269].

3.1.2. Conditional Random Fields

Conditional random fields (CRFs) are considered to be a generalization of HMMs and have seen a lot of use in 3D gesture recognition. Like HMMs they are a probabilistic framework for classifying and segmenting sequential data, however, they make use of conditional probabilities which relax any independence assumptions and also avoid the labeling bias problem [70]. As with HMMs, there have been a variety of different recognition methods that use and extend CRFs. For example, Chung and Yang used depth sensor information as input to a CRF with an adaptive threshold for distinguishing between gestures that are in the gesture set and those that are outside the gestures set [71]. This approach, known as T-CRF, was also used for sign language spotting [72]. Yang and Lee also combined a T-CRF and a conventional CRF in a two-layer hierarchical model for recognition of signs and finger spelling [73]. Other 3D gesture recognizers that make use of CRFs include [39, 74, 75].

Hidden conditional random fields (HCRFs) extend the concept of the CRF by adding hidden state variables into the probabilistic model which is used to capture complex dependencies in the observations while still not requiring any independence assumptions and without having to exactly specify dependencies [76]. In other words, HCRFs enable sharing of information between labels with the hidden variables but cannot model dynamics between them. HCRFs have also been utilized in 3D gesture recognition. For example, Sy et al. were one of the first groups to use HCRFs in both arm and head gesture recognition [77]. Song et al. used HCRFs coupled with temporal smoothing for recognizing body and hand gestures for aircraft signal handling [78]. Liu et al. used HCRFs for detecting hand poses in a continuous stream of data for issuing commands to robots [79]. Other works that incorporate HCRFs in 3D gesture recognizers include [80, 81].

Another variant to CRFs is the latent-dynamic hidden CRF (LDCRF). This approach builds upon the HCRF by providing the ability to model the substructure of a gesture label and learn the dynamics between labels, which helps in recognizing gestures from unsegmented data [82]. As with CRFs and HCRFs, LDCRFs have been examined for use as part of 3D gesture recognition systems and received considerable attention. For example, Elmezain and Al-Hamadi use LDCRFs for recognizing hand gestures in American sign language using a stereo camera [83]. Song et al. improved upon their prior HCRF-based approach [78] to recognizing both hand and body gestures by incorporating the LDCRF [84]. Zhang et al. also used LDCRFs for hand gesture recognition but chose to use fuzzy-based latent variables to model hand gesture features with a modification to the LDCRF potential functions [85]. Elmezain et al. also used LDCRFs in hand gesture recognition to specifically explore how they compare with CRFs and HCRFs. They examined different window sizes and used location, orientation, and velocity features as input to the recognizers, with LDCRFs performing the best in terms of recognition accuracy [86].

3.1.3. Support Vector Machines

Support vector machines (SVMs) are another approach that is used in 3D gesture recognition that has received considerable attention in recent years. SVMs are a supervised learning-based probabilistic classification approach that constructs a hyperplane or set of hyperplanes in high dimensional space used to maximize the distance to the nearest training data point in a given class [87]. These hyperplanes are then used for classification of unseen instances. The mappings used by SVMs are designed in terms of a kernel function selected for a particular problem type. Since not all the training data may be linearly separable in a given space, the data can be transformed via nonlinear kernel functions to work with more complex problem domains.

In terms of 3D gestures, there have been many recognition systems that make use of SVMs. For example, recent work has explored different ways of extracting the features used in SVM-based recognition. Huang et al. used SVMs for hand gesture recognition coupled with Gabor filters and PCA for feature extraction [88]. Hsieh et al. took a similar approach for hand gesture recognition but used the discrete Fourier transform (DFT) coupled with the Camshift algorithm and boundary detection to extract the features used as input to the SVM [89]. Hsieh and Liou not only used Haar features for their SVM-based recognizer but also examined the color of the user’s face to assist in detecting and extracting the users’ hands [90]. Dardas et al. created an SVM-based hand gesture detection and recognition system by using the scale invariance feature transform (SIFT) and vector quantization to create a unified dimensional histogram vector (e.g., bag of words) with K-means clustering. This vector was used as the input to a multiclass SVM [91, 92].

Other ways in which SVMs have been used for 3D gesture recognition have focused on fusing more than one SVM together or using the SVM as part of a larger classification scheme. For example, Chen and Tseng used 3 SVMs from 3 different camera angles to recognize 3D hand gestures by fusing the results from each with majority voting or using recognition performance from each SVM as a weight to the overall gesture classification score [93]. Rashid et al. combined an SVM and HMM together for American sign language where the HMM was used for gestures while the SVM was used for postures. The results from these two classifiers were then combined to provide a more general recognition framework [94]. Song et al. used an SVM for hand shape classification that was combined with a particle filtering estimation framework for 3D body postures and an LDCRF for overall recognition [84]. Other 3D gesture recognizers that utilize SVMs include [80, 95101].

3.1.4. Decision Trees and Forests

Decision trees and forests are an important machine learning tool for recognizing 3D gestures. With decision trees, each node of the tree makes a decision about some gesture feature. The path traversed from the root to a leaf in a decision tree specifies the expected classification by making a series of decisions on a number of attributes. There are a variety of different decision tree implementations [102]. One of the most common is the C4.5 algorithm [103] which uses the notion of entropy to identify ranking of features to determine which feature is most informative for classification. This strategy is used in the construction of the decision tree. In the context of 3D gesture recognition, there have been several different strategies explored using decision trees. For example, Nisar et al. used standard image-based features such as area, centroid, and convex hull among others as input to a decision tree for sign language recognition [104]. Jeon et al. used decision trees for recognizing hand gestures for controlling home appliances. They added a fuzzy element to their approach, developing a multivariate decision tree learning and classification algorithm. This approach uses fuzzy membership functions to calculate the information gain in the tree [105]. Zhang et al. combined decision trees with multistream HMMs for Chinese sign language recognition. They used a 3-axis accelerometer and electromyography (EMG) sensors as input to the recognizer [106]. Other examples of using decision trees in 3D gesture recognition include [107, 108].

Decision forests are an extension of the decision tree concept. The main difference is that instead of just one tree used in the recognition process, there is an ensemble of randomly trained decision trees that output the class that is the mode of the classes output by the individual trees [115]. Given the power of GPUs, decision forests are becoming prominent for real-time gesture recognition because the recognition algorithm can be easily parallelized with potentially thousands of trees included in the decision forest [116]. This decision forest approach can be considered a framework that has several different parts that can produce a variety of different models. The shape of the decision to use for each node, the type of predictor used in each leaf, the splitting objective used to optimize each node, and the method for injecting randomness into the trees are all choices that need to be made when constructing a decision forest used in recognition. One of the most notable examples of the use of decision forests was Shotton et al.’s work on skeleton tracking for the Microsoft Kinect [45]. This work led researchers to look at decision forests for 3D gesture recognition. For example, Miranda et al. used decision forests for full body gesture recognition using the skeleton information from the Microsoft Kinect depth camera. Key poses from the skeleton data are extracted using a multiclass SVM and fed as input to the decision forest. Keskin et al. used a depth camera to recognize hand poses using decision forests [41]. A realistic 3D hand model with 21 different parts was used to create synthetic depth images for decision forest training. In another example, Negin et al. used decision forests on kinematic time series for determining the best set of features to use from a depth camera [111]. These features are then fed into a SVM for gesture recognition. Other work that has explored the use of decision forests for 3D gesture recognition include [110, 117, 118].

3.1.5. Other Learning-Based Techniques

There are, of course, a variety of other machine learning-based techniques that have been used for 3D gesture recognition, examples include neural networks [119, 120], template matching [121, 122], finite state machines [121, 123], and using the Adaboost framework [112]. To cover all of them in detail would go beyond the scope of this paper. However, two other 3D gesture recognition algorithms are worth mentioning because they both stem from recognizers used in 2D pen gesture recognition, are fairly easy to implement, and provide good results. These recognizers tend to work for segmented data but can be extended to unsegmented data streams by integrating circular buffers with varying window sizes, depending on the types of 3D gestures in the gesture set and the data collection system. This first one is based on Rubine’s linear classifier [124], first published in 1991. This classifier is a linear discriminator where each gesture has an associated linear evaluation function, and each feature has a weight based on the training data. The classifier uses a closed form solution for training which produces optimal classifiers given that the features are normally distributed. However, the approach still produces good results even when there is a drift from normality. This approach also always produces a classification so the false positive rate can be high. However a good rejection rule will remove ambiguous gestures and outliers. The extension of this approach to 3D gestures is relatively straightforward. The features need to be extended to capture 3D information with the main classifier and training algorithm remaining the same. This approach has been used successfully in developing simple, yet effective 3D gesture recognizers [112, 125, 126].

The second approach is based on Wobbrock et al.’s $1 2D recognizer [127]. Kratz and Rohs used the $1 recognizer as a foundation for the $3 recognizer, designed primarily for 3D gestures on mobile devices [113, 128]. In this approach, gesture traces are created using the differences between the current and previous acceleration data values and resampled to have the same number of points as any gesture template. These resampled traces are then corrected for rotational error using the angle between the gesture’s first point and its centroid. Average mean square error is then used to determine the given gesture trace’s distance to each template in the gesture class library. A heuristic scoring mechanism is used to help reject false positives. Note that a similar approach to constructing a 3D gesture recognizer was done by Li, who adapted the Protractor 2D gesture recognizer [129] and extended it to work with accelerometers and gyroscope data [114, 130].

3.2. Heuristic Recognizers

Heuristic 3D gesture recognizers make sense when there are a small number of easily identifiable gestures in an interface. The advantage of heuristic-based approaches is that no training data is needed and they are fairly easy to implement. For example, Williamson et al. [131] developed a heuristic recognition method using skeleton data from a Microsoft Kinect focused on jumping, crouching, and turning. An example of a heuristic recognizer for jumping would be to assume a jump was made when the head is at a certain height above its normal position, defined as where is true or false based on if a jump has occurred, is the height of the head position, is the calibrated normal height of the head position with the user standing, and is some constant. would then be set to a height that a person would only get to by jumping from the ground. Such recognition is very specialized but simple and explainable and can determine in an instant whether a jump has occurred.

Recent work has shown that heuristic 3D recognition works well with devices that primarily make use of accelerometers and/or gyroscopes (e.g., the Nintendo Wiimote, smart phones). For example, One Man Band used a Wiimote to simulate the movements necessary to control the rhythm and pitch of several musical instruments [132]. RealDance explored spatial 3D interaction for dance-based gaming and instruction [133]. By wearing Wiimotes on the wrists and ankles, players followed an on-screen avatar’s choreography and had their movements evaluated on the basis of correctness and timing. These explorations led to several heuristic recognition schemes for devices which use accelerometers and gyroscopes.

Poses and Underway Intervals. A pose is a length of time during which the device is not changing position. Poses can be useful for identifying held positions in dance, during games, or possibly even in yoga. An underway interval is a length of time during which the device is moving but not accelerating. Underway intervals can help identify smooth movements and differentiate between, say, strumming on a guitar and beating on a drum.

Because neither poses nor underway intervals have an acceleration component, they cannot be differentiated using accelerometer data alone. To differentiate the two, a gyroscope can provide a frame of reference to identify whether the device has velocity. Alternatively, context can be used, such as tracking acceleration over time to determine whether the device is moving or stopped.

Poses and underway intervals have three components. First, the time span is the duration in which the user maintains a pose or an underway interval. Second, the orientation of gravity from the acceleration vector helps verify that the user is holding the device at the intended orientation. Of course, unless a gyroscope is used, the device’s yaw cannot be reliably detected. Third, the allowed variance is the threshold value for the amount of acceleration allowed in the heuristic before rejecting the pose or underway interval. For example, in RealDance [133], poses were important for recognizing certain dance movements. For a pose, the user was supposed to stand still in a specific posture beginning at time and lasting until , where is a specified number of beats. A player’s score could be represented as the percentage of the time interval during which the user successfully maintained the correct posture.

Impulse Motions. An impulse motion is characterized by a rapid change in acceleration, easily measured by an accelerometer. A good example is a tennis or golf club swing in which the device motion accelerates through an arc or a punching motion, which contains a unidirectional acceleration. An impulse motion has two components, which designers can tune for their use. First, the time span of the impulse motion specifies the window over which the impulse is occurring. Shorter time spans increase the interaction speed, but larger time spans are more easily separable from background jitter. The second component is the maximum magnitude reached. This is the acceleration bound that must be reached during the time span in order for the device to recognize the impulse motion.

Impulse motions can also be characterized by their direction. The acceleration into a punch is essentially a straight impulse motion, a tennis swing has an angular acceleration component, and a golf swing has both angular acceleration and even increasing acceleration during the follow-through when the elbow bends. All three of these impulse motions, however, are indistinguishable to an acceleration only device, which does not easily sense these orientation changes. For example, the punch has an acceleration vector along a single axis, as does the tennis swing as it roughly changes its orientation as the swing progresses. These motions can be differentiated by using a gyroscope as part of the device or by assuming that orientation does not change. As an example, RealDance used impulse motions to identify punches. A punch was characterized by a rapid deceleration occurring when the arm was fully extended. In a rhythm-based game environment, this instant should line up with a strong beat in the music. An impulse motion was scored by considering a one-beat interval centered on the expected beat.

Impact Events. An impact event is an immediate halt to the device due to a collision, characterized by an easily identifiable acceleration bursting across all three dimensions. Examples of this event include the user tapping the device on a table or dropping it so it hits the floor. To identify an impact event, the change in acceleration (jerk) vector is required for each pair of adjacent time samples. Here, corresponds to the largest magnitude of jerk: where is the acceleration vector at time . If the magnitude is larger than a threshold value, an impact occurs. As an example, RealDance used impact motions to identify stomps. If the interval surrounding a dance move had a maximal jerk value less than a threshold, no impact occurred. One Man Band also used impact events to identify when a Nintendo Nunchuk controller and Wiimote collided, which is how users played hand cymbals.

Modal Differentiation. Herustics can also be used as a form of simple segmentation to support the recognition of different gestures. For example, in One Man Band [132], the multiinstrument musical interface (MIMI) differentiated between five different instruments by implementing modal differences based on a Wiimote’s orientation. Figure 3 shows four of these. If the user held the Wiimote on its side and to the left, as if playing a guitar, the application interpreted impulse motions as strumming motions. If the user held the Wiimote to the left, as if playing a violin, the application interpreted the impulse motions as violin sounds. To achieve this, the MIMI’s modal-differentiation approach used a normalization step on the accelerometer data to identify the most prominent orientation: followed by two exponential smoothing functions The first function, with an , removed jitter and identified drumming and strumming motions. The second function, with an , removed jitter and identified short, sharp gestures such as violin strokes.

4. Experimentation and Accuracy

As we have seen in the last section, there have been a variety of different approaches for building 3D gesture recognition systems for use in 3D gestural interfaces. In this section, we focus on understanding how well these approaches work in terms of recognition accuracy and the number of gestures that can be recognized. These two metrics help to provide researchers and developers guidance on what strategies work best. As with Section 3, we do not aim to be an exhaustive reference on the experiments that have been conducted on 3D gesture recognition accuracy. Rather, we present a representative sample that highlights the effectiveness of different 3D gesture recognition strategies.

A summary of the experiments and accuracy of various 3D gesture recognition systems is shown in Table 1. This table shows the authors of the work, the recognition approach or strategy, the number of recognized gestures, and the highest accuracy level reported. As can be seen in the table, there have been a variety of different methods that have been proposed and most of the results reported are able to achieve over 90% accuracy. However, the number of gestures in the gesture sets used in the experiments vary significantly. The number of gestures in the gesture set is often not indicative of performance when comparing techniques. In some cases, postures were used instead of more complex gestures and in some cases, more complex activities were recognized. For example, Lee and Cho recognized only 3 gestures, but these are classified as activities that included shopping, taking a bus, and moving by walking [61]. The gestures used in these actions are more complex than, for example, finger spelling. In other cases, segmentation was not done as part of the recognition process. For example, Hoffman et al. were able to recognize 25 gestures at 99% accuracy, but the data was presegmented using button presses to indicate the start and stop of a gesture [112].

It is often difficult to compare 3D gesture recognition techniques for a variety of reasons including the use of different data sets, parameters, and number of gestures. However, there have been several, more inclusive experiments that have focused on examining several different recognizers in one piece of research. For example, Kelly et al. compared their gesture threshold HMM with HMMs, transition HMMs, CRFs, HCRFs, and LDCRFs [64] and found their approach to be superior, achieving over 97% accuracy on 8 dynamic sign language gestures. Wu et al. compared their frame-based descriptor and multiclass SVM to dynamic time warping, a naive Bayes classifier, C4.5 decision trees, and HMMs and showed their approach has better performance compared to the other methods for both user dependent (95.2%) and user independent cases (89.3%) for 12 gestures [100]. Lech et al. compared a variety of different recognition systems for building a sound mixing gestural interface [118]. They compared the nearest neighbor algorithm with nested generalization, naive Bayes, C4.5 decision trees, random trees, decision forests, neural networks, and SVMs on a set of four gestures and found the SVM to be the best approach for their application. Finally, Cheema et al. compared a linear classifier, decision trees, Bayesian networks, SVM, and AdaBoost using decision trees as weak learners on a gesture set containing 25 gestures [125]. They found that the linear classifier performed the best under different conditions which is interesting given its simplicity compared to the other 3D gesture recognition methods. However, SVM and AdaBoost also performed well under certain user independent recognition conditions when using more training samples per gesture.

Experiments on 3D gesture recognition systems have also been carried out in terms of how they can be used as 3D gestural user interfaces and there have been a variety of different application domains explored [134]. Entertainment and video games are just one example of an application domain where 3D gestural interfaces are becoming more common. This trend is evident since all major video game consoles and the PC support devices that capture 3D motion from a user. In other cases, video games are being used as the research platform for 3D gesture recognition. Figure 4 shows an example of using a video game to explore what the best gesture set should be for a first person navigation game [9], while Figure 5 shows screenshots of the video game used in Cheema et al.’s 3D gesture recognition study [125]. Other 3D gesture recognition research that has focused on the entertainment and video game domain include [132, 135137].

Medical applications and use in operating rooms are an area where 3D gestures have been explored. Using passive sensing enables the surgeon or doctor to use gestures to gather information about a patient on a computer while still maintaining a sterile environment [138, 139]. 3D gesture recognition has also been explored with robotic applications in the human robot interaction field. For example, Pfeil et al. (shown in Figure 6) used 3D gestures to control unmanned aerial vehicles (UAVs) [140]. They developed and evaluated several 3D gestural metaphors for teleoperating the robot. Other examples of 3D gesture recognition technology used in human robot interaction applications include [141143]. Other application areas include training and interfacing with vehicles. Williamson et al. developed a full body gestural interface for dismounted soldier training [29] while Riener explored how 3D gestures could be used to control various components of automotive vehicles [144]. Finally, 3D gesture recognition has recently been explored in consumer electronics, specifically for control of large screen smart TVs [145, 146].

Although there have been great strides in 3D gestural user interfaces from unobtrusive sensing technologies to advanced machine learning algorithms that are capable of robustly recognizing large gesture sets, there still remains a significant amount of future research that needs to be done to make 3D gestural interaction truly robust, provide compelling user experiences, and support interfaces that are natural and seamless to users. In this section, we highlight three areas that need to be explored further to significantly advance 3D gestural interaction.

5.1. Customized 3D Gesture Recognition

Although there has been some work on customizable 3D gestural interfaces [147], customization is still an open problem. Customization can take many forms and in this case, we mean the ability for users to determine the best gestures for themselves for a particular application. Users should be able to define the 3D gestures they want to perform for a given task in an application. This type of customization goes one step further than having user-dependent 3D gesture recognizers (although this is still a challenging problem in cases where many people are using the interface).

There are several problems that need to be addressed to support customized 3D gestural interaction. First, how do users specify what gestures they want to perform for a given task. Second, once these gestures are specified, if using machine learning, how do we get enough data to train the classification algorithms without burdening the user? Ideally, the user should only need to specify a gesture just once. This means that synthetic data needs to be generated based on user profiles or more sophisticated learning algorithms that deal with small training set sized are required. Third, how do we deal with user defined gestures that are very similar to each other? This problem occurs frequently in all kinds of gestures recognition, but the difference in this case is that the users are specifying the 3D gesture and we want them to use whatever gesture they come up with. These are all problems that need to be solved in order to support truly customized 3D gestural interaction.

5.2. Latency

3D gesture recognition needs to be both fast and accurate to make 3D gestural user interfaces usable and compelling. In fact, the recognition component needs to be somewhat faster than real time because responses based on 3D gestures need to occur at the moment a user finishes a gesture. Thus, the gesture needs to be recognized a little bit before the user finishes it. This speed requirement makes latency an important problem that needs to be addressed to ensure fluid and natural user experiences. In addition, as sensors get better at acquiring a user’s position, orientation, and motion in space, the amount of data that must be processed will increase making the latency issue a continuing problem.

Latency can be broken up into computational latency and observational latency [74, 148]. Computational latency is the delay that is based on the amount of computation needed to recognize 3D gestures. Observational latency is the delay based on the minimum amount of data that needs to be observed to recognize a 3D gesture. Both latencies present an important area in terms of how to minimize and mitigate them. Parallel processing can play an important role in reducing computational latency while better understanding the kinematics of the human body is one of many possible ways to assist in reducing observational latency.

5.3. Using Context

Making use of all available information for recognizing 3D gestures in a 3D gestural interface makes intuitive sense because it can assist the recognizer in several ways. First, it can help to reduce the amount of possible 3D gestures that could be recognized at any one time and it can assist in improving the recognition accuracy. Using context is certainly an area that has received considerable attention [149151], especially in activity recognition [152154], but there are several questions that need to be answered specifically related to context in 3D gestural interfaces. First, what type of context can be extracted that is most useful to improve recognition. As an example, in a video game, the current state of the player and the surrounding environment could provide useful information to trivially reject certain gestures that do not make sense in a given situation. Second, how can context be directly integrated into 3D gesture recognizers? As we have seen, there are a variety of different approaches to recognize 3D gestures, yet it is unclear how context can be best used in all of these algorithms. Finally, what performance benefits do we gain making use of context both in terms of accuracy and in latency reduction when compared to recognizers that do not make use of context? It is important to know how much more of an improvement we can get in accuracy and latency minimization so we can determine what the best approaches are for a given application.

5.4. Ecological Validity

Perhaps, one of the most important research challenges with 3D gestural interfaces is determining exactly how accurate the 3D gesture recognizer is that makes up the 3D gestural interface from a usability standpoint. In other words, how accurate is the 3D gestural interface when used in its intended setting. Currently most studies that explore a recognizer’s accuracy are constrained experiments intended to evaluate the recognizer by having users perform each available gesture number of times. As seen in Section 4, researchers have been able to get very high accuracy rates. However, we have also seen from Cheema et al. [125, 126] that accuracy can be severely reduced when tests are conducted in more realistic, ecologically valid scenarios. Even in the case of Cheema et al.’s work, their experiments do not come close to the ecological validity required to truly test a 3D gestural interface. Thus, these studies act more as an upper bound on gesture recognition performance than a true indicator of the recognition accuracy in everyday settings.

The open research problem here is how to design an ecologically valid experiment to test a 3D gestural interface. To illustrate the challenge, consider a 3D gestural interface for a video game. To adequately test the 3D gesture recognizer, we need to evaluate how accurately the recognizer can handle each gesture in the gesture set. However, to be ecologically valid, the game player should be able to use any gesture that makes sense at any given time. Thus, we do not know what gestures the user will be doing at any given time nor if they will provide enough test samples to adequately test the recognizer. That presents a difficult challenge. One option is to try to design the game so that each gesture is needed for an integral multiple of times, but this may not be the best user experience, if, for example, a user likes one particular gesture over another. Another option is to have many users test the system, video tape the sessions, and then watch them to determine which gestures they appear to perform. With enough users, the number of gestures in the test set would approach the appropriate amount. Neither of these two options seem ideal and more research is needed to determine the best way to deal with the ecological validity issue.

6. Conclusion

3D gestural interaction represents a powerful and natural method of communication between humans and computers. The ability to use 3D gestures in a computer application requires a method to capture 3D position, orientation, and/or motion data through sensing technology. 3D gesture recognizers then need to be developed to detect specific pattern in the data from a set of known patterns. These 3D gesture recognizers can be heuristic-based, where a set of rules are encoded based on observation of the types of gestures needed in an application or through machine learning techniques where classifiers are trained using training data. These recognizers must then be evaluated to determine their accuracy and robustness.

In this paper, we have examined 3D gesture interaction research by exploring recent trends in these areas including sensing, recognition, and experimentation. These trends show that there are both strengths and weaknesses with the current state of 3D gesture interface technology. Strengths include powerful sensing technologies that can capture data from a user’s whole body unobtrusively as well as new sensing technology directions that go beyond computer vision-based approaches. In addition, we are seeing better and faster 3D gesture recognition algorithms (that can make use of parallel processing) that can reach high recognition accuracies with reasonably sized gesture sets. However, 3D gestural interaction still suffers from several weaknesses. One of the most important is that although accuracy reported for various 3D gesture recognizers is high, these results are often considered a theoretical upper bound given that the accuracy will often degrade in more ecologically valid settings. Performing accuracy tests in ecologically valid settings is also a challenge making it difficult to determine the best ways to improve 3D gesture recognition technology. In addition, latency is still a concern in that delays between a user’s gesture and the intended response can hamper the overall user experience. In spite of these issues, the field of 3D gesture interfaces is showing maturity as evidenced by 3D gestural interfaces moving into the commercial sector and becoming more commonplace. However, it is clear that there is still work that needs to be done to make 3D gesture interfaces truly mainstream as a significant part of human computer interfaces.