Abstract

This paper presents a method to recognize continuous full-body human motion online by using sparse, low-cost sensors. The only input signals needed are linear accelerations without any rotation information, which are provided by four Wiimote sensors attached to the four human limbs. Based on the fused hidden Markov model (FHMM) and autoregressive process, a predictive fusion model (PFM) is put forward, which considers the different influences of the upper and lower limbs, establishes HMM for each part, and fuses them using a probabilistic fusion model. Then an autoregressive process is introduced in HMM to predict the gesture, which enables the model to deal with incomplete signal data. In order to reduce the number of alternatives in the online recognition process, a graph model is built that rejects parts of motion types based on the graph structure and previous recognition results. Finally, an online signal segmentation method based on semantics information and PFM is presented to finish the efficient recognition task. The results indicate that the method is robust with a high recognition rate of sparse and deficient signals and can be used in various interactive applications.

1. Introduction

In recent years, sensor-based human motion recognition has received a great deal of attention from researchers. Sensors have been adapted for large-scale movements to avoid shading and lighting problems. This has advantages over vision-based methods for special scenes and has allowed full-body motion recognition and sensor-based motion control to be applied in various fields, such as medical rehabilitation and interactive games.

Currently, motion control tasks are based on accurate and complete accelerations, as well as signals provided by other sensors. Unfortunately, these devices are expensive and not easily portable. In practice, sparse and low-cost sensors are more attractive, but they are usually accompanied by less information, more noise, and frequent signal deletion, making it difficult to acquire or reconstruct accurate position information and accordingly harder to achieve a proper online recognition result. Therefore, reconstructing human motion from signal features based on sparse and deficient signals has recently evoked much interest.

In light of the above problems, an online motion recognition method that adopts sparse, low-cost Wii Remote sensors (Wiimotes) as input devices is proposed. Because sparse, deficient linear accelerations cannot acquire accurate position information of human motion, a predictive fusion model, which combines fused hidden Markov model (HMM) with an autoregressive process, is presented. Considering the independence of each part of the human body, a hierarchical fusion structure of fused HMM is used to deal with human motion signals, which enhances the independent and cooperative expression of the classification model. The predictive capability of the model provided by the autoregressive process ensures robustness when dealing with noisy and deficient signals. Once the online recognition process is underway, a graph model that builds the transition between different motion types filters those motion types and reduces the recognition complexity of the predictive fusion model (PFM). Moreover, a semantic-based automatic signal segmentation method is introduced to ensure the continuity of the online recognition processes.

Thus, based on sparse and deficient input signals, a human motion recognition PFM is presented that effectively supports sparse, low-cost sensors. The presented model is of a high accuracy rate and robust enough to handle insufficient and missing signals. An online motion recognition method is also proposed that does not require any position calibration. The method integrates PFM, action graph structure, and a semantic-based signal segmentation method to support user-driven virtual human motion in virtual scenes with continuous motions.

As pattern recognition technologies develop, pattern recognition methods are increasingly used in the context of motion recognition. Typical methods, including self-organizing maps (SOMs), support vector machines (SVMs), and HMM approaches, can be adapted for motion recognition processes.

Methods for motion recognition vary depending on the input source. It has been shown that vision-based methods and sensor-based methods constitute two of the main research areas and are based on two types of input device, depending on the application. Poppe [1] presented a survey of vision-based human action recognition systems. Ning and Mokhtarian [2] used a shape to represent object contours extracted from each frame of a movie and constructed a tangent space based on the mean shape to approximate the linear space encompassing the datasets. Zhou et al. [3] and Min et al. [4] built a low-dimensional deformable model based on shape information from human motions in an image sequence to realize motion control. Lai et al. [5] proposed a local feature-based human motion analysis framework that extracted the features directly from local regions containing motion. Research has shown that the general idea of vision-based methods is to extract varied feature information from image sequences. In order to avoid the effect of light and shade and the inconvenience of vision-based methods when moving in a larger scene, sensor-based methods remain a hot topic in this field.

Recent work [69] which has described some basic methods for gesture recognition using accelerometers shows that sensor-based methods can be adapted for recognition tasks. Sun et al. [10] and Shiratori and Hodgins [11] used low-cost sensors to monitor daily physical activities. This method is practical but the finite types of simple activities limit recognition. Niu and Abdel-Mottaleb [12] considered the continuity of signals and provided a segmentation and recognition method based on HMM. Khan et al. [13] used a hierarchical scheme for human activity recognition. Tautges et al. [14] and Wong et al. [15] generated simple full-body animations controlled by sparse and accurate 3D accelerometers attached to the extremities of a human actor; this method is able to properly deal with accurate input to recover accurate human position information. In terms of both sparse and deficient signals, learning models are more effective than generative models. Early methods of the learning model define features analysis with HMM but require improvement in the robustness for deficient signals and the recognition rate.

The present research is motivated by the above studies. A probabilistic fusion model and autoregressive process in the hierarchical model of virtual human movement are proposed, which ensures that full-body motion information can be expressed relatively independently and deals with deficient input caused by sparse, inexpensive sensors. The recognition process ensures robustness, accuracy, and efficiency.

3. Method

3.1. Overview

In this paper, a recognition model PFM to deal with offline single motion segments is proposed first. Combined with graph constraint and online signal segmentation, the model can then be applied to online motion recognition. The method consists of three main key technologies, the structure of which can be found in Figure 1.

Predictive Fusion Model. Sparse and deficient inputs require more relevant information between each input signal sequence to keep local and global information. Therefore, HMMs of different part inputs were constructed, and a probabilistic model was used to fuse these HMMs together so as to enhance the model robustness. An autoregressive process is then introduced, which ensures that the unstable signals can be adjusted based on the past signals and training signals. The model can properly deal with offline motion recognition with sparse and deficient inputs.

Graph Constraint Construction. A graph structure based on the content of motion segments is constructed, limiting the choice for the following motion type based on the current motion content. The graph structure can filter part of motion type, reducing the complexity when dealing with a large motion database and improving the recognition accuracy as well.

Semantics-Based Signal Segmentation. Because input signals are continuous and may consist of multiple motion types, a method to separate the long continuous signal into segments was proposed. This method supports online motion recognition, the basis of the PFMs and graph constraint built offline.

3.2. Predictive Fusion Model

To build a robust learning model that can acquire feature information from sparse and deficient sequential input, HMM shows a high capability of dealing with time series. Here a predictive fusion model is presented based on the structure of HMM, which not only considers the sparse and deficient signal but also considers the features of human motion.

Consider two HMMs with observations and , which indicate two groups of signal divided from all input sources, respectively. These input sources can be Wiimotes attached to different body parts in our experiments. For each motion type, a corresponding model is needed so as to value the similarity between the current input and the model, and the highest similarity probability determines the input type. Then, the problem can be defined as finding a solution to constructing the connections between the two HMMs so as to provide an optimal estimate for this similarity probability . To capture the statistical dependence between two observations and , the maximum entropy principle is used: where and are the respective transforms of and and absorb some dependence between and . Here, should be chosen from the two components of HMM, that is, the hidden state and the observation .

Supported by the maximum mutual information criterion in [20], it is better to connect two HMMs by the hidden state sequence for one HMM and the observation sequence for the other one, rather than two hidden states for each one. The structure is shown in Figure 2. Thus, the transforms can be replaced by or . The probability defined by (1) yields or where the structures defined by (2) and (3) are different. Equation (2) expresses the relationship between and , indicating that the former HMM is more reliable than the latter one. The reliability of each HMM can be quantified as the weights for each part: where and represent the reliability of each body part motion. The values of and are determined by the selected types of actions. For general and daily activities, such as actions in our experiment, can be valued as , while for special occasion and activities, such as ping-pong, where the action focus is on the upper body part, can be valued as and .

The observation and state can be unfolded as ,, where is the length of data sequence. The structure of this model is described in Figure 2.

Vary the basic parameters in HMM, where stands for initial probability vector, for state transition probability matrix, and for observation probability vector, and the new parameters enhance the model’s ability to deal with intermittent or noisy , where the hidden state is taken into account in assuming , which can be written in the form of autoregressive process: where is a parameter that preserves the descriptive power of the standard HMM when and is residual error when calculating the observation . Since all current observations are affected by the current hidden state and past observations, parameter of HMM can be modified as where can be calculated from (5).

The methods described above define the model parameters , consisting of two predictive HMM parameters and the dependencies parameter . The training process can be summarized as follows.

Calculate the parameters of two predictive HMMs with the expectation-maximization (EM) algorithm presented in [21] and Baum-Welch method in [22]. To maximize has to be maximized at each time of the sequence, which can also be written as . The terms that have to be maximized are where is the probability of being in state at time and in state at time in the Baum-Welch algorithm. To solve the terms in (7), the derivatives of the terms with respect to each variable and must be determined: where indicates . The parameters and can be calculated by solving (8). The covariance matrix can then be calculated using the updated parameters and :

Select one predictive HMM as the leading HMM and calculate the hidden state sequence for the leading HMM using the Viterbi algorithm. Then, determine the fusion parameters or . If is discrete, the following is obtained: where is the total hidden state number, is the clustering number, and is the impulse function. When the parameter set of the model has been trained, the similarities and can be acquired by forward-backward algorithm, and the similarity can be calculated by (2) or (3).

Then, how to use the model in the process of recognition will be shown. In training process, the input signal sequences are four Wiimotes attached to all four human limbs, which are divided into two groups (upper and lower limbs). models are trained for recognition use, where indicates the total number of motion types. In the recognition process, the models trained for each motion type are used to compute the model's similarity to the input signal sequence. The solution to the similarity probability can be calculated using the same forward algorithm as HMM. If the similarity to any motion exceeds a certain threshold, the sequence is classified as the motion type for which the similarity probability is the largest. The recognition result and similarity probability variation trend are shown in Figure 3. The results indicate that “waving hello’’ is the motion most similar to the input signal of the six types. Inspect the similarity probability of these models at each time, and it can be found that PFM had a higher classification capacity than the standard model because PFM can be determined timely at 20–40th frames. More experiments with larger databases will be described in Section 4.

3.3. Graph Constraint Construction

The model detailed above can properly identify the motion type from dozens of alternative ones. However, when the number of alternative motion types grows, it not only affects the accuracy rate of recognition but also increases the computation time due to the probability calculations required for each model. Therefore, a structured method was used to reduce the scale of alternative motion types in dealing with a large database.

When a user performs continuous and varied actions, it is noticed that certain action types cannot appear when the current action type has been determined, due to the coordination of human motion. This constraint can be used to guide selection of the following motion type based on the current determinate type.

The present graph model is motivated by the methods of Li et al. [23] and the motion graph of Kovar et al. [24] but different from the methods for different purposes and results. The model can be weighted or unweighted: the weighted one is a directed graph that contains the transition possibility detailing the compatibility and transitivity between two motion types. The node of the graph is of a single motion type, such as “walk’’ or “run.’’ Before constructing the graph, a training process is necessary to obtain a more precise transition probability. Hundreds of long, continuous human motions are required, and the transition probability is calculated statistically by recording the frequency of motion transitions from one motion type to another. The unweighted graph has a similar structure to the weighted graph, but the transition probability only contains two values . The structure of the graph is shown in Figure 4.

Once the recognition process is underway, the motion type is annotated immediately after recognizing the current motion signal segment . For the unweighted graph, the nodes which are directed from node are selected, and the remaining motion types are excluded without calculating the similarity probability between the upcoming input signal and the current model. Only the models that correspond to selected nodes will calculate the similarity probability. For the weighted graph, the transition probability between two motion types and measures their similarities, as follows: where is one of the alternative motion types for the current signal segment and is a scaling function that reduces the effect of , such as a logarithmic function.

3.4. Online Semantics-Based Signal Segmentation and Motion Recognition

For the online recognition process input signals which are always continuous and long need to be separated into short segments based on different motion types. In recent studies, such as the recursive least squares (RLS) method presented by [25] and the piecewise linear representation (PLR) method presented by [26], signal segmentation problems are always located at the break point in the signal energy curve, which may lead to oversegmentation or skipping smooth transition points. Therefore, signal segmentation based only on the signal shape is not comprehensive and requires consideration of the semantic information in the signal sequence.

In order to combine the semantic information with the segmentation process, the motion content needs to be parsed by a recognition model in the online signal segmentation. PFM is introduced into the process to acquire semantic information. With specific semantic information, it can be ensured that the segmented sequence is an intact and independent motion type, which can greatly reduce the occurrence of oversegmentation. The method can be described as follows.

Let be a long sequence of -dimensional input acceleration signal vectors and let be a two-dimensional integer vote array of time length , where is the number of all motion types. The array indicates the number of votes of type at frame , which indicates the current motion type at . On the online recognition stage, a sliding window of length scans the input sequence from front to back with a step length of one frame. Each time the window is moved, the PFMs are programmed to recognize the signal segment in the current window . For example, when the window moves to frame , similarity probabilities are calculated by PFMs of alternative types with input of signals from to . The vote array will then be increased by 1, where is the winner type in the present recognition. After the window sliding to frame , curves can be drawn based on before frame , which is shown in Figure 5. The intersection points shown in Figure 5 can be classified as alternative segment points, and recognition results can be acquired after finishing the segmentation.

To deal with transition signals and signals that do not belong to any alternative motion type, an appropriate threshold for each PFM should be set to filter out the redundant segment points during the PFM training process. The threshold is defined as the minimum normalized probability in the training dataset, and it rejects motion signals dissimilar to the training set.

The method presented above considers the semantic information of signal sequence and acquires the recognition result based on the PFMs trained offline. The recognition process is online, and results of which will be discussed in Section 4.

4. Results and Discussions

In this section, the functions of PFM, the effect of online recognition, and various applications of this technology will be described. As is presented in the last section, the input devices used in our experiment are sparse and low-cost (see Table 1). Devices with more information provided always result in higher price. Several general portable input devices are shown in the table, and sparser and cheaper devices are chosen to conduct our experiment. The signals analyzed here were the linear accelerations without any denoising or angular information, making it difficult to calculate accurate position information, as Table 2 shows.

The Wiimotes transmitted signals to a computer via a bluetooth interface that supports an 8–10 meters distance during an experiment. The sampling time in our experiment was 25 fps, which can be adjusted to accommodate a range of precisions. The training motion signal database has been preliminary constructed, which is clustered as 28 nodes in graph structure based on the content of the motion signal segments. Each node consists of 3-4 groups of motion segments with different variants, such as walking in different styles or kicking to different positions. Each type of motion signal is captured 5 times by 4 different actors. These hundreds of motion signals are well-organized for model training. In the experiment, thousands of independent action signals and hundreds of long continuous action signals are performed by testers in real time to get the result on recognition rate, robustness, and so forth.

4.1. Performance of Model

Before the experiment, we have tested several state-based methods, such as coupled HMM and structural HMM, as Pan et al. [20] presented. The result shows that fused HMM presents a better accuracy and robustness to the others when dealing with sparse and deficient motion signals. Therefore, in this section, the functions of our PFM will be shown and the accuracy and efficiency of the recognition process will be only compared with the performance of traditional Gaussian HMM and fused HMM when dealing with sparse and deficient input.

In our experiment, the recognition effect of different actors was validated by leave-one-out and k-fold cross validation methods, and the recognition rates of the PFM are shown in Figure 6, based on 40 alternative action types from the database we built. The HMM method yielded an average recognition rate of 42%, lower than the fused HMM and PFM recognition rates. The horizontal axis in Figure 6 represents the type of input signal sample and the vertical axis represents the types of corresponding models we built. While HMM is not robust when dealing with certain special motions, the PFM presents a more robust and accurate recognition result. In the HMM, without considering the motion of different body parts, the combined acceleration information led to confusion and presented a worse classification capability than for motions of similar variance. The fused HMM considered the structure of human motion and presented a higher classification capability than the standard model. The prediction capabilities of PFM were much better for these special inputs.

The proposed model can handle imperfect signals as well as deletion of input signals. The fewest number of sensors that can retain complete full-body motion information remains to be determined. Further experiments will be conducted to show the robustness and capabilities of dealing with deficient signals and to determine the requisite number of sensors in order to properly function in the motion recognition process. Table 3 shows the model's robustness with respect to deficient signals. In this experiment, different actors attached with reducing input devices are chosen. The actors here include both trainees and newcomers. Since the action data of trainees are more standard and similar to trained motion data, the recognition rates for trainees are slightly higher than those for newcomers as the table shows. In the event that a short signal sequence from one sensor is lost, the recognition results remain unchanged from those derived from the complete signal sequence. For trained actors, the average recognition rate of HMM is 41% when two Wiis are intermittent and 73% for FHMM. This comparison shows that the classifying abilities of the PFM are greater than those of the two methods.

An analysis of unknown motions not included in the training datasets provides an estimate for the maximal probability of the motions most likely to be in the training datasets. Evaluation methods demonstrate the accuracy of the input signal relative to the recognition results.

4.2. Online Recognition

In an online recognition system, continuous signal processing is key for completing the task, and the results are essential for influencing and evaluating the recognition process. In our experiments, five actors were required to perform a continuous motion that included 51 motion segments used to test the segmentation accuracy rate. The accuracy rate of the segmentation experiment was evaluated by the number of desirable missing segmentation points and the number of undesirable or redundant segmentation points. Table 4 presents the segmentation results for these two measures. The desired segmentation points can be always located properly in our method, and the main factor that reduces the accuracy is the redundant segmentation points for our method. Unlike current segmentation methods of human motion signal sequence, that is, the RLS method and PLR method, whose abundant parameter and threshold groups are determined by repeated adjustments, our semantics-based method is more independent of parameters. For all this, a large number of desirable missing segmentation points for these two methods with an appropriate parameter group are always one of the main factors which may affect the accuracy rate of the segmentation. Besides, the delay of segmentation points and the accumulation of errors which always appear in these two methods can be effectively avoided in our method. Taking actor 3, for example, the desirable segmentation points are 41 for RLS and 36 for PLR, and the redundant segmentation points are 17 and 22, which is also more than semantics-based method. Figure 7 shows the final accuracy rate of these methods.

When dealing with large databases of alternative motion types, the difficulty in distinguishing features between different motion types becomes greater. The recognition capability of PFM is reduced substantially (see Table 5). Based on the graph structure we built, the alternative types of current segment recognition were fewer than the total alternative types, thereby preserving the online recognition accuracy rate rather efficiently. The high classification capability of the PFM model ensures that the results can be efficiently acquired at 30–50 frames of the signal input before the actor finishes the motion.

4.3. Applications

The methods proposed here are applicable to a wide variety of applications, including behavioral teaching evaluations, interactive games in virtual environments, and activity validation systems in large-scale scenes.

A general application of the proposed recognition method includes driving the virtual human to generate computer animations or to simulate a virtual environment for user interactions. After the user performs the continuous motions the segmentation and recognitions are conducted efficiently, and the recognition results guide the searching process of the corresponding motion data in the database. The blending process in the motion graph technology guarantees continuity of the generated motion. Generative models, such as the Gaussian latent variable model presented by [27], can be properly embedded to synthesize more delicate and stylized motion in various applications.

In the context of educational applications, the present method can be used to evaluate activities, such as playing tennis, doing martial arts, or dancing. Students can act out motions while following a standard motion sequence that is presented in advance. The system can then evaluate the similarity of the mimicked sequence to the standard sequence. An evaluation system can be constructed by calculating the probability ratio between the input motion signal and the normative training data. The ratio provides an important evaluation criterion. The weights of the fusion model may be adjusted to standardize the motions of each appendage. Figure 8 shows an experiment based on the evaluation system described here. The proposed method was adapted to a set of complex motions associated with tennis, Tai chi, and boxing. The motions performed by the user were recognized and evaluated using our method.

Complex virtual environmental interactions constitute the main application focus of our method. Virtual environment games and special training regimens require environmental immersion and interactions with virtual objects. Our method, based on sparse, low-cost sensors, performed well in the context of these applications and can provide the user with an immersed experience.

5. Conclusion and Future Works

This paper presents a full-body motion recognition method based on sparse, low-cost accelerometers. In the online recognition process, a semantics-based signal segmentation method was adopted to acquire short motion segments, and a motion transition graph structure was constructed to reduce the amount of alternative motion types. To recognize the motion type accurately, a predictive fusion model was presented to efficiently distinguish between current motion types and alternative motion types. The models recognition capability is robust and accurate in dealing with unstable and deficient signals that provide little information for reconstructing position information. Results show that the method has a high recognition rate and can be adapted to specific input signals.

During experiments, it is found that the method had difficulty identifying the actors’ orientation, as the input devices we used lack direction information for recovering whole motion information. In addition, a short pause in a continuous motion occasionally led to a redundant motion segment. In the future, in order to overcome these problems low-cost sensors will be integrated that will also provide direction information so that the input device can be more conveniently adapted to a specific interaction. The database of the motion signals and the motion data will also be expanded. Ultimately, the method will be applied to complicated scene interactions between users and the virtual environment.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of China (Grant no. 61170186) and the Zhejiang Leading Team of Science and Technology Innovation (2011R50019-06). The data used was obtained from HDM05 in [28] and CMU public database.

Supplementary Materials

An online full-body motion recognition method using sparse and deficient signal sequences.

  1. Supplementary Video