Human motion sensing technology gains tremendous popularity nowadays with practical applications such as video surveillance for security, hand signing, and smart-home and gaming. These applications capture human motions in real-time from video sensors, the data patterns are nonstationary and ever changing. While the hardware technology of such motion sensing devices as well as their data collection process become relatively mature, the computational challenge lies in the real-time analysis of these live feeds. In this paper we argue that traditional data mining methods run short of accurately analyzing the human activity patterns from the sensor data stream. The shortcoming is due to the algorithmic design which is not adaptive to the dynamic changes in the dynamic gesture motions. The successor of these algorithms which is known as data stream mining is evaluated versus traditional data mining, through a case of gesture recognition over motion data by using Microsoft Kinect sensors. Three different subjects were asked to read three comic strips and to tell the stories in front of the sensor. The data stream contains coordinates of articulation points and various positions of the parts of the human body corresponding to the actions that the user performs. In particular, a novel technique of feature selection using swarm search and accelerated PSO is proposed for enabling fast preprocessing for inducing an improved classification model in real-time. Superior result is shown in the experiment that runs on this empirical data stream. The contribution of this paper is on a comparative study between using traditional and data stream mining algorithms and incorporation of the novel improved feature selection technique with a scenario where different gesture patterns are to be recognized from streaming sensor data.

1. Introduction

With the advance in the sensing technology [1] that is relatively easy-to-deploy and cost-effective in operation, video motion sensor finds its applications popularly across different domains, just to name a few successful cases in environmental sensing, useful applications such as hand motion gesture detection by smartphones for remote interaction [2], monitoring human movements for medical rehabilitation [3], daily activities in elderly homes [4], entertainments, and sports [5], as well as advanced computer-human interaction research [6, 7]. While the communication mechanisms and the general operation of the sensor applications have been intensively studied, the decision making of such sensor system which is often known as the analytical “brain” has not yet been explored in details. Kinect sensor, for example, [8], is a peripheral I/O device designed to provide a natural interface to gaming consoles without the need of conventional controllers. This sensing device features a simple color camera, depth sensor, and multiarray microphone, capable of delivering swift audio and visual data streams for supporting facial recognition, partial or full-body motion recognition, and acoustical detection recognition.

In general, human motion recognition is the process of first detecting and recording changes in position of a human posture or gesture (that depends on the context of a full body or hands only), relative to its surroundings or backdrops that are corresponding to the previous positions in the video sequence. The collected data from sensors include motions of infrared signals, optical visions, radio frequency energy, or ultrasounds, depending on the types of motion sensors being used. Assuming that the deployed sensors are the eyes and ears that continuously and reliably collect perceived data from a moving subject, the data quickly stream to a preliminary image-processing that extracts the features from the images and then to a central processor that functions as a decision-support for the applications by interpreting the changing information of the features.

As a generalized process applicable to many applications, the remaining tasks after the data are collected and delivered to the decision-support center consist of the following: image processing, data preprocessing, model induction, rule extraction, and intelligence dissemination. In modern sensor hardware, the image processing is usually embedded as some low-level middleware in the device which processes and extracts information from raw images into characteristic features in abstract level.

Data preprocessing involves data transformation, data cleaning, and often feature selection for reducing the feature space for enhanced recognition accuracy; our previous papers have addressed in length this task for distributed wireless sensor networks [9, 10]. We focus in this paper on the tasks of model induction and rule extraction, specifically by using data streaming algorithms and lightweight feature selection scheme suitable for high-speed incremental machine-learning, for gesture recognition.

This paper investigates the remaining tasks after the data are collected and delivered to the decision-support center with the aim of finding the right combination of classification algorithms and feature selection algorithms for accurately recognizing human gestures on the fly. The sensing data concerned in this paper are the data that are collected from Microsoft Kinect sensor, which are used to capture the gesture, in terms of the body positions in 3D and their corresponding velocities and acceleration while the user poses in various postures. Our focus is the rigorous and comparative performance analysis over two groups of classification algorithms, namely, batch-learning and incremental or data-stream learning, in the light of achieving top accuracy at the shortest possible preprocessing times. We illustrate the efficacy of a newly proposed feature selection method around an illustrative example, that is, how to recognize human gesture pattern in an application of video sensors.

The main research challenge here is about finding the most appropriate model induction algorithm for gesture pattern recognition. The challenge rests on several stringent requirements in video motion sensor: first of all, the amount of data feed is potentially infinite, and the data delivery is continuous like a high-speed train of information. The processing hence is expected to be real-time and instantly responsive. This implies that the classification induction algorithm being deployed must be lightweight, incremental, and accurate for sure. The model update needs to be done quickly and on the fly upon each arrival of the new instance of data. As an additional feature, pertaining to the suitability for video sensor where the decision support process/operation may have to be embedded in a small mobile device, the memory requirement is opt to be as little as possible for obvious reasons of energy saving and fitting into a tiny device size. In other words, the learned model, probably in form of generalized nonlinear mappings between the values of the features to the predicted target classes, must be compact enough to be executed in a small run-time memory. No room is wasted for storing the features and their relations that neither are significant nor contribute little to the model accuracy. To this end, without using feature selection is out of the question. That is because the number of original features extracted from the video sequences could be very high. Feature selection is a heuristic process which retains only the significant features as an optimal subset of the full features, representative enough to induce an accurate classification model for pattern recognition.

Another complication on top of quantitatively computing the nonlinear relations between the feature values and the target classes is the temporal nature of such sensor data stream. One must crunch on the data stream long enough for modeling seasonal cycles or regular patterns if they ever exist. There are no straightforward relations that can easily map the attribute data into a specific class without a long-term observation. This impacts considerately on the data mining algorithm design that should be capable of just reading and forgetting the data stream (so is called “one-pass” algorithm), retaining nothing but just the required statistics for reasoning the long-term relations among the attributes values and the target classes.

Taken into account the aforementioned unique computational challenges associated with the motion data from video sensor, a rigorous analytical evaluation is both crucial and necessary in comparing some popular data mining algorithms, for gesture pattern recognition. This evaluation as reported in this paper offers insights to developers who want to design a video sensor network for the purpose of recognizing human activities or gestures through data stream mining.

The remaining of the paper is organized as follows. Section 2 introduces the background of our research via a discussion in both aspects of the experimental layout and the types of data mining algorithms to be tested with. In particular, the video sensor for measuring motion data as a result of human activities is described, and traditional and incremental decision trees are compared and contrasted. Section 3 covers the technical details of the data stream mining algorithms. Specifically the shortcoming of the traditional algorithms is discussed, as well as how the new functions of data stream mining algorithms that help overcome the limitation are narrated. An empirical video sensor dataset is applied in the experiment, in Section 4, with the aim of comparing several data mining algorithms vis-à-vis with respect to gesture recognition. Lastly Section 5 concludes the paper.

2. Background

2.1. Gesture Recognition and Dataset

In 2013, researchers Madeo et al. from University of Sao Paulo studied a gesture segmentation problem using support vector machine [11] and reviewed temporal aspect of hand gesture analysis [12] and how gestures which are captured by video sensor can be recognized by incorporating the temporal aspects with references to the bodily positions [13]. Although the focus of their research is on gesture phase segmentation the research team pointed out that the gesticulation behavior may influence the performance of a classifier. It was known that different human users who were videoed doing the same gesture may yield different gesticulation behavior. Therefore an effective machine learning approach is needed in accurately classifying the motion pattern data into their corresponding gestures.

Their experimental dataset which is available for download is composed by 7 video sequences captured by using Microsoft Kinect sensor. In front of the Microsoft Kinect sensor, 3 human subjects were instructed to read 3 comic strips and to tell the stories from the comic strips using hand gestures and bodily postures, while the sensor is recording the motions. At the end of the image processing, the video contains a sequence of images, one for each frame, indexed by a timestamp. The video sequence is then formatted into a matrix of text file, with rows of data representing the temporal data instances and columns that characterize the spatial positions or -- 3D coordinates of 6 articulation points. The articulation points are identified by the current positions of the limbs and body parts, such as the head, spine, left hand, left wrist, right hand, and right wrist. The measures of the positions were normalized into numeric values. Each data instance is the gesture information extracted per video frame, with each of which uniquely identified by a timestamp. The data instances are postprocessed by the help of a human specialist who manually segmented the file and associated it with a label of gesture. This manual postprocessing was needed for generating a ground truth for classification for the sake of evaluating the performance of the classification algorithms under test.

There are 32 attributes or features in the dataset, which are combined from the static positions of the body parts as in the video frame and the motion information. The target classes are the five phases of the gestures which are individually abstracted as Hold, Preparation, Rest, Retraction, and Stroke. A total of 50 attributes are used to characterize each instance that amounts to thousands depending on the length of the stories to be played. Out of the 50 attributes, 18 are the positions of the body parts, and 32 are the velocity and acceleration, in both vectorial and scalar forms of hands and wrists. Table 1 shows the 50 attributes. There are a total of 9,900 data instances which are extracted from the 7 videos available for training/testing the classification algorithms.

As an illustration, the three prime positions of a left hand are visualized for showing the fluctuation in values over time in Figure 1. Furthermore the normalized values of the three coordinates are visualized as parallel coordinate graph in Figure 2 that displays the wide ranges of the attribute values though they have been normalized between 0 and 1. The mapping relations of the three coordinates to the six phases (target classes) are visualized in 2D and 3D in Figures 3 and 4, respectively. They both point to the fact that the mapping relations which constitute the construction of a classification model are indeed very nonlinear. Highly nonlinear models are known to be computationally difficult to induce especially if they need to achieve a good balance between generality and accuracy. With the additional requirement for data stream mining, processing must be fast which poses certain challenge in algorithm design.

2.2. Traditional and Incremental Model Learning Methods

Sensor data analysis is emerging and it demands an efficient classification model that is capable of mining data streams and making a prediction for unseen samples. Traditional classification approach is referred to a method of top-down supervised learning [14], where a full set of data is used to construct a classification model, by recursively partitioning the data into forming mapping relations for modelling a concept. Since these models are built based on a stationary dataset, model update needs to repeat the whole training process whenever new samples arrive, adding them to incorporate the changing underlying patterns. The traditional models might have a good performance on a full set of historical data, and the data are relatively stationary without anticipating much new changes. In dynamic stream processing environment, like gesture recognition using a video sensor, however, data streams are ever evolving and the classification model would have to be frequently updated accordingly. Therefore a new generation of algorithms, generally known as incremental classification algorithms or simply, data stream mining algorithms has been proposed to solve this problem [15].

Take decision tree construction for example, heuristic function is an important evaluation method that determines the split attributes for converting leaves into nodes, for instance, information gain used in C4.5 [16] and Hoeffding tree [17]. Traditional methods require the full dataset (newly arrival data and historical data) to update decision model while incremental methods implement a single-pass approach which is unnecessary to reload full dataset. Figure 5 shows the flow-charts of classification model induction by using these two families of learning methods.

As a technical drawback in the traditional methods, holding the whole execution process of model-induction in runtime memory is not favorable especially when the input training data are too large. Hence, incremental methods load only a small fragment of the input data stream at a time rather than filling all in one go, for refreshing the classification model incrementally as shown in Figure 5. In incremental learning, Hoeffding bound (HB) is used to decide whether an attribute should be split to establish new nodes provided that sufficient samples for that attribute have appeared in the data stream. The new approach is designed for incremental decision trees, the pioneer of which is very fast decision tree (VFDT) and sometimes it is more generally called Hoeffding tree (HT) [17]. HT is a classical work using HB in the node-splitting test. This is attributed to the statistical property of HB that controls the node-splitting error rate on the fly.

3. Incremental Learning Model for Data Stream Mining

3.1. Batch-Learning Classification Problems

Here we review, via mathematics, why traditional model induction process may not function well for mining data stream from video sensor. Assume an instance of data arrives for model induction from the stream at timestamp , ; it carries a vector of data of multiple attributes and a corresponding class value as defined inDuring a time slot () over the number of timestamp , the data are collected into a data block , where . is defined in With the data block which was collected so far on hand, a heuristic function is used for inducing a classification model. Let be such heuristic function; greedy search approach that works in the manner of divide-and-conquer is usually employed by traditional decision tree model that attempts to induce a globally optimal decision tree, TRGLOBAL. This tree is ensured as global, because of the availability of the full collected dataset . The role of is to one-by-one rank and select the attributes in the order of the highest information gain [18], as splitting tree nodes in the case of decision tree. There are other incremental learning methods though incremental decision tree is used for illustration here. So for each attribute , of indices and where and , for , is the maximum number of attributes and is the maximum number of instances received so far, is the splitting value. The function tries to pick the attribute that has the maximum splitting value, by from the splitting values ranging from to , which we have already known from . This process ensures that the resultant model is globally optimal as far as the full data is collected in , and it is defined in

For any given new instance that arrives in the future time , , the induced model will map it to a predicted class where is the index to the possible set of classes, . With reference to the data that have been collected and used for training so far, the induced model is being built with the aim of minimizing the classification error, as defined in (4). The and functions are generic, depending on the implementation and the choice of the classification algorithms. In general the Train() function takes two parameters, one is the data which would be used for supervised learning and is the heuristic function to be used in learning from the data . The function produces a prediction result by testing the classification model which is supposed to be globally optimal over a testing sample received at timestamp :

Now consider a situation at timestamp , the data is accumulated up to , and a classification model TRGLOBAL has been induced, so far so good. When new data arrives at , the classification model TRGLOBAL now needs to be updated by repeating the induction process defined by (3) and (4) with the inclusion of the new data, . The time taken for model rebuilding will only get longer as and increase. Each time it requires loading in all historical repeatedly.

In mining sensor data, the collected data instances are massive in volume and ever new data are being generated frequently without end (in some cases like 24/7 video surveillance). How to keep up with the latest model efficiently is an open problem. For frequently updating model, recomputing historical data is not applicable when the data repository contains large millions of records. Some sort of incremental approach is required.

To solve this incremental problem, the authors in [17] proposed an alternative method for incrementally inducing a classification model, TRINCR. This method is also known as any-time algorithm where the training data is read once only without storing or reloading it anymore. The induction method builds a tree by selecting an attribute for node-splitting by estimating the sufficient statistics that records the counts of each attribute value. This is done by computing the Hoeffding bound () as defined in (5) that checks how often the attribute value of attribute would have corresponded to class :where the class distribution is measured by and the amount of instances that have been seen belonged to a class is . Unlike the traditional approach, for attribute the method checks on the splitting-value by nominating two best values. At any time, we have the best value of called such that . Likewise, the second best value is so , . These two best values are chosen incrementally as the induction goes and new data arrives. The difference between these two best values is calculated as in for each attribute where . For number of instances that have been observed so far, a confidence interval is computed by HB as in (5), called by which we can be sure of relating the attribute value to class . Incrementally, just by observing the confidence intervals as the only retained statistics for each attribute , where . For assuring an attribute is to be nominated for node-splitting, a minimum amount of observed samples, is required. Over the observed samples, if the inequality holds true for , and , then the attribute being tested is the best candidate by the statistics based on only a part of the data stream over the entire data stream with good confidence.

In this way, we estimate the splitting-value of attribute , without the need of knowing all attribute values from to . It hence frees us from reloading the full data for training the classification model as it learns incrementally when additional data come. The induced model can be useful in prediction at any time as well as being trained at any time by adjusting the statistics of the splitting values. While being able to embrace unlimited incoming samples from the data stream, the incremental learning is designed with the optimization goal of keeping the error minimum as follows:

3.2. Incremental Learning Algorithms as Solutions

Two main schools of algorithms were designed for incremental learning: functional-based and decision tree-based. The former group of algorithms constructs a black-box model which is represented by numeric weights and coefficients for mapping the relations between the inputs and the predicted outputs. Two of the most popular functional-based incremental learning algorithms are KStar and Updatable Naïve Bayes.

The full name of KStar is “Instance-Based Learner Using an Entropic Distance Measure.” As the name suggests, it learns incrementally per instance by some similarity function that measures the entropic distance between the test instance and the other instances. Motivated by information theory, the underlying similarity function solves the smoothness problem by summing the probabilities over all possible decision paths for attaining good overall performance. Due to the large amount of summation over all the possible paths, KStar usually required longer processing time than its counterparts. The details of the algorithm and its entropy-based distance function are described in full in [19]. In the same article, KStar was shown to outperform other rule based and instance based learning algorithms using some empirical datasets.

Updatable Naïve Bayes is extended from the famous Naïve Bayes classifiers which embrace a family of simple probabilistic classifiers founded on the principle of Bayes’ theorem. The algorithm is designed with assumptions of possessing strong independence between the features. An advantage of this assumption is that it only requires a small amount of training data to estimate the means and variances of the features (variables) for computing the probabilities of all the possible outcomes for performing classification. Updatable Naïve Bayes is the online version of Naïve Bayes where the same algorithm continually updates its variables for tuning the hypothesis as it runs; it continually receives a new data instance and predicts its target class based on the current hypothesis; the new instance is used to further update its hypothesis accordingly too.

The other major group of algorithms is decision tree based. By the any-time tree induction principle as discussed in Section 3.2, several research papers have proposed different approaches to improve the accuracy of VFDT in the past decade. Some selected algorithms, together with KStar and Updatable Naïve Bayes will be put into experimental test in this paper. Such incremental decision tree algorithms using HB in node splitting test are so called Hoeffding tree (HT).

HOT [18] proposed an algorithm producing some optional tree branches at the same time, replacing those rules with lower accuracy by optional ones. The classification accuracy has been improved significantly while learning speed is slowed because of the construction of optional tree branches. Some of the options are inactive branches consuming computer resources are to be removed; some are random choices of trees for speed-up called random Hoeffding tree (RHT), and so forth. ADWIN [20] that stands for adaptive sliding window algorithm proposes a solution to detect changes by observing the recently seen data instances within a variable-sized sliding window. The node splitting value is judged on the variation in the average value of the instances as seen inside the window. Another adaptive algorithm is called “Concept Drift Active Learning Strategies” or just Active [21]. Active learning aims at learning an accurate model with as little tree branches as possible. It is known that during data streaming in, the data distribution in the data stream is prone to differ over time resulting in concept drift, hence the learnt model needs to adapt by relearning. Usually data stream learning focuses on checking through the uncertain instances which can be found near the decision boundary. So if the concept drift happens in other areas other than the boundary, the learning may fail to adapt. Active learning strategies make use of randomization of search space for evenly learning from the data stream. Another challenge associated with incremental learning in data streams is the huge volume of search space (as hyperplane) from which a representative feature subset needs to be derived, for efficient model induction without incurring a large latency for high-speed data stream mining. Usually the larger the amount of features in the data, the higher the cardinality of the dimensions, the huger the search space, and the extremely longer time it requires for processing. The next subsection deals with some techniques called feature selection to tackle this problem.

3.3. Feature Selection by Swarm Search and APSO

A contemporary type of feature selection algorithm, specially designed for choosing an optimal subset from a huge hyperspace is called swarm search-feature selection (SS-FS) Model [22]. SS-FS is wrapper-based feature selection model which retains the accuracy of each trial classifier built from a candidate feature subset, picks the highest possible fitness, and deems the candidate feature subset as the choice output. The workflow of the SS-FS Model is shown in Figure 6. It can be seen that the operation iterates starting from a random selection of feature subset, continues to refine the accuracy of the classification model by searching for a better feature subset, in stochastic manner. The flow enables the classification model and the chosen feature subset finally converges.

The wrapped classifier is used as a fitness evaluator, advising how useful the candidate subset of features is; the optimization function searches for candidate subset of features in stochastic manner. This approach if run by brute-force testing out all the possible subsets, it will take a very long time. For there are 50 features in the sensor data, there are 250 ≈ 1.1259 × 1015 possible trials of repeatedly building the wrapped classifier. While the increase in data features goes by , the high computation costs intensify proportional to the amount of instances; in the case data stream mining, the sensor feed may amount to infinity!

In this regard, a search strategy called Swarm Search is used. Instead of testing on every possible feature subset, the Swarm Search which is enabled by multiple search agents who work in parallel would be able to find the most currently optimal feature subset at any time. In order to shorten the search process, a speed-up is implemented in our proposed model by incorporating a speed-up in the initialization step in the Swarm Search, called Accelerated Particle Swarm Optimization (APSO) [23].

PSO searches the space of an objective function by adjusting the trajectories of individual agents, called particles, as the piecewise paths formed by positional vectors in a quasistochastic manner. The movement of a swarming particle consists of two major components: a stochastic component and a deterministic component. Each particle is attracted towards the position of the current global best and its own best location in history called “individual best”, while at the same time it has a tendency to move randomly. Let and be the position vector and velocity for particle , respectively. The velocity vector is defined bywhere and are two random vectors and each entry takes the values between 0 and 1. The parameters and are the learning parameters for accelerating the particles with typical value of . One noticeable improvement is the use an inertia function so that is replaced by where the velocity vector with the inertia function is defined by where with a typical value of 0.5. This is similar to introducing a virtual mass to stabilize the motion of the particles, so the swarm search can converge more quickly.

The reason of using the individual best is primarily to increase the diversity in the quality solutions; however, this diversity can be simulated using some randomness. A simplified version which could accelerate the convergence of the algorithm is to use the global best only. Thus, in this version of APSO the velocity vector is generated by a simpler formula. Considerwhere is drawn from to replace the second term. The update of the position now becomes simply,In order to speed up the convergence sooner, we can define the update of the location in a single step,This simpler version of position update will deliver the same order of convergence. Typically, = 0.1L~0.5L where is the scale of each variable, while = 0.1~0.7 is sufficient for most cases. It is worth indicating that velocity does not appear in (11), and there is no need to deal with initialization of velocity vectors except the starting positions must be set appropriately.

In order to set the initial starting positions for APSO, some feature ranker function that must be quick and simple should be applied. In our proposed data stream mining model, a very simple and efficient feature selection called clustering coefficients of variation (CCV) is used for finding the ideal starting positions for APSO. CCV is based on a very simple principle of variance-basis that finds a subset of features useful for optimally balancing the classification model induction between generalization and overfitting. CCV is founded on a basic belief that a good attribute in a training dataset should have its data vary sufficiently wide across a range of values, so that it is significant to characterize a useful prediction model. The coefficient of variation (CV) is expressed as a real number from −∞ to +∞ and it describes the standard deviation of a set of numbers relative to their mean. It can be used to compare variability even when the units are not the same. In general CV informs us about the extent of variation relative to the size of the observation, and it has the advantage that the coefficient of variation is independent of the units of observation. The coefficient of variation, however, will be the same over all the features of a dataset as it does not depend on the unit of measurement. So you can obtain information about the data variation throughout all the features, by using the coefficient of variation to look at all the ratios of standard deviations to mean in each feature. Intuitively, if the mean is the expected value, then the coefficient of variation is the expected variability of a measurement, relative to the mean. This is useful when comparing measurements across multiple heterogeneous data sets or across multiple measurements taken on the same data set – the coefficient of variation between two data sets, or calculated for two attributes of measurements in the case of feature selection, can be directly compared, even if the data in each are measured on very different scales, sampling rates or resolutions. In contrast, standard deviation is specific to the measurement/sample it is obtained from, that is, it is an absolute rather than a relative measure of variation. In statistics, it is sometimes known as measure of dispersion, which helps compare variation across variables with different units. A variable with higher coefficient of variation is more dispersed than one with lower CV.

Let be a training dataset with instances of vector whose values are characterized by a total of attributes or features. An instance is an -dimensional tuple, in the form of (). For each where , can be partitioned into subgroups of different classes where is the total number of prediction target classes. So that . Consider

is the mean of all the th feature values that belong to class . is the sum of all coefficients of variation for each class where , for that particular th feature. The coefficient of variation is expressed as a real number from to . The subsequent step required in CCV after calculating the CV is to find a threshold in order to decide which features and how many features are to be retained. The underlying concept behind this task is Bia-Variance dilemma. Some recent studies stated that the decomposition of a supervised learner’s error into bias and variance terms can provide considerable insight into the prediction performance of the classifier learner.

Assume a target function: . Then the expected squared error over fixed size training sets drawn from can be expressed as sum of three components:

Our goal is to minimize the expected loss, which we have decomposed into the sum of a (squared) bias, a variance, and a constant noise term. As we shall see, there is a trade-off between bias and variance, with very flexible models (which can possibly overfit) having low bias and high variance, and relatively rigid models (under fit) having high bias and low variance. In order to achieve this optimum equilibrium between bias and variance, a simple -means clustering technique is employed. It tries partition the data point into two clusters: one to be retained and the other one to be removed. The goal is to assign membership of a cluster to each data point. Clustering algorithm helps to find the ideal cluster positions , of the clusters that minimize the distance from the data points to the cluster centroids, with the following objective function:where is the set of points that belong to cluster . The clustering algorithm uses the square of the Euclidean distance .

Given a data set , the estimation function is . As it was mentioned before, adding more parameters into the model as features, the complexity of the model rises, so does the variance while bias steadily falls. The function of -means is to divide the data set into two groups according to the values of coefficient of variation. The values of variance-bias are different for the data points in different clusters that reflect the complexity of model. It is known the more a complex model, the more bias it is, and vice versa. So we are reducing the complexity of model by choosing some valuable attributes by separating the variance. The total error of two groups:

One of the two groups (15) and (16) with data points representing the combinations of variances and biases is to be chosen as the optimal feature subset. A quick and effective division method call HyperPipes [24] is utilized for this task. HyperPipes is a probabilistic learning tool that is very similar to Naïve Bayes, except for the fact that it does not record the frequency count of how attributes correspond to classes. In essence, an attribute either corresponds to a hypothetical class or it does not, regardless of how many times this is the case. The learner will record all of the attributes and their correspondence with the class in a table of Booleans. The learner will determine the class based on the score of the attributes added up (0 for if it does not exist and 1 for if it does).

4. Mining Sensor Data Streams

4.1. Evaluation Method

The experiment contains two parts: firstly, we compare two groups of classification learning methods, traditional batch learning and incremental learning pertaining to their classification performance such as accuracy, kappa, precision and recall, and so forth. The names of the classification learning algorithms, together with a short description are shown in Table 2. The choices of algorithms for both groups are popular methods that have been used widely in the literature. The data stream mining algorithms which are put under test here are mainly inherited from the Hoeffding principle in growing a decision tree. In addition, two nondecision tree type of incremental learning such as Updateable Naïve Bayes and KStar are tested in the comparison. Secondly the timing performance is evaluated for the two groups of classification, in relation to the cost-benefit of accuracy improvement at the price of extra running time.

The experiment was conducted on the computing platform of a Dell Precision T7610 PC with Intel Xeon Processor E5-2670 v2 (Ten Core HT, 2.5 GHz Turbo, 25 MB) and 128 GB RAM. The programming environment is Java Development Kit 1.5. For the algorithms, they are implemented on MOA platform. Default parameter values are set for all experimentation runs. For a fair evaluation over the efficacy of the algorithms, 10-fold cross-validation is used for obtaining an unbiased estimate of the accuracy performance of the classification models. The data is divided into 10 subsets of equal portion; the models by the same algorithm are built 10 rounds, each round sparing out one of the 10 subsets from training the model, as unseen data for performance validation.

4.2. Sensor Data Classification

The sensor data that is subject to the experiment of performance evaluations is treated with 4 types of preprocessing methods for feature selection. The first preprocessing does no feature selection we simply call it “Original” meaning the sensor data is in its original form as collected from the video sensor; the second method is preprocess with Correlation-based feature selection, namely Cfs which is a popular approach in data mining, the third preprocessing is done with swarm search feature selection using PSO, called FS-PSO; and the fourth preprocessing is the same as the third method except standard PSO is replaced by Accelerated PSO, called FS-APSO.

The experiment conducted over a number of combinations of feature selection preprocessing methods and classification algorithms, from both traditional and incremental learning types. The performance results are harvested in terms of Accuracy, Kappa (Kappa statistics), True Positives rate, false positive rate, precision, recall, F-measure, model building time per run, preprocessing time, and number of features selected. The results are tabulated in Tables 3 and 4 for traditional classification algorithms and incremental classification algorithms respectively. Some selected important performance indicators such as Accuracy, Kappa, True Positive rate, False Positive rate, Time and size of selected feature subset are graphed in radar charts respectively in Figures 712.

4.3. Discussion of the Results

The radar charts are laid out by placing the 7 traditional classification algorithms on the right side of the chart, and the 7 incremental algorithms on the left, for easy comparison. The accuracy measure is defined by the number of correctly classified instances over the total instances in the sensor data. In the case of batch learning by traditional algorithms, the accuracy is the ratio of correctly classified instances over all the 99,000 instances. In the case of incremental learning, the accuracy is measured by averaging all the intermediate accuracies resulted from each data segment over a series of tests. In Figure 7 the overall accuracy by the traditional classification algorithms is slightly higher than those by the incremental algorithms: average accuracy 82.98296% for traditional versus 74.08409% for incremental. The top performers are Neural Network and KStar. The performance in general for the preprocessing methods of original and Cfs is out-performed by FS-PSO and FS-PSO. Generally Cfs consistently offered improvement in accuracy for traditional algorithms, though marginally. For incremental algorithms however Cfs do not always have enhance the accuracy. This may be due to the fact that the calculation of correlation between targets and attributes in the incremental mechanism does not work well with nonstationary data, and vice versa. The swarm search type of feature selections (FS) unanimously outperformed Cfs. The improvement by FS is most obvious for NB, RHT, HOT, NBup, and KStar algorithms. These algorithms have a phenomenon in common as their model structures are loosely represented by a large set of numeric variables. Like HOT and RHT for example, the decision trees are in multiple forms, gathering a pool of possible model candidate during the induction process. NB, NBup, and KStar are represented by a large number of conditional probabilities and statistical variables. These models are relatively loosely defined; hence the stochastic search by PSO is appropriate and effective in finding the optimal feature subsets leading to a big leap in performance improvement. The proposed new version of APSO for Swarm Search, namely FS-APSO nevertheless shows its superior respective to performance improvement over the standard PSO version by FS-PSO. FS-APSO is better than FS-PSO in all cases except NB. Moreover, for HT, PSO has very poor performance in upholding the accuracy whereas APSO solved the problem. By far, FS-APSO has shown the maximum accuracy improvement compared to original and Cfs, indicating that FS-APSO would be a feasible feature selection scheme for the other family members of Hoeffding tree. When it comes to performance indicators like Kappa and True Positive rate, the algorithms show similar patterns as described above in Figures 8 and 9 respectively. False positive rate which is also known as false alarm rate is an undesirable feature in machine learning. Figure 10 shows that RHT with Cfs incurred the highest false alarm rate, inferring the unsuitability of correlation-based feature selection for data stream mining especially when many random trees are being generated during runtime. FS-APSO managed to subside the false alarm rate in all cases. KStar in particular works extremely well with FS-APSO being able to maintain the lowest false alarm rate of all.

The amounts of features that are selected as an optimal subset by different combination of algorithms are shown in Figure 12. It can be seen that FS-APSO is capable of maintaining only the minimum amount of features which are significant enough for inducing classification models of the highest accuracies in most of the cases. Followed by FS-PSO the standard version of APSO, likewise can select fewer features than Cfs in all except NN, BN and HP. Less number of features to be selected may imply simpler deployment of classification or prediction in sensor data, without the need of using a full array of features, each of these features may require certain processing resource and sensing abilities. In other words, it would be cost-effective if fewer features were to be required yet being able to attain a good level of accuracy in sensor data classification.

Lastly, the factor of runtime is considered together with other accuracy performance. Figure 11 shows a comparison of preprocessing times incurred by different mixes of feature selection and classification algorithms. Cfs takes almost no-time which is indeed a benefit although Cfs is under-performing in accuracy and other performance indicators. By comparing only FS-PSO and FS-APSO which are stochastic in nature and they do need to take time to search for the optimal feature subset, it is interesting to observe which is more efficient. FS-APSO with the benefits of precalculating the qualified features, very quickly by CCV, as initial starting search position, shortens the runtime in all cases (except NBup) when compared to FS-PSO. HP is amazingly quick with both FS-PSO and FS-APSO, followed by RHT which completes the preprocessing in a relatively short time. KStar, NN, SVM, and AC however require relatively the longest preprocessing in both types of FS methods. By glancing over the results on Figure 11 it can be seen that the preprocessing in the traditional group of classification algorithms takes slightly longer than the incremental group of algorithms in data stream mining. This could be explained by the nature of the traditional classifiers which are embedded in the swarm search as a fitness evaluation function is time-consuming over the stationary data. On the other hand, the incremental algorithms that mine along the data stream when being used as the fitness function performs much quicker because of its incremental nature.

In order to have a fairness comparison with respect to time, a new indicator called gain is proposed in this paper. Gain is simply the performance increase factor, considering the increment of accuracy (accuracy % with feature selection, accuracy % of original) over the number of seconds consumed in preprocessing. Ideally we favour a combination of algorithms with a high-gain, meaning it can yield the highest increase in accuracy while incurring the shortest preprocessing time. For the comparison of different combinations of algorithms, in view of this gain indicator, a gain value for each of them is computed and tabulated in Table 5.

As it is shown in Table 5, the highest average gain is the group of incremental learning algorithms coupled with FS-APSO (0.7253), followed by the same incremental group with FS-PSO (0.5801) and then traditional algorithms with FS-APSO (0.3202) and traditional group with FS-PSO comes last (0.2679). Individually the top performance in gain is by RHT combined with FS-APSO. Both RHT and FS-APSO employed a lot of randomization functions, yet they are complimenting each other in operation. NN in turn has the least gain for it has a rigid mechanism in machine learning by adjusting its internal weights and activation function.

To sum up, it is most feasible to utilize our newly proposed FS-APSO for data stream mining, particularly for RHT algorithm. Alternatively, NB, DT, and RF are good choices considering their relatively high accuracy and moderate amount of preprocessing times.

5. Conclusion

As long as a sensing device is operating, it collects a large amount of data streams all the time. Fresh data are being generated at all times that it requires an incremental computation which is able to monitor large scale of data dynamically. As a result, the algorithm design of data mining sensor application shall consider a lightweight incremental algorithm that is capable of robustness, high accuracy and minimum preprocessing latency. In this paper, we investigated the possibility of using a group of incremental classification algorithm for classifying the collected data streams from video sensor for gesture recognition. As a case study empirical data stream was used that was donated by a research team from the research team of Madeo et al. at University of Sao Paulo, Brazil. The data collected are visual feeds of composed video captured by Microsoft Kinect sensor. The video data are transformed into 50 numeric variables extracted from videos with people gesticulating, aiming at studying Gesture Phase Segmentation initial. We compared the traditional classification model induction and their counter-part in incremental inductions. In particular we proposed a novel lightweight feature selection method by using Swarm Search and Accelerated PSO, which is supposed to be suitable for data stream mining. The evaluation result showed that the incremental method obtained a higher gain in accuracy per second incurred in the preprocessing. The contribution of this paper is experimental insights for anybody who wishes to design a similar gesture recognition application from video sensors in choosing the appropriate decision support algorithms especially in scenario of mining activity patterns that are temporal and streaming in nature. In the future, we will want to extend the data stream mining of such sensor data with extra capabilities of sensing more complex gestures for richer information in the experimentation.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.