Abstract

Multiple built-in cameras and the small size of mobile phones are underexploited assets for creating novel applications that are ideal for pocket size devices, but may not make much sense with laptops. In this paper we present two vision-based methods for the control of mobile user interfaces based on motion tracking and recognition. In the first case the motion is extracted by estimating the movement of the device held in the user's hand. In the second it is produced from tracking the motion of the user's finger in front of the device. In both alternatives sequences of motion are classified using Hidden Markov Models. The results of the classification are filtered using a likelihood ratio and the velocity entropy to reject possibly incorrect sequences. Our hypothesis here is that incorrect measurements are characterised by a higher entropy value for their velocity histogram denoting more random movements by the user. We also show that using the same filtering criteria we can control unsupervised Maximum A Posteriori adaptation. Experiments conducted on a recognition task involving simple control gestures for mobile phones clearly demonstrate the potential usage of our approaches and may provide for ingredients for new user interface designs.

1. Introduction

Designing comfortable user interfaces for mobile phones is a challenging problem, given the limited amount of interaction hardware and the small size of the device. Touch sensitive technology has already enabled new ways for users to interact with handheld devices. Recent touch screens provide an intuitive interface for navigating content but this equipment still imposes some limitations: the user's fingertip size can decrease pointing accuracy, the area of interest can be occluded by fingers, and most importantly the operation area is restricted. Moreover, the amount of functionalities in mobile devices is likely to keep increasing due to the forthcoming 3D user interfaces and applications. Going forward we will see also multiple sensors in portable devices that can enrich the mobile user experience by allowing control through gestures and other types of movement. Studies into alternatives to mobile user interaction have, therefore, become a very active research area in recent years.

Much of the work in mobile interaction has been in direct manipulation interfaces, such as screen navigation by scrolling or pointing and clicking. In particular, it has been shown that different sensors provide viable alternatives to conventional user interaction. For example, tilting interfaces can be implemented with gyroscopes [1] and accelerometers [2]. Using both tilt and buttons, the device itself is used as input for navigating menus and maps. During the operation, only one hand is required for manipulation. Several devices employ a detachable stylus in which interaction is done by tapping the touch screen to activate buttons or menu choices. Interestingly, Apple's products make use of the same technology in a different way. In the iPhone, users are allowed to zoom in and out by performing multiple fingers gestures on the touch screen. In addition, a proximity sensor shuts off the display in certain situations to save battery power, and an accelerometer senses the orientation of the phone and changes the screen accordingly.

On the other hand, many of the current mobile phones have also two cameras built-in, one for capturing high-resolution photography and the other for lower-resolution video telephony. Even the most recent devices have not yet utilised these unique input capabilities enabled by cameras for purposes other than just photographing. With appropriate computer vision methods, information provided by images allow us to create new self-intuitive user interface concepts. In our work we have focused on what could be described as indirect interfaces, where an abstract shape is recognised, and this is then interpreted as a command by the mobile device.

In this paper we investigate two specific approaches for creating patterns of motion: firstly the estimation of the egomotion of the device itself using the inbuilt camera now available on most mobile devices and also the use of this camera for tracking the motion of an external object, in our case the user's finger. These motion trajectory sequences are then modelled using Hidden Markov Models (HMMs). In order to improve the initial, we propose to automatically filter incorrectly classified sequences from the final result. This filtering is based on two criteria: entropy and likelihood ratio. The first, entropy, is a measure of the data itself, whilst the second, likelihood ratio, is a measure of the confidence in the classification result. In our case the entropy measure is used to characterise the randomness of the velocity of the motion sequence. Our hypothesis is that sequences with more random velocity are more likely to be incorrectly classified, as opposed to a sequence with a more constant velocity. The likelihood ratio is the ratio between the most likely sequence and the second most likely sequence. This ratio can be seen as a confidence measure of the classification result.

In the following section, Section 2, we look at previous approaches to vision-based control of mobile user interfaces. In Section 3, we present two methods used for producing motion information from image sequences. In Section 4, we describe the HMMs used for sequence classification and the use of Maximum A Posteriori (MAP) adaptation in adapting these models. In Section 5, we demonstrate in two sets of experiments how the criteria of entropy and likelihood ratio can be used to filter the results of a recognition task and also how the same criteria can be used to select data for performing unsupervised adaptation of HMMs using MAP adaptation. Finally we present our conclusion in Section 6.

Much of the previous work on vision-based user interaction with mobile phones has utilised measured motion information directly for controlling purposes. In these systems the user can operate the phone through a series of hand movements whilst holding the phone to perform actions on the screen of the device such as scrolling or pointing and clicking [3]. For example, Siemens introduced an augmented reality game called Mozzies developed for their SX1 cell phone in 2003. This was the first mobile phone application utilizing the camera as a sensor. The goal of the game was to shoot down the synthetic flying mosquitoes projected onto a real-time background image by moving the phone around and clicking at the right moment. During user movements, the motion of the phone is recorded using a simple optical flow technique.

After this work, we have seen many other image motion based approaches. Möhring et al. [4] presented a tracking system for augmented reality on a mobile phone to estimate 3D camera pose using special colour coded markers. Other marker-based methods use a printed or hand-drawn circle [5], a hand-held target [6], and a set of squares [7] to facilitate the control task. One new solution was presented by Pears [8]. The idea of this approach was to use a camera on the mobile device to track markers on the computer display. This technique can compute which part of the display is viewed and the 6-DOF position of the camera with respect to the display.

An alternative to markers is to estimate motion between successive image frames with similar methods to those commonly used in video coding. Rohs [9] divided incoming frames into the fixed number of blocks and then determined the relative 𝑥, 𝑦, and rotational motion using a simple block matching technique in order to to interact with an RFID tag. Another possibility is to extract distinctive features such as edges and corners from images which exist naturally in the scene. Haro et al. [10] have proposed a feature-based method to estimate movement direction and magnitude, so the user can navigate the device screen in 2D. Instead of using local features, some approaches extract global features such as integral projections from the image [11].

Some recent and generally interesting directions for mobile interaction are to combine information from several different sensors. In their feasibility study, Hwang et al. [12] combined forward and backward movement and rotation around the 𝑦-axis from camera-based motion tracking, and tilts about the 𝑥- and 𝑧-axis from the 3-axis accelerometer. Also, a technique to couple wide area, absolute, and low resolution global data from a GPS receiver with local tracking using feature-based motion estimation was presented by DiVerdi and Höllerer [13].

Recently, the motion input was also applied for more advanced indirect interaction such as recognising signs. This increases the flexibility of the control system as the abstract signs can be used to represent any command, such as controls for a music player. A number of authors have examined the possibility of using phone motion to draw alphanumeric characters. Liu et al. [14] show examples of Latin and Chinese characters drawn using the ego-motion of a mobile device, although these characters are not recognised or used for control. Kratz and Ballagas [15] propose using a simple set of motions to interact with the external environment through the mobile device. In their case there are four symbols, consisting of a three-sided square in four different orientations, and due to the small size of the symbol set they report good performance with no user training.

The other solution studied in this paper, vision-based finger tracking, is well studied problem on desktop computers with numerous applications [16, 17]. On mobile devices, Henrysson et al. [18] considered how a front-facing camera on the phone can be used for 3-D augmented reality interaction. They compared finger gesture input to tangible input, keypad interaction, and phone tilting in user interface tasks. However, in their work finger tracking was performed by using simple frame differencing method. Similar system called Finteraction was introduced by Jenabi and Reiterer [19] but they do not provide much detail of the tracking method. Davis et al. [20] presented a real-time algorithm for finger pointing. The method is based on skin detection which makes it susceptible to illumination changes and noise. In experiments, they evaluate the method in a picture browsing task achieving promising results. Recently, Terajima et al. [21] presented another template-based finger tracking system for recognizing motion made by the user. They achieve real-time performance but they do not provide any quantitative analysis of the algorithm.

3. Motion Feature Extraction

In our contribution, we propose two alternative solutions to extract motion information from successive images which can be used as a feature for classification. In the first approach, the ego-motion of the device is estimated while the user operates the phone through a series of hand movements. The second technique is to move an object such as a finger in front of the camera and simultaneously track the object during gestures. Both these approaches utilise the feature-based motion analysis as a subtask where a sparse set of image features are first selected from one image and then their displacements are determined. In order to improve accuracy of the motion information, an uncertainty of these features is also analysed.

3.1. Feature Motion Analysis

Feature motion analysis begins with the selection of image features from the first frame. The goal is to ensure that the features are distributed over the image so that the probability of sufficient presentation of overall image motion is high. We use a computationally straightforward way where the image area is split to nonoverlapping regions and one feature is selected from each region [22].

Another goal is to select some distinctive features which guarantee high precision in the estimation of the displacement vectors. Various criteria for selecting such features typically analyse the richness of texture within an image area [23]. One approach is to consider first-order image derivatives in the horizontal and vertical directions. The sum of squared derivatives provides a computationally simple criterion. An alternative approach we have used is eigenvalue analysis of 2×2 normal matrice which can give better features, but has slightly higher computational complexity.

To estimate the displacement of the features 𝑖, a block matching measure is evaluated exhaustively for a suitable range of integer displacements in both 𝑥- and 𝑦-directions. As a matching measure, we use either the sum of squared differences (SSDs) measure or its variant, zero-mean sum of squared differences (ZSSDs). The latter measure is more robust to lighting changes which can be crucial in some applications. Exhaustive evaluation of either of these measures gives a motion profile. The displacement that minimizes the criterion provides a feature motion estimate 𝐝𝑖 which can be refined to subpixel accuracy via quadratic interpolation of the motion profile values.

Uncertainty of the obtained estimate is analysed by detecting those displacements that may be close to the true displacement according to the matching measure value. Selection of the set of those displacements is based on gradient-based thresholding of the motion profile. The result of this analysis is summarized as a covariance matrix 𝐂𝑖.

As a result of these computational steps, we obtain a set of motion features. A motion feature 𝑖 consists of the feature centroid location in the first image, 𝐩𝑖, its displacement estimate 𝐝𝑖, and the result of uncertainty analysis (𝐂𝑖). Device motion estimation and object tracking use this information as an input.

3.2. Device Motion Estimation

A mobile user interface system controlled through a series of hand movements requires a method for estimating the ego-motion of the device's camera [22]. Camera ego-motion is often estimated from 2-D image motion measured between two successive frames. As the observed motion in an image sequence may consist of multiple motions due to moving objects in a scene and motion parallax, one must consider solutions that estimate the dominant motion.

The ego-motion estimation generally refers to the computation of 6-DOF motion. However, the choice of a model and the number of parameters for the computation are application dependent. For simplicity we use a four-parameter similarity motion model which represents the displacement 𝐝 of a feature located at 𝐩=[𝑥,𝑦]𝑇 using𝐩𝐝=𝐝(𝜽,𝐩)=𝐇𝜽=10𝑥𝑦01𝑦𝑥𝜽,(1) where 𝜽=[𝜃1,𝜃2,𝜃3,𝜃4]𝑇 is a vector of model parameters and 𝐇[𝐩] is an observation matrix. Here, 𝜃1 and 𝜃2 are related to common translational motion, and 𝜃3 and 𝜃4 encode information about 2-D rotation 𝜙 and scaling 𝑠, 𝜃3=𝑠cos𝜙1 and 𝜃4=𝑠sin𝜙.

The global motion describing the device motion is estimated using those motion features which pass an outlier analysis stage. Such analysis is necessary as feature displacement estimates can be erroneous due to image noise, or there may be several independent motions in the scene. It is assumed that the majority of motion features are associated with the global motion we want to estimate. To select those inlier features, we use an RANSAC-based scheme where pairs of motion features are used to instantiate motion model hypotheses, which are then voted for by other features.

A feature votes for a hypothesis if the displacement instantiated from the hypothesis is close to the estimated displacement. The covariance matrix 𝐂𝑖 provides information about the feature motion uncertainty in different directions, and the calculation of votes uses 𝐂𝑖-based Mahalanobis distance measure. Once inlier features have been selected, a weighted least squares approach is used to compute the estimate of the device motion. Primarily, weighting is based on measured uncertainties.

3.3. Object Tracking

The goal of object tracking is to estimate the motion of an object such as a finger which can then be used as a feature for recognising gestures [24]. With hand-held devices the camera also moves slightly when the user is operating the device. The problem is therefore formulated as a task of estimating two distinct motion components, the camera motion and the object motion. However, we are not so interested in segmenting the observed displacements into coherent regions in an image.

One way to track multiple object motions and cope with multimodal distribution is combinatorial data association methods [25]. In many tracking problems there is more than one measurement at the same time step available. Data association is a process to assign each of measurements to the appropriate objects or motion. Assigning measurements can be effective in the case of incoherent motion. Methods of this kind often perform data association and estimation separately by first assigning the measurements and then estimating the state. In the following, we review our method that is able to track multiple motions using a sparse set of motion features. One benefit compared to previous approaches is that no iterations are needed, making the algorithm computationally efficient.

In our model, we assume that the background and foreground motions are constant but subject to random perturbations. Translational models are considered as sufficient approximations, and then the state-space model of the camera (𝑗=1) and object + camera (𝑗=2) motions is𝐬𝑗(𝑘)=𝐬𝑗(𝑘1)+𝜀𝑗(𝑘),(2) where the state 𝐬𝑗(𝑘)=[𝑢𝑗(𝑘),𝑣𝑗(𝑘)]𝑇 denotes the motion between the frames 𝑘1 and 𝑘. 𝑢𝑗(𝑘) is the motion in 𝑥-direction and 𝑣𝑗(𝑘) is the motion in 𝑦-direction. 𝜀𝑗(𝑘) is the process noise term, which is assumed to be zero-mean white Gaussian noise with covariance matrix 𝐐𝑗=𝜎2𝑗𝐈. As foreground motion contains both camera and object motion, it is reasonable to assume that 𝜎22>𝜎21.

Object tracking uses motion features described in Section 3.1 and illustrated in Figure 1(a) as an input. Observed displacements of those features, 𝐝𝑖(𝑘), are modelled as𝐝𝑖(𝑘)=𝜆𝑖𝐬1(𝑘)+1𝜆𝑖𝐬2(𝑘)+𝜼𝑖(𝑘),(3) where 𝜼𝑖(𝑘) is the observation noise, which is assumed to obey zero-mean Gaussian distribution with covariance 𝐑𝑖, and 𝜆𝑖 is a hidden binary assignment variable which indicates the object that generates the measurement.

To estimate the motions we use a technique where the Kalman filter [26] and the EM algorithm [27] are combined. The basic assumption is that the motion measurements 𝐝𝑖 are drawn from either of two distributions corresponding to the background or foreground. Having some estimate of distribution parameters, we can assign measurements to the appropriate motion. Note that the Kalman filter could be used to directly estimate 𝐬𝑗(𝑘) if the assignments 𝜆𝑖 were known. As these assignments are unknown, the predicted estimates of 𝐬𝑗(𝑘) and a priori probabilities of associating features to motion components are used to compute soft assignments 𝑤𝑖,𝑗 using a Bayesian formulation. An example of assignments is shown in Figures 1(b) and 1(c). The assignment step corresponds to the E step of the EM algorithm.

Soft assignments are then used in the computation of the Kalman gains which are needed to get the filtered estimates of 𝐬𝑗(𝑘). The principle is that the lower the value of 𝑤𝑖,𝑗 is the higher the observation noise 𝐑𝑖,𝑗 becomes. This weighting of the measurements corresponds to the M step of the EM algorithm.

To describe the algorithm in more detail, we denote the estimate of the state 𝐬𝑗(𝑘) with ̂𝐬+𝑗(𝑘) and associated error covariance matrix with 𝐏+𝑗(𝑘). The steps used to obtain the state estimate at time instant 𝑘+1 are as follows. (1)Predict estimate ̂𝐬𝑗(𝑘+1) by applying dynamics (2) ̂𝐬𝑗̂𝐬(𝑘+1)=+𝑗(𝑘),(4) and predict error covariance 𝐏𝑗(𝑘+1)𝐏𝑗(𝑘+1)=𝐏+𝑗(𝑘)+𝐐𝑗.(5)(2)Compute the weights 𝑤𝑖,𝑗 for each motion feature 𝑖(𝑘+1)=(𝐝𝑖(𝑘+1),𝐂𝑖(𝑘+1)) using a Bayesian formulation. Let 𝜋𝑗(𝑘)>0 be the a priori probability of associating a feature with the motion 𝑗 (𝑗𝜋𝑗(𝑘)=1). The weight 𝑤𝑖,𝑗 is the a posteriori probability given by (𝑗𝑤𝑖,𝑗=1) 𝑤𝑖,𝑗𝐝𝑝𝑖̂𝐬𝑗(𝑘+1),𝐏𝑗(𝑘+1)+𝐂𝑖𝜋(𝑘+1)𝑗(𝑘),(6) where the likelihood function 𝑝() is a Gaussian pdf, with mean ̂𝐬𝑗(𝑘+1) and covariance 𝐏𝑗(𝑘+1). (3)Use the weights 𝑤𝑖,𝑗 to set the observation noise covariance matrices in (3) according to 𝐑𝑖,𝑗=𝐂𝑖𝑤𝑖,𝑗+𝜖1,(7) where 𝜖 is a small positive constant. Compute the Kalman gain 𝐊𝑗(𝑘+1)=𝐏𝑗(𝑘+1)𝐇𝑇𝐇𝐏𝑗(𝑘+1)𝐇𝑇+𝐑𝑗1,(8) where 𝐑𝑗 is a block diagonal matrix composed of 𝐑𝑖,𝑗 and 𝐇=[𝐈2𝐈2]𝑇 is the corresponding 2𝑁×2 observation matrix. Note that if 𝑤𝑖,𝑗 has a small value, corresponding measurement is effectively discarded by this formulation.(4)Compute filtered estimates of the state ̂𝐬+𝑗̂𝐬(𝑘+1)=𝑗(𝑘+1)+𝐊𝑗𝐳(𝑘+1)(𝑘+1)𝐇𝐬𝑗(𝑘+1)(9) and compute the associated error covariance matrix 𝐏+𝑗(𝑘+1)=𝐈𝐊𝑗𝐏(𝑘+1)𝐇𝑗(𝑘+1),(10) where 𝐳(𝑘+1)=[𝐝1(𝑘+1)𝑇,𝐝2(𝑘+1)𝑇,,𝐝𝑁(𝑘+1)𝑇]𝑇.(5)Update a priori probabilities for assignments with a recursive filter 𝜋𝑗(𝑘+1)=𝑎𝜋𝑗1(𝑘)+(1𝑎)𝑁𝑁𝑖=1𝑤𝑖,𝑗,(11) where 𝑎<1 is a constant learning rate.

Figure 2 shows some frames of the sequence 1 with motion features observed during tracking. The weightings for each feature are illustrated using colors. The red and blue colors depict the background motion and the finger motion, respectively. It can be seen that most of the features are correctly associated. In Figure 2(b) all features are associated to the background because the finger motion is negligible. In our experimental tracker, 100 motion features are used, the image feature size is 5 by 5 pixels, and the maximum displacement is 12 pixels. We assume that the majority of features are extracted from the background. Therefore, the initial probabilities 𝜋1 and 𝜋2 (see (6)) for the background and the finger motion were set to 0.7 and 0.3, respectively. The learning rate 𝑎 in (11) was set to 0.95 that guarantees a decent change in the proportion of mixture components.

4. Motion Sequence Recognition

In order to perform classification we must select an appropriate method of modelling the motion sequences produced by the feature extraction methods described in Section 3. The most common method currently used to model sequences of data are HMMs [28]. An HMM is a statistical model capable of representing temporal relations in these sequences. The data is characterised as a parametric stochastic process, and the parameters of this process are automatically estimated. The data sequence is factorised over time by a series of hidden states and emissions from these states. The transition between states is probabilistic and depends only on the previous state. In our case [22, 24] the continuous emission probability from each state is modelled using Gaussian Mixture Models (GMMs) [28]. HMM training can be carried out using the Expectation-Maximisation (EM) algorithm [27] and sequence decoding using the Viterbi algorithm [29].

4.1. Maximum A Posteriori Adaptation

Due to the difficultly in tracking and the noisy nature of the measurements, in some applications, it may be difficult to create general models for each class that will perform well for many different users. In order to improve the models performance we propose using unsupervised Maximum A Posteriori (MAP) adaptation to tailor the general models for a specific user. We address the problem of controlling unsupervised learning by proposing a method of selecting adaptation data using a combination of entropy and likelihood ratio. We demonstrate how this approach can significantly improve the performance in the task of finger gesture recognition.

When using statistical models for pattern recognition we must train the models based on a training set. If this training set is labelled, then the Maximum Likelihood (ML) principle is used to update the model parameters during training. The likelihood of a training set 𝒳train is maximised with respect to the parameters of the model 𝜃. So we select the parameters 𝜃ML such that𝜃ML=argmax𝜃𝑝𝒳train.𝜃(12) The EM algorithm can be used to estimate (12). If, however, we are presented with unlabelled data to train the model, then there is the possibility that some of the training data will not correspond to the class we wish to recognise. Therefore we need some way of constraining the estimation of the model parameters to limit the effects of incorrect data. In MAP adaptation [30] a prior distribution over the parameters 𝜃, 𝑃(𝜃), is used to constrain the updated parameters. The formulation for MAP estimation is similar to the formulation for ML estimation given in (12). However, in MAP estimation it is assumed that there is a prior distribution on the parameters to be estimated. The estimation of the parameters 𝜃 according to the MAP principle is given by𝜃MAP=argmax𝜃𝑃𝜃𝒳adapt=argmax𝜃𝑝𝒳adapt𝑃𝜃(𝜃),(13) where 𝒳adapt is the data selected for adaptation. Again the EM algorithm can be used for MAP estimation. The next section will look at the use of MAP learning for adapting the parameters of HMMs.

4.2. MAP Adaptation for Hidden Markov Models

In a GMM the set of parameters 𝜃 for each mixture 𝑚 is given by𝜃={𝑊,𝜇,Σ},(14) where 𝑊={𝜔𝑚} is the set of scalar mixture weights, 𝜇={𝜇𝑚} is the set of vector means, and Σ={Σ𝑚} is the set of covariance matrices of the Gaussian mixture. This HMM is trained by updating the parameters of each Gaussian and also the transitions between each state. In MAP adaptation the estimation of the model parameters for each state is constrained by a prior distribution for these parameters, 𝜃prior={𝑊prior,𝜇prior,Σprior}. The updated parameters of a particular GMM mixture 𝑚, 𝜃MAP𝑚, can be estimated according to the following update equations:𝑤MAP𝑚=𝛼𝑤prior𝑚+(1𝛼)𝑤ML𝑚,𝜇MAP𝑚=𝛼𝜇prior𝑚+(1𝛼)𝜇ML𝑚,ΣMAP𝑚=𝛼prior𝑚+𝜇prior𝑚𝜇MAP𝑚𝜇prior𝑚𝜇MAP𝑚𝑇+(1𝛼)ML𝑚+𝜇𝑚𝑀𝐿𝜇MAP𝑚𝜇ML𝑚𝜇MAP𝑚𝑇,(15) where 𝛼 is a weighting factor on the contributions of the prior parameters, 𝜃prior, and the current estimated parameters using ML, 𝜃ML.

4.3. Filtering Recognition Results

A key point in unsupervised learning is controling of either the learning process or the data used for adaptation. One way to control unsupervised adaptation is to filter out any incorrectly classified sequences before they are used for adapting the model. In this case we propose the use of entropy and the log likelihood ratio as criteria for selecting sequences for adaptation. This is based on our previous work [22] where we demonstrated that a combination of log likelihood ratio and entropy can be used as a measure of the confidence in the recognition result for a sequence.

To adapt the general model to a user specific model, we first classify the sequences produced by the user using the general model. We then use the entropy and log-likelihood ratio as the criteria for filter incorrect sequences from these results. Finally the sequences that pass the selection criteria are used to update the model for that class using MAP adaptation.

4.3.1. Likelihood Ratio

If we have a set of classes {𝐶𝑙1,𝐶𝑙2,,𝐶𝑙𝑁} and a sequence of data 𝒳, then the class with the highest likelihood given 𝒳 is denoted by 𝐶𝑙𝑎 and the class with the second highest likelihood given 𝒳 is denoted by 𝐶𝑙𝑏. The log-likelihood ratio 𝛿 for a particular sequence is given by 𝑝𝛿=log𝐶𝑙𝑎𝑝𝒳log𝐶𝑙𝑏𝒳,(16) where 𝑝(𝐶𝑙𝒳) is the likelihood of the class 𝐶𝑙 given the data sequence 𝒳. In our experiments any sequence that produces a value of 𝛿 below a certain threshold is not used for adaptation.

4.3.2. Entropy

Information Entropy is a measure of the randomness of a probability distribution of a random variable 𝒴 and is given by [31]𝒴Ent=𝐿𝑁𝑛=1𝑃𝑦𝑛log2𝑃𝑦𝑛,(17) where 𝑃(𝑦𝑛) is the probability of 𝑦, 𝑁 is the number of samples, and 𝐿 is a constant. In our case we take the first derivative of the motion trajectory 𝒳, and this gives us the velocity of the motion. This continuous velocity sequence is quantised into a histogram, and we calculate the entropy of the entries in this histogram.

So a sequence with a higher entropy will have a more random velocity, while a sequence with lower entropy would have a more constant velocity. Our hypothesis is that well-formed signs will have a more constant velocity, and so a lower entropy, as opposed to more random or poorly formed signs. In this paper we demonstrate that sequences with higher entropy are more likely to be incorrectly classified and that by setting a threshold on the entropy we can filter these potentially incorrect sequences from the data used for adaptation.

5. Experiments

5.1. Device Motion Recognition

The system we propose here uses HMMs, described in Section 4, to model the device motion features described in Section 3.2. These models are then used to classify the device motion sequences that are input from the user. In order to ensure a minimum number of incorrect classifications this initial result is then filtered in order to reject any possibly incorrect commands before they can be executed. The methods used for filtering the results, likelihood ratio and entropy, are described in Section 4.3. An overview of the system is shown in Figure 3.

We use a two-level filtering of the result. The first level of filtering is based on the likelihood ratio between the most likely command and the second most likely command. This ratio can be seen as a confidence measure of the classification result. If this ratio is below a certain predefined threshold, then the confidence in the result is low and the sequence is rejected. A second level of filtering is employed to reject unintentional or accidental sequences, such as when the input system is activated without the user's knowledge or the user loses control of the phone for some reason. It is important that these unintended commands are not recognised and executed as real commands.

In our experiments we use two threshold values for 𝛿. The first, 𝛿hard, is a hard decision, and any sequence with a log-likelihood ratio below this threshold is rejected. The second, 𝛿soft, is a soft decision, for any sequence where 𝛿hard<𝛿<𝛿soft the entropy of the sequence is used as an additional indicator of the quality of the classification. Those sequences with a log likelihood ratio, 𝛿, satisfying 𝛿hard<𝛿<𝛿soft are classified according to their entropy. Any sequences with an entropy higher than a predetermined threshold are rejected as potentially incorrect. This entropy threshold Entth is set on the validation set as described in the next section.

Entropy is used for filtering the final classification result. In our experiments we have included a number of sequences where the user has either deliberately made a bad sign or has just moved the phone at random. These sequences are used to test the case where the system may be unintentionally turned on by the user or the user loses control of the phone whilst making a sign. The mean of the entropy of “bad” sequences in the validation was found to be significantly higher than the mean of the “good” sequences. This initial result indicates the potential of using the velocity entropy as a measure of the quality of the sign.

5.1.1. Experimental Procedure

In order to validate the technique described here a hypothetical control system of mobile phone functions was devised. In this system a series of control commands was proposed. These commands are composed of seven simple elements based on seven different motions. These seven elements are shown in Table 1. Using these basic elements alone the system would be limited to seven different commands, so in order to provide a greater number of commands more complex commands are constructed from these motion elements. These complex commands are used for our recognition experiments and are shown in Table 2. Although we have used 11 complex commands this could easily be extended to a larger number of commands.

The experimental data was collected from 35 subjects. Each subject was asked to draw each of the commands in Table 2 five times using a standard camera equipped mobile phone, a Nokia N90. The majority of subjects had no previous experience in performing this task. There was considerable variability of the sequences both between subjects and also between different attempts from the same subject; this can clearly be seen in Figure 4.

The subjects were randomly divided into training, validation, and test sets. There were 20 subjects in the training set, 5 subjects in the validation set, and 10 subjects in the test set. Additionally 10 “bad” sequences were added to the validation set and 20 to the test set, giving a total of 1100, 285, and 570 sequences in the training, validation, and testing sets, respectively.

In addition to these subjects 30 random sequences were collected. These sequences were produced by moving the camera in a random way. These random or “bad” sequences were included in the data to test the system's performance with input caused by accidental activation of the camera or the user losing control of the phone whilst making a sign. The mean of the entropy of “bad” sequences in the validation set is 0.88 with standard deviation of 0.11, while the mean and standard deviation of the “good” sequences is 0.58 and 0.13, respectively.

It must be emphasised that there was no overlap of subjects between these three sets. The training set was used to train the parameters of the HMMs. The validation set was used to set the hyperparameters of the individual models, such as the number of Gaussians in the GMMs, that model the state distributions of the HMMs, and the number of states in the HMMs. The validation set was also used to set the hard threshold, 𝛿hard, and the soft threshold, 𝛿soft. In addition the entropy threshold Entth was set by using the validation set. The values of these parameters were; number of states = 11, number of Gaussians = 10, 𝛿hard=5, 𝛿soft=30 and Entth=0.72.

5.1.2. Results and Discussion

The results of running the system on the 570 test sequences are shown in Table 3. It can be seen from these results that only 5 sequences are incorrectly classified, while 27 sequences that would have produced an incorrect result were rejected by the system. These rejected sequences included all of the 20 deliberately bad sequences. It is particularly interesting that 13 of these bad sequences were rejected using the entropy criteria. This result confirms that the higher entropy of the bad sequences observed in the validation set can be generalised to the bad sequences in the test set.

5.2. Object Motion Recognition

In this section we propose a system for control of a mobile device by again recognising motion sequences. In this instance the sequences are generated from tracking the users finger motion in front of the mobile device camera, as described in Section 3.3. In this system a series of simple control commands were proposed; these eight commands are shown in Figure 5. These command gestures are formed by the user drawing the sign in the air with an extended index finger in front of the mobile phone camera. So there are two challenges to overcome, first the tracking of the finger and secondly the classification of the sequences produced by this tracking.

In order to recognise the motion trajectories produced by finger tracking we are again using HMMs. However, due to the diversity in how people make the gestures it may be difficult to create general models for each class that will perform well for many different users. In order to improve the model performance we propose using unsupervised MAP adaptation to tailor the general models for a specific user. We address the problem of controlling unsupervised learning by proposing a method of selecting adaptation data using a combination of entropy and likelihood ratio. We demonstrate how this approach can significantly improve the performance in the task of finger gesture recognition.

5.2.1. Experimental Procedure

The experimental data was collected from 10 subjects. Each subject was asked to draw each of the commands in Figure 5 four times using a standard camera equipped mobile phone, a Nokia N73. This data formed the initial training and validation sets and also the baseline test set to measure the models performance before any adaptation. The collected data was divided into a training set of four subjects (128 sequences), a validation set of two subjects (64 sequences), and a test set of four subjects (128 sequences). To form the training and test sets for adaptation two subjects from the test set were asked to draw each sign an additional 11 times. These sequences form an adaptation set and test set for each of the subjects. Seven sequences from each subject are used to create a model adapted to that specific subject, and four sequences from each subject are used to test the performance of the adapted models. During MAP adaptation experiments the initial prior model is the baseline model, after this the prior model, 𝜃prior, is the ML estimated model, 𝜃ML, from the previous iteration. The adaption weighting, 𝛼, used in the testing was 0.8; this was found to be a reasonable value from previous work and was also tested using the validation set.

5.2.2. Results and Discussion

We first ran the baseline experiments using the training set of four subjects and the test set of four different subjects. This produced a sequence recognition rate of 82% on the test set. It can be seen from the confusion matrix shown in Table 4 that the errors show a distinct pattern of confusion between sign S1 and sign S8 and also between sign S2 and sign S7. These signs are quite similar to each other with the only difference being a horizontal or vertical separation of the strokes in signs S8 and S7. This may be due to the variability between different subjects when making the signs, so if a user in the training set does a particularly narrow S8 or S7 this may cause the model to incorrectly classify S1 and S2 in the test set. This problem can clearly be seen in Figure 6.

In the next set of experiments we tailor the general model to an individual user using unsupervised MAP adaptation. The results of these experiments can be seen in Table 6; this shows that adapting with no constraints on the data used for adaptation can produce an increased recognition rate. If, however, we filter the sequences used for adaptation by applying the likelihood ratio and entropy criteria we can significantly improve this result. This improvement can also been seen in the confusion matrix shown in Table 5.

6. Conclusions

We have presented here two camera-based user interaction techniques for mobile devices that combine motion features and statistical sequence modelling to classify the hand movements of a user: the first by recognising the motion of the device held in the user's hand and second by recognising the motion of the user's finger. In order to improve the results produced by these systems we have introduced two methods of filtering the result of this classification, likelihood ratio and entropy. In the first application these criteria were used to filter incorrect or random sequences from the final result, while in the second the criteria are used for selecting data for unsupervised adaptation. It is clear from the results shown in Section 5.1.2 that our proposed method of using entropy as an indicator of badly formed sequences is able to filter out all such sequences from the final result. Additionally the results in Section 5.2.2 demonstrate that the same criteria can be used to improve the performance of the models using MAP adaptation.

We conclude that the computer vision-based motion estimation and recognition techniques presented in this paper have clear potential to become practical means for interacting with mobile devices. They can possibly also augment the information provided by other sensors, such as accelerometers and touch screens, in a complementary manner. In fact, the cameras in future mobile devices may, for most of time, be used as sensors for self-intuitive user interfaces rather than using them for digital photography.