Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA
Abstract
We propose two algorithms that decompose the joint likelihood of observing multidimensional neural input data into
marginal likelihoods. The first algorithm, boosted mixtures of hidden Markov chains (BMs-HMM), applies techniques from
boosting to create implicit hierarchic dependencies between these marginal subspaces. The second algorithm, linked mixtures
of hidden Markov chains (LMs-HMM), uses a graphical modeling framework to explicitly create the hierarchic dependencies
between these marginal subspaces. Our results show that these algorithms are very simple to train and computationally efficient,
while also reducing the input dimensionality for brain-machine interfaces (BMIs).
1. Introduction
The field of brain-machine interfaces (BMIs) is
devoted to accomplishing the goal of one day restoring paralyzed patients mobility by
directly connecting their brain to a machine. A majority of developments in
this field have centered around linear and nonlinear models that map neural
firing patterns of an animal to a robotic prosthetic [1–3]. Usually, in this type of experiment, a primate or
rat engages in a movement task as neural data is recorded from their cortex (as
well as the kinematic information). Once a prediction model has been trained
with the trajectory/neural data, only neural data is used to control a robotic
arm in real-time [2, 3].
As a result of the above-mentioned experiments, our
group established that when multiple feed-forward prediction models map
discrete portions of the neural data to respective portions of the continuous
trajectory, the overall trajectory reconstruction is improved [4, 5]. The classifier that is
responsible for switching between these different feed-forward models is the
main focus of this paper (see Figure 1 for system overview).
The previous switching classifier was an ensemble
method that incorporated multiple independent experts. These experts were
single neural-channel HMM chains that formed an independently coupled hidden
Markov model (IC-HMM) [5]. Since it is unlikely that the independence
assumption applies to all of the neurons, finding dependencies between some of
the neurons, could prove beneficial for final kinematic reconstruction or other
biologically-inspired modeling. Additionally, one limitation of this algorithm
is that as more neurons are sampled from the brain (as is the current trend for
BMIs), the number of independent models could grow to an unmanageable level
[6]. Although the
ICHMM is still computationally more efficient than using a model that has full
dependencies, like the coupled hidden Markov model (CHMM), it is still desirable
that the input dimensionality be reduced to avoid this pitfall.
In this paper, we first take multidimensional neural
input data and decompose the joint likelihood of observing the neural input
into marginal likelihoods using boosted mixtures of hidden Markov chains
(BMs-HMM). The algorithm applies techniques from boosting to create hierarchical
dependencies between these marginal subspaces in an unsupervised manner.
Additionally, ideas from mixture of experts (MOEs) are incorporated so that the
local information is weighted and integrated into an ensemble decision. Our
results show that this algorithm is very simple to train and computationally
efficient, while also providing the ability to reduce the input dimensionality
for BMIs.
We then move from this method, which only implicitly
exploits the dependencies between the neurons, to a more formal graphical model
structure which explicitly models dependencies between neurons. We call this
model linked mixture of hidden Markov chains (LM-HMM). Borrowing ideas from the
BM-HMM, we show how this semisupervised model explicitly exploits the
dependencies between neurons (if they exist).
The development and evaluation of the BM-HMM and
LM-HMM serve as the core of this paper and directs the following organization.
First, we discuss our motivation and technique for the BM-HMM. Second, we
develop the theory for the LM-HMM and explain how it relates to the BM-HMM.
Third, we compare results of the new models to our previous models, as well as
present some possible interpretation of the results. Finally, we suggest areas
of future work.
2. Modeling Neural Data for Brain-Machine Interfaces
2.1. Experimental Animal Data
Neural action
potentials from two different animal experiments are used to train and test the
models in this paper. In the first experiment, neural data was recorded from an
owl monkey's cortex as it performed a food reaching task. Specifically,
multiple implanted microwire arrays recorded this data from 104 neural cells in
the following cortical areas: posterior parietal cortex (PP), left and right
primary motor cortex (M1), and dorsal premotor cortex (PMd). Concurrently with
the neural data recording, the 3D hand position was recorded as the monkey made
three repeated movements: rest to food, food to mouth, and mouth to rest
[1, 5].
In the
second experiment, a male Sprague-Dawley rat performed
a go, no-go lever pressing task as neural data was recorded. This dataset
contains 16 neural cells that were collected with microwire arrays implanted in
the forelimb region of the left primary cortex (M1) [7]. Subsequently, the data was
spike-detected and spike-sorted using thresholds and template matching
[7]. The lever
presses were recorded simultaneously with the neural activity. The beginning of
each trial was signaled to the rat by an LED indicating when to press the lever
in order to receive a reward [7].
The data sets resulting from the above experiments
were segmented into movement and rest classes for modeling. For the monkey
data, all angular velocities greater than 4 mm/s are labeled as part of the
movement class. Included in this movement class are the times when the monkey
momentarily holds its arm during a reach for food or its mouth [5, 8]. For the rat experiments,
since there is no information about the rat moving around the cage (or
grooming), only the lever press is included as part of the movement class
[7].
For both experiments, the neural data is binned into
100-millisecond counts, which is consistent with the neural science community
[8, 9]. Consequently, the movement
data is down-sampled to match the 10 Hz neural bin counts. The time recording
for the monkey experiment corresponds to a dataset of 23000
104 time bins. For the rat experiment, the
dataset consists of 13000
16 time bins [1, 7].
2.2. Motivation
Neuroscientists often treat the multiple channels of
neural data acquired from BMI experiments as multivariate observations from a
single process [10].
This perspective requires fully coupled statistics across all of the channels
at all times irrespective of partial independence among the multiple processes.
Additionally, it has been established that neurons exhibit nonstationary
behavior [11]. Our
group's own work has shown that modeling this data as a single multivariate
process is not the most appropriate [4]. It is also inappropriate to model all of the
channels independently since there is a possibility of dependencies between
some of the neurons.
From a machine learning perspective, modeling of the
observed and hidden neural information can be accomplished with observable and
hidden random processes that are interacting with each other in some unknown
way. To achieve this, we make the assumption that each neuron's output is an
observable random process that is affected by hidden information. Since the
experiment does not provide detailed biological information about the
interactions between the sampled neurons, we use hidden variables to model
these hidden interactions [9, 12]. We further assume that this compositional
representation of the interacting processes occurs through space and time
(i.e., between neurons at different times).
We also differ from other work in the BMI field by not
taking the traditional regression approach of mapping the neural data directly
to the patient hand kinematics with a conventional linear/nonlinear model
[10]. Since the final
BMI paradigm will not include desired kinematic information from paraplegics,
we use generative models to explain the observable neural data. From these
generative models, we divide the input space into regions that we call “motion
primitives.” These structures have been loosely touched upon in other work
[13, 14]. The basic idea is that
kinematic information or neural data can be decomposed into these “motion
primitives” similar to phonemes in speech processing. In speech
processing, graphical models (specifically HMMs) are the leading technology
because they are able to capture very well the piecewise nonstationarity of
speech [15, 16]. Since speech production is
ultimately a motor function, graphical models can potentially also be useful
for motor BMIs (also nonstationary) [11, 15]. By using smaller simpler components, more
complicated arm kinematics can be constructed through the combination of these
simple structures.
The ultimate goal is to decipher these underlying
structures so that the model may one day be decoupled from the desired
kinematics (for unsupervised modeling). It is our current goal to determine
these underlying structures through a supervised mode in order to later exploit
them in an unsupervised mode.
3. Boosted Mixtures of HMM Chains
3.1. Related Work
With IC-HMMs, each neural channel is modeled with a
hidden and observable random process (i.e., an HMM chain) [5]. Each neural channel HMM in
the IC-HMM is computed independently. To calculate the log likelihood for the
IC-HMM (Figure 2) we use
(1) where
is the set of all the parameters for the model
of a particular neural channel (where
neural channels). Given that,
is a sequence of discrete binned neural
firings and
represents the sequence of discrete hidden
states or variables (of length
), this joint probability can be further
decomposed into
(2)
Figure 2: IC-HMM graphical model.
From Figure 2 and (2), we see that the observable
variables are IID and the hidden state sequence is Markovian (which is the
classical HMM formulation). Essentially, the IC-HMM decomposes the multivariate
input into the independent marginals so that only the likelihoods of the
individual neurons are needed for the final calculation [5].
In order to move beyond the IC-HMM and exploit the
complimentary information provided by the independent HMM chains, we first look
to boosting. Boosting is a technique that creates different training
distributions from an initial input distribution so that a set of weak
classifiers is generated [17, 18]. The generated classifiers then form an ensemble vote
for the current data example. It has been shown that these hierarchic
combinations of classifiers achieve lower error rates than the individual base
classifiers [17, 18].
Adaboost is the most widely used algorithm to evolve
from boosting methods [19]. This algorithm sequentially generates weak
classifiers based on weighted training examples. Essentially, the initial
distribution of training examples is resampled each round (based on the
distribution of the weights
) in order to train the next classifier up to
rounds [19]. The
training examples that fail to be classified on a particular round receive an
increased weighting so that the subsequent classifiers are more likely to be
trained on these hard examples. With the initial values of the weights being
,
for
samples, the update for each weight is
(3)where
are the external weights for each
th expert that has been trained for the
th round which act very similar to priors for
the respective experts. Additionally,
is a normalization factor in order to make the
weights
a distribution. Concurrent to the weights
for training examples, each
is updated so that it can be used in a final
ensemble vote (which is a linear combination of the
weights and the hypothesis of each expert):
(4)where
(5)
and the final ensemble output becomes
(6)
The success of boosting has been attributed to the
distribution of the “margins” of the training examples [17]. For further details on
Adaboost or boosting with respect to the margin and relationships to support
vector machines, see Schapier et al.
With respect to improving IC-HMM, Adaboost offers a
promising way to implicitly find dependencies among the channels through the
process of boosting classifiers during training. In our case, Adaboost is
applied differently to the multidimensional neural data as explained in the
next section.
3.2. Modeling Framework
BMI data imposes specific constraints that require
modifications to the standard procedures. First, our data is high dimensional
(104 channels) and the importance of each channel to the mapping can be very
different and is unknown. Second, the classes have markedly different prior
probabilities. Third, the computational practicality of using thousands of
experts to generate a decision is infeasible for hundreds of neural channels.
Our approach uses Adaboost as a way to select multichannel experts that
contribute the most information in the training set. Although the algorithm
starts out with parallel training for the independent experts, gradually a
winning expert is chosen for each hierarchic level to combine later into the
ensemble. Since each level is formed in an unsupervised way, this algorithm is
a hybrid between supervised and unsupervised learning, following a process
similar to the mixture of experts framework [20].
Our first major departure from Adaboost results from
how the ensemble is generated. Instead of forming one expert at a time, the
independent HMM chains are trained in parallel using the Baum-Welch formulation
[21]. As explained,
this splits the joint likelihood into marginals so that independent processes
are working in simpler subspaces [5]. A ranking is then performed and a winner is chosen
based on the classification performance for the current distribution of input
examples. Specifically, the winner that minimizes the error with respect to the
distribution of samples is chosen. To find the minimal error, we differ from
Adaboost and use an Euclidean distance for the classes
to avoid biasing class
assignments (since classes may not have equal priors)
[5]. Next,
the remaining
experts are trained within their respective subspace but relative to the errors
of the previous winner. Finally, the
are used to select the next distribution of
examples for the remaining experts. Similar to Adaboost, the remaining experts
are trained on the hard examples from different subspaces. In turn, a
hierarchic structure is formed as the winning experts affect the training on
the local subspaces for the subsequent experts. During this process, they are
implicitly modeling the dependencies among the channels.
As explained earlier, Adaboost uses
's as external weights to the classifiers as
opposed to the
's which weight the training examples. Our
computation of the
's is the second major departure from Adaboost
since we use a mixture of experts formulation for the external weights or
mixture coefficients. To find the mixture coefficients for the local
classifiers, we look to the boosted mixture of experts (BME) [22]. With BME, improved performance is gained through the
use of a confidence measure for the individual experts [22]. Although many different
confidence measures exist, the majority use a scalar function of the expert's
output which is then used as a static gating function or mixture coefficient
[20, 22].
Our algorithm uses a
simple measure for each expert
based on the
-Norm of the class errors (instead
of the one outlined in (4)):
(7)where
and
are the respective errors of the two classes
in our problem, move and rest (which could generalize to more classes). We use
to substitute for the normal Adaboost
formulation of
(4) to update the
's in (3).
Since there is a condition placed during the boosting
phase to discard experts with less than 50% classification, negative alphas
will not occur [19].
Notice that as the errors between the two classes are smaller, the weights for
the experts become larger. We present the proposed Adaboost
training algorithm for BM-HMM (see Algorithm 1).
The criterion for stopping is based on two conditions.
The first stopping condition occurs, if the chosen experts are performing less
than 50% classification. The second stopping condition occurs if the cross-validation
set shows an increase in error or a plateau in performance for a significant
number of rounds.
Since a single HMM chain is trained on a single neural
channel, the number of parameters is very small and can support the amount of
training data. The individual HMM chains in the BM-HMM contain around 70
parameters for a training set of 10000 examples as opposed to almost 18000
parameters necessary for a comparable CHMM (due to the dependent states)
[5].
In the next section, we slightly increase the
complexity of the BM-HMM to explicitly model dependencies between the neurons.
4. Linked Mixtures of HMM Chains
4.1. Modeling Framework
Let
represents the set of variables (both hidden
and observed) included in the probabilistic model. A graphical model (or
Bayesian network) representation can then be used to provide insight into the
probability distributions over
encoded in a graph structure [23]. With this type of
representation, edges of the graph represent direct dependencies between
variables. Conversely, and more importantly, the absence of an edge allows the
assumption of conditional independence between variables. Ultimately, these
conditional independencies allow a more complicated multivariate distribution
to be decomposed (or factorized) into simple and tractable distributions
[23].
Since there is a variety of graphical model
representations that decomposes the joint probability of the hidden and
observed variables in
,
choosing the best approximation is overwhelming. To move beyond the IC-HMM and
implicit dependencies in the BM-HMM, we establish another layer of hidden or
latent variables to link and express the spatial dependencies between the lower
level HMM structures (Figure 3), thus creating a clique tree structure
(since there are cycles), where hierarchic
links exist between neural channels.
Figure 3: LM-HMM graphical model.
The log likelihood of the dynamic neural firings from
all of the neurons for this structure (Figure 3) is
(8) where the dependency between the tree cliques are
represented by a hidden variable
in the second layer:
(9)and the hidden state sequence
also has a dependency on the hidden variable
in the second layer:
(10)
This hierarchic model allows us to model data across
multiple neural channels while also exploiting possible dependencies. From
Figure 3, we see that the lower observable variables
are conditionally independent from the second
layer hidden variable
as well as the subgraphs of the other neural
channels
(where
). The hidden variable
in the second layer of (8) can be interpreted
as a mixture variable (when excluding the hierarchic links).
The LM-HMM implements a middle ground between making
an independence assumption and a full dependence assumption. Since we are
adding a layer of hidden variables, we are incurring a computational cost. The
next section will show that we can make an approximation to ease computational
costs but still bring in the richness of modeling the interrelationships
between neurons.
4.2. Training with Expectation Maximization (EM)
Although using EM with graphical models gives us
insight into probability distributions over our observed and hidden variables,
some probabilities of interest are intractable to compute. Instead of using
brute force methods to evaluate such probabilities, conditional independencies
represented in the graphical model can be exploited. Often, approximations like
Gibbs sampling, variational methods, and mean field approximations are applied in
order to make the problem tractable or computationally efficient [23, 24].
For our model, we use a mean field approximation to
allow interactions associated with tractable substructures to be taken into
account [24]. The
basic idea is to associate with the intractable distribution a simplified
distribution that retains certain terms of the original distribution while
neglecting others, replacing them with parameters
that we will refer to as “variational
parameters.” Graphically, the method can be viewed as deleting edges from
the original graph until a forest of tractable structures is obtained. Edges
that remain in the simplified graph correspond to terms that are retained in
the original distribution and edges that are deleted correspond to variation
parameters [24, 25].
We make approximations when finding the
expectation of
(8). In particular, we will first approximate
by treating
as independent from
making conditional probability equal to the
familiar
.
Two important features can be seen in this type of approximation. First, we
have decoupled the simple lower-level HMM chains from the higher-level
variables. Second,
can now be regarded as a linked mixture
variable for the HMM chains since
which we will address later [20].
Because the lower-level HMMs have been decoupled, we
are able to use the Baum-Welch formulation to compute some of the calculations
in the
-step, leaving estimation of the variational parameter for later. As a
result, we can calculate the forward pass:
step
(11)
We can calculate this quantity
recursively by setting
(12)
The well-known backward procedure is
similar:
(13)this computes the probability of
the ending partial sequence
given the start at state
at time
.
Recursively, we can define
as
(14)
Additionally, the
and
matrices are the transition and emission
matrices defined for the model which are updated in the
-step. Continuing in
the
-step, we will rearrange posteriors in terms of the forward and backward
variables. Let
(15)which is the posterior
distribution. We can rearrange the equations to quantities so we
have
(16)
and now with the conditional independencies we can
define the posterior in terms of
's and
's:
(17)
We also define
(18)which can be expanded
as
(19)
The
-step departs from the Baum-Welch formulation and
introduces the variational parameter [24]. Specifically, the
-step involves the update of the
parameters
,
,
(we will save
for later):
step
(20)
There are two issues left to resolve. First, how can
the variational parameter be estimated and maximized given the dependencies.
Second, if experimentally it is not known which neurons are affecting other
neurons (if at all), how can the dependencies between neurons be defined in the
model.
4.3. Updating Variational Parameter via Importance Sampling
While still working within the EM framework, we treat
the variational parameters
as mixture variables generated by the
th HMM each having a prior probability of
.
We want to estimate the set of parameters that maximize the likelihood function
[25–27]:
(21)
Given the set of sequences and current estimates of
the parameters, the
-step consists of computing the conditional expectation of
hidden variable
:
(22)
The problem with this conditional expectation is the
dependency on
.
Since
is independent from
and
,
we can decompose this into
(23)
The first term, a well-known expectation for mixture
of experts, is calculated by using Bayes rule and the priori probability that
:
(24)
Since the integration for the second term is much
harder to compute, we look to an integration approximation that will maintain
the dependencies. Importance sampling is a well-known method that is capable of
approximating the integration with a lower variance than Monte-Carlo
integration [28]. We
can approximate the integration with
(25)where the
samples have been drawn from the proposal
distribution
.
For the estimation of
,
we need to combine the two terms:
(26)
To compute the
-step,
(27)
Borrowing from the competitive nature of the BM-HMM,
we choose winners based on the same criterion of minimizing the Euclidean
distance for the classes for the LM-HMM.
5. Experimental Results
5.1. Quantitive Results
In this section, we first show the results of our
model on the two animal experiments. Next, we illustrate the effects of the
algorithm by comparing the BM-HMM and LM-HMM chains to the IC-HMM chains. The
BM-HMM and LM-HMM chains are ranked in order of the rounds, while the IC-HMM
chains are ranked from the best to worst in terms of the individual chain's
ability to classify. We conclude with an analysis of the methods by randomly
selecting the chains for the LM-HMM and BM-HMM to understand if dependencies
are being captured.
In Tables and
, we see a comparison of the results
from the BM-HMM and LM-HMM versus the full IC-HMM and a simple linear
classifier created by a regression model followed by a threshold [5]. For the HMMs, we use three
hidden states and an observation sequence length
, which corresponds to one second of data
(given the 100-millisecond bins). These choices were based on previous efforts
to optimize performance [4]. For the linear classifier, a Wiener filter with a
10-tap delay (that corresponds to one second of data) is used. These parameters
and thresholds were also chosen based on previous work [4, 13]. The
classification results
are determined on each data point in the training set. Cross-validation sets,
Monte-Carlo runs, leave-K-out methodologies are also employed to make sure that
our results are general.
Table 1: Classification results (BM-HMM selected channels).
Table 2: Classification results
(random LM-HMM selected channels).
As seen in Tables , ,
, and , the correct
classification percentage is much higher for the monkey data than the rat data.
This is explained primarily by the experimental conditions. Since the rat
experiment is not as controlled as the monkey experiment, the rat dataset is
much noisier. In particular, as the rat moves around the cage there are many
training and testing samples labeled as nonmovement, when the animal is
actually moving (only lever presses are counted as movement in this setup).
This explains the much lower performance on the rat data. Also, fewer neural
channels were collected from the rat than the monkey [1, 7]. Apart from this
difference, the same trends are seen in both datasets.
Table 3: Classification results (random BM-HMM selected
channels).
Table 4: Classification results (random LM-HMM selected
channels).
To give a fair comparison between the methods, the
same neural channels that were chosen by the BM-HMM are used with the linear
model and the IC-HMM. For the LM-HMM, a different set of neurons (although some
overlap the BM-HMM) was selected based on the algorithm, but the number of
final channels were kept the same. As compared in Tables
and , the BM-HMM
and LM-HMM perform on par with the IC-HMM, but with the added benefit of
dimensionality reduction. Effectively, the same performance is obtained with a
fraction of the number of neural channels. For this experiment, the performance
on the rat data is better with the BM-HMM, and slightly worst with the LM-HMM.
In Tables and
, we wanted to see the effect of
randomly selecting the experts. Notice the results in these tables are poor in
comparison to Tables and
. In particular, Tables
and show a significant
decrease in performance modeling the monkey data. The performance reduction is
less pronounced for the data collected from the rat. We believe this is due to
the available number of channels that we can randomly select. Since the monkey
data is collected from a greater number of channels, there is an increased
possibility of selecting a bad expert for both the BM-HMM and LM-HMM. It is
interesting that the
for both models on both datasets were
approximating similar values (like an averaging), almost as if the neurons
randomly selected are more independent than dependent. This is also evident
since the result of the IC-HMM using the same channels has similar performance
with respect to both models in Tables and
. Even
Figure 8 points to the
channels having zero correlation amongst themselves, alluding to independency.
The results demonstrate three interesting points.
First, it is significant that on the monkey data, nine BM-HMM and LM-HMM chains
outperform the linear classifier that uses the full input space. Second, the
subset of experts that are chosen by the BM-HMM seems to perform well on the
linear model. This result is expected since the BM-HMM chains select neural
channels with important complimentary information. Third, when comparing the
BM-HMM and LM-HMM to the Wiener filter and the IC-HMM using the same subset of
neural channels, the results show that the hierarchic training of the BM-HMM
and LM-HMM provides a significant increase in performance. We believe this is
due to the dependencies that are being exploited during each round of training,
where the other models simply try to uniformly combine all the neural
information into a single hypothesis. This is supported by Tables
and
since
the LM-HMM and BM-HMM appears to default to an independent approximation.
Finally, other BMI researchers apply sensitivity analysis to understand the
importance of a neural channel respective to the kinematics performed by the
subject. In contrast the BM-HMM and LM-HMM channel selection is trying to
improve classification results by exploiting dependencies between channels as
well as kinematics. Interestingly, some of the channels selected in both
datasets do overlap some of the same neurons selected during sensitivity
analysis [3, 7].
Additionally, if we do an expert adding experiment in
which the best ranked experts are added one by one to the ensemble vote, an
interesting result emerges. In Figure 4, as the BM-HMM chains are added, the
error rate quickly decreases below the IC-HMM error when applied to the monkey
neural data. The LM-HMM has an even faster drop in error but does not achieve
the same result as the BM-HMM. Figure 7 shows a similar result when applied to
the rat neural data. Overall, we find that the boosted mixtures or linked
mixtures are exploiting more useful and complimentary information for the final
ensemble than the simple IC-HMM.
Figure 4: Monkey expert adding experiment.
6. Discussion and Future Work
Based on the results, the BM-HMM performs better than
the LM-HMM due to the fact that the BM-HMM takes advantage of using the errors
in classification when determining the
's. In contrast, the LM-HMM only uses the
likelihood information for updating the priors and variational parameters in
order to create explicit dependencies. Despite the slightly lower performance
of the LM-HMM, it is significant that the model can still find and exploit
neural channels that are important for movement. Additionally, since the
likelihood is only required, perhaps the algorithm could be improved by dynamically
updating the prior and variational parameters during the evaluation of the test
set. This could allow us to modify the temporal and spatial dynamics in an
online fashion.
To understand how the two methods differ in selecting
the channels, we plotted in Figures 5 and
6, the correlation coefficients
between the subset of channels chosen by the BM-HMM and LM-HMM. Interestingly,
we see that more of the channels in the LM-HMM have a large positive and large
negative correlation (with respect to the move class). In contrast, the BM-HMM
has some positive correlation amongst its subset, but few channels exhibit
negative correlation. Figure 8 shows the correlation coefficient between the
channels that were randomly selected. Notice how the correlations between
channels are near zero. These results suggest that the LM-HMM prefers stronger
correlated channels (i.e., dependant) than the BM-HMM and much stronger than
randomly selecting channels (which seems to exhibit more independence between
channels).
Figure 5: Correlation coefficients between channels (monkey
moving).
Figure 6: Correlation coefficients between channels (monkey at
rest).
Figure 7: Rat expert adding experiment.
Figure 8: Correlation coefficient between channels (randomly
selected for monkey).
With respect to other algorithms in the machine
learning community, there are a few ways to interpret the BM-HMM. BM-HMMs can
be thought of as a modification to boosting or even a simpler version of the
mixture of trees algorithm, if the HMM chains are interpreted as binary stumps
[29]. Additionally, the
temporal Markovian dynamics coupled with the hierarchic structure and mixture
modeling can be thought of as a simple approximation to tree structured HMMs
[23]. Other work has
focused on solving this problem of boosting multiple parallel classifiers
[30]. Other authors
have proposed boosting solutions that reduce the dimensionality of the input
data [30–32]. From their perspective,
the multidimensional inputs are treated as simple features of a single random
process [30]. We
differ from this perspective by assuming the input space is composed of
multiple random processes that are interacting with each other in some unknown
way. By decomposing the input space into multiple random processes, the local
contributions of the individual processes are exploited rather than using the
global effect of a single process. This type of algorithm also has components
similar to the mixture of experts (MOEs) algorithm.
Although some similarities exist between MOE and both
of our models, it is important to note that the MOE has the advantages of
localization and the use of a dynamic model for combining the outputs from the
experts [20, 22]. Others have proposed a
similar formulation of building boosted hierarchic structures with the MOE
algorithm, but these efforts lack the Markovian dynamics that are inherent with
BM-HMMs [20, 22, 23, 33]. Our work also differs from these formulations, since
our “gating functions” are static and behave like priors for the experts
[5]. Because of the
static “gating functions,” it could even be closer aligned with an
ensemble method, but in future work we hope to exploit a dynamic gating
function similar to MOE, where temporal dynamics changes the output of the
switching classifier.
Finally, for BMIs, the selection of only 9 neurons
(for the monkey data) from the large number of cortical neurons must be
interpreted in the proper context. Notice that we are just selecting the
neurons that are capable of allowing recognition of movement or rest, which is
a very simple task. Our experience with cursor tracking tasks shows that a much
larger number of neurons is required, and since the data is nonstationary,
neurons are being multiplexed across time to implement tasks. This implies that
in principle, the larger the number of neurons the better, provided that one
has sufficient data to train the models. Nonetheless, this HMM methodology will
be extremely useful for the design of BMIs, especially if ways are found to
discover the time varying dependencies.
Acknowledgments
The authors
would like to thank Johan Wessberg and Miguel A. L. Nicolelis for sharing their
monkey neural data and experience and Justin Sanchez for his rat data. They
also thank Antonio Paiva for his linear classifier code and Michael Nechyba for
providing a helpful hand in this research as well as thanking Jeremy Anderson
and Grisel Morales for editing the paper. This paper was supported by DARPA
Project no. N66001-02-C-8022 and NSF Project no. CNS-0540304.
References
- M. A. L. Nicolelis, D. Dimitrov, J. M. Carmena, et al., “Chronic, multisite, multielectrode recordings in macaque monkeys,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 19, pp. 11041–11046, 2003.
- J. Wessberg, C. R. Stambaugh, J. D. Kralik, et al., “Real-time prediction of hand trajectory by ensembles of cortical neurons in primates,” Nature, vol. 408, no. 6810, pp. 361–365, 2000.
- J. C. Sanchez, S.-P. Kim, D. Erdogmus, et al., “Input-output mapping performance of linear and nonlinear models for estimating hand trajectories from cortical neuronal firing patterns,” in Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 139–148, Martigny, Switzerland, September 2002.
- S. Darmanjian, S.-P. Kim, M. C. Nechyba, et al., “Bimodal brain-machine interface for motor control of robotic prosthetic,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '03), vol. 3, pp. 3612–3617, Las Vegas, Nev, USA, October 2003.
- S. Darmanjian, S.-P. Kim, M. C. Nechyba, J. C. Principe, J. Wessberg, and M. A. L. Nicolelis, “Independently coupled HMM switching classifier for a bimodel brain-machine interface,” in Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, pp. 379–384, Maynooth, Ireland, September 2006.
- S. Darmanjian, A. R. C. Paiva, J. C. Principe, et al., “Hierarchal decomposition of neural data using boosted mixtures of independently coupled hidden Markov chains,” in Proceedings of the International Joint Conference on Neural Networks, pp. 89–93, Orlando, Fla, USA, August 2007.
- J. C. Sanchez, J. C. Principe, and P. R. Carney, “Is neuron discrimination preprocessing necessary for linear and nonlinear brain machine interface models,” in Proceedings of the 11th International Conference on Human-Computer Interaction, Vegas, Nev, USA, July 2005.
- E. Todorov, “On the role of primary motor cortex in arm movement control,” in Progress in Motor Control III, pp. 125–166, Human Kinetics, Champaign, Ill, USA, 2003.
- A. B. Schwartz, D. M. Taylor, and S. I. H. Tillery, “Extraction algorithms for cortical control of arm prosthetics,” Current Opinion in Neurobiology, vol. 11, no. 6, pp. 701–707, 2001.
- M. A. Lebedev and M. A. L. Nicolelis, “Brain-machine interfaces: past, present and future,” Trends in Neurosciences, vol. 29, no. 9, pp. 536–546, 2006.
- R. B. Northrop, Introduction to Dynamic Modeling of Neurosensory Systems, CRC Press, Boca Raton, Fla, USA, 2001.
- W. J. Freeman, Mass Action in the Nervous System, Academic Press, New York, NY, USA, 1975.
- S.-P. Kim, J. C. Sanchez, D. Erdogmus, et al., “Divide-and-conquer approach for brain machine interfaces: nonlinear mixture of competitive linear models,” Neural Networks, vol. 16, no. 5-6, pp. 865–871, 2003.
- N. G. Hatsopoulos, Q. Xu, and Y. Amit, “Encoding of movement fragments in the motor cortex,” Journal of Neuroscience, vol. 27, no. 19, pp. 5105–5114, 2007.
- B. H. Juang and L. R. Rabiner, “Issues in using hidden Markov models for speech recognition,” in Advances in Speech Signal Processing, S. Furui and M. M. Sondhi, Eds., pp. 509–553, Marcel Dekker, New York, NY, USA, 1992.
- X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice-Hall, Englewood Cliffs, NJ, USA, 2001.
- R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: a new explanation for the effectiveness of voting methods,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 322–330, Nashville, Tenn, USA, July 1997.
- R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol. 5, no. 2, pp. 197–227, 1990.
- Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the 13th International Conference on Machine Learning (ICML '96), pp. 148–156, Bari, Italy, July 1996.
- R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
- L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Annals of Mathematical Statistics, vol. 41, no. 1, pp. 164–171, 1970.
- R. Avnimelech and N. Intrator, “Boosted mixture of experts: an ensemble learning scheme,” Neural Computation, vol. 11, no. 2, pp. 483–497, 1999.
- M. I. Jordan, Learning in Graphical Models, MIT Press, Cambridge, Mass, USA, 1999.
- M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” in Learning in Graphical Models, MIT Press, Cambridge, Mass, USA, 1998.
- A. Ypma and T. Heskes, “Automatic categorization of web pages and user clustering with mixtures of hidden Markov models,” in Proceedings of the 4th International Workshop on MiningWeb Data for Discovering Usage Pattens and Profiles (WEBKDD '02), pp. 35–49, Edmonton, Canada, July 2002.
- J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering clusters in motion time-series data computer vision and pattern recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '03), vol. 1, pp. 375–381, Madison, Wis, USA, June 2003.
- I. Cadez and P. Smyth, “Probabilistic clustering using hierarchical models,” Department of Information and Computer Science, University of California, Irvine, Calif, USA, 1999.
- G. S. Fishman, Monte Carlo: Concepts, Algorithms, and Applications, Springer, New York, NY, USA, 1995.
- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classication and Regression Trees, Wadsworth International Group, Belmont, Calif, USA, 1984.
- Y. Sun and J. Li, “Iterative RELIEF for feature weighting,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 913–920, ACM Press, Pittsburgh, Pa, USA, June 2006.
- M. Martínez-Ramón, V. Koltchinskii, G. L. Heileman, and S. Posse, “fMRI pattern classification using neuroanatomically constrained boosting,” NeuroImage, vol. 31, no. 3, pp. 1129–1141, 2006.
- P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '01), vol. 1, pp. 511–518, Kauai, Hawaii, USA, December 2001.
- M. Meila and M. I. Jordan, “Learning fine motion by Markov mixtures of experts,” in Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., vol. 8, MIT Press, Cambridge, Mass, USA, 1996.