Computing Laboratory and Center for Biomedical Informatics, University of Kent, Canterbury CT2 7NF, UK
Abstract
The discrete particle swarm optimization (DPSO) algorithm is an
optimization technique which belongs to the fertile paradigm of Swarm Intelligence.
Designed for the task of attribute selection, the DPSO deals with discrete
variables in a straightforward manner. This work empowers the DPSO algorithm
by extending it in two ways. First, it enables the DPSO to select attributes for a
Bayesian network algorithm, which is more sophisticated than the Naive Bayes
classifier previously used by the original DPSO algorithm. Second, it applies the
DPSO to a set of challenging protein functional classification data, involving a
large number of classes to be predicted. The work then compares the performance
of the DPSO algorithm against the performance of a standard Binary PSO
algorithm on the task of selecting attributes on those data sets. The criteria used
for this comparison are (1) maximizing predictive accuracy and (2) finding the
smallest subset of attributes.
1. Introduction
Most of the particle swarm algorithms present in the
literature deal only with continuous variables [1–3]. This is a significant
limitation because many optimization problems are set in a search space
featuring discrete variables. Typical examples include problems which require
the ordering or arranging of discrete variables, such as scheduling or routing
problems [4].
Therefore, the design of particle swarm algorithms that deal directly with
discrete variables is pertinent to this field of study.
The work in [5] proposed a discrete particle swarm optimization (PSO)
algorithm for attribute selection in Data Mining. Hereafter, this algorithm
will be refereed to as the discrete particle swarm optimization (DPSO) algorithm.
The DPSO deals directly with discrete variables, and its population of
candidate solutions contains particles of different sizes—the DPSO forces
each particle to carry a constant number of attributes across iterations. The
DPSO algorithm interprets the concept of velocity, used in traditional PSO, as
“probability;" renders velocity as a
proportional likelihood; and uses this information
to sample new particle positions. The motivation behind the DPSO algorithm is
indeed to introduce a probability-like approach to particle swarm.
Although specifically designed for the task of
attribute selection, the DPSO is not limited to this kind of application. By
performing a few modifications, one can apply this algorithm to many other
discrete optimization problems, such as facility location problems [6].
Many data mining applications involve the task of
building a model for predictive classification. The goal of such a model is to
classify examples—records or data instances—into classes or categories of
the same type. Noise or unnecessary attributes may reduce the accuracy and
reliability of a classification or prediction model. Unnecessary attributes also
increase the costs of building and running a model—particularly on large
data sets. Before performing classification, it is therefore important to
select an appropriate subset of “good" attributes. Attribute selection
tries to simplify a data set by reducing its dimensionality and identifying
relevant underlying attributes without sacrificing predictive accuracy. As a
result, it reduces redundancy in the information provided by the attributes
used for prediction. For a more detailed review of the attribute selection task
using genetic algorithms, see [7].
The main difference between the DPSO and other
traditional PSO algorithms is that the particles in the DPSO do not represent
points inside an
-dimensional
Euclidean space (continuous case) or lattice (binary case) as in the standard
PSO algorithms [8].
Instead, they represent a combination of selected attributes. In previous work,
the DPSO was used to select attributes for a Naive Bayes (NB) classifier. The
resulting NB classifier was then used to predict postsynaptic function in
proteins.
The study presented here extends previous work
reported in [5, 9] in two ways.
First, it enables the DPSO to select attributes for a Bayesian network
algorithm, which is more sophisticated than the Naive Bayes algorithm previously
used. Second, it increases the number of data sets used to evaluate the PSO
from 1 to 6. All the 6 functional classification data sets used have a much
greater number of classes to be predicted—in contrast with the postsynaptic
data set used in [5]
which had just two classes to be predicted.
The work is organized as follows. Section 2 briefly
addresses Bayesian networks and Naive Bayes classifier. Section 3 shortly
discusses PSO algorithms. Section 4 describes the standard binary PSO algorithm
and Section 5 the DPSO algorithm. Section 6 describes the G-protein-coupled
receptors (GPCRs) and Enzyme data sets used in the computational experiments.
Section 7 reports computational experiments—it also includes a discussion of
the results obtained. Section 8 presents conclusions and points out future
research directions.
2. Bayesian Networks and Naive Bayes
The Naive Bayes
(NB) classifier uses a probabilistic approach to assign each record of the data
set to a possible class. In this work, the NB classifier assigns a protein of a
data set of proteins to a possible class. A Naive Bayes classifier assumes that
all attributes are conditionally independent of one another given the class
[10].
A Bayesian network (BN), by contrast, detects probabilistic
relationships among these attributes and uses this information to aid the
attribute selection process.
Bayesian networks are graphical representations of a
probability distribution over a set of variables of a given problem domain
[11, 12]. This graphical
representation is a directed acyclic graph in which nodes represent the
variables of the problem and arcs represent conditional probabilistic
independencies among the nodes. A directed acyclic graph
is an ordered
pair
, where
is a set whose
elements are called vertices or nodes and
is a set whose
elements are called directed edges, arcs, or arrows. The graph
contains no
directed cycles—for any vertex
, there is no directed path that starts and ends on
.
An example of a Bayesian network is as follows. (This is a modified version of
the so-called “Asia" problem [13].)
Suppose that a doctor is treating a patient who has been suffering from
shortness of breath—called dyspnoea. The doctor knows that diseases such as
tuberculosis, bronchitis, and lung cancer are possible causes for that. The
doctor also knows that other relevant information includes whether the patient
is a smoker—increasing the chances of lung cancer and bronchitis—and what
sort of air pollution the patient has been exposed to. A positive x-ray would
indicate either tuberculosis or lung cancer. The set of variables for this
problem and their possible values are shown in Table 1.
Table 1: Bayesian network: nodes and values for the
lung cancer problem. L = low, H = high, T = true, F = false, Pos = positive,
and Neg = negative.
Figure 1 shows a Bayesian network representing this
problem. For applications of Bayesian networks on evolutionary algorithms and
optimization problems, see [14, 15].
Figure 1: A Bayesian
network representing the lung cancer problem.
More formally, let
be a
multivariate random variable whose components
are also random
variables. A corresponding lower-case letter
denotes an
assignment of state or value to the random variable
.
represent the
set of nodes—variables or attributes in this work—that have a directed
edge pointing to
. Let us consider a BN containing
nodes,
to
, taken in that order. A particular value of
in the joint
probability distribution is represented by
(1) or more
compactly,
. The chain rule of probability theory allows the
factorization of joint probabilities, therefore
(2)
As the structure of a BN implies that the value of a
particular node is conditional only on the values of its parent nodes, (2) may be
reduced to
(3)
Learning the structure of a BN is an NP-hard problem
[16, 17]. Many algorithms that were developed to this end use a scoring metric and a search procedure. The scoring
metric evaluates the goodness-of-fit of a structure to the data. The search
procedure generates alternative structures and selects the best one based on
the scoring metric. To reduce the search space of networks, only candidate
networks in which each node has at most
-inward arcs
(parents) are considered —
is a parameter
determined by the user. In the present work, k is set to 20
to avoid overly complex models.
A greedy search algorithm is used to generate
alternative structures for the BN starting with an empty network, the greedy
search algorithm adds into the network the edge that most increases the score
of the resulting network. The search stops when no other edge addition improves
the score of the network. Algorithm 1 shows the pseudocode of this generic greedy
search algorithm.
To evaluate the “goodness-of-fit” (score) of a
network structure to the data, an unconventional scoring metric—specific for
the target classification task—is adopted. The entire data set is divided
into mutually exclusive training and test sets—the standard methodology for
evaluating classifiers, see Section 7.1. The training set is further divided
into two mutually exclusive parts. The first part is used to compute the
probabilities for the Bayesian network. The second part is used as the
validation set. During the search for the best possible network structure, only
the validation set is used to compute predictive accuracy. The score of a
candidate network is given by the classification accuracy in the validation
set. The graphical model of the network that shows the highest predictive
accuracy on the validation set—during the entire PSO run—is then used to
compute the predictive accuracy on the test set.
Once the best network structure is selected, at the
end of the PSO run, the validation set and the other part of the training set
are merged and this merged data—that is, the entire original training set—is used to compute the probabilities for the selected Bayesian network. The
predicted accuracy—reported as the final result—is then computed on the
previously untouched test set. This process is discussed again, in more
details, in Section 7.1. A similar process is adopted for the computation of
the predictive accuracy using the Naive Bayes classifier.
3. A Brief Introduction to Particle Swarm Optimization
Particle swarm optimization (PSO) comprises a set of
search techniques, inspired by the behavior of natural swarms, for solving
optimization problems [8]. In PSO, a potential solution to a problem is represented
by a particle,
in an
-dimensional
search space.
represents the
th particle in
the population and
represents the
number of variables of the problem. The coordinates
of these particles
have a rate of change (velocity)
,
. Note that the use of the double subscript notation
(
,
)
like in
represents the
th component of the
th particle in the swarm
—the same
rationale is used for
, and so forth.
Every particle keeps a record of the best position
that it has ever visited. Such a record is called the particle's previous best
position and denoted by
. The global best position attained by any particle so
far is also recorded and stored in a particle denoted by G. An iteration
comprises evaluation of each particle, then stochastic adjustment of
in the direction of particle
's previous
best position and the previous best
position of any particle in the neighborhood [18]. There is much variety in
the neighborhood topology used in PSO, but quite often gbest or lbest topologies are used. In the gbest topology, the neighborhood of a
particle consists of all the other particles in the swarm, and therefore all
the particles will have the same global best neighbor—which is the best
particle in the entire population. In the lbest topology, each particle
has just a “local" set of neighbors, typically much fewer than the number
of particles in the swarm, and so different particles can have different best
local neighbors. For a review of the neighborhood topologies used in PSO the
reader is referred to [8, 19].
As a whole, the set of rules that govern PSO are evaluate,
compare, and imitate. The evaluation phase measures how well each particle
(candidate solution) solves the problem at hand. The comparison phase
identifies the best particles. The imitation phase produces new particle
positions based on some of the best particles previously found. These three
phases are repeated until a given stopping criterion is met. The objective is
to find the particle that best solves the target problem.
Important concepts in PSO are velocity and
neighborhood topology. Each particle,
, is associated with a velocity vector. This velocity
vector is updated at every generation. The updated velocity vector is then used
to generate a new particle position
. The neighborhood topology defines how other
particles in the swarm, such as
and G,
interact with
to modify its
respective velocity vector and, consequently, its position as well.
4. The Standard Binary PSO Algorithm
Potential solutions to the target problem are encoded
as fixed size binary strings; that is,
=
, where
,
, and
[8]. Given a list of attributes
, the first element of
, from the left to the right hand side, corresponds to
the first attribute “
," the second
to the second attribute “
," and so
forth. A value of 0 on the site associated to an attribute indicates that the
respective attribute is not selected. A value of 1 indicates that it is
selected.
4.1. The Initial Population for the
Standard Binary PSO Algorithm
For the initial
population,
binary strings
of size
are randomly
generated. Each particle
is generated
independently. For every position
in
, a uniform random number
is drawn on the
interval (0, 1). If
, then
, otherwise
.
4.2. Updating the Records for the Standard Binary PSO Algorithm
At the
beginning, the previous best position of
, denoted by
, is empty. Therefore, once the initial particle
is generated,
is set to
=
. After that, every time that
is updated,
is also updated
if
is better than
. Otherwise,
remains as it
is. Note that
represents the
fitness function used to measure the quality of the candidate solutions. A
similar process is used to update the global best position G. Once that
all the
have been
determined, G is set to the fittest
previously
computed. After that, G is updated if the fittest
in the swarm is
better than G. And, in that case,
is set to
=
(fittest
). Otherwise, G remains as it is.
4.3. Updating the Velocities for the Standard Binary PSO Algorithm
Every particle
is associated
to a unique vector of velocities
=
. Note that, for simplicity, this work uses row
vectors rather than column vectors. The elements
in
determine the
rate of change of each respective coordinate
in
,
. Each element 
is updated
according to the equation
(4)
where
(
), called the inertia weight, is a constant value
chosen by the user and
. Equation (4) is a standard equation used in PSO
algorithms to update the velocities [20, 21]. The factors
and
are uniform
random numbers independently generated in the interval (0, 1).
4.4. Sampling New Particle Positions for the Standard Binary PSO Algorithm
For each
particle
and each
dimension
, the value of the new coordinate 
can be either 0
or 1. The decision of whether
will be 0 or 1
is based on its respective velocity 
and is given by
the equation
(5)
where
rand
is a uniform
random number, and
(6) is the sigmoid
function. Equation (5) is a standard equation used to sample new particle
positions in the binary PSO algorithm [8]. Note that the lower the value of
is, the more
likely the value of
will be 0. By
contrast, the higher the value of
is, the more
likely the value of
will be 1. The
motivation to use the sigmoid function is to map the interval [
,
]
into the
interval (0, 1) which is equivalent to the interval of a probability function.
5. The Discrete PSO (DPSO) Algorithm
The DPSO
algorithm deals directly with discrete variables
(attributes) and, unlike the binary PSO algorithm, its population of candidate
solutions contains particles of different sizes. Potential solutions to the
optimization problem at hand are represented by a swarm of particles. There are
particles in a
swarm. The size of each particle may vary from 1 to
, where
is the number
of variables—attributes in this work—of the problem. In this context, the
size of a particle refers to the number of different attribute indices that the
particle is able to represent at a single time.
For example, given
, in DPSO it may occur that a particle
in the
population has size 6
, whereas another particle
in the same
population has size 2
, and so
forth, or any other sizes between 1 and
.
Each particle
keeps a record
of the best position it has ever attained. This information is stored in a
separate vector labeled as
. The swarm also keeps a record of the global best
position ever attained by any particle in the swarm. This information is also
stored in a separate vector labeled G. Note that G is equal to
the best
present in the
swarm.
5.1. Encoding of the Particles for the
DPSO Algorithm
Each attribute
is represented by a unique positive integer number, or index. These numbers,
indices, vary from 1 to
. A particle is a subset of nonordered indices without
repetition, for example,
,
.
5.2. The Initial Population for the DPSO
Algorithm
The original work on DPSO [5] used a randomly generated
initial population for the standard PSO algorithm and a new randomly generated
initial population for the DPSO algorithm, when comparing these algorithms'
performances in a given data set. However, the way in which those populations
were initialized generated a doubt about a possible advantage of one initial
population over the other—which would bias the performance of one algorithm
over the other. In this work, to eliminate this possible bias, the initial
population used by the DPSO is always identical to the initial population used
by the binary PSO. They differ only in the way in which solutions are
represented. The conversion of every particle in the initial population of
solutions of the binary PSO to the Discrete PSO initial population is as
follows.
The index of every attribute that has value 1 is
copied to the new solution (particle) of the DPSO initial population. For
instance, an initial candidate solution for the binary PSO algorithm equal to
is converted into
for
the DPSO algorithm—because attributes
,
, and
are set to 1
(are present) in
,
. Note that the same initial population of solutions
is used to both algorithms, binary PSO and DPSO, to make the comparison between
the performances of these algorithms as free from initialization bias as
possible.
Initializing the particles
in this way
causes different particles, in DPSO, to have different sizes. For instance, an
initial candidate solution
(from the binary PSO algorithm) is converted into the initial candidate
solution
(to the
DPSO algorithm) which has size 2, whereas another initial candidate solution
(binary PSO) is converted into the initial candidate solution
(DPSO) which has size 4,
and
.
In the DPSO algorithm, for simplicity, once the size
of a particle is determined at the initialization, the particle will keep that
same size during the entire execution of the algorithm. For example, particle
above, which has been initialized with 4 indices, will always carry exactly 4
indices,
.
The values of those 4 indices, however, are likely to change every time that
the particle is updated.
5.3. Velocities = Proportional Likelihoods
The DPSO
algorithm does not use a vector of velocities as the standard PSO algorithm
does. It works with proportional likelihoods instead. Arguably, the notion of
proportional likelihood used in the DPSO algorithm and the notion of velocity
used in the standard PSO are somewhat similar. DPSO uses
to represent an
array of proportional likelihoods and
to represent
one of
's components.
Every particle in DPSO is associated with a 2-by-
array of
proportional likelihoods, where 2 is the number of rows in this array and
is the number
of columns—note that the number of columns in
is equal to the
number of variables of the problem
.
This is an example of a generic proportional
likelihood array
(7) Each of the
elements in the
first row of
represents the
proportional likelihood that an attribute be selected. The second row of
shows the
indices of the attributes associated with the respective proportional
likelihoods.
There is a one-to-one correspondence between the
columns of this array and the attributes of the problem domain. At the
beginning, all elements in the first row of
are set to 1,
for example,
(8) After the
initial population of particles is generated, this array is always updated
before a new configuration for the particle associated to it is generated. The
updating of the likelihoods
is based on
,
, G and three constant updating factors,
namely,
,
, and
. The updating factors (
,
, and
) determine the
strength of the contribution of
,
, and G to the adjustment of every coordinate 
.
Note that
,
, and
are parameters chosen
by the user. The contribution of these parameters to the updating of
is as follows.
All indices present in
have their
correspondent proportional likelihood increased by
. In addition to that, all indices present in
have their
correspondent proportional likelihood increased by
. The same for G for which the proportional
likelihoods are increased by
.
For instance, given
,
,
,
,
,
,
, and also
(9)the updated
would be
(10)
Note that index 1 is absent in
,
, and G. Therefore, the proportional likelihood
of attribute 1 in
remains as it
is. In this work, the values used for
,
, and
were
,
, and
. These
values were empirically determined in preliminary experiments; but this work
makes no claim that these are optimal values. Parameter optimization is a topic
for future research. As a whole, these values make the contribution of
(
) to the updating of the
a bit stronger
than the contribution of
(
); and the contribution of G (
) even stronger.
The new updated array
replaces the
old one and will be used to generate a new configuration to the particle
associated to it as follows.
5.4. Sampling New Particle Positions for
the DPSO Algorithm
The
proportional likelihood array
is then used to
sample a new instance of particle
—the particle
associated to
. For this sampling process, a series of operations is
performed on the array. To start with, every element of the first row of the
array
is multiplied
by a uniform random number between 0 and 1. A new random number is drawn for
every single multiplication performed.
To illustrate, suppose that
(11) The multiplied
proportional likelihood array would be
(12)where
are uniform
random numbers independently drawn on the interval (0, 1).
Suppose that this is the resulting array
after the
multiplication
(13)
A new particle position is then defined by ranking the
columns in
by the values
in its first row. That is, the elements in the first row of the array are
ranked in a decreasing order of value; and the indices of the attributes—in
the second row of
—follow their
respective proportional likelihoods. For example, ranking the array
(shown
immediately above) would generate
(14)
The next operation now is to select the indices that
will compose the new particle position. After ranking the array
, the first
indices (in the
second row of
), from left to
right, are selected to compose the new particle position. Note that
represents the
size of the particle
—the particle
associated to the ranked array
.
Suppose that
the particle
associated to
(15) has size 3
.
That makes
—note
that
, for instance, may have a different size and
consequently a different
value. For the
above, however,
as
the first 3
indices from the second row of
would be
selected to compose the new particle position. Based on the array
given above and
, the
indices (attributes) 5, 2, and 4 would be selected to compose the new particle
position, that is,
. Note
that indices that have a higher proportional likelihood are, on average, more
likely to be selected.
The updating of
,
, and G follows what is described in Section
4.2.
Once the algorithms have been explained, the next
section briefly introduces the particular data sets (case studies) used to test
the algorithms.
6. Case Study: the GPCR and Enzyme Data Sets Used in the Computational Experiments
The experiments involved 6 data sets comprising two
kinds of proteins, namely, G-protein-coupled receptors (GPCRs) and Enzymes.
The G-protein-coupled receptors (GPCRs) are a protein
superfamily of transmembrane receptors. Their function is to transduce signals
that induce a cellular response to the environment. GPCRs are involved in many
types of stimulus-response pathways, from intercellular communication to
physiological senses. GPCRs are of much interest to the pharmaceutical industry
because these proteins are involved in many pathological conditions—it is
estimated that GPCRs are the target of 40% to 50% of modern medical drugs [22]
Enzymes are proteins that
accelerate chemical reactions—they participate in many processes in a
biological cell. Some enzymes are used in the chemical industry and other
industrial applications where extremely specific catalysts are required. In
Enzyme Nomenclature, enzymes are assigned and identified by an Enzyme Commission
(EC) number. For instance, EC 2.3.4 is an enzyme with class value 2 in the
first hierarchical class level, class value 3 in the second class level, and so forth.
This work uses the GPCRs and EC data sets described in Table 2.
Table 2: GPCR and EC data sets. “Cases" represents the
number of proteins in the data set, “Attributes" represents the total
number of attributes that describe the proteins in
the data set, and “L1",

, “L4" represent the number of classes at
hierarchical class levels 1,

, 4, respectively.
These
data sets were derived from the data sets used in [23, 24]. Note that both the GPCR
and the Enzyme data sets have hierarchical classes. Each protein in these data
sets is assigned one class at the first (top) hierarchical level, corresponding
to a broad function, another class at the second level, corresponding to a more
specialized function, and another class at the third level, corresponding to an
even more specialized function, and so forth. This work copes with these hierarchical
classes in a simple way by predicting classes one level at a time, as explained
in more detail later.
The data sets used in the experiments involved four
kinds of protein signatures (biological “motifs"), namely, PROSITE
patterns, PRINTS fingerprints, InterPro entries, and Pfam signatures.
PROSITE is a database of protein families and domains.
It is based on the observation that, while there is a huge number of different
proteins, most of them can be grouped, on the basis of similarities in their
sequences, into a limited number of families (a protein consists of a sequence
of amino acids). PROSITE patterns are essentially regular expressions
describing small regions of a protein sequence which present a high sequence
similarity when compared to other proteins in the same functional family.
In the data sets, the absence of a given PROSITE
pattern is indicated by a value of 0 for the attribute corresponding to that
PROSITE pattern. The presence of it is indicated by a value of 1 for that same
attribute.
PRINTS is a compendium of protein fingerprints. A
fingerprint is a group of conserved motifs used to characterize a protein
family. In the PRINTS data sets, a fingerprint corresponds to an attribute. The
presence of a fingerprint is indicated by a value of 1 for that same attribute;
the absence by a 0.
Pfam signatures are produced by hidden Markov models,
and InterPro integrates a number of protein signature databases into a single
database. In this work, Pfam and InterPro entries also correspond to binary
attributes indicating whether or not a protein matches those entries, using the
same codification described for the PROSITE patterns and Fingerprints.
The objective of the binary PSO and DPSO algorithms is
to classify each protein into its most suitable functional class level. The
classification of the proteins is performed in each class level individually.
For instance, given protein
, at first, a conventional “flat” classification algorithm
assigns a class to
at the first
class level only. Once
has been
classified at the first class level, the conventional flat classification
algorithm is again applied to assign a class to
at the second
level—no information about
's class at the
previous level is used. The same process is used to assign a class to protein
at the third
class level, and so forth.
7. Experiments
The quality of
a candidate solution (fitness) is evaluated in three different ways: (1) by a
baseline algorithm—using all possible attributes; (2) by the binary PSO—using
only the attributes selected by this algorithm; and (3) by the discrete PSO
(DPSO) algorithm—using only the attributes selected by this algorithm. Each
of these algorithms computes the fitness of every given solution using two
distinct techniques: (a) using a Naive Bayes classifier; and (b) using a
Bayesian network.
7.1. Experimental Methodology
Note that the
computation of the fitness function
for the
particles
(binary PSO
algorithm) and
(DPSO
algorithm) follows the description given below. For simplicity, only the
process using
is described—but the same is applicable to
.
is equal to
the predictive accuracy achieved by the Naive Bayes classifier—and the
Bayesian network—on each data set and using only the attributes selected in
.
The measurement of
follows a
wrapper approach. The wrapper approach searches for an optimal attribute subset
tailored to a particular algorithm, such as the Naive Bayes classifier or
Bayesian network. For more information on wrapper and other attribute selection
approaches, see [25].
The computational experiments involved a 10-fold
crossvalidation method [25]. First, the data set being considered is divided into
10 equally sized folds. The folds are randomly generated but under the
following criterion. The proportion of classes in every single fold must be
similar to the proportion of classes found in the original data set containing
all records. This is known as stratified crossvalidation.
Each of the 10 folds is used once as a test set and
the remaining of the data is used as training set. Out of the 9 folds in the
training set one is reserved to be used as a validation set. The Naive Bayes
classifier and the Bayesian network use the remaining 8 folds to compute the
probabilities required to classify new examples. Once those probabilities have
been computed, the Naive Bayes (NB) classifier and the Bayesian network (BN)
classify the examples in the validation set.
The accuracy of this classification on the validation
set is the value of the fitness functions
(
) and
(
)—the same
for
(
) and
(
). When the run
of the PSO algorithm is completed, the 9 folds are merged into a full training
set. The Naive Bayes classifier and the Bayesian network are then trained again
on this full-training set (9 merged folds), and the probabilities computed in
this final, full-training set are used to classify examples in the test set
(the 10th fold), which was never accessed during the run of the algorithms.
The reasons for having separate validation and test
sets are as follows. In the classification task of data mining, by definition,
the goal is to measure predictive accuracy—generalization ability—on a
test set unseen during training. Hence, the test set cannot be accessed by the
PSO, and is reserved just to compute the predictive accuracy associated with
the Bayesian classifier constructed with the best set of attributes selected at
the end of the PSO run.
Concerning the validation set, which is used to
compute the fitness of particles during the PSO run, this is a part of the
original training set which is different from the part of the training set used
to build the Bayesian classifier, and the reason for having these two separate
parts of the training set is to avoid overfitting of the classifier to the
training data; for overfitting in the context of classification, see [7, pages 17, 18]. In other
words, if the same training set that was used to build a Bayesian classifier
was also used to measure the fitness (accuracy) of the corresponding particle,
there would be no pressure to build classifiers with a good generalization
ability on data unseen during training, and a classifier could obtain a high
accuracy by simply being overfitted to idiosyncrasies of the training set which
are unlikely to generalize well to unseen data. By measuring fitness on a
validation set separated from the data used to build the classifier, this is
avoided, and a pressure to build classifiers with good generalization ability
is introduced in the fitness function.
In each of the 10 iterations of the crossvalidation
procedure, the predictive accuracy of the classification is assessed by 3
different methods, as follows.
(1)
Using all
possible original attributes: all possible attributes are used by the Naive
Bayes classifier and the Bayesian network—there is no attribute selection.
(2)
Standard
binary PSO algorithm: only the attributes selected by the best particle
found by the binary PSO algorithm are used by the Naive Bayes classifier and
the Bayesian network.
(3)
DPSO
algorithm: only the attributes selected by the best particle found by the
DPSO algorithm are used by the Naive Bayes classifier and the Bayesian network.
Since the Naive Bayes and Bayesian network classifiers
used in this work are deterministic, only one run—for each of these
algorithms—is performed for the classification using all possible
attributes.
For the binary PSO and the DPSO algorithms, 30
independent runs are performed for each iteration of the crossvalidation
procedure. The results reported are averaged over these 30 independent runs and
over the 10 iterations of the crossvalidation procedure.
The population size used for both algorithms (binary
PSO and DPSO) is 200 and the search stops after 20 000 fitness evaluations—or 100 iterations.
The binary PSO algorithm uses an inertia weight value
of 0.8 (i.e.,
). The
choice of the value of this parameter was based on the work presented in
[26].
Other choices of parameter values for the DPSO were
,
, and
, chosen
based on empirical experiments but probably not optimal values.
The measurement of the predictive accuracy rate of a
model should be a reliable estimate of how well that model classifies the test
examples—unseen during the training phase—on the target problem.
In Data Mining, typically, the
equation
(16)is used to assess the accuracy
rate of a classifier—where
,
,
, and
are the numbers
of true positives, true negatives, false positives, and false negatives,
respectively [25].
However, if the class distribution is highly
unbalanced, (16) is an ineffective way of measuring the accuracy rate
of a model. For instance, in many problems, it is easy to achieve a high value
for (16) by
simply predicting always the majority class. Therefore, on the experiments
reported on this work, a more demanding measurement for the accuracy rate of a
classification model is used.
This measurement has been used before in [27]. It is given by the
equation
(17)
where,
=
and
=
—
stands for true
positive rate and
stands for true
negative rate.
Note that if any of the quantities
or
is zero, the
value returned by (17) is also zero.
7.2. Discussion
Computational
results are reported in Tables 5 and 6. Let us focus the discussion on the
results obtained by the 3 algorithms (binary PSO, DPSO, and baseline algorithm)
for attribute selection on the GPCR-PROSITE data set, see Table 5. The results
obtained for the other 5 data sets are similar. To start with, the results
obtained using the Naive Bayes classifier are presented.
Results obtained using the Naive Bayes classifier
approach
To assess the performance of the algorithms, two
criteria were considered: (1) maximizing predictive accuracy; and (2) finding
the smallest subset of attributes.
The results for the first criterion, accuracy, show
that both versions of the PSO algorithm did better—in all class levels—than the baseline algorithm using all attributes.
Furthermore, the DPSO algorithm did slightly better
than the binary PSO algorithm also in all class levels. Nevertheless, the
difference in the predictive accuracy performance between these algorithms is,
in some cases, statistically insignificant.
Table 3 shows the results of a paired two-tailed
-test for the
predictive accuracy of the binary PSO versus the predictive accuracy of the
DPSO at a significance level of 0.05.
Table 3: Predictive accuracy:
binary PSO versus DPSO. Paired two-tailed

test for the
predictive accuracy—significance level 0.05.
Table 3 shows that, using Naive Bayes as classifier,
the only statistically significant difference in performance—in terms of
predictive accuracy—between the algorithms binary PSO and DPSO is at the
third class level. By contrast, using Bayesian networks as classifier, the
difference in performance is statistically significant at all class levels.
Nevertheless, the discriminating factor between the
performance of these algorithms is on the second comparison criterion—finding
the smallest subset of attributes.
The DPSO not only outperformed the binary PSO in
predictive accuracy, but also did so using a smaller subset of attributes in
all class levels. Moreover, when it comes to effectively pruning the set of
attributes, the difference in performance between the binary PSO and the DPSO
is always statistically significant, as Table 4 shows.
Table 4: Number of selected attributes: binary PSO versus DPSO. Paired two-tailed

test for the
number of attributes selected—significance level 0.05.
Table 5: Results for the
GPCRs data sets. For the binary PSO and DPSO algorithms, 30 independent runs
are performed. The results reported are averaged over these 30 independent
runs. The best result on each line for each performance criterion is marked
with an asterisk (*).
Table 6: Results for the
EC data sets. For the binary PSO and the DPSO algorithms, 30 independent runs are
performed. The results reported are averaged over these 30 independent runs.
The best result on each line for each performance criterion is marked with an
asterisk (*).
Results obtained using the the Bayesian network
approach
Again, the
predictive accuracy attained by both versions of the PSO algorithm surpassed
the predictive accuracy obtained by the baseline algorithm in all class levels.DPSO obtained the best predictive accuracy of all
algorithms in all class levels. Regarding the second comparison criterion,
finding the smallest subset of attributes, again DPSO always selected the
smallest subset of attributes in all hierarchical levels.
The results on the performance of the classifiers—Naive Bayes versus Bayesian networks—show that Bayesian networks did a much
better job. For all class levels, the predictive accuracy obtained by the 3
approaches (baseline, binary PSO and DPSO) using Bayesian networks was significantly
better than the predictive accuracy obtained using Naive Bayes classifier. The
Bayesian networks approach also enabled the two PSO algorithms to do the job
using fewer selected attributes—compared to the Naive Bayes approach.
The results emphasize the importance of taking
relationships among attributes into account—as Bayesian networks do—when
performing attribute selection. If these relationships are ignored, predictive
accuracy is adversely affected.
The results also show that for all 6 data sets tested,
the DPSO algorithm not only selected the smallest subset of attributes, but
also obtained the highest predictive accuracy in every single class level.
8. Conclusions
Computational
results show that the use of unnecessary attributes tends to derail classifiers
and hurt classification accuracy. Using only a small subset of selected
attributes, the binary PSO and DPSO algorithms obtained better predictive
accuracy than the baseline algorithm using all attributes. Previous work had
already shown that the DPSO algorithm outperforms the binary PSO in the task of
attribute selection [5], but that work involves only one data set. This
current work shows much stronger evidence for the effectiveness of DPSO in 6
data sets. In addition, the 6 data sets mined in this work are much more
challenging than the two-class data set mined in [5], because the former have
several hierarchical class levels per data set, leading to a much larger number
of classes to be predicted for each data set.
Even when the difference in predictive accuracy is
insignificant, by selecting fewer attributes than the binary PSO, the DPSO
certainly enhances computational efficiency of the classifier and is therefore
preferable.
The original work on DPSO [5] questioned whether the
difference in performance between these two algorithms was attributable to
variations in the initial population of solutions. To overcome this possible
advantage/disadvantage for one algorithm or the other, the present work used
the same initial population for both algorithms.
The results demonstrate that, even using an identical
initial population of particles, the DPSO is still outperforming the binary PSO
in both predictive accuracy and number of selected attributes. The DPSO is
arguably not too different from traditional PSO but still the algorithm has
features that enable it to improve over binary PSO on the task of attribute
selection.
Another result—although expected—from the
experiments is the clear difference in performance between Naive Bayes and
Bayesian networks used as classifiers. The Bayesian networks approach
outperformed the Naive Bayes approach in all experiments and in all
hierarchical class levels.
In this work, the hierarchical classification problem
was dealt with in a simple way by “flattening" the hierarchy, that is, by
predicting classes for one class level at a time, which permitted the use of
flat classification algorithms. The algorithms made no use of the information
of the class assigned to a protein in one level to help predict the class at
the next hierarchical level. Future work intends to look at an algorithm that
makes use of this information.
Algorithm 1: Pseudocode for a generic greedy search
algorithm.
Acknowledgments
The authors
would like to thank Nick Holden for kindly providing them with the biological
data sets used in this work. The authors would also like to thank EPSRC (grant Extended
Particle Swarms GR/T11265/01) for financial support.
References
- T. Blackwell and J. Branke, “Multi-swarm optimization in dynamic environments,” in Applications of Evolutionary Computing, vol. 3005 of Lecture Notes in Computer Science, pp. 489–500, Springer, New York, NY, USA, 2004.
- S. Janson and M. Middendorf, “A hierarchical particle swarm optimizer for dynamic optimization problems,” in Proceedings of the 1st European Workshop on Evolutionary Algorithms in Stochastic and Dynamic Environments (EvoCOP '04), vol. 3005 of Lecture Notes in Computer Science, pp. 513–524, Springer, Coimbra, Portugal, April 2004.
- M. Løvbjerg and T. Krink, “Extending particle swarm optimisers with self-organized criticality,” in Proceedings of the Congress on Evolutionary Computation (CEC '02), D. B. Fogel, M. A. El-Sharkawi, X. Yao, et al., Eds., vol. 2, pp. 1588–1593, IEEE Press, Honolulu, Hawaii, USA, May 2002.
- M. M. Solomon, “Algorithms for the vehicle routing and scheduling problems with time
window constraints,” Operations Research, vol. 35, no. 2, pp. 254–265, 1987.
- E. S. Correa, A. A. Freitas, and C. G. Johnson, “A new discrete particle swarm algorithm applied to attribute selection in a bioinformatics data set,” in Proceedings of the 8th Annual Conference Genetic and Evolutionary Computation (GECCO '06), M. Keijzer, M. Cattolico, D. Arnold, et al., Eds., pp. 35–42, ACM Press, Seattle, Wash, USA, July 2006.
- E. S. Correa, M. T. A. Steiner, A. A. Freitas, and C. Carnieri, “A genetic algorithm for solving a capacity p-median problem,” Numerical Algorithms, vol. 35, no. 2–4, pp. 373–388, 2004.
- A. A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms, Springer, Berlin, Germany, 2002.
- J. Kennedy and R. C. Eberhart, Swarm Intelligence, Morgan Kaufmann, San Francisco, Calif, USA, 2001.
- E. S. Correa, A. A. Freitas, and C. G. Johnson, “Particle swarm and Bayesian networks applied to attribute selection for protein functional classification,” in Proceedings of the 9th Annual Genetic and Evolutionary Computation Conference (GECCO '07), pp. 2651–2658, London, UK, July 2007.
- T. M. Mitchell, Machine Learning, McGraw-Hill, London, UK, 1997.
- F. V. Jensen, Bayesian Networks and Decision Graphs, Springer, New York, NY, USA, 1st edition, 2001.
- J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Francisco, Calif, USA, 1st edition, 1988.
- S. L. Lauritzen and D. J. Spiegelhalter, “Local computations with probabilities on graphical structures and their application to expert systems,” Journal of the Royal Statistics Society, vol. 50, no. 2, pp. 157–224, 1988.
- P. Larrañaga, R. Etxeberria, J. A. Lozano, B. Sierra, I. Naki Inza, and J. M. Peña, “A review of the cooperation between evolutionary computation and probabilistic models,” in Proceedings of the 2nd International Symposium on Artificial Intelligence and Adaptive
Systems (CIMAF '99), pp. 314–324, La Havana, Cuba, March 1999.
- J. M. Peña, J. A. Lozano, and P. Larrañaga, “Globally multimodal problem optimization via an estimation of distribution algorithm based on unsupervised learning of Bayesian networks,” Evolutionary Computation, vol. 13, no. 1, pp. 43–66, 2005.
- R. R. Bouckaert, “Properties of Bayesian belief network learning algorithms,” in Proceedings of the 10th Annual Conference on Uncertainty in Artificial Intelligence (UAI '94), I. R. L. de Mantaras and E. D. Poole, Eds., pp. 102–109, Morgan Kaufmann, Seattle, Wash, USA, July 1994.
- D. M. Chickering, D. Geiger, and D. Heckerman, “Learning Bayesian networks is NP-hard,” Tech. Rep. MSR-TR-94-17, Microsoft Research, Redmond, Wash, USA, November 1994.
- J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC '97), vol. 5, pp. 4104–4109, IEEE, Orlando, Fla, USA, October 1997.
- J. Kennedy, “Small worlds and mega-minds: effects of neighborhood topology on particle
swarm performance,” in Proceedings of the Congress of Evolutionary Computation, P. J. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, and A. Zalzala, Eds., vol. 3, pp. 1931–1938, IEEE Press, Washington, DC, USA, July 1999.
- G. Kendall and Y. Su, “A particle swarm optimisation approach in the construction of optimal
risky portfolios,” in Proceedings of the IASTED International Conference on Artificial Intelligence
and Applications, part of the 23rd Multi-Conference on Applied Informatics, pp. 140–145, Innsbruck, Austria, February 2005.
- R. Poli, C. D. Chio, and W. B. Langdon, “Exploring extended particle swarms: a genetic programming approach,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '05), pp. 169–176, ACM Press, Washington, DC, USA, June 2005.
- D. Filmore, “It's a GPCR world,” Modern Drug Discovery, vol. 11, no. 7, pp. 24–28, 2004.
- N. Holden and A. A. Freitas, “Hierarchical classification of G-protein-coupled receptors with a PSO/ACO algorithm,” in Proceedings of the IEEE Swarm Intelligence Symposium (SIS '06), pp. 77–84, IEEE Press, Indianapolis, Ind, USA, May 2006.
- N. Holden and A. A. Freitas, “A hybrid particle swarm/ant colony algorithm for the classification of hierarchical biological data,” in Proceedings of the IEEE Swarm Intelligence Symposium (SIS '05), pp. 100–107, IEEE Press, Pasadena, Calif, USA, June 2005.
- I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, Calif, USA, 2nd edition, 2005.
- Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimization,” in Proceedings of the 7th International Conference on Evolutionary Programming (EP '98), pp. 591–600, Springer, San Diego, Calif, USA, March 1998.
- G. L. Pappa, A. J. Baines, and A. A. Freitas, “Predicting post-synaptic activity in proteins with data mining,” Bioinformatics, vol. 21, pp. ii19–ii25, 2005.