Bacterial Colony Algorithms for Association Rule Mining in Static and Stream Data

Cunha, Danilo S. da; Xavier, Rafael S.; Ferrari, Daniel G.; Vilasbôas, Fabrício G.; de Castro, Leandro N.

doi:https://doi.org/10.1155/2018/4676258

Mathematical Problems in Engineering

On this page

Abstract Introduction Experimental Results Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Recent Advances on Swarm Intelligence for Solving Complex Engineering Problems

View this Special Issue

Research Article | Open Access

Volume 2018 | Article ID 4676258 | https://doi.org/10.1155/2018/4676258

Bacterial Colony Algorithms for Association Rule Mining in Static and Stream Data

Danilo S. da Cunha,¹Rafael S. Xavier,¹Daniel G. Ferrari,¹Fabrício G. Vilasbôas,^1,2and Leandro N. de Castro¹

Guest Editor: Eric Monfroy

Received09 May 2018

Revised25 Sept 2018

Accepted09 Oct 2018

Published11 Nov 2018

Abstract

Bacterial colonies perform a cooperative and distributed exploration of the environmental resources by using their quorum-sensing mechanisms. This paper describes how bacterial colony networks and their skills to explore resources can be used as tools for mining association rules in static and stream data. A new algorithm is designed to maintain diverse solutions to the problems at hand, and its performance is compared to that of other well-known bacteria, genetic, and immune-inspired algorithms: Bacterial Foraging Optimization (BFO), a Genetic Algorithm (GA), and the Clonal Selection Algorithm (CLONALG). Taking into account the superior performance of our approach in static data, we applied the algorithms to dynamic environments by converting static into flow data via a stream data model named sliding-window. We also provide some notes on the running time of the proposed algorithm using different hardware and software architectures.

1. Introduction

Bacterial colonies can be seen as complex adaptive systems that perform distributed information processing to solve complex problems, such as food acquisition, swarming mobility, and biofilm formation, among others. They use a collaborative system of chemical signals to explore the resources of a given environment and coordinate their social and behavioural tasks [1]. Bacteria can be found in distinct environments, ranging from hostile to more hospitable ones by applying different kinds of survival strategies to process self and environmental stimuli [2].

The collective and collaborative activities carried out by a bacterial colony are classified as a type of collective intelligence [3], where each bacterium is able to sense itself and the environment and maintain communication with other bacteria in the colony to perform its coordinated tasks. This enables the colony to acquire information about the environment and its changes. Thus, a colony can be seen as an adaptive computational system that processes information on different levels, independently of environmental changes [4]. Some important computational properties and collective behaviours of bacteria colonies are shown in [4].

This paper presents an algorithm inspired by the exploratory behaviour of environmental resources by a colony of bacteria, named BaCARO-II, extended from [5, 6], for mining association rules of items in transactional databases and introduces the necessary modifications so that it can be applied to data streams. As an outcome of the modifications, the new bacteria algorithm is able to avoid the genic conversion problem discussed in [7].

The bacterial colony algorithm is compared to other bio-inspired heuristics, more specifically the Bacterial Foraging Optimization (BFO) [8], a Genetic Algorithm (GA) [9], and the Clonal Selection Algorithm (CLONALG) [10], which were adapted to perform association rule mining of static and stream data. The following performance measures are accounted for: support (S), confidence (C), interestingness (I), number of rules (U), and processing time (P).

The paper is an extension of [11] and it is organized as follows. Section 2 provides some theoretical background on association rule mining and Section 3 a review of data stream processing models. Section 4 provides the biological foundations of bacterial colonies and Section 5 presents an overview of bacterial algorithms. Section 6 introduces two bacterial algorithms applied to association rule mining in static and dynamic environments. Section 7 shows the experimental results and, finally, the final considerations and future works are provided in Section 8.

The abbreviations used for the algorithms in this research are as follows: BaCARO-II: Bacterial Colony Association Rule Optimization-II BFO: Bacterial Foraging Optimization CLONALG: Clonal Selection Algorithm GA: Genetic Algorithm sBaCARO-II: Stream Bacterial Colony Association Rule Optimization-II sBFO: Stream Bacterial Foraging Optimization Algorithm sCLONALG: Stream Clonal Selection Algorithm sGA: Stream Genetic Algorithm

2. On Association Rule Mining and Data Streams

This section provides a brief review of the two main concepts covered in this paper: association rule mining and data streams.

2.1. Association Rule Mining

Originally known as market-basket analysis, mining association rules is one of the main data mining tasks. It is a descriptive task, which uses unsupervised learning and focuses on the identification of associations between items that occur together in a dataset [12–15]. A transaction is a set of items that occur together. In the scenario described in the original market-basket analysis, items in a transaction are those that are acquired together by an end user [14, 15]. An association rule is as follows:where A and C are itemsets of products selected by a consumer.

The first set A is called the antecedent and the other one C is called the consequent of the association rule. The intersection between these two sets is empty (A C = Ø), because it is redundant for an item to imply itself. The rule means that the presence of (all items in) A in a transaction implies the presence of (all items in) C in the same transaction with some associated probability [13, 15].

Given a set of transactions T, it is interesting to generate all rules that satisfy two types of constraints:(i)Syntactic constraints: the number of items that appear in a rule is limited.(ii)Support constraints: involving delimitations in the number of transactions in T that support the rule, with support, usually an input parameter, being defined as the number of transactions in T that contain A and C simultaneously.

The problem with the previous definition is that the number N of possible association rules, given a number d of items, grows exponentially, and the problem is placed within the NP-complete set [12, 13, 15]:

To illustrate how this scales, Figure 1 shows the value of N for growing values of d.

Therefore, it is not computationally feasible to generate all rules for fairly large datasets in a reasonable time. Thus, it is compulsory to somehow prune the association rules built before trying to analyse their real usefulness.

Measures of Interest. The Confidence and Support, proposed in [12, 13], are the most studied and applied measures of interest in the association rule mining literature. The support of an association rule is a measure of its relative frequency in the set of all transactions:

On the other hand, the confidence of a rule is a measure of its satisfiability or strength when its antecedent is found in T, that is to say, from all the occurrences of A, how often C also occurs in the base:

While confidence is a measure of the strength of a rule, the support corresponds to its statistical significance over the database. The interestingness of a rule, I(A C), is calculated as follows [14]:where A and C are defined as previously and T is the number of transactions in the database. This measure of interest, differently from the support, looks for low frequency rules in the database.

The Apriori Algorithm. The most well-known algorithm for association rule mining is called Apriori [13] and has the following main steps:(i)Generate frequent itemsets: a set of frequent items is the one whose support is greater than or equal to a minimum support threshold (minsup).(ii)Generate reliable association rules: the reliable association rules are those with a confidence value equal to or greater than a minimum confidence value (minconf).

A set of items of length k, i.e., with k items, is called a k-itemset. The Apriori algorithm was named after its use of a methodology for selecting items that come before others (a priori) for the generation of frequent itemsets. This feature is known as closing down.

The algorithm performs multiple scans over the database. In the first step it computes the frequency of each item. After keeping those items whose frequency is equal to or greater than minsup, it checks if those frequent items, i_x, occur in conjunction with item i_x+1 and together if their frequency is greater than or equal to minconf. At each new iteration on the data, the algorithm stores, incrementally, only those frequent items that satisfy minsup and minconf. Therefore, Apriori-based algorithms are not suitable for a data stream environment, because data can be scanned only once [16].

3. Data Streams

A sequence of objects that arrives in a timely order is named a data stream [17, 18]. Differently from traditional static data, data streams are continuous, unbounded, and of high speed and their data distribution changes with time. Data streams can be classified in two main classes: offline streams and online streams. An offline stream is characterized by regular bulk arrivals, while an online stream is characterized by real-time updated data that come one followed by the other in time. Unlike offline data streams, bulk data processing is not possible for online stream data [19]. As the number of applications over data streams grows rapidly, there is an increasing need to perform data stream mining tasks, such as classification, clustering, and association rule on stream data.

There are three major stream data processing models for rule mining [20]:(i)Landmark model: it mines all frequent itemsets over the entire log of stream data from a limited point of time, named landmark, to the current one. This simple model is not suitable for applications where the user is interested in the most recent information of data streams.(ii)Damped model: also named time-fading model, it finds frequent itemsets in stream data in which each transaction has a weight decrease with time. Older transactions have a smaller weight toward itemset frequencies, i.e., different weights for new and old transactions.(iii)Sliding-window model: it finds and maintains frequent itemsets in sliding-windows. Only part of the data streams within the sliding-windows are stored and processed at the time while the data flows in. The sliding-window size is defined based on the application and system resources. The result depends on recently generated transactions in the window range.

All approaches have been used in different researches on data stream mining. Selecting which kind of stream data process model to use largely depends on the application demands. The three approaches are summarized in Figure 2.

Some data stream applications involving association rule mining include estimating missing data in sensor networks [21]; predicting the frequency of Internet packet streams [22]; finding alarm incidents from streams [23]; determining frequent itemsets over online data streams [24]; and association analysis [25–27].

Open Problems in Data Stream Association Rule Mining. Despite the many applications, these tools are focused on specific areas, and none of them fully deal with the main open issues in data stream association rule mining [16]:(i)There is not enough time to rescan the whole database or to perform a multiscan, as in traditional data mining algorithms.(ii)The data stream mining method needs to adapt to the data distribution, i.e., avoid the drifting problem [28].(iii)The speed of the mining algorithm should be faster than the data arrival rate.(iv)Due to the stream properties, the analysis results of data streams often keep changing as well.(v)A mining mechanism that adapts itself to the available resources is needed.

4. Some Notes on Bacterial Colonies

Bacterial colonies have different behavioural patterns, including foraging, reproduction, communication, sporulation, and motility [29, 30]. They perform a distributed and parallel information processing and each bacterium is an autonomous system capable of sending, storing, processing, and interpreting information. This gives the bacterium a certain freedom to choose its response according to the messages received as part of the chemical distributed processing of information from the colony.

Bacterial communication occurs via chemical signals. The main entities around this communication are the signalling cell, the target cell, the signal molecule, and the receiver protein. The signalling cell sends the chemical signal, presented by the signal molecule, to one or more target cells. The target cells read the message contained in the signalling molecule via protein receptors and then send the message to the intracellular gel. The signalling molecule does not enter the bacteria; the responsible one for decoding and sending each message to the intercellular plasma is the receiver protein [31].

The most studied bacterial communication process in the literature is quorum-sensing, which depends on the concentration of a diffusible molecule called autoinducer [32, 33], and works only in a high density colony. The concentration of autoinducers increases in the environment with the growth of the number of cells that produce them, thus promoting the activation or suppression of gene expression that are responsible for generating certain behaviours in bacteria. Quorum-sensing works as a micro and macro communication mechanism. In the intracellular communication network, a bacterium analyses and interprets the data read from the environment. The macro level information processing is represented by the biochemical interactions of the colony, which correspond to the extracellular communication.

The motion patterns, named taxes, that the bacteria generate in the presence of chemical attractants and repellents are called chemotaxis. The bacteria movement can be done by swimming, which means moving in the same direction, and if a bacterium performs successive swimming steps, we say it is performing a running step, and, finally, if it is moving in a random direction we say it is tumbling. Swimming and tumbling (chemotactic behaviour) are individual and stochastic responses that result in emergent global responses, such as swarming.

Reproduction in bacteria is performed after some chemotaxis steps. The bacteria fitness is used to select those who will die, and the survivors are divided into two new bacteria placed in the same direction. In other words, the survivors are cloned via asexual reproduction, and the clones stay in the same region as their parents.

5. Bacterial Colony Algorithms: BFO and BaCARO-II

There is currently a number of bacteria-inspired algorithms. The pioneer proposal was called Bacterial Chemotaxis Algorithm (BCA) [34] and bacterial foraging behaviours have been used as inspiration for the design of other algorithms, such as the Bacterial Foraging Optimization (BFO) Algorithm [8], Bacterial Colony Optimization (BCO) [35], and Bacterial Colony Association Rule Optimization (BaCARO) [5, 6]. This section describes BFO, which is one of the most well-known proposals in the literature, and a version of our approach, named BaCARO-II. The nomenclature of the parameters used by the algorithms is as follows: P: populution of candidate solutions Bac_num: number of bacteria in a populution N_ed: number of elimination and dispersal steps N_re: number of reproduction steps N_c: number of chemotactic steps N_s: number of swim steps P_ed: probability of elimination-dispersal : probability of intracellular communication : probability of extracellular communication : probability of changing information : extracellular network size

5.1. The Bacterial Foraging Optimization Algorithm: BFO

The Bacterial Foraging Optimization (BFO) algorithm simulates the foraging strategy of Escherichia Coli and was originally designed to solve optimization problems in continuous environments. It takes inspiration in the following bio-inspired mechanisms [8, 36]: chemotaxis, reproduction, elimination, and dispersion.

Algorithm 1 summarizes the main steps of the BFO algorithm for solving a minimization task. It starts by initializing all the input parameters: a colony P with Bac_num bacteria of the same dimension as the problem to be solved; number of elimination and dispersal steps (N_ed); number of reproduction steps (N_re); number of chemotactic steps (N_c); number of swim steps (N_s); the elimination-dispersal probability (P_ed); and number of bacteria to be selected for reproduction (S_r).

procedure [P] = BFO(,,,,,,)
initialize P()
for l=0 todo//Elimination-dispersal loop
for k=0 todo//Reproduction loop
for j=0 todo//Chemotaxis loop
Apply chemotaxis
foreach Bacterium in P do
if Fitness(Bacterium) ≥ Fitness()then
Bacterium
end if
end foreach
end for//Chemotaxis
P SortByCellFitness(P,)
P = Clone(P)
end for//Reproduction
foreach Bacterium in Population do
if Random() ≤ then
Bacterium BacteriumAtRandLocation()
end if
end foreach
end for//Elimination-dispersal
return
end procedure

The algorithm first applies chemotaxis and reproduction until their thresholds are reached and then follows with elimination-dispersal. During reproduction a bacterium is cloned (duplicated) with no mutation. During chemotaxis, the health (fitness) of each bacterium is assessed and a number S_r of the healthiest ones are cloned, while the others are removed from the population. Bacteria are then allowed to swim for a number of swim steps (N_s), moving to different locations. If the new location results in improved (healthier) bacteria, then they keep swimming in the same direction; otherwise they tumble, exploring other regions of the search space. Finally, bacteria can survive or be removed from the population with probability P_ed. Whenever a bacterium is eliminated, another one is generated in a random position (disperse).

BFO is the bacterial-inspired algorithm more extensively applied to solve problems in different areas [37, 38], such as global optimization [39], engineering design [40], power system [41–43], optimal design [44], network planning [45], and data analysis [46–48].

5.2. The Bacterial Colony Association Rule Optimization Algorithm: BaCARO-II

The algorithm named Bacterial Colony Association Rule Optimization-II (BaCARO-II) is inspired by the biological processes of intra- and extracellular communication networks of bacterial colonies, as well as quorum-sensing, chemotaxis, and bacterial dispertion [1, 49]. In BaCARO-II, intracellular communication [50] is used to search better gene rearrangements so that bacteria present a higher fitness, and extracellular communication is used to coordinate bacterial motility over the search space. Quorum-sensing is applied to evaluate the neighbourhood and use the synergy of individual and collective decisions, and chemotaxis is used to make fine adjustments during intracellular communication: if the new gene arrangement is worse than the previous one (position in the search space), it can be undone. Finally, dispersion promotes the movement of bacteria away from regions of high concentrations of bacteria.

BaCARO-II starts by initializing a random colony of size equal to the search-space dimension. The artificial colony is evaluated and each bacterium has a probability of making intracellular communication . The bacteria randomly selected to perform intracellular communication reconfigure their gene expression and if the new rearrangement is better than the previous one, the latter is adopted. The colony fitness is updated and the extracellular step begins. Each bacterium starts to perceive its neighbourhood, and those in the same region disperse to new regions. Those that are not occupying dense regions are selected with some probability , a total of surrounding bacteria to change information with their neighbours according to a value and move to the best direction. After that, fitness is computed. Finally, the colony is confronted with an environmental pressure that leads to the selection of the bacteria with highest fitness values to the next generation. The synergy of intracellular and extracellular communication results in quorum-sensing, which is the core of most bacterial algorithms. The pseudocode of BaCARO-II is summarized in Algorithm 2.

procedure [P] = BaCARO-II(,,)
initialize P
t 1
f evaluate(P)
while not_stopping_criterion do
for i=0 to Size(P) do//Intracellular communication loop
rf inCellular(P,,)
f update(f,rf)
end for
for j=0 to Size(P) do//Extracellular communication loop
exCellular()
Foreach in eachdo
if bacterialDensity()==true then//Quorum-sensing
P disperse()//Dispersion
else
Foreach in do
MoveToBestDirectionInExtracellularNetwork()
end Foreach
end Foreach
end for
P
f evaluate(P)
P select(P,f)
t t+1
end while
end procedure

6. Bacterial Colonies in Association Rule Mining

This section describes how the different bacteria-inspired algorithms were adapted to solve association rule mining problems in static and dynamic environments. As presented in the previous section, BFO takes into account reproduction, chemotaxis (tumbling and swimming), and elimination-dispersal mechanisms. By contrast, BaCARO-II uses chemotaxis (tumbling and swimming), intra- and extracellular communication, and dispersion. These mechanisms will be presented here so that both algorithms can be applied to solve association rule mining tasks.

6.1. Encoding Scheme

Instead of initializing the agents in a real interval (), we randomly set them as pairs of binary values (00, 01, 10, or 11) for each vector position. A pair of bits represents each item in a transaction, where items present in the association rule are represented by a bit pair of 00 (antecedent of the rule) or 11 (consequent of the rule). Items out of a rule are composed of the other combinations: 01 or 10. Figure 3 illustrates an artificial bacterium encoding the following rule: .

6.2. Reproduction

The surviving bacteria are cloned without mutation.

6.3. Chemotaxis: Swim and Tumble

Another modification to mine association rules was made in the chemotactic behaviors. A rule of size is more probable than a rule of size . The tumbles were implemented by randomly choosing a rule part (antecedent or consequent) to be shortened and removing an element from this part. If after the tumble the bacterium adaptation level (fitness) increases, it starts to run (applying swim steps) by removing items from the same part until its size is equal to 1 or the number of swim steps (user-defined parameter) is reached, as illustrated in Figure 4. Note that, in terms of chromosomes, the bacteria maintain the same length after swim and tumbling; what changes is only the number of items in the encoded rules.

On the other hand, if after tumbling the bacterium maintains its adaptation value (fitness) the chemotactic behavior is finalized, as illustrated in Figure 5.

6.4. Elimination-Dispersal Mechanisms

This step has two parts:(1)Elimination: removal of some bacteria from the colony based on their fitness (adaptability).(2)Dispersal: randomly changing the positions of the bacteria in the search space.

6.5. Intracellular Communication

In this step each bacterium has an associated probability of performing internal communication. The parts that make up a rule are identified as exchanging structures and the items of these structures may assume a new position in the rule, that is, a new gene expression, as illustrated in Figure 6.

6.6. Extracellular Communication

Extracellular communication is used to coordinate bacterial motility as a collective behaviour over the search space by sharing information in a chemical network. The chemical network is used to control the range of information into a part of the colony, a group. In our model, the information shared by the bacterium with higher fitness is considered by the others as a reference to move around the search space. In a higher density group, the collective behaviour adopted is to disperse to new regions.

6.7. Evaluation Function

The evaluation, fitness, or objective function should reflect the relevance of the measures to be optimized, exhibit regularities over the space defined by the chosen representation, and provide enough information to drive the environmental pressure of a population-based search algorithm [51]. The measures of interest often used in Evolutionary Algorithms and Artificial Immune Systems to compute fitness values are based on those employed for classification rule mining, with some slight modifications.

Confidence and support were used in [52–54] to define the fitness function aswhere and and minSupp and minConf are, respectively, the user-defined minimum threshold values for support and confidence. Another fitness function present in the association rule mining literature is

As in , minSupp is also the minimum threshold value defined by the user. There are other fitness functions in the field [52, 55, 56], but they are essentially different combinations of support, confidence, and other measures of interest. A detailed description of various measures of interest usually applied in the association rule mining literature is available in [57].

The evaluation of each bactetium is related to the occurrence probability and accuracy of an association rule in the database. The selection of bacteria is proportional to their fitness values. The fitness function used in BaCARO-II and in the benchmark algorithms iswhere w₁ = w₂ = 0.5 and w₁ + w₂ = 1, subject towhere returns the cardinality of a set.

The algorithms use support and confidence to calculate the fitness value and the interestingness measure to compare them from a different perspective, as in [14, 58].

7. Experimental Results

To assess the performance of the algorithms, we run several experiments over distinct scenarios. The first set of tests was performed using five different binary static datasets and the second was run applying a sliding-window approach in the datasets to simulate the data streams. Finally, some experiments were performed investigating the computational complexity of the algorithms using a standard and an optimized architecture.

The following algorithms were implemented for comparison: BFO; BaCARO-II; GA; and CLONALG, as well as their stream versions sBFO; sBaCARO-II; sGA; and sCLONALG [59]. All algorithms were implemented in Java 1.7.0_95 over a GNU/Linux environment (Debian 3.16.7-ckt20-1). The experiments were run in an Intel Pentium® Dual-core CPU t4500 @ 2.30GHz.

7.1. Performance Tests in Static Datasets

The BFO parameters were set as follows: , , , , , and . The BaCARO-II parameters were set as follows: , , , and . For CLONALG we used , , and , and, finally, for GA we used and . All populations were set with 100 individuals and the maximum number of iterations was 100.

The following datasets were taken from the UCI Machine Learning Repository [60]: SPECT Heart database, with a sparsity of 66.75%; Mushroom Database, with 119 items and 8,124 instances with a sparsity of 80.67%; Balance Scale Database, with 23 items and 625 instances with a sparsity of 78.26%; Flare Data, with 49 items and 1,389 instances with a sparsity of 73.47%; and the Monks Problems-1 Database, with 19 items and 432 instances with a sparsity of 63.16%; and the Nursery Database with 32 items and 12,960 instances with a measure of sparsity around 71.88%.

All the values taken over ten simulations of BFO, BaCARO-II, CLONALG, and GA for static environments are summarized in Table 1, while sBFO, sBaCARO-II, sCLONALG, and sGA for static and dynamic environments are summarized in Table 2. The values presented are the mean ± standard deviation and minimum and maximum values for the set of rules found in the final population of each algorithm over ten simulations, where S means support, C confidence, I interestingness, U number of unique rules found over the last set of candidate solutions, and Time the processing time. As S and C are used in the fitness function, we selected the best fitness value from the final population. On the other hand, I is conceptually different from S and C and we used it to estimate the heterogeneity of solutions in the final population, as well as U.

In general, BaCARO-II presented better results than BFO, CLONALG, and GA in most measures. For instance, BaCARO-II overcomes BFO in all five datasets for the S and P measures. It occurs because BFO makes use of its global information by compounding a measure value of each attribute of the bacterium to influence the entire colony. BaCARO-II uses its global information to promote punctual variations along the colony and improve its search ability. By improving it, BaCARO-II tends to maintain many agents over the same high adaptable regions. Consequently, BFO sometimes overcomes BaCARO-II in the U measure by applying more local search steps, avoiding the concentration of large numbers of agents in the same region. On the other hand, BFO makes less use of global information and then BaCARO-II presents better fitness values as well as processing time.

BaCARO-II presented competitive results for all datasets. The best performance of our bacterial algorithm was for the Mushroom, Monks, and Nursery databases. The average values of support, confidence, and interestingness of our approach are higher than those presented by BFO. However, the number of rules generated by BaCARO-II is not greater than that of BFO in most datasets. On the other hand, our approach produces association rules with higher values of support and confidence. Another favourable point for BaCARO-II is its average processing time, which is smaller than its competitors. Nevertheless, BaCARO-II performs worse than BFO, GA, and CLONALG for all databases for the unique rules measure.

7.2. Bacterial Colony Algorithms in Stream Data

The same parameter configurations adopted in the static environment were applied to the dynamical case. As datasets have different sizes, we fixed the sliding-window size at 100, changing 1 object per iteration.

By considering the highlighted performance of our algorithm presented here and in other works [5, 6], we designed dynamical environments to evaluate its robustness and flexibility in mining association rules. In fact, we converted the following static datasets, SPECT, Balance Scale, Flare, Monks, and Nursery, to dynamical datasets by applying the Sliding-Window approach over them. To differentiate static and stream databases, we refer to the stream versions as streamSPECT, streamBalance, streamFlare, streamMonks, and streamNursery.

For experimental proposes, we fixed the sliding-window size at 100 objects per time step t_i of the data stream and its transition from t_i to t_i+1 occurs when one object from the stream enters and another leaves the sliding-window, which always maintains its size. The sliding-window schema, data stream, and its transactions used in the experiments are illustrated in Figure 7.

The results obtained by the stream versions of the algorithms (sBaCARO-II, sCLONALG, sBFO, and SGA) in the dynamic environments for streamSPECT, streamBalance, streamFlare, streamMonks, and streamNursery output are summarized in Table 2.

Although the final result is based on the different objects that run through the sliding-window during the association rule mining process, it is undeniable that the objects at the final time t are the most relevant for the development of the previous ones.

To validate the results obtained in static and dynamic environments we compared the results of our approach with BFO; we choose this one instead of GA or CLONALG due to its superior performance during experimental results, using Student’s t-test with two-tailed distribution. In the static environment, for the Balance database, the t-test showed no statistical difference for the highest values of the support and confidence measures, 8.53 and 0, respectively; the t-test for the Flare database indicates, respectively, the statistical difference of 0.00017 and 0.1341 for the measures of support and confidence; for the Monks database, the value obtained by the statistical difference t-test for the support was 0.015, while for the confidence it was 0.167; already in the Mushroom database, the t-test registered 0.025 for the support measure, while it did not record a difference for the confidence measure; for the Nursery database, the values indicated by the t-test were 0.010 and 0.006 for the confidence and support measures, respectively; and finally, for the SPECT database, the t-test pointed to the largest statistical difference between the algorithms, the support recorded with 0.489 for the support measure and 0.109 for the confidence measure.

In the dynamic environment, the t-test for the Balance database registered 0.009 and 0.041 for the support and confidence measures, respectively; for the Flare database, values of 2.2 and 0.343 were, respectively, recorded for support and confidence; already in the Monks database, the t-test for the support was pointed out with 7.93 and showed no statistical difference for the confidence measure; for the Mushroom database the t-test showed statistical difference for both measures because sBaCARO-II did not generate any rule; and finally, the t-test indicated 3.73 and 0.0006, respectively, for support and confidence measures.

7.3. Some Notes on BaCARO-II Running Time

To assess the running time of the proposed algorithm, we tested its static version using a different hardware and software architecture: an accelerating performance for server-side Java [61] applications, an optimization on JVM (Java Virtual Machine) from version 1.8 to newer versions to Intel® new Xeon Scalable Processors. We performed new experiments aiming at investigating Intel’s High Performance Computing (HPC) platforms benefits. These new experiments were made on a compute node composed of two Intel® Xeon® Platinum 8160 processors @ 2.10 GHz, each one with 24 physical cores (48 logical) and 33 MB of cache memory, 190 GB of RAM, two Intel® Solid State Drive Data Center (Intel® SSD DC) S3520 SERIES with 1.2 TB e 240 GB store capacity, and a CentOS 7 operation system running kernel version 3.10.0-693.21.1.3l7.x86_64. Table 3 provides a comparison of the running times of BaCARO-II for the static datasets in both architectures. As can be observed, the use of an HPC platform leads to an average 2.60-fold gain in performance.

8. Final Remarks and Future Trends

There are many phenomena happening in a bacterial colony. Some of them, such as foraging and chemotaxis, were used to construct tools to solve complex problems. This paper proposed and applied a new bacteria-inspired algorithm by looking at intra- and extracellular communication networks, as well as interactions between bacteria and their internal constituent parts to deal with association rule mining. The results presented by BaCARO-II showed a superior performance to other bio-inspired algorithms, such as BFO, GA, and CLONALG when applied to the same tasks.

With the current need of solving stream data problems, we designed and applied versions of BFO, BaCARO-II, GA, and CLONALG for mining association rules in stream data. The proposed bacterial approach showed good results in the experiments performed, in both static and stream data. We understand that the superior performance of our approach is primarily due to two reasons: first, the local search performed in the intracellular communication phase and, second, the use of information available in the neighbourhood (nearest bacterial cell) of each bacterial cell to improve the search space exploration. BFO was very competitive and presented better results in some dynamic scenarios, though it demands longer processing time.

As future investigations, sBaCARO-II should be applied to stream data mining tasks with different kinds of stream data processing models, Landmark and Damped. Other settings for the Sliding-Windows size should also be tested and the results compared with other algorithms, such as the ones presented in [62, 63]. Future works may also include a deeper understanding of bacterial behaviours and phenomena.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank CAPES, CNPq, Fapesp, and Mackpesquisa for the financial support. The authors also acknowledge the support of Intel for the Natural Computing and Machine Learning Laboratory as an Intel Center of Excellence in Artificial Intelligence.

References

M. Matsushita and H. Fujikawa, “Diffusion-limited growth in bacterial colony formation,” Physica A: Statistical Mechanics and its Applications, vol. 168, no. 1, pp. 498–506, 1990.
View at: Publisher Site | Google Scholar
J. Van Helden, A. Toussaint, and D. Thieffry, “Bacterial molecular networks: Bridging the gap between functional genomics and dynamical modelling,” Methods in Molecular Biology, vol. 804, pp. 1–11, 2012.
View at: Publisher Site | Google Scholar
E. Ben-Jacob, “Learning from bacteria about natural information processing,” Annals of the New York Academy of Sciences, vol. 1178, pp. 78–90, 2009.
View at: Publisher Site | Google Scholar
R. S. Xavier, N. Omar, and L. N. De Castro, “Bacterial colony: Information processing and computational behavior,” in Proceedings of the 2011 3rd World Congress on Nature and Biologically Inspired Computing, NaBIC 2011, pp. 439–443, Spain, October 2011.
View at: Google Scholar
D. S. da Cunha, R. S. Xavier, and L. N. de Castro, “A bacterial colony algorithm for association rule mining,” in Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning (IDEAL'15), 2015.
View at: Publisher Site | Google Scholar
D. S. Da Cunha, R. S. Xavier, D. G. Ferrari, and L. N. De Castro, “Association rule mining using a bacterial colony algorithm,” in Proceedings of the 2nd Latin-America Congress on Computational Intelligence, LA-CCI 2015, Brazil, October 2015.
View at: Google Scholar
D. S. da Cunha and L. N. de Castro, “Evolutionary and immune algorithms applied to association rule mining,” in Proceedings of the International Conference on Swarm, Evolutionary, and Memetic Computing (SEMCCO), Bhubaneswar, 2012.
View at: Publisher Site | Google Scholar
K. M. Passino, “Biomimicry of bacterial foraging for distributed optimization and control,” IEEE Control Systems Magazine, vol. 22, no. 3, pp. 52–67, 2002.
View at: Publisher Site | Google Scholar
J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT Press, 1992.
L. N. de Castro and F. J. von Zuben, “Learning and optimization using the clonal selection principle,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 3, pp. 239–251, 2002.
View at: Publisher Site | Google Scholar
D. S. da Cunha, R. S. Xavier, D. G. Ferrari, and L. N. de Castro, “Bacterial Colony Algorithms Applied to Association Rule Mining in Static Data and Streams,” in Proceedings of the International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 525–533, Springer, 2018.
View at: Publisher Site | Google Scholar
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93), pp. 207–216, May 1993.
View at: Google Scholar
R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conference Very Large Data Bases (VLDB'94), 1994.
View at: Google Scholar
S. Dehur, A. K. Jagadev, A. Ghosh, and R. Mall, “Multi-objective genetic algorithm for association rule mining using a homogeneous dedicated cluster of workstations,” American Journal of Applied Sciences, vol. 3, no. 11, pp. 2086–2095, 2006.
View at: Publisher Site | Google Scholar
K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kurgan, Data Mining: A Knowledge Discovery Approach, Springer Science and Business Media, 2007.
N. Jiang and L. Gruenwald, “Research issues in data stream association rule mining,” ACM SIGMOD Record, vol. 35, no. 1, pp. 14–19, 2006.
View at: Publisher Site | Google Scholar
C. C. Aggarwal, Data Streams: Models and Algorithms, vol. 31, Springer Science and Business Media, 2007.
M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: a review,” ACM SIGMOD Record, vol. 34, no. 2, pp. 18–26, 2005.
View at: Publisher Site | Google Scholar
S. Guha, N. Koudas, and K. Shim, “Data-streams and histograms,” in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, pp. 471–475, Hersonissos, Greece, 2001.
View at: Publisher Site | Google Scholar
Y. Zhu and D. Shasha, “Statstream: Statistical monitoring of thousands of data streams in real time,” in Proceedings of the 28th International Conference on Very Large Data Bases (VLDB'02), Hong Kong, 2002.
View at: Google Scholar
M. H. Le Gruenwald, “Estimating missing values in related sensor data streams,” in COMAD, 2005.
View at: Google Scholar
E. D. Demaine, A. López-Ortiz, and J. I. Munro, “Frequency estimation of internet packet streams with limited space,” in Proceedings of the European Symposium on Algorithms, 2002.
View at: Publisher Site | Google Scholar
Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, and L. Auvil, “MAIDS: Mining alarming incidents from data streams,” in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 919-920, France, June 2004.
View at: Google Scholar
D. Lee and W. Lee, “Finding maximal frequent itemsets over online data streams adaptively,” in Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM'05), pp. 266–273, Houston, TX, USA, 2005.
View at: Publisher Site | Google Scholar
H. Huang, X. Wu, and R. Relue, “Association analysis with one scan of databases,” in Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, pp. 629–632, Maebashi City, Japan, 2002.
View at: Publisher Site | Google Scholar
R. Relue, X. Wu, and H. Huang, “Efficient runtime generation of association rules,” in Proceedings of the tenth International Conference on Information and Knowledge Management (IKM'01), p. 466, Atlanta, Georgia, USA, October 2001.
View at: Publisher Site | Google Scholar
L. Yang and M. Sanver, “Mining short association rules with one database scan,” in Proceedings of the International Conference on Information and Knowledge Engineering, IKE'04, pp. 392–395, USA, June 2004.
View at: Google Scholar
H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 226–235, Washington, DC, USA, August 2003.
View at: Publisher Site | Google Scholar
I. Habibi, E. S. Emamian, and A. Abdi, “Quantitative analysis of intracellular communication and signaling errors in signaling networks,” BMC Systems Biology, vol. 8, no. 1, 2014.
View at: Google Scholar
J. F. Prescott and P. M. Dowling, Antimicrobial Therapy in Veterinary Medicine, John Wiley & Sons, 2013.
B. Alberts, Molecular Biology of the Cell, CRC Press, 2017.
View at: Publisher Site
E. Ben Jacob, Y. Shapira, and A. I. Tauber, “Seeking the foundations of cognition in bacteria: From Schrödinger's negative entropy to latent information,” Physica A: Statistical Mechanics and its Applications, vol. 359, no. 1-4, pp. 495–524, 2006.
View at: Publisher Site | Google Scholar
H. Salis, A. Tamsir, and C. Voigt, “Engineering bacterial signals and sensors,” in Bacterial Sensing and Signaling, vol. 16, pp. 194–225, Karger Publishers, 2009.
View at: Publisher Site | Google Scholar
H. J. Bremermann, “Chemotaxis and optimization,” Journal of The Franklin Institute, vol. 297, no. 5, pp. 397–404, 1974.
View at: Publisher Site | Google Scholar
B. Niu and H. Wang, “Bacterial colony optimization: principles and foundations,” in Proceedings of the International Conference on Intelligent Computing, vol. 304, 2012.
View at: Publisher Site | Google Scholar
K. M. Passino, “Bacterial foraging optimization,” in Innovations and Developments of Swarm Intelligence Applications, vol. 219, IGI Global, 2012.
View at: Publisher Site | Google Scholar
B. Xing and W. Gao, “Bacteria inspired algorithms,” in Innovative Computational Intelligence: A Rough Guide to 134 Clever Algorithms, vol. 62, pp. 21–38, Springer, 2014.
View at: Publisher Site | Google Scholar
S. Das, A. Biswas, S. Dasgupta, and A. Abraham, “Bacterial foraging optimization algorithm: theoretical foundations, analysis, and applications,” in Foundations of Computational Intelligence, vol. 203, pp. 23–25, Springer, 2009.
View at: Publisher Site | Google Scholar
A. Biswas, S. Dasgupta, S. Das, and A. Abraham, “A synergy of differential evolution and bacterial foraging optimization for global optimization,” Neural Network World, vol. 17, no. 6, pp. 607–626, 2007.
View at: Google Scholar
E. A. H.-O. B. Mezura-Montes, “Modified bacterial foraging optimization for engineering design,” in Intelligent Engineering Systems through Artificial Neural Networks, pp. 1–8, ASME Press, 2009.
View at: Publisher Site | Google Scholar
S. M. Abd-Elazim and E. S. Ali, “A hybrid particle swarm optimization and bacterial foraging for optimal power system stabilizers design,” International Journal of Electrical Power & Energy Systems, vol. 46, no. 1, pp. 334–341, 2013.
View at: Publisher Site | Google Scholar
S. M. Abd-Elazim and E. S. Ali, “Bacteria foraging optimization algorithm based svc damping controller design for power system stability enhancement,” International Journal of Electrical Power & Energy Systems, vol. 43, no. 1, pp. 933–940, 2012.
View at: Publisher Site | Google Scholar
K. S. Kumar and T. Jayabarathi, “Power system reconfiguration and loss minimization for an distribution systems using bacterial foraging optimization algorithm,” International Journal of Electrical Power & Energy Systems, vol. 36, no. 1, pp. 13–17, 2012.
View at: Publisher Site | Google Scholar
S. M. Abd-Elazim and E. S. Ali, “Synergy of particle swarm optimization and bacterial foraging for TCSC damping controller design,” International Journal of World Scientific and Engineering Academy and Society (WSEAS) Transactions on Power Systems, vol. 8, pp. 74–84, 2013.
View at: Google Scholar
H. Chen, Y. Zhu, and K. Hu, “Multi-colony bacteria foraging optimization with cell-to-cell communication for RFID network planning,” Applied Soft Computing, vol. 10, no. 2, pp. 539–547, 2010.
View at: Publisher Site | Google Scholar
M. Wan, L. Li, J. Xiao, C. Wang, and Y. Yang, “Data clustering using bacterial foraging optimization,” Journal of Intelligent Information Systems, vol. 38, no. 2, pp. 321–341, 2012.
View at: Publisher Site | Google Scholar
J. R. Olesen, J. Cordero, and Y. Zeng, “Auto-clustering using particle swarm optimization and bacterial foraging,” in Proceedings of the International Workshop on Agents and Data Mining Interaction, pp. 69–83, 2000.
View at: Publisher Site | Google Scholar
R. Majhi, G. Panda, B. Majhi, and G. Sahoo, “Efficient prediction of stock market indices using adaptive bacterial foraging optimization (ABFO) and BFO based techniques,” Expert Systems with Applications, vol. 36, no. 6, pp. 10097–10104, 2009.
View at: Publisher Site | Google Scholar
S. R. Chhabra, B. Philipp, L. Eberl, M. Givskov, P. Williams, and M. Cámara, “Extracellular communication in bacteria,” in The Chemistry of Pheromones and Other Semiochemicals II, pp. 279–315, Springer, Berlin, Heidelberg, Grmany, 2005.
View at: Publisher Site | Google Scholar
C. G. Bowsher and P. S. Swain, “Environmental sensing, information transfer, and cellular decision-making,” Current Opinion in Biotechnology, vol. 28, pp. 149–155, 2014.
View at: Publisher Site | Google Scholar
T. Bäck, D. Fogel, and Z. Michalewicz, Evolutionary Computation 1: Basic Algorithms and Operators, vol. 1, CRC Press, 2000.
View at: Publisher Site
Y. Su, X. Gu, and Z. Li, “Incremental updating algorithm based on artificial immune system for mining association rules,” in Proceedings of the IEEE International Conference on e-Business Engineering (ICEBE'06), Shanghai, China, October 2006.
View at: Publisher Site | Google Scholar
Y. Zhang, S. Bu, and Y. Zhang, “Association rules mining based on the improved immune algorithm,” in Proceedings of the Third International Symposium on Intelligent Information Technology Application, 2009.
View at: Publisher Site | Google Scholar
Y. Zhang and S. Bu, “Association rules mining based on simulated annealing immune programming algorithm,” in Proceedings of the International Conference on Computer Engineering and Technology, 2009.
View at: Google Scholar
T. Liu, “An immune based association rule algorithm,” in Proceedings of the Second International Conference on Innovative Computing, Information and Control (ICICIC 2007), Kumamoto, Japan, 2007.
View at: Publisher Site | Google Scholar
Z. Lei and L. Ren-hou, “An algorithm for mining fuzzy association rules based on immune principles,” in Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, 2007.
View at: Google Scholar
L. Geng and H. J. Hamilton, “Interestingness measures for data mining: a survey,” ACM Computing Surveys, vol. 38, no. 3, pp. 1–32, 2006.
View at: Publisher Site | Google Scholar
M. J. del Jesus, J. A. Gámez, P. González, and J. M. Puerta, “On the discovery of association rules by means of evolutionary algorithms,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 5, pp. 397–415, 2011.
View at: Publisher Site | Google Scholar
D. S. da Cunha and L. N. de Castro, “Evolutionary and immune algorithms applied to association rule mining in static and stream data,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 2018.
View at: Publisher Site | Google Scholar
K. Bache and M. Lichman, UCI - Machine Learning Repository, 2013, http://archive.ics.uci.edu/ml.
I. Corporation, “Accelerating performance for server-side Java applications,” Porland, 2017.
View at: Google Scholar
P. Deepa Shenoy, K. G. Srinivasa, K. R. Venugopal, and L. M. Patnaik, “Evolutionary approach for mining association rules on dynamic databases,” in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2003.
View at: Google Scholar
K. R. Venugopal, K. G. Srinivasa, and L. M. Patnaik, “Dynamic association rule mining using genetic algorithms,” in Soft Computing for Data Mining Applications, 2009.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2018 Danilo S. da Cunha et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1800

Downloads

824

Citations