Computational Intelligence and Neuroscience

Volume 2018, Article ID 2085721, 13 pages

https://doi.org/10.1155/2018/2085721

## Constructing Temporally Extended Actions through Incremental Community Detection

College of System Engineering, National University of Defense Technology, Changsha, Hunan 410073, China

Correspondence should be addressed to Mei Yang; nc.ude.tdun@iemgnay

Received 25 September 2017; Revised 6 February 2018; Accepted 26 February 2018; Published 23 April 2018

Academic Editor: Michele Migliore

Copyright © 2018 Xiao Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Hierarchical reinforcement learning works on temporally extended actions or skills to facilitate learning. How to automatically form such abstraction is challenging, and many efforts tackle this issue in the options framework. While various approaches exist to construct options from different perspectives, few of them concentrate on options’ adaptability during learning. This paper presents an algorithm to create options and enhance their quality online. Both aspects operate on detected communities of the learning environment’s state transition graph. We first construct options from initial samples as the basis of online learning. Then a rule-based community revision algorithm is proposed to update graph partitions, based on which existing options can be continuously tuned. Experimental results in two problems indicate that options from initial samples may perform poorly in more complex environments, and our presented strategy can effectively improve options and get better results compared with flat reinforcement learning.

#### 1. Introduction

Reinforcement learning (RL) is a machine learning branch where an agent learns to optimize its behavior by trial-and-error interaction with its environment. Traditional RL researches suffer the inability of use in complex practical problems due to the so-called “Curse-of-Dimensionality,” that is, the exponential growth of memory requirements with the number of state variables. Hierarchical Reinforcement Learning (HRL) aims to reduce the dimensionality through decomposing the RL problem into several subproblems. As solving small-scale subproblems would be simpler than solving the entire one, HRL is expected to be more efficient than flat RL. In the HRL research community, three main frameworks, HAM [1], options framework [2], and MAX-Q [3], provide different paradigms of problem hierarchies and learning methodologies. These all make HRL work on temporally extended actions or skills. Generally HRL requires domain knowledge to define such abstraction, which may function only for specific problems. How to automatically form useful abstractions, or skill acquisition, is an attractive issue.

To the best of our knowledge, most studies on this topic adopt the options framework [4–6]. Then the skill acquisition falls in automatic option construction. Though existing approaches solve this from different perspectives, one thing in common is that they require data sampled from the environment. In some complex environments this sampled experience may be deficient in describing the actual transition dynamics of the environment. These previously created options may get unadaptable to the environment, thus even leading to poor performance in HRL. For such cases there is need to online improve the quality of individual options.

This paper targets two problems, that is, how to create options and how to optimize options during learning. Our approach is to operate on the state transition graph of the learning environment. In the graph states are individual nodes and connecting edges denote states transitions. We first divide the sampled graph as communities, from which options are constructed. Community is a concept in the network science field, representing a cluster of strongly connected states. This paper employs Louvain algorithm [7] for community detection. The generated option set acts as the base for online learning. We present a rule-based community revision algorithm, adding newly collected states and transitions to previous communities. Option improvement is then performed based on these updated communities. Our approach is evaluated in two environments, that is, four-room grid world and small-scale Pac-Man world. The former is a benchmark problem for testing option generation algorithms, in which the effectiveness of learning options from Louvain detected communities is tested. The latter is more complex and uncertain, which is suitable for demonstrating the performance of the presented incremental-option improvement algorithm. Comparative results show that, in the four-room environment, options constructed from communities can accelerate the convergence speed than learning from primitive actions. In the Pac-Man environment, two scenarios with different types of the ghost agent are set. One follows a fixed strategy while the other makes random moves. Results suggest that options from initial samples perform poorly in the more complex scenario, while the presented incremental-option improvement can help adapt the existing option set and obtain better results compared with flat RL.

The remainder of this paper is organized as follows: In Section 2 we describe some basic ideas of RL and the options framework. Section 3 shows some related works on option construction. In Section 4 we illustrate the main approach of creating options from communities. Section 5 gives the detailed algorithm on incremental community revision and how to learn from these evolving communities. Section 6 demonstrates experiments and result analysis. Finally we discuss our implementation and draw conclusions in Section 7.

#### 2. Preliminaries

##### 2.1. Reinforcement Learning

The RL environment is typically formalized as a Markov Decision Process (MDP) [8] that can be described as a 5-element tuple , where is a finite set of states of the learning environment; is the available action set in state ; describes the state transition dynamics; represents the reward function for each state transition; and the discount factor is to balance the importance of short-term and long-term reward. At each step of time , an agent in state selects an available action ; then at next step it moves to with the probability and obtains the reward . A policy defines which action to choose under a certain state. It is associated with the action-value function indicating the expected reward from after taking and thereafter following . The aim of RL is to find an optimal policy that can reach the maximal expected reward, corresponding to the optimal action-value function .

RL algorithms can be divided as two categories, that is, model-based and model-free, according to whether they attempt to model the environment. -learning [9], one of the most commonly used RL algorithms, is a model-free type. In each learning step, the agent experiences the transition , and then the function is updated as follows:where is the learning rate. -learning has been shown to converge to under standard stochastic approximation assumptions.

##### 2.2. Options Framework

The MDP model assumes that an action lasts for a single time unit. In large state space problems, hierarchical abstraction has proven to be able to increase the RL efficiency. Options [2], built on these one-step actions, are formed as temporally extended courses of actions. An option is defined as a triple , where(i) is the* initiation set*; that is, is applicable in iff ;(ii) defines the* option policy*;(iii) specifies the* termination condition* while executing .

With this definition, an atom action can also be viewed as a* primitive option* with the initiation set , the local policy , , and the one-step termination condition , . Thus the option based RL agent can choose among atom actions as well as higher level skills.

The MDP model with a set of options is formed as a Semi-Markov Decision Process (SMDP). When the learning agent chooses an option to perform, it follows the option policy for several steps until the termination condition is satisfied. -learning under SMDP, which is also referred to as Option-to-Option Learning, updates the option value function after the option has terminated. Specifically, the rule is aswhere is the starting state of the option , is the number of steps where is taken from to its ending, and is the state that terminates at.

The main drawback of Option-to-Option Learning is that it needs to execute an option to completion before learning its value, thus requiring a significant amount of experience to reach convergence for every option. On the other hand, intraoption learning [10] can take advantage of one-step option execution for all related options, which leads to potentially more efficient learning. In detail, an experience fragment can be utilized for all consistent options, which would have taken in , to update the value estimation. Such one-step intraoption value learning is expressed aswhereThis update rule takes place after the one-step transition and is applied by options which are consistent with the policy .

The options framework illustrates how defined options are utilized by the learning agent. What we concern here is how to form useful options, having their , , and automatically generated.

#### 3. Literature Review

Automated option construction has been an active research area and various approaches have been proposed. These efforts commonly work on the basis of sampled experience. With collected states and transitions, a general process is to identify useful* subgoals* and then compose options using them. The term of subgoals defines what states the options need to achieve. They act as the basis to divide the original problem. Based on the difference in finding subgoals, most existing works can be categorized as two main branches: the sampling trajectory based approach and the graphical approach.

The sampling trajectory based approach tries to analyze history experience from a statistical perspective. For instance, the diverse density algorithm [4] specifies subgoals as regions where the agent passes more frequently on successful trajectories and not on unsuccessful ones. Here the successful trajectory is defined where it can start from any state and finally end at the expected goal state. The relative novelty [11] assumes that subgoals can lead the agent to access a new region from highly visited regions. The Local Roots algorithm [12] considers that subgoals should be junctions of shortcut paths from each state to the goal state. This approach online constructs a sequence tree from collected successful trajectories and takes the one with the local maximum* root factor* measure as subgoal. Another method presented in [13] employs the ant colony optimization to construct options. In its context subgoals are specified by monitoring the variance in pheromone values of related transitions.

The graphical approach forms a state transition graph through the agent’s interaction with the environment. In such graphs, states act as vertexes and their potential transitions caused by actions are represented as edges. Some efforts form subgoals via directly ranking graph centrality measures of state nodes (e.g., the betweenness centrality [5] and the connection graph stability centrality [14]). The basic idea is that potential subgoals would be special on these measures compared to other vertexes. On the other hand, a more common way is to partition the transition graph as several vertex clusters. States within the same cluster are strongly connected while the intercluster connectivity is minimized. Then border states connecting adjacent clusters can be naturally regarded as subgoals, and options are with the implication of moving from one cluster to another. There are some approaches following this idea but with different implementations. The approach in [15] partitions the transition graph by removing edges with high-value edge betweenness centrality. Meanwhile in [16] the eigenvector centrality is used as a basis to cluster the graph. The authors also present an online option pruning algorithm, attaining substantial performance improvement compared with the betweenness approach and the edge betweenness approach. Another spectral clustering algorithm PCCA+ [17] is also used for skill acquisition [18]. Combining neural network training, it shows effectiveness for complex environments like Atari games. The work presented in [19] finds subgoals in linear time based on forming Strongly Connected Component (SCC) of the graph. What is unique is that this method also exploits historical data to help improve the performance. Additionally, the reformed Label Propagation Algorithm (LPA), a community detection method, is employed to tackle this issue [6]. While LPA has a near-linear time complexity [20], its stability remains doubtful as it can generate redundant communities as well as skills even in simple problems.

One main drawback of the sampling trajectory based approach is that excessive exploration is needed to accurately identify those subgoals. Also, if the goal of the environment changes, previous efforts would be wasted in case that currently detected subgoals do not lead to the new ultimate goal. The graphical approach relies on the transition graph, which would form an understanding on the overall environment. This helps identify potential subgoals no matter what the current goal is. These two branches of approaches do not have a clear border. Actually some approaches can take advantage of both of them, such as the SCC based approach [19]. The approach in [13] also operates on the state transition graph but concentrates more on some metric in the context of the ant colony optimization. The main difference of those efforts lies in how they define the standard for states to be subgoals.

For the graphical approach, it is usually difficult to get a complete transition graph for large state space problems, and hence continuous sampling is necessary to approximate the full view. This requires that the graphical processing can deal with potential increase of new states and new transitions during the exploration. In this paper we propose an option construction and option improvement strategy though incremental community detection. What we concern more is how initially generated options can be updated in online learning. In some related works, the PCCA+ based method [18] calls the cluster algorithm iteratively to get options for large state space problems. Its computation overhead should be expensive. Also, if the resulting partition is quite different from the existing one, the currently formed options can be wasted. The reformed LPA in [6] is extended with an incremental version, but the stability of generated communities is not further discussed. We focus on how detected communities evolve and try to make the option be improved in a stable and efficient way.

#### 4. Generating Options from Communities

Constructing options from communities belongs to the graph partition based approach. We first give a brief description of the concept of communities and the process of the Louvain community detection algorithm. Then we describe how options are generated from communities.

##### 4.1. Louvain Method for Community Detection

Define as an undirected unweighted graph where and represent the vertex set and the edge set, respectively. Community detection aims to partition into a finite set of communities , where and for any distinct . A community is thought as a portion of a graph in which intracommunity edges are dense while intercommunity edges are sparse [21]. The measure* modularity* [22] is often used to evaluate the quality of communities partition:where is the sum of intracommunity edges of , is the sum of degree of vertexes in , and denotes the total number of edges in . The value range of modularity is .

Generally higher value means better partitioning; thus the problem of community detection can be solved as seeking for a solution maximizing the modularity. However, because the space of possible partitions grows quite fast, achieving the highest modularity is an NP-hard problem [23]. Algorithms for modularity-based community detection usually try to approximate the maximum of this measure. A comprehensive review on these approaches can be found in [24].

Louvain algorithm [7], which we use in this paper, is a hierarchical greedy optimization approach. It generates communities though iteratively executing a two-phase process. The general procedure is as follows:(1)Initially, each vertex of the graph is assigned to a different community.(2)For each node, check the modularity changes if moving it from its current community to one of neighbor communities, and make the change yielding the positive and maximal modularity increase. The process continues until all nodes are checked, resulting in a first-level partition with local maximal .(3)Build a new graph based on the first-level partition, where each node represents a community, and the connected edges are formed with weights as the sum of previous weights of corresponding intercommunity connections.(4)Repeat step and step until no increase in modularity is possible, resulting in the ultimate partition solution.

Louvain algorithm is believed to be one of the fastest modularity-based community detection methods [25]. Assume the graph to be processed has a total of edges and vertexes. The algorithm’s runtime complexity is believed to be . For sparse graphs it is also with a roughly linear growth on . In addition to high efficiency, it can obtain very good-quality results in terms of the modularity measure [25]. The main limitation is its storage demand for large scale networks. Compared to some existing graphical based option generation algorithms, employing Louvain algorithm is of advantage on computation time. For instance, the betweenness centrality computation following [26] should be and for unweighted and weighted graph, respectively; LPA grows like where is the sum of the algorithm’s internal label propagation iterations; and the SCC based method [19] is a linear time algorithm with an complexity.

What should be noted is that, in step of Louvain algorithm, the visit order of vertexes can vary. As indicated in [7], the ordering can influence the computation time as well as the obtained final partition. A default strategy is to traverse nodes in a random order. In [27] the authors evaluate several other vertex ordering strategies and suggest that sort nodes based on descending order of edge degree can bring marginal improvement on computation time than the default strategy. Results in [28] also show that partitions generated by Louvain algorithm following this degree-descending order can have low variance in modularity value (the number in most tested networks is at the level of ). In this paper, our implementation uses Louvain algorithm with such ordering strategy by default.

##### 4.2. Option Generation from Communities

The main idea of generating options from communities is to form an abstract MDP model on the basis of communities. These communities are converted from the state transition graph of the original problem. State vertexes in the same group can be aggregated as a macrostate, and transitions between macrostates are formed as macroactions (i.e., options). Here we only consider options shifting between two adjacent communities. An option from to can be generated by assigning its , , and . Specifically,(i)The initiation states .(ii)The termination condition is defined as The option is expected to stop while the agent reaches the border states of connecting . In some cases there is more than one state connecting these adjacent clusters, which we can all regard as subgoals.(iii)The option policy mainly guides the agent moving to . It is assumed that enough episodes of transition experience have been collected before these communities’ generation. We adopt the experience replay (ER) mechanism [29] to learn from previous trajectories. ER reuses past experiences to find an optimal policy to reach specified subgoals. During this process, a completion reward in addition to environmental reward signals will be assigned while is satisfied. For nondeterministic transitions, we also set a negative reward if the agent executing an option jumps out of both the source community and the target community.

#### 5. Incremental Community Detection for Online Learning

Though Louvain algorithm is computationally efficient and can generate high-quality solutions, it is initially designed for static network analysis. For complex systems sample based methods need to be employed to asymptotically form a satisfying state transition graph. Those initial samples may not be able to reach all states. Thus there is a need to online develop the state transition graph when experience accumulates during learning. As a result, the community detection needs to be performed adapting those dynamic changes. A direct approach is to iteratively execute Louvain algorithm while new states are discovered. However, the algorithm can produce distinct community structures if running multiple times on the same network. Even if we have employed a specific ordering strategy to decrease the variance, partition changes can lead to reconstruction of options, which may waste previous efforts on learning these options’ internal policies. Therefore, a trade-off between the optimality and the stability should be considered.

##### 5.1. Rule-Based Incremental Community Revision

Here we propose an approach combining the original Louvain algorithm and later incremental processing. Typically Louvain algorithm need be called to create a community assignment from initial samples, and then the incremental processing will work to handle later changes of the transition graph. What we concentrate on is the changes that specifically happen during the learning process, which can be organized as three categories: **Case1**: new vertex addition with edge(s) **Case2**: intracommunity edge addition **Case3**: intercommunity edge addition.

We set rules of how to respond to those cases: For** Case1**, two possible operations can be made, namely, assigning the new vertex to its connected community (Op1) or creating a new community for the vertex (Op2). For** Case2**, intracommunity edge addition actually strengthens the relating community’s local modularity measure; hence we can just keep the current community structure unchanged (Op3). For** Case3**, there are also two potential operations to tackle the change, that is, Op3 or merging corresponding communities into a new one (Op4). Selecting a specific operation for a certain case should keep the principle that the modularity of the resulting partition must be the maximal among all choices.

The modularity changes brought by each operation can be deduced from (5). Specifically, we have the following.

*(1) In*** Case1**. If Op1 is applied to the community structure, the original partition has a new vertex added to one of its communities. Denote the community to be changed as , and we have . The resulting modularity is aswhere represents the resulting partition after applying Op1 on the original partition in Case1.

The other alternative Op2 creates a new community for the new vertex. Let the new community be ; then and should be adjacent. Similarly, we have

In order to select from Op1 and Op2 in Case1, we can compare their effects on the original partition:

*(2) In*** Case2**. There is only Op3 as the solution, which just leaves the current partition as is. Define the intracommunity edge being within ; then we have , , and . The modularity for is

It can be found that has the same form as , though they result in different partitions. In [30] it has been proved that adding any intracommunity link to a community of a graph will not split it into smaller modules, because this actually increases the community’s local modularity. Hence it is reasonable to apply Op3 in response of Case2.

*(3) In*** Case3**. We suppose the added edge connects and . Then for Op3 the modularity becomes

If Op4 is selected, and are combined into one, which we denote as here. The resulting modularity should be computed aswhere for there may have been already existing intercommunity edges between and .

In order to obtain the better operator for Case3, we compare the two as

After the analysis of each operator’s effect, we present the incremental processing algorithm to tackle detected state transition changes. As shown in Algorithm 1, the input is a list of changes, with each item corresponding to a specific case. The algorithm is defined to be called after an initial graph being created. Periodically, new nodes and edges on the current transition graph detected from several episodes history will be stored in a list, and then the algorithm will process each item sequentially. Finally a new community partition will be generated, which provides basis for option learning.