Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 7158507, 14 pages

http://dx.doi.org/10.1155/2016/7158507

## A Self-Organizing Incremental Spatiotemporal Associative Memory Networks Model for Problems with Hidden State

Department of Computer Science and Software, Tianjin Polytechnic University, Tianjin 300387, China

Received 31 May 2016; Revised 23 July 2016; Accepted 27 July 2016

Academic Editor: Manuel Graña

Copyright © 2016 Zuo-wei Wang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identifying the hidden state is important for solving problems with hidden state. We prove any deterministic partially observable Markov decision processes (POMDP) can be represented by a minimal, looping hidden state transition model and propose a heuristic state transition model constructing algorithm. A new spatiotemporal associative memory network (STAMN) is proposed to realize the minimal, looping hidden state transition model. STAMN utilizes the neuroactivity decay to realize the short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials, and synchronized activation mechanism to complete identifying and recalling simultaneously. Finally, we give the empirical illustrations of the STAMN and compare the performance of the STAMN model with that of other methods.

#### 1. Introduction

The real environment in which agents are is generally an unknown environment where there are partially observable hidden states, as the large partially observable Markov decision processes (POMDP) and hidden Markov model (HMM) literatures attest. The first problem for solving a POMDP is hidden states identifying. In many papers, the method of using the -step short-term memory to identify hidden states has been proposed. The -step memory is generally implemented through tree-based models, finite state automata, and recurrent neural networks.

The most classic algorithm of tree-based model is U-tree model [1]. This model is a variable length suffix tree model; however, this method can only obtain the task-related experiences rather than general knowledge of the environment. A feature reinforcement learning (FRL) framework [2, 3] is proposed, which considers maps from the past observation-reward-action history to an MDP state. Nguyen et al. [4] introduced a practical search context trees algorithm for realizing the MDP. Veness et al. [5] introduced a new Monte-Carlo tree search algorithm integrated with the context tree weighting algorithm to realize the general reinforcement learning. Because the depth of the suffix tree is restricted, these tree-based methods cannot efficiently handle long-term dependent tasks. Holmes and Isbell Jr. [6] first proposed the looping prediction suffix trees (LPST) in the deterministic POMDP environment, which can map the long-term dependent histories onto a finite LPST. Daswani et al. [7] extended the feature reinforcement learning framework to the space of looping suffix trees, which is efficient in representing long-term dependencies and perform well on stochastic environments. Daswani et al. [8] introduced a squared -learning algorithm for history-based reinforcement learning; this algorithm used a value-based cost function. Another similar work is by Timmer and Riedmiller [9], who presented the identify and exploit algorithm to realize the reinforcement learning with history lists, which is a model-free reinforcement learning algorithm for solving POMDP. Talvitie [10] proposed the temporally abstract decision trees to learning partially observable models. These -step memory representations based on multidimensional tree required additional computation models, resulting in poor time performances and more storage space. And these models have poor tolerance to fault and noise because of the accurate matching of each item.

More related to our work, finite state automata (FSA) has been proved to approximate the optimal policy on belief states arbitrarily well. McCallum [1] and Mahmud [11] both introduced the incremental search algorithm for learn probabilistic deterministic finite automata, but these methods learn extremely slowly and with some other restrictions. Other scholars use recurrent neural networks (RNN) to acquire memory capability. A well known architecture for RNN is Long Short-term Memory (LSTM) proposed by Hochreiter and Schmidhuber [12]. Deep reinforcement learning (DRL) [13] first was proposed by Mnih et al., which used deep neural networks to capture and infer hidden states, but this method still apply to MDP. Recently, deep recurrent -learning was proposed [14], where a recurrent LSTM model is used to capture the long-term dependencies in the history. Similar methods were proposed to learn hidden states for solving POMDP [15, 16]. A hybrid recurrent reinforcement learning approach that combined the supervised learning with RL was introduced to solve customer relationship management [17]. These methods can capture and identify hidden states in an automatic way. Because these networks use common weights and fixed structure, it is difficult to achieve incremental learning. These networks were suited to resolve the spatiotemporal pattern recognition (STPR), which is extraction of spatiotemporal invariances from the input stream. For the temporal sequence learning and recalling in more accurate fashion, such as the trajectory planning, decision making, robot navigation, and singing, special neural network models for temporal sequence learning may be more suitable.

Biologically inspired associative memory networks (AMN) have shown some success for this temporal sequence learning and recalling. These networks are not limited to specific structure, realizing incremental sequence learning in an unsupervised fashion. Wang and Yuwono [18] established a model to recognize and learn complex sequence which is also capable of incremental learning but need to provide different identifiers for each sequence artificially. Sudo et al. [19] proposed the self-organizing incremental associative memory (SOIAM) to realize the incremental learning. Keysermann and Vargas [20] proposed a novel incremental associative learning architecture for multidimensional real-valued data. However, these methods cannot address temporal sequences. By using time-delayed Hebb learning mechanism, a self-organizing neural network for learning and recall of complex temporal sequences with repeated items and shared items is presented in [21, 22], which was successfully applied to robot trajectory planning. Tangruamsub et al. [23] presented a new self-organizing incremental associative memory for robot navigation, but this method only dealt with simple temporal sequences. Nguyen et al. [24] proposed a long-term memory architecture which is characterized by three features: hierarchical structure, anticipation, and one-shot learning. Shen et al. [25, 26] provided a general self-organizing incremental associative memory network. This model not only leaned binary and nonbinary information but realized one-to-one and many-to-many associations. Khouzam [27] presented a taxonomy about temporal sequences processing methods. Although these models realized heteroassociative memory for complex temporal sequences, the memory length still is decided by designer which cannot vary in self-adaption fashion. Moreover, These models are unable to handle complex sequence with looped hidden state.

The rest of this paper is organized as follows: In Section 2, we introduce the problem setup, present the theoretical analysis for a minimal, looping hidden state transition model, and derive a heuristic constructing algorithm for this model. In Section 3, STAMN model is analyzed in detail, including its short-term memory (STM), long-term memory (LTM), and the heuristic constructing process. In Section 4, we present detailed simulations and analysis of the STAMN model and compare the performance of the STAMN model with that of other methods. Finally, a brief discussion and conclusion are given in Sections 5 and 6 separately.

#### 2. Problem Setup

A deterministic POMDP environment can be represented by a tuple , where is the finite set of hidden world states, is the set of actions that can be taken by the agent, is the set of possible observations, is a deterministic transition function , and is a deterministic observation function . In this thesis, we only consider the special observation function that solely depends on the state . A history sequence is defined as a sequence of past observations and actions , which can be generated by the deterministic transition function and the deterministic observation function . The length of a history sequence is defined as the number of observations in this history sequence.

The environment that we discuss is deterministic and the state space is finite. We also assume the environment is strongly connected. However the environment is deterministic, which can be highly complicated, and nondeterministic at the level of observation. The hidden state can be fully identified by a finite history sequence in a deterministic POMDP, which is proved in [2]. Several notations are defined as follows: is a possible observation following by taking action . denotes the observations sequence generated by taking actions sequence from state .

Our goal is to construct a minimal, looping hidden state transition model by use of the sufficient history sequences. First we present the theoretical analysis showing any deterministic POMDP environment can be represented by a minimal, looping hidden state transition model. Then we present a heuristic constructing algorithm for this model. Corresponding definitions and lemmas are proposed as follows.

*Definition 1 (identifying history sequence). *A identifying history sequence is considered to uniquely identify the hidden state . In the rest of this paper, the hidden state is equally regarded as its identifying history sequence , so and can replace each other.

A simple example of deterministic POMDP is illustrated in Figure 1. We conclude easily that an identifying history sequence for is expressed by and an identifying history sequence for is expressed by . Note that there may exist infinitely many identifying history sequences for because the environment is strongly connected and may exist as unbounded long identifying history sequences for because of uninformative looping. So this leads us to determine the minimal identifying history sequences length .