Abstract

Identifying the hidden state is important for solving problems with hidden state. We prove any deterministic partially observable Markov decision processes (POMDP) can be represented by a minimal, looping hidden state transition model and propose a heuristic state transition model constructing algorithm. A new spatiotemporal associative memory network (STAMN) is proposed to realize the minimal, looping hidden state transition model. STAMN utilizes the neuroactivity decay to realize the short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials, and synchronized activation mechanism to complete identifying and recalling simultaneously. Finally, we give the empirical illustrations of the STAMN and compare the performance of the STAMN model with that of other methods.

1. Introduction

The real environment in which agents are is generally an unknown environment where there are partially observable hidden states, as the large partially observable Markov decision processes (POMDP) and hidden Markov model (HMM) literatures attest. The first problem for solving a POMDP is hidden states identifying. In many papers, the method of using the -step short-term memory to identify hidden states has been proposed. The -step memory is generally implemented through tree-based models, finite state automata, and recurrent neural networks.

The most classic algorithm of tree-based model is U-tree model [1]. This model is a variable length suffix tree model; however, this method can only obtain the task-related experiences rather than general knowledge of the environment. A feature reinforcement learning (FRL) framework [2, 3] is proposed, which considers maps from the past observation-reward-action history to an MDP state. Nguyen et al. [4] introduced a practical search context trees algorithm for realizing the MDP. Veness et al. [5] introduced a new Monte-Carlo tree search algorithm integrated with the context tree weighting algorithm to realize the general reinforcement learning. Because the depth of the suffix tree is restricted, these tree-based methods cannot efficiently handle long-term dependent tasks. Holmes and Isbell Jr. [6] first proposed the looping prediction suffix trees (LPST) in the deterministic POMDP environment, which can map the long-term dependent histories onto a finite LPST. Daswani et al. [7] extended the feature reinforcement learning framework to the space of looping suffix trees, which is efficient in representing long-term dependencies and perform well on stochastic environments. Daswani et al. [8] introduced a squared -learning algorithm for history-based reinforcement learning; this algorithm used a value-based cost function. Another similar work is by Timmer and Riedmiller [9], who presented the identify and exploit algorithm to realize the reinforcement learning with history lists, which is a model-free reinforcement learning algorithm for solving POMDP. Talvitie [10] proposed the temporally abstract decision trees to learning partially observable models. These -step memory representations based on multidimensional tree required additional computation models, resulting in poor time performances and more storage space. And these models have poor tolerance to fault and noise because of the accurate matching of each item.

More related to our work, finite state automata (FSA) has been proved to approximate the optimal policy on belief states arbitrarily well. McCallum [1] and Mahmud [11] both introduced the incremental search algorithm for learn probabilistic deterministic finite automata, but these methods learn extremely slowly and with some other restrictions. Other scholars use recurrent neural networks (RNN) to acquire memory capability. A well known architecture for RNN is Long Short-term Memory (LSTM) proposed by Hochreiter and Schmidhuber [12]. Deep reinforcement learning (DRL) [13] first was proposed by Mnih et al., which used deep neural networks to capture and infer hidden states, but this method still apply to MDP. Recently, deep recurrent -learning was proposed [14], where a recurrent LSTM model is used to capture the long-term dependencies in the history. Similar methods were proposed to learn hidden states for solving POMDP [15, 16]. A hybrid recurrent reinforcement learning approach that combined the supervised learning with RL was introduced to solve customer relationship management [17]. These methods can capture and identify hidden states in an automatic way. Because these networks use common weights and fixed structure, it is difficult to achieve incremental learning. These networks were suited to resolve the spatiotemporal pattern recognition (STPR), which is extraction of spatiotemporal invariances from the input stream. For the temporal sequence learning and recalling in more accurate fashion, such as the trajectory planning, decision making, robot navigation, and singing, special neural network models for temporal sequence learning may be more suitable.

Biologically inspired associative memory networks (AMN) have shown some success for this temporal sequence learning and recalling. These networks are not limited to specific structure, realizing incremental sequence learning in an unsupervised fashion. Wang and Yuwono [18] established a model to recognize and learn complex sequence which is also capable of incremental learning but need to provide different identifiers for each sequence artificially. Sudo et al. [19] proposed the self-organizing incremental associative memory (SOIAM) to realize the incremental learning. Keysermann and Vargas [20] proposed a novel incremental associative learning architecture for multidimensional real-valued data. However, these methods cannot address temporal sequences. By using time-delayed Hebb learning mechanism, a self-organizing neural network for learning and recall of complex temporal sequences with repeated items and shared items is presented in [21, 22], which was successfully applied to robot trajectory planning. Tangruamsub et al. [23] presented a new self-organizing incremental associative memory for robot navigation, but this method only dealt with simple temporal sequences. Nguyen et al. [24] proposed a long-term memory architecture which is characterized by three features: hierarchical structure, anticipation, and one-shot learning. Shen et al. [25, 26] provided a general self-organizing incremental associative memory network. This model not only leaned binary and nonbinary information but realized one-to-one and many-to-many associations. Khouzam [27] presented a taxonomy about temporal sequences processing methods. Although these models realized heteroassociative memory for complex temporal sequences, the memory length still is decided by designer which cannot vary in self-adaption fashion. Moreover, These models are unable to handle complex sequence with looped hidden state.

The rest of this paper is organized as follows: In Section 2, we introduce the problem setup, present the theoretical analysis for a minimal, looping hidden state transition model, and derive a heuristic constructing algorithm for this model. In Section 3, STAMN model is analyzed in detail, including its short-term memory (STM), long-term memory (LTM), and the heuristic constructing process. In Section 4, we present detailed simulations and analysis of the STAMN model and compare the performance of the STAMN model with that of other methods. Finally, a brief discussion and conclusion are given in Sections 5 and 6 separately.

2. Problem Setup

A deterministic POMDP environment can be represented by a tuple , where is the finite set of hidden world states, is the set of actions that can be taken by the agent, is the set of possible observations, is a deterministic transition function , and is a deterministic observation function . In this thesis, we only consider the special observation function that solely depends on the state . A history sequence is defined as a sequence of past observations and actions , which can be generated by the deterministic transition function and the deterministic observation function . The length of a history sequence is defined as the number of observations in this history sequence.

The environment that we discuss is deterministic and the state space is finite. We also assume the environment is strongly connected. However the environment is deterministic, which can be highly complicated, and nondeterministic at the level of observation. The hidden state can be fully identified by a finite history sequence in a deterministic POMDP, which is proved in [2]. Several notations are defined as follows: is a possible observation following by taking action . denotes the observations sequence generated by taking actions sequence from state .

Our goal is to construct a minimal, looping hidden state transition model by use of the sufficient history sequences. First we present the theoretical analysis showing any deterministic POMDP environment can be represented by a minimal, looping hidden state transition model. Then we present a heuristic constructing algorithm for this model. Corresponding definitions and lemmas are proposed as follows.

Definition 1 (identifying history sequence). A identifying history sequence is considered to uniquely identify the hidden state . In the rest of this paper, the hidden state is equally regarded as its identifying history sequence , so and can replace each other.
A simple example of deterministic POMDP is illustrated in Figure 1. We conclude easily that an identifying history sequence for is expressed by and an identifying history sequence for is expressed by . Note that there may exist infinitely many identifying history sequences for because the environment is strongly connected and may exist as unbounded long identifying history sequences for because of uninformative looping. So this leads us to determine the minimal identifying history sequences length .

Definition 2 (minimal history sequences length ). The minimal identifying history sequences length for the hidden state is defined as follows: The minimal identifying history sequence for the hidden state is the identifying history sequence such that no suffix of is also identified for . However, the minimal identifying history sequence may have unbounded length, because the identifying history sequence can include looping arbitrarily many times, which merely lengthens the history sequence. For example, two identifying history sequences and for are separately expressed to and . In this situation, is treated as the looped item. So the minimal identifying history sequences length for the hidden state is the minimal length value of all identifying history sequences by excising the looping portion.

Definition 3 (a criterion for identifying history sequence). Given a sufficient history sequence, for the set of history sequences for the hidden state , if each history sequence for the hidden state exists, which satisfies that is the same observation for each action, thus we consider the history sequence as identifying history sequence. Definition 3 is correct iff a full sufficient history sequence is given.

Lemma 4 (backward identifying history sequence). We assume an identifying history sequence is for a single hidden state , if is a backward extension of by adding an action and an observation ; then is a new identifying history sequence for the single state . is called the prediction state of .

Proof. We assume is a history sequence generated by the transition function and observation function . And is a history sequence as a prefix of . Since is identifying history sequence for , it must be that . Because the environment is deterministic POMDP, thus, the next state is uniquely determined by after the action has been executed. Then is a new identifying history sequence for the single state .

Lemma 5 (forward identifying history sequence). We assume an identifying history sequence is for a single hidden state , if is a prefix of by removing the last action and the last observation ; then is the identifying history sequence for the single state s such that . s is called the previous state of .

Proof. We assume is a history sequence generated by the transition function and observation function . And is a history sequence as a prefix of h by removing the action and the observation . Since is an identifying history sequence for the state , it must be that . Because the environment is deterministic POMDP, thus, the current state is uniquely determined by the previous state such that . Then is the identifying history sequence for the state such that .

Theorem 6. Given the sufficient history sequences, a finite, strongly connected, deterministic POMDP environment can be represented soundly by a minimal, looping hidden state transition model.

Proof. We assume , where is the finite set of hidden world states and is the number of the hidden states. First, for all hidden states, at least a finite length identifying history sequence for one state exists; because the environment is strongly connected, there must exist a transition history from to , according to Lemma 4, by backward extending of by adding , which is the identifying history sequence for . Since the hidden state transition model can identify the looping hidden state, there exists the maximal transition history sequence length from to which is . Thus, this model has minimal hidden state space with .
Since the hidden state transition model can correctly realize the hidden state transition , this model can correctly express all identifying history sequences for all hidden state and possesses the perfect transition prediction capacity, and this model has minimal hidden state space with . This model is a -step variable history length model, and the memory depth is a variable value for the different hidden state. This model is realized by the spatiotemporal associative memory networks in Section 3.
If we want to construct correct hidden state transition model, first we necessary to collect sufficient history sequences to perform statistic test to determine the minimal identifying history sequence length . However, in practice, it can be difficult to obtain a sufficient history. So we propose a heuristic state transition model constructing algorithm. Without the sufficient history sequence, the algorithm may produce a premature state transition model, but this model at least is correct for the past experience. We realize the heuristic constructing algorithm by way of two definitions and two lemmas as follows.

Definition 7 (current history sequence). A current history sequence is represented by the -step history sequence generated by the transition function and the observation function . At , the current history sequence is empty. is the observation vectors at current time , and is the observation vectors that precede -step at time .

Definition 8 (transition instance). The history sequence associated with time is captured as a transition instance. The transition instance is represented by the tuple , where and are current history sequences occurring at times and on episode. A set of transition instances will be denoted by the symbol F, which possibly contains transitions from different episodes.

Lemma 9. For any two hidden states and , iff .
This lemma gives us a sufficient condition to determine . However, this lemma is not a necessary condition.

Proof by Contradiction. We assume that ; thus for all actions sequences , there exists . is contradicting with trans trans. Thus, original proposition is true.

Lemma 10. For any two identifying history sequences and separately for and , there exists iff . However, Lemma 10 is not a necessary condition.

Proof by Contradiction. Since an identifying history sequence is considered to uniquely identify the hidden state , the identifying history sequences uniquely identify the hidden state and the identifying history sequences uniquely identify the hidden state . We assume ; thus there must exit , which is contradicting with . Thus, original proposition is true.

However, the necessary condition for Lemma 10 is not always true, because there maybe exist several different identifying history sequences for the same hidden state.

Algorithm 11 (a heuristic constructing algorithm for the state transition model). The initial state transition model is constructed making use of the minimal identifying history sequence length . And the model is empty initially.
The transition instance is empty, and :(1)We assume the given model can perfectly identify the hidden state. The agent makes a step in the environment (according to history sequence definition, the first item in is the observation vector). It records the current transition instance on the end of the chain of transition instance. And at each time , for the current history sequence , execute Algorithm 12, if the current state is looped to the hidden state ; then go to step (2); otherwise a new node is created in the model; go to step (4).(2)According to Algorithm 13, if , then the current state is distinguished from the identifying sate ; then go to step (3). If there exists , then go to step (4).(3)According to Lemma 10, for any two identifying history sequences and separately for and , there exists iff . So the identifying history sequence length for and separately for and must increase to until and are discriminated. We must reconstruct the model based on the new minimal identifying history sequence length , and go to step (1).(4)If the action node and corresponding function for the current identifying state node exist in the model, the agent chooses its next action node based on the exhaustive exploration function. If the action node and corresponding function do not exist, the agent chooses a random action instead, and a new action node is created in the model. After the action control signals to delivery, the agent obtains the new observation vectors by ; go to step (1).(5)Steps (1)–(4) continue until all identifying history sequences for the same hidden state can correctly predict the trans for each action.

Note, we apply the -step variable history length from to , and the history length is a variable value for the different hidden state. We adopt the minimalist hypothesis model in constructing the state transition model process. This constructing algorithm is a heuristic. If we adopt the exhaustiveness assumption model, the probability of missing looped hidden state increases exponentially, and many valid loops can be rejected, yielding larger redundant state and poor generalization.

Algorithm 12 (the current history sequence identifying algorithm). In order to construct the minimal looping hidden state transition model, we need to identify the looped state by identifying hidden state. Current history sequence with -step is needed. According to Definition 7, current -step history sequence at time is expressed by . There exist three identifying processes.

(1) -Step History Sequence Identifying. If the identifying history sequence for is , satisfy

Thus, the current state is looped to the hidden state . In the STAMN, the -step history sequence identifying activation value is computed.

(2) Following Identifying. If the current state is identified as the hidden state , the next transition history is represented by through . According to Lemma 4, the next transition history is identified as the transition prediction state .

In the STAMN, the following activation value is computed.

(3) Previous Identifying. If there exists a transition instance , is an identifying history sequence for state . According to Lemma 5, the previous state is uniquely identified by such that . So if there exists a transition prediction state , and , then the previous state is uniquely identified.

In the STAMN, the previous activation value is computed.

Algorithm 13 (transition prediction criterion). If the model correctly represents the environment model as Morkov chains, then, for all state , for all identifying history sequence for the same state, it is satisfied that is the same observation for the same action.
If and are the identifying history sequences with -step memory for the hidden state , there exist , according to Lemma 9; thus is discriminative with .
In the STAMN, the transition prediction criterion is realized by the transition prediction value .

3. Spatiotemporal Associative Memory Networks

In this section, we want to relate the spatiotemporal sequence problem to the idea of identifying the hidden state by a sequence of past observations and actions. An associative memory network (AMN) is expected to have several characteristics: (1) An AMN must memorize incrementally, which has the ability to learn new knowledge without forgetting the learned knowledge. (2) An AMN must be able to not only record the temporal orders but also record sequence items duration time with continuous time. (3) An AMN must be able to realize the heteroassociation recall. (4) An AMN must be able to process the real-valued feature vectors in bottom-up method, not the symbolic items. (5) An AMN must be robust and must be able to recall the sequence correctly with incomplete or noisy input. (6) An AMN can realize learning and recalling simultaneously. (7) An AMN can realize interaction between STM and LTM; dual-trace theory suggested that the persistent neural activities of STM can lead to LTM.

The first thing to be defined is the temporal dimension (discrete or continuous). The previous researches are mostly based on a regular intervals of , and few AMN have been proposed to deal with not only the sequential characteristic but also sequence items duration time with continuous time. However, this characteristic is important for many problems. In speech production, writing, music generation and motor-planning, and so on, the sequence item duration time and sequence item repeated emergence have essential different meaning. For example, “” and “---” are exactly different, the former represents that item sustains for 4 timesteps, and the latter represents that item repeatedly emerges for 4 times. STAMN can explicitly distinguish these two temporal characteristics. So a history of past observations and actions can be expressed by a special spatiotemporal sequence . , where is the length of sequence of ; the items of include the observation items and the action items, where denotes the real-valued observation vectors generated by taking action . denotes the action taken by the agent at time , where does not represent temporal discrete dimension by sampling at regular intervals but represents the th step in the continuous-time dimension, which is the duration time between the current item and the next one.

A spatiotemporal sequence can classified as simple sequence and complex sequence. Simple sequence is a sequence without repeated items; for example, the sequence “---” is a simple sequence, whereas those containing repeated items are defined as the complex sequence. In complex sequence, the repeated item can be classified as looped items and discriminative items by identifying the hidden state, for example, in the history sequence “----”, “,” and “” maybe the looped items or discriminative items. Identifying the hidden state needs to introduce the contextual information resolving by the -step memory. The memory depth is not fixed and is a variable value in different parts of state space.

3.1. Spatiotemporal Associative Memory Networks Architecture

We build a new spatiotemporal associative memory network (STAMN). This model makes use of neuron activity decay of nodes to achieve short-term memory, connection weights between different nodes to represent long-term memory, presynaptic potentials and neuron synchronized activation mechanism to realize identifying and recalling, and a time-delayed Hebb learning mechanism to fulfil the one-shot learning.

STAMN is an incremental, possibly looping, nonfully connected, asymmetric associative memory network. Nonhomogeneous nodes correspond to hidden state nodes and action nodes.

For the state node , the input values are defined as follows: (1) the current observation activation value , responsible for matching degree of state node , and current observed value, which is obtained from the preprocessing neural networks (if the matching degree is greater than a threshold value, then ); (2) the observation activation value of the presynaptic nodes set of current state node ; (3) the identifying activation value of the previous state node of state node ; (4) the activation value of the previous action node of state node . The output values are defined as follows: (1) The identifying activation value represents whether the current state is identified to the hidden state or not. (2) The transition prediction value represents whether the state node is the current state transition prediction node or not.

For the action node , the input values are defined as the current action activation value , responsible for matching degree of action node and current motor vectors. The output value is activation value of the action nodes and indicates that the action node has been selected by agent to control the robot’s current action.

For the STAMN, all nodes and connections weight do not necessarily exist initially. The weights and connected to state nodes can be learned by a time-delayed Hebb learning rules incrementally representing the LTM. The weight connected to action nodes can be learned by reinforcement learning. All nodes have activity self-decay mechanism to record the duration time of this node representing the STM. The output of the STAMN is the winner state node or winner action node by winner takes all.

STAMN architecture is shown in Figure 2, where black nodes represent action nodes and concentric circles nodes represent state nodes.

3.2. Short-Term Memory

We using self-decay of neuron activity to accomplish short-term memory, no matter whether observation activation or identifying activation . The activity of each state node has self-decay mechanism to record the temporal order and the duration time of this node. Supposing the factor of self-decay is and the activation value , the self-decay process is shown as

The activity of action node also has self-decay characteristic. Supposing the factor of self-decay is and the activation value the self-decay process is shown as

wherein and are active thresholds and and are self-decay factors. Both determine the depth of short-term memory , where is a discrete time point by sampling at regular intervals , and is a very small regular interval.

3.3. Long-Term Memory

Long-term memory can be classified into semantic memory and episodic memory. Learning of history sequence is considered as the episodic memory, generally adopting the one-shot learning to realize. We use the time-delayed Hebb learning rules to fulfil the one-shot learning.

(1) The -Step Long-Term Memory Weight . The weight connected to state nodes represents -step long-term memory. This is a past-oriented behaviour in the STAMN. The weight is adjusted according to where is the current identifying activation state node, . Because the identifying activation process is a neuron synchronized activation process, when , all nodes whose and are not zero are the contextual information related to state node . These nodes whose and are not zero are considered as the presynaptic node set of current state node . The presynaptic node set of current state node is expressed by , where represent the activation value at time , and record not only the temporal order but also the duration time because of the self-decay of neuron activity. If , are smaller, the node is more early to current state node . is activation weight between presynaptic node and state node . records context information related to state node to be used in identifying and recalling, where , , and is learning rate.

The weight is time-related contextual information; the update process is shown as in Figure 3.

(2) One-Step Transition Prediction Weight . The weight connected to state nodes represents one-step transition prediction in LTM. This is a future-oriented behaviour in the STAMN. Using the time-delayed Hebb learning rules, the weight is adjusted according to

The transition activation of current state node is only associated with the previous winning state node and action node, where is the current identifying activation state node, . The previous winning state node and action node are presented by , , where and , is learning rate.

The weight is one-step transition prediction information; the update process is shown as in Figure 4.

(3) The Weight Connected to Action Nodes. The activation of action node is only associated with the corresponding state node which selects this action node directly. We assume the state node with maximal identifying activation value is the state node at the time , so the connection weight connected to current selected action node is adjusted by where is represented by function . This paper only discusses how to build generalized environment model, not learning the optimal policy, so this value is set to be the exhaustive exploration function based on the curiosity reward. The curiosity reward is described by (7). When action is an invalid action, is defined to be a large negative constant, which is avoided going to the dead endswhere is a constant value, represents the count of exploration to the current action by the agent, and is the average count of exploration to all actions by the agent; represents the degree of familiarity with the current selected action. The curiosity reward is updated when each action is finished. update equation is showed by (8), where is learning rate

The update process is shown as in Figure 5.

The action node is selected by according to where is the valid action set of state node . represent that the sate node is identified, and the action node was selected by agent to control the robot’s current action; set .

3.4. The Constructing Process in STAMN

In order to construct the minimal looping hidden state transition model, we need to identify the looped state by identifying hidden state. There exist identifying phase and recalling phase (transition prediction phase) simultaneously in the constructing process in STAMN. There is a chicken and egg problem during the constructing process: the building of the STAMN depends on state identifying; conversely, state identifying depends on the current structure of the STAMN. Thus, exhaustive exploration and -step variable memory length (depends on the prediction criterion) are used to try to avoid local minima that this interdependent causes.

According to Algorithm 12, The identifying activation value of state node depends on three identifying processes: -step history sequence identifying, following identifying, and previous identifying. We provide calculation equations for each identifying process.

(1) -Step History Sequence Identifying. The matching degree of current -step history sequence with the identifying history sequence for is computed to identify the looping state node . First, we compute the presynaptic potential for the state node according to where is the current activation value in the presynaptic node set . is the confidence parameter which means the node ’s importance degree in the presynaptic node set . The value of can be set in advance, and . The function represents the similar degree between activation value of the presynaptic node and the contextual information in LTM. ; the similar degree is high between and ; thus is close to 1. According to “winner takes all,” among all nodes whose presynaptic potentials exceed the threshold , the node with the largest presynaptic potential will be selected.

The presynaptic potential represents the synchronous activation process of presynaptic node set of , which represents the previous step contextual information matching of state node . To realize all -step history sequence matching, the -step history sequence identifying activation value of the state node is given below:where is the maximum potential value of node . means the node with the largest potential value is selected among all nodes whose synaptic potentials exceed the threshold . represents the matching degree of state node and current observed value. If , the current state is identified to looped state node by the step memory.

(2) Following Identifying. If the current state is identified as the state , then the next transition prediction state is identified as the state . First, we compute the transition prediction value for the state node according towhere is the identifying activation value of the state node and action node at time . indicates state node and action node are the previous winner nodes. is the confidence parameter which means the node ’s importance degree. The value of can be set in advance, and . records one-step transition prediction information related to state node to be used in identifying and recalling phase. If the current state is identified as the hidden state , represent the probability of the next transition prediction state is .

If the next prediction state node is the same as the current observation value, the current history sequence is identified as the state node . the following identifying value of the state node is given below:

If , the current history sequence is identified to looped state node by the following identifying. If , , and , then there exists , representing mismatching of state node and current observed value. There exist trans trans. According to transition prediction criterion (Algorithm 13), the current history sequence is not identified by the hidden sate , so the identifying history sequence length for and separately must increase to .

(3) Previous Identifying. If the current history sequence is identified as the state , then the previous state is identified such that . First, we compute the identifying activation value for all states . if there exists , then the current state is identified as state , and the previous state is identified as the previous state of state . The previous identifying value of the state node is defined as . The previous state is satisfying condition 1 and condition 2. Then we set :Condition 1: Condition 2:

According to above three identifying processes, the identifying activation value of the state node is defined as follows:

According to Algorithm 11, we give Pseudocode 1 for Algorithm 11.

Set initial memory depth ; ;
All nodes and connections weight do not exist initially in STAMN;
The transition instance ;
A STAMN is constructed incrementally through interaction with the environment by the agent,
expanding the SATAMN until the transition prediction is contradicted with the current minimal
hypothesis model, reconstruct the hypothesis model by increase the memory depth . in this
paper, the constructing process includes the identifying and recalling simultaneously.
While  one pattern or can be activated from the pattern cognition layer  do
   
   for all existed state nodes
   compute the identifying activation value by executing Algorithm 12
   If exist the   then
   the looped state is identified, and the weight , is adjusted according to (4), (5)
       else
   a new state node will be created, set . and the weight , is adjusted according to (4), (5)
  end if
   end for
   for all existed state nodes
   compute the transition prediction criterion by executing Algorithm 13.
  If   then
     for  the previous winner state node of the state node
       set memory depth until each are discriminated
      end for
    reconstruct the hypothesis model by the new memory depth according to Algorithm 11
   end if
  end for
  for all existed action nodes
   compute the activation value according to (9).
  if exist the   then
  the weight is adjusted according to (6).
   else
   a new action node will be created, set . and the weight is adjusted according to (6).
  end if
  end for
  The agent obtains the new observation vectors by = trans()
  
  end While

The pseudocode for Algorithm 12 is as follows: the current history sequence identifying algorithm (identifying phase). Compute the identifying activation value according to (16).

The pseudocode for Algorithm 13 is in Pseudocode 2.

)
  if the according to (12) and the according to (13)  then
return False
  else
  return True
  end if

4. Simulation

4.1. A Simple Example of Deterministic POMDP

We want to construct the looping transition model as in Figure 1. First, we obtain the transition instance . According to Algorithm 11, set the initial identifying history sequence length . The STAMN model is constructed incrementally using as in Figure 6.

When , because of , the looped state is identified by the current observation vector. Thus and are the identifying history sequences for the state . Since tans and tans in exist, so as to trans trans, according to Lemma 9, and are the identifying history sequences for the states and , respectively, and . So is distinguished as and . And .

According the Lemma 10, for any two identifying history sequences and separately for and , there exists iff . So the identifying history sequence length for and separately for and must increase to 2. For , the identifying history sequence is , and for , the identifying history sequence is . These two identifying history sequences are identical, so longer transition instance is needed.

When , . In , we obtain the identifying history sequence for states and . For , the distinguished identifying history sequence for is and . However, the distinguished identifying history sequence for is , which is not a discriminated history sequence from the identifying history sequence for . So the identifying history sequence length for must increase to 3. The identifying history sequences for and are described as in Table 1.

According to -step history identifying, following identifying and previous identifying, can be represented as , .

According to Pseudocode 1, the STAMN model is constructed incrementally using as in Figure 7(a), and the LPST is constructed as Figure 7(b). The STAMN has the fewest nodes because the state nodes in STAMN represent the hidden state, and the state nodes in LPST represent the observation state.

To illustrate the difference between the STAMN and LPST, we present the comparison results in Figure 8.

After 10 timesteps, the algorithms use the current learned model to realize the transition prediction, and the transition prediction criterion is expressed by the prediction error. The data point represents the average prediction error over 10 runs. We ran three algorithms: the STAMN with exhaustive exploration, the STAMN with random exploration, and the LPST.

Figure 8 shows that three algorithms all can produce the correct model with zero prediction error at last because of no noise. However, the STAMN with exhaustive exploration has the better performance and the faster convergence speed because of the exhaustive exploration -function.

4.2. Experimental Comparison in Grid Problem

First, we use the small POMDP problem to test the feasibility of the STAMN. gird problem is selected, which is shown in Figure 9. The agent wanders inside the grid and has only left sensor and right sensor to report the existence of a wall in the current position. The agent has four actions: forward, backward, turn left, and turn right. The reference direction of the agent is northward.

In the gird problem, the hidden state space size is 11, the action space size is 4 for each state, and the observation space size is 4. The upper figure in the grid represents the observation state and the bottom figure in the grid represents the hidden state. The black grid and the wall are regarded as the obstacles. We present a comparison between the LPST and the STAMN. The results are shown in Figure 10. Figure 10 shows that the STAMN with exhaustive exploration still has the better performance and the faster convergence speed in the gird problem.

The number of state nodes and action nodes in STAMN and LPST is described in Table 2. In STAMN, the state node represents the hidden state, but in LPST, the sate node represents observation value, and the hidden state is expressed by -step past observations and actions; thus, more observation nodes and action nodes are repeated created, and most of the observation nodes and action nodes are the same. LPST has the perfect observation prediction capability and is the same as the STAMN but is not the state transition model.

In STAMN, The setting of parameters is very important which determines the depth of short-term memory. We assume and initial representing . When we need to increase to , we only need to decrease , according to , . The other parameters are determined relative easily. We set learning rates and set the confidence parameters , representing the same importance degree. Set constant values , .

4.3. Experimental Results in Complex Symmetrical Environments

The symmetrical environment in this paper is very complex, which is shown in Figure 11(a). The robot wanders in the environment. By following wall behaviour the agent can recognize the left wall, right wall, and corridor landmarks. These observation landmarks are different from the observation vector in the previous grid problem; every observation landmark has different duration time. For analysis of the fault tolerance and robustness of STAMN, we assume the disturbed environment is shown in Figure 11(b).

The figures in Figure 11(a) represent the hidden states. The robot has four actions: forward, backward, turn left, and turn right. The initial orientation of the robot is shown in Figure 11(a). Since the robot follows the wall in the environment, the robot has only one optional action in each state. The reference direction of action is the robot current orientation. Thus, the path 2-3-4-5-6 and path 12-13-14-15-16 are identical, and the memory depth is necessary to identify hidden states reliably. We present a comparison between the STAMN and the LPST in noise-free environment and disturbed environment. The results are shown as in Figures 12(a) and 12(b). Figure 12(b) shows that STAMN has better noise tolerance and robustness than LPST because of the neuron synchronized activation mechanism, which bears the fault and noise in the sequence item duration time, and can realize the reliable recalling. However, the LPST cannot realize the correct convergence in reasonable time because of accurate matching.

5. Discussion

In this section, we compare the related work with the STAMN model. The related work mainly includes the LPST and the AMN.

5.1. Looping Prediction Suffix Tree (LPST)

LPST is constructed incrementally by expanding branches until they are identified or become looped using the observable termination criterion. Given a sufficient history, this model can correctly capture all predictions of identifying histories and can map the all infinite identifying histories onto a finite LPST.

The STAMN proposed in this paper is similar to the LPST. However, the STAMN is looping hidden state transition model, so in comparison with LPST, STAMN have less state nodes and action nodes because these nodes in LPST are based on observation, not hidden state. Furthermore, STAMN has better noise tolerance and robustness than LPST. The LPST realizes recalling by successive accurate matching, which is sensitive to noise and fault. The STAMN offers the neuron synchronized activation mechanism to realize recalling. Even if in noisy and disturbed environment, STAMN can still realize the reliable recalling. Finally, the algorithm for learning a LPST is an additional computation model, which is not a distributed computational model. The STAMN is a distributed network and uses the synchronized activation mechanism, where performance cannot become poor with history sequences increasing and scale increasing.

5.2. Associative Memory Networks (AMN)

STAMN is proposed based on the development of the associative memory networks. In existing AMN, firstly, almost all models are unable to handle complex sequence with looped hidden state. The STAMN realizes the identifying the looped hidden state indeed, which can be applied to HMM and POMDP problems. Furthermore, most AMN models can obtain memory depth determined by experiments. The STAMN offers a self-organizing incremental memory depth learning method, and the memory depth is variable in different parts of state space. Finally, the existing AMN models general only record the temporal orders with discrete intervals rather than sequence items duration with continuous-time. STAMN explicitly deals with the duration time of each item.

6. Conclusion and Future Research

POMDP is the long-standing difficult problem in the machine learning. In this paper, SATMN is proposed to identify the looped hidden state only by the transition instances in deterministic POMDP environment. The learned STAMN is seen as a variable depth -Morkov model. We proposed the heuristic constructing algorithm for the STAMN, which is proved to be sound and complete given sufficient history sequences. The STAMN is real self-organizing incremental unsupervised learning model. These transition instances can be obtained by the interaction with the environment by the agent and can be obtained by a number of training data that does not depend on the real agent. The STAMN is very fast, robust, and accurate. We have also shown that STAMN outperforms some existing hidden state methods in deterministic POMDP environment. The STAMN can generally be applied to almost all temporal sequences problem, such as simultaneous localization and mapping problem (SLAM), robot trajectory planning, sequential decision making, and music generation.

We believe that the STAMN can serve as a starting point for integrating the associative memory networks with the POMDP problem. Further research will be carried out on the following aspects: how to scale our approach to the stochastic case by heuristic statistical test; how to incorporate with the reinforcement learning to produce a new distributed reinforcement learning model; and how it should be applied to robot SLAM to resolve the practical navigation problem.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this article.