Research Article

Optimal Policy Learning for Disease Prevention Using Reinforcement Learning

Algorithm 2

SARSA.
Input:
States: S = 1, …, n
Actions: A = 1, …, n
Rewards: R: S × A ⟶ R
Transitions: T: S × A ⟶ S
α ∈ [0, 1] and γ ∈ [0, 1]
λ ∈ [0, 1] this shows the trade-off between Temporal Difference and Monte Carlo methods.
Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)
while For every episode do
    Randomly initialize s ∈ S
    Initialize e with 0
    Randomly select (s, a) ∈ S × A
    while For every step in the episode do
      //Repeat until s is terminal
      r ⟵ R (s, a)
      s′ ⟵ T (s, a)
      Compute π based on Q using exploration strategy (e.g. ε-greedy)
      a′ ⟵ π (s′)
      e (s, a) ⟵ e(s, a) + 1
      δ ⟵ r + γ.Q (s′, a′) − Q (s, a)
      for (s′, a′) ∈ S × A do
         Q (s′, a′) ⟵ Q (s′, a′) + α.δ.e (s′, a′)
         e (s′, a′ ⟵ γ.λ.e (s′, a′))
      s ⟵ s′
      a ⟵ a