Research Article

Optimal Policy Learning for Disease Prevention Using Reinforcement Learning

Algorithm 1

Q-Learning.
Input:
States: S = 1, …, n
Actions: A = 1, …, n
Rewards: R: S × A ⟶ R Transitions: T: S × A ⟶ S
α ∈ [0, 1] and γ ∈ [0, 1]
Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)
while For every episode do
    Initialize S ∈ S
    Select a from s on the basis of exploration strategy (e.g. ε-greedy)
    while For every step in the episode do
       //Repeat until s is terminal
      Compute π on the basis of Q and strategy of exploration (e.g. π (s) = argmaxaQ (s, a))
      a ⟵ π (s)
      r ⟵ R (s, a)
      s ⟵ T (s, a)
      Q (s, a) ⟵ (1 − α).Q (s, a) + α [r + Q (s, a)]
      s ⟵ s