Scientific Programming

Research Article

Optimal Policy Learning for Disease Prevention Using Reinforcement Learning

SARSA.

	Input:
	States: S = 1, …, n
	Actions: A = 1, …, n
	Rewards: R: S × A ⟶ R
	Transitions: T: S × A ⟶ S
	α ∈ [0, 1] and γ ∈ [0, 1]
	λ ∈ [0, 1] this shows the trade-off between Temporal Difference and Monte Carlo methods.
	Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)
	while For every episode do
	Randomly initialize s ∈ S
	Initialize e with 0
	Randomly select (s, a) ∈ S × A
	while For every step in the episode do
	//Repeat until s is terminal
	r ⟵ R (s, a)
	s′ ⟵ T (s, a)
	Compute π based on Q using exploration strategy (e.g. ε-greedy)
	a′ ⟵ π (s′)
	e (s, a) ⟵ e(s, a) + 1
	δ ⟵ r + γ.Q (s′, a′) − Q (s, a)
	for (s′, a′) ∈ S × A do
	Q (s′, a′) ⟵ Q (s′, a′) + α.δ.e (s′, a′)
	e (s′, a′ ⟵ γ.λ.e (s′, a′))
	s ⟵ s′
	a ⟵ a′