| Input: | | States: S = 1, …, n | | Actions: A = 1, …, n | | Rewards: R: S × A ⟶ R | | Transitions: T: S × A ⟶ S | | α ∈ [0, 1] and γ ∈ [0, 1] | | λ ∈ [0, 1] this shows the trade-off between Temporal Difference and Monte Carlo methods. | | Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s) | | while For every episode do | | Randomly initialize s ∈ S | | Initialize e with 0 | | Randomly select (s, a) ∈ S × A | | while For every step in the episode do | | //Repeat until s is terminal | | r ⟵ R (s, a) | | s′ ⟵ T (s, a) | | Compute π based on Q using exploration strategy (e.g. ε-greedy) | | a′ ⟵ π (s′) | | e (s, a) ⟵ e(s, a) + 1 | | δ ⟵ r + γ.Q (s′, a′) − Q (s, a) | | for (s′, a′) ∈ S × A do | | Q (s′, a′) ⟵ Q (s′, a′) + α.δ.e (s′, a′) | | e (s′, a′ ⟵ γ.λ.e (s′, a′)) | | s ⟵ s′ | | a ⟵ a′ |
|