Optimal Policy Learning for Disease Prevention Using Reinforcement Learning

<table class="algorithm-group"><tr><td><table class="algorithm" id="alg1"><tr><td> </td><td>Input:</td></tr><tr><td> </td><td>States: S = 1, …, n</td></tr><tr><td> </td><td>Actions: A = 1, …, n</td></tr><tr><td> </td><td>Rewards: R: S × A ⟶ R Transitions: T: S × A ⟶ S</td></tr><tr><td> </td><td>α ∈ [0, 1] and γ ∈ [0, 1]</td></tr><tr><td> </td><td>Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)</td></tr><tr><td> </td><td>while For every episode do</td></tr><tr><td> </td><td>    Initialize S ∈ S</td></tr><tr><td> </td><td>    Select a from s on the basis of exploration strategy (e.g. ε-greedy)</td></tr><tr><td> </td><td>    while For every step in the episode do</td></tr><tr><td> </td><td>       //Repeat until s is terminal</td></tr><tr><td> </td><td>      Compute π on the basis of Q and strategy of exploration (e.g. π (s) = argmaxaQ (s, a))</td></tr><tr><td> </td><td>      a ⟵ π (s)</td></tr><tr><td> </td><td>      r ⟵ R (s, a)</td></tr><tr><td> </td><td>      s ⟵ T (s, a)</td></tr><tr><td> </td><td>      Q (s′, a) ⟵ (1 − α).Q (s, a) + α [r + <svg height="15.3797pt" id="M15" style="vertical-align:-3.9436pt" version="1.1" viewbox="-0.0498162 -11.4361 27.854 15.3797" width="27.854pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M797 0V26C739 32 732 36 732 103V296C732 394 682 449 605 449C576 449 550 437 529 423C504 407 475 389 446 366C425 418 382 449 334 449C303 449 279 437 253 421C222 403 201 385 180 371V452C135 432 85 419 41 411V388C99 379 102 374 102 310V103C102 38 93 32 27 26V0H238V26C189 32 180 38 180 103V338C210 363 250 390 289 390C351 390 377 348 377 275V103C377 37 368 32 306 26V0H520V26C465 32 456 38 456 101V296C456 314 455 326 453 338C491 369 529 390 565 390C628 390 653 345 653 274V107C653 36 642 32 583 26V0H797Z"></path></g><g transform="matrix(.013,0,0,-0.013,10.608,0)"><path d="M433 39L423 65C413 59 399 54 387 54C370 54 352 69 352 114V299C352 352 342 392 307 422C285 440 255 449 225 449C168 437 102 399 75 379C56 365 44 353 44 339C44 315 69 296 87 296C101 296 111 303 116 319C124 349 133 371 145 385C156 397 171 404 190 404C241 404 275 364 275 291V274C253 256 180 229 120 209C65 190 39 159 39 110C39 47 88 -12 159 -12C189 -12 237 25 277 52C282 35 288 21 301 8C312 -3 333 -12 348 -12L433 39ZM275 84C256 65 221 48 195 48C164 48 124 73 124 124C124 161 146 180 185 198C206 208 254 229 275 240V84Z"></path></g><g transform="matrix(.013,0,0,-0.013,16.315,0)"><path d="M474 0V26C414 34 401 43 364 100L267 248C300 297 324 332 345 358C381 400 394 405 455 411V437H272V411C316 406 323 401 305 370C287 337 267 306 247 276L188 369C169 397 173 405 215 411V437H16V411C71 404 83 396 114 348L201 212C171 167 144 127 116 92C77 42 66 34 4 26V0H190V26C139 34 136 43 156 77C175 113 198 150 220 183L294 66C311 39 302 31 260 26V0H474Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,22.561,-5.741)"><path d="M310 541L304 571C290 586 211 619 185 610L80 76L131 52L310 541Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,22.561,3.784)"><path d="M490 97L476 124C442 96 405 70 398 70C392 70 390 78 396 114C419 243 448 379 463 432L457 436C446 436 431 439 418 442C393 447 368 451 343 451C281 451 204 418 155 381C74 320 24 206 24 107C24 23 59 -12 88 -12C118 -12 155 5 191 34C236 70 290 122 328 177H330L312 84C296 0 311 -12 331 -12C355 -12 425 24 490 97ZM374 387C371 367 360 299 347 264C323 202 187 53 142 53C128 53 113 73 113 120C113 224 157 332 221 380C241 395 274 403 303 403C330 403 360 395 374 387Z"></path></g></svg>Q (s′, a′)]</td></tr><tr><td> </td><td>      s ⟵ s</td></tr></table></td></tr></table>

Scientific Programming

alg1

Algorithm 1

Algorithm 1: Optimal Policy Learning for Disease Prevention Using Reinforcement Learning