Dynamical Motor Control Learned with Deep Deterministic Policy Gradient

<div>Schematic illustration of the deep deterministic policy gradient method. The critic network approximates the value function <svg height="11.5564pt" id="M48" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 35.2007 11.5564" width="35.2007pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M699 368C699 549 574 666 407 666C186 666 23 488 23 277C23 113 129 -3 288 -13L307 -26C431 -111 501 -139 533 -147C559 -154 613 -163 658 -164L666 -141C597 -111 507 -66 430 -11L416 -1C580 42 699 190 699 368ZM601 371C601 227 518 54 381 22L354 40L278 24C175 47 120 145 120 269C120 451 235 631 398 631C540 631 601 521 601 371Z" id="g113-82"></path></g><g transform="matrix(.013,0,0,-0.013,9.386,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z" id="g113-41"></path></g><g transform="matrix(.013,0,0,-0.013,13.884,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z" id="g113-116"></path></g><g transform="matrix(.013,0,0,-0.013,18.76,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z" id="g113-45"></path></g><g transform="matrix(.013,0,0,-0.013,23.903,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z" id="g113-98"></path></g><g transform="matrix(.013,0,0,-0.013,30.483,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z" id="g113-42"></path></g></svg> by minimizing the TD error <span class="nowrap"><svg height="11.439pt" id="M49" style="vertical-align:-2.15067pt" version="1.1" viewbox="-0.0498162 -9.28833 16.267 11.439" width="16.267pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M494 514C482 587 419 712 303 712C238 712 174 667 174 603C174 561 205 514 249 449C219 438 187 422 162 407C93 366 23 283 23 177C23 69 87 -12 190 -12C244 -12 288 5 328 33C406 87 444 170 444 249C444 329 404 391 331 475C265 550 222 605 222 627C222 647 238 657 267 657C355 657 421 585 484 499L494 514ZM359 234C359 143 319 30 219 30C172 30 114 75 114 178C114 275 163 343 195 378C212 397 241 415 269 425C305 382 359 313 359 234Z" id="g113-226"></path></g><g transform="matrix(.013,0,0,-0.013,6.722,0)"><path d="M699 368C699 549 574 666 407 666C186 666 23 488 23 277C23 113 129 -3 288 -13L307 -26C431 -111 501 -139 533 -147C559 -154 613 -163 658 -164L666 -141C597 -111 507 -66 430 -11L416 -1C580 42 699 190 699 368ZM601 371C601 227 518 54 381 22L354 40L278 24C175 47 120 145 120 269C120 451 235 631 398 631C540 631 601 521 601 371Z" id="g113-82"></path></g></svg>.</span> The actor network is updated with the gradient <svg height="11.927pt" id="M50" style="vertical-align:-3.291101pt" version="1.1" viewbox="-0.0498162 -8.6359 21.969 11.927" width="21.969pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M599 626V650H42V626L303 -15L331 -7L599 626ZM535 600L339 111L145 600H535Z" id="g113-156"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.228,3.132)"><path d="M490 97L476 124C442 96 405 70 398 70C392 70 390 78 396 114C419 243 448 379 463 432L457 436C446 436 431 439 418 442C393 447 368 451 343 451C281 451 204 418 155 381C74 320 24 206 24 107C24 23 59 -12 88 -12C118 -12 155 5 191 34C236 70 290 122 328 177H330L312 84C296 0 311 -12 331 -12C355 -12 425 24 490 97ZM374 387C371 367 360 299 347 264C323 202 187 53 142 53C128 53 113 73 113 120C113 224 157 332 221 380C241 395 274 403 303 403C330 403 360 395 374 387Z" id="g50-98"></path></g><g transform="matrix(.013,0,0,-0.013,12.421,0)"><path d="M699 368C699 549 574 666 407 666C186 666 23 488 23 277C23 113 129 -3 288 -13L307 -26C431 -111 501 -139 533 -147C559 -154 613 -163 658 -164L666 -141C597 -111 507 -66 430 -11L416 -1C580 42 699 190 699 368ZM601 371C601 227 518 54 381 22L354 40L278 24C175 47 120 145 120 269C120 451 235 631 398 631C540 631 601 521 601 371Z" id="g113-82"></path></g></svg> from the critic. Two sets of actors and critics are exploited for stability, shown as the boxes of “actor” and “critic” and the boxes of “target actor” and “target critic,” respectively. The target critic and target actor are updated by “soft update” for stability (arc dashed arrows).</div>

Computational Intelligence and Neuroscience

fig2

Figure 2

Figure 2: Dynamical Motor Control Learned with Deep Deterministic Policy Gradient