Dynamical Motor Control Learned with Deep Deterministic Policy Gradient

<div>Schematic illustration of the dynamical control. (a) Conventional motor control takes the state feedback <svg height="11.5564pt" id="M1" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 18.4723 11.5564" width="18.4723pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z" id="g113-116"></path></g><g transform="matrix(.013,0,0,-0.013,4.875,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z" id="g113-41"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M324 430H196L233 583L223 592L145 529L120 430H54L29 396L31 388H111L56 126C33 15 54 -12 77 -12C137 -12 214 57 250 95L233 119C208 92 155 59 138 59C126 59 120 70 131 125L186 390L298 394L324 430Z" id="g113-117"></path></g><g transform="matrix(.013,0,0,-0.013,13.806,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z" id="g113-42"></path></g></svg> as input to generate the control signal <span class="nowrap"><svg height="11.5564pt" id="M2" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 20.1817 11.5564" width="20.1817pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z" id="g113-98"></path></g><g transform="matrix(.013,0,0,-0.013,6.58,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z" id="g113-41"></path></g><g transform="matrix(.013,0,0,-0.013,11.078,0)"><path d="M324 430H196L233 583L223 592L145 529L120 430H54L29 396L31 388H111L56 126C33 15 54 -12 77 -12C137 -12 214 57 250 95L233 119C208 92 155 59 138 59C126 59 120 70 131 125L186 390L298 394L324 430Z" id="g113-117"></path></g><g transform="matrix(.013,0,0,-0.013,15.511,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z" id="g113-42"></path></g></svg>,</span> and it behaves like a regulator or spatial filter to the feedback state. (b) The dynamical controller generates the control signal <svg height="11.5564pt" id="M3" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 20.1817 11.5564" width="20.1817pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z" id="g113-98"></path></g><g transform="matrix(.013,0,0,-0.013,6.58,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z" id="g113-41"></path></g><g transform="matrix(.013,0,0,-0.013,11.078,0)"><path d="M324 430H196L233 583L223 592L145 529L120 430H54L29 396L31 388H111L56 126C33 15 54 -12 77 -12C137 -12 214 57 250 95L233 119C208 92 155 59 138 59C126 59 120 70 131 125L186 390L298 394L324 430Z" id="g113-117"></path></g><g transform="matrix(.013,0,0,-0.013,15.511,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z" id="g113-42"></path></g></svg> by its internal dynamics. Note that the dynamical controller loops by itself and theoretically the initial state <svg height="9.25202pt" id="M4" style="vertical-align:-3.29111pt" version="1.1" viewbox="-0.0498162 -5.96091 9.8741 9.25202" width="9.8741pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z" id="g113-116"></path></g><g transform="matrix(.0091,0,0,-0.0091,4.81,3.132)"><path d="M245 635C92 635 37 457 37 312C37 149 91 -12 244 -12C395 -12 449 166 449 312C449 469 395 635 245 635ZM243 598C332 598 358 454 358 312C358 173 334 26 245 26C158 26 128 174 128 313S152 598 243 598Z" id="g50-49"></path></g></svg> and the goal state are sufficient to generate the control command, with or without the feedback state (dotted arrow). (c) The dynamical controller is trained using DDPG with the reward information <svg height="11.5564pt" id="M5" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 19.0855 11.5564" width="19.0855pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M393 379C402 394 400 411 393 422C384 437 365 448 348 448C301 448 237 372 186 285H182L193 335C210 408 205 448 178 448C150 448 80 402 29 344L45 321C80 355 114 373 122 373C128 373 130 365 124 330C106 228 76 98 50 -5L57 -12C82 -5 112 3 132 6L172 203C196 256 234 304 254 329C275 355 293 367 306 367C318 367 330 360 342 348C347 343 355 343 365 350S386 367 393 379Z" id="g113-115"></path></g><g transform="matrix(.013,0,0,-0.013,5.488,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z" id="g113-41"></path></g><g transform="matrix(.013,0,0,-0.013,9.986,0)"><path d="M324 430H196L233 583L223 592L145 529L120 430H54L29 396L31 388H111L56 126C33 15 54 -12 77 -12C137 -12 214 57 250 95L233 119C208 92 155 59 138 59C126 59 120 70 131 125L186 390L298 394L324 430Z" id="g113-117"></path></g><g transform="matrix(.013,0,0,-0.013,14.419,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z" id="g113-42"></path></g></svg> from the environment (shown as the Env box). The broken arrow indicates that the controller parameters are tuned with the gradients from DDPG.</div>

Computational Intelligence and Neuroscience

fig1

Figure 1

Figure 1: Dynamical Motor Control Learned with Deep Deterministic Policy Gradient