Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach

<div>Performance of the standard <svg height="8.70527pt" id="M312" style="vertical-align:-0.1802902pt" version="1.1" viewbox="-0.0498162 -8.52498 19.6728 8.70527" width="19.6728pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M303 0V28C221 34 213 39 213 125V525C213 610 221 616 303 622V650H38V622C120 616 128 610 128 525V125C128 40 120 34 38 28V0H303Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.433,0)"><path d="M631 18C609 24 585 35 559 65C534 91 514 117 478 169C448 214 406 281 389 313C462 346 516 399 516 485C516 545 490 590 449 616C412 641 363 650 290 650H42V622C120 615 128 612 128 527V125C128 40 120 34 38 28V0H300V28C221 34 212 40 212 125V284H244C295 284 312 272 329 244C359 195 395 133 430 84C475 19 516 -3 592 -7C603 -8 615 -8 627 -8L631 18ZM212 316V563C212 591 215 602 223 607C231 613 248 617 277 617C352 617 423 577 423 469C423 415 407 375 368 345C343 324 310 316 260 316H212Z"></path></g><g transform="matrix(.013,0,0,-0.013,12.506,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g></svg> method (<span class="nowrap"><svg height="8.70527pt" id="M313" style="vertical-align:-0.1802902pt" version="1.1" viewbox="-0.0498162 -8.52498 38.3194 8.70527" width="38.3194pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M861 0V28C774 35 771 41 768 147L759 509C756 612 762 614 851 622V650H681L449 149L221 650H57V622C148 613 153 609 144 479L130 271C123 166 117 123 111 88C104 46 85 34 26 28V0H259V28C192 35 169 42 167 90C166 130 166 173 170 256L185 541H187L411 7H431L675 555H679L683 147C683 41 680 35 598 28V0H861Z"></path></g><g transform="matrix(.013,0,0,-0.013,11.583,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g><g transform="matrix(.013,0,0,-0.013,18.577,0)"><path d="M303 0V28C221 34 213 39 213 125V525C213 610 221 616 303 622V650H38V622C120 616 128 610 128 525V125C128 40 120 34 38 28V0H303Z"></path></g><g transform="matrix(.013,0,0,-0.013,23.01,0)"><path d="M631 18C609 24 585 35 559 65C534 91 514 117 478 169C448 214 406 281 389 313C462 346 516 399 516 485C516 545 490 590 449 616C412 641 363 650 290 650H42V622C120 615 128 612 128 527V125C128 40 120 34 38 28V0H300V28C221 34 212 40 212 125V284H244C295 284 312 272 329 244C359 195 395 133 430 84C475 19 516 -3 592 -7C603 -8 615 -8 627 -8L631 18ZM212 316V563C212 591 215 602 223 607C231 613 248 617 277 617C352 617 423 577 423 469C423 415 407 375 368 345C343 324 310 316 260 316H212Z"></path></g><g transform="matrix(.013,0,0,-0.013,31.083,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g></svg>)</span> used in the first stage of our framework. The plain curves are the mean of “<svg height="8.70527pt" id="M314" style="vertical-align:-0.1802902pt" version="1.1" viewbox="-0.0498162 -8.52498 16.528 8.70527" width="16.528pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M517 162C503 123 484 88 467 68C445 42 417 34 341 34C291 34 256 34 237 47C219 59 213 81 213 128V317H308C395 317 402 311 415 240H444V431H415C403 364 398 356 307 356H213V584C213 613 215 616 246 616H322C394 616 421 609 435 587C448 566 458 544 467 502L496 506C493 557 488 625 488 650H42V622C120 616 128 612 128 523V125C128 43 120 35 29 28V0H511C520 31 540 125 546 158L517 162Z"></path></g><g transform="matrix(.013,0,0,-0.013,7.227,0)"><path d="M687 650H462V622C543 612 549 605 530 547C498 447 422 252 372 126H370C302 298 229 492 204 563C188 607 191 615 262 622V650H17V622C77 616 93 608 122 534C180 389 262 172 329 -11H360C436 196 541 450 568 516C606 605 619 614 687 622V650Z"></path></g></svg>” scores with respect to demonstration steps and nonoptimality degree. The blue, red, and black circles are different initialization settings for stage 2 of our framework.</div>

Journal of Robotics

fig3

Figure 3

Figure 3: Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach 

Figure 3 | Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach