Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach

<table class="table-group" id="tab2"><tr><td><table class="table"><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr class="thead"><td class="align_left" rowspan="2">Aspect</td><td class="align_center" colspan="2">Demonstration</td><td class="align_center" rowspan="2">Feedback error <svg height="6.1673pt" id="M306" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -5.96091 5.44961 6.1673" width="5.44961pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M387 375C387 402 357 448 257 448C172 448 82 404 82 326C82 289 108 255 156 241V239C85 223 23 181 23 116C23 39 89 -12 182 -12C265 -12 336 31 378 91L361 114C320 73 269 47 216 47C157 47 115 82 115 137C115 191 160 219 218 219C243 219 262 218 272 217L304 259L302 266C295 265 281 264 255 264C195 264 163 294 163 335C163 377 200 416 249 416C293 416 321 389 329 342C331 332 335 329 341 329C355 329 387 352 387 375Z"></path></g></svg></td><td class="align_center" rowspan="2">Learner policy type</td></tr><tr class="thead"><td class="align_center">Step number</td><td class="align_center">Optimality degree</td></tr><tr><td class="thead-hr" colspan="5"><hr/></td></tr><tr><td class="align_left" rowspan="2">Comparison (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.1" target="_blank">5.1.1</a>)</td><td class="align_center">100 <svg height="11.5564pt" id="M307" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 25.7795 11.5564" width="25.7795pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.498,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,14.516,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.013,0,0,-0.013,21.097,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg></td><td class="align_center">60% (point A5)</td><td class="align_center">0</td><td class="align_center">Probabilistic</td></tr><tr><td class="align_center">20 <svg height="11.5564pt" id="M308" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 25.7795 11.5564" width="25.7795pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.498,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,14.516,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.013,0,0,-0.013,21.097,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg></td><td class="align_center">100% (point B2)</td><td class="align_center">0</td><td class="align_center">Probabilistic</td></tr><tr><td class="align_left">Nonoptimal demonstration effect (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.2" target="_blank">5.1.2</a>)</td><td class="align_center">100 <svg height="11.5564pt" id="M309" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 25.7795 11.5564" width="25.7795pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.498,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,14.516,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.013,0,0,-0.013,21.097,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg></td><td class="align_center">Different (points: A1–A8, B10, C)</td><td class="align_center">0</td><td class="align_center">Probabilistic</td></tr><tr><td class="align_left">Sparse demonstration effect (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.3" target="_blank">5.1.3</a>)</td><td class="align_center">Different (points: B1–B10, C)</td><td class="align_center">100%</td><td class="align_center">0</td><td class="align_center">Probabilistic</td></tr><tr><td class="align_left">Learn only from feedback (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.4" target="_blank">5.1.4</a>)</td><td class="align_center">No demo (point C)</td><td class="align_center">—</td><td class="align_center">0</td><td class="align_center">Probabilistic</td></tr><tr><td class="align_left">Effect of the policy execution (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.5" target="_blank">5.1.5</a>)</td><td class="align_center">100 <svg height="11.5564pt" id="M310" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 25.7795 11.5564" width="25.7795pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.498,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,14.516,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.013,0,0,-0.013,21.097,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg></td><td class="align_center">60% (point A5)</td><td class="align_center">0</td><td class="align_center">Probabilistic, greedy, random</td></tr><tr><td class="align_left">Effect of feedback error (<a href="https://static.hindawi.com/articles/jr/volume-2020/3849309/figures/#sec5.1.6" target="_blank">5.1.6</a>)</td><td class="align_center">100 <svg height="11.5564pt" id="M311" style="vertical-align:-2.26807pt" version="1.1" viewbox="-0.0498162 -9.28833 25.7795 11.5564" width="25.7795pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.498,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.013,0,0,-0.013,9.373,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,14.516,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.013,0,0,-0.013,21.097,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg></td><td class="align_center">60% (point A5)</td><td class="align_center"><i>ε</i> = 0, 0.1, 0.5</td><td class="align_center">Probabilistic</td></tr><tr class="table-tr"><td colspan="5"><hr class="tbody-hr"/></td></tr></table></td></tr><tr class="table-fn"><td><div>Note: learning from the collected feedbacks is done in the batch learning mode.<br/></div></td></tr></table>

<div>Setting used to study different aspects of our framework (<span class="nowrap"><svg height="8.98583pt" id="M305" style="vertical-align:-0.2324905pt" version="1.1" viewbox="-0.0498162 -8.75334 38.1759 8.98583" width="38.1759pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M303 0V28C221 34 213 39 213 125V525C213 610 221 616 303 622V650H38V622C120 616 128 610 128 525V125C128 40 120 34 38 28V0H303Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.433,0)"><path d="M631 18C609 24 585 35 559 65C534 91 514 117 478 169C448 214 406 281 389 313C462 346 516 399 516 485C516 545 490 590 449 616C412 641 363 650 290 650H42V622C120 615 128 612 128 527V125C128 40 120 34 38 28V0H300V28C221 34 212 40 212 125V284H244C295 284 312 272 329 244C359 195 395 133 430 84C475 19 516 -3 592 -7C603 -8 615 -8 627 -8L631 18ZM212 316V563C212 591 215 602 223 607C231 613 248 617 277 617C352 617 423 577 423 469C423 415 407 375 368 345C343 324 310 316 260 316H212Z"></path></g><g transform="matrix(.013,0,0,-0.013,12.506,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g><g transform="matrix(.013,0,0,-0.013,19.5,0)"><path d="M43 650V622C120 616 128 612 128 526V124C128 39 120 33 34 27V0H270C392 0 492 25 567 83C643 141 690 230 690 350C690 444 655 517 605 565C543 625 450 650 323 650H43ZM213 547C213 587 217 598 226 604C236 612 262 617 304 617C371 617 429 604 474 576C554 529 592 439 592 336C592 176 505 36 319 36C246 36 213 55 213 131V547Z"></path></g><g transform="matrix(.013,0,0,-0.013,29.29,0)"><path d="M614 175C564 76 510 21 408 21C256 21 146 149 146 336C146 488 235 629 402 629C510 629 570 586 597 480L626 488C620 541 614 582 606 638C578 643 510 665 429 665C206 665 44 527 44 316C44 157 153 -15 402 -15C474 -15 558 5 586 11C604 45 629 119 643 165L614 175Z"></path></g></svg>)</span> in the simulated navigation domain.</div>

Journal of Robotics

tab2

Table 2

Table 2: Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach 

Table 2 | Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach