Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

<table class="figure-group"><tr class="fig-image" id="a"><td><object data="https://static.hindawi.com/articles/sp/volume-2015/246019/figures/246019.fig.006a.svgz" name="246019.fig.006a" type="image/svg+xml"></object></td></tr><tr class="fig-caption"><td><b>(a) </b>Whole trace (<svg height="8.91176pt" id="M294" style="vertical-align:-0.2130003pt" version="1.1" viewbox="-0.0498162 -8.69876 74.8712 8.91176" width="74.8712pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.0135,0,0,-0.0135,0,0)"><path d="M495 86L479 114C446 82 419 66 409 66C401 66 401 72 406 97C420 166 436 231 453 297C489 435 454 448 428 448C406 448 384 439 354 422C305 394 222 327 161 247H159L183 345C200 415 194 448 173 448C143 448 82 410 23 351L38 325C64 349 95 371 105 371C111 371 116 365 109 336L25 -4L31 -12C50 -4 77 3 107 9C119 69 132 122 145 168C197 254 321 381 370 381C387 381 393 374 378 305L329 95C309 17 320 -12 345 -12C372 -12 430 19 495 86Z" id="g113-111"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="502" vert-adv-y="502"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,10.586,0)"><path d="M535 323V373H52V323H535ZM535 138V188H52V138H535Z" id="g117-34"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="587" vert-adv-y="587"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,22.324,0)"><path d="M766 88L752 113C719 83 690 64 681 64C674 64 672 74 679 103L724 292C758 436 724 448 701 448C680 448 666 442 639 429C594 407 514 350 441 252H439L447 289C476 423 450 448 419 448C398 448 379 441 355 427C307 400 234 344 162 249H160L180 324C203 409 197 448 170 448C144 448 82 413 23 349L35 321C57 343 96 374 108 374C115 374 117 371 111 341C87 227 57 112 24 -6L32 -12C53 -4 81 4 108 6C119 68 134 128 149 171C177 229 309 383 364 383C387 383 388 355 373 282C354 190 330 92 303 -6L309 -12C332 -4 356 3 386 6C396 63 411 122 424 171C458 236 590 383 642 383C658 383 664 369 652 315L603 91C587 20 593 -12 619 -12C642 -12 708 23 766 88Z" id="g113-110"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="789" vert-adv-y="789"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,36.799,0)"><path d="M535 323V373H52V323H535ZM535 138V188H52V138H535Z" id="g117-34"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="587" vert-adv-y="587"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,48.537,0)"><path d="M384 0V27C293 34 287 42 287 114V635C232 613 172 594 109 583V559L157 557C201 555 205 550 205 499V114C205 42 199 34 109 27V0H384Z" id="g113-50"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="480" vert-adv-y="480"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,55.04,0)"><path d="M241 635C89 635 35 457 35 312C35 153 89 -12 240 -12C390 -12 443 166 443 312C443 466 390 635 241 635ZM238 602C329 602 354 454 354 312C354 172 330 22 240 22C152 22 124 173 124 313S148 602 238 602Z" id="g113-49"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="480" vert-adv-y="480"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,61.543,0)"><path d="M241 635C89 635 35 457 35 312C35 153 89 -12 240 -12C390 -12 443 166 443 312C443 466 390 635 241 635ZM238 602C329 602 354 454 354 312C354 172 330 22 240 22C152 22 124 173 124 313S148 602 238 602Z" id="g113-49"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="480" vert-adv-y="480"></glyph.data></g><g transform="matrix(.0135,0,0,-0.0135,68.046,0)"><path d="M241 635C89 635 35 457 35 312C35 153 89 -12 240 -12C390 -12 443 166 443 312C443 466 390 635 241 635ZM238 602C329 602 354 454 354 312C354 172 330 22 240 22C152 22 124 173 124 313S148 602 238 602Z" id="g113-49"></path><glyph.data ascent="3473" descent="-2876" horiz-adv-x="480" vert-adv-y="480"></glyph.data></g></svg>)</td></tr><tr class="fig-image" id="b"><td><object data="https://static.hindawi.com/articles/sp/volume-2015/246019/figures/246019.fig.006b.svgz" name="246019.fig.006b" type="image/svg+xml"></object></td></tr><tr class="fig-caption"><td><b>(b) </b>Partial zoomed-in trace</td></tr></table>

<div>Execution trace of the hybrid QP3 implementation. The top trace is on the CPU, while the remaining two traces are on the GPU with two GPU streams (matrix-vector multiply, matrix-matrix multiply, column swap, pivot selection, reflector generation, norm computation, and communication are in green, purple, orange, magenta, red, cyan, and black, respectively. Since the BLAS matrix-vector multiply routine does not support a vector-matrix multiply, a matrix-matrix multiply is used to compute <svg height="15.3563pt" id="M293" style="vertical-align:-5.759991pt" version="1.1" viewbox="-0.0498162 -9.59631 9.19227 15.3563" width="9.19227pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.0135,0,0,-0.0135,0,0)"><path d="M52 442L22 406L27 394H92V108C92 45 88 39 30 32V0H302V32C227 39 220 45 220 108V394H321C335 401 339 431 335 442H220L218 477C214 562 214 597 220 616C224 631 233 642 251 642C279 642 311 617 335 593C345 583 355 583 369 593C384 605 395 618 399 629C406 644 405 659 391 673C374 689 349 701 309 703C261 697 226 679 190 649C133 601 112 553 105 531C97 507 92 486 92 464V442H52Z" id="g13-100"></path><glyph.data ascent="1024" descent="-360" horiz-adv-x="331" vert-adv-y="331"></glyph.data></g><g transform="matrix(.0095,0,0,-0.0095,4.484,3.264)"><path d="M400 606C400 634 383 656 353 656C316 656 294 620 294 593C294 564 317 545 343 545C375 545 400 573 400 606ZM366 351C379 413 381 451 356 451C323 451 251 408 183 341L199 313C223 335 267 365 277 365C285 365 284 354 277 312C245 132 222 27 193 -100C182 -148 160 -188 131 -188C113 -188 90 -178 75 -170C64 -164 55 -168 48 -175C38 -185 24 -203 24 -222S48 -257 71 -257C89 -257 131 -241 186 -192C243 -141 286 -46 310 74L366 351Z" id="g50-107"></path><glyph.data ascent="3443" descent="-2856" horiz-adv-x="430" vert-adv-y="430"></glyph.data></g></svg> at Step 1.4 of Algorithm <a href="../alg2/">2</a>). The second GPU stream is used to transfer the next panel and top block row to the CPU.</div>

Scientific Programming

fig6

Figure 6

Figure 6: Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations 

Figure 6 | Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations