Learning a Mid-Level Representation for Multiview Action Recognition

<div>Framework of our method. Firstly, we densely sample spatiotemporal cuboids from subvolumes of <svg height="8.68572pt" id="M1" style="vertical-align:-0.0498209pt" version="1.1" viewbox="-0.0498162 -8.6359 12.9526 8.68572" width="12.9526pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M962 650H795L470 145L347 650H176L170 622C268 613 275 606 244 503L190 322C150 188 132 126 118 91C102 50 80 33 18 28L12 0H245L251 28C175 35 170 48 174 93C177 128 191 180 220 284L292 542H294C331 392 383 150 409 4H432L774 555H776L714 137C700 40 694 34 612 28L606 0H868L874 28C793 34 784 37 797 137L849 533C859 612 863 616 956 622L962 650Z" id="g113-78"></path></g></svg> views. Suppose that 24 cuboids are extracted from one subvolume, and then six multitask random forests are constructed, each of which is built upon cuboids from multiple views sampled at four adjacent positions. Then all cuboids are classified by their corresponding random forests, and an integrated histogram is created to represent cuboids of all the <svg height="8.68572pt" id="M2" style="vertical-align:-0.0498209pt" version="1.1" viewbox="-0.0498162 -8.6359 12.9526 8.68572" width="12.9526pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M962 650H795L470 145L347 650H176L170 622C268 613 275 606 244 503L190 322C150 188 132 126 118 91C102 50 80 33 18 28L12 0H245L251 28C175 35 170 48 174 93C177 128 191 180 220 284L292 542H294C331 392 383 150 409 4H432L774 555H776L714 137C700 40 694 34 612 28L606 0H868L874 28C793 34 784 37 797 137L849 533C859 612 863 616 956 622L962 650Z" id="g113-78"></path></g></svg> views sampled at the same position. The concatenation of histograms for all positions constitutes a mid-level representation of the input <svg height="8.68572pt" id="M3" style="vertical-align:-0.0498209pt" version="1.1" viewbox="-0.0498162 -8.6359 12.9526 8.68572" width="12.9526pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M962 650H795L470 145L347 650H176L170 622C268 613 275 606 244 503L190 322C150 188 132 126 118 91C102 50 80 33 18 28L12 0H245L251 28C175 35 170 48 174 93C177 128 191 180 220 284L292 542H294C331 392 383 150 409 4H432L774 555H776L714 137C700 40 694 34 612 28L606 0H868L874 28C793 34 784 37 797 137L849 533C859 612 863 616 956 622L962 650Z" id="g113-78"></path></g></svg> subvolumes. At last, random forest is utilized as the final action classifier.</div>

Advances in Multimedia

Learning a Mid-Level Representation for Multiview Action Recognition

Figure 1