Research Article

Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Table 6

Kernel information of Shields on GTX460.

Method Number of callsExe. time (us)% Exe. timeAverage Value for each kernel launch
Branch IPCShared_memRegistersLimited factors

memcpyHtoD86474184.2015.75% 0.000.000.000.00 Number of calls
cavlc_bitpack_block15062971.3013.37% 5215.310.866656.0014.00Parallelism
memcpyDtoH35760254.8012.80% 0.000.000.000.00 Number of calls
pframe_intra_coding_luma2936969.207.85% 104046.000.303824.0032.00Parallelism
me_IntegerSimulsadVote2934574.207.34% 47548.100.991216.0040.00Registers
me_QR_LowresSearch2928985.806.16% 65434.901.365648.0032.00Registers
Iframe_luma_residual_coding127286.105.79% 873822.001.945472.0063.00Parallelism
ChromaPFrameIntraResidualCoding2919010.404.04% 1895.590.74320.0063.00Registers
pframe_inter_coding_luma2918334.803.89%10815.800.531824.0042.00Parallelism
cavlc_texture_codes_luma_DC9016730.30 3.55%10254.501.451008.0018.00Instruction issue
me_HR_Cal_Candidate_SAD297972.38 1.69%4639.971.251584.0019.00Block size
cavlc_block_context_iframe_LumaAC307900.451.68%1539.202.230.0015.00Instruction issue
cavlc_texture_symbols_luma_AC307585.54 1.61%23281.900.944096.0023.00Instruction issue
ChromaPFrameInterResidualCoding297196.61 1.53%7221.101.632688.0031.00Parallelism
me_HR_Candidate_Vote296964.67 1.48%6781.521.73272.0021.00Parallelism
MotionCompensateChroma296353.73 1.35%4137.381.08748.0018.00Instruction issue
memset32_aligned1D1824387.74 0.93%3957.692.260.003.00None
cavlc_bitpack_MB304362.85 0.93%2084.401.720.0019.00Global bandwidth
cavlc_block_context_PrevSkipMB294307.42 0.91%729.000.790.008.00Parallelism
cavlc_texture_symbols_chroma_AC303908.42 0.83%9674.630.402560.0022.00Global bandwidth
me_Decimate583695.84 0.78%1345.781.48512.0013.00Block size
CalcCBP_and_TotalCoeff_Luma303498.780.74%257.47 1.634608.0021.00Global bandwidth
CalcPredictedMVRef293313.120.70%230.28 1.450.0018.00Parallelism
CalcCBP_and_TotalCoeff_Chroma302855.010.61%1148.83 0.822528.0023.00Global bandwidth
cudaDeblockMB_kernel_ver302851.800.61%35558.30 1.191040.0031.00Global bandwidth
cavlc_block_context_ChromaAC302764.670.59%643.33 1.820.0027.00Registers