Research Article
Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation
Table 6
Kernel information of Shields on GTX460.
| Method |
Number of calls | Exe. time (us) | % Exe. time | Average Value for each kernel launch
| Branch | IPC | Shared_mem | Registers | Limited factors |
| memcpyHtoD | 864 | 74184.20 | 15.75% | 0.00 | 0.00 | 0.00 | 0.00 |
Number of calls | cavlc_bitpack_block | 150 | 62971.30 | 13.37% | 5215.31 | 0.86 | 6656.00 | 14.00 | Parallelism | memcpyDtoH | 357 | 60254.80 | 12.80% | 0.00 | 0.00 | 0.00 | 0.00 |
Number of calls | pframe_intra_coding_luma | 29 | 36969.20 | 7.85% | 104046.00 | 0.30 | 3824.00 | 32.00 | Parallelism | me_IntegerSimulsadVote | 29 | 34574.20 | 7.34% | 47548.10 | 0.99 | 1216.00 | 40.00 | Registers | me_QR_LowresSearch | 29 | 28985.80 | 6.16% | 65434.90 | 1.36 | 5648.00 | 32.00 | Registers | Iframe_luma_residual_coding | 1 | 27286.10 | 5.79% | 873822.00 | 1.94 | 5472.00 | 63.00 | Parallelism | ChromaPFrameIntraResidualCoding | 29 | 19010.40 | 4.04% | 1895.59 | 0.74 | 320.00 | 63.00 | Registers | pframe_inter_coding_luma | 29 | 18334.80 | 3.89% | 10815.80 | 0.53 | 1824.00 | 42.00 | Parallelism | cavlc_texture_codes_luma_DC | 90 | 16730.30 | 3.55% | 10254.50 | 1.45 | 1008.00 | 18.00 | Instruction issue | me_HR_Cal_Candidate_SAD | 29 | 7972.38 | 1.69% | 4639.97 | 1.25 | 1584.00 | 19.00 | Block size | cavlc_block_context_iframe_LumaAC | 30 | 7900.45 | 1.68% | 1539.20 | 2.23 | 0.00 | 15.00 | Instruction issue | cavlc_texture_symbols_luma_AC | 30 | 7585.54 | 1.61% | 23281.90 | 0.94 | 4096.00 | 23.00 | Instruction issue | ChromaPFrameInterResidualCoding | 29 | 7196.61 | 1.53% | 7221.10 | 1.63 | 2688.00 | 31.00 | Parallelism | me_HR_Candidate_Vote | 29 | 6964.67 | 1.48% | 6781.52 | 1.73 | 272.00 | 21.00 | Parallelism | MotionCompensateChroma | 29 | 6353.73 | 1.35% | 4137.38 | 1.08 | 748.00 | 18.00 | Instruction issue | memset32_aligned1D | 182 | 4387.74 | 0.93% | 3957.69 | 2.26 | 0.00 | 3.00 | None | cavlc_bitpack_MB | 30 | 4362.85 | 0.93% | 2084.40 | 1.72 | 0.00 | 19.00 | Global bandwidth | cavlc_block_context_PrevSkipMB | 29 | 4307.42 | 0.91% | 729.00 | 0.79 | 0.00 | 8.00 | Parallelism | cavlc_texture_symbols_chroma_AC | 30 | 3908.42 | 0.83% | 9674.63 | 0.40 | 2560.00 | 22.00 | Global bandwidth | me_Decimate | 58 | 3695.84 | 0.78% | 1345.78 | 1.48 | 512.00 | 13.00 | Block size | CalcCBP_and_TotalCoeff_Luma | 30 | 3498.78 | 0.74% | 257.47 | 1.63 | 4608.00 | 21.00 | Global bandwidth | CalcPredictedMVRef | 29 | 3313.12 | 0.70% | 230.28 | 1.45 | 0.00 | 18.00 | Parallelism | CalcCBP_and_TotalCoeff_Chroma | 30 | 2855.01 | 0.61% | 1148.83 | 0.82 | 2528.00 | 23.00 | Global bandwidth | cudaDeblockMB_kernel_ver | 30 | 2851.80 | 0.61% | 35558.30 | 1.19 | 1040.00 | 31.00 | Global bandwidth | cavlc_block_context_ChromaAC | 30 | 2764.67 | 0.59% | 643.33 | 1.82 | 0.00 | 27.00 | Registers |
|
|