4kUHD H264 Wireless Live Video Streaming Using CUDA
Ultrahigh definition video streaming has been explored in recent years. Most recently the possibility of 4kUHD video streaming over wireless 802.11n was presented, using preencoded video. Live encoding for streaming using x264 has proven to be very slow. The use of parallel encoding has been explored to speed up the process using CUDA. However there hasnot been a parallel implementation for video streaming. We therefore present for the first time a novel implementation of 4kUHD live encoding for streaming over a wireless network at low bitrate indoors, using CUDA for parallel H264 encoding. Our experimental results are used to verify our claim.
Video streaming in wireless networks has witnessed improvements in recent years. The recent adoption of 4kUHD (3840 × 2160) video resolution now makes it necessary to investigate the possibilities of streaming such video resolution in a wireless environment due to its popularity. Video display at such resolutions can be done with the aid of SAGE [1, 2] (scalable adaptive graphics environment), tiled displays (using more than one display output unit to produce the required resolution), CAVE  (cave automatic virtual environment) a one to many presentation systems, and more recently commercial production of 4kUHD television sets. Currently streaming at this resolution is normally done using compressed formats [4–7], as the minimum requirement for uncompressed UHD video starts at 2.39 Gb/s for 8-bit 4 : 2 : 0 subsampling at 24 frames per second. Streaming uncompressed video at this resolution has been carried out in both wired networks by [8, 9] and in wireless , using four wireless 60-GHz parallel channels. However storage and bandwidth limitations make it harder for deployment of uncompressed video streaming applications. The analysis of H.264 video transmission over 802.11b [11, 12], 802.11g [13, 14], and 802.11n  has provided insight into how the H264 encoder can function in a lossy network such as wireless. Previous studies done only show video resolutions no larger than full-high definition (1080p) with the exception of our previous work , which studied the use of preencoded video for live streaming over 802.11n wireless network. We therefore take this further by implementing a live streaming encoder which can accept both uncompressed video and live capture from a 4kUHD camera.
2. 4kUHD Wireless Video Streaming
Quality of experience is getting more attention at the moment due to the rapid increase in high quality multimedia content especially with the adoption of higher video resolution content (such as 4kUHD); on one hand end-users expectation increases, while transmission requirements also present new problems . A major problem here is how best to utilize encoding parameters for both compression efficiency and intended transmission medium. Wireless networks communication is very unstable due to the environment factors and can experience interference from other signals, especially those operating in similar frequency range. Both major factors influence the dynamic rate scaling in wireless networks. Therefore when designing a solution for wireless video streaming services the fidelity of the network must be taken into consideration.
Previous studies already show the possibilities of 4kUHD video streaming over a wireless network both in uncompressed and compressed formats. Uncompressed 4kUHD wireless video transmission and playback has already been explored in . This study has showed initial possibilities of uncompressed 4kUHD playback and streaming over four commercial off-the-shelf (COTS) wirelessHD  60 GHz parallel channels using a high-end PC and GPU (graphics processing unit) based on decoding using CUDA  (Compute unified device architecture); however, this was only done in a short range due to interference between the channels. Another study conducted showed the use of 4kUHD H264 compressed video in a live stream over 802.11n . In this study, the Chroma subsampling was varied from 4 : 2 : 0 to 4 : 4 : 4 at a low bitrate of 20 Mb/s using a group of picture (GOP) length of 40. This approach did not take into consideration other video sequences due to availability and interroom transmission due to the bandwidth of 802.11n between rooms; video quality used was measured using structural similarity (SSIM)  and is considered to emulate the human visual system (HVS). Furthermore the studies shown in [20–22] have shown that the widely used peak signal-to-noise ratio (PSNR) and mean squared error (MSE) are flawed in differentiating structural contents of images since different types of impairments can be applied and still have the same MSE value, while PSNR is more sensitive to noise. We therefore use SSIM, since increase in spatial resolution of video enables improved satisfaction for the HVS , and structural impairments will be easily noticed at higher resolution which can diminish the end-user QoE.
Based on the observations stated above we therefore propose an implementation of 4kUHD live encoding for live video streaming at low bitrates. We employ CUDA in implementing this system as regular open source x264  as shown to provide low bitrates for 1080p videos and 4kUHD videos.
3. Background on 802.11ac
802.11ac  provides high-throughput wireless local area network (WLAN) on the 5 GHz band. Its specification allows for a multistation link of 1 Gb/s and a single link throughput of 500 Mb/s due to the increase in channel width of 80 MHz. Furthermore it supports multiuser multiple input multiple output (MIMO). The standard also implements standardized beamforming technology and feedback compatibility across vendors, unlike the nonstandardization in its predecessor standard 802.11n which made it difficult to implement beamforming effectively.
4. Parallel H264 Encoder
Previous studies, showing the parallelization of H.264 encoder applications for GPUs, involved the development of different solutions for separate encoding processes such as prediction, entropy encoding, and deblocking filter. More recently there has been a study of porting these processes onto a GPU using CUDA .
In  an efficient algorithm for block-size motion estimation with fractional pixel refinement using CUDA GPU is presented. The study shows the decomposition of H264 motion estimation algorithm into 5 steps allowing for highly parallel computation with low memory transfer rates. In  motion estimation and discrete cosine transform (DCT) coefficient parallelization in CUDA is studied; however, the relational dependencies between the transformation of DCT and intraframe predictions were neglected. Intraframe encoding algorithm was presented in  which is GPU-based and does block process reordering in a diagonal order using openCL (a platform dependent GPU application development environment), however, this is only one way of reordering the block for parallelism. In , all encoding processes are parallelized on a GPU using CUDA; however, the parallel encoder is aimed at high definition video and also for video storage as their implementation does not take into consideration real-time encoding for live video streaming due to the huge amount of memory transfer latency.
These related works show improvements in using CUDA for parallel H264 encoding; however, none of these studies take into consideration live video streaming as the memory transfer latencies which occur during the pre- and postencoding process is not suitable for such an application. For example in  the achieved speedup of the GPU application was only double that of a single threaded CPU application performance. Their implementation was found to have architectural bottlenecks as a result of caching latencies and limitations of registers. Their solution seemed impossible using software approach due to their CUDA compute capability. In  A CPU/GPU parallel model for H.264 SVC encoder using CUDA was implemented to study the data transfer between host (CPU) and device (GPU). The authors used a raw frame as reference, though this increased data granularity, the resulting image quality was poor; for a resolution of the performance achieved was 1.03 fps. In  the authors do a port of four major processes of H264 encoding to CUDA. They use data localization to organize data and threads to work efficiently on the GPU. However this approach is constrained to available GPU resources, as their approach eliminates the use of higher than HD resolutions especially on GPUs with constrained local memory. Furthermore their implementation allows for a latency bottleneck for critical video applications such as live streaming and real-time encoding/decoding processes as data must be transferred from and to the CPU memory before it can be accessed by other processes. We therefore extend the work of  to support zero-copy memory and also implement it as a directshow filter to enable it to accept a feed from live capture. The modules in  are Interprediction, intra prediction, entropy encoding, and deblocking filter. We also work on the interprediction module where we implement dynamic parallelism for motion estimation and on the intraprediction module where we reduce the predictions; however, every other module within their implementation is available to perform our experiments.
Our GPU architecture is different from previous implementations, as our GPU uses a newer streaming multiprocessor unit called SMX available through the Kepler GK110  architecture with CUDA compute capability 3.5. This allows for dynamic parallelism; therefore, kernels can spawn new threads without needing to get instructions from the CPU all over again. This is very useful in modules such as interprediction where the macroblocks are subdivided into smaller units. The CPU need not issue instructions every time a subdivision is to be executed.
5. Implementation Design
Our test bed consists of two computers connected using the 802.11ac  network at channel 36 using buffalo  air station (wireless transmitter and receiver) as illustrated in Figure 1(a). Open source VLC Media player  was used to decode video. Our GPU platform is the NVidia Quadro 510; streaming was done using the real-time streaming protocol (RTSP) server filter provided by  and network statistics were collected using wireshark. We do our tests in three parts:(1)encoding,(2)real-time encoding for live streaming,(3)live capture with a point grey flea 3  4k camera and encoding for live streaming.
In conducting the test numbers 2 and 3, we consider two major scenarios which depict the usage of video streaming indoors in a typical office space with lots of furniture. The two scenarios considered are as follows:(i)change in distance intraroom (10 and 20 meters),(ii)the use of obstruction: interroom (one room and two rooms apart) and interfloor (one floor above).
Figure 2 shows the floor plan of two floors of the office environment used for the experiments. The notations Rx (A, B, and C) show the position of receiver in other parts of the building during the experiments. In the intraroom scenario, the distance between the transmitter and receiver was 10 and 20 meters, respectively. During each experiment the following test clips were used: Sintel 4k , Coast, Foreman, and News . These video clips were chosen based on motion with sintel having the fastest and foreman the slowest, each video clip had 500 frames. At the beginning of each experiment, the data rate was measured using LAN speed test  software to observe the bandwidth available. For the sake of clarity we use arbitrary numbers to denote the scenario in Table 1.
Based on the data rate observed at the beginning of each scenario we can expect good quality results for peer-to-peer video transmission. The video clips and live capture were compressed using a GOP length of 40 (using I and P frames only), with a bitrate of 20 Mb/s ABR (average bitrate); the designated packet size used was 1500 bytes.
5.1. Directshow Filter Implementation
Microsoft directshow filter  implementation provides an architecture for highquality video capture, streaming, and playback. It supports a wide variety of formats including Motion Picture Experts Group (MPEG) and audio-video interleaved (AVI). Furthermore, it supports video capture from analogue and digital devices supported by the windows driver model (WDM).
The main aim of using directshow is to simplify the task of creating multimedia applications on windows platform irrespective of hardware differences, data transport, and synchronization of individual components of the entire framework.
Filters are normally divided into three main categories, namely, source, transform, and render. For this filter we implement a transform filter, with the base class CTransform. The filter itself implements the following functions in order to achieve its core purpose.
5.1.1. Check Input Type
This function checks whether the specified media type provided by the upstream filter is acceptable to the encoder filter or not. In our case the major type of the media should be video. The filter supports the following media subtypes:(a)WMMEDIASUBTYPE_I420(b)MEDIASUBTYPE_IYUV(c)MEDIASUBTYPE_YV12(d)MEDIASUBTYPE_YUY2.
5.1.2. Get Media Type
This function provides the supported media type of the output pin of the encoder filter to the downstream filter during the connection negotiation process. The major type of the output pin is video while the subtype it supports is custom defined for the CUDA H264 encoder. It is defined as a global unique identifier (GUID) in the following manner.
5.1.3. DEFINE_GUID (MEDIASUBTYPE_H264CUDA, 0x55B845A5, 0x8169, 0x4BE7, 0xBA, 0x63, 0x6C, 0x4C, 0x2C, 0x01, 0x26, 0x6D)
This function also makes sure that the input pin of the encoder filter is connected before the output pin attempts any connection with the downstream filter. This is because most of the properties (width, height, frame rate, etc.) between input and output do not change and these must be known beforehand so that they can be conveyed to the downstream filter.
5.1.4. Decide Buffer Size
This function negotiates memory requirements between the output pin of the encoder filter and the input pin of the downstream filter. This function asks for memory that can hold the largest possible encoded frame. The input pin of the encoder filter must be connected for this function to succeed because it needs to estimate the size of the output frame based on the properties of the raw incoming video.
This function receives raw YUV video frame by frame from the input pin and passes it to the encoder instance. During this process, the CPU pointer allocated to the input buffer of the encoder allows the filling of this buffer, when the buffer is full it is unlocked for the GPU to access. If any output is available from the encoder, it sends that output to the downstream filter. This function does not return the encoded frame. The encoded frame is returned by a call-back function that is registered with the encoder during initialization. Since this becomes an asynchronous process, a queue is needed to hold the encoded frames being output by the encoder using the call-back function. Whenever the “transform” function is called it checks the queue; if some encoded frame is available, it sends it to the downstream filter.
5.1.6. Start Streaming
This function is called when the stream is being executed. It initializes the encoder using the API already provided by the work of . After the default settings are fetched, some changes are made to customize the encoding process according to user requirements (e.g., bitrate, fps, and so forth). A call to the following function returns the encoder object that is used for the actual encoding. A call to the following functions instantiates the object with the memory buffer and a call-back function that is used to output encoded bit stream by the encoder. The encoded bit stream is then allocated a CPU pointer. At this point the encoded bit stream is pushed to the RTSP server for packetization and streaming.
5.1.7. Stop Streaming
This function is called when the streaming is stopped. This function frees any memory allocated for the encoder and destroys the encoder object.
5.2. Zero-Copy Memory Mapping
Irregular memory access patterns can be successfully captured by CPUs due to its design that allows for reduction in memory latency access through extensive caching. However, the same patterns may prevent the efficient utilization of GPU memory bandwidth because the restrictions on access patterns must be met in order to achieve good memory performance, which are stricter on GPUs than they are on CPU.
When running a typical CUDA application, memory is allocated as pageable. Therefore, the memory is only allocated when needed. The use of pageable memory allows for increase in memory access latency as this memory will page out, and therefore it will only be reallocated when there is a need for it. The major disadvantage of this is that CPUGPU memory transactions are slower due to the bandwidth of peripheral component interconnect express expansion bus (PCIe) which cannot be fully exploited. Since the paged memory can be swapped or reallocated, the PCIe driver needs to access every single page, copy it to a buffer, and pass on to direct memory access (DMA) thus a synchronous page by page copy. This is the only way PCIe transfers can occur.
Due to this there is the need for an independent host controlled memory management allocation unit (MMU) especially for applications such as live video streaming or video conferencing which have strict latency requirements. At initialization, host mapping is enabled and the MMU allocates CPU pinned memory for input data. It is assumed that the maximum memory size for this application is 2 GB; this is based on experiments performed with x264 video streaming. The CUDA kernel pointers are produced to allow access to slash from the GPU. Finally the kernels were allocated pointers to the host memory as if it were the GPU’s global memory for the encoding process as the memory is already mapped to CUDA unified virtual address space. This technique of memory allocation allows for the overlap of encoding and packetization for video transport, as data is accessed directly via DMA (residing on the GPU) from the CPU without any explicit data transfer to the GPU memory. Figure 3 shows an illustration of this.
At runtime the CUDA kernels are normally asynchronous to the CPU; therefore, each block also issued an atomic counter for synchronization. Therefore all blocks can be executed in a sequential manner with nondivergent branching  and data can also be read from previous threads. It should be noted that the CPU does wait until the encoding process is complete before releasing the memory buffers for packetization and buffer refill of unprocessed frames.
Interprediction is the most demanding process in x264 encoder . Previous profiling in  indicates that interprediction accounts for approximately 70% of the total encoding process time and is done with the help of motion estimation.
To obtain accurate prediction values, H264 standard makes it possible to partition variable size MB, which is a macroblock (MB). Each macroblock for encoding a frame is split into , , , , , submacroblocks, an illustration can be seen in Figure 4. We therefore implement dynamic parallelism. Using dynamic parallelism the macroblock acting as a parent kernel can spawn the thread blocks of the subblocks (which act as child kernels) without needing any extra instructions from the CPU; dynamic parallelism drastically reduces the execution control from the CPU (Figures 5(b) and 6). In deciding the final encoding mode, two steps are involved.
(a) Typical procedure
(b) Modified procedure
The first step involves the calculation of the best motion vectors (MV) for each possible mode within the reference image. Based on the matching criterion the sum of absolute differences (SAD) is matched. The second step involves the evaluation of ratedistortion of each mode; the final mode is determined from the selection of the best mode.
Since a macroblock is divided into sixteen blocks, the SAD value is calculated for each block in parallel for all candidate motion vector positions within the reference search range. Since 960 × 540 blocks exist in a 4kUHD frame, 1024 candidate positions (MVs) per block exist in a 32 × 32 search range. Each SAD candidate is computed by one thread and 512 threads are executed in one thread block; we allocate more threads to increase processing granularity. For example, since 16 threads can be executed in a block, therefore 32 MBs can be processed per execution. Therefore the total number of blocks is based on. where denotes the number of blocks per subblock, with possible combinations of and , (, , , , , , ), where and are the subblock dimensions. and are the frame dimensions, and denotes the search range while denotes the number of threads allocated per block. Based on this calculation every 1024 candidates of one block are assigned to a thread block, which is then allocated to the shared memory therefore shared by all threads executed within a block.
Since the SADs of the variable block sizes , , , , , and are a consolidation of SADs, the SADs are merged to obtain the SAD values of all combinations possible. Each thread retrieves sixteen SADs of one macroblock obtained at a candidate position and combines them in different ways to determine the SADs of all these block sizes.
After all the SADs of the block sizes are generated, the least of them is chosen as the best motion vector. Based on our implementation the SADs are compared in one thread block. During this process each thread pulls in 4 SADs from the host memory allocated to the parent kernel to produce the least SAD value. This value is then stored temporarily in the shared memory alongside their indexes. The SADs are then compared and the least value is stored back in the memory. This process helps with reducing the system memory access. Therefore the total number of thread blocks equals the block number of a frame. When the smallest SAD which is the Fractional Pixel motion pixel MV (FMV) of that block is identified and indexed it is then stored in the mapped memory allocated to it.
In intraprediction the reconstructed pixels are needed as reference pixels. Strong dependencies between neighbouring MBs, enable multiple prediction modes. This makes it quite different from interprediction and thus presenting a fundamentally low parallel execution. A typical example is in 4 × 4 intraprediction scenario in which the blocks will refer to the 4 × 4 blocks in both diagonally and vertically (in form of a zig zag manner). This in itself limits it from high performance in a parallel implementation as the degree of parallel execution is limited to multiples of 2 subblocks within a 4 × 4 macroblock. For a 4kUHD image resolution, the maximum degree of parallel execution for a intraprediction mode is limited by 270 (using diagonal processing for a wave front algorithm).
Other prediction modes which have been proposed rely on the recommendations of [42, 43], where the number of predictions is decreased for a better degree of parallelism. However during the test experiments it was noticed that the degree of parallel execution could be increased by reducing the direction of predictions even further and not compromising on video quality. Therefore horizontal approach for parallel execution is proposed.
The numbers of predictions within the 4 × 4 block are decreased by computing the prediction directions (Figure 7). In doing this, the wave parallelism in the blocks is simplified such that the blocks on the top and lower ends are dependent on each other horizontally while the other blocks between them are vertically independent; the illustration can be seen in Figure 8. This simplification is aimed at reducing decoding complexity after a transmission through a noisy channel such as a wireless network.
As explained in implementation section we evaluate our implementation in three stages. The metrics used for the two other stages are structural similarity index metric (SSIM), network delay, and packet loss. Since at all stages the experiments were conducted ten times, an average of each metric is used for each video clip and test. Since the encoder is only running in parallel we have access to the same input parameters as any other H264 encoder.
6.1. Encoding Performance
Figure 9(a) shows a comparison of attained average frame rate between x264 and its parallel implementation on CUDA for both 1080p videos and 4kUHD videos. Across board the results show that there is a significant speedup of the GPU implementation over the CPU implementation. We cannot compare directly to previous work as CUDA parallel processing cores differ between GPU generations due to several important architectural changes that exist between streaming multiprocessor designs. However, we only compare based on the approach used. In , the authors use memory copy between the GPU memory and the host (CPU) memory. Figure 9(b) shows the amount of latency avoided. On average a staggering 2000 ms of latency is avoided due to the implementation of zero-copy in our implementation. Therefore, we can assume that based on 500 frames the additional memory copy latency incurred during the encoding process for each frame will, on average, be 4 ms. This might look negligible for a small number of frames; however, for a longer sequence this will be more pronounced.
6.2. Real-Time Encoding for Live Streaming Performance
Based on these observations we proceed to the live encoding for streaming experiments. Figure 10 shows an illustration of the interior of two floors within the networks building at the University of Essex which is used for the experiments.
We show the variance and standard deviation of video quality results at 10 meters.
Table 2 shows the standard deviation (SD) and variance in values of the video quality for each video clip used at 10 meters. All SD values show slight deviation from the mean SSIM value in each case with foreman having the largest deviation and coast the smallest deviation.
Figures 11 and 12 show that across board for a peer-to-peer transmission the 802.11ac network provides sufficient bandwidth for a 20 Mbps ABR stream. The least average SSIM value 0.73 occurs in the interfloor scenario for sintel 4k; however, the video quality is still acceptable. The average network delay spans between 59 and 92 ms in all cases. The packet loss is impressive as the maximum value for average packet loss is only 0.72%.
6.3. Live Capture Stream
We experiment using a point grey flea 3 4k camera (blue arrow) connected to a PC which is connected to the 802.11ac WLAN transmitter (green arrow). We use the buffalo air station receiver within the room in the upper diagram of Figure 2 and use a netgear A6200  usb 2.0 dongle in the room on the upper floor (lower diagram in Figure 2). The Air station was connected to the 4kUHD screen from  which still performs better compared to the netgear USB dongle. This is typical of the hardware design as we did not expect a very good quality video from the netgear. However, when the air station was used in the same scenario as the netgear, we had better quality video.
7. Conclusion and Future Work
In this paper, we have demonstrated a novel implementation of 4kUHD live encoding for streaming over a wireless network using CUDA parallel H264. We extended previous research by implementing zero-copy to reduce memory copy latencies between host memory and GPU memory; a dynamic parallel motion algorithm and a novel intraprediction mode. Finally we demonstrate proof of concept by streaming 4kUHD video content using uncompressed video sequences and a live capture device in a peer-to-peer network, making this is a significant improvement to the state of the art. Having reached these results (Figures 8, 9, and 10), we can now conclude that 4kUHD real-time encoding for live streaming at low bitrates is possible and can be implemented in real-world applications particularly in one-way video streaming applications where delay is not a major issue.
The findings of this research show the possibilities of reduced bitrate for 4320p (8KUHD) video at reduced bitrates. Moreover, since GPU generational changes will bring about changes in parallelization, therefore improved interprediction algorithms will be investigated as soon as those changes happen. Furthermore, H.264 is currently limited to 4kUHD; the major focus of future research will be on the new high efficiency video codec (HEVC). Currently HEVC shows a huge potential for video streaming since it can provide up to 50% the bandwidth needed compared to its predecessor. However, the issue which we have come across is the time taken to encode video files, as it takes longer to encode due to its threading issues. We will therefore be working towards the implementation of the CUDA parallel encoder for HEVC thereby enabling and improving the quality of experience for UHD raw videos being encoded and streamed in realtime.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication year.
A. R. L. Renambot, R. Singh, B. Jeong et al., “SAGE: the scalable adaptive graphics environment,” in Proceedings of World Conference on Cooperative & Work-Integrated Education (WACE '04), 2004.View at: Google Scholar
C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti, “Surround-screen projection-based virtual reality: the design and implementation of the CAVE,” in Proceedings of the ACM Conference on Computer Graphics (SIGGRAPH '93), pp. 135–142, August 1993.View at: Google Scholar
K. Jarrett, “Beyond broadcast yourself™: the future of YouTube,” Media International Australia, no. 126, pp. 132–144, 2008.View at: Google Scholar
N. Electronics, http://www.ntt-electronics.com/.
A. O. Adeyemi-Ejeye and S. D. Walker, “Ultra-high definition Wireless Video transmission using H. 264 over 802. 11n WLAN: challenges and performance evaluation,” in Proceedings of the 12th International Conference on Telecommunications (ConTEL '13), pp. 109–114, 2013.View at: Google Scholar
M. K. J. Halák, S. Ubik, P. Žejdl, and F. Nevřela, “Real-time long-distance transfer of uncompressed 4K video for remote collaboration,” in Future Generation Computer Systems 27(2011) 886–892, 2011, http://www.elsevier.com/wps/find/journaldescription.cws_home/505611/description#description.View at: Google Scholar
A. O. Ejeye and S. D. Walker, “Uncompressed quad-1080p wireless video streaming,” in Proceedings of the 4th Computer Science and Electronic Engineering Conference (CEEC '12), pp. 13–16, 2012.View at: Google Scholar
C. T. Calafate, M. P. Malumbres, and P. Manzoni, “Performance of H.264 compressed video streams over 802.11b based MANETs,” in Proceedings of the 24th International Conference on Distributed Computing Systems Workshops, pp. 776–781, March 2004.View at: Google Scholar
S. P. A. J. V. K. Soroushian, “H. 264 parameter optimizations for internet based distribution of high quality video,” in Proceeding of the SMPTE Annual Technical Conference, October 2008.View at: Google Scholar
K. Gatimu, T. Johnson, M. Sinky, Z. Jing, L. Ben, K. Myungchul et al., “Evaluation of wireless high definition video transmission using H. 264 over WLANs,” in Proceedings of the IEEE Consumer Communications and Networking Conference (CCNC '12), pp. 204–208, 2012.View at: Google Scholar
L. WirelessHD, “WirelessHD Specification Version 1. 0 Overview,” 2010, http://www.wirelesshd.org/pdfs/WirelessHD_Full_Overview_071009.pdf.View at: Google Scholar
Nvidia, “Nvision 08: the world of Visual Computing,” 2011, http://www.nvidia.com/content/cudazone/download/Getting_Started_w_CUDA_Training_NVISION08.pdf.View at: Google Scholar
S. Winkler, “Video quality and beyond,” in Proceedings of the European Signal Processing Conference, pp. 3–7, 2007.View at: Google Scholar
T. Murakami, “The development and standardization of ultra high definition video technology,” in High-Quality Visual Experience, pp. 81–135, Springer, 2010.View at: Google Scholar
Reference software X264-060805, http://www.videolan.org/developers/x264.html.
I. w. group, “OFFICIAL IEEE 802. 11 WORKING GROUP PROJECT TIMELINES—2013-09-21,” 2013, http://grouper.ieee.org/groups/802/11/Reports/802.11_Timelines.htm.View at: Google Scholar
N. Wu, M. Wen, H. Su, J. Ren, and C. Zhang, “A parallel H. 264 encoder with CUDA: mapping and evaluation,” in Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS '12), pp. 276–283, 2012.View at: Google Scholar
G. Zhiyong, W. Shuang, S. Zhenyu, and L. Haihua, “Design and implementation of H. 264/AVC video encoding based on cuda,” Journal of South-Central University for Nationalities, vol. 28, pp. 67–72, 2009.View at: Google Scholar
M. C. Kung, O. Au, P. Wong, and C. H. Liu, “Intra frame encoding using programmable graphics hardware,” in Proceedings of the Advances in Multimedia Information Processing (PCM '07), pp. 609–618, Springer, 2007.View at: Google Scholar
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-M. W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '08), pp. 73–82, February 2008.View at: Google Scholar
Nvidia, “Nvidia Kepler GK110 Next-Generation CUDA Compute Architecture,” 2012.View at: Google Scholar
Buffalo, “Networking at Gigabit Speeds,” 2012, http://www.buffalotech.com/resource_center/wireless_technologies.View at: Google Scholar
nanocosmos, “Nanocosmos,” 2012, http://www.nanocosmos.de/v4/documentation/live_video_encoder_-_playback.View at: Google Scholar
Sintel, “Sintel 4K,” 2011, http://www.sintel.org/news/sintel-4k-version-available/.View at: Google Scholar
Elemental, “4K Test sequences,” 2013, http://www.elementaltechnologies.com/resources/4k-test-sequences.View at: Google Scholar
Microsoft, “Directshow,” 2012, http://msdn.microsoft.com/en-us/library/windows/desktop/dd375454(v=vs.85).aspx.View at: Google Scholar
M. Harris, “Optimizing parallel reduction in CUDA,” NVIDIA Developer Technology, vol. 2, 2007.View at: Google Scholar
NETGEAR, “A6200 802. 11ac Wifi USB Adapter,” 2013.View at: Google Scholar