Abstract

Design of flexible multimedia accelerators that can cater to multiple algorithms is being aggressively pursued in the media processors community. Such an approach is justified in the era of sub-45 nm technology where an increasingly dominating leakage power component is forcing designers to make the best possible use of on-chip resources. In this paper we present an analysis of two commonly used window-based operations (sum of absolute differences and mean squared error) across a variety of search patterns and block sizes (, , etc.). We propose a context adaptable architecture that has (i) configurable 2D systolic array and (ii) 2D Configurable Register Array (CRA). CRA can cater to variable pixel access patterns while reusing fetched pixels across search windows. Benefits of proposed architecture when compared to 15 other published architectures are adaptability, high throughput, and low latency at a cost of increased footprint, when ported on a Xilinx FPGA.

1. Introduction

There has been an increase in flexibility and variations demanded by video processing applications. Conventional approaches to support flexibility in hardware design have centered around the design of multiple accelerators, each designed to accelerate a specific variation of the application, and switching off the accelerators selectively during run-time to save power. However, the dominance of leakage power dissipation in deep submicron technologies [1] continues to challenge VLSI architects to make the best possible use of on-chip resources, and the conventional approaches fail to provide a low-power design. As an alternative, VLSI architects are exploring designs that can offer flexibility or adaptability within and across multiple algorithms.

Video processing applications include motion estimation, deinterlacing, and optical flow analysis. Such applications are embarrassingly parallel, thereby making them suitable for hardware acceleration. An analysis of these algorithms reveals most of the compute-intensive kernels to be window-based operations such as sum of absolute differences (SADs) and mean-square error (MSE). Depending on content and compression ratios, each of these kernels in turn may need to be supported for various block sizes of (, , , , etc.). A high-level pseudocode is presented in Figure 1. This code illustrates the window-based operations for variable block sizes and also showcases the parallelism available amongst such operations. The focus of this paper is a video processing accelerator capable of efficiently supporting arbitrary block sizes for both SAD and MSE kernels and is based on a context adaptable architecture that builds on prior efforts which use systolic arrays.

Rest of the paper is organized as follows. Limited flexibility hardware accelerators for motion estimation that have been proposed in [211] are discussed in Section 2. Section 3 analyzes the compute and data access patterns of arbitrary block-sized SAD/MSE computations. Section 4 discusses the proposed context adaptable architecture and Section 5 presents results of the architecture, in comparison to peer designs in terms of an FPGA implementation.

2. Literature Review

This section provides a review of motion estimation architectures. Motion estimation is an embarrassingly parallel application that can be accelerated using a set of vector processing elements. However, massively parallel vector processors result in the requirement for exceedingly high memory or I/O bandwidth, thereby making such designs infeasible to be realized in actual hardware. The concept of systolic arrays was first proposed in the year 1980 by Kung and Leiserson [12] and it provided a gateway to execute operations concurrently, while data is fed into the system in a pipelined fashion. I/O resources are transformed into local connections between processing elements. Fixed Size Block Motion Estimation (FSBME) is observed to possess the characteristic of data being reused in a highly regular manner, and this can be exploited to develop a systolic array. Previous works have utilized systolic arrays to realize efficient motion estimation architectures and have shown 100% (or near 100%) resource utilization while attempting to limit the memory bandwidth. In the year 1989, there was a plethora of research efforts [24] towards developing such architectures.

In one of the earlier works towards FSBME acceleration engines, Komarek and Pirsch [4] presented a set of four systolic arrays of differing flavors—1D, 2D, and varying scheduling vectors. Overall data flow for FSBME algorithm was modeled by combining the three individual data flows, those of the reference pixels, current pixels, and the partial SAD values. In each of the four architectures, partial SAD value is computed in each Processing Element (PE) and is passed to the bottom PE or accumulator. Dependence graph for the search operation of a single Macroblock (MB) is a 4D graph. This 4D graph is projected along two of the axes onto a 2D plane, thereby resulting in a 2D systolic array. Another level of projection would result in a 1D systolic array. Regular block-type systolic array architecture is realized by passing columns of data either between two rows or two columns of PEs (in case of 2D array) or between two PEs (in case of 1D array) in a regular fashion (not diagonal). AS-1 and AS-2 are systolic array architectures that involve data transfer along non-uniform delay units and along the diagonal paths between PEs in adjacent rows or columns respectively. Amount of time to compute each motion vector is also proposed. Time for computing best SAD for a block with search area of size P for AS-2 is proportional to , and the compute time for AB-2 (another 2D systolic array proposed by Komarek et al.) is proportional to . AS-2 architecture proposed in this paper has a high throughput and a relatively low resource count and is used as the motivation for the context-adaptable architecture proposed in our paper.

Ou et al. [5] propose an SAD accelerator that exploits data reuse at multiple levels and keeps I/O bandwidth to a minimum, while maintaining a high throughput. Ou proposes a Variable Block Size Motion Estimation (VBSME) engine that is capable of generating a set of 41 motion vectors for every 256 clock cycles, and the engine runs at 123 MHz. This set of 41 motion vectors are calculated for the various block sizes as required by H.264-based motion estimation. Each MB in the current frame is split into sixteen 44 subblocks, and the best match is identified inside a search area of size 77 in the reference block. SAD is used as the metric to determine the best match. SADs are calculated directly using pixel values from all the 44 sub-blocks and resultant 44 SADs are combined to generate the SADs for larger blocks. This architecture contains 256 PEs, with each PE containing a subtract unit, absolute value calculation unit, and an accumulator, and requires a total bandwidth of 1024 bits per clock cycle. Latency is computed to be 256 clock cycles. A configurable mode processor is used to combine the 44 SADs (and eventually the best motion vector) and compute SADs for larger blocks.

Chen et al. [6] propose a motion estimation architecture that isolates input data flow from the 2D compute array. Input data is maintained using a pipeline of registers, and data is allowed to flow through this pipeline thereby reducing I/O bandwidth. Jen et al. [7] explore the properties of data reuse that are found in an FSBME, and the associated memory bandwidth requirements and each of these properties are formulated. Also, multiple modes of scheduling data flow are identified and formulated. Other systolic array-based architectures are proposed in [811]. While some of these approaches cater to the need for limited flexibility in window sizes and focused only on one kernel (SAD), we extend some of these approaches to cater to a larger set of window sizes across both SAD and MSE kernels. Approaches in [211] were designed to support motion estimation based on Full Search (FS). Remainder of this section discusses prior work [1315] in developing architectures that can support data flow for various search patterns other than Full Search.

Ruchika and Akoglu [13] propose a reconfigurable architecture to perform VBSME. A soft reconfigurable router is used to support data flow for multiple search patterns, including full, diamond, and hexagon searches. A 2-dimensional array of PEs is used to compute the SAD for 4 4 blocks and 5 other PEs are used to compute the larger SADs (48, 84, 88, etc.). Each PE is connected to four nearest neighboring PEs and is associated with a router to send/receive packets of data using handshaking protocols. The paper claims that there is an overhead of 8 clock cycles per search point for diamond and hexagon search patterns. Our research does not use a router-based design, thus removing the overhead due to handshaking. We intend to use a 2D Configurable Register Array to support multiple data flows required by multiple search patterns. In our approach, timing overhead is found to be 2 clock cycles per search point for diamond and hexagon searches.

Marcelo Porto et al. [14] propose a high-throughput and low-cost architecture for performing diamond search based motion estimation for a block size of 1616. This architecture is dedicated to diamond search and uses a set of memory banks to hold the required data. Pixel data from the search area in reference frame and current frame is preloaded into a monolithic memory bank (higher level of hierarchy), which is then copied into smaller and parallel memory banks (lower level of hierarchy). Apart from the initial latency to fill the data into the monolithic memory bank, the architecture is found to provide valid data to the compute engine during every clock cycle. However, such memory architecture is dedicated to a single search pattern and cannot be reused to support any other search pattern.

Zhang and Gao [15] propose a methodology to derive a reusable architecture that can support a predefined set of search patterns. Best motion vectors for four different block sizes (1616, 168, 816, and 88) are computed using a set of 9 processing units (PUs). Each unit is dedicated for the computation of the cost of motion vector (which is based on SAD) for a pair of 1616 pixel blocks from the reference frame (candidate block) and current frame (current block), respectively. During each clock cycle, two rows of pixels from two blocks (one from reference frame and one from current frame) are fed into each PU which updates the cost of motion vector for that input. A data array is used to buffer the input data from the reference frames before feeding it to nine PUs. Each row of the data array is long enough to hold the maximum number of pixels in a single row of current set of nine candidate blocks. Rows are introduced from the top of the data array and piped through to the bottom. Based on the search patterns selected and the control step, specific pixels are selected using multiplexers and fed to appropriate PUs. Also, the data array has provisions to calculate the subpixel values for fractional motion estimation. This paper fails to provide a detailed discussion of data array architecture. While the architecture does exploit data reuse among candidate blocks, it is effective only if size of the search window is limited. Also, the set of search patterns need to be predefined.

In this paper, we propose a 2D Configurable Register Array (CRA) that can support multiple search patterns, while exploiting the property of data reuse among adjacent candidate blocks in the reference frames. Amongst the prior approaches, only [13] and [15] have the capability to support multiple search patterns. While [13] suffers from the overhead associated with the router (8 clock cycles per search operation), [15] requires the search area size to be limited. Proposed CRA does not suffer from either of these limitations.

3. Compute Flow and Data Access Analysis of Arbitrary Block SAD/MSE Operations

3.1. Compute Flow Analysis

An analysis of video processing applications such as motion estimation, deinterlacing, and motion flow analysis reveals that most of the compute-intensive kernels are window-based operations [6, 16, 17]. Such operations include SAD and MSE computations for arbitrary block sizes. In this discussion, the terms “window” and “block” are used interchangeably. For example, (1–4) are commonly found in motion adaptive deinterlacing algorithms [13] and (5) is often used in motion estimation algorithms [4]:

In these equations, CFB and RFB represent the intensity of pixel in the location inside pixel blocks in current and reference frames, respectively. a,b represent the Manhattan distance between the anchor point location of current block and anchor point location of reference block. Size of these blocks varies based on the operation being performed. For instance, in (1) size of the block is 32, and in (5) size of the block is , where can be 4, 8, and 16, if the reference standard is H.264. In (3), vecmMul is a scaling factor and is a constant number. In motion flow analysis algorithms, unusual blocks of size 55 are used in many compute-intensive kernels [14]. Due to the wide range of window/block sizes that are used for video processing, we will analyze their data flow by decomposing such window-based operations into two suboperations:

(i)per-pixel operations,(ii) reduction operations.

Per-pixel operations can include

(a)absolute difference between two-pixels or (b) square of the difference between two pixels. In some of the motion flow applications, MSE based on (b) is used as an error metric instead of SAD based on (a) and is shown in (6):

Reduction operations may include: (a) simple accumulation of per-pixel results or (b) scaled accumulation of per-pixel results. To support multiple per-pixel (SAD and MSE) and reduction operations (simple and scaled accumulation), each operation is realized in hardware, and routing resources (multiplexers) can be used to select between the outputs of these operations. In the proposed architecture only accumulation of results of per-pixel operations (SAD and MSE) is supported. A single window-based operation for block size is illustrated using a dependence graph shown in Figure 2. Each node in this figure is a computation node represented by coordinate, and inputs to this node are pixels from current and reference frames, CFB and RFB, respectively. For purpose of illustration, computation shown in (5) is transformed into two stages as shown in (7) and (8):

In Figure 2, “dark” nodes represent computation of a single step in (7) (One per-pixel operation and one reduction operation), and two “light” nodes (last nodes in each column of the dependence graph) represent computation of a single step of (7) for the last pair of pixels in each column and a single step of (8) (one reduction operation) respectively. Operation of each node is illustrated in Figure 3. Section 4 discusses the design of a configurable PE array that can support parallel execution of multiple arbitrary sized block operations.

3.2. Data Access Analysis

This section discusses analysis of access patterns of reference frame pixels. Proposed architecture is aimed to support multiple search algorithms while exploiting data reuse and hence is required to support multiple data access patterns.

Figure 4 illustrates the data access inside a search area for a particular search pattern. In this illustration, a variation of the diamond search pattern is considered. Block SAD/MSE computations are performed over blocks. Following notations are used in this discussion:

(i) : number of contiguous blocks in the current frame involved in block-based computation,(ii) : size of each block, (iii) : current frame block, (iv)Searches : number of searches (4 in this illustration),(v) : reference frame block corresponding to search, (vi) (,) : Manhattan distance between anchor pixels of and ,(vii) : pixel location.

Concurrency amongst such block-based computations can be exploited in two different ways. Concurrency is found in set of block computations between different blocks in reference frame, and different blocks in current frame for example, SAD computation between and and SAD computation between and . Group of RFBs during the and search steps are shown in (9) and (10), respectively:

From Figure 4, it can be observed that there is considerable number of pixels common between the two sets ( and ) that can be reused. This approach assumes that RFBs belonging to are contiguous; that is, motion vectors of contiguous CFBs are highly correlated. In case of motion involving affine or perspective warping or CFBs being processed concurrently belong to different visual objects (MPEG4 advanced profile [18]), correlation between the motion vectors of CFBs is weak. This leads to non-contiguous RFBs in and the proposed approach cannot be applied.

Concurrency is also found in set of block computations between different blocks in reference frame, and same block in current frame that is, SAD computation between and and SAD computation between and . Groups of RFBs during the and search steps are shown in (11) and (12), respectively: From Figure 4, it can be observed that there are no pixels common between these two sets ( and ) that can be reused. Thus, option (a) is used to exploit concurrency and data reuse. A set of blocks in the reference frame is involved in block computation with a set of blocks in the current frame during a single search step. This is repeated sequentially till all searches are completed (nSearches). We observed earlier that there is possible data reuse between reference block sets in consecutive searches. This property can be exploited to reduce the amount of data fetched from memory at a higher level of hierarchy (e.g., cache) that stores the reference frame. Each set of reference frame blocks consist of pixels. Instead of fetching a new set of such pixels for every new search, it is sufficient to fetch a specific number of new rows and columns and reuse some of the pixels from the earlier set of reference frame blocks. For the search step, number of new rows and columns that need to be fetched depends on (,). For instance, consider the search step in Figure 4. In search step 1, set of pixels shown in (13) is fetched. Equation (14) shows the number of pixels fetched. In this discussion, RF represents reference frame, and (,) represents the manhattan distance between the anchor pixel locations of and :

For fetching the pixels for next search step, new rows of pixels from north (with respect to RFPixelSet() are fetched, while older rows of pixels from south are discarded. Set of pixels shown in (15) is fetched. Equation (16) shows the number of pixels fetched: After fetching the pixels from north, new columns of pixels from east (with respect to RFPixelSet() are fetched, while older columns of pixels from west are discarded. Set of pixels shown in (17) is fetched. Equation (18) shows the number of pixels fetched. In the proposed approach, it is sufficient to fetch pixels instead of fetching a new set of reference frame blocks for each search step. Total number of pixels that are fetched for performing all the searches (number of search steps = nSearches) is calculated for proposed approach () as shown in (19): Total number of pixels fetched for a vanilla approach that does not exploit data reuse () is shown in (20): Savings obtained in the proposed approach is computed as shown in (21). This savings reflects the reduction in bandwidth requirements of the compute unit. Amount of savings is dependent on the following features of window-based operation.

(1) The larger the hops are, the lesser is the data reuse and hence the lesser is the savings.(2) The smaller the size of the reference block set () is, the lesser is the possibility of overlap, and hence lesser is the savings. (3) The larger the number of searches is, the larger is the value of and hence the higher is the saving. Table 1 lists the total number of pixels fetched in the two approaches and savings obtained for some of the commonly used search patterns. Block size is assumed to be 44 and number of contiguous current frame blocks is also assumed to be 44. Number of searches depends on the search pattern and is equal to the number of candidate blocks in the reference frame for a given block in the current frame. It can be observed that a considerable amount of savings is obtained by exploiting the data reuse property.

This section analyzed the compute flow and data access patterns for SAD/MSE-based block operations with (i) arbitrary block size (ii) multiple search patterns. In this paper, we define a “context” to be a property of the block operation that is a mix of the following:

(1) per-pixel operation (SAD/MSE), (2) block size (), (3)search pattern (Full Search, Hexagon Search, etc.).

Next two sections propose an architecture template that can support multiple contexts during run-time. We also discuss the methodology to realize a prototype of this architecture on a commodity FPGA and evaluate its performance and resource overhead against fixed block size architecture.

4. Proposed Architecture

This section discusses the methodology to derive systolic array architecture for performing SAD/MSE computations (block computations) and extend it to support arbitrary block sizes. In this discussion, a control step represents a single clock period. Top-level block diagram of the proposed architecture is shown in Figure 5. It consists of two key modules: (a) a 2D Configurable Register Array (CRA), and (b) a merged PE systolic array (MPESA).

Controller and memory interface modules to facilitate the integration of this accelerator with a host processor system are also shown in Figure 5, but their discussions are beyond the scope of this paper. Remainder of this section analyzes the parallel processing of window-based operations and discusses the two key modules of the proposed architecture.

4.1. Parallel Processing of Window-based Operations and Bandwidth Requirements

In video processing applications discussed in Section 3 (motion estimation, deinterlacing, etc.), it is observed that the window-based operation is performed repeatedly to compute error metric between pairs of multiple contiguous blocks of pixels in the reference and current frame. Window-based operations for all pairs of blocks can be performed in parallel, but amount of parallelism supported by any hardware accelerator is limited by two factors.

(1) Number of processing elements (PE) available in the proposed 2D merged array. A single PE operates on one pair of pixels. This number is constrained by the total number of resources available on an FPGA (DSP48, LUTs, Flip Flops, etc.). Number of PEs is represented by . (2)Memory bandwidth available to the PEs. This number depends on the implementation of the 2D CRA and is equal to output bandwidth of the CRA

For a given block size and 2D merged PE array of size , proposed architecture can support parallel execution of window-based operations for block pairs. For a block size , (22) lists the blocks that can be processed in parallel during the first control step from the reference frame: Similarly, subblocks for the current frame (CFBs) to are defined. Figure 6 illustrates the set of 55 reference frame blocks that are processed during the first control step for and . In this figure, is used to represent the relative position of the pixels in the reference frame blocks. represents the top and leftmost pixel in the first block processed during the first control step.

In many video processing applications, it is observed that window-based operations are performed between multiple blocks in the reference frame (also called candidate blocks) and a given block ( block in the current frame), during a single control step. Equation (23) lists one possible list of candidate blocks corresponding to for a Full Search pattern and this list of blocks is illustrated in Figure 7:

Number of candidate blocks in the reference frame for each current block (for a given block index ) in the current frame is represented in (24) in terms of the number of searches that need to be performed, as discussed in Section 3: In this paper, window-based operations for the following pairs of blocks are started in parallel by the proposed merged PE architecture in one control step: During subsequent control steps, y is varied from 1 to nSearches. During each control step, a bandwidth of is required by the PE array from the memory support system used to store the current and reference frames. Pixels from current frame do not vary between control steps, and memory support system to provide the current frame pixels is simple and is not discussed here. However, pixels from reference frame vary based on the control step , and the memory support system is required to provide the necessary bandwidth, during every control step.

4.2.  2D Configurable Register Array

This section discusses the derivation of proposed 2D Configurable Register Array (CRA) and estimates the resource requirements.

4.2.1. Architecture Derivation

From Figure 7, it is observed that there is ample data reuse between the set of and set of blocks accessed in adjacent control steps. This data reuse depends on how the blocks are accessed within the search area in the reference frame. Analysis in Section 3 showed considerable data reuse for various search patterns. This paper proposes a 2D Configurable Register Array (CRA) to exploit this property of data reuse. Memory support system for reference frame comprises of external memory connected to the proposed 2D CRA that, in turn, is connected to the merged PE array. Such a design helps to reduce bandwidth that is required from external memory, while providing sufficient number of pixels to the merged PE array. Architecture of CRA is illustrated in Figures 8 and 9 illustrates the inner design of each register node. In Figure 8, three types of register nodes are shown. Nodes labeled represent the units that feed data to PE, which is the PE in the PE array. Nodes and are required by the configurable delay unit, which is part of the PE array design. Since the proposed 2D systolic array architecture consists of PEs, number of “” nodes in CRA is also fixed to be . Based on block size , only a portion of CRA is required to provide contiguous columns of pixels to the 2D systolic array during every control step. For a block size of 33 and 2-D array size of 1616, 1515 register nodes are utilized, and the remaining 31 nodes are wasted. This is one of the overheads to support arbitrary block sizes and formulated in (26): Proposed 2D CRA exploits the property of data reuse based on how the candidate blocks are accessed from the reference frame. Table 2 shows the set of operations that modify the contents of CRA for each control step. Based on the data flow required during a control step, the operation varies. In these operations, all reference frame pixels are assumed to be stored in a memory system, and this memory system is capable of providing pixels from a predetermined location in a single row or pixels from a predetermined location in a single column in the reference frame. Such a set of (or ) pixels is represented by the vector InR. Input bandwidth of the 2D CRA can thus be defined as max( , ).

Operation of the 2D CRA is illustrated here for block size 33, 2D merged PE array of size 1616 and the search area is defined with four candidate blocks (for sake of illustration). Figure 7 illustrates the access pattern within the reference frame across multiple control steps (for Full Search). During the first 16 control steps, a new row of pixels is fed into the 2D CRA (see Table 2). During control step , vector InR is fetched from the reference frame as shown in (27): After 16 control steps, CRA consists of the following pixels: 2D CRA now contains all the pixels corresponding to the first set of 25 candidate blocks: , , ,, . Proposed systolic array architecture can now start all 25 window-based operations in parallel. In control step 17, a new row is fetched as follows: Data is now shifted upwards and 2D CRA now contains all the pixels corresponding to the second set of candidate blocks, namely, , , , , . In control step 18, a new row is fetched as follows: Data is now shifted eastward and 2D CRA now contains all the pixels corresponding to the third set of candidate blocks. This process is continued till all candidate blocks have been fed to the 2D PE array. The last set of candidate blocks is presented to the systolic array after the control steps.

It is observed that the proposed 2D CRA provides the required set of candidate blocks to PE array during each control step (after the initial latency). This observation is applicable only in the case of a Full Search based window-based operations. To support window-based operations involving other search patterns (diamond search, etc.), 2D CRA operates in a similar fashion but is unable to provide the required set of candidate blocks to PE array during each control step. This is a tradeoff to support multiple search patterns.

4.2.2. Resource Requirements and Performance Estimates

This subsection discusses an analytical model to estimate overall resource and input bandwidth requirements of the proposed 2D CRA for arbitrary block sizes () and systolic array sizes (). Based on Figure 8, we compute the total number of register nodes here. In this discussion, n(a) represents number of duplicate units of “” that need to be implemented: Each register node is shown in Figure 9 and consists of two 8-bit multiplexers and one 8-bit register. Multiplexers and registers can be realized using Look-Up Tables (LUTs) and Flip-Flops (FFs), which are the primitive resources in Xilinx FPGAs. Total resource requirements for the proposed 2D CRA are shown in (32) and (33): Here, nLUT() denotes number of LUTs used to realize the FPGA design of “, and nFF() denotes number of FFs used to realize the FPGA design of “. Also, the CRA requires an input bandwidth of to be available from the external memory. Performance is estimated in terms of (a) Bandwidth provided by CRA, (b) initial latency and (c) total latency due to idle cycles caused by hops in the search pattern.

Effective bandwidth is computed in (38)using these three factors that are listed in(34), (35), (36): Proposed 2D CRA provides pixels to the 2D PE array during each clock cycle. Number of clock cycles required to send a new set of RFBs (corresponding to the search pattern) is nSearches. Total number of clock cycles spent in sending all the required sets of RFBs is computed in (37). Effective bandwidth is a measure of the amount of useful data that the 2D CRA is able to feed the 2D PE array for a single clock cycle. The greater the effective bandwidth is , the better is the performance of 2D CRA. It is computed as shown in (38): Table 3 lists the effective bandwidth for and for various search patterns. An ideal 2D CRA with the best performance should be able to provide (256) pixels per clock cycle. Reduced performance of proposed CRA is a tradeoff to support multiple search patterns with a limited input bandwidth.

4.3. Merged PE Array

This section discusses the merged PE systolic array (MPESA) architecture required to compute multiple window-based operations for arbitrary block sizes. During every control step (after initial latency), proposed 2D CRA is responsible for providing the array architecture with a new set of reference frame blocks that are candidates for a given set of blocks from the current frame. For search patterns other than Full Search, some idle cycles are introduced as the CRA needs to prepare the next set of candidate blocks. During these cycles, the PE array continues to read the data from 2D CRA and process them, but the results are ignored. Following subsections discuss the PE array architecture in detail.

4.3.1. Architecture Derivation

The merged PE array needs to support three different modules.

(i)Configurable delay unit. This unit is responsible for delaying the reference frame pixels according to the block size, 2D array size, and location of PE in the 2D array.(ii)PEs for per pixel operation. This unit is responsible for computing the absolute difference or square error between a pair of pixels from the current frame and reference frame. (iii)PEs for reduction operation. This unit is responsible for computing the summation (or weighted summation) of the results of the PE array for per pixel operation.

Proposed methodology to derive MPESA architecture begins at the level of a base architecture to support a single window-based operation involving a pair of reference and current frame blocks. This operation is represented in the form of a dependence graph, as shown in Figure 2, and is well suited to be processed using a systolic array architecture. Figure 2 displays the dependence graph for the computation of SAD operation (block size = ).

To map the dependence graph onto systolic array architecture, we need to determine the projection vector, and scheduling vector [12]. It should be noted that there is no requirement for a projection vector for mapping a 2-dimensional dependence graph to a 2-dimensional systolic array. After evaluating multiple options, it is observed that is the best option for a scheduling vector and this obeys all the rules required for a systolic array design. Hence a systolic array based on this schedule vector is chosen as a base architecture that will be used to perform block computations. This systolic array is illustrated in Figure 10. While the pixels from the current frame are stored in place in each processing element (PE), pixels from sets of reference frame blocks () are fed into different PEs of the systolic array from the 2D CRA. Arrows marked “” indicate a single clock delay associated with data transfer between PE and arrows marked ()D indicate a delay of clock cycles. Such delays are realized using a series of edge-triggered register units. In this figure, corresponds to and represents the top and leftmost pixel in the block. Similarly corresponds to and corresponds to .

Two types of PEs are required in this systolic array. “Dark” PE performs the computation corresponding to “dark” node shown in Figures 2 and 3. “Light” PE performs computations corresponding to “light” node shown in Figures 2 and 3. performs the computation corresponding to (7) and feeds the result after a variable delay to which, in turn, performs the computation corresponding to (8). To support the computation of SADs for pairs of multiple contiguous blocks (as illustrated in Section 3), proposed systolic array is replicated, and an PE Systolic Array (PESA) is realized. Based on block size, multiple configurations of PESA are required. Figure 11 shows the configuration of PESA for two possible block sizes for a given number of PEs. M and N are set to be 6 for the purpose of illustration:

(1) PESA1: (),(2) PESA2: ().

In each PESA, each PE is represented using its coordinate . Following requirements guide us towards the design of a merged PESA to support arbitrary block sizes:

(i)Type of PE (“light” or “dark”) depends on row index of the PE and block size (e.g., PE(3,3) is “light” PE in PESA1 and “dark” PE in PESA2). (ii) Based on row index of a specific PE() and block size, either a zero or the intermediate result from the ) is routed into this specific PE. (in PESA1, a zero is routed to PE in PESA2, result from is routed).(iii)Based on column index of a specific PE and block size, either a zero or the intermediate result from PE() is routed into this specific PE. (in PESA1, a zero is routed to PE(6,4). In PESA2, result from PE(5,4) is routed). (iv) Input fed into each is delayed by specific number of clock cycles. This depends on row index of the and block size. (in PESA1, input to is delayed by 2 clock cycles. In PESA2, input is delayed by 1 clock cycle). (v) Data transfer within “light” PE is delayed by specific number of clock cycles. This depends on column index of the and block size. (in PESA1, data within is delayed by 3 clock cycles. In PESA2, data is delayed by 1 clock cycle).

Remainder of this section discusses the design of a merged PE Systolic Array (MPESA) that can be configured to support arbitrary block sizes. Each PE in MPESA can possibly be a “dark” PE or a “light” PE, as discussed above. By analyzing the computations performed by each type of PE, it is observed that the computation performed by “dark” PE is also performed by “light” PE. Thus, every PE in MPESA is realized as a “light” PE. Each PE is termed as a merged processing element (MPE) and can support (a) both MSE and SAD computations and (b) arbitrary block sizes. Each MPE is represented by its location in the MPESA and is referred to as MPE. Figure 12( a) shows the design of MPE. Figure 12( b) shows the internal design of the per-pixel operator which shows that the MPE can support both MSE and SAD computations. A pair of configurable delay units is included in this design to support arbitrary values of p and q. represents the intermediate results of computation within a single column. This corresponds to SAD, as shown in Figure 3. corresponds to SAD, as shown in Figure 3. Other terms used in Figure 12 are explained later.

Salient features of the proposed MPESA (and MPE) architecture and its derivation are explained with an example. A computation is assumed to start at control step . At this instant, first set of blocks from the reference frame and current frame are available to the appropriate PEs from the 2D CRA (CFB and RFB are provided to MPE). RFB corresponds to the top and leftmost pixel in the set of RFBs. RFB is delayed before it is fed to the first compute unit (absolute difference or square error unit). For a given MPE, this delay depends on the row index “” and is computed as follows:

This value can be anywhere between 0 and . A configurable delay unit for reference pixels is derived to generate appropriate delays. We observed that delayed copies of the reference frame pixels are available in the 2D CRA. Thus, the configurable delay unit can be realized by reusing the registers present in the 2D CRA. Resource overhead for a single MPE is reduced to a single multiplexer. This design is discussed in the next subsection. A single clock delay register is used to hold the intermediate result of ADDER number 1 before sending it to the next PE (MPE) or to the accumulator unit (ADDER number 2). It is observed that the MPE that is configured to act as “dark” PE sends the intermediate result of ADDER number 1 to MPE after a single clock cycle and the result of ADDER number 2 is ignored by MPE. MPE that is configured to act as “light” PE sends the intermediate result of ADDER number 1 to ADDER number 2 after a specific number of clock cycles based on the location of the MPE and the block size (as shown in Figure 10). This number depends on column index of the MPE in the 2D MPESA and block size and is computed as follows: This value can be anywhere between and . Such a configurable delay unit is realized using a circuitry involving multiple registers (single cycle delay) and one M:1 multiplexer. This is illustrated in Figure 13 (for ). Based on the location of the MPE in a 2D MPESA and the value of and , inputs to select lines of the 2:1 multiplexers shown in Figure 12 that control the selection between zero and an intermediate value are modified. This is shown in (41)and (42):

This merged PE (MPE) is now used as the building block of a 1616 MPESA (similar to the one illustrated in Figure 12). A possible list of block sizes and number of such block computations that can be performed during a single control step using the MPESA, for , is shown as follows:

(i)one 1616 block, (ii)Four 88 blocks,(iii)Sixteen 44 blocks, (iv) Nine 55 blocks, (v) Fifteen 53 blocks,(vi)Twenty-five 33 blocks.
4.3.2. Resource Requirement and Performance Estimates

This subsection discusses an analytical model to estimate overall resource requirement for MPESA design. MPESA consists of MPEs arranged in 2D systolic array architecture. Resource requirement for each MPE is calculated as shown in (43):

is the configurable delay unit for reference pixels, ABSDIFF is the absolute difference unit (for SAD calculations), MULTIPLIER is the multiplier unit (for MSE calculations), is the configurable delay unit for intermediate results.

nLUT( ) and nLUT( ) are computed as shown in (44) and (45), respectively. nLUT(MPESA) is computed in (46). nFF(MPESA) is computed in a similar fashion: Performance of MPESA is estimated in terms of (a) Latency and (b) Throughput. Latency is number of clock cycles required by MPESA to compute the first set of valid SAD outputs. It is a function of block size and is computed as shown in (47): Throughput (number of pixels processed per clock cycle) is computed as shown in (48). Throughput is affected by the following factors:

(i)latency of MPESA, (ii)total number of clock cycles for CRA to send all blocks,(iii) total number of pixels:

Table 4 illustrates the throughput computed for and for various search patterns. An ideal architecture with the best performance should be able to compute (256) pixels per clock cycle. Reduced performance of proposed MPESA is a tradeoff to support multiple search patterns and reuse intermediate results.

4.4. Illustration

To illustrate the working of proposed architecture, an example of computing 3x3 SAD operations for Full Search is discussed here. Sequence of is set to be (0,1), (0,1), (0,1), (1,0), (0,−1), (0,1), (0,1), (1,0), (0,1), (0,1), (0,1), (1,0), (0,1), (0,1), (0,1). A 1616 MPESA, supported by a 1616 Configurable Register Array (CRA), can be used to compute a maximum of twenty-five 33 SADs. Computation is assumed to start at control step . At this instant, the first row of pixels is fed into the CRA. At , first set of blocks from the reference frame and current frame are available to the appropriate PEs from the CRA (RFB is provided to MPE). At every subsequent control step, , in this example, a set of twenty-five blocks of () from the reference frame are fed into MPESA. During subsequent control steps, is varied from 16 to 16 + nSearches, and appropriate blocks from the reference frame are fed into MPESA via CRA. During each control step, only a subset of MPEs from the 2D MPESA compute useful intermediate results. This set is represented as OpMPE( , ). represents current control step and represents the control step during which a particular set of blocks was fed into MPE. For instance, the first set of blocks is fed into MPE at , thus = 16. Table 5 illustrates the activity of MPESA during each control step.

The set of operations ( = 16 to = 21) can be pipelined across multiple sets of candidate blocks, as different sets of resources are used during different control steps (i.e):

5. Results and Analysis

Merged PE systolic array (MPESA) architecture for performing arbitrary sized SAD computations is implemented in Verilog RTL and synthesized using Xilinx ISE design tools [19]. Xilinx Virtex-4 (XC4VSX35) FPGA [20] is used for the sake of prototyping, and designs can be ported onto custom technology libraries. For benchmarking performance, the proposed architecture was used to compute 16 possible 44 SAD values between pixels in current frame and the pixels in the search area of reference frame. This operation is repeated for all 256 candidate blocks in the search area (nSearches = 256).

Following metrics were used to evaluate the proposed architecture:

(i) overall resource utilization,(ii) initial latency, (iii) time taken to determine 16 SAD values (), (iv) memory bandwidth, (v) local and global I/O, (vi)Fanout.

Local , Global and Fanout were used to represent the network congestion of a given architecture. FPGAs are known to have limited routing resources, and hence an architecture with a lower network congestion will be a better fit to be implemented on an FPGA. Local indicates the number of wires in the architecture that have a single source and single destination. Global indicates the number of wires in the architecture that have multiple sources multiple destinations. Table VI evaluates the proposed architecture against all the architectures listed in literature review section. In this comparison table, resource utilization is an estimated value, as different architectures have been implemented using different technology libraries (some papers have limited implementation details). It was observed that the proposed architecture had the lowest latency and one of the highest throughputs among all architectures. Memory bandwidth was found to be 128 bits. Network congestion of the proposed architecture was found to be larger as we support multiple data access patterns. For comparing the resources used, each architecture is modeled as a set of adders, absolute difference units, multiplexers, and registers. Number of units is estimated from literature and overall resource utilization is estimated by summing up all the individual resource counts. It was observed that the proposed architecture resulted in the highest resource utilization, which is the penalty paid for context-adaptability. Among the other architectures, only Chen et al. [6] and Komarek et al. [4] have low latency, high throughput, and low resource count.

For evaluating the resource overhead of the proposed architecture over a fixed size SAD architecture, a reference design (F-SAD) was implemented using the same technology. F-SAD implementation is the stripped down version of the proposed implementation. All the logic required for context-adapting the design during run-time based on the value of are removed. , , and nSearches are set at 4, 4, and 256, respectively. Table 7 shows the comparative results between the two architectures. It is observed that the proposed implementation requires 10.3 more LUTs and 5.5 times more Flip-Flops. This overhead is dedicated toward supporting all possible block shapes and sizes. All the other numbers in Table 7 are comparable.

Remainder of this section provides results to analyze the performance of the proposed architecture. 720P HDTV video sequences consisting of 30 frames per second (fps) are considered to be the test sequence. Proposed architecture can be used to accelerate multiple window-based algorithms. In this paper, we provide results for Full Search Motion Estimation (FSME) with block size set to be 44. Search pattern is defined using the sequence of that is set to be {(0,1), (0,1), (0,1), (1,0), (0,−1), (0,−1), (0,−1), (1,0), (0,1), (0,1), (0,1), (1,0), (0,−1), (0,−1), and (0,−1)}. Proposed architecture has been tested to run at a frequency of 100 MHz. It is observed that 720P frames has 57600 44 blocks in each frame. Each block in a current frame has 16 candidate blocks in the reference frame, and 16 current blocks can be executed in parallel using the proposed architecture. By using the formulas presented in Section 4 for estimating performance, it is determined that 16 current blocks can be processed in 40 clock cycles. To process one frame, 144 000 clock cycles are required. To process 30 frames, 432 0000 clock cycles (0.0432 seconds at a clock frequency of 100 MHz) are required. Thus the proposed architecture is shown to support 30 fps (real time performance). Similar analysis can be performed for other applications as well (deinterlacing, fast search, etc.).

Performance of the proposed architecture is evaluated for multiple block sizes. Figure 14 shows the latency and throughput for multiple block sizes. In this figure, throughput is computed as the number of pixels processed by the MPESA per clock cycle. Overall latency is computed as sum of latencies of 2D CRA and MPESA.

6. Conclusion

This paper proposes a context-adaptable architecture to accelerate compute-intensive video processing kernels comprising of arbitrary sized block SAD/MSE operations and a variable search pattern. Results are presented by comparing the performance and resource utilization of the proposed architecture against a reference architecture and other architectures found in literature. Compared to fixed architecture (F-SAD), proposed architecture provides context-adaptability (, varying anywhere between 1 and 16) at an increased cost of 5.5 for FFs and 10.3 for LUTs, while maintaining a similar performance (latency and throughput).