Abstract

The high integration density in today's VLSI chips offers enormous computing power to be utilized by the design of parallel computing hardware. The implementation of computationally intensive algorithms represented by 𝑛-dimensional (𝑛-D) nested loop algorithms, onto parallel array architecture is termed as mapping. The methodologies adopted for mapping these algorithms onto parallel hardware often use heuristic search that requires a lot of computational effort to obtain near optimal solutions. We propose a new mapping procedure wherein a lower dimensional subspace (of the 𝑛-D problem space) of inner loop is identified, in which lies the computational expression that generates the output or outputs of the 𝑛-D problem. The processing elements (PE array) are assigned to the identified sub-space and the reuse of the PE array is through the assignment of the PE array to the successive sub-spaces in consecutive clock cycles/periods (CPs) to complete the computational tasks of the 𝑛-D problem. The above is used to develop our proposed modified heuristic search to arrive at optimal design and the complexity comparisons are given. The MATLAB results of the new search and the design space trade-off analysis using the high-level synthesis tool are presented for two typical computationally intensive nested loop algorithmsβ€”the 6D FSBM and the 4D edge detection alternatively known as the 2D filtering algorithm.

1. Introduction

1.1. Prelude to the New Search Method

Today’s reconfigurable SoCs feature processing elements (PEs) with significant amount of programmable logic fabric present on the same die. The management of complexity and tapping the full potential of these RSoC architectures present many challenges [1]. A large number of heuristic algorithms have been used in developing many novel scheduling and mapping algorithms [2–5]. However, these approaches face difficulties in dealing with large execution times.

𝑛-dimensional (𝑛-D) nested loop representations are used in the formulation of numerous computationally intensive multimedia computing/image processing and signal processing algorithms. Systolic array design style can effectively exploit parallelism inherent in the nested loop algorithm and, therefore, reduce processing time [2, 3]. Often heuristic procedures are used to search for the mapping transformations that are used to map the nested loop algorithms onto array architectures [4, 5]. Since the effort that goes into heuristic search is large and complex, the challenge lies in improving the process to reduce the computational effort in getting the mapping results.

Our main contribution in this paper is that we propose an augmented approach to the heuristic search. A new method of identifying the subspace to which the PE array is to be assigned is proposed based on the directional index of the computational expression that is explained in Section 2. The new vectors and terminologies used in the procedure are defined and elaborated in Section 2.

A modified heuristic search is implemented using the proposed procedure to determine the optimal solution to the 𝑛-D problem. The complexity analysis is performed by comparing the search space used in our method with the search space in [4]. The high-level synthesis tool GAUT is used to plot the design space trade-off curves to obtain the design space exploration curves.

The paper is organized as follows: in Section 3, mapping steps used in the heuristic method and our proposed modified search method are described. The 4D nested loop formulation of the 2D filtering problem is explained in Section 4. The methodology and the implementation of the above approach for the 2D filtering algorithm and the mapping results are presented in Section 4. The mapping process for 6D FSBM is elaborated in Section 5 followed by the results of the heuristic search for the reduced 4D FSBM and modified heuristic search for the same in Section 6. The design trade-off results using the high-level synthesis tool GAUT are presented in Section 6. Section 7 discusses the complexity considerations and comparisons. Section 8 gives the conclusion and future work.

2. Terminologies and Definitions

2.1. Axis Vector I

The multidimensional (𝑛-D) problem is associated with an 𝑛-dimensional axis vector I. Its components are {𝑖1,𝑖2,𝑖3,…,𝑖𝑛}, where the subscripts of the components belong to the integer set 𝑧. The components of the vector I represent the different axis directions of the 𝑛-dimensional vector I. The letter K is used to represent a constant vector whose components are different constant numbers, K= {π‘˜1,π‘˜2,π‘˜3,π‘˜4,…,π‘˜π‘›}. Each π‘˜π‘§ represents the upper limit of the corresponding vector component 𝑖𝑧 of the vector I. For example, axis component 𝑖4 has a value varying from 𝑖4=1 to 𝑖4=π‘˜4.

2.2. Data Representations

Considering the input data set to the algorithm, the input data is represented using letter A with subscript 𝑧. The input data set consists of collection of data 𝐀1,𝐀2,…,π€π‘˜ where π‘˜ is some constant integer number. Each of this type of data is associated with the axis vector I. For example, for 𝐀1, we call it as 𝐀1(𝐈). Now every such data is associated with a particular axis component 𝑖𝑧 in I. 𝑖𝑧 is the axis vector along which the data 𝐀1(𝐈) is read into the 𝑛-D multidimensional algorithm using a set of ports. The input data is represented as 𝐀1(𝑖1,𝐼2,𝑖3,….𝑖𝑛). This means that input data 𝐀1(𝐈) is fed along 𝑖2 axis. The corresponding word size is π‘˜2, and the port size required to feed this data is π‘˜2. The input data is reused either within the same computation or in different computations within iteration (depending on the application considered). If the reuse is within the same clock cycle/clock period CP, it is made possible by propagating the data (with zero delay) termed as data broadcast. The reuse direction of each data is represented by the directional vector termed as the β€œdependence vector”—𝐷𝑣. 𝐷𝑣 is determined as follows: as shown in Listing 1, the data 𝐀1 on the LHS is assigned from the data 𝐀1(𝑖1,𝐼2,𝑖3βˆ’1,…,𝑖𝑛) on the RHS in equation (1a) in Listing 1. This means that it is broadcast within the same iteration in the 𝑖3 direction and fed along the 𝑖2 axis using π‘˜2 ports (Figure 1).

Do 𝑖 1 = 1 to π‘˜ 1 ;
    Do 𝑖 2 = 1 to π‘˜ 2 ;
      Do 𝑖 3 = 1 to π‘˜ 3 ;
    𝐴 1 ( 𝑖 1 , 𝐼 2 , 𝑖 3 , . . . , 𝑖 𝑛 )     = 𝐴 1 ( 𝑖 1 , 𝐼 2 , 𝑖 3 βˆ’ 1 , . . . , 𝑖 𝑛 ) ;//broadcast in 𝑖 2 direction
                                            (1a)
    𝐴 2 ( 𝑖 1 , 𝑖 2 , 𝐼 3 , . . . , 𝑖 𝑛 )     = 𝐴 2 ( 𝑖 1 βˆ’ 1 , 𝑖 2 , 𝐼 3 , . . . , 𝑖 𝑛 ) ;//broadcast in 𝑖 1 direction
    𝐴 3 ( 𝐼 1 , 𝑖 2 , 𝑖 3 , . . . , 𝑖 𝑛 ) = 𝐴 3 ( 𝐼 1 , 𝑖 2 , 𝑖 3 βˆ’ 1 , . . . , 𝑖 𝑛 ) broadcast in 𝑖 3 direction
    𝐴 𝑛 βˆ’ 1 ( 𝑖 1 , 𝑖 2 , 𝑖 3 , 𝐼 4 , . . . , 𝑖 𝑛 ) = 𝐴 𝑛 βˆ’ 1 ( 𝑖 1 , 𝑖 2 , 𝑖 3 βˆ’ 1 , 𝐼 4 , . . . , 𝑖 𝑛 ) ;
         𝐢 [ β†’ 𝐈 ] = 𝑓 2 (C[ β†’ 𝐈 βˆ’ β†’ 𝐝 ], 𝑓 1 ( 𝐴 1 𝐴 2 [ β†’ 𝐈 ] ))          (2)
        or
         𝐢 [ 𝑖 1 , 𝑖 2 , 𝑖 3 , . . . , 𝑖 𝑛 ] = 𝐢 [ 𝐼 1 , 𝐼 2 , 𝑖 3 , . . . , 𝑖 𝑛 βˆ’ 1 ]+ 𝐴 1 ( 𝑖 1 , 𝐼 2 , 𝑖 3 , . . . , 𝑖 𝑛 ) Γ— 𝐴 2 ( 𝐼 1 , 𝑖 2 , 𝑖 3 , . . . , 𝑖 𝑛 )
                                           (3)
        End Do 𝑖 𝑛 ;  … .;
        End Do   𝑖 2 ;
    End Do   𝑖 1 ;

The output data is represented as 𝐂(𝑖1,𝑖2,𝑖3,…,πΌπ‘›βˆ’1) which means that the data is output along 𝑖𝑛 axis and propagated along 𝑖𝑛 direction. When we consider the output data, the word β€œpropagation” is replaced by the term β€œupdate direction.” The vector associated with update direction is termed as the Computational Trail Vector (CTV). The updation of CTV may be with delay or without delay as demanded by the application.

The vector representing the update direction in this example is given as [].CTV=0,0,…,1(1)

The form of representation of the 𝑛-D algorithm in Listing 1 wherein the broadcast direction and computations are shown with the complete detail is termed as the uniform recurrence relation or the URE form of the 𝑛-D nested loop algorithm. In the expression (2) for CTV in Listing 1, the computational output data C is represented as C[β†’πˆ] (arrow line on top of the symbol) which indicates that it is associated with an update direction. The corresponding vector →𝐝 in the RHS of (2) represents the CTV defined in (1).

The functions 𝑓1 and 𝑓2 in (3) in Listing 1 are simple commutative operators which are executed independent of any other output component computations of C. These are assumed to be operators with no precedence constraint. 𝑓2 especially is an operator that has no precedence constraint. It needs not wait for any past computations. It can proceed independently provided as much parallel hardware is available. There is only one output computation expression in Listing 1. Listing 1 is said to have a single CTV with no precedence constraint.

2.3. 𝑛-D Nested Loop Problem

A general 𝑛-D nested loop algorithm is illustrated in Listing 1. 𝑖1,𝑖2, and 𝑖3,…,𝑖𝑛 are the loop indices. Together they form the 𝑛-D (iteration) index space. Representation of the 𝑛-D loop computations as a dependence graph (DG) leads to each point in the index space corresponding to a single node in the DG. Theoretically each node can be assigned a processing element (PE). The 𝑛-D iteration space is constructed as follows.

2.3.1. An 𝑛-D Iteration Space Computation in Terms of (π‘›βˆ’1)-D Subspace

First an (π‘›βˆ’1)-D dependence graph (DG) as in Figure 1 with an (π‘›βˆ’1) multidimensional indexed positions given by ξ‚Έβ†’πˆπ‘–π‘›π‘›βˆ’1ξ‚Ή=𝑖,11,𝑖2,𝑖3,…,π‘–π‘›βˆ’1ξ€Ύ,1(2) is constructed showing the data input directions and data broadcast directions. Here we show one of the data input directions and data broadcast directions for the sake of illustration. The data specifications or the dependence relations within each cell in the iteration space show the different data broadcast directions as shown in Figure 1.

The 𝑛-D iteration space is constructed by replicating the (π‘›βˆ’1)-D iterationspace along the 𝑖𝑛 direction. Each (π‘›βˆ’1)-D subspace is termed as a cell (or iteration). An array of PE is assigned to this cell, and the computation of the cell is completed in 1 clock period (CP). In the next CP, the PE array is assigned to the next cell along the 𝑖𝑛 direction. The direction of PE array assignment to consecutive subspaces is termed as the scheduling direction represented as the scheduling vector →𝑠𝑑. As per Listing 1, the CTV is also updated along the same 𝑖𝑛 direction. The CTV is partially updated in CP1, and the updation continues as the scheduling advances along the 𝑖𝑛 direction in every CP till the completion of computation in π‘˜π‘› CPs.

2.4. Mapping and Scheduling

Any node in the iteration space is 𝑁[𝑖1,𝑖2,𝑖3,…,1]and is mapped onto the PE array assigned to the iteration subspace. This is termed as mapping. The time β€œπ‘‘β€ at which the node 𝑁[𝑖1,𝑖2,𝑖3,…,1] is mapped on the PE in the PE array is termed as scheduling. The mapping and scheduling are derived for each application in detail in the corresponding sections.

2.5. Computation of 𝑛-D Iteration Space Using an (π‘›βˆ’2)-D Subspace

In an alternate generalization, we represent the 𝑛-D nested loop problem as identified to have an iterative (π‘›βˆ’2)-D subspace as shown in Figure 2. An (π‘›βˆ’2)-D dependence graph (DG) with an (π‘›βˆ’2)-D multidimensional indexed positions is given by ξ‚Έβ†’πˆπ‘–π‘›,π‘–π‘›βˆ’1π‘›βˆ’1ξ‚Ή=𝑖,1,11,𝑖2,𝑖3,…,π‘–π‘›βˆ’2ξ€Ύ.,1,1(3)

The collection of indexed node positions in (3) is termed as the (π‘›βˆ’2)-D subspace or hyperplane, which is represented showing the data input directions and data broadcast directions in Figure 2(a). The 𝑛-D iteration space computation is completed by replicating the (π‘›βˆ’2)-D DG. We expand the iteration space along the π‘–π‘›βˆ’1 direction, followed by its expansion along 𝑖𝑛 direction. Each (π‘›βˆ’2)-D subspace is termed as a cell or iteration cell. An array of PE is assigned to this cell, and the computation of the cell is completed in 1 CP.

A part of the output expression termed as the computational expression is assumed to be computed in the inner loop formed by the (π‘›βˆ’2)-D iteration space as depicted in Figure 2(a). The directional index representing the propagation direction or the update direction of the computational expression is termed as the Computational Trail Vector (CTV). The CTV is partially updated in CP1, and the updation continues as the scheduling advances along the π‘–π‘›βˆ’1, showing that in the next CP the PE array is assigned to the next iteration cell along the π‘–π‘›βˆ’1 direction (as shown in Figure 2(b)) to complete the first row of computation in π‘˜π‘›βˆ’1 CPs. The sequence direction of subspace assigned to the PE array in consecutive CPs is termed as the scheduling direction represented by the scheduling vector →𝑠𝑑1, which is along the π‘–π‘›βˆ’1 direction, and CTV is also updated along the same π‘–π‘›βˆ’1 direction.

Following this, the PE array assignment is done to next 𝑖𝑛 giving the scheduling vector →𝑠𝑑2 as 𝑖𝑛 as in Figure 2(b). The total number of CPs used to complete the computation isπ‘˜π‘›βˆ’1Γ—π‘˜π‘›.

2.6. 𝑛-D to (π‘›βˆ’π‘₯)-D with CTV and Scheduling Directions

In the previous section, the (π‘›βˆ’1)-D subspace is built using a sequence of (π‘›βˆ’2)-D subspaces by scheduling along the appropriate (π‘›βˆ’1)th dimension followed by scheduling along the appropriate 𝑛th dimensionβ€”say along 𝑖𝑛 with an assumption that CTV has the same direction as the scheduling vector which may not be true always. There are two approaches to complete the 𝑛-D computation using the (π‘›βˆ’2)-D subspaces. The PE array assignment to the (π‘›βˆ’2)-D subspace is one order closer to the physical realization. For a practical implementation, this process has to be continued down to 2D level.

In general, the direction of updation of the computational expression is defined as a vector termed as the Computational Trail Expression (CTV) of the 𝑛-D problem. We identify the corresponding (π‘›βˆ’π‘₯)-D computational hyperplane in which the CTV lies, forming an (π‘›βˆ’π‘₯)-D subspace in the 𝑛-D space. The PE array is assigned to this plane. This is followed by the reuse of the (π‘›βˆ’π‘₯)-D plane along the scheduling direction/s.

3. Methodology of Mapping

The mapping methodology used in the heuristic search of the mapping transformation matrix 𝑀 is explained hereafter. In general, the mapping matrix 𝑀 is constituted of the timing vector or hyperplane 𝑆 and the space matrix or vector also called the space hyperplane 𝑃 [6, 7]. Any node in the iteration space 𝑁[𝑖1,𝑖2,𝑖3,…,𝑖𝑛] is mapped onto a PE in the PE array using the 𝑃 matrix at a time β€œπ‘‘β€ determined by the 𝑆 vector of [4] βŽ‘βŽ’βŽ’βŽ£π‘†π‘ƒβŽ€βŽ₯βŽ₯βŽ¦π‘€=.(4)

3.1. Heuristic Method [4]

Step 1. Generate the iteration space for the 𝑛-D nested loop application under consideration.

Step 2. Find the data dependencies in the algorithm and formulate the dependence vector 𝐷𝑣.

Step 3. The causality constraint is checked for using (5), that is, whether the condition π‘†βˆ—π·π‘£>0(5) for all dependencies is satisfied, where 𝐷𝑣 is dependence vector for each data variable (Table 1). Choose those 𝑠 elements of 𝑆 which satisfy the condition.

Step 4. Generate or modify the search space for the 𝑀 matrix (𝑀set) to satisfy the rank condition [4].

Step 5. Chose a candidate 𝑀 matrix from the above set.

Step 6. Save the candidate 𝑀 matrix in 𝑀result.

3.2. The Proposed Modified Heuristic Method

The following are the steps in our approach for modification of the heuristic search based on the optimal allocation method evolved in Section 2.

Identify the scheduling direction. Once a layer of PEs is assigned to the (π‘›βˆ’π‘₯)-D subspace, the same array of PEs is to be used in the next computation. This reuse direction is known as the scheduling direction in Section 2. All these conditions are used in the modified heuristic search procedure in the following steps.

Step 7. The scheduling vector representing the scheduling direction represented by the →𝑠𝑑1 vector is used to prune down the valid 𝑀 matrices.

Step 8. Prune down the valid 𝑀 matrices by choosing the (π‘›βˆ’π‘₯)-D subspace to which the PE plane is discussed. This is done by identifying the iterative subspace. To summarise, the selected 𝑀mat is obtained by pruning down the 𝑀result using the CTV and PE plane assignment done as discussed in Section 2.

Step 9. Evaluate the cost function as given in (10) in Section 5.2. If Costactual<Costrequired, proceed to Step 6 else to Step 3.

The plots of Figure 5 show the comparison of heuristic method of Section 3.1 with the modified heuristic search method described in Section 3.2.

3.3. Direct Method

Step 10. The delay edge is calculated by the direct method as explained in Section 4.5. The results are presented in Table 2.

Step 11. The delay edge matrix in Table 2 is determined using the expression 𝐷𝑣 defined in Tables 1 and 3 for 2D filtering algorithm.

3.4. Mapping Process

The main objective is to find the 𝑀 matrix which consists of the processor allocation vector (𝑃𝑑) and the scheduling vector (𝑆𝑑).

First we take the boundaries of the search space between which the 𝑃𝑑 and 𝑆𝑑 are to be searched. The selection of search space is an important factor, because there is an exponential growth in both area and time complexity of the mapping methodology. Consider that π‘ˆπ‘–,π‘ˆπ‘—,π‘ˆπ‘˜,…,π‘ˆπ‘› are the upper bounds of an 𝑛-dimensional nested loop algorithm. The heuristic followed in this work is to generate the search space that can be obtained by the following element set {0,1,π‘ˆπ‘–,π‘ˆπ‘—,π‘ˆπ‘˜,∏(π‘ˆπ‘–π‘ˆπ‘—)}.

3.5. Methods and Resources Used in Obtaining the Mapping Methodology

As a whole, the implementation of the mapping methodology consists of two parts. The first is the heuristic search for the mapping. The heuristic search allows us to obtain the near optimal solutions and then pick up the feasible architecture by pruning the solutions based on Steps 4–9 as described in Section 3.3. The new mapping methodology is explained with respect to the 2D filtering algorithm in Section 4. The modified heuristic method based on the new method followed after implementing the steps in Section 3.1 are implemented using MATLAB to obtain the results of the search procedure of Sections 3.1 and 3.2. Also the comparative results between the heuristic and the modified heuristic method for the 6D full search block motion estimation (FSBM) algorithm are given. The second part is the design space exploration of resultant architecture. It is obtained as explained in the next section.

3.6. High-Level Synthesis (HLS)

The input to high-level synthesis system is the problem represented in behavioural description in a high-level language. The optimization in a high-level synthesis is done at a level higher than the boolean optimization done by the RTL synthesis tools. This is suitable for hardware optimization of DSP and image processing algorithms [8]. This is followed by scheduling and allocation [9]. The GAUT [10] tool used incorporates all the above features and allows the design space exploration.

The algorithm is described in a high-level description in C, and this is used as the input design specification to the high-level synthesis tool. The high-level synthesis tool is used to obtain the Control Data Flow Graph (CDFG). The CDFG allows the designer to verify the design required at a later stage. It allows the tracing of data values as live variables in registers associated with the PE hardware. Also the high-level synthesis tool is used to obtain the design space exploration results which give the area Versus latency tradeoff.

4. Mapping of 2D Filtering Algorithm

4.1. 2D Filtering for Image Processing: A 4D Problem

The problem formulation of Section 2 and the methodology in Section 3 are applied to the 2D filtering problem. 2D filtering or convolution is one of the essential operations in digital image processing required for image enhancements. The grey levels are usually represented with a byte or 8-bit unsigned binary number, ranging from 0 to 255 in decimal. Equation (6) shows the two dimensional discrete convolution algorithm, where 𝐼[π‘₯,𝑦] represents the input pixel data image, π‘Š is the window coefficient, and 𝑂 is the output image. The movement of the mask window function to calculate the window function value for the whole image region is shown in Figure 3: 𝑂[][][]=π‘₯,𝑦=π‘Šπ‘₯,π‘¦βŠ—πΌπ‘₯,𝑦43𝑖=0𝑗=0𝐼[][].𝑖+π‘₯,𝑗+π‘¦βŠ—π‘Šπ‘–,𝑗(6)

Digital convolution can be thought of as a moving window of operations, where the window that is, mask, is moved from left to right and from top to bottom.

The 2D image filtering problem is a representative example of a 4D nested loop involving 2D convolution, as in Listing 2 and Figure 3. The computation is highly redundant and requires high data reuse. This is considered here for systolic mapping. An image of size 0 to +π‘˜1; 0 to +π‘˜2 is considered convolved with a mask of size 0 to +π‘˜3; 0 to +π‘˜4. The mask coefficients are stored in memory. The significant features of the algorithm are listed in the following section.

For( 𝑖 1 = 0 ; 𝑖 1 < π‘˜ 1 ; 𝑖 1 ++)
For( 𝑖 2 = 0 ; 𝑖 2 < π‘˜ 2 ; 𝑖 2 ++)
{
O[ 𝑖 1 , 𝑖 2 ] = 0;
    For ( 𝑖 3 = 0 ; 𝑖 3 < = + π‘˜ 3 ; 𝑖 3 ++)
    For ( 𝑖 4 = 0 ; 𝑖 4 < = + π‘˜ 4 ; 𝑖 4 ++)
    O[ 𝑖 1 , 𝑖 2 ] = O[ 𝑖 1 , 𝑖 2 ] + I[ 𝑖 1 + 𝑖 3 , 𝑖 2 + 𝑖 4 ] Γ— w[ 𝑖 3 , 𝑖 4 ]    (8)
    //Output one window function evaluation
    Ends do 𝑖 4 ; End do 𝑖 3 ;}
  End 𝑖 2 ;
End 𝑖 1 ;

4.2. Nested Loop Formulation

The nested loop formulation for the 2D filtering algorithm for image size π‘˜1Γ—π‘˜2 and window function size π‘˜3Γ—π‘˜4 is given in Listing 2β€”the same is represented in uniform recurrence equation form (URE) in Listing 3.

 For 𝑖 = 1 to π‘˜ 1
 For 𝑗 = 1 to π‘˜ 2
For π‘˜ = 1 to 4 //window size = 4 Γ— 3
             For 𝑙 = 1 to 3
            If ( 𝑖 == 1 && 𝑗 == 1)
                  𝑀 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = π‘Š n e w
            Else if( 𝑗 == π‘˜ 2 )
                  𝑀 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝑀 ( 𝑖 βˆ’ 1 , 𝑗 , π‘˜ , 𝑙 ) ;//next 𝑖
            //[Dvw1 = 1 0 0 0]
            Else
             𝑀 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝑀 ( 𝑖 , 𝑗 βˆ’ 1 , π‘˜ , 𝑙 ) ;
            // next 𝑗 [Dvw2 = 0 1 0 0]
            End if
  If ( 𝑖 == 1 && 𝑗 == 1)
   𝐼 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝐼 n e w ;
  Else if( 𝑖 == 1 && 𝑗 > 1) // first rowβ€”second window calculation
   𝐼 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝐼 ( 𝑖 , 𝑗 + 1 , π‘˜ , 𝑙 + 1 ) ; move to the next 𝑗 pixel – 𝑗 + 1; pixel dataβ€”reads in next column of pixel and
  old data is moved in the ( π‘˜ , 𝑙 ) plane-PE_array from ( π‘˜ , 𝑙 ) to ( π‘˜ , 𝑙 + 1);//[ 𝐃 𝐕 𝐱 𝟐 = 0 1 0 1]
  Else if (j==k 𝟐 )    // for next 𝑖
   𝐼 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝐼 ( 𝑖 + 1 , 𝑗 , π‘˜ + 1 , 𝑙 ) ; // move to the next 𝑖 pixel 𝑖 + 1; pixel dataβ€”reads in next row of
  // pixel and old data is moved in the ( π‘˜ , 𝑙 ) plane-PE_array from ( π‘˜ , 𝑙 ) to ( π‘˜ + 1 , 𝑙 )
  //[ 𝐃 𝐕 𝐱 𝟏 = 1 0 1 0]
  End if
  If ( 𝑙 ==1 && π‘˜ == 1)
   𝑂 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 0;
  Else if (k < 3)
   𝑂 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝑂 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 βˆ’ 1 ) + 𝐼 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) βˆ— π‘Š ( 𝐼 , 𝑗 , π‘˜ , 𝑙 ) ;
  Else
   𝑂 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) = 𝑂 ( 𝑖 , 𝑗 , π‘˜ βˆ’ 1 , 𝑙 ) + 𝐼 ( 𝑖 , 𝑗 , π‘˜ , 𝑙 ) βˆ— π‘Š ( 𝐼 , 𝑗 , π‘˜ , 𝑙 ) ;
  End;
End For 𝑙 , π‘˜ , 𝑗 , 𝑖 ;

4.3. Single Assignment Statement Formulation or Uniform Recurrence Equation (URE) Form of 2D Filtering

The SAS of the 4D edge detection algorithm is in Listing 3, and the dependence vectors for the four level algorithms have 4 indices and the index space is generated by varying the four index values till the upper limit of each index as in Listing 3. The dependencies give the propagation direction of the input variables and update direction of the output data. In Listing 3, π‘Šnew represents the mask values in 2D filtering algorithm that are to be input at the fresh windowing and 𝐼new to indicate the loading of pixel values for a new frame of image.

4.4. Dependence Vectors for 2D Filtering Algorithm

Listing 3 is well commented to bring out the formulation of the following dependence vectors in Table 1.

4.5. Delay-Edge Matrix-Direct Method of Determining Delay and Edge Connectivity

The delay edge mapping is obtained by the product of dependence matrix (𝐷𝑣) and 𝑀 matrix as shown in Table 2.

Step 11 in the mapping process uses the dependence matrix to compute the edges and delays as follows: 𝐷𝑣=[0101;1010;0001;0010;0001;0010]β€² (Table 1); the first half in each vector in 𝐷𝑣 stands for the scheduling direction and the second half for the PE array directions. The first half (termed as sdd vector—→𝑠𝑑𝑑) gives the delays associated with the corresponding edges given by the second half (sde vector = →𝑠𝑑𝑒): ξ‚ΈDelays=→𝑠𝑑2××𝑀11→𝑠𝑑𝑑;Edges=→𝑠𝑑2××𝑀11→𝑠𝑑𝑒.(7) This is computed and presented in Table 2.

Mapping results for 2D filtering are given in Table 4(a) for heuristic method, and Table 4(b) gives the modified heuristic method.

4.6. Space-Time Mapping Matrix (𝑀) Illustration

The mapping was performed for 1D array. The generalized form of space time mapping matrix 𝑀 is given here as shown in (3); βŽ‘βŽ’βŽ’βŽ£π‘†π‘€=π‘‘π‘ƒπ‘‘βŽ€βŽ₯βŽ₯⎦.(8) if 𝑃𝑑=0031; 𝑆𝑑=4151.

4.6.1. Delay-Edge Matrix

The delay edge mapping is obtained by the product of dependence matrix (𝐷) and 𝑀 matrix: DEmat=π‘€βˆ—DVmat,⎑⎒⎒⎣⎀βŽ₯βŽ₯βŽ¦Γ—βŽ‘βŽ’βŽ’βŽ’βŽ’βŽ’βŽ’βŽ£βŽ€βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯⎦=⎑⎒⎒⎣⎀βŽ₯βŽ₯⎦.41510031010100101000010001100012111300130141(9)

4.7. Direct Method-Edge Connectivity and Delay Registers

The direct method in deriving the delay edge connectivity is obtained from the dependence vector as given in Table 6.(1)The delay edge matrix based on the heuristic search is used to calculate the cost as given in Table 4(a) and the above can be used to pick up the good solution based on minimum cost, but does always guarantee the feasibility. So we do not consider the delay edge obtained from this method.(2)Using the proposed modified search algorithm, 9, 9 or 12, 12 are the number of PEs and number of clock cycles in Table 4(b) (for assumed window size in Table 4(b)) are arrived at after pruning down the search results using the PE-plane subspace based on CTV. (3)As mentioned above, the delay edge connectivity is obtained directly from the dependence matrix directly by considering the scheduling directions for delays and considering the PE directions for the edges as discussed in Section 4.5 and as shown in Table 2 and the architecture is obtained using the mapping results and direct delay_edge connectivity.

4.8. Mapping Results

The cost function is defined as (10) and is used as an additional constraint mentioned to Step 9 in Section 3.2 for selecting architecture according to the modified heuristic method heuristic search Cost=π‘Žβˆ—processors+π‘βˆ—cycles+π‘βˆ—delays.reg(10) Here π‘Ž, 𝑏, 𝑐 are the scalar coefficients which represent weights for the corresponding costs to minimize the overall cost function.

4.9. Architecture

Figure 4 shows the architecture for edge-detection algorithm. It consists of 2 ports, one for accessing the image data and the other for the output. The architecture consists of 𝑀1×𝑀2 PEs, where 𝑀1×𝑀2 is the size of the window used. The intermediate output is propagated to the successive PEs within a row but has to be passed through a line buffer when passing the intermediate output between rows of PEs. The buffer width is equal to the number of pixels per row. The final output is at the 𝑀1×𝑀2 PE.

Figure 5(a) shows the search results giving the possible solutions including the register cost. Registers represent the delays in the connecting edges which are the result of heuristic search, but which may not be feasible or realizable. The Pareto optimal and near optimal solutions are shown in the plots Figures 5(a) and 5(b) based on the heuristic search and the modified heuristic search, respectively. The modified heuristic search developed by us picks up the good solutions with respect to the number of PEs and cycles concerned, but we see that the register cost does not reflect the Pareto optimal solution and does not guarantee feasibility. The delay-edge connectivity is obtained directly from the dependency vectors as explained in Section 3.3 and in Table 2 for 2D filtering and leads to the feasible architecture in Figure 4.

5. Mapping of 6D FSBM

The main objective is to find the 𝑀 matrix which consists of the processor allocation vector (𝑃𝑑) and the scheduling vector (𝑆𝑑). The method used is same as explained in Section 3.

5.1. Dependencies for 6-Level FSBM Algorithm

Dependence vectors formulations have been presented for a reduced index space 4D FSBM algorithm [11]. Due to lack of space, it is not presented.

5.2. Results of Modified Method for FSBM Algorithm

The mapping results after the search are presented here.

The heuristic search results of Tables 5 and 8(a) (using MATLAB) for 𝑝=1 and 2, respectively, are shown in the graph in Figure 6.

5.3. Delay-Edge Connectivity for FSBM Algorithm Using Table 5 Results

(1)1101βˆ—π·π‘£=3113310031; the edges we get are same as the elements in 𝐷𝑣   at the 𝑝-direction (hence verified), and the delays [0100]βˆ—π·π‘£=ans=3116431101616ans=register/delaysforthevariablesπ‘₯,𝑦,MAD,Dmin. This is obtained as a good solution from Table 5 by selecting the optimum cost taking into consideration the feasibility. (2)The final delay edge is given as follows: ⎑⎒⎒⎣00β„ŽΓ—π‘2𝑁𝑁111𝑁2𝑁2⎀βŽ₯βŽ₯⎦2𝑃+1112𝑝+12𝑃+11002𝑃+11.(11)The second row is the edge, and the first row is the registers connected obtained as the highest nonzero value, in the 𝐷𝑣 values along other indices other than 𝑝-direction in Listing 2. 𝑝-direction is the direction of orientation of the systolic array (PE array) in the 𝑛-D problem space. The above gives a minimal cost connectivity and register delay elements simultaneously satisfying the feasibility and implementability checked by the direct method.

6. Architecture of the FSBM Algorithm

The architecture is arrived at, based on above is in Figure 7.

6.1. Design Space Exploration Using High-Level Synthesis

The design space exploration results are presented in the following based on the architecture arrived at.

6.2. CDFG of the Design

The architecture in Figure 7 is input using a behavioural description using a C type language to the GAUT tool, and it generates the control data flow graph (CDFG) architecture as in Figure 8 and also integrates into ModelSim and Xilinx ISE.

6.3. Results of Design Space Exploration

The high-level synthesis tool allows the designer to input the timing constraint as the cadency values to obtain the tradeoff of allocation of hardware as obtained in Table 7 for 𝑝=1 for FSBM algorithm.

6.4. Design Space Exploration for 𝑝=2

The search range 𝑝 in FSBM algorithm is increased to 𝑝=2, and the design space exploration is done in MATLAB for the modified heuristic and also using the HLS GAUT tool.

The results of the above are shown in Figure 9.

7. Complexity Analysis

The merit of the modified heuristic algorithm is measured in terms of the search space complexity.

7.1. Search Space Complexity

In general, in heuristic search procedures, the loop bounds are considered as the maximum values for searching. But as the loop bounds and the nested loop dimension increase, the search space will be huge if vectors are exhaustively generated. A graphical representation of search space expansion with respect to the different values of 𝑛 for 𝑛-level nested loop algorithms is given in Figure 10.

The β€œa” bars show the search space obtained by taking the loop bounds, say βˆ’π‘ˆπ‘– to +π‘ˆπ‘–, as the limit for each variable, and the β€œb” bars are obtained by using our proposed modified heuristic elaborated in Section 3, where it is observed from the plot in Figure 10 that the increase in cost is not high.

7.2. Search Space Complexity Tables

Tables 9(a) and 9(b) show the complexity calculations for 6D FSBM and 4D FSBM and the proposed modified heuristic method whose results are in Tables 4(b) and 5(b).

Table 9(a) shows the complexity calculations for varying values of 𝑛 and gives a comparison between the general heuristic method and the method presented in this paper.

7.3. 6D Problem Reduced to 4D FSBM [11] and 4D Problem-2D FIR Filtering Problem

The reduction in search space by modifying the 6D algorithm to 4D as reported in [11] and also the benefit of the modified heuristic are reflected by the last entry in Table 9(b).

8. Conclusion and Future Work

Many of the computationally intensive algorithms are of 𝑛-D deeply nested loop type. The methodology of mapping of algorithms involves heuristic search wherein the search complexity is large. The search space of the 2D filtering and 4D FSBM has been pruned down using the scheduling vector →𝑠𝑑 and the constraints imposed by it. The search has been performed using MATLAB, for the PE array assigned to the identified (π‘›βˆ’π‘₯)-D subspace evolved with the nature of the CTV. The resultant mapping matrix is useful in determining the PE assignment and the exact clock cycle at which a particular node in 𝑛-D space represented by the DG is mapped onto a PE in the PE array. The search results are presented for 2 computationally intensive applicationsβ€”2D filtering and the reduced index space 4D FSBM algorithm. The graph in Figure 5(a) corresponds to Table 4(a) showing the heuristic search results that show the distribution of PEs and cycles and cost. Figure 5(b) corresponds to Table 5(b) that gives the number of PEs and cycles pruned down after applying the modified heuristic algorithm. The delay edge connectivity is determined by the proposed direct approach as described in Sections 3.3 and 4.5 using Tables 2 and 4, instead of using the Mapping Transformation Matrix 𝑀 or Tmat in Tables 4(a) and 5(a) as in [4]. The use of high-level synthesis tool is to obtain the CDFG. Also the design space exploration results obtained using high-level synthesis tool GAUT have been presented. The search have been performed for varying search ranges of 𝑃 values 𝑃=1 and 𝑃=2 and the number of resources used, and latency for different input cadency values gives the design trade-off results presented in Tables 7 and 8(b) shown in the graph in the Figure 9. The output file of the GAUT tool could be used to interface with simulation tools and synthesis tools to build the RTL design and map it onto target FPGA architecture in the future for elaborate timing verification. The complexity comparison of our method with heuristic method is given in Tables 9(a) and 9(b).