Abstract

A technique for parallelising multiple loops in a heterogeneous computing system is presented. Loops are first unrolled and then broken up into multiple tasks which are mapped to reconfigurable hardware. A performance-driven optimisation is applied to find the best unrolling factor for each loop under hardware size constraints. The approach is demonstrated using three applications: speech recognition, image processing, and the N-Body problem. Experimental results show that a maximum speedup of 34 is achieved on a 274 MHz FPGA for the N-Body over a 2.6 GHz microprocessor, which is 4.1 times higher than that of an approach without unrolling.

1. Introduction

Microprocessors are commonly used to implement computing systems as they have the advantages of low cost and fast development time. In performance-critical applications, performance can be improved by introducing larger degrees of spatial parallelism via reconfigurable hardware implemented on field programmable gate arrays (FPGAs). Heterogeneous computing systems using both microprocessors and FPGA-based custom function units can combine advantages of both for many applications.

Computational intensive tasks in digital signal processing algorithms are usually iterative operations. Scheduling such loops in a heterogeneous computing system to fully utilise the available resources is difficult due to their complex nature. Techniques which have been previously proposed tend to address single loop only and are summarised as follows.

(i)Control flow based [1, 2]. This approach divides a control flow graph into various subgraphs based on control edges, and each subgraph is scheduled independently, typically list scheduling technique is used. A complete scheduling is generated by combining all the schedulings of subgraphs. This approach only analyses one iteration of the loop body, it does not target generating higher parallelism implementation for multiprocessor systems.(ii)Modulo scheduling [3]. Generates a schedule for one iteration of a loop so that all iterations repeat at a fixed interval, that is, a software pipelined design. Since only a single iteration is analysed, limited parallelism is achieved.(iii)Graph conversion [4]. An application with loop can be characterised as a cyclic graph, this approach attempts to find a better scheduling of the loop body by using a graph traversal algorithm to convert the cyclic graph to an acyclic one with minimised critical path. Depth-first search technique is used to traverse the cyclic graph and remove the feedback edge; an acyclic graph scheduling technique is then used to form a scheduling of the loop body. This approach does not analyze task dependency in different iterations which may result in reduced parallelism.(iv)Loop unrolling [57]. This is a common technique to generate an implementation with greater parallelism. It involves unrolling a loop and extracting parallel tasks from different loop iterations. These references have only been applied to parallelise a single loop.(v)Dynamic scheduling [8]. This approach schedules tasks at run-time making use of both online and offline parameters. The loop condition is checked dynamically at runtime. Loop parallelisation is not addressed in this approach.(vi)Loop fission [9, 10]. This approach breaks a loop into multiple tasks and maps each individual one to FPGA. Implementing applications which exceed the size constraint on FPGA thus becomes feasible. Since loop unrolling is not involved, this approach results in limited parallelism.

A comparison between this work and different approaches is shown in Table 1. Previous work has focused on parallelising a single loop [37], and multiloop optimisation has not been adequately addressed. Since reconfigurable hardware in a heterogeneous system is capable of supporting parallel execution of tasks, a major challenge is to develop techniques which can effectively exploit this capability.

This work explores techniques to optimise applications with multiple loops in a heterogeneous computing system. Our recent work has shown that an integrated mapping and scheduling scheme with multiple neighborhood functions [11], and combining mapping and scheduling with loop unrolling [12] can achieve considerable performance gains. This work complements those results through a method for optimising the unrolling factors in multiple loops. The novel aspects of this work are as follows:

(i)a performance-driven strategy, combined with an integrated mapping/scheduling system with multiple neighborhood functions, to find the best unrolling factor for each loop (Section 2.4),(ii)a static mapping and scheduling technique capable of handling cyclic task graphs for which the number of iterations is not known until run-time (Sections 3.1 and 3.3),(iii)The introduction of additional management tasks for dynamic data synchronisation while maintaining near optimal performance when an accurate compile-time prediction of the run-time condition is made (Section 3.2).

The remainder of this paper is organised as follows. The proposed multiloop parallelisation scheme is presented in Section 2. Section 3 introduces the loop unrolling technique and provides an overview of the multiple neighborhood function based mapping/scheduling system. Experimental results are given in Section 4, and finally, concluding remarks are given in Section 5.

2. Multi-Loop Parallelisation

2.1. Reference Architecture

The reference heterogeneous computing system contains two processing elements (PEs): one microprocessor and one FPGA. Each processing element has a local memory for data storage during task execution, and the communication channel between these two processing elements is being assigned a weight which specifies the data transfer rate. Results of a task's predecessors must be transferred to the local memory before this task starts execution.

2.2. Notations

Given an application containing a loop (Figure 1(a)), the followings are various notations used in this paper:

(i) Loop Unrolling and Unrolling Factor
Loop unrolling is a process to duplicate the body of a loop multiple times and use them to replace the original body, where the loop-control code is adjusted accordingly. The number of copies being duplicated is called unrolling factor. For example, Figure 1(b) shows an unrolled loop with an unrolling factor of N.

(ii) Loop Fission and Sub-Loop
Loop fission is a process to split a loop that contains multiple instructions into a number of loops with the same loop control. Each splitted loop is called a sub-loop which contains a portion of instructions of the original loop body. For instance, Figure 1(c) shows multiple sub-loops after fission.

(iii) Task
A task is a block of consecutive instructions derived from task partitioning stage for a given application [13], for example, a loop in Figure 1(a) is a task.

(iv) Task Graph
A task graph is an acyclic graph representing the data flow dependencies of tasks, where a task in the task graph is only executed once and it cannot be executed prior to its predecessors due to the data dependency. For instance, Figures 1(d) and 1(e) are two task graphs of Figures 1(a) and 1(c), respectively, where each loop is a node in the graph.

2.3. Overview

Figure 2 gives an overview of the proposed multi-loop parallelisation strategy. A search strategy is employed where the goal is to find an optimal unrolling factor for each loop so that the overall performance is maximised. This section focuses on the search of unrolling factors; the calculation of quality score will be introduced in Section 3.

Given an application containing a set of loops , let be a set of unrolling configurations with each designating an instance of the unrolling factors of all loops, where is the unrolling factor of loop . Each unrolling configuration thus contains all unrolling factors for all loops in this application. In each iteration of the search, a set of unrolling configurations is firstly generated, and a quality score is then calculated for each configuration after loop unrolling and fission, task graph generation, and mapping/scheduling processes have been applied. The best unrolling configuration with highest quality score is selected and used for the next iteration. This process is repeated iteratively until a termination condition is reached, the goal being to find a solution with the maximum quality score.

The advantage of considering unrolling and fission of all loops globally is that unrolled sub-loops from various loops can be potentially executed in parallel. This allows for a better mapping/scheduling solution to be found after unrolling and fission. Figure 3 shows an example of unrolling two loops which have no data dependencies between iterations. In the original graph, and represent two loops, , , and are the three unrolled sub-loops of ; is unrolled as , , and . Before unrolling and fission, and are mapped to two processing elements PE1 and PE2. Hardware resources are not fully utilised and the processing time for three iterations using this mapping is 90 time units (Figure 3(c)). After unrolling and fission, the first two sub-loops ( and ) are mapped to PE2 and PE3, respectively, and other unrolled sub-loops are mapped to PE1. Processing time is reduced to 50 time units (Figure 3(d)). Tasks and are two generated management tasks to synchronise results produced by different sub-loops which will be introduced in Section 3.2.

Unrolling and fission can still achieve higher parallelism for loops with data dependency between iterations. As a loop may be executed in parallel with other tasks in an application, after unrolling and fission, execution sequence of unrolled sub-loops can be better combined with other tasks. Figure 4 shows the unrolling/fission of two loops with data dependencies between iterations. Before unrolling and fission, is mapped to PE1 and is mapped to PE2. The overall processing time for three iterations is 90 time units (Figure 4(c)). After unrolling, the first sub-loop of (i.e., ) is mapped to PE2, and the remaining sub-loops ( and ) can be executed in PE1. The overall processing time becomes time units (Figure 4(d)). A better mapping/scheduling solution with higher inter-loop parallelism is thus obtained.

2.4. Generation and Selection of Unrolling Configuration

If an application contains only one loop it obviously should be selected for unrolling. For the multiple loop case, the number of loops to unroll and the corresponding unrolling factors need to be determined. Since unrolling a loop without data dependencies between iterations is likely to achieve more performance gain than unrolling a loop with data dependencies, a performance-driven strategy (Algorithm 1) is proposed in this work.

( )
( ) , where for
( )
( ) while     do
( )  for  all loops   do
( )   
( )  end for
( )  for all unrolling configurations   do
( )   for all loops   do
( )   unroll for iterations, where
( )   loop fission
( )  end for
( )  generate new task graph
( )  generate complete mapping/scheduling
( )  calculate quality score for
( )  
( ) end for
( ) find loop with maximum
( ) 
( ) 
(21) 
( ) 
( ) update
( ) end while
( ) return   and

Given an application containing a set of loops , an initial unrolling configuration is generated with all unrolling factors being set to , that is, , where for . A new set of unrolling configurations is generated by incrementing each in turn, for example, and . For each unrolling configuration , a quality score is calculated by first applying the unrolling factors specified in followed by fission to break the new loop into sub-loops over the same loop count with each loop having the same loop body as the original loop. A task graph is then generated with each sub-loop being treated as a task. The task graph is then passed to the mapping and scheduling process, where a complete mapping/scheduling solution is generated and a quality score is calculated (Section 3). As a result, a set of quality scores is produced. Afterward the corresponding quality improvement is calculated as: where is the best quality score to date. The best unrolling configuration with highest quality improvement is chosen, the best quality score is updated as , and the unrolling configuration is replaced by . This process is repeated until the resources on the FPGA are exhausted, causing termination of the algorithm.

3. Quality Score Calculation

3.1. Unrolling, Fission, and Task Graph Generation

Given a set of loops and an unrolling configuration . The following steps are used to generate a task graph:

(i)Unroll each loop according to .(ii)Break each unrolled loop into subloops by fission, each subloop performs the same operations as the original loop body before unrolling.(iii)Construct a new task graph by treating each subloop as a task, each having the same parent and child tasks as the original task before unrolling.(iv)Generate a management task to synchronise results produced by different sub-loops (Section 3.2), and insert this task to the tails of all unrolled sub-loops in the task graph (tasks and in Figure 3(b)), that is, predecessors of the management task are the unrolled sub-loops, and successors of the management task are the successors of the original loop.

The produced task graph is then presented to the mapping and scheduling tool to generate a quality score (Section 3.3), which guides the search.

3.2. Management Task

One of the problems introduced after unrolling is data synchronisation: since results are produced by unrolled iterations in parallel, they need to be reorganised in the correct sequence (Figure 5). Another problem is loop count uncertainty, for example, a loop may be unrolled times but the actual loop count at run-time may not be a multiple of . In this case some results must be discarded. To handle these problems, a management task which collects data from different unrolled tasks, keeps track of the actual loop count at run-time, organises the collected data into the correct sequence, and discards unneeded data is introduced. The management task is treated as a normal task, inserted into the task graph and presented to the mapping/scheduling tool. For loops without data dependencies, the following pseudo-code shows the data synchronisation process:

for (i = 0; i < (M-1); i++)   for     (j = 0; j N; j++)     rst[i*N+j] = d[j][i];   tc = R − (M−1) * N;for (i = 0; i < tc; i++)   rst[(M−1)*N+i] = d[i][M−1];

where M is the actual count of the unrolled loop being executed, R is the required loop count for the loop before unrolling, and N is the number of iterations being unrolled. is the result produced by different unrolled iterations, for example, is the result produced by the first iteration. rst is the original array to store results. The second loop is used to collect the results of the last iteration and discard unneeded data, where is the number of data remaining.

If there are data dependencies between iterations, the management task must select the correct result from the unrolled iterations:

tc = M * N - R;switch(tc)   case 0:     rst = d[N-1];     break;  case 1:     rst = d[N-2];     break;  ...  ...  case N-1:     rst = d[];     break;

The generated mapping/scheduling solution does not require the designer to know the exact loop termination conditions using these management tasks. However, users can specify an estimated loop count at compile time. Loops are unrolled using this information and a mapping/scheduling solution is generated. If the estimated loop count matches the actual value at run-time, maximum performance can be achieved. However, if the loop count is different, the data management task can handle data synchronisation dynamically, which means that the generated mapping/scheduling solution is still feasible. These management tasks can easily be implemented in software or in hardware state machines.

3.3. Mapping and Scheduling Overview

A heuristic search-based approach is used to find the best mapping/scheduling solution for an input task graph as shown in Figure 6. Given a task graph and a target architecture specification which includes information concerning the processing elements and communications channel, a tabu search is used to iteratively generate different mapping/scheduling solutions (neighbors). For each solution, a speedup coefficient is calculated and used to guide the search with the goal being to find a solution with maximum speedup.

3.4. Integrated Scheduling Technique

Given a set of tasks and a set of task lists , where each task list is an ordered task sequence to be executed by processing element , each task in will be processed by in sequence when it is ready for execution, that is, when all of its predecessors are finished. Task mapping and scheduling is thus integrated in a single step that deals with assigning tasks to task lists. A task assignment function is defined as : , for example, denotes task being assigned to of list . This means that is the th task to be executed by processing element . A mapping/scheduling solution is characterized by assignments of all tasks to processing elements, that is, for every task for a .

3.5. Multiple Neighborhood Functions

Tabu search is used to find the best mapping/scheduling solution. It is based on neighborhood search, which starts with a feasible solution and attempts to improve it by searching its neighbors, that is, solutions that can be reached directly from the current solution by an operation called a move. Tabu search keeps a list of the searched space and uses it to guide the future search direction; it can forbid the search moving to some neighbors. In the proposed tabu search technique with multiple neighborhood functions, after an initial solution is generated, two neighborhood functions are designed to move tasks between task lists and used to generate various neighbors simultaneously [11]. If there exists a neighbor better than the best solution so far and it cannot be found in the tabu list, this neighbor is recorded. Otherwise, a neighbor that cannot be found in the tabu list is recorded. If all the above conditions cannot be fulfilled, a solution in the tabu list with the least degree, that is, a solution being resident in the tabu list for the longest time, is recorded. If the recorded solution has a smaller cost than the best solution so far, it is recorded as the best solution. The searched neighbors are added to tabu list and solutions with the least degrees are removed. This process is repeated until the search cannot find a better solution for a given number of iterations.

3.6. Quality Score

For each mapping/scheduling solution, an overall execution time is calculated, which is the time to process all tasks using the reference heterogeneous computing system and includes data transfer time. The processing time of a task on processing element is calculated as the execution time of on plus the time to retrieve results from all of its predecessors. The data transfer time between a task and a predecessor is assumed to be zero if they are assigned in the same processing element.

A speedup coefficient is defined and used to measure the quality of a mapping/scheduling solution, it is calculated as the processing time using a single microprocessor divided by the processing time using the heterogeneous computing system:

A higher speedup means that a mapping/scheduling solution is better as the application can be finished using less time. This score is used to guide the tabu search and the goal is finding a solution with maximum speedup. This maximum speedup is used as the final output and defined as the quality score to measure the quality of the input unrolling configuration.

4. Results

4.1. Experimental Setup

The reference heterogeneous computing system used in work has one 2.6 GHz AMD Opteron(tm) Processor 2218 and one Celoxica RCHTX-XV4 FPGA board with a Xilinx Virtex-4 XC4VLX160 FPGA. The FPGA board and microprocessor are connected via an HTX interface with maximum data transfer rate of 3.2 GB/s.

An isolated word recognition (IWR) system [14] is used as an application. It uses 12th order linear predictive coding coefficients (LPCCs), a codebook with 64 code vectors, and 20 hidden Markov models (HMMs), each with 12 states. One set of utterances from the TIMIT TI 46-word database [15] containing 5082 words from 8 males and 8 females is used for recognition. Table 2 shows the profiling results of major processes of the isolated word recognition system on the AMD processor. It is found that loops in vector quantisation (vq), autocorrelation (autocc), and hidden Markov model decoding (hmmdec) consumed the most CPU resource, which are , , and , respectively.

4.2. Multi-Loop Unrolling and Fission

In this experiment, the proposed unrolling strategy is applied. Figure 7 shows the mapping of different processes in the speech system. It is found that vector quantisation is unrolled 3 times (vq3) and mapped to the FPGA; all 12 iterations of the autocorrelation process are unrolled (autocc12); inner loop of hidden Markov model decoding is unrolled for 12 iterations (hmmdec12) which is equal to the number of HMM states; the outer loop is further unrolled for 2 iterations which means two HMM decoding are executed in parallel. The corresponding FPGA resource usage is shown in Table 3 and the operating frequency is  MHz, a speedup (quality score) of is obtained for this configuration. In contrast, the speedup obtained without unrolling is , where vector quantisation, autocorrelation, and HMM decoding are executed in FPGA without unrolling. An improvement of times is hence obtained using the proposed strategy.

Figure 8 shows the speedups for different vector quantisation unrolling factors, where all other processes are executed on the CPU. It is found that the speedup increases with unrolling factor and saturates. This figure explains why only three iterations of vector quantisation are unrolled in the final mapping/scheduling solution.

4.3. Run-Time versus Compile-Time Parameters

In the above experiment, mapping/scheduling solutions are generated by assuming that the LPCC order is 12 at compile-time. However, this value may be modified to cope with different circumstances at run-time. Using a mapping/scheduling solution generated with 12 LPCCs, Figure 9 shows the performance of this system for different run-time LPCC orders. It is found that maximum performance is achieved at  LPCCs, and the performance drops when the run-time LPCC order is different from compile-time value, for example, a % drop at LPCCs.

4.4. Quality Score Comparison

In addition to the IWR example, two other applications are employed to evaluate the proposed approach: the SUSAN corner detection image processing algorithm [16] and the N-Body problem [17]. Figure 10 shows the quality score comparison between strategies with and without unrolling. The FPGA resource usage and operating frequency are shown in Table 4. The proposed strategy can achieve , , and times speedup for IWR, SUSAN, and N-Body, respectively, the corresponding improvements are factors of , , and over the approach without unrolling. The improvements for SUSAN and N-Body are much higher than the times improvement obtained using the IWR application because there is a critical loop in each of these two applications: in SUSAN, the loop to compute the similarity of pixels and for N-body, the loop to compute velocity. Unrolling these loops significantly improves the performance of those cases.

5. Conclusions

A multi-loop parallelisation technique involving fission and unrolling is proposed to improve intra-loop and inter-loop parallelism in heterogeneous computing systems. The utility of this approach is demonstrated in three practical applications and a maximum speedup of times is obtained using a computing system containing an FPGA and a microprocessor. It is times higher than the case where unrolling is not applied. The generated system is tolerant to run-time conditions, and its performance is closer to optimum when there is a more accurate prediction of run-time condition during compile-time.

Acknowledgment

The support from FP6 hArtes (Holistic Approach to Reconfigurable Real Time Embedded Systems) Project, the UK Engineering and Physical Sciences Research Council, Celoxica, and Xilinx is gratefully acknowledged.