Research Article

The Potential for a GPU-Like Overlay Architecture for FPGAs

Table 1

The schedule of operand reads from the central register file for batches of four threads (T0–T3, T4–T7, etc.) decoding only ALU instructions. An ALU instruction has up to three vector operands (A, B, C) which are read across threads in a batch over three cycles. In the steady state this schedule can sustain the issue of one ALU instruction from every cycle.

Clock cycle Inst phase Register file read ALU ready

0 ALU0 ALU:A(T0,T1,T2,T3)
1 ALU1 ALU:B(T0,T1,T2,T3)
2 ALU2 ALU:C(T0,T1,T2,T3)
3T0
4 ALU0 ALU:A(T4,T5,T6,T7) T1
5 ALU1 ALU:B(T4,T5,T6,T7) T2
6 ALU2 ALU:C(T4,T5,T6,T7) T3
7T4
8 ALU0 ALU:A(T8,T9,T10,T11) T5
9 ALU1 ALU:B(T8,T9,T10,T11) T6
10 ALU2 ALU:C(T8,T9,T10,T11) T7
11