Research Article

The Potential for a GPU-Like Overlay Architecture for FPGAs

Table 2

The schedule of operand reads from the central register file for batches of four threads (T0–T3, T4–T7, etc.) decoding both ALU and TEX instructions. TEX instructions require only one source operand, hence we can read source operands for four threads in a single cycle.

Clock cycle Inst phase Register file read ALU ready TEX ready

0 ALU0 ALU:A(T0,T1,T2,T3)
1 ALU1 ALU:B(T0,T1,T2,T3)
2 ALU2 ALU:C(T0,T1,T2,T3)
3 TEX TEX:A(T0,T1,T2,T3) T0
4 ALU0 ALU:A(T4,T5,T6,T7) T1T0
5 ALU1 ALU:B(T4,T5,T6,T7) T2 T1
6 ALU2 ALU:C(T4,T5,T6,T7) T3 T2
7 TEX TEX:A(T4,T5,T6,T7) T4 T3
8 ALU0 ALU:A(T8,T9,T10,T11) T5 T4
9 ALU1 ALU:B(T8,T9,T10,T11) T6 T5
10 ALU2 ALU:C(T8,T9,T10,T11) T7 T6
11