Graduate School of Science and Technology, Kumamoto University, 2-39-1 Kurokami, Kumamoto 860-8555, Japan
Reconfigurable logic devices (RLDs) are classified as the
fine-grained or coarse-grained type based on their basic logic
cell architecture. In general, each architecture has its own
advantage. Therefore, it is difficult to achieve a balance between
the operation speed and implementation area in various
applications. In the present paper, we propose a variable
grain logic cell (VGLC) architecture, which consists of a 4-bit ripple carry adder with configuration memory bits and
develop a technology mapping tool. The key feature of the
VGLC architecture is that the variable granularity is a tradeoff
between coarse-grained and fine-grained types required
for the implementation arithmetic and random logic, respectively.
Finally, we evaluate the proposed logic cell using the
newly developed technology mapping tool, which improves
logic depth by 31% and reduces the number of configuration
data by 55% on average, as compared to the Virtex-4 logic
cell architecture.
1. Introduction
System large-scale integrated circuits
(LSICs), which exhibit high performance and have
high densities, are manufactured using advanced process technologies. However,
their high costs are an enormous disadvantage. To respond to diversifications
in market trends and frequent changes in standards, various types of products
must be manufactured in low volumes, despite the fact that conventional
production facilities essentially target high volumes of only a few types of
LSICs. It follows that the cost benefit of each chip is poor because of the
increase in both the nonrecurring expense (NRE) and design cost.
Hence, a reconfigurable logic device (RLD), which has
circuit programmability, is applied to embedded systems as a hardware
intellectual property (IP) core. Due to its flexibility, it is possible for
designs to be implemented in the shortest turn-around time from specification
to implementation. In particular, there is a possibility that design complexity
and power distribution problems can be solved using a dynamic reconfigurable
processor. It becomes necessary to adapt the
frequently changing market cycles, such as that of mobile phones.
However, conventional RLDs, which are commercial
field-programmable gate arrays (FPGAs), cannot achieve efficient
implementation. Therefore, the chip area and power consumption increase.
Because embedded systems, in particular, require a small area and low power
consumption, the above-mentioned parameters are critical for these products. If
we consider the quality of the actual product, conventional FPGAs are not
satisfactory in terms of performance or function.
We have studied a reconfigurable logic architecture
that has both flexibility and high performance [1–4] as a reconfigurable IP core. Our goal is specification
circuits rather than large circuits, such as stand-alone commercial FPGA.
Figure 1 shows the usage of reconfigurable IP cores as follows: (a) bug fixing
after fabrication process, (b) avoidance of high volume, (c) version upgrade,
and (d) multiple mode/specification. In the present paper, we propose a
variable grain logic cell (VGLC) architecture that can change the computational
granularity corresponding to the application. The novel architecture, which is
based on a 4-bit ripple carry adder that includes configuration memory bits,
offers a tradeoff between coarse and fine granularity and can be used for efficient
mapping in an application. We also evaluate the implementation efficiency using
the newly developed technology mapping tool.
Figure 1: The use of a VGLC as a reconfigurable logic core.
The remainder of the present paper is organized as
follows. Related research is described in Section 2. Section 3 introduces issues
related to implementation efficiency in RLDs. Section 4 describes the VGLC
architecture. Section 5 describes the technology mapping method. Section 6
describes the evaluation process and method. Section 7 presents a discussion of
the results obtained herein. Finally, the conclusions and an overview of future
research are presented in Section 8.
2. Related Research
The need to
optimize a general-purpose reconfigurable logic cell architecture is recognized
in the field of reconfigurable IP-core design. For example, mixed-grain
FPGA [5] with four
2-LUTs and carry generation logic is used to reduce critical path delay and
implementation area for the DSP application domain. Mixed-grained FPGA can construct
a 4-bit adder or 4-input logic using one logic cell as well as our concept.
They performed implementation evaluation using hand mapping for small circuits.
Reference [6] also
proposes memory-oriented architecture. Wilton et al. presented a synthesizable
datapath-oriented embedded FPGA [7] that is optimized for bus-based operations, which are
common in signal processing. This architecture will be used to implement
multiply functions, rather than a more general logic circuit. NEC electronics'
DRP-1 [8], Hewlett
Packard Chess architecture [9], and IPFLex's DAPDNA [10] are proposed for DSP
application domain. These devices are based on the ALU structure. The function
units in Chess are 4-bit ALUs, and the connections are 4-bit buses. Each ALU is
adjacent to four switch boxes, and each switch box is adjacent to four ALUs. On
the other hand, DPFPGA [11]
and the medium-grain logic cell [12] are based on the LUT structure. In [13], the proposed logic cell
includes a 6:2 compressor, which features additional fast carry chains that
concatenate adjacent compressors and can be routed locally without a global
routing network. Altera Stratix devices [14, 15], Xilinx Virtex-5 [16], and Kumamoto University MCNC [17] are multigrained (adaptive) LUT structures. In the
Stratix-II and Stratix-III, the adaptive logic module (ALM) can be used for
varying input granularity. Higuchi
et al. [18] presented reconfigurable hardware (RHW), which has a 1-bit full adder
and EXOR and MUX. While they presented a fast depth-constrained technology
mapping algorithm for RHW as tree-structured 2-LUTs, RHW only ensures 2-input
logic completely. Peter Jamieson and Jonathan Rose present shadow
clusters [19]. The
shadow cluster is a standard FPGA logic cluster, such as a multiplier, that
shares the routing resources with the hard circuit. The key advantage of the
shadow cluster concept is that the waste associated with the unused
programmable routing is mitigated in hard circuit tiles because a shadow
cluster always allows the programmable routing designed for hard circuits to be
used.
3. Application Implementation Spectrum
3.1. Granularity
Conventional RLDs are roughly classified into
fine-grained and coarse-grained architecture types based on the granularity of
the basic logic block [20]. The fine-grained reconfigurable architecture is
associated with lookup tables (LUTs) or multiplexers, which are essentially
identical to the logic elements of FPGAs. These logic elements can efficiently
implement any circuit for random logic functions. On the other hand, a typical
element of the coarse-grained reconfigurable architecture is a computational
datapath unit, such as the arithmetic-logic unit (ALU). This architecture type
reduces the total amount of configuration data, thereby improving the
reconfiguration overhead. NEC electronics' DRP-1 [8] and IPFLex's DAPDNA [10] are coarse-grained dynamic
reconfigurable processors that are developed for the current market.
However, these RLDs can be efficiently implemented
only in their specific application domains. If fine-grained RLDs incorporate a
digital filter, which requires several arithmetic operations, a large circuit
area will be required. On the other hand, coarse-grained RLDs have several area
and delay overheads for glue logic implementation.
In order to overcome this disadvantage, traditional
RLDs have dedicated circuits inside and outside the logic block. Table 1 shows
the features of two typical RLDs. The Xilinx Virtex-4 (XC4VLX15) [21] comprises a carry logic, a
MULT AND, and an 18-bit multiplier that is embedded as a hard IP core. On the
other hand, DRP-1 has an 8-bit data management unit (DMU) as the logic unit and
eight 32-bit multiplier units. However, we must carefully consider the type of
functionality for dedicated circuits. If these circuits are not utilized in an
application, then the space they take up on the chip is wasted. It is difficult
to achieve a balance between the operation speed and area in applications.
Table 1: Structures of typical RLDs.
3.2. Computational Characterization of
Applications
In this section, we present a type of application
domain analysis. For this purpose, we use six types of OpenCores [22] benchmark circuits that are
characterized in terms of three types of application domain: the controller
domain, the cryptosystem domain, and the DSP domain.
We first synthesize six industrial designs using the
Synopsys Design Compiler with ASIC standard cell. Here, we extract the datapath
circuit in the form of a macroblock using the Synopsys Design Ware library.
This allows the extraction of various macroblocks, such as adders, multipliers,
and wide-ranging MUXes. Nondetected circuits are
categorized as random logic computation circuits.
We mapped these designs onto standard cells with a CMOS
0.35- library and generated a synthesis report
containing the area information for each design. Based on the obtained results,
we classified all of the design components into three main groups: random
logic, arithmetic functions, and MUXes.
Figure 2 shows the information obtained
from the area report: Ac97 and Vga are the circuits for the control application
domain, aescipher and sha256 are the circuits for the for the cryptosystem
application domain, and biquad and Dct are the circuits for the DSP application
domain. The graph shows the percentage of the total area taken by each
function. Since flip flops are unrelated to the division
of random logic or arithmetic logic, they are
excluded. The ratio of random logic to arithmetic logic is not constant across
different applications. In addition, this shows that the MUX occupies a large
area in four circuits and has different input ranges, for example, 2:1MUX,
4:1MUX, and 8:1MUX as indicated by the synthesis report. In [5], a similar analysis and
results were presented for the DSP application domain. Hence, it is necessary
to effectively implement these computations in a logic block for efficiently
mapping of the area. At the same time, we must consider not only the type of
computation but also various input ranges.
Figure 2: Synthesis results of six benchmark circuits.
In the following section, we propose a homogeneous
architecture in which, as discussed in Section 3.2, the various functionality
requirements must be fulfilled in every logic cell. The potential area overhead
is avoided by maximal reuse of existing logic cell resources.
4. Proposed Logic Cell Architecture
As mentioned
above, since the granularity of a conventional logic cell is fixed, it is
difficult to efficiently implement circuits for all applications. This problem
can be solved if the computational granularity of a logic cell unit can be
changed. We propose a logic cell that has a 4-bit ripple carry adder with a
configuration memory bit. In this section, we describe a hybrid cell that is
used as the basic element of the VGLC. Next, we introduce a novel logic cell
structure and its functions.
4.1. Hybrid Cell
The arithmetic computation is based on an adder, and
the random logic is expressed efficiently in a “canonical form,” such as a
lookup table. Figure 3(a) shows the structure of a 1-bit full adder. Although
an OR gate is generally used for the output of the full adder, an EXOR gate can
also be used. Figure 3(b) shows the 2-input Reed-Muller canonical form. The
2-input Reed-Muller canonical form is described by the following equation:
Figure 3: Basic structure of a hybrid cell.
Each part enclosed by brackets corresponds to a
configuration memory bit. This equation can represent the 16 logic patterns
with a 4-bit configuration memory as well as a 2-input LUT. Both Figures
3(a)
and 3(b) have common parts consisting of EXOR and AND gates.
Figure 3(c) shows
a hybrid cell (HC) composed of a full adder with four configuration memory
bits. The HC can be constructed as either a full adder or a 2-input canonical
form according to the computation. In the case of mapping a 1-bit full adder,
there is a logic cell structure having either a 4-LUT with carry logic or two
3-LUTs in the traditional FPGAs. In contrast, an HC requires only four
configuration memory bits, thereby reducing the configuration data.
Furthermore, to allow increased functionality beyond a
1-bit full adder and 2-input Reed-Muller canonical form, we construct a
structure called basic logic element (BLE), as shown in
Figure 4. A BLE consists
of HCs, MUXes, EXOR, inputs ; and
AS, and
outputs
and .
Since , and AS input variables or clamp to 1/0, a BLE can perform
three basic functions; as shown in Figure 4.
Figure 4: Components and functions of a BLE.
4.2. Variable Grain Logic Cell
Architecture
We propose a VGLC architecture with HC.
Figure 5 shows
detailed views of the ports in the VGLC. The novel logic block, which has 21
inputs and four outputs with a 27-bit configuration memory, consists of four
BLEs and an output selector part. The number of BLEs is determined by two
reasons, namely, coarse-grained architecture (e.g., Chess [9] and mixed-grained
architecture [5]),
which employ nibble bit processing in the logic element and are suitable for
extension to 4-, 8-, 16-, 32-, and 64-input ranges. On the other hand,
fine-grained architecture uses at least 4-input logic implementation in
traditional FPGAs. Therefore, by using a VGLC with four BLEs, the VGLC can
implement both nibble bit ALU and 4-input random logic.
Figure 5: Variable grain logic cell architecture.
The signal carry_in and carry_out terminals are
connected directly to the adjoining VGLCs as dedicated lines. The add/sub
control signal (AS) and carry selector memory (CP) are entered commonly for all
four BLEs. As shown in Figure 6, the VGLC can be programmed to operate in one of
five modes: Arithmetic, Shift-resistor, Random Logic, Misc. Logic, or Wide-range MUX Mode.
Figure 6: Basic function
of a VGLC.
4.2.1. Arithmetic Mode
When a BLE is
used as a full adder, the VGLC is composed of a 4-bit ripple carry adder or a
subtractor with a carry line between the BLEs. Since the carry path is created
via the local interconnection within a logic block, it is propagated at a high
speed. We can expand the bit range corresponding to the computation using the dedicated
lines carry_in and carry_out. It is possible to dynamically select the add/sub
functions using AS.
4.2.2. Shift Register Mode
Figure 7 shows
the architecture of the output selector part. The output selector with a 9-bit
control memory is composed of a serial-to-parallel 4-bit shift register. In
addition to the carry path, the VGLC has a dedicated shift line to increase the
shift range.
Figure 7: Output selector part.
4.2.3. Random Logic Mode
When a BLE is used in the canonical form, the VGLC is
composed of a 3-input or 4-input canonical form (3-CF or 4-CF) with MUX10-12
(see Figure 5) using the 2-input canonical form (with some shared inputs). Since
4-LUT is a good implementation in conventional FPGAs, four BLEs are implemented
in a single VGLC. In the Random Logic mode, the VGLC can have various canonical
forms, for example, four 2-input canonical form, two 3-input canonical form, or
one 4-input canonical form, so that suitable BLEs corresponding to the circuit
designs can be allocated. Furthermore, the mapping area and the number of
configuration memory bits are expected to be improved.
4.2.4. Misc. Logic Mode
It is shown
that the VGLC can represent 2-, 3-, and 4-input canonical forms. However, this
requires an area larger than 2-, 3-, and 4-LUTs, respectively. In order to
reduce these overheads, we use the misc. logic (miscellaneous logic) function
that applies its gate structure. Since a BLE can be used for a 4-input/2-output
module, it can represent the maximum 3- or 4-input logic pattern with 4-bit
configuration memory bits. Table 2 depicts the logic pattern that can be
expressed at the output of dt and ds in the BLE. The -input variable has output logic patterns. Thus, there are 65,536,
and 256 patterns in the 4-input and 3-input canonical forms, respectively. For
example, since AS = 0 in the 3-input variable, dt and ds are capable of
covering 120 and 43 logic patterns, respectively. In total, the misc. logic
function has the ability to represent 80.47% of the 3-input logic patterns.
Figure 8 shows an example of 3-input Misc. Logic (3-misc.) usage.
Table 2: Output logic
pattern of the BLE (CP = 0).
Figure 8: Example of a three-input misc. logic function.
Using multiple BLEs, the VGLC can also increase the
number of inputs that are expressed. The coverages are 73.6% and 50.3% in the
4-input logic and 5-input logic, respectively, with two BLEs. We can combine a
maximum of four BLEs.
4.2.5. Wide Range of Multiplexer
Mode
In Figure 5, the
output of MUXes 2-5 is through the HC, and the BLE is composed of a 2:1
multiplexer. In addition to the MUXes 10–12, which are used to realize the
canonical form, a wide range of MUX configurations, such as 4:1 and 8:1, can be
implemented.
5. Technology Mapping Method
5.1. Mapping Flow
In order to
efficiently map the five functions of the VGLC, the datapath circuits (e.g.,
adder) a or wide-ranging multiplexer (e.g., 8:1 multiplexer) are prepared in
advance in the user library for the logic synthesis process. We can extract
these circuits as a macroblock during logic synthesis. The remaining circuits
are mapped using the Misc. Logic and Random Logic modes.
5.2. VGLC-HeteroMap
We need to implement Random Logic circuits
using the VGLC structure. UCLA FlowMap [23] is well known as a FPGA technology mapping
algorithm.
Although the smallest number in the LUT can be mapped using FlowMap, it targets
only homogeneous logic cell structures such as 4-LUT. On the other hand, UCLA
HeteroMap [24], which
has a delay optimized algorithm, can also map heterogeneous structures, such as
the Xilinx XC4K series [25], having a 3-LUT and a 4-LUT in one logic block. Since
the HeteroMap source code is open to the public [26], we develop the
VGLC-HeteroMap, which is a modified HeteroMap. In particular, we add the
following two factors.
(i) Correspondence to the netlist that contains
the macroblock.(ii) Addition of the Misc. Logic mode for random
logic mapping.
First, since HeteroMap targets single-output LUTs, it
must be modified such that the netlist will include a multioutput macroblock.
Second, the VGLC has the Misc. Logic
mode and the Random Logic mode for random logic mapping. Therefore, we add the Misc. Logic matching process
into the VGLC-HeteroMap. The pattern matching of Misc. Logic results in Boolean
matching. Debnath and Sasao [27] present an efficient method to check the equivalence
of two Boolean functions under permutation of the variables. As a basis of the
Boolean matching, they introduced the concept of the -representative and a
breadth-first search algorithm for its quick computation. If two functions have
the same -representative, then they match. We use this technique for
Misc. Logic pattern matching. Since this method is useful for up to eight of
inputs, we assume that Misc. Logic circuits with a maximum of eight inputs
(8-misc.) are used with BLEs. The definitions for this initial search is as
follows.
Definition 1.
The minterm expansion of an
-variable function is ,
where .
The binary digit is called the coefficient of the th minterm, the th coefficient, or simply the coefficient. The bit binary number is the binary number representation of .
The subscript 2 is used to denote a binary number.
Definition 2. Two functions and are P-equivalent if can be obtained from by permutation of the variables. denotes that and are -equivalent. -equivalent functions form
a -equivalence class of
functions.
Definition 3. The function that has the
smallest binary number representation among the functions of a -equivalence
class is the P-representative of
that class.
Figure 9 shows an example of the mapping process in
the VGLC. The pseudocode of the VGLC-HeteroMap with the Misc. Logic mode is
given in Algorithm 1.
Algorithm 1: Pseudocode in
random logic mapping.
Figure 9: VGLC-HeteroMap.
6. Experimental Methodology
In this evaluation, we show the number of
logic cells, the logic depth, the area, and the amount of configuration data
used to implement benchmark circuits. This section presents a brief description
of the environment and the model using the following evaluations.
6.1. Evaluation Environment and Modeling
We need to
fairly compare both the coarse-grained and fine-grained types with the VGLC.
However, a coarse-grained type logic cell has various structures, and there is
no freely available CAD tool. Therefore, the fine-grained logic cell is
restricted to three types of LUT-based structure, as shown in Figures 10
and 11. Figure 10(a) shows the 6-LUT, and
Figure 10(b) shows a 4-LUT with a carry
chain. In addition, Figure 11 shows the heterogeneous LUT structure of Virtex-4.
The total number of transistors in each logic cell is computed using the cell
count suggested in [28, 29]. A 6-LUT can be implemented using 67 SRAMs, seven
inverters, one EXOR, one flip flop, and 64 2:1MUXes, for a total of transistors. A 4-LUT can be implemented using
16 SRAMs, five inverters, one EXORs, one flip flop, and 16 2:1MUXes, for a
total of transistors. A Virtex-4 structure can be
implemented using 35 SRAMs, five inverters, two EXORs, two ANDs, two flip
flops, and 35 2:1MUXes, for a total of transistors. On the other hand, the VGLC has a
function based on three types of structure; as shown in Figure 12. The VGLC can
be implemented using 27 SRAMs, 12 ANDs, four flip flops, 32 2:1MUXes, and 24
EXORs, for a total of transistors. Each logic cell is counted as a
unit when we evaluate the area. Therefore, the total area of the mapping
solution is equal to the total number of logic cells. Such simplification is
reasonable because the layout information is not yet available, and the die
size of commercial FPGAs is not open.
Figure 10: Homogeneous LUT structures with carry chain.
Figure 11: Heterogeneous LUT structures with carry chain.
Figure 12: Three components of the VGLC.
Similar to the area count, the total amount of
configuration data is characterized by the product: the number of configuration
memory bits per logic cell the total number of logic cells. The number of
configuration memory bits for each architecture is as follows: VGLC: , comparison logic cell: , , and .
The combination
delay of the 4-LUT () and heterogeneous LUT (, ) refers to Xilinx Virtex-4 [21], which is manufactured using 90- process technology. In addition, 6-LUT () refers to Xilinx Virtex-5 [16], which is manufactured
using 65- process technology. The delay parameters in
the typical condition refer to Xilinx ISE. On the other hand, with regard to
transistor optimization for the proposed architecture, we perform transistor
level design with the Rohm 0.35-
3.3 V transistor model and obtain the delay by
Synopsys HSpice simulation in the typical condition. Table 3 summarizes the
combination delay for each function of the VGLC.
Table 3: Implementation parameters for VGLC.
Note that although the 0.35- process technology is somewhat obsolete. In
addition, the placement and routing are not performed because the present
research focuses on logic depth and area evaluation for the proposed logic
cell. The interconnect cost of such an architecture is not considered herein
and will instead be evaluated using an advanced process technology in a future
study. Therefore, in this evaluation, we assume that the routing architecture
is of the island type.
6.2. Evaluation Flow
Figure 13 shows
the evaluation flow, and the three mapping methods are as follows:
Figure 13: Architecture evaluation flow.
(A)mapping method using Random Logic mode only,(B)mapping
method using (A) + Misc. Logic mode + macro block,(C)mapping
method using each comparison logic cell.
First, the Verilog RTL netlists of circuits are used
as inputs for the synthesis tool Synopsys Design Compiler and are mapped onto
the standard cell library. An adder and/or wide-ranging multiplexer is
extracted as a macroblock using the DesignWare library. Furthermore, we convert
the gate-level netlist into the BLIF format using perl script, the netlists are
mapped using the VGLC-HeteroMap. Second, each comparison logic cell used the
above-described netlist and technology mapping tool, as well as the VGLC.
Finally, we obtain the mapping delay, the logic depth, and the number of logic
cells.
In this evaluation, OpenCores [22] circuits and MCNC [30] circuits are chosen as
benchmarks. Note that since the MCNC benchmarks are provided in EDIF format,
which is almost synthesized, we cannot extract macroblocks.
7. Results and Analysis
7.1. Effect of Misc. Logic
We first
evaluate effect of misc. logic functions. In this evaluation, five OpenCores
circuits and large/medium MCNC benchmarks are used. Table 4 shows the results
of the number of logic cells and the mapping delay for technology mapping
benchmarks. This table compares two conditions, that is, a VGLC that uses only
a canonical form function without the misc. logic function (column method “(A)”)
and a VGLC that adds the misc. logic function (column method “(B)”).
Figures 14 and 15 illustrate the ratio of misc. logic to the total number of
logic functions. Note that five OpenCores benchmarks include macroblock
functions. misc. logic functions occupy approximately 75.5% and 80.7% in Figures 14
and 15, respectively.
Table 4: Number of logic cells and mapping delay.
Figure 14: Ratio of Misc. Logic (five MCNC benchmarks).
Figure 15: Ratio of Misc. Logic (five OpenCores benchmarks).
Table 5 shows the mapping delay, the area, and the
number of configuration data, which are normalized by each value of (A).
Compared to case of CF only, Misc. Logic improves the critical path delay, the
area, and the number of configuration data. The VGLC with misc. logic functions
improve one at a maximum of 27% and an average of 13% in mapping delay.
Similarly, the area and configuration data improve one at a maximum of 60% and
an average of 38%. As a result, Misc. Logic has an effect on the improvement of
the circuit performance.
Table 5: Normalized
delay, area, and configuration data.
7.2. Comparison with LUT-Based
Architectures
Table 6
compares four architectures, that is, the VGLC, the Virtex-4, the 4-LUT, and
the 6-LUT using five larger MCNC circuits and five OpenCores circuits (as shown
Figure 2). The logic depth, the area, and the number of configuration data are
shown in the table. Table 7 shows the performances normalized by each value of
the VGLC. Compared to the Virtex-4 logic cell, the VGLC reduces the logic depth
and configuration data by 31% and 55%, respectively, the area increases by 24%
on average. Compared to the 4-LUT based logic cell, the VGLC reduces the logic
depth and configuration data by 62% and 42%, respectively, while the area
increases by 30% on average. Compared to the 6-LUT based logic cell, the VGLC
reduces the logic depth, the area, and the configuration data by 48%, 45%, and
243%, respectively. In particular, Sha256 and vga show improvements in logic
depth of more than three times the other benchmarks. Since the critical path
line contains multibit adder circuits as a macroblock part, such as a 32-bit
adder, the VGLC can be effectively packed using the ALU mode. These circuits
have a larger arithmetic logic than the other circuits; as shown in Figure 2.
Table 6: Logic depth, area, and data for different
architectures.
Table 7: Normalized logic depth, area, and configuration
data.
The logic depth and amount of configuration data are
reducible in all comparative logic cells. For this reason, the misc. logic
function and heterogeneous mapping may have a considerable effect on
large-scale applications in particular. With fewer logic cells to cross, the
routing delay between the modern FPGA is lower. Therefore, it is necessary to
pack more logic into one block in order to avoid routing delay, which is a
major problem in the deep submicron FPGA. It is important that the VGLC can
pack more logic per logic cells. On the other hand,
Table 7 shows that most
benchmarks occupy more silicon area, as compared to 4-LUT and Virtex-4
architectures. Nevertheless, the configuration data required by the VGLC is
less than that required by all three architectures.
In the present evaluation, we assume that the routing
architecture is island style routing. It is actually a routing area
which dominates a die area and has a significant influence
on circuit performance (area, delay, and power). However, since the VGLC is
used as a logic IP core for small or medium circuits, we believe that the
number of routing tracks in island style routing is reduced to below the number
of stand-alone type modern FPGAs. Moreover, the VGLC does not restrict the
routing architecture to the above style, and various connections and routing
architectures can be implemented due to the specifications. For example,
Runesas MX-core [31]
has massive data banks that consist of SRAM data registers, and the medium-grain
reconfigurable architecture [32] has a H-tree routing architecture.
8. Conclusions
In the present
paper, we have proposed a VGLC architecture and evaluated the area, delay, and
configuration data using a technology mapping tool called VGLC-HeteroMap. The
novel architecture, which is based on a 4-bit ripple carry adder that includes
configuration memory bits, offers a tradeoff between coarse and fine
granularity and can be used for efficient mapping in an application. In our
evaluation, the VGLC improves the logic depth by 31% and reduces the number of
configuration data by 55% on average, compared to the Virtex-4 logic cell.
The present study did not consider the routing
network, which is likely to dominate the area and delay in an FPGA
implementation. In the future, we will study the connection block and routing
structure required to best support the VGLC. Currently, we are attempting to
optimize the full custom design to evaluate the chip performance and are
developing clustering, place, and routing tools for the proposed architecture.
Furthermore, the VGLC can be used as an IP core and can be considered as an
FPGA logic element. Since the VGLC can reduce the logic depth compared to other
traditional LUT architectures, it is possible to implement the VGLC as a 2D
array of the VGLC that is connected by an island style routing architecture.
Kuon and Rose [33]
showed that the benefit of the IP core in the FPGA could be lost due to its
fixed position in the chip, whereas the VGLC has no such restriction. We must
also estimate the routing resources using [34] and further explore the use of the VGLC.
Acknowledgments
The present study was supported by the VLSI Design and
Education Center (VDEC) at the University of Tokyo in collaboration with
Synopsys, Inc., Calif, USA.