Rapid Datapath-Oriented FPGA Mapping
Our group is investigating the use of
reconfigurable coprocessors based on FPGA (field-programmable
gate array) technology to accelerate computational tasks
(see the page describing the Garp Chip).
The computation kernels to be executed on the FPGA coprocessor
are either automatically extracted from a high-level language
such as C (see Automatic C Compilation for Garp)
or are manually specified using a hardware description language
or schematic capture.
One serious problem when synthesizing logic for an FPGA
coprocessor is the time required to partition, place,
and route the part of the computation performed on FPGA resources.
Vendors' tools are optimized to handle random logic and can take hours to
complete. Notably, they do not take proper advantage of the structure
inherent in regular datapath operations. Instead, they take one of
two approaches:
-
Flatten the entire datapath to primitive gates, then perform
logic optimization and technology mapping. Often the
regularity of the input is not exploited, and thus the same
optimization is repeated for each bit slice of the datapath.
The subsequent place and route task often initially randomizes
the layout, losing any remaining regularity.
Not surprisingly, this approach is very expensive
computationally.
-
Map each individual operation to a predefined "hard macro" function
unit. This is quick and preserves regularity that simplifies layout
and routing, but with no optimization performed across
macro boundaries, computation resources are often underutilized [1].
Our approach, based on the same linear-time tree covering algorithm
used by instruction selection in compilers, merges multiple operations
into optimized modules while preserving the preferred regular bit-slice
layout of the resulting datapath. This approach is very fast;
a typical kernel extracted during C compilation, approximately
50k-100k gate equivalents, is synthesized in a second or less.
Because of the importance of routing delays in FPGAs, the algorithms
perform module selection and placement
in one integrated step.
Other characteristics of the Garp chip's reconfigurable array
add interesting twists to the mapping problem.
For example, the algorithms must consider the clock period to be fixed,
since the reconfigurable array is clocked synchronously with the
microprocessor core. Also, general memory accesses can be performed
directly from the gate array.
This research is described in more detail in a
paper that appeared at FPGA'98.
References
[1] ``Module Compaction in FPGA-based Regular Datapaths'',
Andreas Koch, 33rd Design Automation Conference, 6/96,
Las Vegas, NV, USA
For more information, contact Tim Callahan,
timothyc at cs.berkeley.edu.
RANDOM DETAILS:
Ideally, each intermediate mapping step would retain all
points that are not "clearly suboptimal" (equal or worse in both delay
and area than another equivalent mapping point).
Then each production would consider
all combinations of points of its children. But because of the
complex interactions of mapping and placement, it is not clear
if even this approach could claim an exact solution.
So for simplicity and speed,
only one "best" solution is kept for each unique grammar point---either
best delay, with area as a tie-breaker, or vice versa.
[BRASS Home]
[Projects]
[Class]
[Documents]
[People]
[Contact]
[Sponsors]
[Links]