Ph.D. Dissertation by Timothy John Callahan.
Abstract:
Microprocessors coupled with dynamically reconfigurable hardware offer the potential of performance approaching that of an application-specific integrated circuit, but with the reprogrammability and mass production cost benefits of a microprocessor. My research group has designed such a processor, called Garp. Throughout program execution Garp's coprocessor is reconfigured as necessary at the entrance of each coprocessor-accelerated loop.
This thesis describes garpcc, a fully operational prototype C compiler targeting the Garp architecture, automatically utilizing the reconfigurable coprocessor when beneficial with no need for guidance from the programmer.
Garpcc uses the hyperblock framework to combine a candidate loop's commonly executed paths using predicated and speculative execution to expose instruction level parallelism (ILP). The hyperblock also allows the exclusion of rarely taken paths, leading to many benefits: coprocessor-infeasible operations on excluded paths do not interfere with acceleration of the rest of the loop; the remaining kernel is smaller allowing it to fit onto the coprocessor's reconfigurable resources when it would not otherwise; and excluding rare paths can improve the performance of remaining paths by removing long dependence chains and improving optimization opportunities.
The hardware synthesis phase of garpcc uses a fully-spatial approach where each operation is directly instantiated in the datapath configuration. The spatial approach allows merging of operations into optimized modules. The module-mapping step is timing-sensitive in that it optimizes for the critical path; it also simultaneously performs relative placement of the resulting modules. Finally, the spatial datapath is easily pipelined, further increasing the ILP.
The automatic compilation path combined with a cycle-accurate simulation of a realistic implementation of the Garp chip allows quantitative evaluation of the current Garp/garpcc system across a number of large benchmarks. Identification of weaknesses is mainly confined to garpcc, since until these are addressed, it is premature to fault Garp or its approach to reconfigurable computing in general.
Garpcc is shown to create pipelined datapaths capable of exploiting large amounts of ILP. But in many cases large peak ILP does not translate to significant net speedup because with short-running loops the pipeline does not completely fill, and also the overhead of using the coprocessor is more significant. Many improvements to the compilation path are suggested.
callahan_thesis.pdf
.