Automatic C compiler for reconfigurable computing

Automatic C Compilation to SW + Reconfigurable HW

You can try running a small example through our online Garp compilation demo.
Sorry, I guess they took this offline

Tim Callahan's Ph.D. thesis
on a reconfigurable architecture compiler
In this project we have created a complete functioning compiler to automatically extract compute-intensive loops of ANSI C programs for acceleration on a tightly-coupled dynamically reconfigurable coprocessor (see related work). Our current concrete target is the Garp chip (Garp diagram), but the compiler is retargetable to similar architectures that can exploit spatial computing to accelerate loops.

The User Just Writes C

Our goal was to make it easy for the typical software programmer to use the Garp chip; thus, our compiler shoulders the burden of all tasks related to utilizing the reconfigurable hardware: selecting the parts of the program (the kernels) that will utilize the reconfigurable hardware, designing the optimized hardware circuit to accelerate each kernel, and orchestrating the interaction between the reconfigurable hardware and the software on the main processor. Thus the programmer does not even have to know that the reconfigurable coprocessor is there. Furthermore, existing programs simply need to be recompiled to reap performance benefits from the Garp chip's reconfigurable coprocessor. Of course, expert human designers will be able to create higher quality hardware configurations than the compiler can. We see automatic compilation as a complement to human-designed libraries of common as well as domain-specific functions.
We started with the assumption that for ease of use, compilation to this combination software/hardware platform should be as fast as typical C (software) compilation. This required the development of new algorithms for generating the hardware configurations, since traditional algorithms for generating hardware can be very time consuming.

Performance
One of the main performance advantages of the reconfigurable hardware is that it can support a large number of concurrent operations -- it has no instruction issue bandwidth limitation as does a microprocessor. However, the amount of instruction-level parallelism (ILP) within a basic block is typically very limited. In order to extract greater degrees of ILP, our compiler looks across basic block boundaries by borrowing compilation techniques previously developed for VLIW (very-long instruction word) processors. These techniques include hyperblock scheduling using predicated execution, software pipelining, and utilization of speculative memory loads. The compiler can also automatically utilize Garp's ``memory queues'', which allow streaming access of data.
The hyperblock approach allows the compiler to exclude infrequently executed paths from the hardware kernels. This results in smaller, faster hardware implementations. It also allows the compiler to extract and accelerate many more kernels than would otherwise be possible.
Each hyperblock is translated to a dataflow graph (DFG). Two important tasks are performed in building the DFG: (i) Control dependence within the kernel is converted to data dependence: conditional branches within the loop are eliminated through the introduction of predicates. The only remaining conditional branches are exits out of the kernel. (ii) Data producer / consumer relationships are made explicit via edges in the graph; this is similar to static single assignment (SSA) form and derives similar benefits. Many classic optimizations are then reapplied to the DFG. There are typically additional opportunities because value definitions on paths excluded from the hyperblock no longer interfere with optimization. Also, the SSA-like form of the DFG makes the implementation of these global optimizations simpler.
Finally, our hardware synthesis tool uses a grammar-based linear time tree covering algorithm to identify groups of operations that can be merged. Each group is then fed to a module generator that, given the group of operations, constant input values, and other input sources, generates the detailed configuration for the optimized module. Essentially, embedded in the grammar and the module generator is a large amount of device-specific human designer expertise. This approach reduces area and critical path delay compared to unoptimized module-based approaches, but is much faster than approaches based on gate- and CLB-level optimization.

Target Environment

The Garp chip was designed from the start for use in a general purpose multi-user environment. It features protected OS kernel mode, inter-process protection, and virtual memory support. Our primary goal is effective compilation to the Garp chip in this typical desktop general-purpose environment.
But realistically, the compilation techniques developed here are most applicable in the short term for CAD tools targeting systems on a chip (SoC) for embedded applications. Specifically, this compiler is a perfect match for a new breed of embedded compute engines known as ``application-specific programmable products'' (ASPPs). In addition to a microprocessor core, these contain reconfigurable logic rather than hardwired ASIC cores (well, they might have some generally useful domain-specific hard cores). The reconfigurability allows on-the-fly flexibility, quicker time to market, and lower costs compared to ASIC-based SoC solutions.
The Garp compiler currently supports two types of IP integration.
Simplest is when the entire configuration is hand-designed. In this case, a simple software wrapper, also hand-written, takes care of the tasks of loading the configuration if necessary and moving live register values between the MIPS core and array. The client program simply calls this wrapper function as it would any other library function. In fact, no direct support from the Garp compiler is required.
The more interesting approach is the integration of custom modules with otherwise compiled/synthesized C. The custom modules are accessed as function calls within the loop, but no actual call is performed -- the custom module is integrated seamlessly with the automatically synthesized modules. For example:
	for (i=0; i<N; i++) {
	    c = inbuf[i];
	    c = garp_byte_swap(c);
	    outbuf[i] = c;
	}
The garp_byte_swap(c) module is instantiated by means of a custom module generator, but the framework then handles placement and routing of this module the same as the other synthesized modules (for example, that calculating the loop increment).
Status

The construction of the compiler is complete for the moment; all of SpecINT95 is compiled and excuted without error. More recent additions:
- Pipelining after module mapping (gives better performance than previous approach of pipelining before module mapping). Rau's Iterative Modulo Scheduling algorithm is utilized to resolve conflicts between different modules using the memory port.
- Improved redundant load elimination
- Sharing of a single reconfigurable array row by two unrelated modules, when one is boolean and one is 32b data.
- Kernel-invariant computation elimination
- Global "address taken" analysis (helps a lot with large benchmarks for which pointer analysis is not feasible).
For more information, you can see the article "``The Garp Architecture and C Compiler'' in IEEE Computer April 2000, a slightly more detailed but outdated description, read the related paper Instruction Level Parallelism for Reconfigurable Computing, see the slides from the demo at FCCM'99 .ps.gz(66kB), ps.Z(132kB), ps(498kB), and/or contact Tim Callahan, tim.callahan at ieee.org.
For most detail, the full thesis is also now available.

Future

I believe the eventual widespread use of reconfigurable architectures will require feedback directed optimization, likely in the context of binary translation. This eliminates the problems of binary compatibility and so on. Fortunately, all steps in my compilation path were designed for speed, so both online and offline techniques are feasible.

[BRASS Home] [Projects] [Class] [Documents] [People] [Contact] [Sponsors] [Links]