TCE project: Co-design of application-specific processors with LLVM-based compilation support

TTA-based Codesign Environment (TCE) is an application-specific instruction-set processor (ASIP) design toolset developed in Tampere University of Technology in several research projects since 2003. This blog post introduces the project and how LLVM is used in it to provide high-level language compiler support for the designed ASIPs.

The use case for application-specific processors

Especially in embedded devices so called general purpose "off-the-shelf processors" are often not optimal for the application at hand. The readily available processors might be too large in chip area, might consume too much power, might be too expensive (for a mass product) or are not running the set of programs fast enough. In order to tackle the performance problem, a common design choice is to run parts of the application in software while speeding up the performance critical functions with a custom hand-tailored hardware accelerator or a co-processor implemented as an application-specific integrated circuit (ASIC).

The design, implementation and verification time of the hardware accelerators costs money and lengthens the time-to-market for the designed device. In addition, a fixed function hardware accelerator designed with an hardware description language (HDL) such as VHDL or Verilog has the problem of being "carved in stone", thus not providing programmability that enables late bug fixes and on-the-field updates to the supported set of functions.

Field-Programmable Gate Arrays (FPGAs) allow reconfiguring the implemented hardware logic on the field. However, as FPGA is only an hardware design implementation technique, the "non-recurring engineering cost" of designing the hardware logic in an HDL is still there.

Application-specific processors can be spotted in the "design space" between off-the-shelf processors, such as ARM products or Texas Instruments DSPs where functionality is described fully in software by the designer, and custom fixed function hardware accelerators where functionality is described fully in hardware description language by the designer. In case of ASIPs, the engineer is able to design both software and hardware at the same time (co-design) and is free to the choose the level of application specialization applied to the processor.

What can be customized in an ASIP depends on the used processor template. A commonly customized part of the processor is the instruction set. Customizable instruction set allows the designer to define new application specific instructions for the processor to implement the desired functionality faster than a set of basic software operations such as additions or shifts can. Examples of such special instructions include complex arithmetic, non-standard floating point arithmetic, application-specific precision fixed point arithmetic, more than two input adders, etc.

TCE places few restrictions on the types of custom instructions that can be added to the processor design. For example, there are no limits to the number of input operands or produced results nor the number of clock cycles the operation can execute. In addition to the custom operations, the number and size of the register files (RF), the number and type of the functional units (FU) and the connectivity between the RFs and FUs can be freely customized.

About Transport Triggered Architectures

TCE is based on a simple but scalable architecture template called Transport Triggered Architecture (TTA, not to be confused with "Time Triggered Architecture"). TTA can be described as an exposed datapath VLIW architecture. The "exposed datapath" part means that the data transports that take place between the functional units (e.g. arithmetic logic unit or a multiplier) and the register files are explicitly visible to the programmer. In other words, while processors are commonly programmed by defining which operations to execute (including information of the sources of the operands and the destinations of the results), TTA is programmed by defining the transports of the operands and results. The name TTA comes from the way operations are executed: when operand data is moved to the triggering port of the functional unit, the operation starts executing. After a fixed latency (from the architecture point of view) the results can be read from the output ports of the functional unit to the next destination.

The programming model can be illustrated more easily with an assembly code snippet example.

1) Traditional "operation triggered" (first parameter is the destination)

ADD R1, R2, R3
MUL R4, R1, R5

2) Transport triggered

R2 -> ADD.OPERAND, R3 -> ADD.TRIGGER
ADD.RESULT -> R1
R1 -> MUL.OPERAND, R5 -> MUL.TRIGGER
MUL.RESULT -> R4

Transport programming enables some specific software optimizations such as software (register) bypassing which in turn can enable "dead result read elimination", that can be applied in case all result reads can be transported directly to the destination functional units. This can lead to reduced register (file) pressure. An example follows:

3. Transport triggered with software bypassing and dead result read elimination

R2 -> ADD.OPERAND, R3 -> ADD.TRIGGER
ADD.RESULT -> MUL.OPERAND, R5 -> MUL.TRIGGER
MUL.RESULT -> R4

Here the result from the adder is copied directly to the operand input port of the multiplier, thus removing the need for the general purpose register R1 used as a temporary storage. In addition to reducing register pressure, the freedom to schedule the movement of operand/result data in multiple cycles reduces the register file port pressure, which is one of the TTA's main motivations. In the traditional VLIW the number of RF ports needs to be scaled according to the number of connected functional units and their worst case RF port requirements (maximum simultaneous operand reads and result writes), leading to more complex RFs with higher delay and area.

Toolset assisted processor design with TCE

Designing new processors from the scratch is not a straightforward task. One needs to take care of the design, verification and porting a high level language programming toolchain for each of the processors so the programmers are happy (writing peculiar assembler syntax for a changing target would get quite depressing quickly!). Thus, the design process should be automated as fully as possible to make experimenting with different processor architecture alternatives feasible.

The ultimate goal for an ASIP design toolset is to be as easy to use as taking a high-level language program as input and producing as a result an optimal processor implementation in VHDL or Verilog. It should parallelize the program for the processor's resources efficiently while exploiting custom instructions intelligently without any user intervention. In our experience, this type of fully automated "design space exploration" tends not to produce good enough results as the codesign process is often something that a human can do more efficiently. For example, sometimes the software needs to be refactored to a form that can exploit instruction level parallelism better. Sometimes it can be hard, or even impossible, for a software algorithm to realize that a complex-looking loop can be replaced with a simple single cycle custom instruction if implemented in hardware, and so on. Thus, we see that the realistic use case for an ASIP design toolset is to assist in the design task as much as possible while still leaving leeway for the engineer to exploit their knowledge in the field of algorithms or hardware design. This way, in case the engineer is skilled enough and the toolset assisting the ASIP design task is flexible enough, the processor design can eventually reach the performance of a fixed function hardware accelerator, while the design process can also be stopped at any point when the result is good enough.

TCE is at a relatively mature state, providing graphical tools to design the TTA architectures, architecture description driven instruction set simulators, a retargetable compiler, and a processor implementation generator supporting VHDL output. Because TCE uses TTA, a static ILP architecture, as its processor template, the efficiency of the end result is highly dependent on an efficient compiler. The compiler has been our main focus in recent years and will most likely be in the future also.

LLVM in TCE

We were introduced to the LLVM project at about 2006. Until that point we used an old gcc v2.7.0 compiler ported from MOVE, the toolset preceeding TCE. It goes without saying that maintaining such an ancient piece of gcc code was quite a challenge and we actively tried to look for something easier to work with.

In addition to a clean C++-based code base, one of the main things that lured us towards LLVM from a purely gcc-based compiler were the interprocedural optimizations. Interprocedural optimizations are very useful for us as we work with standalone (no operating system with a runtime linker assumed) fully linked programs. In fact, our compiler toolchain does not currently include any linker at all but inputs fully linked LLVM bitcodes to its code generation phase. This means that most programs benefit from the LLVM global optimizations such as aggressive inlining and dead code elimination as the externally visible interface in the compiled programs can be limited merely to the startup functions.

After poking around and doing some experimental TTA code generation prototyping with LLVM (I think LLVM was at version 1.7 or so at that point) we noted that LLVM was getting more and more traction and started to look for a proper way for using LLVM for parts of the TCE code generation process.

As described previously, our target, on top of being customizable, is transport triggered, thus nothing like any other architecture supported by LLVM. This caused some trial-and-error coding efforts while figuring out the best way of taking advantage of the existing code base of LLVM while still supporting instruction scheduling that exploits the special trickery and the extra scheduling freedom enabled by TTA. Another requirement, the automated retargeting of the LLVM backend for the resources of different designed processors caused some long work days to get functioning robustly.

In the end, we came up with a plugin based backend approach. TCE code generation chain now includes a tool that generates LLVM backends from our XML-format architecture description files, compiles them with a C++ compiler to a dynamic library, and loads the generated LLVM backend on the fly during the code generation. At this point each TTA is modeled as a simple operation triggered architecture so the supported instruction set and registers can be described in the TableGen format. This way we could use the LLVM instruction selection and register allocation code, but the resulting sequential code is not yet something we can execute in a TTA. To finalize the code generation, after LLVM register allocation we convert the MachineInstructions to another internal representation (CFG+DDG with moves as the graph nodes) and do the rest of the code generation and TTA-specific optimizations on the TCE side.

The main missing piece for us in LLVM code generation framework is a VLIW-type instruction scheduler. In case of TTA which is a statically scheduled architecture with programmer visible operation latencies, the compiler instruction scheduling is not merely an optimization that can be optionally executed to get more performance. It's fully up to the compiler to schedule the operations in such a way that the operations are not started too early (in case a previous operation is executing) nor the results read too soon (in case the operation has not yet finished). Thus, in our case, the instruction scheduling is actually mandatory to get correct results. Also, in order to exploit the instruction level parallelism in an architecture like TTA or VLIW, one needs a way to bundle multiple instructions (in our case data transport moves) in a single wide instruction (or "a cycle") to be executed in parallel. Finally, a way to model the processor resources (with resource tables or similar) during scheduling is needed to produce correct code in the absence of structural hazard detection.

It would be really nice to have such a VLIW-style scheduling framework in LLVM so we could move more of our code generation to LLVM-side. Unfortunately due to a lack of time, we have not yet started to work on this as our existing instruction scheduler does the job well enough when starting from the "sequential RISC-like operation triggered" output from the LLVM code generation.

In general, LLVM and it core developers have been a pleasure to work with and we hope to be able to contribute to the LLVM project more in the future. Keep up the good work!

Future work and final words

Currently we are looking into GPGPU-style workload compilation issues. We are experimenting with OpenCL to describe the applications for easier extraction of parallelism while still providing clean support for calling custom operations from the kernel code. There is also work ongoing to extend TCE to better support task level parallelism with multicore ASIP generation and compiler assisted multithreading.

In case you got interested in the project or have questions to ask, please join the mailing list tce-users@cs.tut.fi.

I hope to see you there!

--Pekka Jääskeläinen,
a researcher that has been working in the TCE project from the start.