Top Banner
The Garp Architecture and C Compiler Brought to you by Liao Jirong [email protected] http:// www.comp.nus.edu.sg/~liaojiro T.J. Callahan J.R. Hauser & J. Wawrzynek, U.C. Berkeley
36

The Garp Architecture and C Compiler Brought to you by Liao Jirong [email protected] liaojiro T.J. Callahan J.R. Hauser.

Dec 29, 2015

Download

Documents

Abner Turner
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The Garp Architecture and C Compiler

Brought to you by

Liao Jirong

[email protected]://www.comp.nus.edu.sg/~liaojiro

T.J. Callahan J.R. Hauser & J. Wawrzynek, U.C. Berkeley

Page 2: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Outline

Background The Garp Architecture The Compiler for the Garp Simulation result Summary

Page 3: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Background Emergence of reconfigurable hardware,

FPGA,etc. Impressive speedups for various tasks DNA sequence matching, encryption,etc. Obstacles to be overcome configuration time, size, floating-point operations,

compatibility of various implementations in the market…..

Past works -- PRISC, NAPA, PRISM, etc limited to specific application domains non full automatic compliation

Page 4: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The Big Picture

Computation Kernel

Processor (CPU)

Coprocessor (FPGA, ASIC, etc)

Application

Non-ComputationKernel

Compilation

Compilation/synthesis

communication

Page 5: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Working Flow of the Execution of the Kernel

1. Load a configuration2. Copy any initial register data to

coprocessor3. Start execution on coprocessor4. Copy result back to the processor

1, 2 & 4 are overhead.

Page 6: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Motivation Integrate reconfigurable hardware more

closely with the processor

Long reconfiguration times

Low- bandwidth paths for data transfer

Hardware design expertise

Page 7: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Assumption

A few cycles of overhead for register data transferring is acceptable

Coprocessor need its own direct path to the processor’s memory system

impossible for the processor to do this Coprocessor need to be rapidly

reconfigurable.

Page 8: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The Garp Architecture Single-issue MIPS processor core with

reconfigurable hardware (coprocessor) Coprocessor is on the same die with processor Coprocessor and Processor share the same

memory The reconfigurable hardware architecture and

interfaces are designed Does not exist as real silicon (simulation only)

Page 9: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The Blueprint

Page 10: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The Garp Arch. (Cont) For general purpose applications Fit into an ordinary processing environment The main thread of control through a

program is managed by the processor 1. configuration can be loaded only when coprocessor is idle 2. coprocessor can work independently 3. coprocessor execution can be halted or

resumed 4. can not load configuration or access the

coprocessor while it is active

Page 11: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The reconfigurable hardware Two-dimensional array of Blocks No. of row is implementation-specific upward-compatible fashion Interconnected by programmable wiring A fixed global clock - sequencer Configuration cache Memory buses Memory queues

Page 12: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.
Page 13: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Blocks Configurable Logic Block (CLB) 2-bit width 16 CLBs in a row is a 32-bit data path each up to 4 2-bit inputs (a<<10)|(b&c) can be implemented in one

row Control blocks one for each row in the leftmost column serve as liaison Boolean Values for if-conversion used in hyperblock

Page 14: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Wires Vertical wire communicate blocks in the same column Horizontal communicate blocks in the same or adjacent

rows Built-in carry chain support for addition, subtraction and

comparison. Make multiplication and division by

constant fairly efficient by multi-bit shift across a row

The wire network is passive value cannot jump from one to another without

passing through a logic block

Page 15: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Memory tricks Configuration cache hold recently displaced configurations reloading from cache requires 5 cycles only. can hole 4 full-sized configurations Wide path betwn coprocessor and memory data transfer and configuration load Memory bus 4 32-bit data bus and 1 32-bit address bus coprocessor is master of memory buses when

active initiate one access every cycle Memory Queues

Page 16: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Compare Garp with other arch.

VLIW Garp resemble VLIW Advantage over VLIW but doesn’t have VLIW’s per-cycle limits on instruction

issue, functional units, or register file bandwidth. pipeline in Garp is more straightforward than software

pipelining on VLIW: no function units competition problem for Garp

maintain high performance for sequential code in processor

Disadvantage over VLIW kernel size limit can not exploit ILP outside of loops

Page 17: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Garp V.S. Vector

Garp resemble a memory-to-memory vector processor when synthesizing a vectorizable loop.

Feedback loops can be constructed arbitrarily while vector units can handle only very speciallized recurrences

Garp can easily handle data-dependent loop exits, which is a problem for vector arch.

Page 18: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Garp V.S. Superscalar

Because of the modest number of instruction issue slots, Superscalar processor can not compete with the Garp coprocessor in cases with a large amount of ILP.

Page 19: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Any Question About Garp?

For further details:

Garp: A MIPS Processor with a Reconfigurable Coprocessor

J.R. Hauser, J. Wawrzynek, IEEE FCCM 1997,

Page 20: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Automatic Compilation

Standard ANSI C as input

SUIF C compiler for the front-end phase

parsing and standard optimizations

Full automatic compliation

Page 21: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Compilation FlowApplication

Kernel selection

Optimization &Synthesis Optimization

kernelNon-kernel

coprocessor processor

Bit-streamExecutable file

Page 22: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Kernel selection Loops The whole loop? -- NO loop size – too large contain some infrequent executed code -- longer load time -- longer interconnects operations cannot be implemented ILP limitation in basic block

Page 23: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Hyperblock Join all the basic blocks of a loop body by

using prediction – boolean value Increase ILP Precedence edges array subscript analysis

inter-procedural pointer analysis Contain the loop back edges avoid switching control from time to

time

Page 24: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.
Page 25: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Hyperblock (Cont)

Reject loops that speedup doesn’t make up the overhead

by profiling and execution time estimate

Exceptional exit cases execution continue on processor

occur only a small fraction

Page 26: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Optimization Techs.

Speculative loads crucial for pipelining Pipelining loop-carried dependencies simultaneous memory access Memory queues 3 memory queues buffering and reading ahead, writing

behind non-cache-allocating

Page 27: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Configuration Synthesis

Module mapping mapping groups of nodes in the DFG to

compound modules in the configuration, minimize the size and its critical path

Placement connect modules close to one another Generating the bit-stream file

Page 28: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Simulation Results

32-row array Adapted Ultrasparc processor Cycle-accurate simulator Model cache misses and interlocks.

Page 29: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Wavelet image compression

Page 30: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Gzip compression

Gzip have irregular memory accesses

reduce parallelism and prevent pipelining

Each loop execute only a few cycles

overhead cost more significant The overhead negates the benefit

Page 31: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Compilation time & Code expansion

Compilation time typically much less than double that of

compiling for software only

Code size typically increase from 10 to 50 percent wavelet benchmark – 16 percent

Page 32: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Garp V.S. Ultrasparc Ultrasparc a four-way superscalar, 167Mhz Garp implemented using the same VLSI process 133Mhz Wavelet Garp is 68% faster than Ultrasparc Gzip Ultrasparc is 14% faster than Garp

Page 33: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Garp V.S. Ultrasparc (Cont) Hand-coded functions Garp has great potential

Page 34: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Future

More experiments over a broader range of benchmark

Development of new optimizations Find out strengths and weaknesses

of the Garp architecture

Page 35: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

Summary The Garp Architecture processor + coprocess configuration cache memory queues high-bandwidth, low-latency data

access

Synthesis Compiler for Garp

Page 36: The Garp Architecture and C Compiler Brought to you by Liao Jirong liaojiro@comp.nus.edu.sg liaojiro T.J. Callahan J.R. Hauser.

The End

Thank you!

Any feedback will be [email protected]

http://www.comp.nus.edu.sg/~liaojiro