Design Automation of Co-Processors for Application ...

Design Automation of

Co-Processors for Application

Specific Instruction Set

Processors

Design Automation of

Co-Processors for Application

Specific Instruction Set

Processors

Seng Lin Shee

OutlineOutline

1. Introduction1. Introduction

2. Justification & Aims2. Justification & Aims

3. Work done & Accomplishmentspp3. Work done & Accomplishments

4. Current Research4. Current Research

5. Customized Architecture5. Customized Architecture

6. Future work6. Future work

ASIPs in GeneralASIPs in General

• ASICs vs GPPs situation

• Power & Performance vs Design / Manufacturing Cost

• ASIPs are the hybrid of the two

• Main characteristic: highly configurable

• Consist of a base processor and optional components

• Today’s ASIPs are extensible

• Xtensa, Jazz, PEAS-III, ARCtangent, Nios, SP5-flex

AimAim

• Automatically create coprocessors for critical loops

• Create coprocessors which acquire small area, power and fast

• Maximize parallelism

• Design the methodology to create coprocessor

• Create estimation methods / ILP formulatiom

Related WorkRelated Work

• [Ernst1993] Hardware software cosynthesis for microcontrollers

– Standard processor is connected by the main memory bus to a co-processing ASIC/FPGA

– Disadvantage: only produce a small amount of improvements; no parallelism involved; also degradation in performance

• [Stitt2003] Dynamic Hardware/Software Partitioning: A First Approach

– Hardware approach to profile program dynamically

– Synthesize onto FPGA; dynamic partitioning to extract appropriate loop

– Disadvantage: only small regions of code; single cycle loop body; sequential address of memory block; number of iterations must bepredetermined

• CriticalBlue

– Provides complete methodology with toolset for converting functions to individual coprocessors on the Cascade platform

– Disadvantage: no parallelism between coprocessor and base processor; coprocessor is a separate component on the bus

ContributionsContributions

• Coprocessors are generally separate components from

the main processor, connecting via the main memory

bus

• My contributions:

– Coprocessors can operate loops in multiclock cycles

– Maximum parallelism

– No limit on loop size

– Minimize resource usage; reducing area usage

– Methodology to generate such a coprocessor

– Reduction in communication overhead

– Accurate prediction to determine the improvement of the code

segment given a certain constraint and architectural

configuration

Project ToolsProject Tools

• Rapid Embedded Hardware/Software System Generation

[Peddersen J., Shee S. L., Janapsatya A., Parameswaran S.]

presented at the 18th IEEE/ACM International Conference on VLSI

Design, January 2005

– Uses ASIPmeister to generate core then adapts the RTL to complete the

processor

– Include and exclude any instructions

– Automatic generation of Application Specific Instruction Set

– Implements the Portable Instruction Set Architecture (PISA)

– Part of the SimpleScalar framework

– Support for extended instructions

– Contribution:

• A full SimpleScalar architecture (integer) processor core (synthesizable into SOC or

FPGA for prototyping)

• A novel approach to generate a processor with various subsets of instructions

More ToolsMore Tools

• Modified SimpleScalar Toolset to support SYSCALL of SS CPU– Take advantage of cache & memory features in SimpleScalar

– Matches clock cycle count of hardware version

– Provides memory dump support

• Loop detection software– To detect most frequently occurring outer most loops.

– Refers back to the line numbers in the C source code.

– “Dynamic Characteristics of Loops”, [Kobayashi M. 1984]

• Memory dump file analyser

• Hot Function Detector– Provides the statistics of how much time is spent in each

function

High Level Synthesis ApproachHigh Level Synthesis Approach

• Previous tools used: SUIF, MACHsuif (particularly for unrolling loops)

• Use SPARK for coprocessor creation (inner control)– a C-to-VHDL high-level synthesis framework that employs a set

of innovative compiler, parallelizing compiler, and synthesis transformations

– takes behavioural ANSI-C code as input, schedules it using speculative code motions and loop transformations, runs an interconnect-minimizing resource binding pass and generates a finite state machine for the scheduled design graph. A backend code generation pass outputs synthesizable register-transfer level (RTL) VHDL

• SPARK : A High-Level Synthesis Framework For Applying Parallelizing Compiler Transformations [Gupta2003]

How improvements are obtainedHow improvements are obtained

Single pipeline

1 iteration

Unrolled

My method

Load

Computation

Store

IntegrationIntegration

Co

Reg

Co

Do

ne

GPR

HLS

Internal

Coprocessor

Wrapper

ID WBBase

Processor

HLS Coprocessor FeaturesHLS Coprocessor Features

• Register file sharing

• A wrapper to control the execution of inner coprocessor

• SCPR & BCPR Instructions

• Disadvantages:– Can only read from destination register after write

back stage; latency number pipeline stages

– Very hard to make loops if input always need to be fetched every time

– Have to make wrapper all the time just to accommodate SPARK generated component

– Number of input / outputs = number of arguments

More detailsMore details

• Detect loop hotspot in cjpeg program

• Created coprocessor using HLS Approach

• Simulated using ModelSim

• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler

• Given a 10ns clock constraint:

• 416.7MHz; 6,199 m2; 2,562 NAND gates

Loop Execution

0 500,000 1,000,00

0

1,500,00

0

2,000,00

0

2,500,00

0

3,000,00

0

3,500,00

0

4,000,00

0

4,500,00

0

Original Program

Modif ied Program

Clock Cycles

Program Execution

0 5,000,000 10,000,000 15,000,000 20,000,000

Original Program

Modif ied Program

Clock Cycles

Why HLS approach was usedWhy HLS approach was used

• Used to unroll loops

• To find out how much parallelism can be obtained

• Parallelism is limited by how many register ports that can be read at any one time

• Area usage & power of register file increases linearly with increasing number of ports

• However, loop unrolling will only be beneficial if the fetches / storesare done in parallel

• We need multiple resources, but we only have 1 base processor! Bottleneck!

• No need to fetch data at the last moment

GPR

configuration

2 reads

1 write

4 reads

2 writes

5 reads

3 writes

8 reads

4 writes

NAND gates 19,185 27,813 34,101 42,432

for (i = 0; i < 100; i++)

g ();

for (i = 0; i < 100; i += 2)

{

g ();

g ();

}

Customized ArchitectureCustomized Architecture

• Highly integrated coprocessor architecture

• Something like a coprocessor but integrated within the base processor

• Make full used of unused registers (r8 – r15, r24-r25)

• All calculations in the loop (when possible) are done in coprocessor

• Base processor just fetches the required data from memory and store the result back to memory

• Coprocessor taps into signal to know when data is ready and whento start execution

• Assumptions:

– No multitasking

– No preemption, no interrupts

– Coprocessor does not stall CPU; will already know how long it would take at creation time; use NOPs

• Problems:

– Latency pipeline stages

– Not good for loops with short / simple computations

AdvantagesAdvantages

• Save register usage

• Fetch data immediately when it is ready at WB stage

• Easy coprocessor task generation; basic block grouping

• Full control of instruction synthesis

• Maximize parallelism; address calculations are also performed

• Memory I/O task given to base processor

• No branch calculations

Customized Coprocessor IntegrationCustomized Coprocessor Integration

Co

Reg

Co

Do

ne

GPR

Custom

Coprocessor

ID WBBase

Processor

Custom Coprocessor Creation MethodologyCustom Coprocessor Creation Methodology

Preconfigured Processor

Library (ASIPmeister

libraries etc)

Custom

Coprocessor

Architectural

Model

Reduce

instruction

set in

ASIPmeister

Heuristic Algorithm

for selecting

segments of code to

be handled by COPR

Critical functions and

loops

Generate

coprocessor

segment in

source code

(assembly

code)

Application

written in C / C++

Set of

Coprocessors

Program Completed

CPU

Initial

application

profiling

Custom

Coprocessor

Generation

Methodology

List of Processes

& Loop Control

ICOP Creation

Methodology

Custom

Coprocessor

Integration

Methodogy

Verification MethodologyVerification Methodology

• C/C++ program is run through hardware

simulation (ModelSim) and software simulation

(SimpleScalar)

• Memory dump file and execution time produced

by both simulations should be identical.

• Same method is applied for verification of ICOP

architecture

• Sim-hexbin (program developed) is used to

obtain output file from dump file for comparison

purposes

Loop IdentificationLoop Identification

Critical Loops

0

5

10

15

20

0 10 20 30 40 50 60Pixels (K pixels)

Perc

en

tag

e

jcphuff.c:643

jcdctmgr.c:232

jchuff.c:766

jchuff.c:684

jchuff.c:673

jfdctint.c:220

jfdctint.c:155

How did we fair?How did we fair?

• Detect (same as previous) loop hotspot in cjpeg program

• Created coprocessor using Custom Coprocessor Methodology

• Simulated using ModelSim

• Synthesized using tcbn90gwc technology libraries through SYNOPSYS design compiler

• Given a 10ns clock constraint:

• 166.9MHz (1 GHz possible); 16,203 m2; 6,698 NAND gates

• Has potential to acquire less area

Loop Execution

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

Original Program

HLS Coprocessor

Custom Coprocessor

Clock Cycles (M cycles)

Program Execution

17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0

Original Program

HLS Coprocessor

Custom Coprocessor

Clock Cycles (M cycles)

HLS vs Custom CoprocessorHLS vs Custom Coprocessor

Loop Energy Consumption

0

20

40

60

80

100

120

140

160

180

Base

Pro

cessor

HLS

Copro

cessor

Custo

m

Copro

cessor

En

erg

y (

µJ)

Processor Size

0

10000

20000

30000

40000

50000

60000

70000

80000

Base

Pro

cessor

HLS

Copro

cessor

Custo

m

Copro

cessor

NA

ND

gate

s

Coprocessor Size

0

1000

2000

3000

4000

5000

6000

7000

HLS

Copro

cessor

Custo

m

Copro

cessor

NA

ND

gate

s

Memory Latency EffectMemory Latency Effect

Loop Improvement

159.58 156.66 156.92 156.33 156.50 155.62 155.72154.91159.58

103.72

69.25

52.1742.11 35.48 30.78 27.20

159.58

103.92

69.07

52.2042.03 35.44 30.76 27.17

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14

Mem ory Access Latency

% Im

pro

vem

en

t

ICACHE DCACHE ICACHE & DCACHE

Memory Latency EffectMemory Latency Effect

Overall Im provem ent

13.97

8.70

6.445.22

4.44 3.91 3.52 3.22

13.97

10.38

7.39

5.574.37

3.51 2.95 2.50

13.97

7.61

5.073.89

3.21 2.76 2.46 2.23

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0/0 18/2 36/4 54/6 72/8 90/10 108/12 126/14

Mem ory Access Latency

% Im

pro

vem

en

t

ICACHE DCACHE ICACHE & DCACHE

Input Data BehaviourInput Data Behaviour

Critical Loops in CJPEG

0

5

10

15

20

25

0 200000 400000 600000 800000 1000000 1200000

Pixe ls

Perc

en

tag

e

jcphuf f .c:643

jcdctmgr.c:232

jchuf f .c:766

jchuf f .c:684

jchuf f .c:673

jfdctint.c:220

jfdctint.c:155

Future WorkFuture Work

• Formalize methodology

• More concrete model of coprocessor

• Model to predict performance improvement

• Able to decide when is ICOP architecture feasible

• Analyze performance improvements on work on a variety of benchmark applications

Thank youThank you

Design Automation of Co-Processors for Application ...

Documents