Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

Performance and Power Analysis of Globally Asynchronous Locally

Synchronous Multiprocessor Systems

Zhiyi Yu, Bevan M. Baas

VLSI Computation Lab,

ECE department, UC Davis

Outline

• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessors• Power analysis of GALS multiprocessors

Why GALS Chip Multiprocessor

• Why GALS clocking style– The challenge of globally synchronous systems– The challenge of totally asynchronous systems– GALS is a good compromise

• Why chip multiprocessor– The challenge of increasing clock frequency– High performance and high energy efficiency of

multiprocessor system

clk1

clk2module

1sync.circuit

module2

clk1 clk2

Totalsync. delay

te

ts

GALS Effects

• Performance penalty due to additional synchronization delay

• High energy efficiency due to independent clock/voltage scaling

A simple GALS system Sync. delay = clk edge alignment (te)

+ sync. circuit (ts)

Synchronous vs. GALS comparisons

Uniprocessor

Array processor

SynchronousGALS

(multiple clock domains)

ALU

MULT

MEM

This work

Previous work

Outline

• Motivation

• Experimental platform: a GALS array processor

• Performance analysis of GALS multiprocessor• Power analysis of GALS multiprocessor

Our Implementation of a GALS Array and Synchronous Array

• Contains multiple identical simple processors• Local oscillator and dual-clock FIFOs are key

components for GALS style

IMEM

ALUMAC

Control

DMEM

FIFO0

FIFO1

IMEM

ALUMAC

Control

DMEM

osc.

dualclk-FIFO0

dualclk-FIFO1

enhanced dual-clock FIFO

3x3 arrayprocessor

Single processor in asynchronous array

Single processor in aGALS array

config config

Micrograph of the 6 x 6 GALS Array

Technology: 0.18 µm CMOS

Max speed: 475 MHz

Area (1 Proc): 0.66 mm²

Fully functional

SingleProcessor

5.65 mm

5.68 mm

[Yu, ISSCC06]

Outline

• Motivation• Experimental platform: a GALS array processor

• Performance analysis of GALS multiprocessor

• Power analysis of GALS multiprocessor

Performance Penalty of GALS Uni-processor

• Extra pipeline hazards result in ~10% throughput penalty compared with synchronous uni-processor

[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]

Branch penalty of 3 cycles in a synchronous uni-processor

Branch penalty of 3 cycles and 4 SYNC delays in a GALS uni-processor

IF ID EXE MEM WB

IF ID EXE

Branch inst

Branch inst IF ID EXE MEM WBSYNC SYNC SYNC SYNC

IF

clk4 clk5clk3clk2clk1

MEM WBbranch penalty

branch penalty

Application Performance of GALS and Synchronous Array

• GALS array has only ~1% performance penalty• Simulation conditions: 32-word FIFO, same clock

frequency, 2 synchronization registers for GALS

8-pt DCT

8×8 DCT

Zig-zag

merg sort

bub. sort

ma-trix

64 FFT

JPEG 802.

11a/g

Sync. array 41 498 168 254 444 817.5 11439 1439 87857

GALS array 41 505 168 254 444 819 11710 1443 88989

GALS perf. reduction(%

)

0 1.4 0 0 0 0.1 2.3 0.3 1.3

Clock cycles for several applications

The Source of Performance Penalty of GALS Array Processor

• For all systems, communication delay affects system throughput only when it generates a loop

• For GALS array processor, communication loop is the FIFO stall loop– Performance simulation results show that the chance

of FIFO stall loop is low for many DSP applications

• FIFO depth affects FIFO stalls and hence a reasonable FIFO size is required

Importance of the Communication Loop Delay

• One way communication does not affect system throughput

• Communication loop degrades throughput– In uni-processor, it is caused by pipeline hazards – GALS system has longer communication loop delay

unit1(T1)

unit2(T3)

comm.(T2)

One way communication:The throughput is 1/Max (T1, T2, T3)

unit1(T1)

unit2(T3)

comm.(T2)

Communication loop:The throughput is 1/(T1 + T2 + T3 + T4)

comm.(T4)

Examples of Stall and Stall Loops in a GALS Array Processor

Proc.1

Proc.2

Data producer proc. 1 too slow causesfrequent FIFO empty stalls

Proc.1

Proc.2

Data consumer proc. 2 too slow causesfrequent FIFO full stalls

Proc.1

Proc.2

Data producer proc. 1 and data consumerproc. 2 both too slow at different times

cause FIFO empty and full stalls

Proc.1

Example of multiple-link loop betweentwo processors

Proc.2

FIFO

FIFO

emptystall

fullstall

empty stall

full stall

emptystall

fullstall

full stall

empty stall

FIFO

FIFO

FIFO

Performance of Synchronous and GALS Array with Different FIFO Sizes

Synchronous Array

GALS vs. Synchronous Array

GALS Array

8 DCT zig-zag bsort 64 FFT 802.11 8x8 DCT msort matrix JPEG

Rel

ativ

e P

erfo

rman

ce

Per

form

ance

R

atio

Rel

ativ

e P

erfo

rman

ce

Outline

• Motivation• Experimental platform: a GALS array processor• Performance analysis of GALS multiprocessor

• Power analysis of GALS multiprocessor

Clock/Voltage Scaling in GALS Uni-processor

• All GALS designs– Independently control clock frequencies to save power– Reduced clock frequency allows voltage reduction

• Around 25% energy savings with more than 10% performance penalty

[Lyer, ISCA02; Semeraro, HPCA02;Talpes, ISLPED03]

IF ID EXE MEM WBsynccirc.

synccirc.

synccirc.

synccirc.

clk1

task1program with

few MEMinstructions

task2

program withfew WB

instructions

clk4 can bereduced

clk5 can bereduced

clk2 clk3 clk4 clk5

Clock/Voltage Scaling in GALS Array Processor

• Similar basic idea as uni-processor– Use low clock frequency for processors with light

computation load– Benefit from unbalanced processor computation loads

• Both static and dynamic clock scaling methods– We study only static scaling here

• The optimal processor clock frequency is determined by its – Computational load– Position

• Can achieve power savings without performance reduction!

Unbalanced Processor Computation Loads in Nine Applications

8-pt DCT 8x8 DCT zig-zag

merge-sort bubble-sort matrix

64 FFT JPEG 802.11a/g

Throughput Changes with Statically Configured Clock for 8x8 DCT

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

scale 1st processor

scale 2nd processor

scale 3rd processor

scale 4th processor

optimal clockscaling points

Relative clock frequency

Rel

ativ

e T

hro

ug

hp

ut

Compute load(clk cycles)

408 204 408 204

Relationship of Processors in an 8x8 DCT

• Proc. 2 and 4 have identical computational load• Different position results in a different FIFO stall

style, which causes different clock scaling behavior

Trans.

FIFO

1-DCT

FIFO

1-DCT

FIFO

Trans.

FIFO

FIFO empty stall of 2nd proc.

FIFO full stall of 2nd proc.

FIFO empty stall of 4th proc.

Power of GALS Array with Static Clock/Voltage Scaling

Rel

ativ

e P

ow

er

8 DCT zig-zag JPEG msort matrix 8x8 DCT 64 FFT 802.11 bsort

• 40% power savings without a performance penalty

• From simulated clock frequency and referenced clk/voltage/pow relationship

Summary

• Compared to a synchronous array processor, the proposed GALS array processor has: – < 1% throughput reduction– ~40% energy savings

• These results compare well with reported GALS uni-processors:– ~10% throughput reduction– ~25% energy savings

• Source of throughput reduction in GALS system– Extra cost for communication loops– Extra cost for FIFO stall loops in GALS array processors

• Energy benefit of GALS clock/voltage scaling– Unbalanced processor computation loads

Acknowledgments

• Funding– Intel Corporation– UC Micro– NSF Grant No. 0430090– UCD Faculty Research Grant

• Special Thanks– E. Work, T. Mohsenin, other VCL processor co-

designers, R. Krishnamurthy, M. Anders, S. Mathew

Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

Documents

gals arraytechnology

loopfor gals array processor

pipeline hazards gals

synchronous arraygals

throughput penalty

throughputin uniprocessor

communication delay

synchronous uniprocessorlyer