Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.

Be-Nice Schedulingfor embedded SMT processors

Apr 6th, 2008Boston

Handong Ye

Be-Nice Scheduling

• ITS (Inter-Thread Stall) Introduction

• Be-Nice Scheduling

• Some experimental results

Be-Nice Scheduling

• ITS Introduction– ITS in Out-Of-Order processor– ITS in In-Order processor



• ITS Introduction– ITS in Out-Of-Order machine

• A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others

• Flush, …

– ITS in In-Order machine• A thread holds Functional Units, blocking others• 2 examples• What can compiler do ?

Be-Nice Scheduling

• ITS Introduction– ITS In In-Order machine

• Examples, assume:– SMT, 2 threads– Embedded– 2 LS units, and 2 ALU– Separate dispatch buffer

Be-Nice Scheduling


• Example – 1 (Same FU ITS)– A missed load can block other threads which are using

the same LS unit

Be-Nice Scheduling

add

ldld

add

EXE

MEM

WB

Dispatch

Buffer

LS1 LS2 ALU1 ALU2

ld

add

MISS

Example - 1 : same-FU block

Thread-A Thread-B


• Example – 2 (Cross FU ITS)– A missed load can block other threads which are using

non-LS Functional Units, e.g., ALU

Be-Nice Scheduling

add

ldld

add

EXE

MEM

WB

Dispatch

Buffer

LS1 LS2 ALU1 ALU2

add

add

MISS

Example – 2 : cross-FU block

add

Thread-A Thread-B


Be-Nice Scheduling

Assume:1. Thread-A cache miss,

around 1%~2%2. Thread-B always hit Results:1. Half of idle cycles are due to ITS2. Almost 1/3 cycles are idle

The effect of ITS, from thread-A to thread-B


• What can compiler do ?– Focused on in-order embedded processor– Need a few simple HW supports– Using Open64, in Instruction Scheduling

Be-Nice Scheduling

Be-Nice Scheduling

• ITS (Inter-Thread Stall) Introduction



• Be-Nice Scheduling• Intuitive thinking

– Prefetch : Unacceptable for embedded systemPrefetch : Unacceptable for embedded system– Reduce Cross-FU ITS: Reduce the number of FUs hold

by the thread-A – Reduce Same-FU ITS: Avoid issuing instructions from

other threads into those blocked FUs

Be-Nice Scheduling

add

ldld

ld

EXE

MEM

WB

Dispatch

Buffer

LS1 LS2 ALU1 ALU2

add

add

add

Thread-A Thread-Badd

add

ld

ld

add sched

Original Thread-A

• Be-Nice Scheduling– Objective

• Schedule n (>=2) loads back-to-back

• Issue the n loads to same FU

– Compiler + HW solution• HW side

– Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit

– Different threads has its prefer LS unit

• Compiler side– Profile to figure out the loads which are highly possible to miss , saying

‘load_a’– Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them

as a pseudo OP– Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both

are changed to ‘ld.1’

Be-Nice Scheduling

• Be-Nice Scheduling– A Compiler + HW solution

Be-Nice Scheduling

BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3

BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3

BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3

Identifiedto miss

BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3

WHIRL

CG-expand

CGIR

Control flow opt.

If-conversionLoop optimizations

Softwarepipelining

Loopunrolling

Scheduling pre-pass ( GCM here)

Local register alloc

Scheduling post-pass

Prolog and Epilog

Extendedblockoptimizer

Code emission

.s

Global register alloc

Be-Nice Scheduling

• Be-Nice Scheduling ( In Open64 GCM and LIS )– The key points during code motion

• Use GCM to find candidates of <ld.1, ld.1> pair

• Moving the pair as a ‘pseudo’ single instruction

Be-Nice Scheduling

Be-Nice Scheduling

• Some experimental results– Be-Nice Schedule on Thread-A– Performance difference on Thread-B

Be-Nice Scheduling


The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice

Be-Nice Scheduling


IPC Improvement of thread-B with Be-Nice Instruction Scheduling

Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.

Documents

nice scheduling slide

nice scheduling bb1

threadb slide

instruction scheduling

r3 bb1

fu block thread

order processor

order machine examples