Be-Nice Scheduling for embedded SMT processors Apr 6 th , 2008 Boston Handong Ye
Dec 22, 2015
Be-Nice Schedulingfor embedded SMT processors
Apr 6th, 2008Boston
Handong Ye
Be-Nice Scheduling
• ITS (Inter-Thread Stall) Introduction
• Be-Nice Scheduling
• Some experimental results
Be-Nice Scheduling
• ITS Introduction– ITS in Out-Of-Order processor– ITS in In-Order processor
• Be-Nice Scheduling
• Some experimental results
• ITS Introduction– ITS in Out-Of-Order machine
• A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others
• Flush, …
– ITS in In-Order machine• A thread holds Functional Units, blocking others• 2 examples• What can compiler do ?
Be-Nice Scheduling
• ITS Introduction– ITS In In-Order machine
• Examples, assume:– SMT, 2 threads– Embedded– 2 LS units, and 2 ALU– Separate dispatch buffer
Be-Nice Scheduling
• ITS Introduction– ITS In In-Order machine
• Example – 1 (Same FU ITS)– A missed load can block other threads which are using
the same LS unit
Be-Nice Scheduling
add
ldld
add
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
ld
add
MISS
Example - 1 : same-FU block
Thread-A Thread-B
• ITS Introduction– ITS In In-Order machine
• Example – 2 (Cross FU ITS)– A missed load can block other threads which are using
non-LS Functional Units, e.g., ALU
Be-Nice Scheduling
add
ldld
add
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
add
add
MISS
Example – 2 : cross-FU block
add
Thread-A Thread-B
• ITS Introduction– ITS In In-Order machine
Be-Nice Scheduling
Assume:1. Thread-A cache miss,
around 1%~2%2. Thread-B always hit Results:1. Half of idle cycles are due to ITS2. Almost 1/3 cycles are idle
The effect of ITS, from thread-A to thread-B
• ITS Introduction– ITS In In-Order machine
• What can compiler do ?– Focused on in-order embedded processor– Need a few simple HW supports– Using Open64, in Instruction Scheduling
Be-Nice Scheduling
Be-Nice Scheduling
• ITS (Inter-Thread Stall) Introduction
• Be-Nice Scheduling
• Some experimental results
• Be-Nice Scheduling• Intuitive thinking
– Prefetch : Unacceptable for embedded systemPrefetch : Unacceptable for embedded system– Reduce Cross-FU ITS: Reduce the number of FUs hold
by the thread-A – Reduce Same-FU ITS: Avoid issuing instructions from
other threads into those blocked FUs
Be-Nice Scheduling
add
ldld
ld
EXE
MEM
WB
Dispatch
Buffer
LS1 LS2 ALU1 ALU2
add
add
add
Thread-A Thread-Badd
add
ld
ld
add sched
Original Thread-A
• Be-Nice Scheduling– Objective
• Schedule n (>=2) loads back-to-back
• Issue the n loads to same FU
– Compiler + HW solution• HW side
– Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit
– Different threads has its prefer LS unit
• Compiler side– Profile to figure out the loads which are highly possible to miss , saying
‘load_a’– Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them
as a pseudo OP– Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both
are changed to ‘ld.1’
Be-Nice Scheduling
• Be-Nice Scheduling– A Compiler + HW solution
Be-Nice Scheduling
BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3
BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3
Identifiedto miss
BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
WHIRL
CG-expand
CGIR
Control flow opt.
If-conversionLoop optimizations
Softwarepipelining
Loopunrolling
Scheduling pre-pass ( GCM here)
Local register alloc
Scheduling post-pass
Prolog and Epilog
Extendedblockoptimizer
Code emission
.s
Global register alloc
Be-Nice Scheduling
• Be-Nice Scheduling ( In Open64 GCM and LIS )– The key points during code motion
• Use GCM to find candidates of <ld.1, ld.1> pair
• Moving the pair as a ‘pseudo’ single instruction
Be-Nice Scheduling
Be-Nice Scheduling
• Some experimental results– Be-Nice Schedule on Thread-A– Performance difference on Thread-B
Be-Nice Scheduling
• Some experimental results
The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice
Be-Nice Scheduling
• Some experimental results
IPC Improvement of thread-B with Be-Nice Instruction Scheduling