Top Banner
A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab Jun 23, 2022
45

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture

Yun Zhang, Michael J. Voss

University of Toronto

Guansong Zhang, Raul Silvera

IBM Toronto Lab

Apr 18, 2023

Page 2: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

2

Agenda

Background Motivation Previous Work Adaptive Schedulers IBM Power 5 Architecture A Multi-Level Hierarchical Scheduler Evaluation Future Work

Page 3: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

3

Simultaneous Multi-Threading

Architecture

Several threads per physical processor

Threads share Caches Registers Functional Units

Page 4: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

4

Power 5 SMT Execution Resource

Resource 0

Resource 1

Resource n

……

……

Thread 0 Thread 1

Clock Cycles

Execution Resource

Resource 0

Resource 1

Resource n

……

……

Thread 0 Thread 1

Clock CyclesExecution Resource

Resource 0

Resource 1

Resource n

……

……

Thread 0 Thread 1

Clock Cycles

Page 5: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

5

OpenMP

OpenMPA standard API for shared memory

programmingAdd directives for parallel regions

Standard Loop SchedulersStaticDynamicGuidedRuntime

Page 6: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

6

OpenMP API#pragma omp parallel for shared(a, b) private(i, j) schedule(runtime)for ( i = 0; i < 100; i ++ ) {

for ( j = 0; j < 100; j ++) {a[i , j] = a[i , j] + b[i , j];

}}

An example of a parallel loop in C code. (Similar in Fortran)

……..

……..

……..…

.

….

….

….

….

j

i

T0 Tn

Page 7: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

7

Motivation OpenMP Applications

Designed for SMP systems Not aware of HT technology Understanding and controlling performance of

OpenMP applications on SMT processors is not trivial

Important performance issues on SMP system with SMT nodes Inter-thread data locality Instruction Mix SMT-related Load Balance

Page 8: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

8

Scaling (Spec & NAS)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 2 3 4 5 6 7 8

Number of Threads

Sp

eed

up

ammp

apsi

art

equake

mgrid

swim

wupwise

BT

CG

EP

FT

MG

SP1 Thread per Processor 1-2 Threads per Processor

4 Intel Xeon Processors with Hyperthre

ading

Page 9: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

9

Why do they scale poorly? Inter-thread data locality

cache misses Instruction Mix

functional units sharing benefit gained this way may outweigh cache

misses SMT-related Load Balance

We should balance work loads well among: processors threads running on the same physical processor.

Page 10: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

10

Previous Work:Runtime Adaptive Scheduler

Hierarchical SchedulingUpper level schedulerLower level scheduler

Select scheduler and the number of threads to run at runtimeOne thread per physical processorTwo threads per physical processor

Page 11: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

11

Two-Level Hierarchical Scheduler

Page 12: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

12

Traditional Scheduling

……..……..

……..

…. …. …. …. ….

Static Scheduling

……..……..

……..

…. …. …. …. ….

TnT0T0 Tn Ti Tk

Dynamic Scheduling

jj

i i

Page 13: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

13

Hierarchical Scheduling

Dynamic Scheduling

T01T00 Ti0 Ti1

……..

….

….…. ….

Static Scheduling

i

j

……..

……..

……..

…. …. …. …. ….

P0 Pi

……..

……..

……..

…. ….

Page 14: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

14

Why can we benefit fromruntime scheduler selection?

Many parallel loops in OpenMP applications are executed again and again.

Example

# of calls vs. Execution time

< 10 times

10 – 40 times

> 40 times

ammp 0% 0% 84.20%

apsi 0% 0% 82.55%

art 100% 0% 0%

equake 0.05% 0% 98.23%

mgrid 0% 0.11% 95.95%

swim 0.09% 0% 99.25%

wupwise 0.12% 0% 99.49%

BT 0% 0% 100%

CG 0.92% 3.5% 92.57%

EP 100% 0% 0%

MG 12.73% 12.87% 71.91%

SP 1.02% 0% 92.71%

for (k = 1; k<100; k++) { …………. calculate(); ………….}

void calculate () {#pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) {

……………; // calculation }}

Page 15: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

15

Adaptive Schedulers Region Based Scheduler

Select loop schedulers at runtime Parallel loops in one parallel region have to use the

same scheduler which may not be the best

Loop Based Scheduler Higher runtime overhead More accurate loop scheduler for each parallel loop

Page 16: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

16

Sample from NAS2004!$omp parallel default(shared) private(i,j,k)!$omp do schedule(runtime) do j=1,lastrow-firstrow+1

do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1enddo

enddo!$omp end do nowait!$omp do schedule(runtime) do i = 1, na+1

x(i) = 1.0D0 enddo!$omp end do nowait!$omp do schedule(runtime) do j=1, lastcol-firstcol+1

q(j) = 0.0d0z(j) = 0.0d0r(j) = 0.0d0p(j) = 0.0d0

enddo!$omp end do nowait!$omp end parallel

loop based scheduler picks a scheduler

region based scheduler picks one scheduler that applies to all three loops

loop based scheduler picks a scheduler

loop based scheduler picks a scheduler

Page 17: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

17

Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………

M1

P1P0

T1T0

P3P2

T3T2

Static Scheduler

Page 18: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

18

Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………

M1

P1P0

T1T0

P3P2

T3T2

Dynamic Scheduler

Page 19: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

19

Runtime Loop Scheduler SelectionPhase 1: try upper level scheduler, run with 4 threads…………

M1

P1P0

T1T0

P3P2

T3T2

Affinity Scheduler

Page 20: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

20

Runtime Loop Scheduler SelectionPhase 1: Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads…………

T0

M1

P1P0

T3T2T1

P1P0

T7T6T5T4

Affinity Scheduler

Static

Page 21: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

21

Sample from NAS2004!$omp parallel default(shared) private(i,j,k)!$omp do schedule(runtime) do j=1,lastrow-firstrow+1

do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1enddo

enddo!$omp end do nowait!$omp do schedule(runtime) do i = 1, na+1

x(i) = 1.0D0 enddo!$omp end do nowait!$omp do schedule(runtime) do j=1, lastcol-firstcol+1

q(j) = 0.0d0z(j) = 0.0d0r(j) = 0.0d0p(j) = 0.0d0

enddo!$omp end do nowait!$omp end parallel

Static-Static, 8 threads

TSS, 4 threads

TSS, 4 threads

Page 22: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

22

Hardware Counter Scheduler Motivation

The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible

Algorithm Try different schedulers on parallel loops on a subset of the

benchmarks using training data Use the characteristic: cache miss, number of floating point

operations, number of micro-ops, load imbalance and the best scheduler for that loop as input

Feed the above data to classification software (we use C4.5) to build a decision tree

Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.

Page 23: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

23

Speedup on 4 Threads

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

amm

pap

si art

equa

ke

mgr

idsw

im

wupwise

BT(W)

CG EP MG

SP(W)

Avera

ge

Benchmarks

Sp

eed

up

static

dynamic

guided

afs

tss

original

RBS

LBS

HCS

4 Intel Xeon Processors with Hyperthre

ading

Page 24: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

24

Speedup on 8 Threads

1.00

1.50

2.00

2.50

3.00

3.50

4.00

amm

pap

si art

equa

ke

mgr

idsw

im

wupwise

BT(W)

CG EP MG

SP(W)

Avera

ge

Benchmarks

Sp

eed

up

static

dynamic

guided

afs

tss

original

RBS

LBS

HCS

4 Intel Xeon Processors with Hyperthre

ading

Page 25: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

25

IBM Power 5

Technology: 130nm Dual processor core 8-way superscalar Simultaneous Multi-

Threaded (SMT) core Up to 2 virtual processors 24% area growth per core

for SMT Natural extension to Power

4 design

Page 26: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

26

Single Thread

Single Thread has advantage when executing unit limited applications Floating or fixed point intensive workloads

Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread

Data locality on one SMT core is better with single thread for some applications

Page 27: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

27

Power 5 Multi-Chip Module (MCM)

Or Multi-Chipped Monster 4 processor chips

2 processors per chip

4 L3 cache chips

Page 28: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

28

Power5 64-way Plane Topology

Each MCM has 4 inter-connected processor chips

Each processor chip has two processors on chip

Each processor has SMT technology therefore two threads can be executed on it simultaneously

Page 29: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

29

Multi-Level Scheduler Loop Iterations

Iterations for Module 1

1st LevelScheduler

Iterations for Module i

Iterations for Module n

2nd LevelScheduler

2nd LevelScheduler

Iterations for Processor m

Iterations for Processor 1

Iterations for Processor m

Iterations for Processor 1

3rd LevelScheduler

3rd LevelScheduler

Iterations for Thread k

Iterations for Thread 1

Iterations for Thread k

Iterations for Thread 1

……………….

…….…….

……………….

………………. ……………….

Page 30: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

30

OpenMP Implementation

Outline Technique New subroutines

created with body of each parallel construct

Runtime routines receives as a parameter the address of the outlined procedure

Page 31: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

31

long main {_xlsmpParallelDoSetup_TPO(…)}

1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…)

void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return;}

while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. }

Outlined Functions

Source Code:#pragma omp parallel for shared(a,b) private(i)for ( i = 0; i < 100; i ++ ) { a = a + b;}

Runtime Library

Page 32: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

32

long main {_xlsmpParallelDoSetup_TPO(…)}

1. Initialize Work Itemsand work shares2. Call _xlsmp_DynamicChunkCall(…)

void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return;}

while (hier_sched(…))) { ………… call main@OL@1(...); …………. }

Outlined Functions

Source Code:#pragma omp parallel for shared(a,b) private(i)for ( i = 0; i < 100; i ++ ) { a = a + b;}

Runtime Library

Page 33: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

33

1. Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2nd level scheduler and return

2. Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends

M0

P1P0

T3T2T1T0

M1

P1P0

T7T6T5T4

Root

Guided

Static Cyclic

Page 34: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

34

Hierarchical Scheduler

Guided as the 1st level scheduler Balance work loads among processors Reduce runtime overhead

Static Cyclic as the 2nd level scheduler Improve cache locality Reduce runtime overhead

….

T0 T1

Iteration space dividing using standard static scheduling

….

T0 T1

Iteration space dividing using static cyclic scheduling

T1 T1 T1T0 T0 T0

Page 35: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

35

Evaluation

IBM Power 5 System 4 Power 5 1904 MHz SMT processors 31872 M memory

Operating System AIX 5.3

Compiler: IBM XL C/C++, XL Fortran compiler

Benchmark SpecOMP2001

Page 36: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

36

Scalability of IBM Power 5 SMT Processors 1 through 8 threads

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 3 4 5 6 7 8Benchmarks

Speedup

ammp

applu

apsi

art

equake

gafort

mgrid

swim

wupwise

Page 37: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

37

0.80

0.85

0.90

0.95

1.00

1.05

1.10

ammp applu apsi art equake gafort mgrid swim wupwise

Benchmarks

Nor

mal

ized

Tim

e ov

er S

tatic

static

dynamic

guided

Hier

Evaluation on Power 5Execution Time Normalized to Default (Static) Scheduler

Page 38: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

38

Conclusion Standard schedulers are not aware of SMT technology Adaptive hierarchical schedulers take SMT specific char

acteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together.

OpenMP parallel applications running on Power 5 architecture with SMT has the same problem

Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001 Large improvements of 7% and 11% on some benchmarks Improves on average over all other standard OpenMP loop sche

dulers by at least 2%

Page 39: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

39

Future Work

Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM)

Explore performance on auto-parallelized benchmarks (SPEC CPU FP)

Examine mechanisms for determining best scheduler configuration at compile-time

Explore the use of helper threads on Power 5 Cache prefetching

Page 40: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

Thank You~

Page 41: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

41

(A cache miss comparison chart will be shown here)

If find a way to calculate the overall L2 load/store miss generally.

If not, will show the overhead of this optimization from the tprof data.

Page 42: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

42

Schedulers’ Speedup on 4 threads

1.00

1.50

2.00

2.50

3.00

3.50

4.00

Benchmarks

Sp

eed

up

static

dynamic

guided

afs

tss

original

Page 43: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

43

Scheduler’s Speedup on 8 Threads

1.00

1.50

2.00

2.50

3.00

3.50

4.00

ammp apsi art equake mgrid swim wupwise BT(W) CG EP MG SP(W) Average

Sp

ee

du

p

static

dynamic

guided

afs

tss

original

Page 44: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

44

Decision Tree Only one decision tree is

built offline, before executing the program

Apply that decision tree to loops at runtime without changing the tree

Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead

uops <= 3.62885e+08 :| cachemiss <= 111979 :| | uops > 748339 : static-4 | | uops <= 748339 :| | | l/s <= 167693 : static-4 (| | | l/s > 167693 : static-static | cachemiss > 111979 :| | floatpoint <= 1.52397e+07 :| | | cachemiss <= 384690 :| | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 :| | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 :| | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 :| l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 :| | imbalance <= 32236 : static-4 | | imbalance > 32236 :| | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 :| | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 :| | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardwar

e-counter schedulingEND hardware-counter scheduling

Page 45: A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

45

(Load imbalance comparison chart will be shown here)

Generating……..