Top Banner
Charm++ on NUMA Platforms: the impact of SMP Optimizations and a NUMA-aware Load Balancer Laércio Pilla (UFRGS/Brazil - INRIA) Christiane Pousa (INRIA) Daniel Cordeiro (USP/Brazil - INRIA) Jean-François Méhaut (INRIA)
48

NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Sep 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ on NUMA Platforms:

the impact of SMP Optimizations

and a NUMA-aware Load Balancer

Laércio Pilla (UFRGS/Brazil - INRIA)

Christiane Pousa (INRIA)

Daniel Cordeiro (USP/Brazil - INRIA)

Jean-François Méhaut (INRIA)

Page 2: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Outline

• Introduction

• Performance Evaluation of SMP Optimizations of

Charm++ on NUMA Machines

• NUMA-aware Load Balancer on Charm++

• Conclusion and Future Work

Page 3: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Motivation for NUMA platforms

• The number of cores per processor is

increasing

• Hierarchical shared memory multiprocessors

• ccNUMA is coming back (NUMA factor)

Node 3

M3 CPU 3

Node 0

M0 CPU 0

Node 1

M1 CPU 1

Node 2

M2 CPU 2

Node 2

M2 c c cc

Node 3

M3 c c cc

Node 0

M0 c c cc

Node 1

M1 c c cc

Now

Page 4: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Remote access

• Optimization of latency

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 5: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Remote access

• Optimization of latency

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 6: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Remote access

• Optimization of latency

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 7: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Memory contention

• Optimization of bandwidth

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 8: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Memory contention

• Optimization of bandwidth

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 9: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

• Memory contention

• Optimization of bandwidth

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

Page 10: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA problems

On NUMA machines,

data distribution

matters!

Page 11: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ Parallel

Programming System

• Platform independent

• Both shared and distributed memory

• Architecture abstraction

• Programmer productivity

From charm++ site: http://charm.cs.uiuc.edu/research/charm/

Page 12: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ Parallel

Programming System

• Communications originally implemented

with message passing

• Even on SMP machines

• Currently, uses optimizations for SMP

systems

• Chao Mei et al., “Optimizing a parallel runtime system

for multicore clusters: a case study”, in TG ‘10

Page 13: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ & NUMA

• How these optimizations work on

NUMA machines?

• How can we use knowledge about the

NUMA system to improve performance

on Charm++?

Page 14: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ & NUMA

• How these optimizations work on

NUMA machines?

• Our evaluation

• How can we use knowledge about the

NUMA system to improve performance

on Charm++?

• NUMA-aware load balancer

Page 15: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Outline

• Introduction

• Performance Evaluation of SMP Optimizations

of Charm++ on NUMA Machines

• NUMA-aware Load Balancer on Charm++

• Conclusion and Future Work

Page 16: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Evaluation of SMP optimizations

• Different Charm++ versions

• With optimizations

• Without optimizations

• Different architecture compilations (flavors)

• net-linux: distributed memory

• multicore: shared memory

Page 17: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA machines• AMD Opteron

• 8 nodes x 2 cores

• @ 2.2GHz

• 2 MB L2 cache

• 32 GB main memory

• Low latency for local

memory access

• Crossbar

• NUMA factor: 1.2 – 1.5

• Linux 2.6.32.6

Node 6

M5 C12 C13

Node 7

M7 C14 C15

Node 4

M4 C8 C9

Node 5

M5 C10 C11

Node 2

M2 C4 C5

Node 3

M3 C6 C7

Node 0

M0 C0 C1

Node 1

M1 C2 C3

Page 18: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA machines

• Intel Xeon X7560

• 4 nodes x 8 cores

• @ 2.27 GHz

• 24 MB shared L3 cache

• 64 GB main memory

• QuickPath

• NUMA factor: 2 - 2.6

• Linux 2.6.32

Node 2

M2

C22 C23

C20 C21

C18 C19

C16 C17

Node 3

M3

C30 C31

C28 C29

C26 C27

C24 C25

Node 0

M0

C6 C7

C4 C5

C2 C3

C0 C1

Node 1

M1

C14 C15

C12 C13

C10 C11

C8 C9

Page 19: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Experimental setup

• Exclusive access to the machines

• Minimum of 10 executions

• Low standard deviation (< 5%)

• Different numbers of cores

Page 20: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Benchmark: Jacobi2D

• Iterative benchmark

• Computations over 2D matrix

• Communications with 4 neighbors

• Stencil (CPU bound)

• Imbalanced

Page 21: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Jacobi2D on Opteron Machine

0

1

2

3

4

5

6

2 4 8 16

Ave

rag

e i

tera

tio

n t

ime

(s

)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

No sensible

difference

b

e

t

t

e

r

Opteron

Page 22: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Jacobi2D on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(s

)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

b

e

t

t

e

r

Xeon

Inside de

error margin

Page 23: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Benchmark: kNeighbor

• Synthetic benchmark

• Completely communication bound

• Each chare communicates with k

neighbors

• k = 3

• Message size = 1024 B

Page 24: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

kNeighbor on Opteron Machine

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

Opteron

b

e

t

t

e

r

Speedup of 9Speedup of 1.2

Page 25: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

kNeighbor on Xeon Machine

0

100

200

300

400

500

600

700

800

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

Xeon

b

e

t

t

e

r

Page 26: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

kNeighbor on Xeon Machine

0

100

200

300

400

500

600

700

800

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

Xeon

b

e

t

t

e

r

Speedup of 2

Speedup of 4.7

Page 27: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Partial conclusions

• Times can be have a 50% difference

between Charm++ versions

• Times 90% smaller when using

multicore instead of net-linux

• Impact proportional to the amount of

communications

Page 28: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Outline

• Introduction

• Performance Evaluation of SMP Optimizations of

Charm++ on NUMA Machines

• NUMA-aware Load Balancer on Charm++

• Conclusion and Future Work

Page 29: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

• Use knowledge about the system

• NUMA-factor among nodes

• Collected through libarchtopo

• Communication history

• No knowledge about the chare’s memory

• Improve performance by reducing

communication latency

• Avoid too many chare migrations

Page 30: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

Calculate processors’ load

Sort chares by decreasing load

While there are migratable chares

Pick most loaded chare k

Compute W(k,i) for all processors i

Migrate k for the processor with smaller W(k,i)

Page 31: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)

Page 32: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

W(k,i) = L(i) +

ɑ*( Communication weight (constant)

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)

Load on candidate processor (core)

Page 33: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)

Intra-core communications

(extended for intra-NUMA node)

Page 34: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

NUMA-aware Load Balancer

W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

) Inter-core communications

(extended for inter-NUMA node)

NUMA factor

Page 35: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Load Balancer Evaluation

• Benchmarks

• Imbalance

• Jacobi2D

• Poisson3D

• Comparison with different load balancers

• GreedyLB

• GreedyCommLB

Page 36: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Benchmark: Imbalance

• By Isaac Dooley

• Based on Fractography3D

• Iterative benchmark

• Imbalance increases with computations

• Computations over 2D array of chares

• Communications with 4 neighbors

Page 37: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Imbalance on Opteron Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

To

tal ti

me

sp

ee

du

p

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Opteron

b

e

t

t

e

r

15%5%

Page 38: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Imbalance on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

To

tal ti

me

sp

ee

du

p

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Xeon

b

e

t

t

e

r

~5% (inside error margin)

Page 39: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Benchmark: Jacobi2D

• Iterative benchmark

• Computations over 2D matrix

• Communications with 4 neighbors

• Stencil (CPU bound)

• Imbalaced

Page 40: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Jacobi2D on Opteron Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

Ite

rati

on

tim

e s

pe

ed

up

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Opteron

b

e

t

t

e

r

Page 41: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Jacobi2D on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

Ite

rati

on

tim

e s

pe

ed

up

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Xeon

b

e

t

t

e

r

17%

3%

Page 42: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Benchmark: Poisson3D

• By Xavier Besseron and Thierry Gautier

• Solves the Poisson equation on a 3D

domain

• Parallelized by domain decomposition

• Well balanced

Page 43: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Poisson3D on Opteron Machine

0,900 0,7970,999 0,989

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

To

tal ti

me

sp

ee

du

p

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Opteron

b

e

t

t

e

r

Page 44: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Poisson3D on Xeon Machine

0,914 0,8370,711

0,995 0,993 0,983

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

To

tal ti

me

sp

ee

du

p

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Xeon

b

e

t

t

e

r

GreedyLB performance decreased due to

migrations overhead

Page 45: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Outline

• Introduction

• Performance Evaluation of SMP Optimizations of

Charm++ on NUMA Machines

• NUMA-aware Load Balancer on Charm++

• Conclusion and Future Work

Page 46: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Conclusion

• SMP optimizations do affect the

performance on NUMA machines

• Up to 50% between versions and 90%

between architecture-specific compilations

• Gains with NUMA LB

• Speedups of up to 2.8 (compared to no LB)

• Performance near GreedyLB

• Avoid migrations

Page 47: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Future Work

• Evolution of NUMA-aware LB

• Consider topology

• Number of hops

• Cache hierarchy

• Memory per chare

• Improve NUMA information discovery

• Initialization overheads

• Run experiments with communication intensive

benchmarks

• Interface for a memory LB?

Page 48: NCSA Wiki - ��Pr�sentation PowerPoint · 2010. 11. 23. · Title: ��Pr�sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ on NUMA Platforms:

the impact of SMP Optimizations

and a NUMA-aware Load Balancer

Thank you