Tutorial: High Performance Computing

Tutorial: High Performance Computing

Igal G. Rasin

Department of Chemical EngineeringIsrael Institute of Technology

27 Nisan 5769 (21.04.2009)

Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 1 / 18

Motivation

What is High Performance Computing?

What is “Regular” Computing?

How to choose method for you “Regular” problem?

Is your algorithm is optimal for serial Computing?


Memory Gap

1980 1985 1990 1995 2000 20051

10

100

1000

CPU, 2x every

2 years

CPU, 2x every 6 years


Processor structure

RAM frequency is less thanprocessor one

Hierarchical structure

L1 cache usually works on thesame frequency with theprocessor Processor

cache L1

cache L2

RAM

Processor name PF, GHz Memorytransfer rate

L1 size L1 transferrate

Intel Core 2 DUO 1.6 6.4 Gb/s 2x32KB 96 GB/s


Memory. Performance

Example

0

2e + 08

4e + 08

6e + 08

8e + 08

1e + 09

1.2e + 09

8 10 12 14 16 18 20 22 24 26

Iter

atio

ns/

s

array size 2n

randomserial

L1 size: 2x32 KB; 212 · 8 = 32KBL2 size: 3072 KB; 218 · 8 = 2048Kb


Multi-Core / Multi-Processor architectures.

core1

core2

L1 1 L1 2

L2 1 L2 2

core1

core2

L1 1 L1 2

L2 1 L2 2

Memory

Memory

memory access rate depends on memory placement

both processors works independently with their memories


Tendencies

More paralelization More cores Vectorization

IBM, SonyTilera

Nvidia

ATI


Cluster vs. Shared memory

Separated mashines with networkon all levels

P

M

P

M

P

MNetwork

Sheared memory machine on alllevels

P

P

P

Memory

Small data/Many calculations

Large data/few calculations


Cluster vs. Shared memory

Separated mashines with networkon all levels

P

M

P

M

P

MNetwork

Sheared memory machine on alllevels

P

P

P

Memory

Small data/Many calculations

Large data/few calculations


Molecular dynamicsMolecular dynamics of atoms with Lennard-Jones potential

Vij = 4ε

((σr

)12−(σ

r

)6)

Fij = 4ε~r

(6σ6

r8− 13

σ12

r14

)Fi =

∑j

Fij

Programm

69 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){70 i n i t F ( f , n ) ;71 f o r ( i n t i ( 0 ) ; i<n;++ i )72 f o r ( i n t j ( i +1); j<n;++ j ){73 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;74 f [ i ]+= f f ;75 f [ j ]−= f f ;76 }77 } Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 9 / 18

Parallelization. OpenMP

OpenMP is a tool for parallelization sheared memory machine

Programm

70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i ;73 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )74 f o r ( i =0; i<n;++ i )75 f o r ( i n t j ( i +1); j<n;++ j ){76 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;77 f [ i ]+= f f ;78 f [ j ]−= f f ;79 }80 }

Nanco, 8000 particles1 core 1 upgrades per Sec2 Processors 2 cores 4 upgrades per Sec


Cluster. MPI

MPI Message Passing Protocol Library for data exchange betweendifferent nodes

Node 0 Node 1 Node 2 Node 3 Node 4

Network

Point to Point Communications between single nodes

Collective Data exchange between a node and a group

One-Sided Remote direct memory access of a process


Cluster. Paralelization

Step 1. Initialization

101 i n t main ( ){102 MPI I n i t ( ( vo id ∗ )0 , ( vo id ∗ ) 0 ) ;...125 MPI F i n a l i z e ( ) ;126 r e t u r n 0 ;127 }

Step 2. Particle exchange

86 vo id p a r t i c l e s E x c h a n g e ( Vec ∗p , i n t n )87 {88 i n t nn ;89 MPI Comm size (MPI COMM WORLD,&nn ) ;90 MPI A l l ga the r (MPI IN PLACE , 0 ,MPI DOUBLE , p ,3∗ n/nn ,91 MPI DOUBLE ,MPI COMM WORLD) ;92 }



Step 3. Force calculation

70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i , n1 , n2 , id , nn ;73 MPI Comm rank (MPI COMM WORLD,& i d ) ;74 MPI Comm size (MPI COMM WORLD,&nn ) ;75 n1=n/nn∗ i d ;76 n2=n/nn ∗( i d +1);77 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )78 f o r ( i=n1 ; i<n2;++ i )79 f o r ( i n t j ( i +1); j<n;++ j ){80 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;81 f [ i ]+= f f ;82 f [ j ]−= f f ;83 }84 }



Step 3. Force exchange

94 vo id f o r c e sExchange ( Vec ∗p , i n t n )95 {96 MPI A l l r educe (MPI IN PLACE , p ,3∗ n ,MPI DOUBLE ,97 MPI SUM,MPI COMM WORLD) ;98 }


Barnes-Hut simulation

Cut-off radius rc = 2.5σ

J. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986


Barnes-Hut simulation

Cut-off radius rc = 2.5σJ. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986


OpenMP Paralelization

Paralelization within 1 subdomain

Paralelization over subdomains


MPI Paralelization

Only border cells require data exchange


Conclusions

Small data/Many calculationsI no adaptations needed for modern processorsI extremely easy and efficient for parallelization

Large data/few calculationsI serial program requires data decomposition in order to fit the cacheI extremely easy and efficient to parallelize serial program with

decomposed data


Tutorial: High Performance Computing

Documents