Tutorial: High Performance Computing Igal G. Rasin Department of Chemical Engineering Israel Institute of Technology 27 Nisan 5769 (21.04.2009) Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 1 / 18
Tutorial: High Performance Computing
Igal G. Rasin
Department of Chemical EngineeringIsrael Institute of Technology
27 Nisan 5769 (21.04.2009)
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 1 / 18
Motivation
What is High Performance Computing?
What is “Regular” Computing?
How to choose method for you “Regular” problem?
Is your algorithm is optimal for serial Computing?
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 2 / 18
Memory Gap
1980 1985 1990 1995 2000 20051
10
100
1000
CPU, 2x every
2 years
CPU, 2x every 6 years
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 3 / 18
Processor structure
RAM frequency is less thanprocessor one
Hierarchical structure
L1 cache usually works on thesame frequency with theprocessor Processor
cache L1
cache L2
RAM
Processor name PF, GHz Memorytransfer rate
L1 size L1 transferrate
Intel Core 2 DUO 1.6 6.4 Gb/s 2x32KB 96 GB/s
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 4 / 18
Memory. Performance
Example
0
2e + 08
4e + 08
6e + 08
8e + 08
1e + 09
1.2e + 09
8 10 12 14 16 18 20 22 24 26
Iter
atio
ns/
s
array size 2n
randomserial
L1 size: 2x32 KB; 212 · 8 = 32KBL2 size: 3072 KB; 218 · 8 = 2048Kb
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 5 / 18
Multi-Core / Multi-Processor architectures.
core1
core2
L1 1 L1 2
L2 1 L2 2
core1
core2
L1 1 L1 2
L2 1 L2 2
Memory
Memory
memory access rate depends on memory placement
both processors works independently with their memories
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 6 / 18
Tendencies
More paralelization More cores Vectorization
IBM, SonyTilera
Nvidia
ATI
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 7 / 18
Cluster vs. Shared memory
Separated mashines with networkon all levels
P
M
P
M
P
MNetwork
Sheared memory machine on alllevels
P
P
P
Memory
Small data/Many calculations
Large data/few calculations
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 8 / 18
Cluster vs. Shared memory
Separated mashines with networkon all levels
P
M
P
M
P
MNetwork
Sheared memory machine on alllevels
P
P
P
Memory
Small data/Many calculations
Large data/few calculations
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 8 / 18
Molecular dynamicsMolecular dynamics of atoms with Lennard-Jones potential
Vij = 4ε
((σr
)12−(σ
r
)6)
Fij = 4ε~r
(6σ6
r8− 13
σ12
r14
)Fi =
∑j
Fij
Programm
69 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){70 i n i t F ( f , n ) ;71 f o r ( i n t i ( 0 ) ; i<n;++ i )72 f o r ( i n t j ( i +1); j<n;++ j ){73 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;74 f [ i ]+= f f ;75 f [ j ]−= f f ;76 }77 } Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 9 / 18
Parallelization. OpenMP
OpenMP is a tool for parallelization sheared memory machine
Programm
70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i ;73 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )74 f o r ( i =0; i<n;++ i )75 f o r ( i n t j ( i +1); j<n;++ j ){76 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;77 f [ i ]+= f f ;78 f [ j ]−= f f ;79 }80 }
Nanco, 8000 particles1 core 1 upgrades per Sec2 Processors 2 cores 4 upgrades per Sec
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 10 / 18
Cluster. MPI
MPI Message Passing Protocol Library for data exchange betweendifferent nodes
Node 0 Node 1 Node 2 Node 3 Node 4
Network
Point to Point Communications between single nodes
Collective Data exchange between a node and a group
One-Sided Remote direct memory access of a process
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 11 / 18
Cluster. Paralelization
Step 1. Initialization
101 i n t main ( ){102 MPI I n i t ( ( vo id ∗ )0 , ( vo id ∗ ) 0 ) ;...125 MPI F i n a l i z e ( ) ;126 r e t u r n 0 ;127 }
Step 2. Particle exchange
86 vo id p a r t i c l e s E x c h a n g e ( Vec ∗p , i n t n )87 {88 i n t nn ;89 MPI Comm size (MPI COMM WORLD,&nn ) ;90 MPI A l l ga the r (MPI IN PLACE , 0 ,MPI DOUBLE , p ,3∗ n/nn ,91 MPI DOUBLE ,MPI COMM WORLD) ;92 }
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 12 / 18
Cluster. Paralelization
Step 3. Force calculation
70 vo id computeF (Vec ∗p , Vec ∗ f , i n t n ){71 i n i t F ( f , n ) ;72 i n t i , n1 , n2 , id , nn ;73 MPI Comm rank (MPI COMM WORLD,& i d ) ;74 MPI Comm size (MPI COMM WORLD,&nn ) ;75 n1=n/nn∗ i d ;76 n2=n/nn ∗( i d +1);77 #pragma omp p a r a l l e l f o r s c h edu l e ( dynamic , 1 0 )78 f o r ( i=n1 ; i<n2;++ i )79 f o r ( i n t j ( i +1); j<n;++ j ){80 Vec f f ( f o r c e ( p [ i ] , p [ j ] ) ) ;81 f [ i ]+= f f ;82 f [ j ]−= f f ;83 }84 }
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 13 / 18
Cluster. Paralelization
Step 3. Force exchange
94 vo id f o r c e sExchange ( Vec ∗p , i n t n )95 {96 MPI A l l r educe (MPI IN PLACE , p ,3∗ n ,MPI DOUBLE ,97 MPI SUM,MPI COMM WORLD) ;98 }
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 14 / 18
Barnes-Hut simulation
Cut-off radius rc = 2.5σ
J. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 15 / 18
Barnes-Hut simulation
Cut-off radius rc = 2.5σJ. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324(4), December 1986
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 15 / 18
OpenMP Paralelization
Paralelization within 1 subdomain
Paralelization over subdomains
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 16 / 18
MPI Paralelization
Only border cells require data exchange
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 17 / 18
Conclusions
Small data/Many calculationsI no adaptations needed for modern processorsI extremely easy and efficient for parallelization
Large data/few calculationsI serial program requires data decomposition in order to fit the cacheI extremely easy and efficient to parallelize serial program with
decomposed data
Igal G. Rasin (Technion) Tutorial: High Performance Computing 27 Nisan 5769 (21.04.2009) 18 / 18