Top Banner
PHY 604: Computatooaa ethods io Physics aod Astrophysics II Paraaaea Computog
87

Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

Jul 14, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Computog

Page 2: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Optmizatoo

● Getog performaoce out of your code meaos

– Pickiog the right aagorithm

– Impaemeotog the aagorithm efcieotay

● We taaked a aot about pickiog the proper aagorithm, aod saw some exampaes of speed-ups you cao get

● For performaoce io the impaemeotatoo:

– You oeed to uoderstaod a bit about how the computer's CPU works

– You may oeed to coosider paraaaea methods

Page 3: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

odero CPU + emory System

● emory hierarchy

– Data is stored io maio memory

– uatpae aeveas of cache (L3, L2, L1)

● A aioe of memory is moved ioto cache—you amortze the costs if you use aaa the data io the aioe

– Data gets to the registers io the CPU—this is where the computatoo takes paace

● It is expeosive to move data from maio memory to the registers

– You oeed to expaoit cache

– For arrays, aoop over data such that you operate oo eaemeots that are adjaceot io memory

Page 4: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

odero CPU + emory System

● Some oumbers (htp://www.7-cpu.com/cpu/Hasweaa.htma)

● Iotea i7-4770 (Hasweaa), 3.4 GHz (Turbo Boost of), 22 om. RA : 32 GB (PC3-12800 ca11 cr2).

– L1 Data cache = 32 KB, 64 B/aioe, 8-WAY.

– L1 Iostructoo cache = 32 KB, 64 B/aioe, 8-WAY.

– L2 cache = 256 KB, 64 B/aioe, 8-WAY

– L3 cache = 8 B, 64 B/aioe

● L1 Data Cache Lateocy = 4 cycaes for simpae access via poioter

● L1 Data Cache Lateocy = 5 cycaes for access with compaex address caacuaatoo (size_t n, *p; n = p[n]).

● L2 Cache Lateocy = 12 cycaes

● L3 Cache Lateocy = 36 cycaes

● RA Lateocy = 36 cycaes + 57 os

Page 5: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Arrays

● Row vs. Coaumo major: A(m,n)

– First iodex is caaaed the row

– Secood iodex is caaaed the coaumo

– uat-dimeosiooaa arrays are fateoed ioto a ooe-dimeosiooaa sequeoce for storage

– Row-major (C, pythoo): rows are stored ooe afer the other

– Coaumo-major (Fortrao, mataab): coaumos are stored ooe afer the other

● Orderiog maters for:

– Passiog arrays betweeo aaoguages

– Decidiog which iodex to aoop over frst

Row major

Coaumo major

Page 6: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Arrays

● This is why io Fortrao, you waot to aoop as:

double precision :: A(M,N)

do j = 1, N

do i = 1, M

A(i,j) = …

enddo

enddo

● Aod io C:

double A[M][N];

for (i = 0; i < M; i++) {

for (j = 0; j < N; j++) {

A[i][j] = …

}

}

Page 7: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Arrays

● Faoatog poiot uoit uses pipeaioiog to perform operatoos

● ost efcieot if you cao keep the pipe fuaa—agaio, takiog advaotage of oearby data io cache

Page 8: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Computog

● Iodividuaa processors themseaves are oot oecessariay getog much faster oo their owo (the GHz-wars are over)

– Chips are packiog more processiog cores ioto the same package

– Eveo your phooe is aikeay a muatcore chip

● If you doo't use the other cores, theo they are just “space heaters”

● Some techoiques for paraaaeaism require ooay simpae modifcatoos of your codes aod cao provide great gaios oo the siogae workstatoo

● There are aots of refereoces ooaioe

– Great book: High Performance Computng by Dowd aod Severaoce—freeay avaiaabae (aioked to from our webpage).

– We'aa use this for some backgrouod

Page 9: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Types of achioes

● odero computers have muatpae cores that aaa access the same pooa of memory directay—this is a shared-memory architecture

● Supercomputers are buiat by coooectog LOTS of oodes (each a shared memory machioe with ~4 – 32 cores) together with a high-speed oetwork—this is a distributed-memory architecture

● Difereot paraaaea techoiques aod aibraries are used for each of these paradigms:

– Shared-memory: Opeo P

– Distributed-memory: message-passiog ioterface ( PI)

– Offloadiog to acceaerators: OpeoACC, Opeo P, or CUDA

Page 10: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

oore's Law

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to contnue, if not to increase.

—Gordoo oore, Eaectrooics agazioe, 1965

(Steve Jurve tso

o/Wikipe dia)

Page 11: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Processor Treods

Page 12: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Top 500 List

Page 13: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Amdaha's Law

● Io a typicaa program, you wiaa have sectoos of code that adapt easiay to paraaaeaism, aod stuf that remaios seriaa

– For iostaoce: ioitaaizatoo may be seriaa aod the resuatog computatoo paraaaea

● Amdaha's aaw: speedup ataioed from iocreasiog the oumber of processors, N, giveo the fractoo of the code that is paraaaea, P :

Page 14: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Amdaha's Law

(Daoieas220 at Eogaish Wikipedia)

Page 15: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Amdaha's Law

● This seems to argue that we'd oever be abae to use 100,000s of processors

● However (Dowd & Severaoce):

– New aagorithms have beeo desigoed to expaoit massive paraaaeaism

– Larger computers meao bigger probaems are possibae—as you iocrease the probaem size, the fractoo of the code that is seriaa aikeay descreases

Page 16: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Types of Paraaaeaism

● Fayoo's taxooomy caassifes computer architectures

● 4 caassifcatoos: siogae/muatpae data; siogae/muatpae iostructoo

– Siogae iostructoo, siogae data (SISD)

● Thiok typicaa appaicatoo oo your computer—oo paraaaeaism

– Siogae iostructoo, muatpae data (SI D)

● The same iostructoo set is dooe to muatpae pieces of data aaa at ooce● Oad days: vector computers; today: GPUs

– uatpae iostructoos, siogae data ( ISD)

● Not very ioterestog...

– uatpae iostructoos, muatpae data ( I D)

● What we typicaaay thiok of as paraaaea computog. The machioes oo the top 500 aist faaa ioto this category

(Wikipedia)

Page 17: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Types of Paraaaeaism

● We cao do I D difereot ways:

– Siogae program, muatpae data

● This is what we oormaaay do. PI aaaows this● Difers from SI D io that geoeraa CPUs cao be used, doeso't require direct syochrooizatoo for aaa tasks

(Wikipedia)

Page 18: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Triviaaay Paraaaea

● Sometmes our tasks are triviaaay paraaaea

– No commuoicatoo is oeeded betweeo processes

● Ex: ray traciog or oote Carao

– Each reaaizatoo cao do its work iodepeodeotay

– At the eod, maybe, we oeed to do some simpae processiog of aaa the resuats

● Large data aoaaysis

– You have a buoch of datasets aod a reductoo pipeaioe to work oo them.

– Use muatpae processors to work oo the difereot data faes as resources become avaiaabae.

– Each fae is processed oo a siogae core

Page 19: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Triviaaay Paraaaea via Sheaa Script

● Ex: data aoaaysis—aauoch iodepeodeot jobs

● This cao be dooe via a sheaa script—oo aibraries oecessary

– Loop over faes

● Ruo jobs uota aaa of the processors are fuaa● Use aockfaes to iodicate a job is ruooiog● Wheo resources become free, start up the oext job

● Let's aook at the code...

● Aaso see GNU paraaaea

Page 20: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

How Do We ake Our Code Paraaaea?

● Despite your best wishes, there is oo simpae compiaer fag “--make-this-parallel”

– You oeed to uoderstaod your aagorithm aod determioe what parts are ameoabae to paraaaeaism

● However... if the buak of your work is oo ooe specifc piece (say, soaviog a aioear system), you may get aaa that you oeed by usiog a aibrary that is aaready paraaaea

– This wiaa require mioimaa chaoges to your code

Page 21: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Shared emory vs. Distributed

● Imagioe that you have a siogae probaem to soave aod you waot to divide the work oo that probaem across avaiaabae processors

● If aaa the core see the same pooa of memory (shared-memory), theo paraaaeaism is straightorward

– Aaaocate a siogae big array for your probaem

– Spawo threads: separate iostaoce of a sequeoce of iostructoos operatog

● uatpae threads operate simuataoeousay

– Each core/thread operates oo a smaaaer portoo of the same array, writog to the same memory

– Some iotermediate variabaes may oeed to be dupaicated oo each thread—thread-private data

– Opeo P is the staodard here

Page 22: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Shared emory vs. Distributed

● Distributed computog: ruooiog oo a coaaectoo of separate computers (CPU + memory, etc.) coooected by a high-speed oetwork

– Each task caooot directay see the memory for the other tasks

– Need to expaicitay seod messages from ooe machioe to aoother over the oetwork exchaogiog the oeeded data

– PI is the staodard here

Page 23: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Shared emory

● Nodes coosist of ooe or more chips each with maoy cores (2-16 typicaaay)

– Everythiog cao access the same pooa of memory

C C

CC

emory

Siogae 4-core chip aod its pooa of memory

Page 24: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Shared emory

● Some machioes are more compaex—muatpae chips each with their owo pooa of aocaa memory cao taak to oo aoother oo the oode

– Lateocy may be higher wheo goiog “of-chip”

● Best performaoce wiaa require koowoiog your machioe's architecture

C C

CC

emory

C C

CC

emory

Two 4-core chips comprisiog a siogae oode—each has their owo pooa of memory

Page 25: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: Baue Waters achioe

(Cray, Ioc.)

Page 26: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Opeo P

● Threads are spawoed as oeeded

● Wheo you ruo the program, there is ooe thread—the master thread

– Wheo you eoter a paraaaea regioo, muatpae threads ruo coocurreotay

(Wikipedia--Opeo P)

Page 27: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Computog

Page 28: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Opeo P “Heaao Worad”

● Opeo P is dooe via directves or pragmas

– Look aike commeots uoaess you teaa the compiaer to ioterpret them

– Eoviroomeot variabae OMP_NUM_THREADS sets the oumber of threads

– Support for C, C++, aod Fortrao

● Heaao worad:

– Compiae with : gfortran -o hello -fopenmp hello.f90

program hello

!$OMP parallel print *, "Hello world" !$OMP end parallel

end program hello

Page 29: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

C Heaao Worad

● Io C, the preprocessor is used for the pragmas

#include <stdio.h>

void main() { #pragma omp parallel printf("Hello world\n");}

Page 30: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

O P Fuoctoos

● Io additoo to usiog pragmas, there are a few fuoctoos that Opeo P provides to get the oumber of threads, the curreot thread, etc.

program hello

use omp_lib

print *, "outside parallel region, num threads = ", & omp_get_num_threads()

!$OMP parallel print *, "Hello world", omp_get_thread_num() !$OMP end parallel

end program hello

code: hello-omp.f90

Page 31: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Opeo P

● ost modero compiaers support Opeo P

– However, the performaoce across them cao vary greatay

– GCC does a reasooabae job. Iotea is the fastest

● There is ao overhead associated with spawoiog threads

– You may oeed to experimeot

– Some regioos of your code may oot have eoough work to ofset the overhead

Page 32: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Number of Threads

● There wiaa be a systemwide defauat for OMP_NUM_THREADS

● Thiogs wiaa staa ruo if you use more threads thao cores avaiaabae oo your machioe—but doo't!

● Scaaiog: if you doubae the oumber of cores does the code take 1/2 the tme?

Page 33: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Aside: Stack vs. Heap

● emory aaaocated at compiae tme is put oo the stack, e.g.:

– Fortrao: double precision a(1000)

– C: double a[1000]

● Stack memory has a fxed (somewhat smaaa size)

– It's maoaged by the operatog system

– You doo't oeed to caeao up this memory

● Dyoamic aaaocatoo puts the memory oo the heap

– uch bigger pooa

– You are respoosibae for deaaaocatog

Page 34: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Loops

● Ex: matrix muatpaicatoo:

program matmul

use omp_lib

implicit none integer, parameter :: N = 50000 double precision, allocatable :: a(:,:) double precision :: x(N), b(N) double precision :: start_omp, finish_omp integer :: i, j

start_omp = omp_get_wtime() allocate(a(N,N))

!$omp parallel private(i, j) !$omp do do j = 1, N do i = 1, N a(i,j) = dble(i + j) enddo x(j) = j b(j) = 0.0 enddo !$omp end do

Page 35: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Loops

!multiply !$omp do do j = 1, N do i = 1, N b(i) = b(i) + a(i,j)*x(j) enddo enddo !$omp end do !$omp end parallel

finish_omp = omp_get_wtime()

print *, "execution time: ", finish_omp - start_omp

end program matmul

Cootoued...

code: matmul.f90

Page 36: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Timiog

● We cao use the omp_get_wtime() commaod to get the curreot waaacaock tme (io secoods)

– This is beter thao, e.g., the Fortrao cpu_time() iotriosic, which measures tme for aaa threads summed together

OMP_NUM_THREADS run 1time (s)

run 2time (s)

run 3time (s)

1 26.276 26.294 26.696

2 18.696 17.514 16.287

4 8.628 9.072 10.680

8 4.744 6.582 4.923

16 3.066 3.146 3.111

Timiogs oo 2x Iotea Xeoo Goad 5115 CPU usiog gfortrao, N = 50000

Page 37: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Loop Orderiog

● This is a great exampae to see the efects of aoop orderiog—what happeos if you switch the order of the aoops?

Page 38: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Loop Paraaaea

● We waot to paraaaeaize aaa aoops possibae

– Iostead of f(:,:) = 0.d0, we write out aoops aod thread

● Private data

– Ioside the aoop, aaa threads wiaa have access to aaa the variabaes decaared io the maio program

– For some thiogs, we wiaa waot a private copy oo each thread. These are put io the private() caause

Page 39: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reductoo

● Suppose you are fodiog the mioimum vaaue of somethiog, or summiog

– Loop spread across threads

– How do we get the data from each thread back to a siogae variabae that aaa threads see?

● reduction() caause

– Has both shared aod private behaviors

– Compiaer eosures that the data is syochrooized at the eod

Page 40: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reductoo

● Exampae of a reductoo

program reduce

implicit none

integer :: i

double precision :: sum

sum = 0.0d0

!$omp parallel do private (i) reduction(+:sum) do i = 1, 10000 sum = sum + exp((mod(dble(i), 5.0d0) - 2*mod(dble(i),7.0d0))) end do !$omp end parallel do

print *, sum

end program reduce

Do we get the same aoswer wheo ruo with diferiog oumber of threads?

code: reduce.f90

Page 41: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Exampae: Reaaxatoo

● Io two-dimeosioos, with Δx = Δy , we have:

– Red-baack Gauss-Seidea:● Update io-paace● First update the red ceaas (baack ceaas are uochaoged)

● Theo update baack ceaas (red ceaas are uochaoged)

Page 42: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Exampae Reaaxatoo

● Let's aook at the code

● Aaa two-dimeosiooaa aoops are wrapped with Opeo P directves

● We cao measure the performaoce

– Fortrao 95 has a cpu_time() iotriosic● Be carefua though—it returos the CPU tme summed across aaa threads

– Opeo P has the omp_get_wtime() fuoctoo● This returos waaacaock tme

– Lookiog at waaacaock: if we doubae the oumber of processors, we waot the code to take 1/2 the waaacaock tme

Page 43: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Exampae Reaaxatoo

● Performaoce:

This is ao exampae of a stroog scaaiog test—the amouot of work is head fxed as the oumber of cores is iocreased

code: relax.f90

groot w/ gfortran -Ofast

512x512

threads wallclock time 1 1.583 2 0.8413 4 0.3979 8 0.2253 16 0.1634

1024x1024

threads wallclock time 1 7.211 2 3.179 4 1.717 8 0.8832 16 0.5076

Page 44: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Threadsafe

● Wheo shariog memory you oeed to make sure you have private copies of aoy data that you are chaogiog directay

● Appaies to fuoctoos that you caaa io the paraaaea regioos too!

● What if your aoswer chaoges wheo ruooiog with muatpae threads?

– Some rouodof-aevea error is to be expected if sums are dooe io difereot order

– Large difereoces iodicate a bug—most aikeay somethiog oeeds to be private that is oot

● Uoit testog

– Ruo with 1 aod muatpae threads ao compare the output

Page 45: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Threadsafe

● Fortrao:

– Commoo baocks are simpay a aist of memory spaces where data cao be fouod. This is shared across muatpae routoes

● Very daogerous—if ooe thread updates somethiog io a commoo baock, every other thread sees that update

● uch safer to use argumeots to share data betweeo fuoctoos

– Save statemeot: the vaaue of the data persists from ooe caaa to the oext

● What if a difereot thread is the oext to caaa that fuoctoo—is the saved quaotty the correct vaaue?

Page 46: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Critcaa Sectoos

● Withio a paraaaea regioo, sometmes you oeed to eosure that ooay ooe thread at a tme cao write to a variabae

● Coosider the foaaowiog:

– If this is io the middae of a aoop, what happeos if 2 difereot threads meet the criteria?

– arkiog this sectoo as critcaa wiaa eosure ooay ooe thread chaoges thiogs at a tme

● Waroiog: critcaa sectoos cao be VERY saow

if ( a(i,j) > maxa ) then maxa = a(i,j) imax = i jmax = jendif

Page 47: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Opeo P

● Opeo P is reaatveay big

Page 48: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Portog to Opeo P

● You cao paraaaeaize your code piece-by-piece

● Sioce Opeo P directves aook aike commeots to the compiaer, your oad versioo is staa there

● Geoeraaay, you are oot chaogiog aoy of your origioaa code—just addiog directves

Page 49: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

ore Advaoced Opeo P

● if caause teaas Opeo P ooay to paraaaeaize if a certaio cooditoo is met (e.g. a test of the size of ao array)

● firstprivate: aike private except each copy is ioitaaized to the vaaue from the origioaa vaaue

● schedule: afects the baaaoce of the work distributed to threads

Page 50: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Opeo P io Pythoo

● Pythoo eoforces a “gaobaa ioterpreter aock” that meaos ooay ooe thread cao taak to the ioterpreter at aoy ooe tme

– Opeo P withio pure pythoo is oot possibae

● However, C (or Fortrao) exteosioos caaaed from pythoo cao do shared-memory paraaaeaism

– Uoderayiog code cao do paraaaea Opeo P

Page 51: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI

● The essage Passiog Library ( PI) is the staodard aibrary for distributed paraaaea computog

– Now each core caooot directay see each other's memory

– You oeed to maoage how the work is divided aod expaicitay seod messages from ooe process to the other as oeeded.

Page 52: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI Heaao Worad

● No aooger do we simpay use commeots—oow we caaa subroutoes io the aibrary:

program hello

use mpi

implicit none

integer :: ierr, mype, nprocs

call MPI_Init(ierr)

call MPI_Comm_Rank(MPI_COMM_WORLD, mype, ierr) call MPI_Comm_Size(MPI_COMM_WORLD, nprocs, ierr)

if (mype == 0) then print *, "Running Hello, World on ", nprocs, " processors" endif

print *, "Hello World", mype

call MPI_Finalize(ierr)end program hello

Page 53: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI Heaao Worad

● PI jobs are ruo usiog a commaodaioe tooa

– usuaaay mpirun or mpiexec

– Eg: mpiexec -n 4 ./hello

● You oeed to iostaaa the PI aibraries oo your machioe to buiad aod ruo PI jobs

– PICH is the most popuaar

– Fedora: dnf install mpich mpich-devel mpich-autoload

code: hello_mpi.f90

Page 54: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI Coocepts

● A separate iostaoce of your program is ruo oo each processor—these are the PI processes

– Threadsafety is oot ao issue here, sioce each iostaoce of the program is isoaated from the others

● You oeed to teaa the aibrary the datatype of the variabae you are commuoicatog aod how big it is (the bufer size).

– Together with the address of the bufer specify what is beiog seot

● Processors cao be grouped together

– Commuoicators aabea difereot groups

– MPI_COMM_WORLD is the defauat commuoicator (aaa processes)

● aoy types of operatoos:

– Seod/receive, coaaectve commuoicatoos (broadcast, gather/scater)

(based oo Using MPI)

Page 55: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI Coocepts

● There are > 100 fuoctoos

– But you cao do aoy messagiog passiog aagorithm with ooay 6:

● MPI_Init● MPI_Comm_Size● MPI_Comm_Rank● MPI_Send● MPI_Recv● MPI_Finalize

– ore efcieot commuoicatoo cao be dooe by usiog some of the more advaoced fuoctoos

– System veodors wiaa usuaaay provide their owo PI impaemeotatoo that is weaa-matched to their hardware

(based oo Using MPI)

Page 56: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: Computog Pi

● This is ao exampae from Using MPI

– Compute π by doiog the iotegraa:

● We wiaa divide the iotervaa up, so that each processor sees ooay a smaaa portoo of [0,1]

● Each processor computes the sum for its iotervaas● Add aaa the iotegraas together at the eod to get the vaaue of the totaa iotegraa

– We'aa pick ooe processor as the I/O processor—it wiaa commuoicate with us

– Let's aook at the code...

(based oo Using MPI)

code: pi.f90

Page 57: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Seod/Receive Exampae

● The maio idea io PI is seodiog messages betweeo processes.

● MPI_Send() aod MPI_Recv() pairs provide this fuoctooaaity

– This is a baockiog seod/receive

● For the seodiog code, the program resumes wheo it is safe to reuse the bufer

● For the receiviog code, the program resumes wheo the message was received

– ay cause oetwork cooteotoo if the destoatoo process is busy doiog its owo commuoicatoo

– See Using MPI for some diagoostcs oo this

● There are ooo-baockiog seod, seods where you expaicitay atach a bufer

Page 58: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Seod/Receive Exampae

● Simpae exampae (mimics ghost ceaa faaiog)

– Oo each processor aaaocate ao ioteger array of 5 eaemeots

– Fiaa the middae 3 with a sequeoce (proc 0: 0,1,2; proc 1: 3,4,5, ...)

– Seod messages to faa the aef aod right eaemeot with the correspoodiog eaemeot from the oeighboriog processors

code: send_recv.f90

Page 59: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Seod/Receive

● Good commuoicatoo performaoce ofeo requires staggeriog the commuoicatoo

● A combioed sendrecv() caaa cao heap avoid deadaockiog

● Let's aook at the same task with PI_Sendrecv()

code: sendrecv.f90

Page 60: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Computog

Page 61: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reaaxatoo

● Let's do the same reaaxatoo probaem, but oow usiog PI iostead of Opeo P

– Io the Opeo P versioo, we aaaocated a siogae array coveriog the eotre domaio, aod aaa processors saw the whoae array

– Io the PI versioo, each processor wiaa aaaocate a smaaaer array, coveriog ooay a portoo of the eotre domaio, aod they wiaa ooay see their part directay.

Page 62: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reaaxatoo

● We wiaa do 1-d domaio decompositoo

– Each processor aaaocates a saab that covers the fuaa y-exteot of the domaio

– Width io x is nx/nprocs● if oot eveoay divisibae, theo some saabs have a width of 1 more ceaa

– Perimeter of 1 ghost ceaa surrouodiog each subgrid

● We wiaa refer to a gaobaa iodex space [0:nx-1]×[0:ny-1]

– emory oeeds spread across aaa processors

– Arrays aaaocated as:

f(ilo-ng:ihi+ng,jlo-ng:jhi+ng)

Page 63: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reaaxatoo

● Lef set of ghost ceaas are faaed by receiviog a message from processor (saab) to aef

Page 64: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reaaxatoo

● Right set of ghost ceaas are faaed by receiviog a message from processor (saab) to right

● Top aod botom ghost ceaas are physicaa bouodaries

Page 65: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Domaio Decompositoo

● Geoeraaay speakiog, you waot to mioimize the surface-to-voaume (this reduces commuoicatoo)

Page 66: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Reaaxatoo

● ost of the paraaaeaism comes io the ghost ceaa faaiog

– Fiaa aef GCs by receiviog data from processor to the aef

– Fiaa right GCs by receiviog data from processor to the right

– Seod/receive pairs—we waot to try to avoid cooteotoo (this cao be very tricky, aod peopae speod a aot of tme worryiog about this...)

● Oo the physicaa bouodaries, we simpay faa as usuaa

● The way this is writeo, our reaaxatoo routoe doeso't oeed to do aoy paraaaeaism itseaf—it just operates oo the domaio it is giveo.

● For computog a oorm, we wiaa oeed to reduce the aocaa sums across processors

● Let's aook at the code...

code: relax_mpi.f90

Page 67: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

PI Reaaxatoo Resuats

● Note that the smaaaer probaem sizes become work starved more easiay

Page 68: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Weak vs. Stroog Scaaiog

● Io assessiog the paraaaea performaoce of your code there are two methods that are commooay used

– Stroog scaaiog: keep the probaem size fxed aod iocrease the oumber of processors

● Eveotuaaay you wiaa become work-starved, aod your scaaiog wiaa stop (commuoicatoo domioates)

– Weak scaaiog: iocrease the amouot of work io proportoo to the oumber of processors

● Io this case, perfect scaaiog wiaa resuat io the same waaacaock tme for aaa processor couots

Page 69: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: aestro Scaaiog

● aestro is a pubaicay avaiaabae adaptve mesh refoemeot aow ach oumber hydrodyoamics code

– odeas astrophysicaa fows

– Geoeraa equatoo of state, reactoos, impaicit difusioo

– Eaaiptc coostraiot eoforced via muatgrid

– htps://github.com/A ReX-Astro/ AESTRO

Page 70: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: aestro Scaaiog

Page 71: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: aestro Scaaiog

Page 72: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: Castro Scaaiog

● Castro is a pubaicay avaiaabae adaptve mesh refoemeot compressibae radiatoo hydrodyoamics code

– Used to modea steaaar expaosioos

– Seaf-gravity soaved via muatgrid

Page 73: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Ex: Castro Scaaiog

Page 74: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Debuggiog

● There are paraaaea debuggers (but these are pricey)

● It's possibae to spawo muatpae gdb sessioos, but this gets out of haod quickay

● Priot is staa your frieod

– Ruo as smaaa of a probaem as possibae oo as few processors as oecessary

● Some rouodof-aevea difereoces are to be expected from sums (difereot order of operatoos)

Page 75: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Hybrid Paraaaeaism

● To get good performaoce oo curreot supercomputers, you oeed to do hybrid paraaaeaism:

– Opeo P withio a oode, PI across oodes

● For exampae, io our PI reaaxatoo code, we couad spait the aoops over each subdomaio over muatpae cores oo a oode usiog Opeo P.

– Theo we have PI to commuoicate across oodes aod Opeo P withio oodes

– This hybrid approach is ofeo oeeded to get the best performaoce oo big machioes

Page 76: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Pythoo

● PI has ioterfaces for Fortrao aod C/C++

● There are severaa pythoo moduaes for PI

– mpi4py: moduae that cao be imported ioto pythoo

– py PI: chaoges the pythoo ioterpreter itseaf

Page 77: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Pythoo

● Heaao worad:

● Ruo with mpiexec -n 4 python hello.py

from mpi4py import MPI

comm = MPI.COMM_WORLDrank = comm.Get_rank()

print "Hello, world", rank

Page 78: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Pythoo

● We cao easiay paraaaeaize our oote Carao poker odds code

– Each processor coosiders haods iodepeodeotay

– Do a reductoo at the eod

code: poker-mpi.py

Page 79: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Libraries

● There are aots of aibraries that provide paraaaea frameworks for writog your appaicatoo.

● Some exampaes:

– Lioear Aagebra / PDEs

● PETSc: aioear aod oooaioear system soavers, paraaaea matrix/vector routoes● hypre: sparse aioear system soaver

– I/O

● HDF5: paatorm iodepeodeot paraaaea I/O buiat oo PI-IO

– Adaptve mesh refoemeot (grids)

● BoxLib: aogicaaay Cartesiao A R with eaaiptc soavers

Page 80: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Coarray Fortrao

● Part of the Fortrao 2008 staodard

– Paraaaea versioo of Fortrao

– Separate image (iostaoce of the program) is ruo oo each processor

– [ ] oo arrays is used to refer to difereot processors

– Not yet wideay avaiaabae

Page 81: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

GPUs

● GPU offloadiog cao greatay acceaerate computog

● aio issue: data oeeds to traosfer from the CPU across the (reaatveay saow) PCIe bus to GPU

– Good performaoce requires that aots of work is dooe oo the data to “pay” the cost of the traosfer

● GPUs work as SI D paraaaea machioes

– The same iostructoos operate oo aaa the data io aockstep

– Braochiog (if-tests) is saower

● Best performaoce requires that you structure your code to be vectorized

Page 82: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

OpeoACC

● OpeoACC is a directves-based method for offloadiog computog to GPUs

– Looks aike Opeo P

– A big difereoce is that you oeed to expaicitay add directves that cootroa data movemeot

● There's a big cost io moviog data from the CPU to the GPU

– You oeed to do a aot of computog oo the GPU to cover that expeose

– We cao separateay cootroa what is copied to aod from the GPU

● We cao do our same reaaxatoo exampae usiog OpeoACC

– Note: we oeed to expaicitay write out the separate red-baack updates to eosure that a aoop doeso't access adjaceot eaemeots

code: relax-openacc.f90

Page 83: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

OpeoACC

code: relax-openacc.f90

!$acc data copyin(f, dx, imin, imax, jmin, jmax, bc_lo_x, bc_hi_x, bc_lo_y, bc_hi_y) copy(v) do m = 1, nsmooth

!$acc parallel !$acc loop do j = jmin, jmax v(imin-1,j) = 2*bc_lo_x - v(imin,j) v(imax+1,j) = 2*bc_hi_x - v(imax,j) enddo

!$acc loop do i = imin, imax v(i,jmin-1) = 2*bc_lo_y - v(i,jmin) v(i,jmax+1) = 2*bc_hi_y - v(i,jmax) enddo

!$acc wait !$acc loop collapse(2) do j = jmin, jmax, 2 do i = imin, imax, 2 v(i,j) = 0.25d0*(v(i-1,j) + v(i+1,j) + & v(i,j-1) + v(i,j+1) - dx*dx*f(i,j)) enddo enddo

...

This is part of the smoother fuoctoo marked up with OpeoACC

Page 84: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

OpeoACC

● This reaaxatoo code ruos about 30× faster oo the GPU vs. CPU (siogae core) oo a aocaa machioe

– Note: wheo compariog CPU to GPU, a fair comparisoo wouad iocaude aaa of the CPU cores, so for a 12 core machioe, it is about 2.5× faster oo the GPU.

Page 85: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Supercomputog Ceoters

● Supercomputog ceoters

– Natooaa ceoters ruo by NSF (through XSEDE program) aod DOE (NERSC, OLCF, ALCF)

– You cao appay for tme—starter accouots avaiaabae at most ceoters to get up to speed

– To get aots of tme, you oeed to demoostrate that your codes cao scaae to O(104) processors or more

● Queues

– You submit your job to a queue, specifyiog the oumber of processors ( PI + Opeo P threads) aod aeogth of tme

– Typicaa queue wiodows are 2 – 24 hours

– Job waits uota resources are avaiaabae

Page 86: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Supercomputog Ceoters

● Checkpoiot/restart

– Loog jobs woo't be abae to foish io the aimited queue wiodow

– You oeed to write your code so that it saves aaa of the data oecessary to restart where it aef of

● Archiviog

– ass storage at ceoters is provided (usuaaay through HPSS)

– Typicaaay you geoerate far more data thao is reasooabae to briog back aocaaay—remote aoaaysis aod visuaaizatoo oecessary

Page 87: Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Future...

● The big thiog io supercomputog these days is acceaerators

– GPUs or Iotea Phi boards

– Adds a SI D-aike capabiaity to the more geoeraa CPU

● Origioaaay with GPUs, there were proprietary aaoguages for ioteractog with them (e.g. CUDA)

● Curreotay, OpeoACC is ao Opeo P-aike way of deaaiog with GPUs/acceaerators

– Staa maturiog

– Portabae

– Wiaa merge with Opeo P io the oear future

● Data traosfer to the acceaerators moves across the saow system bus

– Future processors may move these capabiaites oo-die