Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

PHY 604: Computatooaa ethods io Physics aod Astrophysics II

Paraaaea Computog


Optmizatoo

● Getog performaoce out of your code meaos

– Pickiog the right aagorithm

– Impaemeotog the aagorithm efcieotay

● We taaked a aot about pickiog the proper aagorithm, aod saw some exampaes of speed-ups you cao get

● For performaoce io the impaemeotatoo:

– You oeed to uoderstaod a bit about how the computer's CPU works

– You may oeed to coosider paraaaea methods


odero CPU + emory System

● emory hierarchy

– Data is stored io maio memory

– uatpae aeveas of cache (L3, L2, L1)

● A aioe of memory is moved ioto cache—you amortze the costs if you use aaa the data io the aioe

– Data gets to the registers io the CPU—this is where the computatoo takes paace

● It is expeosive to move data from maio memory to the registers

– You oeed to expaoit cache

– For arrays, aoop over data such that you operate oo eaemeots that are adjaceot io memory


odero CPU + emory System

● Some oumbers (htp://www.7-cpu.com/cpu/Hasweaa.htma)

● Iotea i7-4770 (Hasweaa), 3.4 GHz (Turbo Boost of), 22 om. RA : 32 GB (PC3-12800 ca11 cr2).

– L1 Data cache = 32 KB, 64 B/aioe, 8-WAY.

– L1 Iostructoo cache = 32 KB, 64 B/aioe, 8-WAY.

– L2 cache = 256 KB, 64 B/aioe, 8-WAY

– L3 cache = 8 B, 64 B/aioe

● L1 Data Cache Lateocy = 4 cycaes for simpae access via poioter

● L1 Data Cache Lateocy = 5 cycaes for access with compaex address caacuaatoo (size_t n, *p; n = p[n]).

● L2 Cache Lateocy = 12 cycaes

● L3 Cache Lateocy = 36 cycaes

● RA Lateocy = 36 cycaes + 57 os


Arrays

● Row vs. Coaumo major: A(m,n)

– First iodex is caaaed the row

– Secood iodex is caaaed the coaumo

– uat-dimeosiooaa arrays are fateoed ioto a ooe-dimeosiooaa sequeoce for storage

– Row-major (C, pythoo): rows are stored ooe afer the other

– Coaumo-major (Fortrao, mataab): coaumos are stored ooe afer the other

● Orderiog maters for:

– Passiog arrays betweeo aaoguages

– Decidiog which iodex to aoop over frst

Row major

Coaumo major


Arrays

● This is why io Fortrao, you waot to aoop as:

double precision :: A(M,N)

do j = 1, N

do i = 1, M

A(i,j) = …

enddo

enddo

● Aod io C:

double A[M][N];

for (i = 0; i < M; i++) {

for (j = 0; j < N; j++) {

A[i][j] = …

}

}


Arrays

● Faoatog poiot uoit uses pipeaioiog to perform operatoos

● ost efcieot if you cao keep the pipe fuaa—agaio, takiog advaotage of oearby data io cache


Paraaaea Computog

● Iodividuaa processors themseaves are oot oecessariay getog much faster oo their owo (the GHz-wars are over)

– Chips are packiog more processiog cores ioto the same package

– Eveo your phooe is aikeay a muatcore chip

● If you doo't use the other cores, theo they are just “space heaters”

● Some techoiques for paraaaeaism require ooay simpae modifcatoos of your codes aod cao provide great gaios oo the siogae workstatoo

● There are aots of refereoces ooaioe

– Great book: High Performance Computng by Dowd aod Severaoce—freeay avaiaabae (aioked to from our webpage).

– We'aa use this for some backgrouod


Types of achioes

● odero computers have muatpae cores that aaa access the same pooa of memory directay—this is a shared-memory architecture

● Supercomputers are buiat by coooectog LOTS of oodes (each a shared memory machioe with ~4 – 32 cores) together with a high-speed oetwork—this is a distributed-memory architecture

● Difereot paraaaea techoiques aod aibraries are used for each of these paradigms:

– Shared-memory: Opeo P

– Distributed-memory: message-passiog ioterface ( PI)

– Offloadiog to acceaerators: OpeoACC, Opeo P, or CUDA


oore's Law

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to contnue, if not to increase.

—Gordoo oore, Eaectrooics agazioe, 1965

(Steve Jurve tso

o/Wikipe dia)


Processor Treods


Top 500 List


Amdaha's Law

● Io a typicaa program, you wiaa have sectoos of code that adapt easiay to paraaaeaism, aod stuf that remaios seriaa

– For iostaoce: ioitaaizatoo may be seriaa aod the resuatog computatoo paraaaea

● Amdaha's aaw: speedup ataioed from iocreasiog the oumber of processors, N, giveo the fractoo of the code that is paraaaea, P :


Amdaha's Law

(Daoieas220 at Eogaish Wikipedia)


Amdaha's Law

● This seems to argue that we'd oever be abae to use 100,000s of processors

● However (Dowd & Severaoce):

– New aagorithms have beeo desigoed to expaoit massive paraaaeaism

– Larger computers meao bigger probaems are possibae—as you iocrease the probaem size, the fractoo of the code that is seriaa aikeay descreases


Types of Paraaaeaism

● Fayoo's taxooomy caassifes computer architectures

● 4 caassifcatoos: siogae/muatpae data; siogae/muatpae iostructoo

– Siogae iostructoo, siogae data (SISD)

● Thiok typicaa appaicatoo oo your computer—oo paraaaeaism

– Siogae iostructoo, muatpae data (SI D)

● The same iostructoo set is dooe to muatpae pieces of data aaa at ooce● Oad days: vector computers; today: GPUs

– uatpae iostructoos, siogae data ( ISD)

● Not very ioterestog...

– uatpae iostructoos, muatpae data ( I D)

● What we typicaaay thiok of as paraaaea computog. The machioes oo the top 500 aist faaa ioto this category

(Wikipedia)


Types of Paraaaeaism

● We cao do I D difereot ways:

– Siogae program, muatpae data

● This is what we oormaaay do. PI aaaows this● Difers from SI D io that geoeraa CPUs cao be used, doeso't require direct syochrooizatoo for aaa tasks

(Wikipedia)


Triviaaay Paraaaea

● Sometmes our tasks are triviaaay paraaaea

– No commuoicatoo is oeeded betweeo processes

● Ex: ray traciog or oote Carao

– Each reaaizatoo cao do its work iodepeodeotay

– At the eod, maybe, we oeed to do some simpae processiog of aaa the resuats

● Large data aoaaysis

– You have a buoch of datasets aod a reductoo pipeaioe to work oo them.

– Use muatpae processors to work oo the difereot data faes as resources become avaiaabae.

– Each fae is processed oo a siogae core


Triviaaay Paraaaea via Sheaa Script

● Ex: data aoaaysis—aauoch iodepeodeot jobs

● This cao be dooe via a sheaa script—oo aibraries oecessary

– Loop over faes

● Ruo jobs uota aaa of the processors are fuaa● Use aockfaes to iodicate a job is ruooiog● Wheo resources become free, start up the oext job

● Let's aook at the code...

● Aaso see GNU paraaaea


How Do We ake Our Code Paraaaea?

● Despite your best wishes, there is oo simpae compiaer fag “--make-this-parallel”

– You oeed to uoderstaod your aagorithm aod determioe what parts are ameoabae to paraaaeaism

● However... if the buak of your work is oo ooe specifc piece (say, soaviog a aioear system), you may get aaa that you oeed by usiog a aibrary that is aaready paraaaea

– This wiaa require mioimaa chaoges to your code


Shared emory vs. Distributed

● Imagioe that you have a siogae probaem to soave aod you waot to divide the work oo that probaem across avaiaabae processors

● If aaa the core see the same pooa of memory (shared-memory), theo paraaaeaism is straightorward

– Aaaocate a siogae big array for your probaem

– Spawo threads: separate iostaoce of a sequeoce of iostructoos operatog

● uatpae threads operate simuataoeousay

– Each core/thread operates oo a smaaaer portoo of the same array, writog to the same memory

– Some iotermediate variabaes may oeed to be dupaicated oo each thread—thread-private data

– Opeo P is the staodard here


Shared emory vs. Distributed

● Distributed computog: ruooiog oo a coaaectoo of separate computers (CPU + memory, etc.) coooected by a high-speed oetwork

– Each task caooot directay see the memory for the other tasks

– Need to expaicitay seod messages from ooe machioe to aoother over the oetwork exchaogiog the oeeded data

– PI is the staodard here


Shared emory

● Nodes coosist of ooe or more chips each with maoy cores (2-16 typicaaay)

– Everythiog cao access the same pooa of memory

C C

CC

emory

Siogae 4-core chip aod its pooa of memory


Shared emory

● Some machioes are more compaex—muatpae chips each with their owo pooa of aocaa memory cao taak to oo aoother oo the oode

– Lateocy may be higher wheo goiog “of-chip”

● Best performaoce wiaa require koowoiog your machioe's architecture

C C

CC

emory

C C

CC

emory

Two 4-core chips comprisiog a siogae oode—each has their owo pooa of memory


Ex: Baue Waters achioe

(Cray, Ioc.)


Opeo P

● Threads are spawoed as oeeded

● Wheo you ruo the program, there is ooe thread—the master thread

– Wheo you eoter a paraaaea regioo, muatpae threads ruo coocurreotay

(Wikipedia--Opeo P)


Paraaaea Computog


Opeo P “Heaao Worad”

● Opeo P is dooe via directves or pragmas

– Look aike commeots uoaess you teaa the compiaer to ioterpret them

– Eoviroomeot variabae OMP_NUM_THREADS sets the oumber of threads

– Support for C, C++, aod Fortrao

● Heaao worad:

– Compiae with : gfortran -o hello -fopenmp hello.f90

program hello

!$OMP parallel print *, "Hello world" !$OMP end parallel

end program hello


C Heaao Worad

● Io C, the preprocessor is used for the pragmas

#include <stdio.h>

void main() { #pragma omp parallel printf("Hello world\n");}


O P Fuoctoos

● Io additoo to usiog pragmas, there are a few fuoctoos that Opeo P provides to get the oumber of threads, the curreot thread, etc.

program hello

use omp_lib

print *, "outside parallel region, num threads = ", & omp_get_num_threads()

!$OMP parallel print *, "Hello world", omp_get_thread_num() !$OMP end parallel

end program hello

code: hello-omp.f90


Opeo P

● ost modero compiaers support Opeo P

– However, the performaoce across them cao vary greatay

– GCC does a reasooabae job. Iotea is the fastest

● There is ao overhead associated with spawoiog threads

– You may oeed to experimeot

– Some regioos of your code may oot have eoough work to ofset the overhead


Number of Threads

● There wiaa be a systemwide defauat for OMP_NUM_THREADS

● Thiogs wiaa staa ruo if you use more threads thao cores avaiaabae oo your machioe—but doo't!

● Scaaiog: if you doubae the oumber of cores does the code take 1/2 the tme?


Aside: Stack vs. Heap

● emory aaaocated at compiae tme is put oo the stack, e.g.:

– Fortrao: double precision a(1000)

– C: double a[1000]

● Stack memory has a fxed (somewhat smaaa size)

– It's maoaged by the operatog system

– You doo't oeed to caeao up this memory

● Dyoamic aaaocatoo puts the memory oo the heap

– uch bigger pooa

– You are respoosibae for deaaaocatog


Paraaaea Loops

● Ex: matrix muatpaicatoo:

program matmul

use omp_lib

implicit none integer, parameter :: N = 50000 double precision, allocatable :: a(:,:) double precision :: x(N), b(N) double precision :: start_omp, finish_omp integer :: i, j

start_omp = omp_get_wtime() allocate(a(N,N))

!$omp parallel private(i, j) !$omp do do j = 1, N do i = 1, N a(i,j) = dble(i + j) enddo x(j) = j b(j) = 0.0 enddo !$omp end do


Paraaaea Loops

!multiply !$omp do do j = 1, N do i = 1, N b(i) = b(i) + a(i,j)*x(j) enddo enddo !$omp end do !$omp end parallel

finish_omp = omp_get_wtime()

print *, "execution time: ", finish_omp - start_omp

end program matmul

Cootoued...

code: matmul.f90


Timiog

● We cao use the omp_get_wtime() commaod to get the curreot waaacaock tme (io secoods)

– This is beter thao, e.g., the Fortrao cpu_time() iotriosic, which measures tme for aaa threads summed together

OMP_NUM_THREADS run 1time (s)

run 2time (s)

run 3time (s)

1 26.276 26.294 26.696

2 18.696 17.514 16.287

4 8.628 9.072 10.680

8 4.744 6.582 4.923

16 3.066 3.146 3.111

Timiogs oo 2x Iotea Xeoo Goad 5115 CPU usiog gfortrao, N = 50000


Loop Orderiog

● This is a great exampae to see the efects of aoop orderiog—what happeos if you switch the order of the aoops?


Loop Paraaaea

● We waot to paraaaeaize aaa aoops possibae

– Iostead of f(:,:) = 0.d0, we write out aoops aod thread

● Private data

– Ioside the aoop, aaa threads wiaa have access to aaa the variabaes decaared io the maio program

– For some thiogs, we wiaa waot a private copy oo each thread. These are put io the private() caause


Reductoo

● Suppose you are fodiog the mioimum vaaue of somethiog, or summiog

– Loop spread across threads

– How do we get the data from each thread back to a siogae variabae that aaa threads see?

● reduction() caause

– Has both shared aod private behaviors

– Compiaer eosures that the data is syochrooized at the eod


Reductoo

● Exampae of a reductoo

program reduce

implicit none

integer :: i

double precision :: sum

sum = 0.0d0

!$omp parallel do private (i) reduction(+:sum) do i = 1, 10000 sum = sum + exp((mod(dble(i), 5.0d0) - 2*mod(dble(i),7.0d0))) end do !$omp end parallel do

print *, sum

end program reduce

Do we get the same aoswer wheo ruo with diferiog oumber of threads?

code: reduce.f90


Exampae: Reaaxatoo

● Io two-dimeosioos, with Δx = Δy , we have:

– Red-baack Gauss-Seidea:● Update io-paace● First update the red ceaas (baack ceaas are uochaoged)

● Theo update baack ceaas (red ceaas are uochaoged)


Exampae Reaaxatoo

● Let's aook at the code

● Aaa two-dimeosiooaa aoops are wrapped with Opeo P directves

● We cao measure the performaoce

– Fortrao 95 has a cpu_time() iotriosic● Be carefua though—it returos the CPU tme summed across aaa threads

– Opeo P has the omp_get_wtime() fuoctoo● This returos waaacaock tme

– Lookiog at waaacaock: if we doubae the oumber of processors, we waot the code to take 1/2 the waaacaock tme


Exampae Reaaxatoo

● Performaoce:

This is ao exampae of a stroog scaaiog test—the amouot of work is head fxed as the oumber of cores is iocreased

code: relax.f90

groot w/ gfortran -Ofast

512x512

threads wallclock time 1 1.583 2 0.8413 4 0.3979 8 0.2253 16 0.1634

1024x1024

threads wallclock time 1 7.211 2 3.179 4 1.717 8 0.8832 16 0.5076


Threadsafe

● Wheo shariog memory you oeed to make sure you have private copies of aoy data that you are chaogiog directay

● Appaies to fuoctoos that you caaa io the paraaaea regioos too!

● What if your aoswer chaoges wheo ruooiog with muatpae threads?

– Some rouodof-aevea error is to be expected if sums are dooe io difereot order

– Large difereoces iodicate a bug—most aikeay somethiog oeeds to be private that is oot

● Uoit testog

– Ruo with 1 aod muatpae threads ao compare the output


Threadsafe

● Fortrao:

– Commoo baocks are simpay a aist of memory spaces where data cao be fouod. This is shared across muatpae routoes

● Very daogerous—if ooe thread updates somethiog io a commoo baock, every other thread sees that update

● uch safer to use argumeots to share data betweeo fuoctoos

– Save statemeot: the vaaue of the data persists from ooe caaa to the oext

● What if a difereot thread is the oext to caaa that fuoctoo—is the saved quaotty the correct vaaue?


Critcaa Sectoos

● Withio a paraaaea regioo, sometmes you oeed to eosure that ooay ooe thread at a tme cao write to a variabae

● Coosider the foaaowiog:

– If this is io the middae of a aoop, what happeos if 2 difereot threads meet the criteria?

– arkiog this sectoo as critcaa wiaa eosure ooay ooe thread chaoges thiogs at a tme

● Waroiog: critcaa sectoos cao be VERY saow

if ( a(i,j) > maxa ) then maxa = a(i,j) imax = i jmax = jendif


Opeo P

● Opeo P is reaatveay big


Portog to Opeo P

● You cao paraaaeaize your code piece-by-piece

● Sioce Opeo P directves aook aike commeots to the compiaer, your oad versioo is staa there

● Geoeraaay, you are oot chaogiog aoy of your origioaa code—just addiog directves


ore Advaoced Opeo P

● if caause teaas Opeo P ooay to paraaaeaize if a certaio cooditoo is met (e.g. a test of the size of ao array)

● firstprivate: aike private except each copy is ioitaaized to the vaaue from the origioaa vaaue

● schedule: afects the baaaoce of the work distributed to threads


Opeo P io Pythoo

● Pythoo eoforces a “gaobaa ioterpreter aock” that meaos ooay ooe thread cao taak to the ioterpreter at aoy ooe tme

– Opeo P withio pure pythoo is oot possibae

● However, C (or Fortrao) exteosioos caaaed from pythoo cao do shared-memory paraaaeaism

– Uoderayiog code cao do paraaaea Opeo P


PI

● The essage Passiog Library ( PI) is the staodard aibrary for distributed paraaaea computog

– Now each core caooot directay see each other's memory

– You oeed to maoage how the work is divided aod expaicitay seod messages from ooe process to the other as oeeded.


PI Heaao Worad

● No aooger do we simpay use commeots—oow we caaa subroutoes io the aibrary:

program hello

use mpi

implicit none

integer :: ierr, mype, nprocs

call MPI_Init(ierr)

call MPI_Comm_Rank(MPI_COMM_WORLD, mype, ierr) call MPI_Comm_Size(MPI_COMM_WORLD, nprocs, ierr)

if (mype == 0) then print *, "Running Hello, World on ", nprocs, " processors" endif

print *, "Hello World", mype

call MPI_Finalize(ierr)end program hello


PI Heaao Worad

● PI jobs are ruo usiog a commaodaioe tooa

– usuaaay mpirun or mpiexec

– Eg: mpiexec -n 4 ./hello

● You oeed to iostaaa the PI aibraries oo your machioe to buiad aod ruo PI jobs

– PICH is the most popuaar

– Fedora: dnf install mpich mpich-devel mpich-autoload

code: hello_mpi.f90


PI Coocepts

● A separate iostaoce of your program is ruo oo each processor—these are the PI processes

– Threadsafety is oot ao issue here, sioce each iostaoce of the program is isoaated from the others

● You oeed to teaa the aibrary the datatype of the variabae you are commuoicatog aod how big it is (the bufer size).

– Together with the address of the bufer specify what is beiog seot

● Processors cao be grouped together

– Commuoicators aabea difereot groups

– MPI_COMM_WORLD is the defauat commuoicator (aaa processes)

● aoy types of operatoos:

– Seod/receive, coaaectve commuoicatoos (broadcast, gather/scater)

(based oo Using MPI)


PI Coocepts

● There are > 100 fuoctoos

– But you cao do aoy messagiog passiog aagorithm with ooay 6:

● MPI_Init● MPI_Comm_Size● MPI_Comm_Rank● MPI_Send● MPI_Recv● MPI_Finalize

– ore efcieot commuoicatoo cao be dooe by usiog some of the more advaoced fuoctoos

– System veodors wiaa usuaaay provide their owo PI impaemeotatoo that is weaa-matched to their hardware



Ex: Computog Pi

● This is ao exampae from Using MPI

– Compute π by doiog the iotegraa:

● We wiaa divide the iotervaa up, so that each processor sees ooay a smaaa portoo of [0,1]

● Each processor computes the sum for its iotervaas● Add aaa the iotegraas together at the eod to get the vaaue of the totaa iotegraa

– We'aa pick ooe processor as the I/O processor—it wiaa commuoicate with us

– Let's aook at the code...


code: pi.f90


Seod/Receive Exampae

● The maio idea io PI is seodiog messages betweeo processes.

● MPI_Send() aod MPI_Recv() pairs provide this fuoctooaaity

– This is a baockiog seod/receive

● For the seodiog code, the program resumes wheo it is safe to reuse the bufer

● For the receiviog code, the program resumes wheo the message was received

– ay cause oetwork cooteotoo if the destoatoo process is busy doiog its owo commuoicatoo

– See Using MPI for some diagoostcs oo this

● There are ooo-baockiog seod, seods where you expaicitay atach a bufer


Seod/Receive Exampae

● Simpae exampae (mimics ghost ceaa faaiog)

– Oo each processor aaaocate ao ioteger array of 5 eaemeots

– Fiaa the middae 3 with a sequeoce (proc 0: 0,1,2; proc 1: 3,4,5, ...)

– Seod messages to faa the aef aod right eaemeot with the correspoodiog eaemeot from the oeighboriog processors

code: send_recv.f90


Seod/Receive

● Good commuoicatoo performaoce ofeo requires staggeriog the commuoicatoo

● A combioed sendrecv() caaa cao heap avoid deadaockiog

● Let's aook at the same task with PI_Sendrecv()

code: sendrecv.f90


Paraaaea Computog


Reaaxatoo

● Let's do the same reaaxatoo probaem, but oow usiog PI iostead of Opeo P

– Io the Opeo P versioo, we aaaocated a siogae array coveriog the eotre domaio, aod aaa processors saw the whoae array

– Io the PI versioo, each processor wiaa aaaocate a smaaaer array, coveriog ooay a portoo of the eotre domaio, aod they wiaa ooay see their part directay.


Reaaxatoo

● We wiaa do 1-d domaio decompositoo

– Each processor aaaocates a saab that covers the fuaa y-exteot of the domaio

– Width io x is nx/nprocs● if oot eveoay divisibae, theo some saabs have a width of 1 more ceaa

– Perimeter of 1 ghost ceaa surrouodiog each subgrid

● We wiaa refer to a gaobaa iodex space [0:nx-1]×[0:ny-1]

– emory oeeds spread across aaa processors

– Arrays aaaocated as:

f(ilo-ng:ihi+ng,jlo-ng:jhi+ng)


Reaaxatoo

● Lef set of ghost ceaas are faaed by receiviog a message from processor (saab) to aef


Reaaxatoo

● Right set of ghost ceaas are faaed by receiviog a message from processor (saab) to right

● Top aod botom ghost ceaas are physicaa bouodaries


Domaio Decompositoo

● Geoeraaay speakiog, you waot to mioimize the surface-to-voaume (this reduces commuoicatoo)


Reaaxatoo

● ost of the paraaaeaism comes io the ghost ceaa faaiog

– Fiaa aef GCs by receiviog data from processor to the aef

– Fiaa right GCs by receiviog data from processor to the right

– Seod/receive pairs—we waot to try to avoid cooteotoo (this cao be very tricky, aod peopae speod a aot of tme worryiog about this...)

● Oo the physicaa bouodaries, we simpay faa as usuaa

● The way this is writeo, our reaaxatoo routoe doeso't oeed to do aoy paraaaeaism itseaf—it just operates oo the domaio it is giveo.

● For computog a oorm, we wiaa oeed to reduce the aocaa sums across processors

● Let's aook at the code...

code: relax_mpi.f90


PI Reaaxatoo Resuats

● Note that the smaaaer probaem sizes become work starved more easiay


Weak vs. Stroog Scaaiog

● Io assessiog the paraaaea performaoce of your code there are two methods that are commooay used

– Stroog scaaiog: keep the probaem size fxed aod iocrease the oumber of processors

● Eveotuaaay you wiaa become work-starved, aod your scaaiog wiaa stop (commuoicatoo domioates)

– Weak scaaiog: iocrease the amouot of work io proportoo to the oumber of processors

● Io this case, perfect scaaiog wiaa resuat io the same waaacaock tme for aaa processor couots


Ex: aestro Scaaiog

● aestro is a pubaicay avaiaabae adaptve mesh refoemeot aow ach oumber hydrodyoamics code

– odeas astrophysicaa fows

– Geoeraa equatoo of state, reactoos, impaicit difusioo

– Eaaiptc coostraiot eoforced via muatgrid

– htps://github.com/A ReX-Astro/ AESTRO


Ex: aestro Scaaiog


Ex: aestro Scaaiog

https://github.com/AMReX-Astro/MAESTRO


Ex: Castro Scaaiog

● Castro is a pubaicay avaiaabae adaptve mesh refoemeot compressibae radiatoo hydrodyoamics code

– Used to modea steaaar expaosioos

– Seaf-gravity soaved via muatgrid


Ex: Castro Scaaiog


Debuggiog

● There are paraaaea debuggers (but these are pricey)

● It's possibae to spawo muatpae gdb sessioos, but this gets out of haod quickay

● Priot is staa your frieod

– Ruo as smaaa of a probaem as possibae oo as few processors as oecessary

● Some rouodof-aevea difereoces are to be expected from sums (difereot order of operatoos)


Hybrid Paraaaeaism

● To get good performaoce oo curreot supercomputers, you oeed to do hybrid paraaaeaism:

– Opeo P withio a oode, PI across oodes

● For exampae, io our PI reaaxatoo code, we couad spait the aoops over each subdomaio over muatpae cores oo a oode usiog Opeo P.

– Theo we have PI to commuoicate across oodes aod Opeo P withio oodes

– This hybrid approach is ofeo oeeded to get the best performaoce oo big machioes


Paraaaea Pythoo

● PI has ioterfaces for Fortrao aod C/C++

● There are severaa pythoo moduaes for PI

– mpi4py: moduae that cao be imported ioto pythoo

– py PI: chaoges the pythoo ioterpreter itseaf


Paraaaea Pythoo

● Heaao worad:

● Ruo with mpiexec -n 4 python hello.py

from mpi4py import MPI

comm = MPI.COMM_WORLDrank = comm.Get_rank()

print "Hello, world", rank


Paraaaea Pythoo

● We cao easiay paraaaeaize our oote Carao poker odds code

– Each processor coosiders haods iodepeodeotay

– Do a reductoo at the eod

code: poker-mpi.py


Paraaaea Libraries

● There are aots of aibraries that provide paraaaea frameworks for writog your appaicatoo.

● Some exampaes:

– Lioear Aagebra / PDEs

● PETSc: aioear aod oooaioear system soavers, paraaaea matrix/vector routoes● hypre: sparse aioear system soaver

– I/O

● HDF5: paatorm iodepeodeot paraaaea I/O buiat oo PI-IO

– Adaptve mesh refoemeot (grids)

● BoxLib: aogicaaay Cartesiao A R with eaaiptc soavers


Coarray Fortrao

● Part of the Fortrao 2008 staodard

– Paraaaea versioo of Fortrao

– Separate image (iostaoce of the program) is ruo oo each processor

– [ ] oo arrays is used to refer to difereot processors

– Not yet wideay avaiaabae


GPUs

● GPU offloadiog cao greatay acceaerate computog

● aio issue: data oeeds to traosfer from the CPU across the (reaatveay saow) PCIe bus to GPU

– Good performaoce requires that aots of work is dooe oo the data to “pay” the cost of the traosfer

● GPUs work as SI D paraaaea machioes

– The same iostructoos operate oo aaa the data io aockstep

– Braochiog (if-tests) is saower

● Best performaoce requires that you structure your code to be vectorized


OpeoACC

● OpeoACC is a directves-based method for offloadiog computog to GPUs

– Looks aike Opeo P

– A big difereoce is that you oeed to expaicitay add directves that cootroa data movemeot

● There's a big cost io moviog data from the CPU to the GPU

– You oeed to do a aot of computog oo the GPU to cover that expeose

– We cao separateay cootroa what is copied to aod from the GPU

● We cao do our same reaaxatoo exampae usiog OpeoACC

– Note: we oeed to expaicitay write out the separate red-baack updates to eosure that a aoop doeso't access adjaceot eaemeots

code: relax-openacc.f90


OpeoACC

code: relax-openacc.f90

!$acc data copyin(f, dx, imin, imax, jmin, jmax, bc_lo_x, bc_hi_x, bc_lo_y, bc_hi_y) copy(v) do m = 1, nsmooth

!$acc parallel !$acc loop do j = jmin, jmax v(imin-1,j) = 2*bc_lo_x - v(imin,j) v(imax+1,j) = 2*bc_hi_x - v(imax,j) enddo

!$acc loop do i = imin, imax v(i,jmin-1) = 2*bc_lo_y - v(i,jmin) v(i,jmax+1) = 2*bc_hi_y - v(i,jmax) enddo

!$acc wait !$acc loop collapse(2) do j = jmin, jmax, 2 do i = imin, imax, 2 v(i,j) = 0.25d0*(v(i-1,j) + v(i+1,j) + & v(i,j-1) + v(i,j+1) - dx*dx*f(i,j)) enddo enddo

...

This is part of the smoother fuoctoo marked up with OpeoACC


OpeoACC

● This reaaxatoo code ruos about 30× faster oo the GPU vs. CPU (siogae core) oo a aocaa machioe

– Note: wheo compariog CPU to GPU, a fair comparisoo wouad iocaude aaa of the CPU cores, so for a 12 core machioe, it is about 2.5× faster oo the GPU.


Supercomputog Ceoters

● Supercomputog ceoters

– Natooaa ceoters ruo by NSF (through XSEDE program) aod DOE (NERSC, OLCF, ALCF)

– You cao appay for tme—starter accouots avaiaabae at most ceoters to get up to speed

– To get aots of tme, you oeed to demoostrate that your codes cao scaae to O(104) processors or more

● Queues

– You submit your job to a queue, specifyiog the oumber of processors ( PI + Opeo P threads) aod aeogth of tme

– Typicaa queue wiodows are 2 – 24 hours

– Job waits uota resources are avaiaabae


Supercomputog Ceoters

● Checkpoiot/restart

– Loog jobs woo't be abae to foish io the aimited queue wiodow

– You oeed to write your code so that it saves aaa of the data oecessary to restart where it aef of

● Archiviog

– ass storage at ceoters is provided (usuaaay through HPSS)

– Typicaaay you geoerate far more data thao is reasooabae to briog back aocaaay—remote aoaaysis aod visuaaizatoo oecessary


Future...

● The big thiog io supercomputog these days is acceaerators

– GPUs or Iotea Phi boards

– Adds a SI D-aike capabiaity to the more geoeraa CPU

● Origioaaay with GPUs, there were proprietary aaoguages for ioteractog with them (e.g. CUDA)

● Curreotay, OpeoACC is ao Opeo P-aike way of deaaiog with GPUs/acceaerators

– Staa maturiog

– Portabae

– Wiaa merge with Opeo P io the oear future

● Data traosfer to the acceaerators moves across the saow system bus

– Future processors may move these capabiaites oo-die

Paraaaea Computog - Stony Brook Universitybender.astro.sunysb.edu/classes/numerical_methods/lectures/parallel.pdfA aioe of memory is moved ioto cache—you amortze the costs if you

Documents