Top Banner
29

Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Apr 28, 2018

Download

Documents

dotruc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Scaling hybrid coarray/MPI miniapps on Archer

L. Cebamanos1, A. Shterenlikht2, J.D. Arregui-Mena3,L. Margetts3

1Edinburgh Parallel Computing Centre (EPCC)The University of Edinburgh, King's Buildings, Edinburgh EH9 3FD, UK

Email: [email protected]

2Department of Mechanical EngineeringThe University of Bristol, Bristol BS8 1TR, UK, Email: [email protected]

3School of Mechanical, Aero and Civil EngineeringThe University of Manchester, Manchester M13 9PL, UK

Emails: [email protected], Lee.Margetts@manchester .ac.uk

CUG2016, London 8-12-MAY-2016

Page 2: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CGPACK - cellular automata microstructure simulationlibrary: https://sourceforge.net/projects/cgpack

I Solidi�cation,recrystallisation, andfracture ofpolycrystallinemicrostructures.

I Fortran 2008 coarrays+ TS 18508 [1]extensions.

I HECToR, ARCHER,Intel,OpenCoarrays/GCCsystems.

I BSD license f100g and f110g micro-cracks in individual

crystals merge into a macro-crack.

Page 3: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CGPACK design

I CA spacecoarray - 4Darray, 3Dcorank -structured grid[2, 3, 4].

I Integer cellstates

I Fixed orself-similarboundaries

I Traditionalhalo exchange

Page 4: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CGPACK space coarray:integer, allocatable :: space(:,:,:,:)[:,:,:]

I Discrete space,discrete time

I Meshindependentresults require� 105 CA cellsper crystal onaverage [5].

I Crystal (grain)is a cluster ofcells of thesame value.

Page 5: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CGPACK IO - unresolvedI MPI/IO speeds

up to 2.3GB/son HECToR(Cray XE6) [6].

I MPI/IO canreach 14GB/son ARCHER(Cray XC30)[7].

I NetCDF (notyetimplemented) -higher level ofabstraction,sits on top ofMPI/IO. [8].

106 grains, 1011 cells - 400GB dataset, > 4 hours on 1000 ARCHER nodes (24k cores).

Page 6: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CGPACK scalingI Up to 32k

cores onHECToR andARCHER forsolidi�cationproblems.

I Scaling variesfor di�erentprograms builtwith CGPACK,depending onwhich routinesare called, inwhat order andrequirementsfor synchroni-sation.

1

10

100

1000

8 64 512 4096 32768

speed-u

p

Number of cores, Hector XE6

sync allsync images serial

sync images d&cco_sum

Page 7: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

ParaFEM - scalable general purpose �nite element libraryI http://

parafem.

org.uk

I Fortran 90MPI

I Highlyportable,many users[9]

I Excellentscaling

I BSD license

1

10

100

1000

10000

10 100 1000 10000 100000

Tim

e in

sec

on

ds

Number of MPI processes

Actual

Ideal

Page 8: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Cellular Automata Finite Element (CAFE)

I Used for solidi�cation[10], recrystallisation[11] and fracture[12, 13].

I FE - continuummechanics - stress,strain, etc.

I CA - crystals, crystalboundaries, cleavage,grain boundaryfracture

I FE ! CA - stress,strain

I CA ! FE - damagevariables

Page 9: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

CAFE design: structured CA grid + unstructured FE grid

body (domain)

material

multi−scale model

FE + CA = CAFE

MPI 1

MPI 4

MPI 2

multi−scale model

MPI 3

image 2

image 4

image 3

image 1

Example with 4 PE (4 MPI pro-cesses, 4 coarray images). Arrowsare FE $ CA comms.

PE 4

PE 2PE 1

PE 3

Page 10: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

FE ! CA mapping via a private allocatable array ofderived type:

type mceni n t e g e r : : image , elnumr e a l : : c e n t r (3 )

end type mcentype ( mcen ) , a l l o c a t a b l e : : l c e n t r ( : )

based on coordinates of FE centroids calculated by each MPIprocess and stored in centroid tmp coarray:

type r car e a l , a l l o c a t a b l e : : r ( : , : )

end type r catype ( r ca ) : : c en t r o i d tmp [ � ]:

a l l o c a t e ( c en t r o i d tmp%r (3 , n e l s p p ) )

where nels pp is the number of FE stored on this PE.

Page 11: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

lcentr arrays on images P and Q

PE, image, MPI process P

.

.

b.

.

.

.

a.

.

.

elements

image

elnum

centr

. . . Q . . . P . . .

CA

r

s

lcentr

. . . n . . . b . . .

. . . r . . . s . . .

PE, image, MPI process Q

.

.

m.

.

.

.

n.

.

.

elements

image

elnum

centr

. . . Q . . . P . . .

CA

lcentr

. . . m . . . a . . .

. . . u . . . t . . .

ut

Page 12: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

All-to-all vs nearest neighbour for lcentr

I cgca pfem cenc - all-to-all routine.

I cgca pfem map - nearest neighbour - temporary arrays andcoarray collectives CO SUM and CO MAX, described in TS18508 [1] and will be included in the next revision of theFortran standard, Fortran 2015. At the time of writing coarraycollectives are available on Cray systems as extension to thestandard [14]. The two routines di�er in their use of remotecommunications.

Page 13: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

cgca pfem map

i n t e g e r : : maxfe , s t a r t , pend , c tmps i z er e a l , a l l o c a t a b l e : : tmp ( : , : )! Ca lc . the max num . o f FE s t o r e d on t h i s img

maxfe = s i z e ( c en t r o i d tmp%r , dim=2 )c tmps i z e = maxfec a l l co max ( s ou r c e = maxfe )ga l l o c a t e ( tmp( maxfe�num images ( ) , 5 ) , s ou r c e =0.0)! Each image w r i t e s to a un ique p o r t i o n o f tmp

s t a r t = ( t h i s ima g e ( ) � 1)�maxfe + 1pend = s t a r t + c tmps i z e � 1tmp( s t a r t : pend ,1)= r e a l ( t h i s ima g e ( ) , k i nd=4)! Wr i te e l ement number � as r e a l �

tmp( s t a r t : pend , 2 ) = &r e a l ( (/ ( j , j = 1 , c tmps i z e ) /) , k i nd=4 )

! Wr i te c e n t r o i d coordtmp( s t a r t : pend , 3 : 5 ) = &

t r a n s p o s e ( c en t r o i d tmp%r ( : , : ) )c a l l co sum ( sou r c e = tmp )

Page 14: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Initial CAFE scaling

100

1000

10000

100 1000 10000 100000

1

10

tim

e, s

scalin

g

runtimescaling

ParaFEM/CGPACK MPI/coarray miniapp scaling on ARCHERXC30 for a 3D problem with 1M FE and 800M CA cells.

Page 15: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Initial pro�ling

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with all-to-all routine cgca gcupda at 7200 cores.

Page 16: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Initial pro�ling

Raw pro�ling data for ParaFEM/CGPACK MPI/coarray miniappwith all-to-all routine cgca gcupda at 7200 cores.

Page 17: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

cgca gcupda - all-to-all

i n t e g e r : : gcupd ( 1 0 0 , 3 ) [ � ] , r nd i n t , j , &img , g c u p d l o c a l (100 ,3 )

r e a l : : rnd:c a l l random number ( rnd )r n d i n t = i n t ( rnd �num images ( ) ) + 1do j = rnd i n t , r n d i n t + num images ( ) � 1img = ji f ( img . gt . num images ( ) ) &

img = img � num images ( )i f ( img . eq . t h i s ima g e ( ) ) c y c l e:g c u p d l o c a l ( : , : ) = gcupd ( : , : ) [ img ]:

end do

Page 18: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

cgca gcupdn - nearest neighbour

do i = �1 , 1do j = �1 , 1do k = �1 , 1! Get the co i ndex s e t o f the ne i ghbou rncod = mycod + (/ i , j , k /):g c u p d l o c a l ( : , : ) = &

gcupd ( : , : ) [ ncod (1 ) , ncod (2 ) , ncod ( 3 ) ]:

end doend doend do

Note: the nearest neighbour must be called multiple times topropagate changes from every image to all other images.

Page 19: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Pro�ling cgca gcupdn

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with the neareast neighbour routine cgca gcupdn at 7200cores.

Page 20: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Pro�ling cgca gcupdn

Raw pro�ling data for ParaFEM/CGPACK MPI/coarray miniappwith the neareast neighbour routine cgca gcupdn at 7200 cores.

Page 21: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Scaling improvement with cgca gcupdn over cgca gcupda

10

100

1000

10000

100 1000 10000 100000

1

10

100ti

me, s

scalin

g

cgca_gcupda runtimecgca_gcupdn runtimecgca_gcupdn scaling

Runtimes and scaling for ParaFEM/CGPACK MPI/coarrayminiapp with the nearest neighbour, cgca gcupdn, and all-to-all,cgca gcupda, algorithms.Scaling limit increased from 2k to 7k cores.

Page 22: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Pro�ling with cgca pfem map

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with cgca gcupdn and cgca pfem map at 7200 cores.

Page 23: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Pro�ling with cgca pfem map

Raw pro�ling datafor ParaFEM/CG-PACK MPI/coarrayminiapp withcgca gcupdn andcgca pfem map at7200 cores.

Page 24: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Pro�ling with cgca pfem map

10

100

1000

10000

100 1000 10000 100000

1

10

100ti

me, s

scalin

g

_map runtime_cenc runtime_map scaling

Runtimes and scaling for ParaFEM/CGPACK MPI/coarrayminiapp with cgca pfem map and cgca pfem cenc.cgca pfem map or cgca pfem cenc are called only once during theexecution of the miniapp. Hence only a minor improvement isobtained, only from about 1000 cores.

Page 25: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Issues with CrayPAT

cgca gcupda is top in sampling results, but is absent from tracing.It is called the same number of times as cgca hxi.

Page 26: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Issues with CrayPAT

All pro�ling was done with single thread.

Incorrect number of threads indenti�ed by CrayPAT in a tracingexperiment of ParaFEM/CGPACK MPI/coarray miniapp withcgca gcupda.

Page 27: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

Future work - optimisation of coarray synchronisation

! ===>>> i m p l i c i t sync a l l i n s i d e <<<===c a l l c g c a s l d ( cgca space , . f a l s e . , 0 , 10 , c g c a s o l i d )c a l l c g c a i g b ( cgca spac e )c a l l cgca gbs ( cg ca spac e )c a l l c g c a h x i ( cg ca spac e )sync a l lc a l l cgca gcu ( cgca spac e )sync a l l

I Some routines have sync inside.

I Other sync responsibility is left to the end user.

I Over-synchronisation?

I Enough sync is required by the standard. A standardconforming Fortran coarray program will not deadlock orsu�er races.

Page 28: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

ISO/IEC JTC1/SC22/WG5 N2074, TS 18508 Additional Parallel Features in

Fortran, 2015.

A. Shterenlikht and L. Margetts, \Three-dimensional cellular automata modellingof cleavage propagation across crystal boundaries in polycrystallinemicrostructures," Proc. Roy. Soc. A, vol. 471, p. 20150039, 2015. [Online].Available: http://eis.bris.ac.uk/�mexas/pub/2015prsa.pdf

A. Shterenlikht, L. Margetts, L. Cebamanos, and D. Henty, \Fortran 2008coarrays," ACM Fortran Forum, vol. 34, pp. 10{30, 2015. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/2015acm�.pdf

A. Shterenlikht, \Fortran coarray library for 3D cellular automata microstructuresimulation," in Proc. 7th PGAS Conf., 3-4 October 2013, Edinburgh, Scotland,

UK, M. Weiland, A. Jackson, and N. Johnson, Eds. The University ofEdinburgh, 2014, pp. 16{24. [Online]. Available:http://www.pgas2013.org.uk/sites/default/�les/pgas2013proceedings.pdf

J. Phillips, A. Shterenlikht, and M. J. Pavier, \Cellular automata modelling ofnano-crystalline instability," in Proc. 20th UK ACME Conf. 27-28 March 2012,

The University of Manchester, UK, 2012. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/2012 ACME.pdf

A. Shterenlikht, \Fortran 2008 coarrays," Invited talk at Fortran specialist groupmeeting, BCS/IoP joint meeting, 2014. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/coar bcs.pdf

D. Henty, A. Jackson, C. Moulinec, and V. Szeremi, \ Performance of Parallel IOon ARCHER, version 1.1," ARCHER White Papers, 2015. [Online]. Available:

Page 29: Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on Archer ... I HECToR, ARCHER, Intel, OpenCoarrays/GCC ... states I Fixed or self-similar

http://archer.ac.uk/documentation/white-papers/parallelIO/ARCHER wp parallelIO.pdf

T. Collins, \Using NETCDF with Fortran on ARCHER , version 1.1," ARCHERWhite Papers, 2016. [Online]. Available: http://archer.ac.uk/documentation/white-papers/fortanIO netCDF/fortranIO netCDF.pdf

I. M. Smith, D. V. Gri�ths, and L. Margetts, Programming the Finite Element

Method, 5th ed. Wiley, 2014.

C. A. Gandin and M. Rappaz, \A coupled �nite element-cellular automatonmodel for the prediction of dendritic grain structures in solidi�cation processes,"Acta Met. and Mat., vol. 42, no. 7, pp. 2233{2246, 1994.

C. Zheng and D. Raabe, \Interaction between recrystallization and phasetransformation during intercritical annealing in a cold-rolled dual-phase steel: Acellular automaton model," Acta Materialia, vol. 61, pp. 5504{5517, 2013.

A. Shterenlikht and I. C. Howard, \The CAFE model of fracture { application toa TMCR steel," Fatigue Fract. Eng. Mater. Struct., vol. 29, pp. 770{787, 2006.

S. Das, A. Shterenlikht, I. C. Howard, and E. J. Palmiere, \A general method forcoupling microstructural response with structural performance," Proc. Roy. Soc.

A, vol. 462, pp. 2085{2096, 2006.

ISO/IEC 1539-1:2010, Fortran { Part 1: Base language, International Standard,2010.