Distributed computing of the GEOS-Chem model

1

Distributed computing of the GEOS-Chem model

Kevin Bowman

Lei Pan, Qinbin Li, and Paul von Allmen

California Institute of Technology

Jet Propulsion Laboratory

2

Objectives

• The objective of this activity is to develop a scalable, parallel version ofthe GEOS-Chem code based on a distributed computing architecturethat is suitable for the JPL 1024 processing element (PEs) institutionalcluster

• The goal is to improve the GEOS-Chem wall-clock performance by atleast one order of magnitude over the current capability

• The current capability:the speedup of GEOS-Chem with the number ofCPUs currently plateaus at 4 processors on a shared memory platformsuch as SGI O2K. Best wall-clock performance is completion of a 1-month model simulation on a 200 x 250 km grid within 1 day.

3

Approach

• The primary calculations in GEOS-Chem are:

– Chemistry (60%)

– Transport, deposition, emissions (40%)

• The chemistry component is inherently parallel and therefore the mostlogical starting point.

• The initial stage is to use a master/slave architecture for theparallelization of the chemistry

• The second stage is to migrate towards a domain decompositiondesign that will handle both transport and chemistry.

4

Timeseries diagnostics (diag49)

Initialization

Start 6-h loop

Start dynamic time stepArchive diagnostics (diag3)Met fields (a3 & a6): unzip, read

Transport

Turbulent Mixing

Convection

End 6-h loop

Dry Deposition

Emissions

Chemistry

Wet Deposition

End dynamic time step

Seasonal, monthly, daily dataInterpolate met fieldsCompute air mass quantitiesUnit conversion: kg -> v/v

Compute air mass quantities

Upper boundary flux conditions

Unit conversion: kg -> v/v

DO_CHEMISTRYchemistry_mod.f

CHEM chem.f

PHYSPROC physproc.f

CALCRATE calcrate.fSMVGEAR smvgear.f

DO_WETDEPwetscav_mod.f

WETDEP wetscav_mod.f

DO_EMISSIONSemissions_mod.f

EMISSDR emissdr.f

DO_DRYDEPdrydep_mod.f

DEPVEL drydep_mod.f

DO_CONVECTIONconvection_mod.f

FVDAS_CONVECT fvdas_convect_mod.f

NFCLDMX convection_mod.for

TURBDAYturbday.f

DO_TRANSPORTtransport_mod.f

TPCORE_FVDAS tpcore_fvdas_mod.f90

TPCOREtpcore_mod.for

15 min

15 min

60 min

60 min

60 min

15 min

15 min

GEOS-Chem computational flow

5

Master/slave architecture

Transport, turbulent mixing,convectionDry deposition, emissions

GEOS-Chem master node

Chemistry Chemistry Chemistry Chemistry

GEOS-Chem master node

Wet deposition

Logical sequence/one tim

e-step

Slave node

6

Chemistry

Physproc

FOR ii = 1, 2300 DO

CALCRATE

SMVGEAR

ENDDO

Physproc

FOR ii = 1, 2300/N DO

CALCRATE

SMVGEAR

MPI-SEND

MPI-RECEIVE

CALCRATE

SMVGEAR

CALCRATE

SMVGEAR

ENDDO

PE 1

PE 2

PE N

7

Amdahl’s Law

Amdahl’s law describes the speed-up from parallelizationas a function of processor number, non-parallelizable component,processor communication and contention

Speedup =Tseq

Tnp + Tcom (P)+Tcont (P)+Tseq − Tnp

P

Tseq : Sequential timeTnp : Non-parallelizable component timeTcom: Communication time between processorsTcont: Contention time between processorsP : Number of processors

8

Performance

•Test run on 4x5 deg, full chemistry•1024 processor (dual CPU/node) Dell cluster

• Xeon Processors, ~3 Tflops theoretical peak, ~2 Tbyte RAM•Pentium 3.2 Ghz and 2 GB RAM

•Communication and contention cost removed for analysis

•Tseq: 649.83 sec•Chemistry (seq) : 432.21 sec (66.5%)•SMVGEAR+CALCRATE: 0.0076 sec/node•Reach optimal trade-off in speedup-processorperformance with 32 processors

However,•Total time with master/slave architecture is 2230 sec•Contention time: 1825.95 sec or 82% of wall-clocktime.•Communication time: ~0.0063*2300 sec

Master/slave architecture not a viable option forChemistry or transport.

9

Domain Decomposition

Grid cell:GhostBoundaries:

All computations (transport, chemistry) for a grid cell are performed on one processorFor transport, ghost boundaries must be used

PE 1,1

PE 1,2

PE 1,3

10

Ghost Boundaries

PE1

PE2

t+2dtt t+dt

MES

SAG

E PA

SSIN

G

t+3dtProcess Time t: current values of fields on allgrid points are accessible by PE1 andPE2. Time t+dt and t+2dt: current valuesof fields are accessible by both PE1and PE2 on a reduced set of gridpoints. Message passing: current values offields are made accessible to both PE1and PE2 on all grid points. Time t+3dt: situation identical totime t.

Salient Features Information is exchanged betweenPE1 and PE2 every 3 time steps. Fields on all the grid points in theghost boundary are exchanged. Fields on some grids points arecomputed redundantly by both PE1 andPE2.

Optimization of ghost boundary size

11

Future Directions and Conclusions

• We have a preliminary design for the domain decomposition

• We expect to achieve ~P1/2 speed-up with this design.

• The I/O bottleneck (lots of data written to files) will be resolved by usinga Parallel Virtual File System (PVFS) and MPI ROM/IO in order tomaintain the scaling for a larger number of processors.

• We see this approach will enable GEOS-Chem user’s to address abroad range of questions that are currently inhibited by computationalconstraints.

• These techniques will be beneficial not only to large systems, such asthe JPL institutional cluster, but also to more modest cluster systems.

12

Distributed Computation

Data on full grid

Distribute Data (MP)

Distributed Computation

InjectBoundary Data (MP)

Gather Data (MP)

P1 PNP2

Chemistry

Transportt→t+dt

Chemistry

Transportt→t+dt

P1 P2 PN

Data on full grid

Distributed computing of the GEOS-Chem model

Documents