Implementing RTM on Dense GPU Platforms

Implementing RTM on Dense GPU Platforms

Geert Wenes – Cray, Inc.

Ty McKercher – NVIDIA

Cray Vision: Fusion of Supercomputing and Big (Fast) Data

Copyright 2015 Cray Inc. 2

Modeling The World

Data-IntensiveProcessing

Math Models

Simulation and modeling of the natural world via

mathematical equations.

Data Models

Analysis of large datasets for knowledge discovery, insight, and prediction.

Feeding scientific, sensor, and internet data into simulations

Analytic processing of simulation output

Compute Store Analyze

Chart source: Henri Calandra - Total

Elastic &

Visco Elastic

Full WE ApproximationHPC Evolution

N=1 TOP 500

1995 2000 2005 2010 2015

1 TF

10 TF

10 PF

100 TF

100 PF

1 PF

1 EF

1990

Paraxial WE

approximation

Kirchhoff beam

Post SDM

PreSTM

Acoustic &

Anisotropic

2020

HPC Evolution:

TOTAL EP

RTM-FWI

L2RTM

Cray system + new algorithms:

“Instead of thousands of years, we

can now process a full FWI survey in

a matter of weeks or days, depending

on the amount of data and complexity

of the rocks in the subsurface.”

Steve Derenthal,

in The Lamp, 2012-2.

Processing: Algorithmic Complexity Increasing

Copyright 2015 Cray Inc.3

Petroleum Geo-Services (PGS) Selects High End Cray XC Series Supercomputer for Seismic Processing

• One of the largest ever commercial supercomputers

• 5 PetaFlop XC40 Supercomputer Performance

• Seismic Processing and Imaging focus

– Subsurface maps and 3-D models

• PGS win based on Cray’s:

– Competitive advantage over other O&G service suppliers

• Performance - Increased processing capacity

• Throughput - Faster turn-around on seismic jobs

– Compute efficiency & reliability

– Supportive partnership

• Integrated Cray configuration includes:

– High performance XC40 configuration

– Integrated Sonexion 2000 storage system PGS researcher’s codes scaled and performed beyond the competition

“Abel”

Copyright 2015 Cray Inc.

#12 - June 2015 “Top 500”(#1 Commercial System)

Images © Schlumberger

And © SeaBird Technologies

Node Wide Azimuth Coil Broadband

• Permanently installed

• Repeated acquisition

• Broader workflow use

• Larger surveys

• Drives algorithmic

complexity

• Larger surveys

• More survey

components


complexity

• Larger surveys


complexity

Acquisition: Variety, Volume & Velocity Increasing


• What Geoscientists want

• How GPUs scale

• Why the application range is broadening

• Why RTM is a natural fit-for-GPUs

• How you port to GPUs

• Where you can find more information

October 2015 SHPCP Technical Presentation at SEG 2014 6

Why GPUs are used in seismic processing


30 Hz7 days to process

1,000 CPU-only nodesversus

300 GPU-accelerated nodes

1000:3003-to-1 productivity gain

Why GPUs are used in seismic processing


30 Hz7 days to process

1,000 CPU-only nodesversus

300 GPU-accelerated nodes


60 HzWeeks to process

15,000 CPU-only nodesversus5,000 GPU-accelerated nodes



More Physics

More Scenarios

Bigger Models


High frequency

•Dense Sampling

Large memory

Multiple GPUs

Halo exchange

Inter-GPU comm

30 Hz 60 Hz


GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

Host

divide propertyvolumes

among multiple GPUs


GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

divide volumes into

halo and inner-region

domains


GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

streams for halo

calculations, and

halo data exchange


GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host

streams to launch

inner-region

calculations


peer-to-peer exchange

between GPUs

overlap halo operations andinner-region calculations

GPU-0 GPU-1

Tesla K80 Tesla K80

GPU-2 GPU-3

P2P P2PP2P

Host


CPU CPUQPI

DRAM

DRAM

PCIe PCIe PCIe PCIe


CPU CPUQPI

DRAM

DRAM

PCIe PCIe

PLX

0 1

K80

PLX

2 3

K80

PLX

4 5

K80

PLX

6 7

K80

PCIe PCIe

PLX

8 9

K80

PLX

10 11

K80

PLX

12 13

K80

PLX

14 15

K80

Single Server + 8x Tesla K80:192 GB GPU Memory39,936 CUDA cores64.8 TFLOPs (peak, fp32)

Performance scaling with multiple GPUs




Peak performanceHigher bandwidth

Stacked memory

CPU & GPU-to-GPU interconnect

NVLink high-speed interconnectSingle memory space

Unified memory

2016(Pascal)

Next-generation Pascal

Why the application range is broadening


2008-2010 2012+

RTM FWI

KDM

KTM

WEM PSPI

Elastic Modeling

SRME

CSEM

2007

• Regular access patterns, 80% peak memory bandwidth (480 GB/s)

• Hardware-based math operations

• Communication costs hidden by overlapping computation

• Entire TTI shots per node on multi-GPU systems



Assess

Parallelize

Optimize

Deploy


Assess

Parallelize

Optimize

Deploy

Profile using familiar tools


Assess

Parallelize

Optimize

Deploy 3 ways to accelerate apps

Libraries

Directives

Languages


Assess

Parallelize

Optimize

DeployGuided analysis


Assess

Parallelize

Optimize

DeployMulti-GPU system advantage


Schlumberger CGG TGS-Nopec ENI Chevron Petrobras Statoil

Hess Seismic City Spectraseis Acceleware Stanford U of Chicago Kaust

http://on-demand.gputechconf.com/gtc/2012/presentations/S0628-GTC2012-Energy-Exploration.pdf

https://mediahub.rice.edu/app/portal/video.aspx?PortalID=25ffd62c-3d01-4b29-8c70-7c94270efb3e&DestinationID=URka-_E5-0-e4girH4pDPQ&IsLivePreview=False&ContentID=EIIZ_djSJkStJoG1C4maUw&sharing=true&credits=&attach=&links=&stats=&embed=&play=&hc=&cc=

http://www.tgs.com/uploadedFiles/CorporateWebsite/Modules/Articles_and_Papers/Papers/14-expanding-domain-methods-in-gpu-based-tti-rtm.pdf

http://www.isc-events.com/isc14_ap/presentationdetails.htm?t=presentation&o=96&a=select&ra=personendetails

http://on-demand.gputechconf.com/gtc/2014/presentations/S4145-high-frequency-elastic-gpus-domain-decomposition.pdf

http://www.epmag.com/events/faster-more-accurate-depth-imaging-and-interpretation-gpu-acceleration-393301

http://on-demand.gputechconf.com/gtc/2014/presentations/S4696-maximize-tti-rtm-throughput-kepler.pdf

http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_Seismic_Hess.pdf

http://www.seismiccity.com/pdf/2015_SEG_GPU_RTM.pdf

http://on-demand.gputechconf.com/gtc/2013/presentations/S3176-Memory-Bound-Wave-Propagation.pdf

http://on-demand.gputechconf.com/gtc/2015/presentation/S5350-Marcel-Nauta.pdf

http://on-demand.gputechconf.com/gtc/2014/presentations/S4632-3d-earth-exploration-inverse-imaging.pdf

http://on-demand.gputechconf.com/gtc/2014/presentations/S4599-open-source-3d-elastic-wave-modeling-gpus.pdf

http://on-demand.gputechconf.com/gtc/2015/presentation/S5160-Saber-Feki-Ahmed-Al-Jarro.pdf

• More physics, more scenarios, bigger models

• Linear scaling across multiple GPUs

• Multi-GPU system advantage: improve throughput & productivity


Seismic imaging workloads on GPUs

Realizing dense GPU Computation for RTM

At Scale and In Production

31

As a Seismic Migration Application

• More Physics and Features (RTM(VTI,TTI),

L2RTM, eRTM)

• Implementation Issues/Choices

• Possible strong migration artifacts

• High computational cost (W~N^4)

• Imaging condition

• Implementation Schemes (Explicit FD, (pseudo)-

spectral)

As part of a Critical Workflow

• Preconditioned Data, Model Building, Post-image

Processing

• Integrated with complimentary migration schemes

(e.g. Kirchhoff)

• Wide range of Tradeoffs

• disk/snapshots for source wavefield

construction,

• in-memory processing

• Partial imaging, de/ re-migration


CS-Storm

• Performance & Technology• Performance at high resolution

• Technology longevity for many-core technologies

• Productivity • Open Software Development Environment

• Communication in a distributed memory

environment

• Workload & Storage Management• Dynamic workload management tools

• Optimize data location to efficiently utilize storage

• TCO - Cost-effectiveness• Optimize power consumption, leverage green

computing initiatives

• Reduce time-to-production

R&D Dev Systems Facilities

Algorithms

Performance

Technology

Productivity

I/T Processes

& Standards

WLM/Utilization

Storage/FS/IO Power/Cooling

(Remote)

Access

SLAs: Time-to-solution, Availability


Algorithmic

Complexity

@Ever Increasing

Fidelity and

Functionality

Data Acquisition

@Ever Increasing

Volume,

Velocity (and Variety)

33

• Seismology community code, proxy tor seismic application

• CUDA version - developed by Daniel Peter, ETH

• Data - courtesy BP & Princeton (3D elastic, isotropic model)

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

1 2 4 8 16

#GPUs (K40)

SPECFEM3D SPEED-UP

simple_model BP Demo Ideal

SPECFEM3D Strong Scaling, using a complex

model on multi-GPU CS-Storm servers

Elastic 3D BP model; Density, Vp, and Vs

Rho’

N’

Rho’

SPECFEM3D SPEED-UP


SpecFEM3D linear scaling on K40 and K80

34

• Almost perfect scaling by adding more GPUs and with K80 vs K40

• K80 nodes showing a clear performance advantage: ~2x the

performance

0

50

100

150

200

250

300

350

400

450

0 2 4 6 8 10 12 14 16 18

Wa

ll c

loc

k tim

e (s

ec

)# GPU cards

SpecFEM3D - Strong scalingK40 to K80 Performance improvements

K80

K40

Linear scaling with

the number of

GPUs ½ run time per

node by using K80

nodes


•Complete cluster, server, network and storage management

•Extreme scalability and ease of use

•Partitioning; job scheduler support; revision system with rollback; automatic network/server discovery and failover

Cray Advanced Cluster Engine

(ACE™)

•Cray Compiling Environment

•Cray Scientific and Math Libraries

•Cray Performance Measurement and Analysis Tools

Cray Programming

Environment on CS

•Open-Source and Partner ToolsComplete SW

Ecosystem

35


14U

14U

14U

42U

• Five CS-Storm nodes mounted vertically in 14RU cube

• Datacenter-friendly cooling options– Air cooled for versatility– Support for liquid cooled rear door heat

exchangers for room-neutral cooling

Support for 19” or 24” Cabinet


Powerful and Efficient

•Uncompromising performance in a single-rack system

•Full system solution featuring Cray management and programming environment.

•Maximum efficiency for scalable GPU applications

Performanceby Design

•Power and cooling to spare, allows GPUs to run at full power.

•Designed for upgradeability to protect your investment

Cray Service and Reliability

•Redundancy, data protection and serviceability

•Cray expertise

CS-Storm delivers the best possible efficiency and

performance for RTM applications