Implementing RTM on Dense GPU Platforms Geert Wenes – Cray, Inc. Ty McKercher – NVIDIA
Implementing RTM on Dense GPU Platforms
Geert Wenes – Cray, Inc.
Ty McKercher – NVIDIA
Cray Vision: Fusion of Supercomputing and Big (Fast) Data
Copyright 2015 Cray Inc. 2
Modeling The World
Data-IntensiveProcessing
Math Models
Simulation and modeling of the natural world via
mathematical equations.
Data Models
Analysis of large datasets for knowledge discovery, insight, and prediction.
Feeding scientific, sensor, and internet data into simulations
Analytic processing of simulation output
Compute Store Analyze
Chart source: Henri Calandra - Total
Elastic &
Visco Elastic
Full WE ApproximationHPC Evolution
N=1 TOP 500
1995 2000 2005 2010 2015
1 TF
10 TF
10 PF
100 TF
100 PF
1 PF
1 EF
1990
Paraxial WE
approximation
Kirchhoff beam
Post SDM
PreSTM
Acoustic &
Anisotropic
2020
HPC Evolution:
TOTAL EP
RTM-FWI
L2RTM
Cray system + new algorithms:
“Instead of thousands of years, we
can now process a full FWI survey in
a matter of weeks or days, depending
on the amount of data and complexity
of the rocks in the subsurface.”
Steve Derenthal,
in The Lamp, 2012-2.
Processing: Algorithmic Complexity Increasing
Copyright 2015 Cray Inc.3
Petroleum Geo-Services (PGS) Selects High End Cray XC Series Supercomputer for Seismic Processing
• One of the largest ever commercial supercomputers
• 5 PetaFlop XC40 Supercomputer Performance
• Seismic Processing and Imaging focus
– Subsurface maps and 3-D models
• PGS win based on Cray’s:
– Competitive advantage over other O&G service suppliers
• Performance - Increased processing capacity
• Throughput - Faster turn-around on seismic jobs
– Compute efficiency & reliability
– Supportive partnership
• Integrated Cray configuration includes:
– High performance XC40 configuration
– Integrated Sonexion 2000 storage system PGS researcher’s codes scaled and performed beyond the competition
“Abel”
Copyright 2015 Cray Inc.
#12 - June 2015 “Top 500”(#1 Commercial System)
Images © Schlumberger
And © SeaBird Technologies
Node Wide Azimuth Coil Broadband
• Permanently installed
• Repeated acquisition
• Broader workflow use
• Larger surveys
• Drives algorithmic
complexity
• Larger surveys
• More survey
components
• Drives algorithmic
complexity
• Larger surveys
• Drives algorithmic
complexity
Acquisition: Variety, Volume & Velocity Increasing
Copyright 2015 Cray Inc.
• What Geoscientists want
• How GPUs scale
• Why the application range is broadening
• Why RTM is a natural fit-for-GPUs
• How you port to GPUs
• Where you can find more information
October 2015 SHPCP Technical Presentation at SEG 2014 6
Why GPUs are used in seismic processing
October 2015 SHPCP Technical Presentation at SEG 2015 7
30 Hz7 days to process
1,000 CPU-only nodesversus
300 GPU-accelerated nodes
1000:3003-to-1 productivity gain
Why GPUs are used in seismic processing
October 2015 SHPCP Technical Presentation at SEG 2015 8
30 Hz7 days to process
1,000 CPU-only nodesversus
300 GPU-accelerated nodes
1000:3003-to-1 productivity gain
60 HzWeeks to process
15,000 CPU-only nodesversus5,000 GPU-accelerated nodes
15000:50003-to-1 productivity gain
October 2015 SHPCP Technical Presentation at SEG 2015 9
More Physics
More Scenarios
Bigger Models
October 2015 SHPCP Technical Presentation at SEG 2015 10
High frequency
•Dense Sampling
Large memory
Multiple GPUs
Halo exchange
Inter-GPU comm
30 Hz 60 Hz
October 2015 SHPCP Technical Presentation at SEG 2015 11
GPU-0 GPU-1
Tesla K80 Tesla K80
GPU-2 GPU-3
Host
divide propertyvolumes
among multiple GPUs
October 2015 SHPCP Technical Presentation at SEG 2015 12
GPU-0 GPU-1
Tesla K80 Tesla K80
GPU-2 GPU-3
P2P P2PP2P
Host
divide volumes into
halo and inner-region
domains
October 2015 SHPCP Technical Presentation at SEG 2015 13
GPU-0 GPU-1
Tesla K80 Tesla K80
GPU-2 GPU-3
P2P P2PP2P
Host
streams for halo
calculations, and
halo data exchange
October 2015 SHPCP Technical Presentation at SEG 2015 14
GPU-0 GPU-1
Tesla K80 Tesla K80
GPU-2 GPU-3
P2P P2PP2P
Host
streams to launch
inner-region
calculations
October 2015 SHPCP Technical Presentation at SEG 2015 15
peer-to-peer exchange
between GPUs
overlap halo operations andinner-region calculations
GPU-0 GPU-1
Tesla K80 Tesla K80
GPU-2 GPU-3
P2P P2PP2P
Host
October 2015 SHPCP Technical Presentation at SEG 2015 16
CPU CPUQPI
DRAM
DRAM
PCIe PCIe PCIe PCIe
October 2015 SHPCP Technical Presentation at SEG 2015 17
CPU CPUQPI
DRAM
DRAM
PCIe PCIe
PLX
0 1
K80
PLX
2 3
K80
PLX
4 5
K80
PLX
6 7
K80
PCIe PCIe
PLX
8 9
K80
PLX
10 11
K80
PLX
12 13
K80
PLX
14 15
K80
Single Server + 8x Tesla K80:192 GB GPU Memory39,936 CUDA cores64.8 TFLOPs (peak, fp32)
Performance scaling with multiple GPUs
October 2015 SHPCP Technical Presentation at SEG 2015 18
October 2015 SHPCP Technical Presentation at SEG 2015 19
October 2015 SHPCP Technical Presentation at SEG 2015 20
Peak performanceHigher bandwidth
Stacked memory
CPU & GPU-to-GPU interconnect
NVLink high-speed interconnectSingle memory space
Unified memory
2016(Pascal)
Next-generation Pascal
Why the application range is broadening
October 2015 SHPCP Technical Presentation at SEG 2015 21
2008-2010 2012+
RTM FWI
KDM
KTM
WEM PSPI
Elastic Modeling
SRME
CSEM
2007
• Regular access patterns, 80% peak memory bandwidth (480 GB/s)
• Hardware-based math operations
• Communication costs hidden by overlapping computation
• Entire TTI shots per node on multi-GPU systems
October 2015 SHPCP Technical Presentation at SEG 2014 22
October 2015 SHPCP Technical Presentation at SEG 2015 23
Assess
Parallelize
Optimize
Deploy
October 2015 SHPCP Technical Presentation at SEG 2015 24
Assess
Parallelize
Optimize
Deploy
Profile using familiar tools
October 2015 SHPCP Technical Presentation at SEG 2015 25
Assess
Parallelize
Optimize
Deploy 3 ways to accelerate apps
Libraries
Directives
Languages
October 2015 SHPCP Technical Presentation at SEG 2015 26
Assess
Parallelize
Optimize
DeployGuided analysis
October 2015 SHPCP Technical Presentation at SEG 2015 27
Assess
Parallelize
Optimize
DeployMulti-GPU system advantage
October 2015 SHPCP Technical Presentation at SEG 2015 28
Schlumberger CGG TGS-Nopec ENI Chevron Petrobras Statoil
Hess Seismic City Spectraseis Acceleware Stanford U of Chicago Kaust
• More physics, more scenarios, bigger models
• Linear scaling across multiple GPUs
• Multi-GPU system advantage: improve throughput & productivity
October 2015 SHPCP Technical Presentation at SEG 2014 29
Seismic imaging workloads on GPUs
Realizing dense GPU Computation for RTM
At Scale and In Production
31
As a Seismic Migration Application
• More Physics and Features (RTM(VTI,TTI),
L2RTM, eRTM)
• Implementation Issues/Choices
• Possible strong migration artifacts
• High computational cost (W~N^4)
• Imaging condition
• Implementation Schemes (Explicit FD, (pseudo)-
spectral)
As part of a Critical Workflow
• Preconditioned Data, Model Building, Post-image
Processing
• Integrated with complimentary migration schemes
(e.g. Kirchhoff)
• Wide range of Tradeoffs
• disk/snapshots for source wavefield
construction,
• in-memory processing
• Partial imaging, de/ re-migration
Copyright 2015 Cray Inc.
CS-Storm
• Performance & Technology• Performance at high resolution
• Technology longevity for many-core technologies
• Productivity • Open Software Development Environment
• Communication in a distributed memory
environment
• Workload & Storage Management• Dynamic workload management tools
• Optimize data location to efficiently utilize storage
• TCO - Cost-effectiveness• Optimize power consumption, leverage green
computing initiatives
• Reduce time-to-production
R&D Dev Systems Facilities
Algorithms
Performance
Technology
Productivity
I/T Processes
& Standards
WLM/Utilization
Storage/FS/IO Power/Cooling
(Remote)
Access
SLAs: Time-to-solution, Availability
Copyright 2015 Cray Inc.
Algorithmic
Complexity
@Ever Increasing
Fidelity and
Functionality
Data Acquisition
@Ever Increasing
Volume,
Velocity (and Variety)
33
• Seismology community code, proxy tor seismic application
• CUDA version - developed by Daniel Peter, ETH
• Data - courtesy BP & Princeton (3D elastic, isotropic model)
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
1 2 4 8 16
#GPUs (K40)
SPECFEM3D SPEED-UP
simple_model BP Demo Ideal
SPECFEM3D Strong Scaling, using a complex
model on multi-GPU CS-Storm servers
Elastic 3D BP model; Density, Vp, and Vs
Rho’
N’
Rho’
SPECFEM3D SPEED-UP
Copyright 2015 Cray Inc.
SpecFEM3D linear scaling on K40 and K80
34
• Almost perfect scaling by adding more GPUs and with K80 vs K40
• K80 nodes showing a clear performance advantage: ~2x the
performance
0
50
100
150
200
250
300
350
400
450
0 2 4 6 8 10 12 14 16 18
Wa
ll c
loc
k tim
e (s
ec
)# GPU cards
SpecFEM3D - Strong scalingK40 to K80 Performance improvements
K80
K40
Linear scaling with
the number of
GPUs ½ run time per
node by using K80
nodes
Copyright 2015 Cray Inc.
•Complete cluster, server, network and storage management
•Extreme scalability and ease of use
•Partitioning; job scheduler support; revision system with rollback; automatic network/server discovery and failover
Cray Advanced Cluster Engine
(ACE™)
•Cray Compiling Environment
•Cray Scientific and Math Libraries
•Cray Performance Measurement and Analysis Tools
Cray Programming
Environment on CS
•Open-Source and Partner ToolsComplete SW
Ecosystem
35
Copyright 2015 Cray Inc.
14U
14U
14U
42U
• Five CS-Storm nodes mounted vertically in 14RU cube
• Datacenter-friendly cooling options– Air cooled for versatility– Support for liquid cooled rear door heat
exchangers for room-neutral cooling
Support for 19” or 24” Cabinet
Copyright 2015 Cray Inc.
Powerful and Efficient
•Uncompromising performance in a single-rack system
•Full system solution featuring Cray management and programming environment.
•Maximum efficiency for scalable GPU applications
Performanceby Design
•Power and cooling to spare, allows GPUs to run at full power.
•Designed for upgradeability to protect your investment
Cray Service and Reliability
•Redundancy, data protection and serviceability
•Cray expertise
CS-Storm delivers the best possible efficiency and
performance for RTM applications
Copyright 2015 Cray Inc.