HETEROGENEOUS SYSTEM ARCHITECTURE Mike Houston AMD Fellow Platform of the Future
Sep 27, 2015
HETEROGENEOUS SYSTEM
ARCHITECTURE
Mike Houston
AMD
Fellow
Platform of the Future
2 | XLDB - Stanford | Sept. 12, 2012
A NEW ERA OF PROCESSOR PERFORMANCE
?
Sin
gle
-thre
ad
Perf
orm
ance
Time
we are
here
Enabled by:
Moores Law Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
Modern
Applic
ation
Perf
orm
ance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
Abundant data parallelism
Power efficient GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Thro
ughput
Perf
orm
ance
Time (# of processors)
we are
here
Enabled by:
Moores Law SMP
architecture
Constrained by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL !!!
3 | XLDB - Stanford | Sept. 12, 2012
Most parallel code runs on CPUs designed for scalar workloads
4 | XLDB - Stanford | Sept. 12, 2012
HETEROGENEOUS SYSTEM ARCHITECTURE AN OPEN PLATFORM
Open Architecture, published specifications HSAIL virtual ISA HSA memory model HSA dispatch
ISA agnostic for both CPU and GPU
Inviting partners to join us, in all areas
Hardware companies
Operating Systems
Tools and Middleware
Applications
5 | XLDB - Stanford | Sept. 12, 2012
www.hsafoundation.com
to define the next generation
of computing platforms for all devices
6 | XLDB - Stanford | Sept. 12, 2012
Make the unprecedented processing
capability of the APU as accessible to
programmers as the CPU is today
Dramatically expand the APU software
ecosystem in client and server
Enable immersive applications whether
hosted locally or in the cloud
GOALS
7 | XLDB - Stanford | Sept. 12, 2012
APU: ACCELERATED PROCESSING UNIT
The APU has arrived and it is a great advance over previous platforms
Combines scalar processing on CPU with parallel processing on the GPU and high
bandwidth access to memory
How do we make it even better going forward?
Easier to program
Easier to optimize
Easier to load balance
Higher performance
Lower power
8 | XLDB - Stanford | Sept. 12, 2012
HETEROGENEOUS SYSTEM ARCHITECTURE ROADMAP
9 | XLDB - Stanford | Sept. 12, 2012
ACCELERATING MEMCACHED
CLOUD SERVER WORKLOAD
10 | XLDB - Stanford | Sept. 12, 2012
DATACENTER WORKLOAD
Generally used for short-term storage and caching, handling requests that would otherwise require database or file system accesses
Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others
Effectively a large distributed hash table
Responds to store and get requests received over the network
Conceptually:
store(key, object)
object = get(key)
11 | XLDB - Stanford | Sept. 12, 2012
100%
80%
60%
40%
20%
0 0
1
2
3
4
Key Look Up Performance Execution Breakdown
Data Transfer Execution
OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU
T. H. Hetherington, T. G. Rogers, L. Hsu, M. OConnor, and T. M. Aamodt, Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems, Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012), April 2012.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209
Multithreaded CPU Radeon HD 5870 Trinity A10-5800K Zacate E-350
12 | XLDB - Stanford | Sept. 12, 2012
ACCELERATING JAVA
GOING BEYOND NATIVE LANGUAGES
13 | XLDB - Stanford | Sept. 12, 2012
JAVA ENABLEMENT BY APARAPI
Developer creates Java source Source compiled to class files (bytecode)
using standard compiler (javac)
Classes packaged and deployed using established Java tool chain
Aparapi = Runtime capable of converting Java bytecode to OpenCL
For execution on any OpenCL 1.1+ capable device
OR execute via a thread pool if OpenCL is not available
14 | XLDB - Stanford | Sept. 12, 2012
JAVA AND APARAPI HSA ENABLEMENT ROADMAP
HSAIL
HSA-Enabled JVM
Application
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA
HSA Runtime
LLVM Optimizer
HSAIL
IR
JVM
Application
Aparapi
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA CPU ISA GPU ISA
JVM
Application
Aparapi
GPU CPU
OpenCL
HSAIL
JVM
Application
Aparapi
HSA GPU HSA CPU
HSA Finalizer
CPU ISA GPU ISA
15 | XLDB - Stanford | Sept. 12, 2012
HSA SOFTWARE STACKS
16 | XLDB - Stanford | Sept. 12, 2012
INTRODUCING HSA BOLT PARALLEL PRIMITIVES LIBRARY FOR HSA
Easily leverage the inherent power efficiency of GPU computing
Common routines such as scan, sort, reduce, transform
More advanced routines like heterogeneous pipelines
Bolt library works with OpenCL or C++ AMP
Enjoy the unique advantages of the HSA platform
Move the computation not the data
Finally a single source code base for the CPU and GPU!
Developers can focus on core algorithms See Ben Sanders session tomorrow
for a deep dive on HSA Bolt!
17 | XLDB - Stanford | Sept. 12, 2012
HSA SOLUTION STACK
CPU(s) GPU(s) Other
Accelerators
HSA Finalizer
Legacy Drivers
Application
Domain Specific Libs (Bolt, OpenCV, many others)
HSA Runtime
Application SW
Drivers
Differentiated HW
DirectX Runtime
Other Runtime
HSAIL
GPU ISA
OpenCL Runtime
HSA Software
Knl Driver
Ctl
18 | XLDB - Stanford | Sept. 12, 2012
AMDS OPEN SOURCE COMMITMENT TO HSA
Component Name AMD Specific Rationale
HSA Bolt Library No Enable understanding and debug
OpenCL HSAIL Code Generator No Enable research
LLVM Contributions No Industry and academic collaboration
HSA Assembler No Enable understanding and debug
HSA Runtime No Standardize on a single runtime
HSA Finalizer Yes Enable research and debug
HSA Kernel Driver Yes For inclusion in linux distros
We will open source our linux execution and compilation stack
Jump start the ecosystem
Allow a single shared implementation where appropriate
Enable university research in all areas
19 | XLDB - Stanford | Sept. 12, 2012
SEA MICRO
HIGH DENSITY COMPUTING
20 | XLDB - Stanford | Sept. 12, 2012
THE SM15K PRODUCT FAMILY
10 Rack Units
64 Server Cards
1.28 Terabit fabric interconnect
Up to 160GbE Uplink (16 x 10GbE or 64 x 1 GbE)
0-64 Internal 2.5 SAS/SATA HDD/SSD
Up to 1344 External 3.5 SAS/SATA HDD/SSD
Up to 16 x4 3Gbps SAS interfaces for External Storage
Hardware RAID module w/RAID 1,5,6 and 10
Hot swappable modules with in-service upgrades
Runs off the shelf OS and hypervisors
Redundant Power 100-208V AC, 48V DC
3.0 to 3.5 KW Power Consumption (25-85% Util)
21 | XLDB - Stanford | Sept. 12, 2012
AMD SERVER BLADES SHIPS Q4 2012
64 Opteron EE-4365 Servers per 10RU
512 cores in 10 RU; 2,048 cores in a rack
64GB ECC DRAM/Server (4TB per 10RU, 16TB per rack)
8 x 1GbE per server
AMD Opteron Blade
1 Octal Core 2.0/2.3/2.8GHz Opteron EE-4365 processor per server blade
SM15K-OP
22 | XLDB - Stanford | Sept. 12, 2012
SEAMICRO FABRIC STORAGE ENCLOSURE FAMILY
FS 5084-L FS 2012-L FS 2024-S
Positioning High Capacity Low Upfront Cost Performance Optimized
Height (RU) 5RU 2RU 2RU
Disk Count 84 12 24
Disk Types Supported 3.5 / 2.5 SAS/SATA 3.5 SAS/SATA
2.5 SAS/SATA
Controller Dual HA Storage Bridge Bay (SBB) 2.0 Compatible controllers
Interfaces Three x4 6Gb mini-SAS connectors per controller
Max Storage per Enclosure* 336 TB 48 TB 24 TB
Max Storage per SM15K* 5,376 TB (5.3 PB) 768 TB 384 TB
16
min
i-S
AS
connecto
rs
SM
15
K
*Based on 4TB 3.5 and 1TB 2.5 HDD
23 | XLDB - Stanford | Sept. 12, 2012
Hot Plug upto 16
x 10GbE
FABRIC ENABLES FLEXIBLE COMPUTE, NETWORK AND
STORAGE RATIOS
Freedom ASICs create
1.28Tbps Bandwidth &
0.5 to 6s Fabric
Hot plug up
to 5.3 Peta
bytes
storage
Freedom Fabric enables any server to access any uplink or storage
24 | XLDB - Stanford | Sept. 12, 2012
SM15000: 512 OPTERON "Piledriver" CORES IN 10 RU OPTERON PROCESSORS BASED ON THE NEW PILE DRIVER CORE
64 sockets, each with a new Octal core Opteron processor: 64-bit, x86, 2.0/2.3/2.8 GHz
512 cores in 10 RU; 2,048 cores in a rack
DRAM: 64GB/socket; 4 terabytes/system,
Industry leading DRAM density: 400 GB/RU
Freedom supercompute Fabric
10 GigE bandwidth to each socket
16 x 10GbE uplinks
Supports 1,408 drives, linking up to 5 petabytes of fabric storage
Runs standard OS including Windows, Linux and VMware and Citrix hypervisors
25 | XLDB - Stanford | Sept. 12, 2012
SM15000 LEADS THE INDUSTRY IN STORAGE CAPACITY
5 PETABYTE CLUSTER COMPARISON
6 Racks
112 2 RU Dual Socket Octal Core Sandy Bridge Servers each w/12 3.5 SATA/SAS Disks
224 OS/Big Data SW Licenses
12 10GbE Switches
6 Terminal Servers
224 Power Cables, 248 Networking cables
40 KW
2 Racks (1/3 the space)
1 SM 15000 + 16 Freedom Fabric Storage Enclosures
64 OS/Big Data SW Licenses
38 power cords. 32 Fabric Extender Cables
20 KW
1/2 the Price
26 | XLDB - Stanford | Sept. 12, 2012
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, the HSA logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is
a trademark of Apple Corp. which is licensed to the Khronos Organization. All other names used in this presentation are for
informational purposes only and may be trademarks of their respective owners.
2012 Advanced Micro Devices, Inc.