Top Banner
HETEROGENEOUS SYSTEM ARCHITECTURE Mike Houston AMD Fellow Platform of the Future
26

Xldb2012 Wed 1400 MichaelHouston

Sep 27, 2015

Download

Documents

Lunaris

Xldb2012 Wed 1400 MichaelHouston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • HETEROGENEOUS SYSTEM

    ARCHITECTURE

    Mike Houston

    AMD

    Fellow

    Platform of the Future

  • 2 | XLDB - Stanford | Sept. 12, 2012

    A NEW ERA OF PROCESSOR PERFORMANCE

    ?

    Sin

    gle

    -thre

    ad

    Perf

    orm

    ance

    Time

    we are

    here

    Enabled by:

    Moores Law Voltage

    Scaling

    Constrained by:

    Power

    Complexity

    Single-Core Era

    Modern

    Applic

    ation

    Perf

    orm

    ance

    Time (Data-parallel exploitation)

    we are

    here

    Heterogeneous

    Systems Era

    Enabled by:

    Abundant data parallelism

    Power efficient GPUs

    Temporarily

    Constrained by:

    Programming

    models

    Comm.overhead

    Thro

    ughput

    Perf

    orm

    ance

    Time (# of processors)

    we are

    here

    Enabled by:

    Moores Law SMP

    architecture

    Constrained by:

    Power

    Parallel SW

    Scalability

    Multi-Core Era

    Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL !!!

  • 3 | XLDB - Stanford | Sept. 12, 2012

    Most parallel code runs on CPUs designed for scalar workloads

  • 4 | XLDB - Stanford | Sept. 12, 2012

    HETEROGENEOUS SYSTEM ARCHITECTURE AN OPEN PLATFORM

    Open Architecture, published specifications HSAIL virtual ISA HSA memory model HSA dispatch

    ISA agnostic for both CPU and GPU

    Inviting partners to join us, in all areas

    Hardware companies

    Operating Systems

    Tools and Middleware

    Applications

  • 5 | XLDB - Stanford | Sept. 12, 2012

    www.hsafoundation.com

    to define the next generation

    of computing platforms for all devices

  • 6 | XLDB - Stanford | Sept. 12, 2012

    Make the unprecedented processing

    capability of the APU as accessible to

    programmers as the CPU is today

    Dramatically expand the APU software

    ecosystem in client and server

    Enable immersive applications whether

    hosted locally or in the cloud

    GOALS

  • 7 | XLDB - Stanford | Sept. 12, 2012

    APU: ACCELERATED PROCESSING UNIT

    The APU has arrived and it is a great advance over previous platforms

    Combines scalar processing on CPU with parallel processing on the GPU and high

    bandwidth access to memory

    How do we make it even better going forward?

    Easier to program

    Easier to optimize

    Easier to load balance

    Higher performance

    Lower power

  • 8 | XLDB - Stanford | Sept. 12, 2012

    HETEROGENEOUS SYSTEM ARCHITECTURE ROADMAP

  • 9 | XLDB - Stanford | Sept. 12, 2012

    ACCELERATING MEMCACHED

    CLOUD SERVER WORKLOAD

  • 10 | XLDB - Stanford | Sept. 12, 2012

    DATACENTER WORKLOAD

    Generally used for short-term storage and caching, handling requests that would otherwise require database or file system accesses

    Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others

    Effectively a large distributed hash table

    Responds to store and get requests received over the network

    Conceptually:

    store(key, object)

    object = get(key)

  • 11 | XLDB - Stanford | Sept. 12, 2012

    100%

    80%

    60%

    40%

    20%

    0 0

    1

    2

    3

    4

    Key Look Up Performance Execution Breakdown

    Data Transfer Execution

    OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU

    T. H. Hetherington, T. G. Rogers, L. Hsu, M. OConnor, and T. M. Aamodt, Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems, Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012), April 2012.

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209

    Multithreaded CPU Radeon HD 5870 Trinity A10-5800K Zacate E-350

  • 12 | XLDB - Stanford | Sept. 12, 2012

    ACCELERATING JAVA

    GOING BEYOND NATIVE LANGUAGES

  • 13 | XLDB - Stanford | Sept. 12, 2012

    JAVA ENABLEMENT BY APARAPI

    Developer creates Java source Source compiled to class files (bytecode)

    using standard compiler (javac)

    Classes packaged and deployed using established Java tool chain

    Aparapi = Runtime capable of converting Java bytecode to OpenCL

    For execution on any OpenCL 1.1+ capable device

    OR execute via a thread pool if OpenCL is not available

  • 14 | XLDB - Stanford | Sept. 12, 2012

    JAVA AND APARAPI HSA ENABLEMENT ROADMAP

    HSAIL

    HSA-Enabled JVM

    Application

    HSA GPU HSA CPU

    HSA Finalizer

    CPU ISA GPU ISA

    HSA Runtime

    LLVM Optimizer

    HSAIL

    IR

    JVM

    Application

    Aparapi

    HSA GPU HSA CPU

    HSA Finalizer

    CPU ISA GPU ISA CPU ISA GPU ISA

    JVM

    Application

    Aparapi

    GPU CPU

    OpenCL

    HSAIL

    JVM

    Application

    Aparapi

    HSA GPU HSA CPU

    HSA Finalizer

    CPU ISA GPU ISA

  • 15 | XLDB - Stanford | Sept. 12, 2012

    HSA SOFTWARE STACKS

  • 16 | XLDB - Stanford | Sept. 12, 2012

    INTRODUCING HSA BOLT PARALLEL PRIMITIVES LIBRARY FOR HSA

    Easily leverage the inherent power efficiency of GPU computing

    Common routines such as scan, sort, reduce, transform

    More advanced routines like heterogeneous pipelines

    Bolt library works with OpenCL or C++ AMP

    Enjoy the unique advantages of the HSA platform

    Move the computation not the data

    Finally a single source code base for the CPU and GPU!

    Developers can focus on core algorithms See Ben Sanders session tomorrow

    for a deep dive on HSA Bolt!

  • 17 | XLDB - Stanford | Sept. 12, 2012

    HSA SOLUTION STACK

    CPU(s) GPU(s) Other

    Accelerators

    HSA Finalizer

    Legacy Drivers

    Application

    Domain Specific Libs (Bolt, OpenCV, many others)

    HSA Runtime

    Application SW

    Drivers

    Differentiated HW

    DirectX Runtime

    Other Runtime

    HSAIL

    GPU ISA

    OpenCL Runtime

    HSA Software

    Knl Driver

    Ctl

  • 18 | XLDB - Stanford | Sept. 12, 2012

    AMDS OPEN SOURCE COMMITMENT TO HSA

    Component Name AMD Specific Rationale

    HSA Bolt Library No Enable understanding and debug

    OpenCL HSAIL Code Generator No Enable research

    LLVM Contributions No Industry and academic collaboration

    HSA Assembler No Enable understanding and debug

    HSA Runtime No Standardize on a single runtime

    HSA Finalizer Yes Enable research and debug

    HSA Kernel Driver Yes For inclusion in linux distros

    We will open source our linux execution and compilation stack

    Jump start the ecosystem

    Allow a single shared implementation where appropriate

    Enable university research in all areas

  • 19 | XLDB - Stanford | Sept. 12, 2012

    SEA MICRO

    HIGH DENSITY COMPUTING

  • 20 | XLDB - Stanford | Sept. 12, 2012

    THE SM15K PRODUCT FAMILY

    10 Rack Units

    64 Server Cards

    1.28 Terabit fabric interconnect

    Up to 160GbE Uplink (16 x 10GbE or 64 x 1 GbE)

    0-64 Internal 2.5 SAS/SATA HDD/SSD

    Up to 1344 External 3.5 SAS/SATA HDD/SSD

    Up to 16 x4 3Gbps SAS interfaces for External Storage

    Hardware RAID module w/RAID 1,5,6 and 10

    Hot swappable modules with in-service upgrades

    Runs off the shelf OS and hypervisors

    Redundant Power 100-208V AC, 48V DC

    3.0 to 3.5 KW Power Consumption (25-85% Util)

  • 21 | XLDB - Stanford | Sept. 12, 2012

    AMD SERVER BLADES SHIPS Q4 2012

    64 Opteron EE-4365 Servers per 10RU

    512 cores in 10 RU; 2,048 cores in a rack

    64GB ECC DRAM/Server (4TB per 10RU, 16TB per rack)

    8 x 1GbE per server

    AMD Opteron Blade

    1 Octal Core 2.0/2.3/2.8GHz Opteron EE-4365 processor per server blade

    SM15K-OP

  • 22 | XLDB - Stanford | Sept. 12, 2012

    SEAMICRO FABRIC STORAGE ENCLOSURE FAMILY

    FS 5084-L FS 2012-L FS 2024-S

    Positioning High Capacity Low Upfront Cost Performance Optimized

    Height (RU) 5RU 2RU 2RU

    Disk Count 84 12 24

    Disk Types Supported 3.5 / 2.5 SAS/SATA 3.5 SAS/SATA

    2.5 SAS/SATA

    Controller Dual HA Storage Bridge Bay (SBB) 2.0 Compatible controllers

    Interfaces Three x4 6Gb mini-SAS connectors per controller

    Max Storage per Enclosure* 336 TB 48 TB 24 TB

    Max Storage per SM15K* 5,376 TB (5.3 PB) 768 TB 384 TB

    16

    min

    i-S

    AS

    connecto

    rs

    SM

    15

    K

    *Based on 4TB 3.5 and 1TB 2.5 HDD

  • 23 | XLDB - Stanford | Sept. 12, 2012

    Hot Plug upto 16

    x 10GbE

    FABRIC ENABLES FLEXIBLE COMPUTE, NETWORK AND

    STORAGE RATIOS

    Freedom ASICs create

    1.28Tbps Bandwidth &

    0.5 to 6s Fabric

    Hot plug up

    to 5.3 Peta

    bytes

    storage

    Freedom Fabric enables any server to access any uplink or storage

  • 24 | XLDB - Stanford | Sept. 12, 2012

    SM15000: 512 OPTERON "Piledriver" CORES IN 10 RU OPTERON PROCESSORS BASED ON THE NEW PILE DRIVER CORE

    64 sockets, each with a new Octal core Opteron processor: 64-bit, x86, 2.0/2.3/2.8 GHz

    512 cores in 10 RU; 2,048 cores in a rack

    DRAM: 64GB/socket; 4 terabytes/system,

    Industry leading DRAM density: 400 GB/RU

    Freedom supercompute Fabric

    10 GigE bandwidth to each socket

    16 x 10GbE uplinks

    Supports 1,408 drives, linking up to 5 petabytes of fabric storage

    Runs standard OS including Windows, Linux and VMware and Citrix hypervisors

  • 25 | XLDB - Stanford | Sept. 12, 2012

    SM15000 LEADS THE INDUSTRY IN STORAGE CAPACITY

    5 PETABYTE CLUSTER COMPARISON

    6 Racks

    112 2 RU Dual Socket Octal Core Sandy Bridge Servers each w/12 3.5 SATA/SAS Disks

    224 OS/Big Data SW Licenses

    12 10GbE Switches

    6 Terminal Servers

    224 Power Cables, 248 Networking cables

    40 KW

    2 Racks (1/3 the space)

    1 SM 15000 + 16 Freedom Fabric Storage Enclosures

    64 OS/Big Data SW Licenses

    38 power cords. 32 Fabric Extender Cables

    20 KW

    1/2 the Price

  • 26 | XLDB - Stanford | Sept. 12, 2012

    Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

    and typographical errors.

    The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

    to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

    differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

    obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

    make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

    NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

    RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

    INFORMATION.

    ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

    DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

    OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

    EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

    AMD, the AMD arrow logo, the HSA logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is

    a trademark of Apple Corp. which is licensed to the Khronos Organization. All other names used in this presentation are for

    informational purposes only and may be trademarks of their respective owners.

    2012 Advanced Micro Devices, Inc.