The New Era of Coprocessor in Supercomputing - 并行计算中协处理应用的新时代－

© Supermicro 2013

The New Era of Coprocessor in Supercomputing -

并行计算中协处理应用的新时代－

5/07/2013 @ BAH! Oil & Gas - Rio de Janeiro, Brazil

Marc XAB, M.A. - 桜美林大学大学院Country Manager

Super Micro Computer Inc.Rua Funchal, 418. Sao Paulo – SP

www.supermicro.com/brazil

http://www.supermicro.com/brazil

Networking in Rio

Company Overview

Fremont Facility

Revenues: FY10 $721 M FY11 $942 M FY12 $1BGlobal Footprint: >70 Countries, 700 customers, 6800 SKUsProduction: US, EU and Asia Production facilities Engineering: 70% of workforce in engineering, SSI Member

Market Share: #1 Server Channel Corporate Focus: Leader Energy Efficient, HPC & Application-Optimized Systems

San Jose (Headquarter)

Fortune 2012 100 Fastest-Growing Companies

COPROCESSOR (协处理器 ) A coprocessor is a computer processor used to supplement

the functions of the primary processor (the CPU).

Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, encryption or I/O Interfacing with peripheral devices. Math coprocessor – a computer chip that handles the floating

point operations and mathematical computations in a computer.

Graphics Processing Unit (GPU) – a separate card that handles graphics rendering and can improve performance in graphics intensive applications, like games.

Secure crypto-processor – a dedicated computer on a chip or microprocessor for carrying out cryptographic operations, embedded in a packaging with multiple physical security measures, which give it a degree of tamper resistance

Network coprocessor. 网络协处理器 .

……..

http://www.google.com.tw/url?sa=i&rct=j&q=netowrking+coprocessor&source=images&cd=&cad=rja&docid=cG2NONgyg_-nmM&tbnid=spC6lUydUw8WlM:&ved=&url=http://security-today.com/Directory/List/All-Products.aspx?Page=63&ei=LF5iUbrhJ4yRigfN1IGwDw&psig=AFQjCNHDSuYQtPz3bC9BUPuaD9em7lqdMw&ust=1365487533164158

Green500 Rank MFLOPS/W Site* Computer* Total Power (kW)

1 2,499.44

National Institute for Computational Sciences/University of Tennessee

Beacon - Appro GreenBlade GB824M, Xeon E5-2670 8C 2.600GHz, Infiniband FDR, Intel Xeon Phi 5110P

44.89

2 2,351.10 King Abdulaziz City for Science and Technology

SANAM - Adtech ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000

179.15

3 2,142.77 DOE/SC/Oak Ridge National Laboratory

Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x

8,209.00

4 2,121.71 Swiss Scientific Computing Center (CSCS)

Todi - Cray XK7 , Opteron 6272 16C 2.100GHz, Cray Gemini interconnect, NVIDIA Tesla K20 Kepler

129.00

5 2,102.12 Forschungszentrum Juelich (FZJ)

JUQUEEN - BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect 1,970.00

The Trend Indicated on Green500

http://www.green500.org/lists/green'&site='National%20Institute%20for%20Computational%20Sciences/University%20of%20Tennessee'&rank='1'&month='November'&year='2012

http://www.green500.org/lists/green'&site='King%20Abdulaziz%20City%20for%20Science%20and%20Technology'&rank='2'&month='November'&year='2012

http://www.green500.org/lists/green'&site='DOE/SC/Oak%20Ridge%20National%20Laboratory'&rank='3'&month='November'&year='2012

http://www.green500.org/lists/green'&site='Swiss%20Scientific%20Computing%20Center%20(CSCS)'&rank='4'&month='November'&year='2012

http://www.green500.org/lists/green'&site='Forschungszentrum%20Juelich%20(FZJ)'&rank='5'&month='November'&year='2012

“Submerged Supermicro Servers Accelerated by GPUs”

Supermicro 1U (Single CPU) with two coprocessors No requirement for room-level cooling Operates at PUE ~ 1.12 25 kilowatts per rack – the breakpoint per rack

(between regular air-cool and submerged cool)

Case Study – Submerged Liquid Cooling

Cost Efficiency

Air cool

Submerged liquid cool

KW / rack

~25kW

Removed Fans and Heat Sinks Use SSD & Updated BIOS Reverse the handlers

Tesla: 2-3x Faster Every 2 Years16

2

4

6

8

10

12

14

DP

GFL

OPS

per

Wat

t

2008

2010

2012

2014

T10 Fermi

Kepler

Maxwell

512 cores

Thousands of core

GPU Supercomputer Momentum

0

10

20

30

40

50

60

Tesla Fermi Launched

2008 2009 2010 2011 2012 2013

June 2012 Top500

# of GPU Accelerated Systems on Top500 52

First Double Precision GPU

4x

Case Study – PNNL

Expects supercomputer to rank in world's top 20 fastest machines.

Research for climate and environmental science, chemical processes, biology-based fuels that can replace fossil fuels, new materials for energy applications, etc.

Supermicro FatTwin™with 2x MIC 5110P per node

Theoretical peak processing speed of

3.4 petaflops

42 racks / 195,840 cores

1440 compute nodes with conventional

processors and Intel Xeon Phi "MIC"

accelerators

128 GB memory per node

FDR Infiniband network

2.7 petabyte shared parallel file system

(60 gigabytes per second read/write)

Case Study – PNNL

Supermicro FatTwin™with 2x MIC 5110P per node

Programing Paradigm

The Xeon Phi programming model and its optimization are shared across the Intel Xeon

CUDA (Compute Unified Device Architecture) – a parallel computing platform and programming model. CUDA provides developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.

Made Easier

Don’t Complicated

http://www.google.com/url?sa=i&rct=j&q=COPROCEssor+history+and+future&source=images&cd=&cad=rja&docid=dAWhXVfbxk9FzM&tbnid=e06lP6tndooCnM:&ved=0CAUQjRw&url=http://www.overclock.net/t/1326530/pcm-intel-shipping-first-xeon-phi-coprocessors/20&ei=Ol1XUd33C8HCigKTwIBI&bvm=bv.44442042,d.cGE&psig=AFQjCNH1inBhKys-fH-1hgKGDp47Opbl8g&ust=1364766380418855

Keynotes

This is a new era of hybrid computing – heterogeneous architecture with PCI-E based coprocessor

Specialized (or application-optimized) design is required for GPU/MIC applications and HPC future scalability

There are more to come in the industry roadmap with new technologies, power management and system architecture

Configurable cooling & power for energy efficiency and performance are more and more critical

The trend towards heterogeneous architecture poses many challenges for system builder and software developers in making efficient use of the resources

Programming paradigm and its investment are important as a part of the selecting consideration

•Options pricing•Risk analysis•Algorithmic trading

•Medical imaging•Visualization & docking•Filmmaking & animation

•Computational fluid dynamics•Materials science•Molecular dynamics•Quantum chemistry

•Mechanical design & simulation•Structural mechanics•Electronic Design Automation

•Data parallel mathematics

•Extend Excel with OLAP for planning & analysis

•Database and data analysis acceleration

Computational Finance

Imaging and Computer Vision

•Weather•Atmospheric•Ocean Modeling•Space Sciences

Weather and Climate

Simulation & Creation DesignScientific

•Seismic imaging •Seismic Interpretation•Reservoir Modeling•Seismic Inversion

Oil and Gas/Seismic

Data Mining

Massively parallel architecture accelerates Scientific & Engineering Applications

HPC Coprocessor Applications

http://www.nvidia.com/object/computational_finance.html

Telsa S1070

PCI-E x16

1U Twin™The most

powerful PSC

The fastest 1U serverin the world

1U 4-GPU Standalone box

2U GPU w/ QDR IB onboard

2U Twin

2U 4-GPU

1U 3-GPU

7U GPU Blades20 CPUs + 20 GPUs

X9 (DP) 1U 4-GPU/MIC X9 2U 6-GPU/MIC

X9 (UP) 1U 2-GPU/MIC

NVIDIA Kepler & Intel Xeon Phi supports

Hybrid Computing

FatTwin™ 2-node8 GPUs or MICs per node

FatTwin™ 4-node3 GPUs or MICs per node

Ultra HighEfficiency

2008 2009 2010 2011 2012 2013

4 GPUs or MICsWorkstation / 4U

HybridComputing PioneerGPGPU

Where it started…

EfficiencyDensity

Mainstream

Communication Between Coprocessors

IBIB

IB Switch

The model used by existing CPU-GPU Heterogeneous architectures for GPU-GPU communication. Data travels via CPU & Infiniband (IB) Host Channel Adapter (HCA) and Switch or other proprietary interconnect

Data transfer between cooperating GPUs in separate nodes in a TCA cluster enabled by the PEACH2 chip.

Schematic of the PEARL network within a CPU/GPU cluster

Implementation Example

Source: Tsukuba University

Designing GPU/MIC Optimized Systems

Performance PCI-e lanes arrangement, PCB placement,

interconnect Mechanical design

mounting, location, space utilization Thermal

air flow, fan speed control, location, noise control

Power support PSU efficiency, wattage options,

power management Number of power connectors (& location)

Summary Coprocessor and Applications Performance and Efficiency Top500 & Green500 Hybrid Computing & HPC GPU/MIC Optimized Systems Design Considerations

Performance Mechanical Design Thermal & Cooling Power Support

Thank You!

Marc [email protected]

Conference Puzzle

How do you put an ELEPHANT in a Refrigerator ?

Conference Puzzle

The New Era of Coprocessor in Supercomputing - 并行计算中协处理应用的新时代－

Documents