Top Banner
Connect. Challenge. Inspire. All Rights Reserved, Copyright© FUJITSU LIMITED 2015 ISC 2017 June 20 th 2017 Fujitsu HPC and AI Processors Takumi Maruyama Senior Director AI Platform Business Unit Advanced System Research & Development Unit
31

Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Apr 18, 2019

Download

Documents

phamhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Connect. Challenge. Inspire.

All Rights Reserved, Copyright© FUJITSU LIMITED 2015

ISC 2017

June 20th 2017

Fujitsu HPC and AI Processors

Takumi MaruyamaSenior Director

AI Platform Business Unit

Advanced System Research & Development Unit

Page 2: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Agenda

K computer

Fujitsu’s latest processors

HPC

UNIX

Future Fujitsu processors under development

Post K

AI processor: DLU

Summary

1 Copyright 2017 FUJITSU LIMITED1

Page 3: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

K Computer

2 Copyright 2017 FUJITSU LIMITED

Page 4: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

K Computer

WR#1

10.51 PFlops (Top500, 2011/11)

38,621 GTEPS (Graph500, 2016/11)

602.7 TFLOPS (HPCG, 2016/11)

3 Copyright 2017 FUJITSU LIMITED3

Page 5: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

High Performance Processor

8core

Liquid Cooling

4Processors

Torus Network

6D.

Fujitsu Technologies in the K computer

864 racks

82,944 Compute nodes

5,184 IO nodes

High Density Rack

24boards

4 Copyright 2017 FUJITSU LIMITED

Page 6: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

The Latest Fujitsu Processors

5 Copyright 2017 FUJITSU LIMITED

Page 7: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Fujitsu Processor DevelopmentPerpetual Evolution > 60 years:

Always Targeting No.1

2000~2003

SPARC64

SPARC64

II

SPARC64

V

SPARC64

GP

GS8900

GS21

600

GS8600

GS8800B

SPARC64

VII

GS21

1600

SPARC64

V+

SPARC64

VI

GS8800

GS21

900

Mainframe

Perfo

rman

ce

Relia

bility

Store Ahead

Branch History

Prefetch

Single-chip CPU

Non-Blocking $

O-O-O Execution

Super-Scalar

L2$ on Die

HPC-ACE

System on Chip

Hardware Barrier

Multi-core Multi-thread

2004~2007 2008~2011

SPARC64

GP

2012~2015 2016~

SPARC64

IXfx

SPARC64

VIIIfx

Virtual Machine Architecture

Software on Chip

High-speed Interconnect

SPARC64

X+

130nm

250nm /

220nm

180nm

:Technology generation

90nm

350nm

28nm

Tr=1B

CMOS Cu

40nm

65nm

HPCUNIX

$ ECC

Register/ALU Parity

Instruction Retry

$ Dynamic Degradation

Error Checkers/History

Mainframe/UNIX/HPC + AI

incremental development

GS21

2600

45nm

40nm

Next

GS

SPARC64

XIfx

SPARC64

X

20nm

DLU

SPARC64

XII

Post-K

ARM

AI

6

Next

SPARC

Copyright 2017 FUJITSU LIMITED

Page 8: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

SPARC64™ XIfx Chip (HPC)

Architecture Features• 32 computing cores

+ 2 assistant cores

• HPC-ACE2 (256bit SIMD)Fujitsu’s ISA enhancements

• Sector Cache: Cache with SW controllability

• 24 MB L2 cache

20nm CMOS• 3,750M transistors

• 2.2GHz

Performance (peak)• 1.1TFlops

• HMC 240GB/s x 2 (in/out)

• Tofu2 125GB/s x 2 (in/out)

core core

core core

core core

core core

core core

core core

core core

core core

Assistant

coreAssistant

core

core core

core core

core core

core core

core core

core core

core core

core core

Tofu2 interface

Tofu2 controller

HM

C inte

rface H

MC

inte

rfac

e

L2 cache

L2 cache

PCI interface

MA

C

MA

C M

AC

M

AC

PCI controller

7 Copyright 2017 FUJITSU LIMITED7

Many (32+2) cores, Medium CPU GHz

Page 9: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

SPARC64™ XII Chip (UNIX)

Architecture Features• 12 cores x 8 threads

• SWoC (“Software on Chip”)Fujitsu’s ISA enhancements

• 32MB L3 cache

• Embedded MAC and IOC

20nm CMOS• 25.8mm x 30.8mm

• 5,450M transistors

• 4.25GHz (up to 4.35GHz with “High Speed Mode” enabled)

Performance (peak)• 417GIPS / 835GFlops

• 153GB/s memory throughput

DDR4 interface

DDR4 interface

CoreCoreL

2 C

ach

e

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

L3

Cach

e

MAC

MAC

SERDES

PCIe

Gen3

SERDES

Inter

connect

L3

Cach

e

L3

Cach

e

L3

Cach

e

L2

Ca

ch

eL

2 C

ach

eL

2 C

ach

e

L2

Ca

ch

eL

2 C

ach

eL

2 C

ach

e

L2

Ca

ch

e

L2

Ca

ch

e

L2

Ca

ch

e

L2

Ca

ch

e

L2

Ca

ch

e

Inte

rco

nn

ect &

Co

he

rence C

on

tro

l

Copyright 2017 FUJITSU LIMITED8

Multiple big cores, High CPU GHz

Page 10: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

SPARC64TM XIfx (HPC) Pipeline

FLB

L1 I$64KB

4ways

BranchTarget

Address

Decode

& Issue

RSE

RSA

RSF

RSBR

GUB

GPR188Registers

EXA

EXB

EAGA

EXCEAGB

EXD

FPR128x4 Reg.

FUB

Fetch

Port

Store

Port

L1 D $64KB

4Way

MAC

Fetch Issue Dispatch Reg-Read Execute Cache and Memory

CSE

Commit

PC

Control

Registers

L2$

Write

Buffer

PatternHistoryTable

IOC CPU-CPU I/F

34 cores …

FLBFLALocal

PatternTable

FLBFLBFLB

Copyright 2017 FUJITSU LIMITED9

Page 11: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

L1

Instruction

Cache

64KB

RSEReservation Station

for Execution

RSAReservation Station

for Address generation

RSFReservation Station

for Floating-point

RSBRReservation Station

for Branch

GUB

EXA

EXB

EAGA

EAGB

FUBFPR Update Buffer

FLA

FLB

Fetch

Port

Store

Port

L1

Data

Cache

32KB

Fetch Decode Issue Reg-Read Execute Cache and Memory

Commit

Stack

Entry

Commit

FLC

FLD

Store

Buffer

12 cores

Pipeline-0

Pipeline-1

MAC

L3 Cache

IOCCPU-CPU i/f

L2 Cache

dTLB

SPARC64TM XII (UNIX) Pipeline

BranchPrediction

GPR

x4

FPR

x4

Program

Counter x4

Control

Registers x4

DecodeInstruction

Buffer

Shared Micro-architecture

Copyright 2017 FUJITSU LIMITED10

Page 12: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Future Fujitsu Processors

Under Development

- Post K

11 Copyright 2017 FUJITSU LIMITED

Page 13: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Project Overview

• RIKEN and Fujitsu are currently developing the post-K computer, which

is aims to be the most advanced general-purpose supercomputer in the

world

Goals of Japan’s Post-K Development Project

• Application performance

• Low power consumption

• User convenience

• Ability to produce ground-breaking results

Copyright 2017 FUJITSU LIMITED

Japan’s Post-K Computer Development Project

12

Page 14: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Functions & Architecture Post-K K computer

Processor

Base ISA + SIMD Extensions ARMv8-A+SVESPARCv9+HPC-

ACE

SIMD width [bit] 512 128

FP16 (half precision) support ✔ -

FMA: Floating-point multiply and add ✔ ✔

Math. acceleration primitives ✔ Enhanced ✔

Inter-core barrier ✔ ✔

Sector cache ✔ Enhanced ✔

Hardware “prefetch” assist ✔ Enhanced ✔

Interconnect Tofu ✔ Enhanced ✔

Post-K Processor and Interconnect Features

Fujitsu Processor, adopting ARM ISA and enhanced Tofu interconnect

Inherits and enhances the K computer’s innovative features

Copyright 2017 FUJITSU LIMITED13

Page 15: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Post-K Processor Supports FP16

Provides optimized precision for a wide range of applications

• Superior performance

• Reduces required bandwidth and power consumption

Target applications:

• Existing numerical applications

• Brand-new applications, including Deep Learning

High Performance

for

More Applications

Double Precision

Single Precision

Half

Precision

Copyright 2017 FUJITSU LIMITED14

Page 16: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Future Fujitsu Processor

Development

- AI Processor (DLUTM)

15 All Rights Reserved, Copyright 2017 FUJITSU LIMITED

Page 17: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Processor Designed for Deep Learning

Features of DLU Architecture designed for Deep Learning

Low power consumption design

Optimized precision

➔Goal: 10x Performance / Watt compared to

competitors

Scalable design with Tofu interconnect technology

➔Ability to handle large-scale neural networks

The photograph is an image, and it is different from the thing.

DLU(Deep Learning Unit)

FY2018 ~

TM

Utilizing technologies derived from the K computer

Copyright 2017 FUJITSU LIMITED16

Page 18: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

DLU Design Target

Copyright 2017 FUJITSU LIMITED

High

Performance

Low

Power

Conflicting Demands

• Less Transistors

• less control logic

• fewer execution units/$

• Lower Frequency

• More transistors

• state of the art O-O-O

• many execution units/$

• Higher Frequency

High Deep Learning performance / watt:10x performance / watt

However, high performance and low power is not easy to achieve at the same time

17

Page 19: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Need for a New ArchitectureA new architecture is required for the DLU to achieve

the target.

The architecture is domain specific – Deep Learning

General

Purpose

Computing

Brain

Computing

Supercomputer

Accelerator

Quantum

Computer

Deep

Learning

Inference

Specialization

Required

Processing

Copyright 2017 FUJITSU LIMITED18

Page 20: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

What’s the New Architecture for the DLU?

High Precision

General Use

Conventional

Architecture

The New

Architecture

2. Optimal Precision

1. Domain Specific

Sequential

+ Parallel3. Massively Parallel

Many cores w/ on-chip networkMultiple strong cores

Double/Single precision FP Deep Learning Integer

Complicated O-O-O cores Domain specific cores

Domain specific, Optimal precision, and Massively parallel.

Copyright 2017 FUJITSU LIMITED19

Page 21: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

HBM2

DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element

Host I/FDPU-0

DPU-1

DPU

DPU

DPU

DPU-n

DPE DPEDPE

DPE DPEDPE DPE DPEDPE

DPE DPEDPE

DPE DPEDPE

Large scale DLU interconnect

through off-chip network

DPE DPEDPE

DLUTM

(Deep Learning Unit)

DLU Architecture

Inter-chip

I/F

1. Domain specific

Domain specific Cores

- Newly designed ISA

- Simplified μ-architecture

- Fully software visible and

controllable

- Heterogeneous cores★- DPE and Large RF ★

3. Massively Parallel

Many DPUs with an On-chip Network

2. Optimal Precision

Deep Learning Integer★

Copyright 2017 FUJITSU LIMITED20

Page 22: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

DPU: Execution

・Execute DL operations based

on master core’s control

How to utilize many DPUs

(convolution example)

・ one CH-out / DPU

・ multiple batch / DPU

Heterogeneous Cores

DPU

DPU

DPU

DPU

DPU

DPU

DPU

DPU

Master

MemoryMemory

Controller

Instructions/Data

・・・

CH-in CH-out

Master Core:

Memory Access and

DPU control

• Push & Pull

instructions and data for DLUs.

• Start/stop execution of DLUs

The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure

Copyright 2017 FUJITSU LIMITED21

Page 23: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

DPE & Large RF (Register File)

DPU

CNTL

DPU: 128 SIMD* / 16DPE

DPU consists of 16 DPEs connected with on-chip network

DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine.

RF is fully SW controllable unlike cache to extract full HW potential

DPE: 8SIMD* with large RF

(~100x of typical CPU core)

Exec

UNIT

Exec

UNIT

RF

Exec

UNIT

RF

Exec

UNIT

RF

Exec

UNIT

RF

Exec

UNIT

RF

Exec

UNIT

RF

Exec

UNIT

RF

* For FP32

Copyright 2017 FUJITSU LIMITED22

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Register File

Name RF/$ structure

UNIX SPARC64 XII RF + $

HPC SPARC64 XIfx RF + sector $

AI DLU Large RF

More SW controllability

Page 24: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Deep Learning Integer

Fujitsu’s “Deep Learning Integer” realizes necessary accuracy for Deep Learning with only a 16 or 8 bit data size (i.e. less power consumption compared with FP32)

Copyright 2017 FUJITSU LIMITED

Data Size

Effective Precision

FP32

16-bit

INT

8-bit

INT

Required Precision for Deep Learning

>INT8,16

Accumulator

INT16

INT8 INT8

INT16

INT8 INT8

FP16

FP16 FP16

FP32

int16 int16

int16 int16

Int>16

+

× ×

Int>16

+

int8 int8 int8 int8

int8 int8 int8 int8

× × × ×

+

Int>8

+

Int>8

+

Int>8

+

Int>8

Deep Learning

Integer

Data Size and Precision DLU Data Type

Small Large

16/8bit area with

minimum accuracy loss

INT16

INT8 INT8

INT16

INT8 INT8

HW gathered

statistics

23

Page 25: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Deep Learning Integer Accuracy

Copyright 2017 FUJITSU LIMITED

(*) ImageNet(subset): image size=96x96, #categories=25

FP32

Deep Learning Integer (16bit/8bit)

Deep Learning Integer has shown similar accuracy with FP32for Deep Learning

INT8

INT8

Deep Learning Integer (8bit)

FP32

Deep Learning Integer (16bit)

24

Page 26: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

DLU Roadmap

Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors

• Host CPU

required

• Inter-DLU direct

connection

1st

Generation

• Embedded host

CPU2nd

Generation

Other special processors

• Neuromorphic

• Combinatorial optimization

FutureFY2018

* Subject to change without notice

Copyright 2017 FUJITSU LIMITED25

Page 27: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Summary

26 Copyright 2017 FUJITSU LIMITED

Page 28: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Fujitsu Processor Design Style

Mainframe UNIX HPC AI(DLU)

Instruction Set Architecture(HW-SW I/F)

Micro-architecture(CPU internal structure)

★Performance/RAS

Semiconductor Technology

Design Infrastructure

Shared[FJ development]

GS

ISA

SPARC

ISA

General Purpose

Standard ISA with FJ enhancements / newly developed ISA

Shared / Simple + SW visible micro-architecture

The latest semiconductor technology

Shared design infrastructure: Circuit, Methodology, People

27

ARM

ISA

New

ISA

Simple

SW visible

Circuit, Methodology, People

Copyright 2017 FUJITSU LIMITED

The latest

Domain Specific

Page 29: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Deep Learning

Fujitsu Processor Direction

General purpose and Domain specific

Wider variety of processors in the future to meet different requirements.

Supercomputer

Specialization

Required

ProcessingCopyright 2017 FUJITSU LIMITED

SPARC64TM

VII / VII+

SPARC64TM

X

SPARC64TM

XII

SPARC64TM

XIfx

SPARC64TM

VIIIfxPost-K

DLU

28

General

Purpose

Domain

Specific

HPC & AI

Diverge

Page 30: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM

Summary

Fujitsu has designed processors for a long time (> 60 years)

Perpetual evolution over generations

SPARC64 IXfx (HPC), SPARC64 XII (UNIX), and Post-K

General purpose computing

DLU

Domain specific

New Architecture

Heterogeneous, DPE and large RF, Deep Learning Integer

Shared Design infrastructure: Circuit, Methodology, People

Fujitsu will continue to develop cutting-edge processors to

meet the needs of a new era.

29 Copyright 2017 FUJITSU LIMITED

Page 31: Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU LIMITEDတတတတတတတတ Created Date: 6/28/2017 3:09:14 PM