Top Banner
Łukasz Schodowski Power Systems Client Technical Specialist IBM Power Systems FPGA and CAPI in OpenPower
13

FPGA MeetUp

Jan 17, 2017

Download

Technology

Moya Brannan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FPGA MeetUp

Łukasz SchodowskiPower Systems Client Technical Specialist

IBM Power Systems

FPGA and CAPI in OpenPower

Page 2: FPGA MeetUp

© 2015 IBM Corporation 2

FPGA - Field Programmable Gate Array

• It’s not a microprocessor• It’s not an ASIC chip• It’s not a CPU• But it can be all of them

Page 3: FPGA MeetUp

© 2015 IBM Corporation

FPGA as an Accelerator

• FPGA: Field Programmable Gate Array– It’s a re-programmable chip– It can run fast (cycle times of 250 – 500 Mhz or more)– It has Industry Standard Interfaces like PCI-E Gen3– The Major FPGA Suppliers, Altera and Xilinx,

are OpenPOWER Foundation members

3

FPGA

gzip Encrypt

MonteCarlo

PCIE

FPGA Library

Source code for FPGAs has traditionallybeen written in RTL* (VHDL** or Verilog).Now, we also have OpenCL, a more programmer friendly language.

*RTL = Register Transfer Level**VHDL = VHSIC*** Hardware Description Language***VHSIC = Very High Speed Integrated Circuit

Why FPGAs?

Page 4: FPGA MeetUp

© 2015 IBM Corporation

Why is an Accelerator Faster?

4

FPGAPCIE

Question: The POWER8 Processor runs at ~3-4 Ghz while our FPGA runs at 250Mhz. So why would an accelerator be better?

Answer: The FPGA is better for certain algorithms, such as those that are numerical intensive or have parallelism.

The POWER8 processor has a finite set of instructions to implement the algorithm in SW.The FPGA is customized logic built for specific processing of an algorithm.

Why FPGAs?

Page 5: FPGA MeetUp

© 2015 IBM Corporation

Why is an Accelerator Faster?

5

FPGAPCIE

Example 1: Numerical Intensive Algorithm

FPGA

sin cos

x+

+

Integral ()

Sigma ()

Sin ()

Cos ()

Main (n,a,v,w)

SW

𝑛𝑛𝑎𝑎𝑣𝑣𝑤𝑤

Variables

Done!Done!

Why FPGAs?

Page 6: FPGA MeetUp

© 2015 IBM Corporation

Why is an Accelerator Faster?

6

FPGAPCIE

Example 2: Parallelism

FPGA

Monte Carlo Risk Analysis to determine probability of financial success:

Given current finances, run 100 scenarios

Variable distributor

Eng

ine

1E

ngin

e 2

Eng

ine

3E

ngin

e 4

Eng

ine

5E

ngin

e 6

Eng

ine

7E

ngin

e8E

ngin

e 9

Eng

ine

50

Results Accumulator

Monte

Main (Vars)

SW

Variables Variables

50510 100

Why FPGAs?

Page 7: FPGA MeetUp

© 2014 International Business Machines Corporation

Workload AccelerationServices Delivery ModelAdvanced MemoriesOptimized System DesignCustom SOC’s

Some Example Use Cases

Moore’s Law

System stack innovations arerequired to drive cost/performance

Processors

Semiconductor Technology

Applications and services

Firmware, Operating System and Hypervisor

System Stack

Systems Management & Cloud Deployment

Systems Acceleration & HW/SW Optimization

Page 8: FPGA MeetUp

Power8 Invents CAPI – Coherent Accelerator

Processor Interface

CAPP

PCIe

Power Processor

CAPI overPCIe

Coherently AttachedDevice

• Coherent Attached Processor Proxy (CAPP) in processor– Unit on processor that extends coherency to an attached device– On processor directory responds on behalf of off-chip device

(Filtering snoops)

• Coherency protocol tunneled over standard PCIe– Eliminates the need for special I/Os and protocol logic

CAPI utilizes standard Posted Write and Non-posted Reads– Reduces the complexity and bandwidth requirements of the

attached device

• Enables attached device to be a peer to the processor– Simplifies programming model between application– Enables device to use same effective address as application

running in processor– Eliminates the cumbersome I/O Device Driver requirements

Pinned memory not required

Page 9: FPGA MeetUp

9 © 2015 IBM Corporation

Memory Subsystem

Virt Addr

What was done before CAPI?

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

App

FPGAPCIE

VariablesInput Data

DD

Device Driver Storage Area

Variables

Input Data

Variables

Input Data

OutputData

OutputData

Prior to CAPI, an application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation.

3 versions of the data (not coherent).1000s of instructions in the device driver.

CAPI Coherency

Vs. IO Model

Page 10: FPGA MeetUp

10 © 2015 IBM Corporation

Memory Subsystem

Virt Addr

CAPI Coherency

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

App

FPGAPCIE

With CAPI, the FPGA shares memory with the cores

PSL

VariablesInput Data

OutputData

1 coherent version of the data.No device driver call/instructions.

CAPI Coherency

Vs. IO Model

Page 11: FPGA MeetUp

11 © 2015 IBM Corporation

Typical I/O Model Flow:

Flow with a Coherent Model:Shared Mem.

Notify Accelerator Acceleration Shared MemoryCompletion

DD Call Copy or PinSource Data

MMIO NotifyAccelerator Acceleration Poll / Interrupt

CompletionCopy or Unpin

Result DataRet. From DDCompletion

ApplicationDependent, butEqual to below

ApplicationDependent, butEqual to above

300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions

1,000 Instructions

7.9µs 4.9µs

Total ~13µs for data prep

400 Instructions 100 Instructions

0.3µs 0.06µs

Total 0.36µs

CAPI vs. I/O Device Driver: Data PrepCAPI

CoherencyVs. IO Model

Page 12: FPGA MeetUp

12 © 2015 IBM Corporation

Memory Subsystem

Redis asks for data through Block API.Data source is either memory subsystem or IBM Flash. POWER8 manages it.

Virt Addr

IBM Data Engine for NoSQL

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

PCIE PSL IBM

DataEngine

BlockAPI

IBM Flash

Acceptable latency

CAPI

Memory

Conventional PCIe I/O

network

network

network

Page 13: FPGA MeetUp

13 © 2015 IBM Corporation

POWER8 with CAPI Cards

POWER8 Modules

CAPI Dev Kit Cards

Front View

Side View

How CAPI Works