FPGA MeetUp

Łukasz SchodowskiPower Systems Client Technical Specialist

IBM Power Systems

FPGA and CAPI in OpenPower

© 2015 IBM Corporation 2

FPGA - Field Programmable Gate Array

• It’s not a microprocessor• It’s not an ASIC chip• It’s not a CPU• But it can be all of them

© 2015 IBM Corporation

FPGA as an Accelerator

• FPGA: Field Programmable Gate Array– It’s a re-programmable chip– It can run fast (cycle times of 250 – 500 Mhz or more)– It has Industry Standard Interfaces like PCI-E Gen3– The Major FPGA Suppliers, Altera and Xilinx,

are OpenPOWER Foundation members

3

FPGA

gzip Encrypt

MonteCarlo

PCIE

FPGA Library

Source code for FPGAs has traditionallybeen written in RTL* (VHDL** or Verilog).Now, we also have OpenCL, a more programmer friendly language.

*RTL = Register Transfer Level**VHDL = VHSIC*** Hardware Description Language***VHSIC = Very High Speed Integrated Circuit

Why FPGAs?


Why is an Accelerator Faster?

4

FPGAPCIE

Question: The POWER8 Processor runs at ~3-4 Ghz while our FPGA runs at 250Mhz. So why would an accelerator be better?

Answer: The FPGA is better for certain algorithms, such as those that are numerical intensive or have parallelism.

The POWER8 processor has a finite set of instructions to implement the algorithm in SW.The FPGA is customized logic built for specific processing of an algorithm.

Why FPGAs?



5

FPGAPCIE

Example 1: Numerical Intensive Algorithm

FPGA

sin cos

x+

∑

+

∫

Integral ()

Sigma ()

Sin ()

Cos ()

Main (n,a,v,w)

SW

𝑛𝑛𝑎𝑎𝑣𝑣𝑤𝑤

Variables

Done!Done!

Why FPGAs?



6

FPGAPCIE

Example 2: Parallelism

FPGA

Monte Carlo Risk Analysis to determine probability of financial success:

Given current finances, run 100 scenarios

Variable distributor

Eng

ine

1E

ngin

e 2

Eng

ine

3E

ngin

e 4

Eng

ine

5E

ngin

e 6

Eng

ine

7E

ngin

e8E

ngin

e 9

Eng

ine

50

Results Accumulator

Monte

Main (Vars)

SW

Variables Variables

50510 100

Why FPGAs?

© 2014 International Business Machines Corporation

Workload AccelerationServices Delivery ModelAdvanced MemoriesOptimized System DesignCustom SOC’s

Some Example Use Cases

Moore’s Law

System stack innovations arerequired to drive cost/performance

Processors

Semiconductor Technology

Applications and services

Firmware, Operating System and Hypervisor

System Stack

Systems Management & Cloud Deployment

Systems Acceleration & HW/SW Optimization

Power8 Invents CAPI – Coherent Accelerator

Processor Interface

CAPP

PCIe

Power Processor

CAPI overPCIe

Coherently AttachedDevice

• Coherent Attached Processor Proxy (CAPP) in processor– Unit on processor that extends coherency to an attached device– On processor directory responds on behalf of off-chip device

(Filtering snoops)

• Coherency protocol tunneled over standard PCIe– Eliminates the need for special I/Os and protocol logic

CAPI utilizes standard Posted Write and Non-posted Reads– Reduces the complexity and bandwidth requirements of the

attached device

• Enables attached device to be a peer to the processor– Simplifies programming model between application– Enables device to use same effective address as application

running in processor– Eliminates the cumbersome I/O Device Driver requirements

Pinned memory not required

9 © 2015 IBM Corporation

Memory Subsystem

Virt Addr

What was done before CAPI?

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

App

FPGAPCIE

VariablesInput Data

DD

Device Driver Storage Area

Variables

Input Data

Variables

Input Data

OutputData

OutputData

Prior to CAPI, an application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation.

3 versions of the data (not coherent).1000s of instructions in the device driver.

CAPI Coherency

Vs. IO Model


Memory Subsystem

Virt Addr

CAPI Coherency

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

App

FPGAPCIE

With CAPI, the FPGA shares memory with the cores

PSL

VariablesInput Data

OutputData

1 coherent version of the data.No device driver call/instructions.

CAPI Coherency

Vs. IO Model


Typical I/O Model Flow:

Flow with a Coherent Model:Shared Mem.

Notify Accelerator Acceleration Shared MemoryCompletion

DD Call Copy or PinSource Data

MMIO NotifyAccelerator Acceleration Poll / Interrupt

CompletionCopy or Unpin

Result DataRet. From DDCompletion

ApplicationDependent, butEqual to below

ApplicationDependent, butEqual to above

300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions

1,000 Instructions

7.9µs 4.9µs

Total ~13µs for data prep

400 Instructions 100 Instructions

0.3µs 0.06µs

Total 0.36µs

CAPI vs. I/O Device Driver: Data PrepCAPI

CoherencyVs. IO Model


Memory Subsystem

Redis asks for data through Block API.Data source is either memory subsystem or IBM Flash. POWER8 manages it.

Virt Addr

IBM Data Engine for NoSQL

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

POWER8Core

PCIE PSL IBM

DataEngine

BlockAPI

IBM Flash

Acceptable latency

CAPI

Memory

Conventional PCIe I/O

network

network

network


POWER8 with CAPI Cards

POWER8 Modules

CAPI Dev Kit Cards

Front View

Side View

How CAPI Works