Łukasz Schodowski Power Systems Client Technical Specialist IBM Power Systems FPGA and CAPI in OpenPower
Łukasz SchodowskiPower Systems Client Technical Specialist
IBM Power Systems
FPGA and CAPI in OpenPower
© 2015 IBM Corporation 2
FPGA - Field Programmable Gate Array
• It’s not a microprocessor• It’s not an ASIC chip• It’s not a CPU• But it can be all of them
© 2015 IBM Corporation
FPGA as an Accelerator
• FPGA: Field Programmable Gate Array– It’s a re-programmable chip– It can run fast (cycle times of 250 – 500 Mhz or more)– It has Industry Standard Interfaces like PCI-E Gen3– The Major FPGA Suppliers, Altera and Xilinx,
are OpenPOWER Foundation members
3
FPGA
gzip Encrypt
MonteCarlo
PCIE
FPGA Library
Source code for FPGAs has traditionallybeen written in RTL* (VHDL** or Verilog).Now, we also have OpenCL, a more programmer friendly language.
*RTL = Register Transfer Level**VHDL = VHSIC*** Hardware Description Language***VHSIC = Very High Speed Integrated Circuit
Why FPGAs?
© 2015 IBM Corporation
Why is an Accelerator Faster?
4
FPGAPCIE
Question: The POWER8 Processor runs at ~3-4 Ghz while our FPGA runs at 250Mhz. So why would an accelerator be better?
Answer: The FPGA is better for certain algorithms, such as those that are numerical intensive or have parallelism.
The POWER8 processor has a finite set of instructions to implement the algorithm in SW.The FPGA is customized logic built for specific processing of an algorithm.
Why FPGAs?
© 2015 IBM Corporation
Why is an Accelerator Faster?
5
FPGAPCIE
Example 1: Numerical Intensive Algorithm
FPGA
sin cos
x+
∑
+
∫
Integral ()
Sigma ()
Sin ()
Cos ()
Main (n,a,v,w)
SW
𝑛𝑛𝑎𝑎𝑣𝑣𝑤𝑤
Variables
Done!Done!
Why FPGAs?
© 2015 IBM Corporation
Why is an Accelerator Faster?
6
FPGAPCIE
Example 2: Parallelism
FPGA
Monte Carlo Risk Analysis to determine probability of financial success:
Given current finances, run 100 scenarios
Variable distributor
Eng
ine
1E
ngin
e 2
Eng
ine
3E
ngin
e 4
Eng
ine
5E
ngin
e 6
Eng
ine
7E
ngin
e8E
ngin
e 9
Eng
ine
50
Results Accumulator
Monte
Main (Vars)
SW
Variables Variables
50510 100
Why FPGAs?
© 2014 International Business Machines Corporation
Workload AccelerationServices Delivery ModelAdvanced MemoriesOptimized System DesignCustom SOC’s
Some Example Use Cases
Moore’s Law
System stack innovations arerequired to drive cost/performance
Processors
Semiconductor Technology
Applications and services
Firmware, Operating System and Hypervisor
System Stack
Systems Management & Cloud Deployment
Systems Acceleration & HW/SW Optimization
Power8 Invents CAPI – Coherent Accelerator
Processor Interface
CAPP
PCIe
Power Processor
CAPI overPCIe
Coherently AttachedDevice
• Coherent Attached Processor Proxy (CAPP) in processor– Unit on processor that extends coherency to an attached device– On processor directory responds on behalf of off-chip device
(Filtering snoops)
• Coherency protocol tunneled over standard PCIe– Eliminates the need for special I/Os and protocol logic
CAPI utilizes standard Posted Write and Non-posted Reads– Reduces the complexity and bandwidth requirements of the
attached device
• Enables attached device to be a peer to the processor– Simplifies programming model between application– Enables device to use same effective address as application
running in processor– Eliminates the cumbersome I/O Device Driver requirements
Pinned memory not required
9 © 2015 IBM Corporation
Memory Subsystem
Virt Addr
What was done before CAPI?
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
App
FPGAPCIE
VariablesInput Data
DD
Device Driver Storage Area
Variables
Input Data
Variables
Input Data
OutputData
OutputData
Prior to CAPI, an application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation.
3 versions of the data (not coherent).1000s of instructions in the device driver.
CAPI Coherency
Vs. IO Model
10 © 2015 IBM Corporation
Memory Subsystem
Virt Addr
CAPI Coherency
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
App
FPGAPCIE
With CAPI, the FPGA shares memory with the cores
PSL
VariablesInput Data
OutputData
1 coherent version of the data.No device driver call/instructions.
CAPI Coherency
Vs. IO Model
11 © 2015 IBM Corporation
Typical I/O Model Flow:
Flow with a Coherent Model:Shared Mem.
Notify Accelerator Acceleration Shared MemoryCompletion
DD Call Copy or PinSource Data
MMIO NotifyAccelerator Acceleration Poll / Interrupt
CompletionCopy or Unpin
Result DataRet. From DDCompletion
ApplicationDependent, butEqual to below
ApplicationDependent, butEqual to above
300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data PrepCAPI
CoherencyVs. IO Model
12 © 2015 IBM Corporation
Memory Subsystem
Redis asks for data through Block API.Data source is either memory subsystem or IBM Flash. POWER8 manages it.
Virt Addr
IBM Data Engine for NoSQL
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
POWER8Core
PCIE PSL IBM
DataEngine
BlockAPI
IBM Flash
Acceptable latency
CAPI
Memory
Conventional PCIe I/O
network
network
network
13 © 2015 IBM Corporation
POWER8 with CAPI Cards
POWER8 Modules
CAPI Dev Kit Cards
Front View
Side View
How CAPI Works