Examiners: Prof. Dr. Christian Plessl Prof. Dr. Marco Platzner Guide: Michael Laß Presenter: Tasneem Filmwala Study the effects of approximation on conjugate gradient algorithm and accelerate it on FPGA platform
Examiners:Prof. Dr. Christian PlesslProf. Dr. Marco Platzner
Guide:Michael Laß
Presenter:Tasneem Filmwala
Study the effects of approximation on conjugate gradient algorithm
and accelerate it on FPGA platform
2
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
3
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
4
Problem Description
Huge data available
Clusters to compute them
HPC
Increased complex mathematical modelsand simulation environment
HPC Meets Approximate Computing
5
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
6
Motivation
● Computing nodes already advanced● Focus on optimizing algorithms● Use of approximation for performance/resource
benefits
● Iterative Algorithms promising targets in HPC● Conjugate Gradient method used in HPC for
iteratively solving systems of linear equation● Approximate data paths of algorithm
● Accelerate on FPGA
7
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
8
Approximate Computing
● Key Idea
9
Approximate Computing● Key Idea
● Error Resilient Domains
Image ProcessingMachine LearningSignal Processing
10
Approximate Computing
HPC Meets Approximation
Approximate Iterative Algorithms
Tolerate imperfect solutions
11
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
12
Algorithmic Background
Conjugate Gradient Method➢ Iterative Approach➢ Optimization of quadratic function
F(x) = ½ F(x) = ½ xxT T A x - B x A x - B xT T + C+ C
F(x) = Ax – B =0F(x) = Ax – B =0
13
Algorithmic Background
➢ Solves large system of linear equation
➢ Advanced variation of steepest descent method
➢ Converges in few iterations
Input(Known) Matrix
A x – B = 0
Input (Known) Vector
Solution vector
14
Algorithmic Background
15
Algorithmic Background
➢ Move along A Conjugate Search direction ➢ Pk
T A Pk-1 = 0➢ Initial search direction is same as gradient vector
➢ Next search direction (Pk) is linear combination of current gradient vector and previous search direction
Xk+1
= Xk + step-size * P
k
P
k = R
k + beta * P
k-1
16
Algorithmic Background
CG Algorithm
initial guess: x0 = 0
Compute: r0 = Ax
0 - b, p
0 = -r
0
For(k = 0, 1, 2, .. until convergence){α
k = rT
k .r
k / pT
k .Ap
k
xk+1
= xk + α
k . p
k
rk+1
= rk + α
k A p
k
βk = r
k+1T .r
k+1 / r
k T . r
k
pk+1
= rk+1
+ βk p
k
}
Calculate Step-size
Calculate X
Calculate Residual
Calculate search step
Calculate search Direction
17
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
Tools and Hardware Platforms 18
Tool FlowTools and Hardware Platform
19
Hardware Platform● IBM POWER8 system with virtex7 based FPGA
card.
● Coherent memory access of host memory by FPGA.
Tools and Hardware Platform
20
Hardware Platform
Coherent Accelerator Processor Interface
CAPI
Host
Accelerator Function UnitAFU
Tools and Hardware Platform
21
Hardware Platform
● Components of CAPI
Main components of CAPIapplication and accelerator
Coherent Attached Processor Proxy(proxy for accelerator)
Links coherency protocol between CAPP and PSL
Power Service Layer(local cache for accelerator)
Tools and Hardware Platform
22
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
23
Design
➢ Optimization of AFU➢ Design Methodology➢ Framework
24
Optimization of AFU
➢ Increase the data width of AXI bus➢ Partition Vector and Matrix➢ Pipeline and unroll operations➢ Optimize Floating point MAC operation
and achieve II=1 by partial accumulation of product
➢ MAC operation achieved II=1 in HLS simulation but failed in hardware
Design
25
Design MethodologyDesign
26
Framework
ha_pclock
Verilog top module
for PSL Signals
Clock at 250M
Clock at 125M
Reset at 125M
Reset at 250MData Transfer
AFU
CAPI ADAPTER
CAPIAXI Interconnect
AFU AXI Interconnect
Clocking Wizard
rst_Clock_125MHz
rst_Clock_250MHz
Design
27
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
28
Approximation Techniques
➢ Loop Perforation➢ Inexact Circuits➢ Voltage Over-scaling➢ Over-clocking➢ Skipping tasks and memory access➢ Precision scaling Shall be used
29
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
30
Precision Scaling in CG
ApproximateStorage
ApproximateComputation
31
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
32
Error analysis➢ Residual not used to check error distance➢ Residual propagates error from previous
residual➢ Causes CG to converge at wrong solution➢ Use euclidean error distance to study
error distance for different approaches➢ Study Error Distance for approximate
storage➢ Study Error Distance for approximate
computation
33
Approximate Storage
➢ Single Precision Storage➢ Half Precision Storage ➢ Fixed Point Storage (using varied bit-
widths)
Error Analysis
34
Approximate Storage
Half Precision Storage
Single Precision Storage
Error Analysis
35
Approximate Computation
➢ Half Precision Computation
➢ Fixed Point Matrix-Vector Computation(using varied bit-widths)
➢ All operations calculated in fixed point(using varied bit-widths)
Error Analysis
36
Approximate ComputationHalf Precision ComputationFixed Point Matrix Vector Computation
HP Mat-Vec
FP Mat-Vec
Single Precision
Error Analysis
37
Designs EvaluatedError Analysis
38
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
39
Evaluation➢ BRAM Utilization➢ Latency Comparison➢ DSP Utilization➢ Hardware Evaluation
40
BRAM UtilizationEvaluation
➢ ~40-50 % reduction in BRAM using half precision storage
41
Latency ComparisonEvaluation
➢ Speed up of around 2.0x for designs using half precision storage
42
DSP UtilizationEvaluation
43
Hardware Evaluation
➢ Designs Evaluated on Hardware➢ Resources available on hardware➢ Resource Utilization Results➢ Performance Results
Evaluation
44
Designs Evaluatedon Hardware
➢ Single precision storage with fixed point matrix-vector multiplication
➢ Half Precision storage with fixed point matrix-vector multiplication
Evaluation
45
Resources AvailableEvaluation
46
Resource UtilizationReport
Evaluation
➢ Half precision uses 63 % less BRAM as compared with single precision
➢ More DSP usage by half precision in comparison with single precision
47
Performance ResultsEvaluation
➢ FPGA implementations better than software
➢ Half Precision gave 1.5x speed up as compared to single precision
➢ Half precision achieved 2.0x times speed up as compare to software implementation
48
Outline➔ Problem Description
➔ Motivation
➔ Approximate Computing
➔ Algorithm Background
➔ Tools and Hardware Platform
➔ Design
➔ Approximation Techniques
➔ Precision Scaling in CG
➔ Error Analysis
➔ Evaluation
➔ Conclusion
49
Conclusion➢ Proposed use of approximate computing in HPC
domain
➢ Approximate Conjugate Gradient method using precision scaling
➢ Built an IP core to connect HLS design with CAPI
➢ Implemented CG using HLS, approximated it and performed error analysis
➢ Evaluated 5 designs in terms of resource and performance
➢ Successfully ran 2 designs and compared performance/resources against software implementation
➢ Gained speed up and resource benefits using approximation
50
Key Findings➢ Floating point worked better for
approximate storage➢ Matrix-Vector multiplication more error
resilient than rest of the operations ➢ Residual Calculation proved erroneous
51
Thank YouAny Questions ?
52
Backup Slides
53
Pipeline
54
Residual Problem
55
Effect of condition numberand matrix sizes
56
Rest Operations using Fixed Point
57
Fixed Point ComputationResidual Pattern
58
Framework
59
Precision Scaling in CG
➢ Approximate Storage for saving BRAM➢ Approximate Computation for matrix-
vector operations➢ Use of custom HLS data types➢ Use of type casting feature of HLS