Study the effects of approximation on conjugate gradient ... · Study the effects of approximation on conjugate gradient algorithm and accelerate it on FPGA platform. 2 Outline ...

Examiners:Prof. Dr. Christian PlesslProf. Dr. Marco Platzner

Guide:Michael Laß

Presenter:Tasneem Filmwala

Study the effects of approximation on conjugate gradient algorithm

and accelerate it on FPGA platform

2

Outline➔ Problem Description

➔ Motivation

➔ Approximate Computing

➔ Algorithm Background

➔ Tools and Hardware Platform

➔ Design

➔ Approximation Techniques

➔ Precision Scaling in CG

➔ Error Analysis

➔ Evaluation

➔ Conclusion

3


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

4

Problem Description

Huge data available

Clusters to compute them

HPC

Increased complex mathematical modelsand simulation environment

HPC Meets Approximate Computing

5


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

6

Motivation

● Computing nodes already advanced● Focus on optimizing algorithms● Use of approximation for performance/resource

benefits

● Iterative Algorithms promising targets in HPC● Conjugate Gradient method used in HPC for

iteratively solving systems of linear equation● Approximate data paths of algorithm

● Accelerate on FPGA

7


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

8

Approximate Computing

● Key Idea

9

Approximate Computing● Key Idea

● Error Resilient Domains

Image ProcessingMachine LearningSignal Processing

10

Approximate Computing

HPC Meets Approximation

Approximate Iterative Algorithms

Tolerate imperfect solutions

11


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

12

Algorithmic Background

Conjugate Gradient Method➢ Iterative Approach➢ Optimization of quadratic function

F(x) = ½ F(x) = ½ xxT T A x - B x A x - B xT T + C+ C

F(x) = Ax – B =0F(x) = Ax – B =0

13


➢ Solves large system of linear equation

➢ Advanced variation of steepest descent method

➢ Converges in few iterations

Input(Known) Matrix

A x – B = 0

Input (Known) Vector

Solution vector

14


15


➢ Move along A Conjugate Search direction ➢ Pk

T A Pk-1 = 0➢ Initial search direction is same as gradient vector

➢ Next search direction (Pk) is linear combination of current gradient vector and previous search direction

Xk+1

= Xk + step-size * P

k

P

k = R

k + beta * P

k-1

16


CG Algorithm

initial guess: x0 = 0

Compute: r0 = Ax

0 - b, p

0 = -r

0

For(k = 0, 1, 2, .. until convergence){α

k = rT

k .r

k / pT

k .Ap

k

xk+1

= xk + α

k . p

k

rk+1

= rk + α

k A p

k

βk = r

k+1T .r

k+1 / r

k T . r

k

pk+1

= rk+1

+ βk p

k

}

Calculate Step-size

Calculate X

Calculate Residual

Calculate search step

Calculate search Direction

17


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

Tools and Hardware Platforms 18

Tool FlowTools and Hardware Platform

19

Hardware Platform● IBM POWER8 system with virtex7 based FPGA

card.

● Coherent memory access of host memory by FPGA.

Tools and Hardware Platform

20

Hardware Platform

Coherent Accelerator Processor Interface

CAPI

Host

Accelerator Function UnitAFU


21

Hardware Platform

● Components of CAPI

Main components of CAPIapplication and accelerator

Coherent Attached Processor Proxy(proxy for accelerator)

Links coherency protocol between CAPP and PSL

Power Service Layer(local cache for accelerator)


22


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

23

Design

➢ Optimization of AFU➢ Design Methodology➢ Framework

24

Optimization of AFU

➢ Increase the data width of AXI bus➢ Partition Vector and Matrix➢ Pipeline and unroll operations➢ Optimize Floating point MAC operation

and achieve II=1 by partial accumulation of product

➢ MAC operation achieved II=1 in HLS simulation but failed in hardware

Design

25

Design MethodologyDesign

26

Framework

ha_pclock

Verilog top module

for PSL Signals

Clock at 250M

Clock at 125M

Reset at 125M

Reset at 250MData Transfer

AFU

CAPI ADAPTER

CAPIAXI Interconnect

AFU AXI Interconnect

Clocking Wizard

rst_Clock_125MHz

rst_Clock_250MHz

Design

27


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

28

Approximation Techniques

➢ Loop Perforation➢ Inexact Circuits➢ Voltage Over-scaling➢ Over-clocking➢ Skipping tasks and memory access➢ Precision scaling Shall be used

29


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

30

Precision Scaling in CG

ApproximateStorage

ApproximateComputation

31


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

32

Error analysis➢ Residual not used to check error distance➢ Residual propagates error from previous

residual➢ Causes CG to converge at wrong solution➢ Use euclidean error distance to study

error distance for different approaches➢ Study Error Distance for approximate

storage➢ Study Error Distance for approximate

computation

33

Approximate Storage

➢ Single Precision Storage➢ Half Precision Storage ➢ Fixed Point Storage (using varied bit-

widths)

Error Analysis

34

Approximate Storage

Half Precision Storage

Single Precision Storage

Error Analysis

35

Approximate Computation

➢ Half Precision Computation

➢ Fixed Point Matrix-Vector Computation(using varied bit-widths)

➢ All operations calculated in fixed point(using varied bit-widths)

Error Analysis

36

Approximate ComputationHalf Precision ComputationFixed Point Matrix Vector Computation

HP Mat-Vec

FP Mat-Vec

Single Precision

Error Analysis

37

Designs EvaluatedError Analysis

38


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

39

Evaluation➢ BRAM Utilization➢ Latency Comparison➢ DSP Utilization➢ Hardware Evaluation

40

BRAM UtilizationEvaluation

➢ ~40-50 % reduction in BRAM using half precision storage

41

Latency ComparisonEvaluation

➢ Speed up of around 2.0x for designs using half precision storage

42

DSP UtilizationEvaluation

43

Hardware Evaluation

➢ Designs Evaluated on Hardware➢ Resources available on hardware➢ Resource Utilization Results➢ Performance Results

Evaluation

44

Designs Evaluatedon Hardware

➢ Single precision storage with fixed point matrix-vector multiplication

➢ Half Precision storage with fixed point matrix-vector multiplication

Evaluation

45

Resources AvailableEvaluation

46

Resource UtilizationReport

Evaluation

➢ Half precision uses 63 % less BRAM as compared with single precision

➢ More DSP usage by half precision in comparison with single precision

47

Performance ResultsEvaluation

➢ FPGA implementations better than software

➢ Half Precision gave 1.5x speed up as compared to single precision

➢ Half precision achieved 2.0x times speed up as compare to software implementation

48


➔ Motivation




➔ Design



➔ Error Analysis

➔ Evaluation

➔ Conclusion

49

Conclusion➢ Proposed use of approximate computing in HPC

domain

➢ Approximate Conjugate Gradient method using precision scaling

➢ Built an IP core to connect HLS design with CAPI

➢ Implemented CG using HLS, approximated it and performed error analysis

➢ Evaluated 5 designs in terms of resource and performance

➢ Successfully ran 2 designs and compared performance/resources against software implementation

➢ Gained speed up and resource benefits using approximation

50

Key Findings➢ Floating point worked better for

approximate storage➢ Matrix-Vector multiplication more error

resilient than rest of the operations ➢ Residual Calculation proved erroneous

51

Thank YouAny Questions ?

52

Backup Slides

53

Pipeline

54

Residual Problem

55

Effect of condition numberand matrix sizes

56

Rest Operations using Fixed Point

57

Fixed Point ComputationResidual Pattern

58

Framework

59

Precision Scaling in CG

➢ Approximate Storage for saving BRAM➢ Approximate Computation for matrix-

vector operations➢ Use of custom HLS data types➢ Use of type casting feature of HLS

Study the effects of approximation on conjugate gradient ... · Study the effects of approximation on conjugate gradient algorithm and accelerate it on FPGA platform. 2 Outline ...

Documents