Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Codesign Tradeoffs for High-Performance,Low-Power Linear Algebra Architectures

Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 2©Ardavan Pedram 2012

Outline



Trend of processors

• Technology scaling has reached physical limits– Limit of performance is power

• We may have Dark silicon on the chip– Only a percentage of chip might be active


Heterogeneous Solution

– Increase power efficiency: GFLOPS/W

– More of cores with lower frequency and power

– Specialized cores Orders of magnitude better

power efficiency (GFLOPS/W) Expensive Long time to market

04/11/23 5

Nvidia Tegra System on Chip

©Ardavan Pedram 2012

Linear Algebra Processor Design Goals

• Efficiency of full custom hardware • Orders of magnitude improvement

• Achieving upper limits of power/performance ratio

• Flexibility to execute a whole class of coarse- grain operations

• Co-optimized and co-designed across all layers

• Targeting linear algebra applications

04/11/23 6

Source: Andreas Olofsson


Linear Algebra Routines• Linear Algebra Package (LAPACK) level

– Cholesky and QR factorization

• Basic Linear Algebra Subroutines (BLAS)– General matrix-matrix multiplication

(GEMM)

• Inner kernels– Hand-optimized

• GEMM is often what delivers high-performance to many crucial applications


Outline



GEMM Implementations• CPUs: 95% peak

– [Goto et al.2008][Intel MKL]

• GPUs: 70% peak– [Nath et al.2010] Nvidia Fermi– [Volkov et al.2008] Nvidia Tesla

• FPGAs: 99% peak– [Zikari et al. 2007]– [Zhuo et al. 2008]

• Specialized architectures– Clearspeed CSX: 78% peak

– Systolic Arrays:• [Lippert et al.2001]

• Intel Quad core– 40 GFLOPS @2.6 GHz

• Nvidia FERMI– 350 GFLOPS @1.15 GHz

• Altera Stratix IV– 100 GFLOPS @ 0.4 GHz

• CSX 700– 75 GFLOPS @ 0.25 GHz


Common Sources of Inefficiencies in conventional architectures

• CPUs & GPUs– Instruction handling– Multi-ported register file– Cache overheads: tags and coherency– Thread scheduling

• FPGAs– Low area efficiency

• Specialized architectures– Data communication overheads


Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Modeling• Generalization• Conclusion and Future Work


Matrix Multiplication Hierarchy

04/11/23

• Fastest general-purpose implementation of GEMM.[GotoBLAS]

C A B

©Ardavan Pedram 2012 12

Rank-1 Update• Rank-1 Update:

Updates a matrix by adding outer product of two vectors to it

04/11/23 13

Matrix multiplication using series of rank-1 updates:Let C, A, and B be 4x4, 4xkc, and kcx4 matrices. C+=AB can be computed as:

for i=0 to kc-1

end for

AACC

BB


Linear Algebra Core (LAC) Desgin

• Customized for rank-1 update– 2D arrangement of PEs– Broadcast buses

• Integrates into memory hierarchy


CC04/11/23 15

On-Chip Memory

C += A0B0+ … + AK-1BK-1

MainMemory

Core Local stores

Memory Hierarchy

AA BBCC

04/11/23 16

On-Chip Memory

Ci += Ai,pBp

Core Local stores

Memory Hierarchy

CC AA BB

04/11/23 17

On-Chip Memory

Ci,j+= Ai,pBp,j

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

On-Chip Memory

04/11/23 18

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Design of Linear Algebra Core (LAC)

• Distributed memory architecture• Broadcast Buses04/11/23 19©Ardavan Pedram 2012

Data Mapping on LAC

04/11/23 20

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs


Data Mapping on LAC

04/11/23 21

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs


Rank-1 Update

c11+=a1ixbi1c11+=a1ixbi1
















dddd

ss

04/11/23

Orange : elements of A Green : elements of B Blue : elements of C

22©Ardavan Pedram 2012

GEMM on LAP

23©Ardavan Pedram 2012

Multi LAC on Chip

• Same panel of B for all cores

• On-chip memory stores a complete n×n block of C

• Each core computes different panel of C

04/11/23 24

Lac 0Memory

Lac 1Memory

Lac 2Memory


Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work


Performance and Power Analysis

• Analytical formulae– Utilization– Bandwidth– Size of local stores

• Cycle-accurate simulator– Matrix multiplication– Cholesky factorization

• Component selections

– MAC units (45nm) [Galal et al.2010]

– Storage model with [CACTI 6.0]• Pure SRAM Model

– Interconnect• AMBA AHB [Lahiri.2004]• [Wolkotte.2009]

– Activity of components based on GEMM

– Leakage as 25%~30% of dynamic power


Core Utilization Trade-off

04/11/23 27

• Bandwidth vs. local memory size trade-off

• 100% utilization

• Core dimension trade-off


Multi-LAC Solution Trade-off

04/11/23 28

• On-chip memory limits performance

• On-chip Bandwidth requirement grows exponentially to maintain peak performance


• 33 GB/s off-chip BW

• Over 600 DP-GFLOPS

• Over 90% utilization

Performance vs. External Bandwidth

04/11/23 29

256x256 /512x512 / 768x768 /1024x1024

PE Efficiency for Different Frequencies

• Area– Mostly occupied by SRAM

• Power– Mostly consumed by MAC

units• 120 GFLOPS/W

– upper limit for SP-PE• 60 GFLOPS/W

– upper limit for DP-PE• 1 GHz sweet spot of

performance vs. efficiency• Low voltages,

– SRAM power consumption limits efficiency


LAP vs. Intel® Core2 Duo Penryn

• Power Break down– [V George et al.2007]

• Out of Order and Frontend– 40% of the core

power (over 5 W)

• Execution logic– Register file


LAP vs. GTX280 Nvidia Tesla

• Single Precision GEMM04/11/23 32

LAP VS. GTX480 Nvidia Fermi


Summary of LAP

– 600/1200 DP/SP-GFLOPS– One/two Orders of magnitude Improvements vs. GPUs/CPUs


GEMM Performance and efficiency on different platforms

04/11/23 35

GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization

Cell BE (SP) 200 0.3 1.5 5 88%

NVidia GTX480 SM (SP) 780 0.2 0.9 5.2 70%

NVidia GTX480 SM (DP) 390 0.2 0.5 2.6 70%

Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%

Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%

Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%

ClearSpeed CSX700(DP)

75 0.02 0.2 12.5 78%

LAP (SP) 1200 0.2 6-11 55 90+%

LAP (DP) 600 0.2 3-5 25 90+%


Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work


Conclusion

• Linear algebra Processor– Algorithm/Architecture co-design– Power and efficiency estimation– Generalized to more complex algorithms (Cholesky)

– Results @ 1GHz• DP: 32 GFLOPS, 47 GFLOPS/W• 0.6 Watts • 2.8 mm2 in 45nm• 4 GB/s external BW • Orders of magnitude improvement


Conclusion


• Studied Architectures and their power consumption sources

Future Work

• Implementation– Hardware synthesis

• Generalization– Level-3 BLAS– LU and QR

factorization

04/11/23 39

• Integration within a general purpose framework

• Design space exploration– Picking the right

algorithm variant


Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Documents

b b c c slide

c c b b ardavan pedram

update c

chip memory c i

elements of c

xb i4 c

xb i1 c

xb i2 c