Top Banner
Codesign Tradeoffs for High- Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer
39

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Apr 01, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Codesign Tradeoffs for High-Performance,Low-Power Linear Algebra Architectures

Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Page 2: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 2©Ardavan Pedram 2012

Page 3: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 3©Ardavan Pedram 2012

Page 4: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Trend of processors

• Technology scaling has reached physical limits– Limit of performance is power

• We may have Dark silicon on the chip– Only a percentage of chip might be active

04/11/23 4©Ardavan Pedram 2012

Page 5: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Heterogeneous Solution

– Increase power efficiency: GFLOPS/W

– More of cores with lower frequency and power

– Specialized cores Orders of magnitude better

power efficiency (GFLOPS/W) Expensive Long time to market

04/11/23 5

Nvidia Tegra System on Chip

©Ardavan Pedram 2012

Page 6: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Processor Design Goals

• Efficiency of full custom hardware • Orders of magnitude improvement

• Achieving upper limits of power/performance ratio

• Flexibility to execute a whole class of coarse- grain operations

• Co-optimized and co-designed across all layers

• Targeting linear algebra applications

04/11/23 6

Source: Andreas Olofsson

©Ardavan Pedram 2012

Page 7: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Routines• Linear Algebra Package (LAPACK) level

– Cholesky and QR factorization

• Basic Linear Algebra Subroutines (BLAS)– General matrix-matrix multiplication

(GEMM)

• Inner kernels– Hand-optimized

• GEMM is often what delivers high-performance to many crucial applications

04/11/23 7©Ardavan Pedram 2012

Page 8: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 8©Ardavan Pedram 2012

Page 9: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM Implementations• CPUs: 95% peak

– [Goto et al.2008][Intel MKL]

• GPUs: 70% peak– [Nath et al.2010] Nvidia Fermi– [Volkov et al.2008] Nvidia Tesla

• FPGAs: 99% peak– [Zikari et al. 2007]– [Zhuo et al. 2008]

• Specialized architectures– Clearspeed CSX: 78% peak

– Systolic Arrays:• [Lippert et al.2001]

• Intel Quad core– 40 GFLOPS @2.6 GHz

• Nvidia FERMI– 350 GFLOPS @1.15 GHz

• Altera Stratix IV– 100 GFLOPS @ 0.4 GHz

• CSX 700– 75 GFLOPS @ 0.25 GHz

04/11/23 9©Ardavan Pedram 2012

Page 10: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Common Sources of Inefficiencies in conventional architectures

• CPUs & GPUs– Instruction handling– Multi-ported register file– Cache overheads: tags and coherency– Thread scheduling

• FPGAs– Low area efficiency

• Specialized architectures– Data communication overheads

04/11/23 10©Ardavan Pedram 2012

Page 11: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Modeling• Generalization• Conclusion and Future Work

04/11/23 11©Ardavan Pedram 2012

Page 12: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Matrix Multiplication Hierarchy

04/11/23

• Fastest general-purpose implementation of GEMM.[GotoBLAS]

C A B

©Ardavan Pedram 2012 12

Page 13: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Rank-1 Update• Rank-1 Update:

Updates a matrix by adding outer product of two vectors to it

04/11/23 13

Matrix multiplication using series of rank-1 updates:Let C, A, and B be 4x4, 4xkc, and kcx4 matrices. C+=AB can be computed as:

for i=0 to kc-1

end for

AACC

BB

©Ardavan Pedram 2012

Page 14: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Core (LAC) Desgin

• Customized for rank-1 update– 2D arrangement of PEs– Broadcast buses

• Integrates into memory hierarchy

04/11/23 14©Ardavan Pedram 2012

Page 15: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

CC04/11/23 15

On-Chip Memory

C += A0B0+ … + AK-1BK-1

MainMemory

Core Local stores

Memory Hierarchy

AA BBCC

Page 16: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

04/11/23 16

On-Chip Memory

Ci += Ai,pBp

Core Local stores

Memory Hierarchy

CC AA BB

Page 17: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

04/11/23 17

On-Chip Memory

Ci,j+= Ai,pBp,j

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Page 18: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

On-Chip Memory

04/11/23 18

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Page 19: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Design of Linear Algebra Core (LAC)

• Distributed memory architecture• Broadcast Buses04/11/23 19©Ardavan Pedram 2012

Page 20: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Data Mapping on LAC

04/11/23 20

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

©Ardavan Pedram 2012

Page 21: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Data Mapping on LAC

04/11/23 21

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

©Ardavan Pedram 2012

Page 22: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Rank-1 Update

c11+=a1ixbi1c11+=a1ixbi1

c21+=a2ixbi1c21+=a2ixbi1

c12+=a1ixbi2c12+=a1ixbi2

c22+=a2ixbi2c22+=a2ixbi2

c13+=a1ixbi3c13+=a1ixbi3

c23+=a2ixbi3c23+=a2ixbi3

c14+=a1ixbi4c14+=a1ixbi4

c24+=a2ixbi4c24+=a2ixbi4

c31+=a3ixbi1c31+=a3ixbi1

c41+=a4ixbi1c41+=a4ixbi1

c32+=a3ixbi2c32+=a3ixbi2

c42+=a4ixbi2c42+=a4ixbi2

c33+=a3ixbi3c33+=a3ixbi3

c43+=a4ixbi3c43+=a4ixbi3

c34+=a3ixbi4c34+=a3ixbi4

c44+=a4ixbi4c44+=a4ixbi4

dddd

ss

04/11/23

Orange : elements of A Green : elements of B Blue : elements of C

22©Ardavan Pedram 2012

Page 23: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM on LAP

23©Ardavan Pedram 2012

Page 24: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Multi LAC on Chip

• Same panel of B for all cores

• On-chip memory stores a complete n×n block of C

• Each core computes different panel of C

04/11/23 24

Lac 0Memory

Lac 1Memory

Lac 2Memory

©Ardavan Pedram 2012

Page 25: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

04/11/23 25©Ardavan Pedram 2012

Page 26: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Performance and Power Analysis

• Analytical formulae– Utilization– Bandwidth– Size of local stores

• Cycle-accurate simulator– Matrix multiplication– Cholesky factorization

• Component selections

– MAC units (45nm) [Galal et al.2010]

– Storage model with [CACTI 6.0]• Pure SRAM Model

– Interconnect• AMBA AHB [Lahiri.2004]• [Wolkotte.2009]

– Activity of components based on GEMM

– Leakage as 25%~30% of dynamic power

04/11/23 26©Ardavan Pedram 2012

Page 27: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Core Utilization Trade-off

04/11/23 27

• Bandwidth vs. local memory size trade-off

• 100% utilization

• Core dimension trade-off

©Ardavan Pedram 2012

Page 28: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Multi-LAC Solution Trade-off

04/11/23 28

• On-chip memory limits performance

• On-chip Bandwidth requirement grows exponentially to maintain peak performance

©Ardavan Pedram 2012

Page 29: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

• 33 GB/s off-chip BW

• Over 600 DP-GFLOPS

• Over 90% utilization

Performance vs. External Bandwidth

04/11/23 29

256x256 /512x512 / 768x768 /1024x1024

Page 30: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

PE Efficiency for Different Frequencies

• Area– Mostly occupied by SRAM

• Power– Mostly consumed by MAC

units• 120 GFLOPS/W

– upper limit for SP-PE• 60 GFLOPS/W

– upper limit for DP-PE• 1 GHz sweet spot of

performance vs. efficiency• Low voltages,

– SRAM power consumption limits efficiency

04/11/23 30©Ardavan Pedram 2012

Page 31: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP vs. Intel® Core2 Duo Penryn

• Power Break down– [V George et al.2007]

• Out of Order and Frontend– 40% of the core

power (over 5 W)

• Execution logic– Register file

04/11/23 31©Ardavan Pedram 2012

Page 32: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP vs. GTX280 Nvidia Tesla

• Single Precision GEMM04/11/23 32

Page 33: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP VS. GTX480 Nvidia Fermi

04/11/23 33©Ardavan Pedram 2012

Page 34: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Summary of LAP

– 600/1200 DP/SP-GFLOPS– One/two Orders of magnitude Improvements vs. GPUs/CPUs

04/11/23 34©Ardavan Pedram 2012

Page 35: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM Performance and efficiency on different platforms

04/11/23 35

GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization

Cell BE (SP) 200 0.3 1.5 5 88%

NVidia GTX480 SM (SP) 780 0.2 0.9 5.2 70%

NVidia GTX480 SM (DP) 390 0.2 0.5 2.6 70%

Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%

Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%

Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%

ClearSpeed CSX700(DP)

75 0.02 0.2 12.5 78%

LAP (SP) 1200 0.2 6-11 55 90+%

LAP (DP) 600 0.2 3-5 25 90+%

©Ardavan Pedram 2012

Page 36: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

04/11/23 36©Ardavan Pedram 2012

Page 37: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Conclusion

• Linear algebra Processor– Algorithm/Architecture co-design– Power and efficiency estimation– Generalized to more complex algorithms (Cholesky)

– Results @ 1GHz• DP: 32 GFLOPS, 47 GFLOPS/W• 0.6 Watts • 2.8 mm2 in 45nm• 4 GB/s external BW • Orders of magnitude improvement

04/11/23 37©Ardavan Pedram 2012

Page 38: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Conclusion

04/11/23 38©Ardavan Pedram 2012

• Studied Architectures and their power consumption sources

Page 39: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Future Work

• Implementation– Hardware synthesis

• Generalization– Level-3 BLAS– LU and QR

factorization

04/11/23 39

• Integration within a general purpose framework

• Design space exploration– Picking the right

algorithm variant

©Ardavan Pedram 2012