Top Banner
DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Jiaxin Peng, YousraAlkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi 49th International Conference on Parallel Processing ICPP August 2020
24

RANC as Neural Network Accelerator

Oct 20, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RANC as Neural Network Accelerator

DNNARA: A Deep Neural Network Accelerator

using Residue Arithmetic and

Integrated Photonics

Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi

49th International Conference on Parallel Processing – ICPP

August 2020

Page 2: RANC as Neural Network Accelerator

Outline

➢Introduction

➢Background

➢Integrated Photonic Residue Arithmetic Computing Engine for Convolutional Neural Network

➢Performance Evaluation

➢Conclusion

2

Page 3: RANC as Neural Network Accelerator

Introduction

3

Page 4: RANC as Neural Network Accelerator

Introduction

➢Some NN applications require real-time analysis for inference

➢Computation intensive; includes billion multiply-accumulate (MAC) operations

➢We propose DNNARA: a deep neural network accelerator using residue arithmetic based on integrated photonics

➢All the computations through the neural network are done in residue number system (RNS) to avoid extra binary to/from RNS conversions

4Block Diagram of a DNNARA System

Page 5: RANC as Neural Network Accelerator

Introduction

➢DNNARA: RNS with wavelength-division multiplexing (WDM)

• Execute multiple MVMs due to WDM feature

• Speedup MVMs due to digit-independent feature

• Residues are small-sized

• Increase the system parallelism – save area/hardware resources

5

Page 6: RANC as Neural Network Accelerator

Background➢ Convolutional Neural Network

➢ Residue Number System

6

Page 7: RANC as Neural Network Accelerator

Background – Convolutional Neural Network

➢Widely applied in classification• Image recognition

➢Including several layers/functions• Convolutional layers• Activation functions – add non-linearity

• ReLu (Rectified Linear Unit)• Sigmoid function / Hyperbolic tangent function

• Pooling layers – down ample the output• Max pooling• Average pooling

• Fully-connected layers

➢Contains up to billion multiply-accumulate (MAC) operations

7

Page 8: RANC as Neural Network Accelerator

Background - Residue Number System (RNS)

➢Each Integer X is represented by its “residue,” or remainder obtained by dividing it by a modulus Mi

• Example: Moduli are M1=2, M2=3, M3=5, M4=7• X = 20 is represented as X={0, 2, 0, 6}[2, 3, 5, 7]

• Range of numbers that can be represented: 0 to (M– 1) (here 0 to 219) (M=M1*M2*M3*M4)• Moduli should be relatively prime

➢Negative Number Notation: Similar to 2’s compliment• r = |m-|-X|m|m (where X is negative)• Example: -20 = {|2-0|2, |3-2|3, |5-0|5, |7-6|7}[2, 3, 5, 7] = {0, 1, 0, 1}[2, 3, 5, 7]

• Range of numbers that can be represented: [−(𝑀−1)/2,(𝑀−1)/2]if M is odd, or[−𝑀/2,𝑀/2−1]if M is even

➢Residue Arithmetic: Operations carried out on residues• Example: Addition of X=20={0, 2, 0, 6}[2, 3, 5, 7] and Y=5={1, 2, 0, 5 }[2, 3, 5, 7]

• X+Y = {0+1, 2+2, 0+0, 6+5 }[2, 3, 5, 7] → = {1, 1, 0, 4 }[2, 3, 5, 7]

• X*Y = {0*1, 2*2, 0*0, 6*5 }[2, 3, 5, 7] → = {0, 1, 0, 2 }[2, 3, 5, 7]

• Residue arithmetic is carried out as modulo additions and multiplication on the residues• Residue arithmetic is carried out on each residue in parallel

8

Page 9: RANC as Neural Network Accelerator

Integrated Photonic Residue Arithmetic Computing Engine for Neural Network

➢ Overview

➢ Residue Adders and Multipliers

➢ Residue Matrix-Vector Multiplication Unit

➢ Sigmoid Unit

➢ Max Pooling Unit

9

Page 10: RANC as Neural Network Accelerator

• R-MVM: Residue Matrix-Vector Multiplication

• R-Multiplier: Residue Multiplier

• R-Adder: Residue Adder

• MRR: Micro-Ring Resonator

• PD: Photo-Detector

• LUT: Look-up Table

• RNS2Bin: RNS to Binary

• Bin2RNS: Binary to RNS

• T: tile

Overview Architecture

10

Page 11: RANC as Neural Network Accelerator

Integrated Photonic Residue Adder and Multiplier

➢Basic block• An electro-optical 2×2 switch• Light either propagates through (“bar” state – (a))or

propagates cross (“cross” state – (b))

➢Residue Adder [1] – one-hot encoding• Could be considered as a mapping (injection)• Arbitrary Size Benes (AS-Benes) Network (c – even

number & d – odd number)• Switch states are precomputed and stored in a look-

up table (LUT)

➢An AS-Benes modulo-5 adder (e)• Example with |3+4|5 = 2

➢A Modulo-N Residue Multiplier Implementation (f)

➢WDM capable

11

Page 12: RANC as Neural Network Accelerator

Residue MVM (R-MVM) Computing Block

➢Schematic of designed R-MVM (b)

➢Wavelength-Division Multiplexing (WDM) Capable

➢Lasers, MRRs, PDs, LUTs, Registers, as well as photonic and electrical connections are needed

➢sel to choose either the partial sum or bias

➢Example: 5x5 input feature and a 2x2 kernel

12

Page 13: RANC as Neural Network Accelerator

Pipeline of a MAC operation

• Cycle 1:• Input feature (x) are encoded as light with

different wavelengths• Weights (w) are encoded as the selection line,

loading the states of switches in the LUT

• Cycle 2:• Setup the switch states accordingly• Inject light and detect light - multiply• MRRs & PDs act like filter to derive the

solutions for all the multiplications

13

Page 14: RANC as Neural Network Accelerator

Pipeline of a MAC operation

• Cycle 3:• Results from last cycle (w*x) are decoded as

the selection line to load the states for adders• According to sel, either the partial sum or bias

is decoded as the light

• Cycle 4:• Setup the adders• Inject light and detect light – add

• Cycle 5: Write back to the register

14

Page 15: RANC as Neural Network Accelerator

Sigmoid Function Unit - Polynomial

➢In residue domain, it is hard to calculate the sigmoid function

➢Instead, it could be considered as a polynomial because sigmoid function could be represented as Taylor series

➢Need to pre-calculate the terms that include x, and build the connection accordingly

➢Example: P(x) = ax4 + bx3 + cx2 + dx + e in modulo-5 system

15

Page 16: RANC as Neural Network Accelerator

Max pool Function Unit

➢Sign detection in RNS is implicit

➢Instead, we convert the number from RNS to MRS (mixed-radix number system) [2]

➢From the MRS, the coefficient of even number 2 (a4) separates the number to negative or non-negative

➢It is serial but could be pipelined

16

Page 17: RANC as Neural Network Accelerator

Performance Evaluation

17

Page 18: RANC as Neural Network Accelerator

Experiment Setup

➢Electrical memory component• CACTI 7.0 [3],

➢Optical Switch [4]• Lumerical FTDT

➢Optical circuit• Lumerical Interconnect

➢Lasers/MRRs/PDs• Data from other work ([5], [6],and [7],

respectively)

➢HyperTransport serial link • Data from [8]

➢System Level Design• Our own simulator

18

Configurations of Selected Benchmarks

Page 19: RANC as Neural Network Accelerator

Design Space Exploration

➢Swept Parameters• WDM size

• # of tiles in a chip

• # of MVMs in a tile

➢Computation capability• # of operations

/(time*area*power)

19

Page 20: RANC as Neural Network Accelerator

Hardware Specification

20

Page 21: RANC as Neural Network Accelerator

Speed & Power Analysis

➢Real benchmarks

➢The more chip the faster but did not scaled proportionally

➢Consumes more power

➢Due to communication

➢19 times faster compared to a GPU (Nvidia Tesla V-100) for VGG-4 with same power budget

21

Page 22: RANC as Neural Network Accelerator

Conclusion

➢Proposed DNNARA, a deep neural network accelerator that using residue number system

➢DNNARA is a hybrid electro-optical design

➢Proposed a system-level CNN accelerator chip with nano-photonic

➢Built a system-level simulator for experimental estimation

➢Could reach up to 12.6 GOPS/(second·mm2· watt)

➢Reached 19 times faster compared to a state-of-art GPU (Nvidia Tesla V-100) for VGG-4 with same power budget

22

Page 23: RANC as Neural Network Accelerator

References

➢ [1] Jiaxin Peng, Yousra Alkabani, Shuai Sun, Volker Sorger, and Tarek El-Ghazawi.2019. Integrated Photonics Architectures for Residue Number System Computations. In IEEE International Conference on Rebooting Computing (ICRC 2019).129–137.

➢ [2] Nicholas S Szabo and Richard I Tanaka. 1967.Residue arithmetic and its applications to computer technology. McGraw-Hill.

➢ [3] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 3–14.

➢ [4] Shuai Sun, Vikram K Narayana, Ibrahim Sarpkaya, Joseph Crandall, Richard ASoref, Hamed Dalir, Tarek El-Ghazawi, and Volker J Sorger. 2017. Hybrid photonic-plasmonic nonblocking broadband 5×5 router for optical networks. IEEE Pho-tonics Journal10, 2 (2017), 1–12.

➢ [5] Rupert F Oulton, Volker J Sorger, Thomas Zentgraf, Ren-Min Ma, Christopher Gladden, Lun Dai, Guy Bartal, and Xiang Zhang. 2009. Plasmon lasers at deep subwavelength scale.Nature461, 7264 (2009), 629

➢ [6] Erman Timurdogan, Cheryl M Sorace-Agaskar, Jie Sun, Ehsan Shah Hosseini, Aleksandr Biberman, and Michael R Watts. 2014. An ultralow power a thermal silicon modulator. Nature communications5 (2014), 4008.

➢ [7] Yannick Salamin, Ping Ma, Benedikt Baeuerle, Alexandros Emboras, Yuriy Fedoryshyn, Wolfgang Heni, Bojun Cheng, Arne Josten, and Juerg Leuthold. 2018.100 GHz plasmonic photo detector. ACS photonics5, 8 (2018), 3291–3297.

➢ [8] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.

23

Page 24: RANC as Neural Network Accelerator

24

Thank you!