Laconic Deep Learning Computing Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos Abstract Laconic: Deep Neural Network inference accelerator •Targets primarily CNNs, can do any layer type •Exploits effectual bit content for activations and weights •Term-serial accelerator 20.5x faster than data-parallel accelerators 2.6x more energy efficient 26% area cost Motivation • Previous accelerators exploit ineffectual computations including zero skipping, precision variability, and zero bit skipping to improve the performance and energy efficiency of CNNs. • We show that policy “At+Wt”, eliminating the ineffectual computations at the bit level for both the activations and the weights, has the highest potential speedup. Conventional bit-parallel accelerator Execution time does not vary with data bit content 0 15 0 15 + X X activations weights 16 16 16 16 32 32 Laconic’s Approach Laconic Processing Element Results Relative to Baseline 8×8 4×3 = 5.3× • Performance gain • Convolutional layer: × 〈 〉×〈 〉 • Fully-connected layer: 〈 〉 • Data access reduction • Activations: − • Weights: − 01010110 00101010 Baseline + + 4 4 4 4 5 5 1 3 5 0 1 2 4 6 0 3 5 activations weights 1 2 4 7 Dec. Dec. + Histogram Laconic: Term-Serial Accelerator Execution time varies with the number of effectual terms Throughput: Energy efficiency: Enhanced Adder Tree Design Configurations Laconic Baseline • More area and energy efficient adder tree • Divides the output of histogram stage to 6 groups, G0, G1, …, G5. • Outputs within the same group have no overlapping bits that are “1”. • Concatenates the outputs within a group to calculate their sum. • Inputs: • 16 Activations, 1 term/cycle • 16 Weights, 1 term/cycle • Multiplication is done term-serially • Reduces the products to a single output • Supports cascading for smaller layers S: Sparse S: Sparse PE Array (8 x16) 256 input activations 128 weights 16 output activation Datapath Layout 0 5 10 15 20 25 30 35 256A-128W 256A-256W 256A-512W 256A-1KW 0 1 2 3 4 5 6 256A-128W 256A-256W 256A-512W 256A-1KW • 65nm TSMC • Clock frequency: 980 MHz • 128W Tile: • Area: 943422.48 μm 2 • Power: 224.1 mW • Operations Per Second: 752 GOPS • Operations Per Second, Per Watt: 763.4 GOPS/W S PE(0,0) S PE(0,15) S PE(7,0) S PE(7,15) Activation Memory Weight Memory 256 128 Execution Engine Laconic Architecture 1.69 1.84 2.47 1.89 2.87 2.78 2.12 2.29 2.21 1.96 15.16 2.82 8.56 3.66 5.76 2.79 3.19 4.38 14.27 19.73 20.24 17.15 28.13 266.79 19.13 22.09 27.32 178.68 1543.66 215.69 675.81 403.05 659.50 267.13 330.15 418.59 1.00 10.00 100.00 1000.00 10000.00 Potential Speedup A A+W At At+Wt