ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers Ao Ren 1∗ , Tianyun Zhang 2∗ , Shaokai Ye 2 , Jiayu Li 2 , Wenyao Xu 3 , Xuehai Qian 4 , Xue Lin 1 , Yanzhi Wang 1 ∗ Ao Ren and Tianyun Zhang Contributed equally to this work 1 Department of Electrical and Computer Engineering, Northeastern University 2 Department of Electrical Engineering and Computer Science, Syracuse University 3 Department of Computer Science and Engineering, SUNY University at Buffalo 4 Department of Electrical Engineering, University of Southern California 1 [email protected], 1 {xue . lin, yanz . wanд}@northeastern.edu, 2 {tzhan120, sye 106, jli 221}@syr.edu, 3 wenyaoxu@buffalo.edu, 4 [email protected]ABSTRACT To facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), a number of prior work are dedicated to model compression techniques. The target is to simultaneously reduce the model storage size and accelerate the computation, with minor effect on accuracy. Two important categories of DNN model compression techniques are weight pruning and weight quantiza- tion. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. These two sources of redundancy can be combined, thereby leading to a higher degree of DNN model compression. However, there lacks a systematic framework of joint weight prun- ing and quantization of DNNs, thereby limiting the available model compression ratio. Moreover, the computation reduction, energy efficiency improvement, and hardware performance overhead need to be accounted for besides simply model size reduction. To address these limitations, we present ADMM-NN, the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to deal with non-convex optimization problems with possibly combinatorial constraints. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quanti- zation using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than prior work. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. We perform ADMM-based weight pruning and quantization ac- counting for (i) the computation reduction and energy efficiency improvement, and (ii) the hardware performance overhead due to irregular sparsity. The first requirement prioritizes the convo- lutional layer compression over fully-connected layers, while the latter requires a concept of the break-even pruning ratio, defined as the minimum pruning ratio of a specific layer that results in no hardware performance degradation. Without accuracy loss, we can achieve 85× and 24× pruning on LeNet-5 and AlexNet models, respectively, significantly higher than prior work. The improvement becomes more significant when focusing on computation reductions. Combining weight pruning and quantization, we achieve 1,910× and 231× reductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative to appear in ASPLOS 2019 DNNs such as VGGNet and ResNet-50. We release codes and models at anonymous link http://bit.ly/2M0V7DO. 1 INTRODUCTION The wide applications of deep neural networks (DNNs), especially for embedded and IoT systems, call for efficient implementations of at least the inference phase of DNNs in power-budgeted systems. To achieve both high performance and energy efficiency, hardware acceleration of DNNs, including both FPGA-based and ASIC-based implementations, has been intensively studied both in academia and industry [1, 2, 4, 6–8, 13, 16, 20, 21, 28, 31, 35, 37, 41, 43– 45, 48, 49, 51, 52, 54, 61, 62, 65]. With large model size (e.g., for ImageNet dataset [11]), hardware accelerators suffer from the fre- quent access to off-chip DRAM due to the limited on-chip SRAM memory. Unfortunately, off-chip DRAM accesses consume signifi- cant energy, e.g., 200× compared to on-chip SRAM [8, 21], and can thus easily dominate the whole system power consumption. To overcome this hurdle, a number of prior work are dedicated to model compression techniques for DNNs, in order to simultane- ously reduce the model size (storage requirement) and accelerate the computation, with minor effect on accuracy. Two important cat- egories of DNN model compression techniques are weight pruning and weight quantization. A pioneering work of weight pruning is Han et al. [24], which is an iterative, heuristic method and achieves 9× reduction in the number of weights in AlexNet (ImageNet dataset). This work has been extended for improving the weight pruning ratio and actual implementation efficiency [18, 20, 21]. Weight quantiza- tion of DNNs has also been investigated in plenty of recent work [9, 27, 33, 34, 40, 42, 55, 57, 66], quantizing DNN weights to binary values, ternary values, or powers of 2, with acceptable accuracy loss. Both storage and computational requirements are reduced in this way. Multiplication operations may even be eliminated through binary or ternary weight quantizations [9, 27, 42]. The effectiveness of weight pruning lies on the redundancy in the number of weights in DNN, whereas the effectiveness of weight quantization is due to the redundancy in bit representation of weights. These two sources of redundancy can be combined, thereby leading to a higher degree of DNN model compression. Despite certain prior work investigating in this aspect using greedy, heuristic method [21, 22, 66], there lacks a systematic framework of joint weight pruning and quantization of DNNs. As a result they cannot achieve the highest possible model compression ratio by fully exploiting the degree of redundancy. arXiv:1812.11677v1 [cs.LG] 31 Dec 2018
13
Embed
ADMM-NN: An Algorithm-Hardware Co-Design Framework of … · ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers Ao Ren1∗,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ADMM-NN: An Algorithm-Hardware Co-Design Framework ofDNNs Using Alternating Direction Method of Multipliers
Ao Ren1∗, Tianyun Zhang
2∗, Shaokai Ye
2, Jiayu Li
2, Wenyao Xu
3, Xuehai Qian
4, Xue Lin
1, Yanzhi Wang
1
∗ Ao Ren and Tianyun Zhang Contributed equally to this work1 Department of Electrical and Computer Engineering, Northeastern University
2 Department of Electrical Engineering and Computer Science, Syracuse University3 Department of Computer Science and Engineering, SUNY University at Buffalo
4 Department of Electrical Engineering, University of Southern California1 [email protected], 1{xue .lin,yanz.wanд}@northeastern.edu, 2{tzhan120, sye106, jli221}@syr.edu,
techniques to binary values, ternary values, or powers of 2 to fa-
cilitate hardware implementations, with acceptable accuracy loss.
The state-of-the-art technique adopts an iterative quantization and
retraining framework, with randomness incorporated in quanti-
zation [9]. It achieves less than 3% accuracy loss on AlexNet for
binary weight quantization [33]. It is also worth noticing that a
similar technique, weight clustering, groups weights into clusterswith arbitrary values. This is different from equal-interval values
as in quantization. As a result weight clustering is not as hardware-
friendly as quantization [22, 67].
Pros and cons of the two methods:Weight quantization has
clear advantage: it is hardware-friendly. The computation require-
ment is reduced in proportion to weight representation, and multi-
plication operations can be eliminated using binary/ternary quanti-
zations. On the other hand, weight pruning incurs inevitable im-
plementation overhead due to the irregular sparsity and indexing
[14, 22, 53, 56, 58].
The major advantage of weight pruning is the higher potential
gain in model compression. The reasons are two folds. First, there
is often higher degree of redundancy in the number of weights than
bit representation. In fact, reducing each bit in weight presentation
doubles the imprecision, which is not the case in pruning. Second,
weight pruning performs regularization that strengthens the salient
weights and prunes the unimportant ones. It can even increase the
accuracy with a moderate pruning ratio [23, 53]. As a result it
provides a higher margin of weight reduction. This effect does not
exist in weight quantization/clustering.
Combination: Because they leverage different sources of redun-dancy, weight pruning and quantization can be effectively combined.
However, there lacks a systematic investigation in this direction.
The extended work [22] by Han et al. uses a combination of weight
pruning and clustering (not quantization) techniques, achieving 27×model compression on AlexNet. This compression ratio has been
updated by the recent work [66] to 53× on AlexNet (but without
any specification about compressed model).
2.2 Basics of ADMMADMM has been demonstrated [38, 50] as a powerful tool for solv-
ing non-convex optimization problems, potentially with combina-
torial constraints. Consider a non-convex optimization problem
that is difficult to solve directly. ADMM method decomposes it into
two subproblems that can be solved separately and efficiently. For
example, the optimization problem
min
xf (x) + д(x) (1)
lends itself to the application of ADMM if f (x) is differentiable andд(x) has some structure that can be exploited. Examples of д(x)include the L1-norm or the indicator function of a constraint set.
The problem is first re-written as
min
x,zf (x) + д(z),
subject to x = z.(2)
Next, by using augmented Lagrangian [5], the above problem is de-
composed into two subproblems on x and z. The first isminx f (x)+q1(x), where q1(x) is a quadratic function. As q1(x) is convex,
the complexity of solving subproblem 1 (e.g., via stochastic gra-
dient descent) is the same as minimizing f (x). Subproblem 2 is
minz д(z) + q2(z), where q2(z) is a quadratic function. When func-
tion д has some special structure, exploiting the properties of дallows this problem to be solved analytically and optimally. In this
way we can get rid of the combinatorial constraints and solve the
problem that is difficult to solve directly.
3 ADMM FRAMEWORK FOR JOINT WEIGHTPRUNING AND QUANTIZATION
In this section, we present the novel framework of ADMM-based
DNN weight pruning and quantization, as well as the joint model
compression problem.
3.1 Problem FormulationConsider a DNNwith N layers, which can be convolutional (CONV)
and fully-connected (FC) layers. The collection of weights in the
i-th layer isWi ; the collection of bias in the i-th layer is denoted
by bi . The loss function associated with the DNN is denoted by
f({Wi }Ni=1, {bi }
Ni=1
).
The problem of weight pruning and quantization is an optimiza-
tion problem [57, 64]:
minimize
{Wi }, {bi }f({Wi }Ni=1, {bi }
Ni=1
),
subject to Wi ∈ Si , i = 1, . . . ,N .(3)
Thanks to the flexibility in the definition of the constraint set
Si , the above formulation is applicable to the individual prob-
lems of weight pruning and weight quantization, as well as the
joint problem. For the weight pruning problem, the constraint set
Si = {the number of nonzero weights is less than or equal to αi },where αi is the desired number of weights after pruning in layer
i1. For the weight quantization problem, the set Si={the weights inlayer i are mapped to the quantization values} {Q1,Q2, · · · ,QM }},whereM is the number of quantization values/levels. For quantiza-
tion, theseQ values are fixed, and the interval between two nearest
quantization values is the same, in order to facilitate hardware
implementations.
For the joint problem, the above two constraints need to be
satisfied simultaneously. In other words, the number of nonzero
weights should be less than or equal to αi in each layer, while the
remaining nonzero weights should be quantized.
3.2 ADMM-based Solution FrameworkThe above problem is non-convex with combinatorial constraints,
and cannot be solved using stochastic gradient descent methods (e.g.,
ADAM [29]) as in original DNN training. But it can be efficiently
solved using the ADMM framework (combinatorial constraints can
1An alternative formulation is to use a single α as an overall constraint on the number
of weights in the whole DNN.
3
be get rid of.) To apply ADMM, we define indicator functions
дi (Wi ) ={0 if Wi ∈ Si ,+∞ otherwise,
for i = 1, . . . ,N . We then incorporate auxiliary variables Zi andrewrite problem (3) as
minimize
{Wi }, {bi }f({Wi }Ni=1, {bi }
Ni=1
)+
N∑i=1
дi (Zi ),
subject to Wi = Zi , i = 1, . . . ,N .
(4)
Through application of the augmented Lagrangian [5], problem
(4) is decomposed into two subproblems by ADMM. We solve the
subproblems iteratively until convergence. The first subproblem is
minimize
{Wi }, {bi }f({Wi }Ni=1, {bi }
Ni=1
)+
N∑i=1
ρi2
∥Wi − Zki + Uki ∥
2
F , (5)
where Uki is the dual variable updated in each iteration, Uki :=
Uk−1i +Wki − Zki . In the objective function of (5), the first term is
the differentiable loss function of DNN, and the second quadratic
term is differentiable and convex. The combinatorial constraints
are effectively get rid of. This problem can be solved by stochastic
gradient descent (e.g., ADAM) and the complexity is the same as
training the original DNN.
The second subproblem is
minimize
{Zi }
N∑i=1
дi (Zi ) +N∑i=1
ρi2
∥Wk+1i − Zi + Uki ∥
2
F . (6)
As дi (·) is the indicator function of Si , the analytical solution of
subproblem (6) is
Zk+1i = ΠSi (Wk+1i + Uki ), (7)
where ΠSi (·) is Euclidean projection of Wk+1i + Uki onto the set Si .
The details of the solution to this subproblem is problem-specific.
For weight pruning and quantization problems, the optimal, analyt-
ical solutions of this problem can be found. The derived Zk+1i will
be fed into the first subproblem in the next iteration.
The intuition of ADMM is as follows. In the context of DNNs,
the ADMM-based framework can be understood as a smart regular-
ization technique. Subproblem 1 (Eqn. (5)) performs DNN training
with an additional L2 regularization term, and the regularization
target Zki − Uki is dynamically updated in each iteration through
solving subproblem 2. This dynamic updating process is the key
Sun, N., et al. Dadiannao: A machine-learning supercomputer. In Proceedings ofthe 47th Annual IEEE/ACM International Symposium on Microarchitecture (2014),IEEE Computer Society, pp. 609–622.
[8] Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks. IEEE Journalof Solid-State Circuits 52, 1 (2017), 127–138.
[9] Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep
neural networks with binary weights during propagations. In Advances in neuralinformation processing systems (2015), pp. 3123–3131.
[10] Dai, X., Yin, H., and Jha, N. K. Nest: a neural network synthesis tool based on a
Yuan, G., et al. C ir cnn: accelerating and compressing deep neural networks us-
ing block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture (2017), ACM, pp. 395–408.
[15] Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via
layer-wise optimal brain surgeon. In Advances in Neural Information ProcessingSystems (2017), pp. 4857–4867.
[16] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., and
Temam, O. Shidiannao: Shifting vision processing closer to the sensor. In Com-puter Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposiumon (2015), IEEE, pp. 92–104.
[17] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, vol. 1.MIT press Cambridge, 2016.
[18] Guo, K., Han, S., Yao, S., Wang, Y., Xie, Y., and Yang, H. Software-hardware
codesign for efficient neural network acceleration. In Proceedings of the 50thAnnual IEEE/ACM International Symposium on Microarchitecture (2017), IEEEComputer Society, pp. 18–25.
[19] Guo, Y., Yao, A., and Chen, Y. Dynamic network surgery for efficient dnns. In
Advances In Neural Information Processing Systems (2016), pp. 1379–1387.[20] Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y.,
et al. Ese: Efficient speech recognition engine with sparse lstm on fpga. In Pro-ceedings of the 2017 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays (2017), ACM, pp. 75–84.
[21] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J.
Eie: efficient inference engine on compressed deep neural network. In ComputerArchitecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on(2016), IEEE, pp. 243–254.
[22] Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding. In InternationalConference on Learning Representations (ICLR) (2016).
[23] Han, S., Pool, J., Narang, S., Mao, H., Gong, E., Tang, S., Elsen, E., Vajda, P.,
Paluri, M., Tran, J., et al. Dsd: Dense-sparse-dense training for deep neural
networks. In International Conference on Learning Representations (ICLR) (2017).[24] Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections
for efficient neural network. In Advances in neural information processing systems(2015), pp. 1135–1143.
[25] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and patternrecognition (2016), pp. 770–778.
[26] He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep
neural networks. In Computer Vision (ICCV), 2017 IEEE International Conferenceon (2017), IEEE, pp. 1398–1406.
[27] Hubara, I., Courbariaux,M., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized
neural networks. In Advances in neural information processing systems (2016),pp. 4107–4115.
[28] Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., and Moshovos, A.
Stripes: Bit-serial deep neural network computing. In Proceedings of the 49thAnnual IEEE/ACM International Symposium on Microarchitecture (2016), IEEEComputer Society, pp. 1–12.
[29] Kingma, D., and Ba, L. Adam: A method for stochastic optimization. In Interna-tional Conference on Learning Representations (ICLR) (2016).
[30] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processingsystems (2012), pp. 1097–1105.
[31] Kwon, H., Samajdar, A., and Krishna, T. Maeri: Enabling flexible dataflow map-
ping over dnn accelerators via reconfigurable interconnects. In Proceedings of theTwenty-Third International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (2018), ACM, pp. 461–475.
convolutional neural network processor in 28nm fdsoi. In Solid-State CircuitsConference (ISSCC), 2017 IEEE International (2017), IEEE, pp. 246–247.
[38] Ouyang, H., He, N., Tran, L., and Gray, A. Stochastic alternating direction
method of multipliers. In International Conference on Machine Learning (2013),
pp. 80–88.
[39] Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany,
B., Emer, J., Keckler, S.W., andDally,W. J. Scnn: An accelerator for compressed-
sparse convolutional neural networks. In ACM SIGARCH Computer ArchitectureNews (2017), vol. 45, ACM, pp. 27–40.
[40] Park, E., Ahn, J., and Yoo, S. Weighted-entropy-based quantization for deep
neural networks. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (2017), pp. 7197–7205.
[41] Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J., Tang, T., Xu, N., Song,
S., et al. Going deeper with embedded fpga platform for convolutional neural
network. In Proceedings of the 2016 ACM/SIGDA International Symposium onField-Programmable Gate Arrays (2016), ACM, pp. 26–35.
[42] Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet
classification using binary convolutional neural networks. In European Conferenceon Computer Vision (2016), Springer, pp. 525–542.
[43] Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S. K., Hernández-
Lobato, J. M., Wei, G.-Y., and Brooks, D. Minerva: Enabling low-power, highly-
accurate deep neural network accelerators. In Computer Architecture (ISCA), 2016ACM/IEEE 43rd Annual International Symposium on (2016), IEEE, pp. 267–278.
[44] Sharma, H., Park, J., Mahajan, D., Amaro, E., Kim, J. K., Shao, C., Mishra, A.,
and Esmaeilzadeh, H. From high-level deep neural models to fpgas. In Proceed-ings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture(2016), IEEE Computer Society, pp. 1–13.
[45] Sim, J., Park, J.-S., Kim, M., Bae, D., Choi, Y., and Kim, L.-S. 14.6 a 1.42 tops/w deep
convolutional neural network recognition processor for intelligent ioe systems.
In Solid-State Circuits Conference (ISSCC), 2016 IEEE International (2016), IEEE,pp. 264–265.
[46] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[47] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-
scale image recognition. In International Conference on Learning Representations(ICLR) (2015).
[48] Song, M., Zhong, K., Zhang, J., Hu, Y., Liu, D., Zhang, W., Wang, J., and Li,
T. In-situ ai: Towards autonomous and incremental deep learning for iot sys-
tems. In High Performance Computer Architecture (HPCA), 2018 IEEE InternationalSymposium on (2018), IEEE, pp. 92–103.
[49] Suda, N., Chandra, V., Dasika, G., Mohanty, A., Ma, Y., Vrudhula, S., Seo,
J.-s., and Cao, Y. Throughput-optimized opencl-based fpga accelerator for large-
scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays (2016), ACM, pp. 16–
25.
[50] Suzuki, T. Dual averaging and proximal gradient descent for online alternating
direction multiplier method. In International Conference on Machine Learning(2013), pp. 392–400.
[51] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M.,
and Vissers, K. Finn: A framework for fast, scalable binarized neural network
inference. In Proceedings of the 2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays (2017), ACM, pp. 65–74.
[52] Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagan-
nathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., et al. Scaledeep: A
scalable compute architecture for learning and evaluating deep networks. In Com-puter Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposiumon (2017), IEEE, pp. 13–26.
[53] Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity
in deep neural networks. In Advances in Neural Information Processing Systems(2016), pp. 2074–2082.
[54] Whatmough, P. N., Lee, S. K., Lee, H., Rama, S., Brooks, D., and Wei, G.-Y. 14.3
a 28nm soc with a 1.2 ghz 568nj/prediction sparse deep-neural-network engine
with> 0.1 timing error rate tolerance for iot applications. In Solid-State CircuitsConference (ISSCC), 2017 IEEE International (2017), IEEE, pp. 242–243.
[55] Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. Quantized convolutional neural
networks for mobile devices. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (2016), pp. 4820–4828.
[56] Yang, T.-J., Chen, Y.-H., and Sze, V. Designing energy-efficient convolutional
neural networks using energy-aware pruning. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (2017), pp. 6071–6079.
[57] Ye, S., Zhang, T., Zhang, K., Li, J., Xie, J., Liang, Y., Liu, S., Lin, X., andWang, Y.
A unified framework of dnn weight pruning and weight clustering/quantization
using admm. arXiv preprint arXiv:1811.01907 (2018).
[58] Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., andMahlke, S. Scalpel:
Customizing dnn pruning to the underlying hardware parallelism. In ComputerArchitecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on(2017), IEEE, pp. 548–560.
[59] Yu, X., Liu, T., Wang, X., and Tao, D. On compressing deep models by low rank
and sparse decomposition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (2017), pp. 7370–7379.
Yang, H., et al. Sticker: A 0.41-62.1 tops/w 8bit neural network processor with
multi-sparsity compatible convolution arrays and online tuning acceleration for
12
fully connected layers. In 2018 IEEE Symposium on VLSI Circuits (2018), IEEE,pp. 33–34.
[61] Zhang, C., Fang, Z., Zhou, P., Pan, P., and Cong, J. Caffeine: towards uniformed
representation and acceleration for deep convolutional neural networks. In
Proceedings of the 35th International Conference on Computer-Aided Design (2016),
ACM, p. 12.
[62] Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., and Cong, J. Energy-efficient
cnn implementation on a deeply pipelined fpga cluster. In Proceedings of the2016 International Symposium on Low Power Electronics and Design (2016), ACM,
pp. 326–331.
[63] Zhang, D., Wang, H., Figueiredo, M., and Balzano, L. Learning to share:
Simultaneous parameter tying and sparsification in deep learning.
[64] Zhang, T., Ye, S., Zhang, K., Tang, J., Wen, W., Fardad, M., and Wang, Y. A
systematic dnn weight pruning framework using alternating direction method
of multipliers. arXiv preprint arXiv:1804.03294 (2018).[65] Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.-H., Srivastava, M., Gupta,
R., and Zhang, Z. Accelerating binarized convolutional neural networks with
software-programmable fpgas. In Proceedings of the 2017 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Arrays (2017), ACM, pp. 15–24.
[66] Zhou, A., Yao, A., Guo, Y., Xu, L., and Chen, Y. Incremental network quan-
tization: Towards lossless cnns with low-precision weights. In InternationalConference on Learning Representations (ICLR) (2017).
[67] Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. In
International Conference on Learning Representations (ICLR) (2017).