Approximate Neural Networks for Speech Applications in Resource-Constrained Environments by Sairam Arunachalam A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved May 2016 by the Graduate Supervisory Committee: Chaitali Chakrabarti, Co-Chair Jae-sun Seo, Co-Chair Yu Cao ARIZONA STATE UNIVERSITY August 2016
52
Embed
Approximate Neural Networks for Speech … Neural Networks for Speech Applications in Resource-Constrained Environments by Sairam Arunachalam A Thesis Presented in Partial Ful llment
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Approximate Neural Networks for Speech Applications in
Resource-Constrained Environments
by
Sairam Arunachalam
A Thesis Presented in Partial Fulfillmentof the Requirements for the Degree
Master of Science
Approved May 2016 by theGraduate Supervisory Committee:
Chaitali Chakrabarti, Co-ChairJae-sun Seo, Co-Chair
Yu Cao
ARIZONA STATE UNIVERSITY
August 2016
ABSTRACT
Speech recognition and keyword detection are becoming increasingly popular appli-
cations for mobile systems. While deep neural network (DNN) implementation of
these systems have very good performance, they have large memory and compute
resource requirements, making their implementation on a mobile device quite chal-
lenging. In this thesis, techniques to reduce the memory and computation cost of
keyword detection and speech recognition networks (or DNNs) are presented.
The first technique is based on representing all weights and biases by a small num-
ber of bits and mapping all nodal computations into fixed-point ones with minimal
degradation in the accuracy. Experiments conducted on the Resource Management
(RM) database show that for the keyword detection neural network, representing the
weights by 5 bits results in a 6 fold reduction in memory compared to a floating point
implementation with very little loss in performance. Similarly, for the speech recogni-
tion neural network, representing the weights by 6 bits results in a 5 fold reduction in
memory while maintaining an error rate similar to a floating point implementation.
Additional reduction in memory is achieved by a technique called weight pruning,
where the weights are classified as sensitive and insensitive and the sensitive weights
are represented with higher precision. A combination of these two techniques helps re-
duce the memory footprint by 81 - 84% for speech recognition and keyword detection
networks respectively.
Further reduction in memory size is achieved by judiciously dropping connec-
tions for large blocks of weights. The corresponding technique, termed coarse-grain
sparsification, introduces hardware-aware sparsity during DNN training, which leads
to efficient weight memory compression and significant reduction in the number of
computations during classification without loss of accuracy. Keyword detection and
speech recognition DNNs trained with 75% of the weights dropped and classified with
i
5-6 bit weight precision effectively reduced the weight memory requirement by ∼95%
compared to a fully-connected network with double precision, while showing similar
performance in keyword detection accuracy and word error rate.
ii
DEDICATION
To my family for their support
iii
ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to Dr. Chaitali Charkrabarti and Dr.
Jae-sun Seo for their continuous guidance throughout my research work during the
past year. I’m also grateful to Dr. Yu Cao for taking the time to review my work.
I would like to thank Dr. Mohit Shah for his support throughout my research
work. He introduced me to speech recognition tools and helped me even after his
graduation. I’m grateful to Deepak Kadetotad for his insights into my research and
collaboration in the project.
I also thank my parents, brother, friends and roommates for supporting me though
the past two years.
I also gratefully acknowledge the financial support provided by Dr. Jae-sun Seo
25% weight drop 50% weight drop 75% weight drop 88% weight drop
Figure 4.2: Effect of block size and percentage dropout on WER of speech recognition
network
percentage drop of the connections are with respect to the two middle layers. We
do not drop connections in the last layer since it consists of only 12 nodes and is
relatively sensitive. Also since the last layer contributes to 1% of the total weights
in the system, any reduction in this layer does not result in substantial reduction in
memory requirement. From Figure 4.2, we see that for the same block size, increasing
the percentage of dropped connections adversely affects the AUC performance, as
expected. When the drop in connections is less than 50%, there is little change in the
AUC performance even when the block size is large. However, for larger drops, the
AUC performance becomes sensitive to the block size. For instance, the performance
of a system with 75% of its weights dropped has an AUC performance loss of up to
0.029 when the block size is 128×128. The AUC performance loss is only 0.015 when
the block size is 64× 64, and so we use this configuration in our sparse network.
To determine the fixed point precision of the CGS architecture for keyword de-
30
Table 4.1: AUC and memory requirements of floating and fixed point implementations
for keyword detection network
Architecture AUC Memory
Floating Point 0.945 1.81MB
Fixed Point Q.2.2 0.940 290.32KB
Proposed CGS (64 block size at 75% dropout) 0.910 101.26KB
tection, the procedure similar to the one mentioned in Chapter 3 is followed. The
histogram of the weights and biases is shown in Figure 4.3 and the effect of fractional
precision on average AUC is shown in Figure 4.4. From the histogram, the major-
ity of weights and bias values lie in (-4,4) and so 2 integer bits are required. Also
the gain in performance for a fractional precision greater than 3 bits is negligible.
Therefore we represent the weights and biases using Q2.3 (5 bits including sign bits)
representation. The input and hidden layers are represented using the Q2.13 (16 bits
including sign bit) and Q10.5 (15 bits without sign bit) respectively.
Table 4.1 compares the memory requirements and the performance of the system
for different neural network architectures. We see that there is a small drop in the per-
formance of the proposed architecture compared to both the fixed point and floating
point architectures. The proposed CGS architecture requires only 101KB of mem-
ory compared to the floating point architecture that requires 1.81MB. This scheme
achieves a 18× memory compression with minimal impact to the performance of the
system. The ROC curves for the three different architectures shown in Fig. 4.5 are
similar. From these results, we conclude that our sparsified architecture performs at
a level similar to the floating point architecture, while requiring only a small fraction
of the memory required by the fully connected floating point architecture. Moreover
31
-5 0 50
2000
4000Layer #1 weights
-6 -4 -2 0 2 40
2000
4000Layer #2 weights
-4 -2 0 2 40
500
1000Layer #3 weights
-0.8 -0.6 -0.4 -0.2 0 0.2 0.40
100
200Layer #1 bias
-0.08 -0.06 -0.04 -0.02 0 0.02 0.040
50
100Layer #2 bias
-0.01 0 0.01 0.02 0.030
5
10Layer #3 bias
Figure 4.3: Histogram of weights and biases of CGS architecture with 64×64 at 75%
drop for keyword detection
0 1 2 3 4 50.70
0.75
0.80
0.85
0.90
0.95
Ave
rage
AU
C
Fractional Precision
Figure 4.4: Effect of fractional precision of weights and bias on average AUC of
keyword detection network for CGS architecture
32
0.0 0.5 1.00.0
0.5
1.0
True
Pos
itive
Rat
e
False Alarm Rate
Proposed CGS with Fixed Point Fully Connected Floating Point Fully Connected Fixed Point
Figure 4.5: ROC Curve of different deep neural network implementations for keyword
detection.
with 75% of the weights dropped, we also achieve a 4× reduction in the number of
computations, which further reduces the power consumption of the network.
4.4.2 Speech Recognition
Figure 4.6 shows the effect of percentage drop of connections on the WER of the
floating point system as a function of the block size. At up to 75% weight drop at
all layers of the network, the performance of the system is comparable to the fully
connected floating point DNN. Increasing the drop rate to 88% for block sizes larger
than 64× 64, increases the error rate. Based on this analysis, we choose a drop rate
of 75% across all layers with block size of 64× 64.
To determine the fixed point precision of the CGS architecture, the procedure
similar to the one mentioned in Chapter 3 is followed. The histogram of the weights
and biases is shown in Figure 4.7 and the effect of fractional precision on WER is
33
4x4 8x8 16x16 32x32 64x64 128x128256x256
1.6
1.7
1.8
1.9
2.0
2.1
2.2
WER
(%)
Block Size
75% Drop 50% Drop 88% Drop
Figure 4.6: Effect of block size and percentage dropout on WER of speech recognition
network
shown in Figure 4.8. From the histogram, we see that the majority of weights and bias
values lie in (-1,1) and so 0 integer bits are required. Also the gain in performance
for a fractional precision greater than 4 bits is negligible. Therefore we represent the
weights and biases using Q0.4 (5 bits including sign bits) representation. The input
and hidden layers are represented using Q4.11 (16 bits including sign bit) and Q10.5
(15 bits without sign bit) respectively.
Table 4.2 compares the performance of our system with the fully connected floating
point and fixed point architectures. The sparse fixed-point DNN using the proposed
technique with up to 75% of its connections dropped, has an WER close to that of
the floating point fully-connected DNN. The proposed architecture requires memory
size of only 0.85MB compared to 19.53MB of a fully connected floating point archi-
tecture. Thus, the sparsified fixed point network is able to drop ∼95% of the memory
requirement with a small drop in performance. Moreover, with 75% of the weights
34
-1.5 -1 -0.5 0 0.5 1 1.50
20004000
Layer #1 Weights
-1 -0.5 0 0.5 10
500010000
Layer #2 Weights
-0.6 -0.4 -0.2 0 0.2 0.40
5000Layer #3 Weights
-0.6 -0.4 -0.2 0 0.2 0.4 0.60
500010000
Layer #4 Weights
-1 -0.5 0 0.5 1
×104
012
Layer #5 Weights
-2 -1 0 1 20
100200
Layer #1 Bias
-0.5 0 0.5 10
200400
Layer #2 Bias
-0.4 -0.2 0 0.2 0.4 0.60
200400
Layer #3 Bias
-0.2 -0.1 0 0.1 0.2 0.30
200400
Layer #4 Bias
-0.5 0 0.5 1 1.50
200400
Layer #5 Bias
Figure 4.7: Histogram of weights and biases of CGS architecture with 64×64 at 75%
drop
2 3 4 5 6 7
0
10
20
30
40
50
60
WER
Fractional Precision
Figure 4.8: Effect of fractional precision of weights and bias on WER of speech
recognition system for CGS architecture
35
Table 4.2: WER and memory requirements of floating and fixed-point implementa-
tions for speech recognition network.
Architecture WER(%) Memory
Floating Point 1.65% 19.53 MB
Fixed Point (Q0.5) 1.77% 3.66 MB
Proposed CGS (64 block at 75% dropout) 1.64% 0.85MB
dropped, we also achieve a 4× reduction in the number of computations that further
reduces the power consumption of the network.
4.5 Conclusions
In this chapter, a block structure was described to efficiently compress weights
in a neural network with minimal degradation in the performance. The proposed
methodology combined with the fixed point architecture helped achieve a compression
rate of 18 × −23× for keyword detection and speech recognition respectively along
with 4× reduction in the number of computations.
36
Chapter 5
CONCLUSIONS
5.1 Summary
This thesis focused on developing techniques to reduce the memory size in deep
networks, specifically in feed-forward neural networks for keyword detection and
speech recognition. The neural network for keyword detection consists of 2 hidden
layers, with 512 neurons per layer, while the network for speech recognition is much
larger with 4 hidden layers and 1024 neurons per layer.
First, techniques were developed to represent the weights and biases with mini-
mum number of bits to reduce the memory footprint while minimally affecting the
detection/recognition performance. For keyword detection, where 10 keywords were
selected from the RM corpus, experimental results show that there is only a marginal
loss in performance when the weights are stored in Q2.2 (5 bits) format. The to-
tal memory required in this case is approximately 290KB (compared to 1.81MB if
the weights were represented by 32 bit floating point), making it highly suitable for
resource constrained hardware devices. On the larger speech recognition network,
the memory reduction is even more significant. Here the memory size dropped from
19.53MB to 3.66MB when the weights are represented in Q0.5 (6 bits) format. An
additional 0.82%-2.12% reduction (compared to fixed point implementation) can be
obtained by representing insensitive weights by lower precision.
Even larger reduction in memory was achieved by dropping connections in blocks.
Instead of reducing the precision levels of the individual weights, here the weight
connections are removed from the network. We show that the keyword detection
37
and speech recognition networks with 75% of its connections removed performs at
a level similar to that of fully connected networks. Application of this technique on
fixed point reduced precision implementation, helped reduce the memory requirement
by ∼95% compared to a fully connected double precision floating point architecture.
Such an architecture also reduces the number of computations by 4×. Therefore, these
proposed techniques can substantially reduce the memory and power requirement
of resource-constrained devices. As speech recognition becomes more mainstream,
the proposed techniques will enable implementation of these networks in mobile and
wearable devices.
5.2 Future Work
Future work in this area can be directed towards finding an optimal block selection
that maximizes the accuracy of the system. Other approaches include implementing
Convolutional Neural Networks (Abdel-Hamid et al., 2014) to analyze the input fea-
tures. Recently Recurrent Neural Networks (Graves and Jaitly, 2014) have been
shown to perform comparable to DNN-HMM systems. These networks have greatly
simplified the speech recognition pipeline. Implementing the proposed memory re-
duction schemes on RNNs can simplify their hardware implementations significantly.
38
REFERENCES
Abdel-Hamid, O., A.-R. Mohamed, H. Jiang, L. Deng, G. Penn and D. Yu, “Con-volutional neural networks for speech recognition,” IEEE/ACM Transactions onAudio, Speech, and Language Processing 22, 10, 1533–1545 (2014).
Baker, J. M., L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan and D. O.Shaughnessy, “Developments and directions in speech recognition and understand-ing, Part 1 [dsp education],” Signal Processing Magazine, IEEE 26, 3, 75–80 (2009).
Bourlard, H. A. and N. Morgan, Connectionist speech recognition: a hybrid approach,vol. 247 (Springer Science & Business Media, 1994).
Bradley, A. P., “The use of the area under the roc curve in the evaluation of machinelearning algorithms,” Pattern recognition 30, 7, 1145–1159 (1997).
Chen, G., C. Parada and G. Heigold, “Small-footprint keyword spotting using deepneural networks,” in IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pp. 4087–4091 (2014).
Dahl, G. E., D. Yu, L. Deng and A. Acero, “Large vocabulary continuous speechrecognition with context-dependent DBN-HMMs,” in IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 4688–4691, (2011).
Dahl, G. E., D. Yu, L. Deng and A. Acero, “Context-dependent pre-trained deepneural networks for large-vocabulary speech recognition,” IEEE Transactions onAudio, Speech, and Language Processing 20, 1, 30–42 (2012).
Deng, L. and D. Yu, “Deep learning: Methods and applications,” Foundations andTrends in Signal Processing 7, 3–4, 197–387 (2014).
Gales, M. and S. Young, “The application of hidden Markov models in speech recog-nition,” Foundations and trends in signal processing 1, 3, 195–304 (2008).
Gardner, W. A., “Learning characteristics of stochastic-gradient-descent algorithms:A general study, analysis, and critique,” Signal Processing 6, 2, 113–133 (1984).
Graves, A. and N. Jaitly, “Towards end-to-end speech recognition with recurrentneural networks,” in Proceedings of the 31st International Conference on MachineLearning (ICML-14), pp. 1764–1772 (2014).
Hinton, G., L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Van-houcke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic mod-eling in speech recognition: The shared views of four research groups”, SignalProcessing Magazine, IEEE 29, 6, 82–97 (2012).
Huang, X., A. Acero, H.-W. Hon and R. Foreword By-Reddy, Spoken language pro-cessing: A guide to theory, algorithm, and system development (Prentice Hall PTR,2001).
39
Juang, B.-H., S. E. Levinson and M. M. Sondhi, “Maximum likelihood estimationfor multivariate mixture observations of Markov chains,” IEEE Transactions onInformation Theory 32, 2, 307–309 (1986).
Mamou, J., B. Ramabhadran and O. Siohan, “Vocabulary independent spoken termdetection,” in Proceedings of the 30th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, pp. 615–622 (ACM,2007).
Miao, Y., “Kaldi+ pdnn: building DNN-based ASR systems with Kaldi and PDNN,”arXiv preprint arXiv:1401.6984 (2014).
Miller, D. R., M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M.Schwartz and H. Gish, “Rapid and accurate spoken term detection,” in INTER-SPEECh, pp. 314-317 (2007).
Morris, A. C., V. Maier and P. D. Green, “From WER and RIL to MER and WIL: im-proved evaluation measures for connected speech recognition,” in INTERSPEECH,pp. 2765-2768 (2004).
Ooi, W. T., C. G. Snoek, H. K. Tan, C. K. Ho, B. Huet and C.-W. Ngo, 15th PacificRim Conference on Advances in Multimedia Information Processing-PCM , vol.8879 (Springer, 2014).
Parlak, S. and M. Saraclar, “Spoken term detection for Turkish broadcast news,”in IEEE International Conference on Acoustics, Speech and Signal Processing,(ICASSP), pp. 5244–5247, (2008).
Povey, D., A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Han-nemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognitiontoolkit,” in IEEE 2011 workshop on automatic speech recognition and understand-ing, No. EPFL-CONF-192584 (IEEE Signal Processing Society, 2011).
Price, P., W. M. Fisher, J. Bernstein and D. S. Pallett, “The Darpa 1000-word re-source management database for continuous speech recognition,” in InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 651–654(1988).
Rath, S. P., D. Povey, K. Vesely and J. Cernocky, “Improved feature processing fordeep neural networks,” in INTERSPEECH, pp. 109–113 (2013).
Rohlicek, J. R., W. Russell, S. Roukos and H. Gish, “Continuous hidden Markovmodeling for speaker-independent word spotting,” in 1989 International Conferenceon Acoustics, Speech, and Signal Processing, , pp. 627–630, (1989).
Rose, R. C. and D. B. Paul, “A hidden markov model based keyword recognition sys-tem,” in 1990 International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pp. 129–132, (1990).
40
Sha, F. and L. K. Saul, “Large margin Gaussian Mixture Modeling for phoneticclassification and recognition,” in IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), vol. 1, pp. I–I (2006).
Shah, M., J. Wang, D. Blaauw, D. Sylvester, H.-S. Kim and C. Chakrabarti, “A fixed-point neural network for keyword detection on resource constrained hardware,” in2015 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2015).
Silaghi, M.-C., “Spotting subsequences matching an HMM using the average obser-vation probability criteria with application to keyword spotting,” in Proceedings ofthe American Association on Artificial Intelligence, vol. 20, pp. 1118-1123 (MenloPark, CA; Cambridge, MA; London; AAAI Press; MIT Press; 2005).
Silaghi, M.-C. and H. Bourlard, “Iterative posterior-based keyword spotting with-out filler models,” in Proceedings of the IEEE Automatic Speech Recognition andUnderstanding Workshop, pp. 213–216 (1999).
Song, H. A. and S.-Y. Lee, “Hierarchical representation using NMF,” in Neural In-formation Processing, pp. 466–473 (Springer, 2013).
Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov,“Dropout: A simple way to prevent neural networks from overfitting,” The Journalof Machine Learning Research 15, 1, 1929–1958 (2014).
Su, D., X. Wu and L. Xu, “GMM-HMM acoustic model training by a two levelprocedure with Gaussian components determined by automatic model selection,”in IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 4890–4893 (2010).
Sutskever, I., J. Martens, G. Dahl and G. Hinton, “On the importance of initial-ization and momentum in deep learning,” in Proceedings of the 30th internationalconference on machine learning (ICML-13), pp. 1139–1147 (2013).
Tomas, J., J. A. Mas and F. Casacuberta, “A quantitative method for machine trans-lation evaluation,” in Proceedings of the EACL 2003 Workshop on Evaluation Ini-tiatives in Natural Language Processing: Are evaluation methods, metrics and re-sources reusable?, pp. 27–34 (Association for Computational Linguistics, 2003).
Van Ooyen, A. and B. Nienhuis, “Improving the convergence of the back-propagationalgorithm”, Neural Networks 5, 3, 465–471 (1992).
Venkataramani, S., A. Ranjan, K. Roy and A. Raghunathan, “Axnn: energy-efficientneuromorphic systems using approximate computing,” in Proceedings of the 2014International Symposium on Low Power Electronics and Design, pp. 27–32 (2014).
Wan, L., M. Zeiler, S. Zhang, Y. L. Cun and R. Fergus, “Regularization of neuralnetworks using dropconnect,” in Proceedings of the 30th International Conferenceon Machine Learning (ICML-13), pp. 1058–1066 (2013).
41
Wilpon, J., L. Miller and P. Modi, “Improvements and applications for key wordrecognition using hidden Markov modeling techniques,” in 1991 International Con-ference on Acoustics, Speech, and Signal Processing,, pp. 309–312, (1991).
Zeiler, M. D., M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen,A. Senior, V. Vanhoucke, J. Dean, G. Hinton, “On rectified linear units for speechprocessing,” in 2013 IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pp. 3517–3521, (2013).