T82 978-4-86348-717-8 ©2019 JSAP 2019 Symposium on VLSI Technology Digest of Technical Papers T8-1 Inference of Long-Short Term Memory networks at software-equivalent accuracy using 2.5M analog Phase Change Memory devices H. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. M. Shelby, K. Rocki, A. Chen and G. W. Burr IBM Research–Almaden, 650 Harry Road, San Jose, CA 95120, Tel: (408) 927–2073, E-mail: [email protected] Abstract We report accuracy for forward inference of long-short-term-memory (LSTM) networks using weights programmed into the conductances of >2.5M phase-change memory (PCM) devices. We demonstrate strate- gies for software weight-mapping and programming of hardware analog conductances that provide accurate weight programming despite signifi- cant device variability. Inference accuracy very close to software-model baselines is achieved on several language modeling tasks. Keywords: PCM, LSTM, forward inference, in-memory computing Introduction Deep neural network (DNN) computations in analog memory have made significant progress, recently achieving software-equivalent accu- racy with weights stored in non-volatile memory arrays [1], [2]. Fully- connected (FC) networks, composed of the FC-layers that are particu- larly well-suited for acceleration using in-memory analog computing [3], have been mostly eclipsed in modern DNN applications. But recurrent neural networks (RNN), such as Long-Short-Term-Memory (LSTM), are widely used — for language modeling, speech recognition, transla- tion, and sequence classification — and primarily consist of FC-layers (plus a few element-wise vector operations). Due to the recurrent nature of LSTMs, analog device requirements for LSTMs are more stringent [4] than for FC-networks. To date, experimental demonstrations of LSTMs in analog memory have been limited by hardware constraints to very small networks [5]. In this paper, we study the inference accuracy of larger LSTM networks using a mixed-hardware-software experiment [1], where synaptic weights are programmed into a 90nm-node phase change memory (PCM) array with 4M mushroom-cell devices, while all neuron functionalities are simulated in software. We use the weight mapping scheme in Fig.1 to map LSTM networks for language modeling tasks using two datasets, including the Penn Tree Bank (PTB) dataset, a widely-used LSTM software benchmark. LSTM Network and Dataset The 2-layer LSTM network used in this paper is shown in Fig. 2. A fully connected embedding layer converts characters or words using a ‘one-hot’ encoding x0(t) into an embedding vector, x(t), with the same size as the hidden layer. x(t) is then passed to a 2-layer LSTM with a recurrent hidden state, h(t), in each layer. Output y(t) is then computed from the hidden state in the second layer, h2(t), through a fully-connected output layer. Fig. 3a shows weight distributions for software-trained models using two datasets: the book ‘Alice in Wonder- land’ (character-based) with a smaller model (50 hidden units) and the PTB (word-based) dataset (hidden layer size of 200). Since software weights span different ranges, different scaling factors are chosen to map the weights into the same ‘sensitive region’ of analog conductances [6]. To assess the impact of weight-programming errors, conductance values are individually measured from the hardware, and LSTM activa- tions are calculated in software (using the equations in Figs. 1 and 4). Future in-memory analog computing circuitry will efficiently and rapidly perform these multiply-accumulate operations in the analog domain at the weight locations, but expected DNN performance for such hardware can be evaluated by the present experiment, as follows. For language modeling, as each character/word is presented, the LSTM network pre- dicts the probability — quantified by softmax(y(t)) — of the next character/word in the sequence. This prediction can be compared to the actual next input vector, x0(t +1). ‘Accuracy’ of the model is quantified by cross-entropy loss or perplexity – lower loss (perplexity) indicates that correct answers are being predicted with higher probability. Programming PCM Conductances and Weights Fig. 5 compares two strategies for efficient programming in the presence of device variabilities. One strategy applies SET pulses with steadily increasing compliance current to reach the target analog conduc- tance. With this method, PCM conductance is non-monotonic, decreas- ing sharply as each device reaches its RESET condition. In contrast, applying RESET pulses with steadily decreasing compliance current is more tolerant of device variability and offers better precision at low con- ductance values. Iterating these sequences with different pulse-width helps further address outlying devices (Fig. 6(b)). Conductance pro- gramming of a single target conductance (Fig.6) and of a simple target pattern with 32 different levels (Fig.7) show promising results. Fig. 8 compares weight programming results when using 2 PCM and then 4 PCM devices for weight mapping. Target weight distribu- tions, scaled from software weights into conductance units by α=2.5μS, overlap (Fig. 8a,d) and strongly correlate (Fig. 8b,e) with weights pro- grammed into 2 or 4 PCM devices, with low weight programming error (Fig.8c,f). Successive programming of two conductance pairs with F =4 (e,f) significantly reduces weight errors compared to a single pair (b,c). Simple target patterns (Fig.9) also clearly show the benefits of the weight programming strategies introduced here. Impact on LSTM Performance Fig.10a summarizes LSTM accuracy results for the Alice in Won- derland dataset, compared to software baselines (PyTorch & MATLAB). Customized experiments in MATLAB provided an alternate weight-set for programming, obtained by clipping weight-magnitudes during train- ing to a level that left cross-entropy loss unaffected. The mapping of this reduced software-weight range better utilizes available PCM con- ductance levels, and reduces the relative impact of read and write noise. Cross-entropy loss after programming is thus reduced, despite slightly higher software loss in the clipped model. Programming only the weights of the 2-layer LSTM into PCM brings cross-entropy loss even closer to baseline. Fig.10b shows inference results for the PTB dataset, compared to PyTorch baseline [7]. (Here only the 2-layer LSTM fits within the available PCM array.) LSTM network and dataset sizes are summarized in Fig. 11a. Fig. 11b puts the inference perplexity of PTB in context with state-of-the-art software models [8]. For comparison with the full mixed-hardware-software results (Fig. 10), inference results with both constant (Fig. 11c) and weight-dependent (Fig. 11d, σ values extracted from Fig. 9c) added Gaussian noise are shown. Finally, conductance drift of PCM devices causes LSTM accuracy performance to degrade (Fig.11e), highlighting the importance of PCM device structures known to offer lower drift [9,10] as well as potentially better conductance- programming accuracy [10]. Conclusions We demonstrated analog conductance programming and weight mapping of LSTM weights into PCM arrays for inference. Write noise was found to be the primary limiting factor. Training and mapping procedures were shown that can help avoid large weights prone to pro- gramming error. Despite non-ideality and variability in PCM arrays, inference results close to software baselines were achieved on LSTM networks of competitive-size. References [1] G. W. Burr et. al., IEDM, T29-5 (2014). [2] S. Ambrogio et. al., Nature, 558, 60-67 (2018). [3] T. Gokmen et. al., Frontiers in Neuroscience, 10, 333 (2016). [4] T. Gokmen et. al., Frontiers in Neuroscience, 12, 745 (2018). [5] C. Li et. al., Nat. Machine Intelligence, 1, 49-57 (2019). [6] C. Mackin et. al., Adv. Electr. Mater., submitted (2019). [7] https://github.com/deeplearningathome/pytorch-language-model [8] S. Merity et. al., https://arxiv.org/abs/1708.02182, (2017). [9] S. Kim et al., IEDM, 30-7 (2013). [10] I. Giannopoulos et al., IEDM, 27-7 (2018).