Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1 , Lu Jiang 2 , Ming-Hsuan Yang 2,3 , Li-Jia Li 4 , Mingsheng Long 1 , Li Fei-Fei 4 1 Tsinghua University, 2 Google AI, 3 University of California, Merced, 4 Stanford University Summary I We build space-time models of the world through predictive unsupervised learning. I Task1: Future frames prediction. Applications: urban computing, weather forecasting, learning dynamics of complex environments I Task2: Early action recognition. Predicting future percepts from available information. We ask: Can pixel-level predictive learning help percept-level tasks? I Code/models available: github.com/google/e3d_lstm Motivations Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTM Video Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTM Video Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM I Common features in pixel-level and percept-level future prediction ! long-, short-term dependencies. I Our point: jointly learning long-, short-term video representations via recurrent 3DConv. Modeling Short-Term Video Representations I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrent transitions, to reduce the gradient vanishing problem RNN Unit RNN Unit RNN Unit Frame 3:T+2 Frame 1:T Frame 2:T+1 Frame T+1 Frame T+2 Frame T+3 2D CNN Decoders 2D CNN Decoders 2D CNN Decoders Classifier 3D CNN Encoders 3D CNN Encoders 3D CNN Encoders (a) 3D-CNN at Bottom RNN Unit 2D CNN Encoders RNN Unit RNN Unit 2D CNN Encoders 2D CNN Encoders Frame T+1 Frame T+2 Frame T+3 Frame T+1 Frame T+2 Frame T Classifier 3D CNN Decoders 3D CNN Decoders 3D CNN Decoders (b)3D-CNN on Top Frame 1:T Frame T+1 3D CNN Decoders 3D CNN Decoders 3D CNN Decoders E3D-LSTM E3D-LSTM E3D-LSTM 3D CNN Encoders 3D CNN Encoders 3D CNN Encoders Classifier Frame τ :T+τ Frame 2τ :T+2τ Frame T+2τ +1 Frame T+τ +1 (c) E3D-LSTM Network Modeling Long-Term Video Representations I Most prior work handles long-term video relations via the recursions of feed-forward networks (weak in learning temporal dependencies ) or the temporal state transitions of recurrent networks (easily leading to saturated forget gates ) I Our point: We introduce Recall Gate,a Transformer -like mechanism into LSTMs’ memory transitions, replacing the traditional forget gate Forget Gate h t-1 k c t-1 k x t m t k-1 m t k c t k h t k g t i t f t o t ′ i t ′ g t ′ f t (d) Spatiotemporal LSTM Recall Gate Softmax C t k H t-1 k H t k X t M t k ′ F t G t O t ′ I t I t R t C t-1 k M t k-1 { LayerNorm C t-2 k C t −τ k C t −τ +1 k C t −τ :t-1 k ′ G t (e) Eidetic 3D LSTM R t = σ (W xr ⇤ X t + W hr ⇤ H k t -1 + b r ) RECALL(R t , C k t -⌧ :t -1 ) = softmax(R t · (C k t -⌧ :t -1 ) T ) ·C k t -⌧ :t -1 C k t = I t G t + LayerNorm(C k t -1 + RECALL(R t , C k t -⌧ :t -1 )) (1) Moving MNIST Dataset Model SSIM MSE Model SSIM MSE ConvLSTM 0.713 96.5 DFN 0.726 89.0 FRNN 0.819 68.4 VPN baseline 0.870 64.1 PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3 Inputs Ground Truth Ours PredRNN++ ConvLSTM VPN Baseline PredRNN (f)10 ! 10 Prediction PredRNN++ Ours Prior Context (same as Seq. 2) Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth) ConvLSTM Seq. 1 Inputs Seq. 1 Predictions (g)Copy Test KTH Action Dataset: Video Prediction and Replay t=11 t=1 PredRNN++ Ours ConvLSTM Inputs Prediction Ground Truth PredRNN++ Ours ConvLSTM Inputs Prediction Ground Truth Prior Inputs t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49 Early Action Recognition Model Front 25% Front 50% Baseline 1: 3D-CNN at bottom 10.28 16.05 Baseline 2: 3D-CNN on top 9.63 14.82 Baseline 3: Ours w/o 3D convolutions 9.58 13.92 Baseline 4: Ours w/o memory attention 11.39 18.84 Trained only on the recognition task 13.78 20.91 Pre-trained on the prediction task 14.00 22.15 Trained on both tasks with a fixed loss ratio 13.57 20.46 E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73 Pouring [sth.] into [sth.] until it overflows 3D-CNN E3D-LSTM Pouring [sth.] into [sth.] Pouring [sth.] into [sth.] Pouring [sth.]…overflows Pouring [sth.]…overflows Trying to pour [sth.] into [sth.], but missing so it spills next to it 3D-CNN E3D-LSTM Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.] Pouring [sth.] into [sth.] Trying to pour…but spills Poking a stack…w/o collapsing Poking a stack…collapses Poking a stack…collapses Poking a stack…collapses Poking a stack of [sth.] so the stack collapses 0% 100% 25% 50% E3D-LSTM 3D-CNN