Learning Multimodal Representations with Factorized Deep Generative Models Yao-Hung Hubert Tsai *† , Paul Pu Liang *† , Amir Zadeh ‡ , Louis-Philippe Morency ‡ , and Ruslan Salakhutdinov † { † Machine Learning Department, ‡ Language Technologies Institute}, Carnegie Mellon University * equal contributions, {yaohungt,pliang,abagherz,morency,rsalakhu}@cs.cmu.edu Multimodal Factorization Model • Bayesian Network Generative Network Inference Network • Notations – X 1:M : multimodal data from M modalities, Y: labels – ˆ X 1:M : generated multimodal data, ˆ Y: generated labels – Z a· : modality-specific latent variables, F · : factors • Summary – Joint generative-discriminative objective for multimodal data. – Factorize representation into independent sets of factors * Multimodal Discriminative factors * Modality-Specific Generative factors • Neural Architecture – Encoder Q(Z y |X 1:M ) can be parametrized by any model that per- forms multimodal fusion. • Contributions – SOTA performance on six multimodal datasets. – Flexible generative capabilities by independent factors. – Ability to reconstruct missing modalities. – Interpreting multimodal interactions. Generation, Inference, and Learning • Generation – factorization over joint distribution P ( ˆ X 1:M , ˆ Y)= Z F,Z P ( ˆ X 1:M , ˆ Y|F)P (F|Z)P (Z)dFdZ = Z F y ,F a {1:M } Z y ,Z a {1:M } P ( ˆ Y|F y ) M Y i=1 P ( ˆ X i |F ai , F y ) P (F y |Z y ) M Y i=1 P (F ai |Z ai ) P (Z y ) M Y i=1 P (Z ai ) dF dZ, with dF = dF y Q M i=1 dF a i and dZ = dZ y Q M i=1 dZ a i • Inference – Joint-Distribution Wasserstein Distance – Approximation for intractable exact inference – Proposition 1: For any functions G y : Z y → F y , G a{1:M } : Z a{1:M } → F a{1:M } , D : F y → ˆ Y, and F 1:M : F a{1:M } , F y → ˆ X 1:M , we have Joint-Distribution Wasserstein dis- tance W c (P X 1:M ,Y ,P ˆ X 1:M , ˆ Y )= inf Q Z =P Z E P X 1:M ,Y E Q(Z|X 1:M ,Y) " M X i=1 c X i X i ,F i ( G ai (Z ai ),G y (Z y ) ) + c Y Y,D ( G y (Z y ) ) # , where P Z is the prior over Z =[Z y , Z a {1,M } ] and Q Z is the ag- gregated posterior of the proposed approximate inference distribution Q(Z|X 1:M , Y). – Generalized mean field assumption Q(Z|X 1:M , Y) := Q(Z|X 1:M ) := Q(Z y |X 1:M ) M Y i=1 Q(Z a i |X i ). • Relaxed Objective min F,G a{1:M } ,G y ,D inf Q(Z|·)∈Q E P X 1:M ,Y E Q(Z a1 |X 1 ) ··· E Q(Z aM |X M ) E Q(Z y |X 1:M ) " M X i=1 c X i X i ,F ( G ai (Z ai ),G y (Z y ) ) + c Y Y,D ( G y (Z y ) ) # + λMMD(Q Z ,P Z ), with P Z being centered isotropic Gaussian N (0, I) with Z = [Z y , Z a {1,M } ] • Surrogate Inference for Missing Modalities – Φ: Surrogate inference network Φ * = argmin Φ E P X 2:M , ˆ X 1 - log P Φ ( ˆ X 1 |X 2:M ) with P Φ ( ˆ X 1 |X 2:M ) := Z P ( ˆ X 1 |Z a1 , Z y )Q Φ (Z a1 |X 2:M )Q Φ (Z y |X 2:M ) dZ a1 dZ y . – Deterministic mappings in Q Φ (·|·) – P Φ ( ˆ Y|X 2:M ) := R P ( ˆ Y|Z y )Q Φ (Z y |X 2:M ) dZ y Controllable Generation • Digits Dataset – Handwritten (MNISTa [3]) + Street-view House Numbers (SVHN [5]) • Results Multimodal Time Series Dataset • Datasets in Human Multimodal Language – Multimodal Personal Trait Recognition * Movie Reviews (POM [6]) – Multimodal Sentiment Analysis * Monologue Opinion (CMU-MOSI [10]) * Online Social Review (ICT-MMMO [8]) * Product Review and Opinion (MOUD [7] / YouTube [4]) – Multimodal Emotion Recognition * Recorded Dyadic Dialogues (IEMOCAP [1]) • Multimodal Features – Language: pre-trained Glove word embeddings – Visual: facial action units from Facet – Acoustic: MFCCs from COVAREP – Aligned by P2FA • Results Dataset POM Personality Traits Task Con Pas Voi Dom Cre Viv Exp Ent Res Tru Rel Out Tho Ner Per Hum Metric r SOTA2 0.359 † 0.425 † 0.166 ‡ 0.235 ‡ 0.358 † 0.417 † 0.450 † 0.378 ‡ 0.295 0.237 0.215 ‡ 0.238 0.363 † 0.258 0.344 † 0.319 † SOTA1 0.395 # 0.428 # 0.193 # 0.313 # 0.367 # 0.431 # 0.452 # 0.395 # 0.333 # 0.296 # 0.255 # 0.259 # 0.381 # 0.318 # 0.377 # 0.386 # MFM 0.431 0.450 0.197 0.411 0.380 0.448 0.467 0.452 0.368 0.212 0.309 0.333 0.404 0.333 0.334 0.408 Dataset CMU-MOSI ICT-MMMO YouTube MOUD Task Sentiment Sentiment Sentiment Sentiment Metric Acc 7 Acc 2 F1 MAE r Acc 2 F1 Acc 3 F1 Acc 2 F1 SOTA2 34.1 # 77.1 ‡ 77.0 ‡ 0.968 ‡ 0.625 ‡ 72.5 * 72.6 * 48.3 ‡ 45.1 † 81.1 # 80.4 # SOTA1 34.7 ‡ 77.4 # 77.3 # 0.965 # 0.632 # 73.8 # 73.1 # 51.7 # 51.6 # 81.1 ‡ 81.2 ‡ MFM 36.2 78.1 78.1 0.951 0.662 81.3 79.2 53.3 52.4 82.1 81.7 Dataset IEMOCAP Emotions Task Happy Sad Angry Frustrated Excited Neutral Metric Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 SOTA2 86.7 ‡ 84.2 § 83.4 * 81.7 † 85.1 84.5 § 79.5 ‡ 76.6 ‡ 89.6 ‡ 86.3 # 68.8 § 67.1 § SOTA1 90.1 # 85.3 # 85.8 # 82.8 * 87.0 # 86.0 # 80.3 # 76.8 # 89.8 # 87.1 ‡ 71.8 # 68.5 § MFM 90.2 85.8 88.4 86.1 87.5 86.7 80.4 74.5 90.0 87.1 72.1 68.1 • Ablation Study – On CMU-MOSI • Missing Modalities – On CMU-MOSI Task ˆ X · Reconstruction ˆ Y Prediction Metric MSE (‘) MSE (a) MSE (v ) Acc 7 Acc 2 F1 MAE r Purely Generative and Discriminative Baselines ‘(anguage) missing 0.0411 - - 19.4 59.6 59.7 1.386 0.225 a(udio) missing - 0.0533 - 34.0 73.5 73.4 1.024 0.615 v (isual) missing - - 0.0220 33.7 75.4 75.4 0.996 0.634 Multimodal Factorization Model (MFM) ‘(anguage) missing 0.0403 - - 21.7 62.0 61.7 1.313 0.236 a(udio) missing - 0.0468 - 35.4 74.3 74.3 1.011 0.603 v (isual) missing - - 0.0215 35.0 76.4 76.3 0.990 0.635 all present 0.0391 0.0384 0.0182 36.2 78.1 78.1 0.951 0.662 Analyzing Multimodal Representations • Information-Based Interpretation – Analysis on overall trends – Hilbert-Schmidt Independence Criterion [2, 9] MI(F · , ˆ X i ) = HSIC norm (F · , ˆ X i )= tr(K F · HK ˆ X i H) kHK F · Hk F kHK ˆ X i Hk F , – Normalized Ratios r i = MI(F y , ˆ X i )/MI(F a i , ˆ X i ) Ratio r ‘ r v r a CMU-MOSI 0.307 0.030 0.107 • Gradient-Based Interpretation – Fine-grained analysis – Generated data ˆ x i = [ˆ x 1 i , ··· , ˆ x t i , ··· , ˆ x T i ] ˆ x i = F i (f ai ,f y ),f ai = G ai (z ai ),f y = G y (z y ),z ai ∼ Q(Z ai |X i = x i ),z y ∼ Q(Z y |X 1:M = x 1:M ) – Gradients Flow ∇ f y (ˆ x i ) :=[k∇ f y ˆ x 1 i k 2 F , k∇ f y ˆ x 2 i k 2 F , ··· , k∇ f y ˆ x T i k 2 F ]. Umm, in a way, a lot of the themes in “never let me go”, which were very profound and deep. (hesitancy) (emphasis) (neutral) language visual acoustic ( & ℓ ) ( & ) ( & ) =1 = 20 (uninformative) (uninformative) (slight smile) References [1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion cap- ture database. Journal of Language Resources and Evaluation, 2008. [2] A. Gretton, O. Bousquet, A. Smola, and B. Sch¨ olkopf. Measuring statistical de- pendence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005. [3]Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998. [4]L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis: Harvesting opinions from the web. In ICMI, pages 169–176. ACM, 2011. [5]Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. [6]S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. ICMI ’14, 2014. [7]V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-Level Multimodal Sen- timent Analysis. In Association for Computational Linguistics (ACL), Aug. 2013. [8] M. W¨ ollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L.-P. Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 28(3):46–53, 2013. [9]D. Wu, Y. Zhao, Y.-H. H. Tsai, M. Yamada, and R. Salakhutdinov. ” depen- dency bottleneck” in auto-encoding architectures: an empirical study. arXiv preprint arXiv:1802.05408, 2018. [10] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016.