Learning Multimodal Representations with Factorized Deep ...pliang/posters/nips2018ws_factorized_poster.pdf · !"#"$%&'(")*"&+,$-.#/"$"#0") Learning Multimodal Representations with

Learning Multimodal Representations with Factorized Deep Generative ModelsYao-Hung Hubert Tsai∗†, Paul Pu Liang∗†, Amir Zadeh‡, Louis-Philippe Morency‡, and Ruslan Salakhutdinov†

{†Machine Learning Department, ‡Language Technologies Institute}, Carnegie Mellon University∗equal contributions, {yaohungt,pliang,abagherz,morency,rsalakhu}@cs.cmu.edu

Multimodal Factorization Model• Bayesian Network

Generative Network Inference Network

•Notations– X1:M : multimodal data from M modalities, Y: labels– X1:M : generated multimodal data, Y: generated labels– Za·: modality-specific latent variables, F·: factors

• Summary– Joint generative-discriminative objective for multimodal data.– Factorize representation into independent sets of factors∗Multimodal Discriminative factors∗Modality-Specific Generative factors

•Neural Architecture– Encoder Q(Zy|X1:M ) can be parametrized by any model that per-

forms multimodal fusion.

•Contributions– SOTA performance on six multimodal datasets.– Flexible generative capabilities by independent factors.– Ability to reconstruct missing modalities.– Interpreting multimodal interactions.

Generation, Inference, and Learning•Generation

– factorization over joint distribution

P (X1:M , Y) =

∫F,Z

P (X1:M , Y|F)P (F|Z)P (Z)dFdZ

=

∫Fy,Fa{1:M}Zy,Za{1:M}

(P (Y|Fy)

M∏i=1

P (Xi|Fai,Fy))(P (Fy|Zy)

M∏i=1

P (Fai|Zai))(P (Zy)

M∏i=1

P (Zai))dF dZ,

with dF = dFy∏Mi=1 dFai and dZ = dZy

∏Mi=1 dZai

• Inference– Joint-Distribution Wasserstein Distance– Approximation for intractable exact inference– Proposition 1: For any functions Gy : Zy → Fy, Ga{1:M} :

Za{1:M} → Fa{1:M}, D : Fy → Y, and F1:M :

Fa{1:M},Fy → X1:M , we have Joint-Distribution Wasserstein dis-tance Wc(PX1:M ,Y, PX1:M ,Y

) =

infQZ=PZ

EPX1:M,YEQ(Z|X1:M ,Y)

[M∑i=1

cXi

(Xi, Fi

(Gai(Zai), Gy(Zy)

))+ cY

(Y, D

(Gy(Zy)

))],

where PZ is the prior over Z = [Zy,Za{1,M}] and QZ is the ag-gregated posterior of the proposed approximate inference distributionQ(Z|X1:M ,Y).

– Generalized mean field assumption

Q(Z|X1:M ,Y) := Q(Z|X1:M ) := Q(Zy|X1:M )

M∏i=1

Q(Zai|Xi).

•Relaxed Objective

minF,Ga{1:M},Gy,D

infQ(Z|·)∈Q

EPX1:M,YEQ(Za1|X1) · · ·EQ(ZaM |XM )EQ(Zy|X1:M )[

M∑i=1

cXi

(Xi, F

(Gai(Zai), Gy(Zy)

))+ cY

(Y, D

(Gy(Zy)

))]+ λMMD(QZ, PZ),

with PZ being centered isotropic Gaussian N (0, I) with Z =[Zy,Za{1,M}]

• Surrogate Inference for Missing Modalities– Φ: Surrogate inference network

Φ∗ = argminΦ

EPX2:M,X1

(− logPΦ(X1|X2:M)

)with PΦ(X1|X2:M) :=

∫P (X1|Za1,Zy)QΦ(Za1|X2:M)QΦ(Zy|X2:M) dZa1 dZy.

– Deterministic mappings in QΦ(·|·)– PΦ(Y|X2:M ) :=

∫P (Y|Zy)QΦ(Zy|X2:M ) dZy

Controllable Generation•Digits Dataset

– Handwritten (MNISTa [3]) + Street-view House Numbers (SVHN [5])

•Results

Multimodal Time Series Dataset

•Datasets in Human Multimodal Language

– Multimodal Personal Trait Recognition∗Movie Reviews (POM [6])

– Multimodal Sentiment Analysis∗Monologue Opinion (CMU-MOSI [10])∗Online Social Review (ICT-MMMO [8])∗ Product Review and Opinion (MOUD [7] / YouTube [4])

– Multimodal Emotion Recognition∗ Recorded Dyadic Dialogues (IEMOCAP [1])

•Multimodal Features

– Language: pre-trained Glove word embeddings– Visual: facial action units from Facet– Acoustic: MFCCs from COVAREP– Aligned by P2FA

•ResultsDataset POM Personality Traits

Task Con Pas Voi Dom Cre Viv Exp Ent Res Tru Rel Out Tho Ner Per HumMetric r

SOTA2 0.359† 0.425† 0.166‡ 0.235‡ 0.358† 0.417† 0.450† 0.378‡ 0.295� 0.237� 0.215‡ 0.238� 0.363† 0.258� 0.344† 0.319†

SOTA1 0.395# 0.428# 0.193# 0.313# 0.367# 0.431# 0.452# 0.395# 0.333# 0.296# 0.255# 0.259# 0.381# 0.318# 0.377# 0.386#

MFM 0.431 0.450 0.197 0.411 0.380 0.448 0.467 0.452 0.368 0.212 0.309 0.333 0.404 0.333 0.334 0.408

Dataset CMU-MOSI ICT-MMMO YouTube MOUDTask Sentiment Sentiment Sentiment Sentiment

Metric Acc 7 Acc 2 F1 MAE r Acc 2 F1 Acc 3 F1 Acc 2 F1

SOTA2 34.1# 77.1‡ 77.0‡ 0.968‡ 0.625‡ 72.5∗ 72.6∗ 48.3‡ 45.1† 81.1# 80.4#

SOTA1 34.7‡ 77.4# 77.3# 0.965# 0.632# 73.8# 73.1# 51.7# 51.6# 81.1‡ 81.2‡

MFM 36.2 78.1 78.1 0.951 0.662 81.3 79.2 53.3 52.4 82.1 81.7

Dataset IEMOCAP EmotionsTask Happy Sad Angry Frustrated Excited Neutral

Metric Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1 Acc 2 F1

SOTA2 86.7‡ 84.2§ 83.4∗ 81.7† 85.1� 84.5§ 79.5‡ 76.6‡ 89.6‡ 86.3# 68.8§ 67.1§

SOTA1 90.1# 85.3# 85.8# 82.8∗ 87.0# 86.0# 80.3# 76.8# 89.8# 87.1‡ 71.8# 68.5§

MFM 90.2 85.8 88.4 86.1 87.5 86.7 80.4 74.5 90.0 87.1 72.1 68.1

•Ablation Study

– On CMU-MOSI

•Missing Modalities

– On CMU-MOSI

Task X· Reconstruction Y PredictionMetric MSE (`) MSE (a) MSE (v) Acc 7 Acc 2 F1 MAE r

Purely Generative and Discriminative Baselines`(anguage) missing 0.0411 - - 19.4 59.6 59.7 1.386 0.225a(udio) missing - 0.0533 - 34.0 73.5 73.4 1.024 0.615v(isual) missing - - 0.0220 33.7 75.4 75.4 0.996 0.634

Multimodal Factorization Model (MFM)`(anguage) missing 0.0403 - - 21.7 62.0 61.7 1.313 0.236a(udio) missing - 0.0468 - 35.4 74.3 74.3 1.011 0.603v(isual) missing - - 0.0215 35.0 76.4 76.3 0.990 0.635

all present 0.0391 0.0384 0.0182 36.2 78.1 78.1 0.951 0.662

Analyzing Multimodal Representations• Information-Based Interpretation

– Analysis on overall trends– Hilbert-Schmidt Independence Criterion [2, 9]

MI(F·, Xi) = HSICnorm(F·, Xi) =tr(KF·HK

XiH)

‖HKF·H‖F‖HKXiH‖F

,

– Normalized Ratios ri = MI(Fy, Xi)/MI(Fai, Xi)

Ratio r` rv raCMU-MOSI 0.307 0.030 0.107

•Gradient-Based Interpretation

– Fine-grained analysis– Generated data xi = [x1

i , · · · , xti, · · · , x

Ti ]

xi = Fi(fai, fy), fai = Gai(zai), fy = Gy(zy), zai ∼ Q(Zai|Xi = xi), zy ∼ Q(Zy|X1:M = x1:M )

– Gradients Flow

∇fy(xi) :=[‖∇fyx1i‖

2F , ‖∇fyx

2i‖

2F , · · · , ‖∇fyx

Ti ‖

2F ].

Umm, in a way, a lot of the themes in “never let me go”, which were very profound and deep.

(hesitancy) (emphasis) (neutral)

language

visual

acoustic

𝜵𝒇𝒚(𝒙&ℓ)

𝜵𝒇𝒚(𝒙&𝒗)

𝜵𝒇𝒚(𝒙&𝒂)

𝑡 = 1 𝑡 = 20

(uninformative)

(uninformative)

(slig

ht sm

ile)

References[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang,

S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion cap-ture database. Journal of Language Resources and Evaluation, 2008.

[2] A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf. Measuring statistical de-pendence with hilbert-schmidt norms. In International conference on algorithmiclearning theory, pages 63–77. Springer, 2005.

[3] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.

[4] L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis:Harvesting opinions from the web. In ICMI, pages 169–176. ACM, 2011.

[5] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits innatural images with unsupervised feature learning. 2011.

[6] S. Park, H. S. Shim, M. Chatterjee, K. Sagae, and L.-P. Morency. Computationalanalysis of persuasiveness in social multimedia: A novel dataset and multimodalprediction approach. ICMI ’14, 2014.

[7] V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-Level Multimodal Sen-timent Analysis. In Association for Computational Linguistics (ACL), Aug. 2013.

[8] M. Wollmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae, and L.-P.Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context.IEEE Intelligent Systems, 28(3):46–53, 2013.

[9] D. Wu, Y. Zhao, Y.-H. H. Tsai, M. Yamada, and R. Salakhutdinov. ” depen-dency bottleneck” in auto-encoding architectures: an empirical study. arXiv preprintarXiv:1802.05408, 2018.

[10] A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. Multimodal sentiment intensityanalysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems,2016.

Learning Multimodal Representations with Factorized Deep ...pliang/posters/nips2018ws_factorized_poster.pdf · !"#"$%&'(")*"&+,$-.#/"$"#0") Learning Multimodal Representations with

Documents