Top Banner
1 Louis-Philippe Morency Multimodal Machine Learning Lecture 10.2: New Directions
66

Multimodal Machine Learning

Feb 20, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multimodal Machine Learning

1

Louis-Philippe Morency

Multimodal Machine Learning

Lecture 10.2: New Directions

Page 2: Multimodal Machine Learning

Objectives of today’s class

▪ New research directions in multimodal ML

▪ Alignment

▪ Representation

▪ Fusion

▪ Translation

▪ Co-learning

Page 3: Multimodal Machine Learning

3

New Directions:

Alignment

Page 4: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

Two main problems:

(1) Dependencies between entities (2) Multiple region proposals

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Page 5: Multimodal Machine Learning

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Phrase Grounding by Soft-Label Chain CRF

Two main problems:

(1) Dependencies between entities (2) Multiple region proposals

Solution: Formulate the phrase grounding as a sequence labeling task

❑ Treat the candidate regions as potential labels

❑ Propose the Soft-Label Chain CRFs to model dependencies among regions

❑ Address the multiplicity of gold labels

Page 6: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

Standard CRF

▪ Cross-entropy Loss:

Soft-Label CRF:

▪ KL-divergence between the model and target distribution:

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

- Input sequence: 𝒙 = 𝑥1:𝑇

- Label sequence: 𝒚 = 𝑦1:𝑇

- Score function: 𝑠(𝒙, 𝒚)

- Sequence of target distribution: 𝒒 = 𝑞1:𝑇

- Label distribution over all K possible

labels for input 𝑥𝑡: 𝑞𝑡 ∈ ℝ𝐾

➢ Each input 𝑥𝑖 is associated to only one label 𝑦𝑖

➢ Each input 𝑥𝑖 is associated to a distribution of labels 𝑦𝑖

Page 7: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

For efficiency: Reduce the model to a first-order linear chain CRF,

whose scoring function factorizes as:

where are the pairwise potentials between labels at 𝑡 − 1 and 𝑡

are the unary potentials between label and input at 𝑡

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Page 8: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

▪ Training Objective:

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Pairwise potentials

Unary potentials

MLP

MLP

MLP

Page 9: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

▪ Training Objective:

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Pairwise potentials

Unary potentials

MLP

MLP

MLP

Page 10: Multimodal Machine Learning

Phrase Grounding by Soft-Label Chain CRF

Liu J, Hockenmaier J. “Phrase Grounding by Soft-Label Chain Conditional Random Field” EMNLP 2019

Page 11: Multimodal Machine Learning

11

Self-supervised approach to learn an embedding space

where two similar video sequences can be aligned temporally

Temporal Cycle-Consistency Learning

Page 12: Multimodal Machine Learning

12

Representation Learning by enforcing Cycle consistency

Temporal Cycle-Consistency Learning

Page 13: Multimodal Machine Learning

13

Compute “soft” / “weighted” nearest neighbour:

Temporal Cycle-Consistency Learning

distances: Soft nearest neighbor:

Find the nearest neighbor the other way and then penalize the distance:

penalty!

Page 14: Multimodal Machine Learning

14

Nearest Neighbour Retrieval

Temporal Cycle-Consistency Learning

Page 15: Multimodal Machine Learning

15

Anomaly Detection

Temporal Cycle-Consistency Learning

Page 16: Multimodal Machine Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic

Representations for Vision-and-Language Tasks

Lu J, Batra D, Parikh D, et al. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language

tasks.” NeurIPS 2019

▪ ViLBERT: Extending BERT to jointly represent images and text

▪ Two parallel BERT-style streams operating over image regions

and text segments.

▪ Each stream is a series of transformer blocks (TRM) and novel

co-attentional transformer layers (Co-TRM).

Page 17: Multimodal Machine Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic

Representations for Vision-and-Language Tasks

▪ Co-attentional transformer layers

▪ Enable information exchange between modalities.

▪ Provide interaction between modalities at varying representation depths.

Lu J, Batra D, Parikh D, et al. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language

tasks.” NeurIPS 2019

Page 18: Multimodal Machine Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic

Representations for Vision-and-Language Tasks

Two pretraining tasks

1. masked multi-modal modelling

▪ the model must reconstruct image region categories or words for

masked inputs given the observed inputs

2. multi-modal alignment prediction

▪ the model must predict whether or not the caption describes the image

content.

Lu J, Batra D, Parikh D, et al. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language

tasks.” NeurIPS 2019

Page 19: Multimodal Machine Learning

Multi-Head Attention with Diversity for Learning

Grounded Multilingual Multimodal Representations

▪ Introduce a new multi-head attention diversity loss to

encourage diversity among attention heads.

Huang, Po-Yao, et al. “Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal

Representations.” EMNLP 2019

Synchronizes

the multi-head

attentions

between the

transformers

Page 20: Multimodal Machine Learning

Multi-Head Attention with Diversity for Learning

Grounded Multilingual Multimodal Representations

Multi-head attention diversity loss

▪ Taking Image-English instances {V, E} as an example

Huang, Po-Yao, et al. “Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal

Representations.” EMNLP 2019

- 𝛼𝐷: diversity margin

- : cosine similarity

- 𝑒𝑝𝑘: the k-th attention head fo English sentence representation

- : the hinge function

If they are from the same

𝑘𝑡ℎ head, then they should

be close to each other…

… within a certain margin

Page 21: Multimodal Machine Learning

Multi-Head Attention with Diversity for Learning

Grounded Multilingual Multimodal Representations

Multi-head attention diversity loss

▪ Taking Image-English instances {V, E} as an example

Huang, Po-Yao, et al. “Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal

Representations.” EMNLP 2019

▪ Diversity within-modalities and across-modalities:

Page 22: Multimodal Machine Learning

Multi-Head Attention with Diversity for Learning

Grounded Multilingual Multimodal Representations

Huang, Po-Yao, et al. “Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal

Representations.” EMNLP 2019

Learned multilingual multimodal embeddings:(note the sentences are without translation pairs)

Page 23: Multimodal Machine Learning

23

New Directions:

Representation

Page 24: Multimodal Machine Learning

24

Learn vector representations for

text using visual co-occurrences

Four types of co-occurrences:

(a) Object - Attribute

(b) Attribute - Attribute

(c) Context

(d) Object-Hypernym

ViCo: Word Embeddings from Visual Co-occurrences

Page 25: Multimodal Machine Learning

25

Relatedness through Co-occurrences

Since ViCo is learned from multiple types of co-occurrences, it is

hypothesized to provide a richer sense of relatedness

ViCo: Word Embeddings from Visual Co-occurrences

➢ Learned using a multi-task Log-Bilinear Model

Page 26: Multimodal Machine Learning

26

ViCO leads to more homogenous clusters compared to GloVe

ViCo: Word Embeddings from Visual Co-occurrences

Page 27: Multimodal Machine Learning

27

Neural-symbolic VQA

1) Image de-rendering

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Previously trained in a supervised way

Page 28: Multimodal Machine Learning

28

Neural-symbolic VQA

2) Parsing questions into programs

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Similar to neural

module networsk

Page 29: Multimodal Machine Learning

29

Neural-symbolic VQA

3) Program execution

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Execution of the program is somewhat

easier given the “symbolic”

representation of the image

Page 30: Multimodal Machine Learning

30

Neural-symbolic VQA

3) Program execution

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Execution of the program is somewhat

easier given the “symbolic”

representation of the image

Page 31: Multimodal Machine Learning

31

Neural-symbolic VQA

3) Program execution

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Execution of the program is somewhat

easier given the “symbolic”

representation of the image

Page 32: Multimodal Machine Learning

32

Neural-symbolic VQA

Neural-symbolic

programs give

more accurate

answers

(shown in blue)

Kexin Yi, et al. “Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.” Neurips 2018

Page 33: Multimodal Machine Learning

33

Learns visual concepts, words, and semantic parsing of sentences without

explicit supervision on any of them, but just by looking at images and reading

paired questions and answers

The Neuro-symbolic Concept Learner

Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural

Supervision.” ICLR 2019

Extension from Neural-symbolic VQA:

Page 34: Multimodal Machine Learning

34

The Neuro-symbolic Concept Learner

Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural

Supervision.” ICLR 2019

Learns visual concepts, words, and semantic parsing of sentences without

explicit supervision on any of them, but just by looking at images and reading

paired questions and answers

Extension from Neural-symbolic VQA:

Page 35: Multimodal Machine Learning

35

The Neuro-symbolic Concept Learner

Jiayuan Mao , et al. “The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural

Supervision.” ICLR 2019

Page 36: Multimodal Machine Learning

36

Time-Contrastive Networks:

Self-Supervised Learning from (Multi-View) Video

anchor

positive negative

Main idea:

Embeddings should

be close if from

synchronized framesM

ulti-vie

w v

ideos

Goal: We want to observe and

disentangle the world from

many videos.

Page 37: Multimodal Machine Learning

37

Time-Contrastive Networks:

Self-Supervised Learning from (Multi-View) Video

Set of all triplets in

the training set

Margin enforced between

positive/negative pairs:anchor positive negative

anchor

positive negative

Mu

lti-vie

w v

ide

os

d = 32

Let’s learn an embedding function f for an sequence x

Page 38: Multimodal Machine Learning

38

Time-Contrastive Networks:

Self-Supervised Learning from (Multi-View) Video

anchor

positive negative

Page 39: Multimodal Machine Learning

39

Time-Contrastive Networks:

Self-Supervised Learning from Video

Learn RL policies

from only one video

Page 40: Multimodal Machine Learning

40

Time-Contrastive Networks:

Self-Supervised Learning from Video

Demo: Pouring

Page 41: Multimodal Machine Learning

41

Time-Contrastive Networks:

Self-Supervised Learning from Video

Demo: Pose Imitation

Page 42: Multimodal Machine Learning

42

New Directions:

Fusion

Page 43: Multimodal Machine Learning

43

Pose multimodal fusion as an architectural

search problem

Each fusion layer combines three inputs:

(a) Output from previous fusion layer

(b) Output from modality A

(c) Output from modality B

MFAS: Multimodal Fusion Architecture Search

Page 44: Multimodal Machine Learning

44

Realization 01 Realization 02

Enables the space to contain a large number of possible fusion architectures.

Space is naturally divided by complexity levels that can be interpreted as

progression steps

Exploration performed by Sequential model-based optimization

MFAS: Multimodal Fusion Architecture Search

Page 45: Multimodal Machine Learning

Video Action Transformer Network

Recognizing and localizing human actions in video clips

Rohit Girdhar, et al. “Video Action Transformer Network.” CVPR 2019

by attending person of interest and their context (other people, objects)

Page 46: Multimodal Machine Learning

Video Action Transformer Network

▪ Trunk: generate features and region proposals (RP) for the

people present (using I3D)

▪ Action Transformer Head: use the person box from the RPN

as a ‘query’ to locate regions to attend to, and aggregates the

information over the clip to classify their actions

Rohit Girdhar, et al. “Video Action Transformer Network.” CVPR 2019

Initial actor representations

Page 47: Multimodal Machine Learning

Video Action Transformer Network

▪ Visualizing the key embeddings using color-coded 3D

PCA projection

▪ Different heads learn to track people at different levels.

▪ Attends to face, hands of person, and other people/objects

in scene

Rohit Girdhar, et al. “Video Action Transformer Network.” CVPR 2019

Page 48: Multimodal Machine Learning

48

New Directions:

Translation

Page 49: Multimodal Machine Learning

49

Speech2face

Page 50: Multimodal Machine Learning

50

Speech2face

Voice encoder + face encoder + face decoder

Page 51: Multimodal Machine Learning

51

Speech2face

Examples of reconstructed faces

Page 52: Multimodal Machine Learning

52

- Introduce task of reconstructing face from voice

- Two adversaries:

(a) Discriminator to verify the generated image is a face

(b) Classifier to assign a face image to the identity

Reconstructing faces from voices

Page 53: Multimodal Machine Learning

53

The generated face images have identity associations with the true speaker.

The produced faces have features(for ex. hair) that are presumably not predicted

by voice, but simply obtained from their co-occurrence with other features

Reconstructing faces from voices

Page 54: Multimodal Machine Learning

54

High-Resolution Image Synthesis and Semantic

Manipulation with CGANs

Page 55: Multimodal Machine Learning

55

High-Resolution Image Synthesis and Semantic

Manipulation with CGANs

Page 56: Multimodal Machine Learning

56

High-Resolution Image Synthesis and Semantic

Manipulation with CGANs

Page 57: Multimodal Machine Learning

57

High-Resolution Image Synthesis and Semantic

Manipulation with CGANs

Page 58: Multimodal Machine Learning

58

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Page 59: Multimodal Machine Learning

59

Model: Co-grounded Attention Streams

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Page 60: Multimodal Machine Learning

60

Results: Progress Monitoring

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Page 61: Multimodal Machine Learning

61

New Directions:

Co-Learning

Page 62: Multimodal Machine Learning

Regularizing with Skeleton Seqs

▪ Better unimodal representation by regularizing using

a different modality

[B. Mahasseni and S. Todorovic, “Regularizing Long Short Term Memory with 3D Human-

Skeleton Sequences for Action Recognition,” in CVPR, 2016]

Non parallel data!

Page 63: Multimodal Machine Learning

Cyclic

Loss

Decoder

Multimodal Cyclic Translation

Verbal modalityVisual modality

“Today was a great day!”

(Spoken language) Co-learning

Representation

Sentiment

Encoder

Paul Pu Liang*, Hai Pham*, et al., “Found in Translation: Learning Robust Joint Representations by

Cyclic Translations Between Modalities”, AAAI 2019

Page 64: Multimodal Machine Learning

Taskonomy

Zamir, Amir R., et al. "Taskonomy: Disentangling Task Transfer Learning." Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

Page 65: Multimodal Machine Learning

Associative Multichannel Autoencoder

▪ Learning representation through fusion and translation

▪ Use associated word prediction to address data sparsity

[Wang et al. Associative Multichannel Autoencoder for Multimodal Word

Representation, 2018]

Page 66: Multimodal Machine Learning

Grounding Semantics in Olfactory Perception

▪ Grounding language in vision, sound, and smell

[Kiela et al., Grounding Semantics in Olfactory Perception, ACL-IJCNLP,

2015]