Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face

and Pose Annotation

Presenter: Maresh Naresh Singh

Authors: Luo Ji, Barbara Caputo and Vittorio Ferrari

Aim

• Given: News items consisting of images with their associated text.

• Goal: Figure out who is doing what?

Who is doing what?

• Guess possible action of a person in the image.

• Use pose as well as verb for this purpose.• Associate actions with the person in the

image.• Predict the name of the person.

(b) US Democratic presidential candidate Senator Barack Obama waves to supporters together with his wife Michelle Obama standing beside him at his North Carolina and Indiana primary election night rally in Raleigh.

(a) Four sets ... Roger Federer preparesto hit a backhand in a quarter-final matchwith Andy Roddick at the US Open.

Correspondence ambiguity problem.

• Multiple persons in the image and captions.• Person in the image but not mentioned in the

caption.• Mention in the caption but not present in the

image.

Idea

• The title “Joint Modeling of Names and Verbs for

Simultaneous Face and Pose Annotation”

Generative Model

• Observed variables: Names and verbs in the caption. Detected persons in the image.

• Latent Variables: Image-caption correspondence.

• Parameters: Visual appearances of face and pose classes corresponding to different names and verbs.

• EM to compute hidden variables.

Features

Face and pose recognition

• Uses face detector and upper body detector.• Face and upper-body are considered to belong

to same person if the face lies in the center of upper-body bounding box.

Name-Verb pair.

• Language parser extracts name-verb pair from each caption.

• Uses OpenNLP.

Summary from last class.

Probability Function

• Uses EM to maximize the above function.

…

• Maximizing the previous equation somehow boils down to minimizing the equation:

EM algorithm (Initialization)

• Compute distance matrix between faces/poses from images sharing some name/verb in the caption.

• For each name/verb pair, select all captions containing only that name/verb.

• If the corresponding images contain only one person, their faces/poses are used to initialize the center vectors

• If the corresponding images always contain multiple players, assign person by random selection.

EM algorithm (E-Step)

EM algorithm (M-Step)

Experiment and observations

Experiment and observations

Comments

• Better results on the chosen dataset.• Somewhat successful in recognizing persons in

images without captions.

Comments

• Assumes independence between persons in an image.

• Limited dataset of 1610 images used for experimentation.

• Manual involvement in writing captions.• Images collected using search queries like

“Barack Obama” + “Shake hands”• Such queries results in images with strong

correspondence between pose and face.

Thanks for tolerating me.

Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

Documents

simultaneous face

imagecaption correspondence

corresponding images

features face

recognitionuses face

visual appearances of

multiple persons

different names