Situation Recognition Visual Semantic Role Labeling for Image Understanding 1 Mark Yatskar in collaboration w/ Luke Zettlemoyer, Ali Farhadi
Situation Recognition Visual Semantic Role Labeling for Image Understanding
1
Mark Yatskar
in collaboration w/ Luke Zettlemoyer, Ali Farhadi
How can we summarize what is happening in an image?
LOADINGAGENT ITEM DESTINATION TOOL PLACE
WOMAN HORSE TRAILER ROPE OUTDOORS
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why no?
0255075
100
activty other
Activity
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
throwing playing
Activity
Is the same thing happening in two images?
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
throwing playing
why no?
0255075
100
ball sport
Activity
Object
Activity
Object
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
beer pouring glass man
Is the same thing happening in two images?
Activity
Object
turkers say…
0255075
100
yes somewhat no
why yes?
0255075
100
beer pouring glass man
Is the same thing happening in two images?
why no?
0255075
100
destination source other
Role
Systematically describe how objects participate
LOADINGAGENT ITEM DESTINATION TOOL PLACE
WOMAN HORSE TRAILER ROPE OUTDOORS
in activities through roles
Situation Recognition
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Situation Recognition
POURING
AGENT MANSUBSTANCE BEER
SOURCE TAP
DESTINATION GLASS
PLACE BARROOM
POURING
AGENT MANSUBSTANCE BEER
SOURCE GLASS
DESTINATION MOUTH
PLACE BACKYARD
Same
Different
Situation Recognition
POURING
AGENT MANSUBSTANCE BEER
SOURCE TAP
DESTINATION GLASS
PLACE BARROOM
POURING
AGENT MANSUBSTANCE BEER
SOURCE GLASS
DESTINATION MOUTH
PLACE BACKYARD
Same
Different
What is the space of possible situations?
imSituA Large Scale Situation Dataset
120k+ images, 500+ verbs, 100k+ situations
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
Natural Language Processing: Semantic Role Labeling
A boy is fixing a car tire with a tire iron outdoors.
Activity
Object
Role
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
A jockey falling from a horse onto the ground at a racetrack.
Natural Language Processing: Semantic Role Labeling
FALLINGAGENT SOURCE DESTINATION PLACEJOCKEY HORSE GROUND RACETRACK
Activity
Object
Role
FrameNet for Verb and Role Inventory
FIXINGAGENT OBJECT PART TOOL PLACE
semantic role labeling ontology:
FrameNet (8000 verbs)
creating imSitu
Visualness
FIXINGAGENT OBJECT PART TOOL PLACE
semantic role labeling ontology:
FrameNet (8000 verbs)
~1000 visual verbs~3.5 roles/verb
filter verbs, semantic roles
creating imSituFrameNet
WordNet for Noun Inventory
FIXINGAGENT OBJECT PART TOOL PLACE
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
creating imSituFrameNetVisualness
FIXINGAGENT OBJECT PART TOOL PLACE
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
Google Images SearchWeb N-grams
Filter Images creating imSitu
WordNet
FrameNetVisualness
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
values from noun ontology: WordNet (80, 000 nouns)
semantic role labeling ontology:
FrameNet (8000 verbs)
Fill Valuescreating imSitu
WordNet
FrameNetVisualness
Filter Images
imSitu: Dataset Statistics
Verbs 504Images 126,102
Situation/Image 3Roles (types) 1,788 (190)Nouns ( >=3) 11,538 (6,794)Annotations 1,481,851Images/Verb 200-400
Uniq. situations (>= 3) 205,095 (21,505)
Despite 80,000 possible values, 2/3 annotators on 76.8% of role-value
WordNet
creating imSituFrameNetVisualness
Filter ImagesFill Values
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
Skew - not all verbs are equal
0 175 350 525 700
scoopingputting
feeding
fetching
climbing
stumbling
laughingbowing
inflating
shellingsnuggling
twisting
dancing
flossing
VERBS # NOUNS
food: milk
receiver: piglet receiver: dolpin
food: carrot
tool:pot tool: scooper
item:poop item: seed
1 10 100 1000
car
man
elephant
zebra
fireplace
octopus
priestflower
nostril
ice
baconvacuum
bow
cherry
NOUNS # OF SEMANTIC ROLES
splashing.agent swimming.agent
riding.vehicle attacking.victim
colliding.agentdriving.vehicle
pumping.destbuckling.place
Skew - not all nouns are equal
1 10 100 1000
car
man
elephant
zebra
fireplace
octopus
priestflower
nostril
ice
baconvacuum
bow
cherry
NOUNS # OF SEMANTIC ROLES
splashing.agent swimming.agent
riding.vehicle attacking.victim
colliding.agentdriving.vehicle
pumping.destbuckling.place
Skew - not all nouns are equal
Situation RecognitionModels, Evaluation and Basic Results
structure matterssituation recognition improves object and activity recognition
CLEANAGENT SOURCE DIRT TOOL PLACE
man chimney soot brush roof
Conditional Random FieldVGG
Convolutional Layers
Verb-Role-Noun Potential
Verb Potential
Neural Conditional Random Field
Backpropogate CRF loss through VGG
p(S|i; ✓) / v(v, i; ✓)Y
(r,nr)2F
e(v, r, nr, i; ✓)
Qualitative Examples
SPEARING
AGENT PERSON PERSON
VICTIM FISH FISHPLACE OCEAN OCEAN
FALLING
AGENT PERSON PERSONSOURCE HORSE HORSE
DEST. GRND. GRND.PLACE FIELD FIELD
SWIMMING
AGENT SNAKE SNAKEPLACE OCEAN OCEAN
Gold Correct Incorrect
Qualitative Examples
SHAVING
AGENT MAN PERSONCO-AGENT MAN MANBODYPART HEAD HEADSUBSTANCE S. CREAM
TOOL RAZOR RAZORPLACE INSIDE INSIDE
GIVING
AGENT SOLDIERRECIPIENT GIRL
ITEM BAGPLACE OUTSIDE
DETAINING
AGENT SOLDIERVICTIM MANPLACE OUTSIDE
Gold Correct Incorrect
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Quantitive : Structured Prediction Crucial
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Verb-Role-Noun
top-1 top-5
Quantitive : Structured Prediction Crucial
Verb
010203040506070
top-1 top-5
Baseline: 5040-way CNN Classifer (10 most frequent situation/verb)Situation CRF
Full Structure
top-1 top-5
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Verb-Role-Noun
top-1 top-5
Quantitive : Structured Prediction Crucial
FEEDING
AGENT MANEATER BABYFOOD MILK
SOURCE BOTTLEPLACE ROOM
FEEDING
AGENT GIRLEATER HORSEFOOD CARROT
SOURCE HANDPLACE PEN
FEEDING
AGENT WOMANEATER HORSEFOOD MILK
SOURCE BOTTLEPLACE BARN
Test
Instances in train : 35 Instances in train : 7 Instances in train : 0
Generalize to Unseen Combinations
Train
Activity
20
30
40
50
60
70
top-1 top-5
activitysituation
Situations Improves Object and Activity Recognition
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Activity
20
30
40
50
60
70
top-1 top-5
objectactivitysituation
FIXINGAGENT OBJECT PART TOOL PLACE
BOY CAR TIRE TIRE IRON OUTDOORS
Object
60
70
80
90
100
top-1 top-5
Situations Improves Object and Activity Recognition
Errors
PRYING
AGENT PERSONITEM WOOD
SOURCE FLOORTOOL CROWBARPLACE ROOM
PAINTING SPRAYING
PUMPINGAGENT PERSON
ITEM AIRSOURCE AIR
DESTINATION WHEELTOOL PUMPPLACE OUTSIDE
imsitu.orgdata/browsing/demo/code
Conclusion
Introduced situation recognitionrole-centric structured representation of whats happening
Collected imSitu120k+ images, 500+ verbs, 100k+ situations
Introduced simple model neural CRF for situationstructure mattersprovides strong context for activity and object recognition
data/browsing/demo/code
imsitu.org