Outer loop A Metropolis-Hasting algorithm for latent sub-event parsing Dynamics: splitting/merging/relabeling Learning Social Affordance for Human-Robot Interaction Tianmin Shu 1 M. S. Ryoo 2 Song-Chun Zhu 1 1 Center for Vision, Cognition, Learning, and Autonomy, UCLA 2 Indiana University, Bloomington Objective: Learning explainable knowledge from the noisy observation of human interactions in RGB-D videos to enable human-robot interactions. Key idea: Beyond traditional object and scene affordances, we propose a weakly supervised learning of social affordances for HRI. Contributions: • First formulation and hierarchical representation of social affordance • Weakly supervised learning from noisy skeleton input • Efficient motion synthesis based on learned hierarchical affordances Introduction Goal: Obtain the optimal joint selection and grouping and interaction parsing results by maximizing the joint probability. Algorithm: Learning Model ... c s 1 s K ... s 2 p(c) t 2 T 1 J t p(Z ) J t p(Z ) ... ... Joints Selection and Grouping Interaction Parsing ... t 2 T 2 Z s 2 Z s 1 Shake Hands Current frame Affordance For one instance of category c: Inner loop A Gibbs sampling for our modified CRP Initialization Skeleton clustering for initial sub-event parsing Motion Synthesis Goal: Given the initial 10 frames (25 fps), synthesize the motion of an agent given the motion of the other agent and the interaction type. Algorithm: At time , 1) Estimate the current sub-event by DP 2) Predict the ending time and the corresponding joint positions 3) Obtain the joint positions at through interpolation Experiment Hand Over a Cup Throw and Catch High-Five #50 #80 #110 #30 #80 #130 #15 #45 #75 #15 #40 #65 #20 #70 #120 Shake Hands Pull Up GT Ours HMM Exp 2: User study (14 subjects) Q1: Successful? Q2: Natural? Q3: Human vs. robot? From 1 (worst) to 5 (best) 0 0.2 0.4 0.6 0.8 1 GT Ours Exp 1: Average joint distance in meters (compared with GT skeletons) Examples of discovered latent sub-events and their sub-goals Synthesis Examples Frequencies of high scores (4 or 5) for Q3 Acknowledgment This research has been sponsored by grants DARPA SIMPLEX project N66001-15-C- 4035 and ONR MURI project N00014-16-1-2007. Scan to visit our project website p(G|Z c ) / Y k p({J t } t2T k |Z s k ,s k ,c) | {z } likelihood · p(c) |{z} interaction prior · K Y k =2 p(s k |s k -1 ,c) | {z } sub-event transition · K Y k =1 p(s k |c) | {z } sub-event prior For N training examples of category c ( ): G = {G n } n=1,...,N p(Z c )= Y s2S p(Z s |c) p({J t } t2T |Z s , s, c)= g ({J t } t2T ,Z s ,s) m ({J t } t2T ,Z s ,s) p(G ,Z c )= p(Z c ) Q N n p(G n |Z c ) G = {G n } n=1,...,N Z c p(z s ai |Z s -ai )= 8 > > > < > > > : β γ M - 1+ γ if z s ai > 0,M z s ai =0 β M s z a i M - 1+ γ if z s ai > 0,M z s ai > 0 1 - β if z s ai =0 t +5 t t 0 Human agent Synthesized agent t +5 t 0 t s 2 s 1 UCLA Human-Human-Object Interaction Dataset • Five types of interactions; on average, 23.6 instances per interaction performed by totally 8 actors. Each lasts 2-7 s presented at 10-15 fps. • RGB-D videos, skeletons and annotations are available: http://www.stat.ucla.edu/~tianmin.shu/SocialAffordance Pull Up 1 2 3 1 4 5 5 6 1 2 1 2 Hand Over a Cup Throw and Catch High-Five Shake Hands z s ai ⇠ p(G|Z 0 c )p(z s ai |Z s -ai ) p(G, Z c )= p(G|Z c )p(Z c )