Image Binary descrip0ons Rela0ve descrip0ons not natural, not open, perspec0ve more natural than tallbuilding; less natural than forest; more open than tallbuilding; less open than coast; more perspec0ve than tallbuilding; not natural, not open, perspec0ve more natural than insidecity; less natural than highway; more open than street; less open than coast; more perspec0ve than highway; less perspec0ve than insidecity natural, open, perspec0ve more natural than tallbuilding; less natural than mountain; more open than mountain; less perspec0ve than opencountry; White, not Smiling, VisibleForehead more White than AlexRodriguez; more Smiling than JaredLeto; less Smiling than ZacEfron; more VisibleForehead than JaredLeto; less VisibleForehead than MileyCyrus White, not Smiling, not VisibleForehead more White than AlexRodriguez; less White than MileyCyrus; less Smiling than HughLaurie; more VisibleForehead than ZacEfron; less VisibleForehead than MileyCyrus not Young, BushyEyebrows, RoundFace more Young than CliveOwen; less Young than ScarleMJohansson; more BushyEyebrows than ZacEfron; less BushyEyebrows than AlexRodriguez; more RoundFace than CliveOwen; less RoundFace than ZacEfron 2. Learning Rela-ve A0ributes Categorical (binary) aMributes are restric0ve and can be unnatural Rela-ve A0ributes Devi Parikh (TTIC) and Kristen Grauman (UT Aus0n) 6. Datasets Proposed idea: Rela-ve A0ributes Richer communica0on between humans and machines Describe images or categories rela0vely e.g. “dogs are furrier than giraffes”, “find less congested downtown Chicago scene than Learn a ranking func0on for each aMribute 1. Main Idea 4. Rela-ve Zeroshot Learning Mo-va-on: Natural Not Natural ? Smiling Not Smiling ? Enables new applica-ons Novel zeroshot learning from aMribute comparisons Precise automa0cally generated textual descrip0ons of images 5. Describing Images Rela-vely { } { } , ... ∼ }, ( ) { , ... For each aMribute r m (x i )= w T m x i Learn a scoring func0on ∀(i, j ) ∈ O m : w T m x i > w T m x j ∀(i, j ) ∈ S m : w T m x i = w T m x j that best sa0sfies constraints: Supervision is a m , S m : O m : Maxmargin learning to rank formula-on ∼ Young: Smiling: … Learnt rela-ve a0ributes Rela-ve a0ributes space Training: Images from S seen and descrip0ons of U unseen categories Tes0ng: Categorize image into N (=S+U) categories Young: H C S Z M Unseen categories Smiling: Z M Need not use all aMributes Need not relate to all S Smiling Youth S H Z C M Auto generate textual descrip-on of: Density: … Learnt rela-ve a0ributes 3. Ranking Func-on vs. Binary Classifier Score min 1 2 ||w T m || 2 2 + C ξ 2 ij + γ 2 ij s.t w T m (x i − x j ) ≥ 1 − ξ ij , ∀(i, j ) ∈ O m ; ξ ≥ 0; γ ≥ 0 % correctly ordered pairs Classifier Ranker Outdoor scenes 80% 89% Celebrity faces 67% 82% Rela-ve a0ributes space Density 1/8 dataset “more dense than , less dense than CCH H H C F HH M F F IF “more dense than Highways, less dense than Forests” ” Binary Relative OSR TI S HC OMF natural 00001111 T≺I∼S≺H≺C∼O∼M∼F open 00011110 T∼F≺I∼S≺M≺H∼C∼O perspective 11110000 O≺C≺M∼F≺H≺I≺S≺T large-objects 11100000 F≺O∼M≺I∼S≺H∼C≺T diagonal-plane 11110000 F≺O∼M≺C≺I∼S≺H≺T close-depth 11110001 C≺M≺O≺T∼I∼S∼H∼F PubFig ACHJ MS V Z Masculine-looking 11110011 S≺M≺Z≺V≺J≺A≺H≺C White 01111111 A≺C≺H≺Z≺J≺S≺M≺V Young 00001101 V≺H≺C≺J≺A≺S≺Z≺M Smiling 11101101 J≺V≺H≺A∼C≺S∼Z≺M Chubby 10000000 V≺J≺H≺C≺Z≺M≺S≺A Visible-forehead 11101110 J≺Z≺M≺S≺A∼C∼H∼V Bushy-eyebrows 01010000 M≺S≺Z≺V≺H≺A≺C≺J Narrow-eyes 01100011 M≺J≺S≺A≺H≺C≺V≺Z Pointy-nose 00100001 A≺C≺J∼M∼V≺S≺Z≺H Big-lips 10001100 H≺J≺V≺Z≺C≺M≺A≺S Round-face 10001100 H≺V≺J≺C≺Z≺A≺S≺M 0 1 2 3 4 5 0 20 40 60 Accuracy # unseen categories 0 1 2 3 4 5 0 20 40 60 Accuracy # unseen categories DAP SRA Proposed 1 2 5 15 0 20 40 60 Accuracy # labeled pairs 1 2 5 15 0 20 40 60 Accuracy # labeled pairs DAP SRA Proposed 6 5 4 3 2 1 0 20 40 60 Accuracy # att to describe unseen 11 10 9 8 7 6 5 4 3 2 1 0 20 40 60 Accuracy # att to describe unseen DAP SRA Proposed 1 2 3 0 20 40 60 Accuracy Looseness of constraints 1 2 3 0 20 40 60 Accuracy Looseness of constraints DAP SRA Proposed 1 2 3 0 20 40 60 80 100 # top choices % correct image in top choices Binary Relative Outdoor Scene Recogni-on (OSR): 2688 images, 8 categories: coast (C), forest (F), highway (H), insidecity (I), mountain (M), opencountry (O), street (S) and tallbuilding (T), gist features; Public Figure Face (PubFig): 800 images, 8 categories: Alex Rodriguez (A), Clive Owen (C), Hugh Laurie (H), Jared Leto (J), Miley Cyrus (M), ScarleM Johansson (S), Viggo Mortensen (V) and Zac Efron (Z), gist and color features 8. Zeroshot Learning Results Whereas conven0onal Binary descrip-on: “not dense” Not dense: Dense: Rela-ve descrip-on: 7. Image Descrip-on Results w b w m Number of unseen categories Amt. of labeled data to learn a0ributes Amount of descrip-on Quality of descrip-on Baselines: Direct AMribute Predic0on (DAP) [Lampert et al. 2009] (binary) Classifier instead of ranker (SRA) ξ ij ≥ 0; γ ij ≥ 0 ; |w T m (x i − x j )| ≤ γ ij , ∀(i, j ) ∈ S m ; Adapted objec0ve from [Joachims, 2002] How do learned ranking func0ons differ from classifier outputs? Infer image category using maxlikelihood More smiling than Less smiling than More VisFHead than Less VisFHead than More chubby than Less chubby than Human subject experiment: Which image is? OSR PubFig ˆ c(x) = argmax m P (a c m |x) classical recogni0on problem binary ~ rela0ve supervision << baseline supervision can give unique ordering on all classes An aMribute is more discrimina0ve when used rela0vely Rela0ve aMributes jointly carve out space for unseen category ” ? ? ? Example descrip-ons