Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization Chufeng Tang 1 , Lu Sheng 2 , Zhaoxiang Zhang 3 , Xiaolin Hu 1 * 1 Tsinghua University, 2 Beihang University, 3 Chinese Academy of Sciences Ø Problem: previous pedestrian attribute recognition methods failed to indicate the attribute-region correspondence Ø Contribution: performing attribute-specific localization at multiple scales to find the most discriminative region for each attribute in a weakly-supervised manner Ø Results: improvement across three datasets, end-to-end trainable, less computational cost Ø Top-down feature pyramid low-level details: feature learning high-level semantics: localization Ø Deep supervision for training 4 predictions are directly supervised by GT , trained insufficiently otherwise Ø Maximum voting for inference choosing the most confident prediction Effectiveness of each component Ø Attribute-agnostic attention: attend to a broad region, no attribute-region correspondence Ø Rigid body parts localization: simply fuse the local features, require extra computation Ø We need Attribute-Specific Localization √ maintain the attribute-region correspondence √ fully adaptive, without region annotations √ interpretable and computationally efficient Summary Methodology Ablation Study Motivation 1×1 Conv ReLU Global Pool 1×1 Conv Sigmoid Mul Add " FC Spatial Transformer ... FC Attribute Localization Module ... # $ % & $ ' ... ... ... 256 × 128 8×4 ... 32 × 16 ... 16 × 8 ... Attribute Localization Module Concat Concat up × 2 up × 2 ... ... Element-wise Maximum ... ... " # $ 1 2 M ... ... " # $ 1 2 M ... ... " # $ 1 2 M ... 1 2 M ℒ ) ℒ * ℒ + ℒ , Attribute Prediction 1 2 M 1 2 M - . + (-) . * (-) . ) (-) 1 + 1 * 1 ) 2 3 , 2 3 + 2 3 * 2 3 ) Quantitative Results Ø Spatial transformer simplified STN, learn to represent attribute region should be adaptive and differentiable (RoI pooling can’t) Ø Feature alignment a tiny channel attention sub-network, modulating the inter- channel dependencies, since features from different levels should contribute unequally (some need more details). Ø One for each attribute, but still light-weight Three different attribute-specific methods Ø Each attention mask corresponds to one attribute over-adaptive: try to cover all pixels but often failed, since there is no accurate localization labels. Ø Each attribute associated with predefined parts lack-adaptive: discard the adaptive factors, which are less robust to variances. Ø We achieve a balance between two extremes using attribute-specific bounding boxes, which relatively coarse but more interpretable. Input Rigid Parts Attribute Regions Attention Masks Boots Glasses Box weighted binary cross-entropy loss 0.5 0.6 0.7 0.8 0.9 1 Female Clerk Boots LongHair TightTrousers Jeans Vest Pusing Hat Skirt Customer Calling Dress Shirt AgeLess16 LongTrousers HandTrunk Backpack Glasses LeatherShoes CarryingbyHand Suit-Up ClothShoes Cotton SSBag Jacket ShortSleeve ShortSkirt HandBag Box Pulling SportShoes BaldHead BlackHair PlasticBag TShirt OtherAttchment PaperBag Muffler Tight Age31-45 Age17-30 Sweater CarryingbyArm CasualShoes BodyFat Gathering BodyThin Holding BodyNormal Talking Baseline Ours Input Ours Attention Body-parts Attribute: Longhair BodyFat Backpack PlasticBag Hat +3.1% +1.7% +1.3%