Please Write Here The Title of This Poster Please Write Here the Different Authors of this Poster Image & Pervasive Access Lab CNRS UMI 2955 - Singapore www.ipal.cnrs.fr Deep Learning for Image Understanding Olivier Morère 1 , Julie Petta 2 , Jie Lin 3 , Vijay Chandrasekhar 3 , Antoine Veillard 1 1 Université Pierre et Marie Curie, 2 Supélec, 3 A-Star Institute for Infocomm Research Team Web & Data Science Image Classification Video Summarization Compact Image Representations for Image Similarity Search Convolutional Neural Networks or 4K dim. or Fisher Vector Deep Convolutional Neural Network Input Image Training Phase 2: Fine-Tuning Global Feature Extraction 8K-64K dim. Stacked Regularized RBMs W 1 W 2 W L . . . Training Phase 1: Unsupervised W 1 W 2 W L . . . Loss 1 Loss L W 1 W 2 W L . . . Loss 2 Deep Siamese Network Trained DeepHash Model . . . Image Descriptor Hashing (Testing) W 1 W 2 W L Compact Binary Hash 64-1K bits Matching & non-matching pairs High-dimensional Image Descriptor Transfer model Training Testing ↵ =1 ↵ =0 0 < ↵ < 1 More subject-centric More scene-centric !! !! !! "# !! !! !! !! "# !! !! !! !! "# $ $ $ %" %& $ $ %" % %" % GoogLeNet [Szegedy et al., 2014] [Simonyan & Zisserman, 2014] Oxford VGG Input Image Conv-64 MaxPool Conv-64 FC-4096 Conv-128 MaxPool Conv-128 Conv-256 MaxPool Conv-256 Conv-512 MaxPool Conv-512 Conv-512 MaxPool Conv-512 FC-4096 FC-1000 Softmax Softmax Loss [Krizhevsky et al., 2012; Zeiler & Fergus, 2013] AlexNet / Clarifai Input Image Conv MaxPool Normalize Conv MaxPool Normalize Conv Conv Conv MaxPool FC FC FC Softmax Softmax Loss ImageNet 2014 Challenge LIMITED RESOURCES • NVIDIA GTX580 (1.5GB Memory) • Two-Month Effort OPTIMIZATION • Multi-Crop Pooling • Model Fusion RESULTS CNN MODEL 1 Multiple Crops CNN CNN CNN CNN Pooling 12.1% Pooled Scores CNN MODEL 2 . . . CNN MODEL N Model Fusion Fused Scores 11.4% CNN QUERY IMAGE 15.4% Learning Multimodal Representations Tunable Automatic Video Summaries For each video, a compact and mul3modal subjectscene subspace is learnt from high dimensional CNN descriptors using novel unsupervised deep learning methods. The mul3modal representa3ons are used to automa3cally generate compact summaries from videos. Subjectscene centricity can be tuned with a single parameter. DEEPHASH •Binary descriptors (hash) from images •Unsupervised and supervised deep learning pipelines •Application to image similarity search RESULTS •Very compact binary descriptors in the 32-1024 bits range •State-of-the-art retrieval results on many publicly available datasets •Enabling similarity search from internet-scale databases Automa3c image understanding with humanlike accuracy is the new fron3er of ar3ficial intelligence research and deep learning neural nets are frontrunning the race. While striving to reach and maintain stateoftheart performance in largescale image classifica3on, the deep learning group at IPAL is also exploring how the deep image models can be used to push the limits in various other fields of applica3on such as image compression, similaritybased image search and automa3c video summariza3on. Feel free to approach us for demos! Latent subject space Latent scene space DCNN subject descriptor DCNN scene descriptor RBM RBM Scene DCNN Subject DCNN Regularize with scene 16 Layers 138M parameters 8 Layers 60M parameters Regularize with subjects