Top Banner
Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney The University of Texas at Austin {sonaluta, scimitar, grauman, mooney}@cs.utexas.edu (mute) Without sound or text ….. ….. ….. …. With sound or text ??? ??? !!! ….. ….. ….. …. Only sound or text Introduction Motivation Image Recognition & Human Activity Recognition in Videos - Hard to classify, visual cues are ambiguous - Expensive to manually label instances Often images and videos have text captions - Leverage multi-modal data - Use readily available unlabeled data to improve accuracy Goals • Classify images and videos with the help of associated text captions • Use Co-training to achieve better classification accuracy for image and video classification task Datasets • Image : 362 instances with 2 classes • Video : 221 instances with 4 classes Approach • Combining two views (Text and Visual) of images and videos using Co-training (Blum and Mitchell ‘98) learning algorithm • Text View - Caption of image or video - Readily available • Visual View - Color, texture, temporal information in image/video Feature Extraction Algorithm Co-training - Semi-supervised learning paradigm that exploits two mutually independent and sufficient views Features of dataset can be divided into two sets: - The instance space: - Each example: Proven to be effective in several domains - Web page classification (content and hyperlink) - E-mail classification (header and body) Experimental Results Baselines • Uni-modal - Image/Video View : Only image/video features are used - Text View : Only textual features are used • Multi-modal - Early Fusion : Concatenate visual and textual features and train classifier - Late Fusion : Run separate classifiers on each view and concatenate the results Conclusion Combining textual and visual features can help improve accuracy Co-training can be useful to combine textual and visual features to classify images and videos Co-training helps in reducing labeling of images and videos References [1] Bekkerman and Jeon, Multi-modal Clustering for Multimedia Collections. CVPR 2007 [2] Blum and Mitchell, Combining labeled and unlabeled data with co-training, COLT 1998 [3] Laptev, On space-time interest points, IJCV 2005 [4] Weka Data Mining Tool (Witten and Frank) Cultivating farming at Nabataean Ruins of the Ancient Avdat Bedouin Leads His Donkey That Carries Load Of Straw Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School Desert Trees That was a very nice forward camel. Well I remember her performance last time. He has some delicate hand movement. She gave a small jump while gliding He runs in to chip the ball with his right foot. He runs in to take the instep drive and executes it well. The small kid pushes the ball ahead with his tiny kicks. Standard Bag-of-Words Representation Raw Text Commentary Porter Stemmer Remove Stop Words 2 1 X X X × = x = ( x 1 , x 2 ) Image Feature Video Feature Text Feature Divide images into 4Χ6 grid Capture texture and color distributions of each cell into 30-dim vector Cluster the vectors using k-Means to quantize the features into a dictionary of visual words Represent each image in terms of the dictionary Detect Interest Points Harris-Forstener Corner Detector for both spatial and temporal space Describe Interest Points Histogram of Oriented Gradients (HoG) Create Spatio-Temporal Vocabulary Quantize interest points to create 200 visual words dictionary Represent each video in terms of the dictionary Unlabeled Instances Visual Classifier Text Classifier Text View Text View Text View Text View Visual View Visual View Visual View Visual View Partially Labeled Instances Classify most confident instances Text Classifier Visual Classifier Text View Text View Text View Text View Visual View Visual View Visual View Visual View Classifier Labeled Instances Text Classifier Visual Classifier Text View Text View Text View Text View Visual View Visual View Visual View Visual View + + - - + + - - Classify most confident instances Retrain Classifiers Text Classifier Visual Classifier Text View Text View Text View Text View Visual View Visual View Visual View Visual View + + - - + + - - Initially Labeled Instances Visual Classifier Text Classifier Supervised Learning Text View Text View Text View Text View Visual View Visual View Visual View Visual View + + - + + + - + Label a new Instance Text Classifier Visual Classifier Text View Visual View Text View Visual View - Text View Visual View Image Dataset Co-training vs. Supervised SVM Co-training vs. Semi-Supervised EM Video Dataset Co-training vs. Supervised SVM Co-training (Test on Video view) vs. SVM
1

Watch, Listen & Learn: Co-training on Captioned Images and ...ml/co-training/sonal-ecml08-poster.pdf · Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta,

Aug 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Watch, Listen & Learn: Co-training on Captioned Images and ...ml/co-training/sonal-ecml08-poster.pdf · Watch, Listen & Learn: Co-training on Captioned Images and Videos Sonal Gupta,

Watch, Listen & Learn: Co-training on Captioned Images and VideosSonal Gupta, Joohyun Kim, Kristen Grauman, Raymond Mooney

The University of Texas at Austin{sonaluta, scimitar, grauman, mooney}@cs.utexas.edu

(mute)

Without sound or text

….. ….. ….. ….

With sound or text

???

???

!!!

….. ….. ….. ….

Only sound or text

Introduction

Motivation • Image Recognition & Human Activity Recognition in Videos - Hard to classify, visual cues are ambiguous - Expensive to manually label instances

• Often images and videos have text captions - Leverage multi-modal data - Use readily available unlabeled data to improve accuracy

Goals • Classify images and videos with the help of associated text captions • Use Co-training to achieve better classification accuracy for image and video classification task

Datasets • Image : 362 instances with 2 classes • Video : 221 instances with 4 classes

Approach • Combining two views (Text and Visual) of images and videos using Co-training (Blum and Mitchell ‘98) learning algorithm

• Text View - Caption of image or video - Readily available • Visual View - Color, texture, temporal information in image/video

Feature Extraction

Algorithm • Co-training - Semi-supervised learning paradigm that exploits two mutually independent and sufficient views

• Features of dataset can be divided into two sets: - The instance space: - Each example:

• Proven to be effective in several domains - Web page classification (content and hyperlink) - E-mail classification (header and body)

Experimental ResultsBaselines • Uni-modal - Image/Video View : Only image/video features are used - Text View : Only textual features are used

• Multi-modal - Early Fusion : Concatenate visual and textual features and train classifier - Late Fusion : Run separate classifiers on each view and concatenate the results

Conclusion • Combining textual and visual features can help improve accuracy • Co-training can be useful to combine textual and visual features to classify images and videos • Co-training helps in reducing labeling of images and videos

References [1] Bekkerman and Jeon, Multi-modal Clustering for Multimedia Collections. CVPR 2007 [2] Blum and Mitchell, Combining labeled and unlabeled data with co-training, COLT 1998 [3] Laptev, On space-time interest points, IJCV 2005 [4] Weka Data Mining Tool (Witten and Frank)

Cultivating farming at NabataeanRuins of the Ancient Avdat

Bedouin Leads His Donkey That Carries Load Of Straw

Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School

Desert

Trees

• That was a very nice forward camel.•Well I remember her performance last time.•He has some delicate hand movement.• She gave a small jump while gliding•He runs in to chip the ball with his right foot.•He runs in to take the instep drive and executes it well.• The small kid pushes the ball ahead with his tiny kicks.

Standard Bag-of-Words Representation

Raw Text Commentary

Porter Stemmer Remove Stop Words

21 XXX ×=

x = (x1, x2)

Image Feature Video Feature

Text Feature

Divide images into 4Χ6 grid

Capture texture and colordistributions of each cell

into 30-dim vector

Cluster the vectors using k-Meansto quantize the features intoa dictionary of visual words

Represent each image in termsof the dictionary

Detect Interest PointsHarris-Forstener Corner Detector

for both spatial and temporal space

Describe Interest PointsHistogram of Oriented Gradients (HoG)

Create Spatio-Temporal VocabularyQuantize interest points to create 200

visual words dictionary

Represent each video interms of the dictionary

Unlabeled Instances

Visual Classifier

Text Classifier

Text View

Text View

Text View

Text View

Visual View

Visual View

Visual View

Visual View

PartiallyLabeledInstances

Classify most confident instances

Text Classifier

Visual Classifier

Text View

Text View

Text View

Text View

Visual View

Visual View

Visual View

Visual View

+

-

+

-

ClassifierLabeled

Instances

Text Classifier

Visual Classifier

Text View

Text View

Text View

Text View

Visual View

Visual View

Visual View

Visual View

+

+

-

-

+

+

-

-

Classify most confident instances Retrain

Classifiers

Text Classifier

Visual Classifier

Text View

Text View

Text View

Text View

Visual View

Visual View

Visual View

Visual View

+

+

-

-

+

+

-

-

Initially Labeled Instances

Visual Classifier

Text Classifier

Supervised Learning

Text View

Text View

Text View

Text View

Visual View

Visual View

Visual View

Visual View

+

+

-

+

+

+

-

+

Label a new Instance

Text Classifier

Visual Classifier

+ -Text View Visual View

Text View Visual View -

+-

Text View Visual View

Image DatasetCo-training vs. Supervised SVM Co-training vs. Semi-Supervised EM

Video DatasetCo-training vs. Supervised SVM Co-training (Test on Video view) vs. SVM