BUILDING TEXT FEATURES FOR OBJECT IMAGE CLASSIFICATION Gang Wang Derek Hoeim David Forsyth
Feb 08, 2016
BUILDING TEXT FEATURES FOR OBJECT IMAGE CLASSIFICATIONGang Wang Derek Hoeim David Forsyth
MAIN IDEA
Text based image features built using auxiliary dataset of images(internet) annotated with tags.
Visual classifier with an object viewed under novel circumstances.
So, basically,Text classifier Image
ClassifierUnified
WHAT ARE THEY TRYING TO DO?
CHALLENGES
Determine which objects are present in an image based on the text that surrounds similar images drawn from large collections.
Sounds easy but: Object appearance Pose Illumination
LOW LEVEL FEATURES CAN RESCUE BUT…..
Color Texture SIFT features Can help if we had millions of training
samples but this is unrealistic.
So what can help?????Millions of images on the internet, not tagged but the text associated with them helps classification.
EUREKA!!!!!!
Easier to determine image content using surrounding text than with currently available image features.
Given a large enough dataset, we are bound to find very similar images to an input image. So they infer likely text for an input image based on similar images
THE COMMON APPROACH
Approach Improve annotation quality or filter spurious
search results that can be used for training. The Problem
Noise or ambiguity in annotations can easily nullify any benefit
Proposal Learn a distance metric that causes images
with similar surrounding text to be similar in visual feature space.
THEIR APPROACH
Build text features for object image classification as they are expected to capture direct semantic meaning of an image.
APPROACH EXPLAINED
Dataset = Training + Test images Auxiliary Dataset= Internet images(Flickr),
have associated text.
For each training image Extract visual features. Find K nearest neighbor images from internet
dataset. Use text associated with these internet images
to build text feature. Train!!
Repeat for visual features and combine both.
VISUAL FEATURES
SIFT :
Used for image matching and object recognition. They use to detect and describe local patches. Extract 1000 local patches from each image. Quantized to 1000 clusters and each patch
denoted to a cluster index. Finally each image represented as a normalized
histogram of cluster indices.
GIST: Powerful in scene categorization and retreiving. They represent each image as a 960 dimension
GIST descriptor.
Color: Quantize each channel to 8 bins. Each pixel value is represented as integer
between 1 to 512. 512 dimensional histogram for each image.
Gradient Can be considered as global and coarse SIFT
feature. Divide image into 4*4 cells At each cell quantize the gradient into 16 bins. Whole image represented as 256 dimensional
vector.
Unified Concatenation of the 4 previously described
features. Let the above features be f1, f2, f3, f4 . Resultant features [w1f1, w2f2 ,w3f3,w4f4]
HOW TO FIND WEIGHTS:
Learn weights from training images. Aim to force the images from the same
category to be close and vice versa. Randomly select N pairs of images from the
training set. For ith pair, Si=1 if two images share atleast
one same object class, otherwise Si=0. Calculate chi square distance fj for the ith pair
as Learn weights:
Can solve directly using “fmincon” in Matlab.
CHI SQUARE???
Chi square distance(http://
www.stat.lsu.edu/faculty/moser/exst7037/geometry.pdf):
Denominator is the normalization component for each point in X.
So for n dimensions:
FMINCON?????
Finds minimum of constrained nonlinear multivariable function.
x = fmincon(fun,x0,A,b)x = fmincon(fun,x0,A,b,Aeq,beq)x = fmincon(fun,x0,A,b,Aeq,beq,lb,ub)…..
http://www.mathworks.com/help/toolbox/optim/ug/fmincon.html
AUXILIARY DATASET
Collected from Flickr. Total 1 million images Out of which 700,000 images collected for 58
object categories whose names come from PASCAL and CALTECH 256 datasets.
Rest collected from a group called “10 million photos ”. Random images.
TEXT FEATURES
For each training/test image Find K nearest neighbor images from the
auxiliary dataset. Extract text with these associated images Build text features.
“Dogs! Dogs! Dogs!” treated as a single item.
Use only frequent tags and group names(6000) in the auxiliary dataset.
Text feature is a normalized histogram of tag and group name counts.
CLASSIFIER
SVM classifier with a chi-squared kernel for text features.
Same used for visual features as well.
FUSION
Build visual classifier Build text classifier Third classifier trained to combine the
confidence values of above two to give final prediction.
Final classifier logistic regression and is trained on a validation test.
RESULTS
PASCAL VOC 2006-10 object categories PASCAL VOC 2007-20 object categories
Performance quantitatively measured using AUC(Area under the ROC curve) for 2006 dataset and by AP(Average Precision) for 2007 dataset.
Use 150 nearest neighbor images in all experiments.
PERFORMANCE METRICS
Performance of text features built with different visual features.
Effects of combining text and visual classifiers.
Effects of varying number of training images Performance of the text features built with
varying number of internet images Effects of category names
For 2006 Dataset: Text classifier outperforms GIST KNN for each feature. Unified is best amongst all. Combination(V) etc. are obtained by training a logistic regression classifier on the validation dataset usingthe confidence values returned by the individual classifiers.
VARYING NUMBER OF AUXILIARY IMAGES
EXCLUDING CATEGORY NAMES
QUESTIONS???