BEYOND SIMPLE FEATURES: A LARGE-SCALE FEATURE SEARCH
APPROACH TO UNCONSTRAINED FACE
RECOGNITION
Nicolas Pinto Massachusetts Institute of TechnologyDavid CoxThe Rowland Institute at Harvard, Harvard University
International Conference on Automatic Face and Gesture Recognition (FG), 2011.
Outline Introduction Method
V1-like visual representation High-throughput-derived multilayer visual
representations Kernel Combination Experiment Result Discussion
Introduction “Biologically-inspired” representation
capture aspects of the computational architecture of the brain and mimic its computational abilities
Introduction Large Scale Feature Search Framework
Generate models with different parameters then screening
Method - V1-like visual representation
“Null model” - only represent first-order description of the primary visual cortex
Detail Preprocessing: resize image to 150 pixels with aspect
ratio preserved using bicubic interpolation Input normalization: divide each pixel’s intensity value
by the norm of the pixels in the 3x3 neighboring region Gabor wavelet: 16 orientation, 6 spatial frequencies Output normalization: divide by the norm of the pixels
in the 3x3 neighboring region Thresholding and Clipping: output value not in (0,1) is
set to {0,1}
V1-like visual representation Gabor Filter
Method - High-throughput-derived multilayer visual representations
Model architecture: Candidate models were
composed of a hierarchy of two (HT-L2) or three layers (HT-L3)
High-throughput-derived multilayer visual representations
Input size HT-L2: 100 x 100 pixels HT-L3: 200 x 200 pixels
Input was converted into grayscale and locally normalized
High-throughput-derived multilayer visual representations Linear Filter
linearly filtered using a bank of filters to produce a stack of feature maps
this operation is analogous to the weighted integration of synaptic inputs, where each filter in the filterbank represents a different cell
High-throughput-derived multilayer visual representations Linear Filter (cont.)
Parameter: The filter shapes were chosen
randomly with {3, 5, 7, 9}, Depending on the layer l considered, the
number of filters was chosen randomly from the following sets:
In , In , In ,
All filter kernels were fixed to random values drawn from a uniform distribution
High-throughput-derived multilayer visual representations Activation Function
Output values were clipped to be within a parametrically defined
High-throughput-derived multilayer visual representations Activation Function (cont.)
Parameter: was randomly chosen to be or 0 was randomly chosen to be 1 or
High-throughput-derived multilayer visual representations Pooling
neighboring region were then pooled together and the resulting outputs were spatially downsampled
High-throughput-derived multilayer visual representations Pooling (cont.)
Parameter: The stride parameter was fixed to 2, resulting
in a downsampling factor of 4. The size of the neighborhood was randomly
chosen from {3, 5, 7, 9}. The exponent was randomly chosen from {1,
2, 10}. = 1, equivalent to blurring = 2 or 10, is -norm
High-throughput-derived multilayer visual representations Normalization
Draws biological inspiration from the competitive interactions observed in natural neuronal systems (e.g. contrast gain control mechanisms in cortical area V1, and elsewhere)
High-throughput-derived multilayer visual representations Normalization (cont.)
Parameter: The size of the neighborhood region was randomly chosen
from {3, 5, 7, 9} The parameter was chosen from {0, 1} The vector of neighboring values could also be stretched
by gain values {, , } The threshold value was randomly chosen from , , }
Method - Evaluation Binary hard-margin linear SVM
4 feature vector
Method Model overview
Method – Screening Screening (model selection)
Select the best five models on LFW View1 aligned Set
Output dimension are ranged from 256 to 73984
Number of models: HT-L2 : 5915 HT-L3 : 6917
Feature Augmentation Multiple rescaled crops
Three different centered crops 250x250 150x150 125x75
Resized to the standard input size Train SVMs separately
Kernel Combination Three strategies
Blend kernels result from different crops Simple kernel addition with each kernel being
trace-normalized Blend 5 models within the same class Hierarchical blends across model class
Assign exponentially larger weight to higher-level representation (V1-like < HT-L2 < HT-L3)
Kernel Combination Kernel Method
Example:
Kernel Combination The original formulation
Is Equivalent
Kernel Combination Multiple Kernel Learning (MKL)
learn the kernel directly from data
Kernel Combination Multiple Kernel Learning (MKL)
Experiment Screen model on LFW View1 Train SVM and evaluate result using 10-
cross validation on LFW View 2
Result
Result Some error cases
Discussion Use whole image pixel value, not dealing
with pose variation take advantage on background
information ? Disturb by background
Performance increase when addingdifferent crops
16-GPU Monster-Class Supercomputer
Environment GNU/Linux Python, C, C++, Cython CUDA, PyCuda