Spring 2020: Venu: Haag 315, Time: M/W 4-5:15pm ECE 5582 Computer Vision Lec 08: Feature Aggregation II Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: [email protected], Ph: x 2346. http://l.web.umkc.edu/lizhu Z. Li: ECE 5582 Computer Vision, 2020 p.1 slides created with WPS Office Linux and EqualX LaTex equation editor
47
Embed
Lec 08: Feature Aggregation II - sce.umkc.edu€¦ · Super Vector Aggregation – Speaker ID • Fisher Vector: Aggregates Features against a GMM • Super Vector: Aggregates GMM
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Spring 2020: Venu: Haag 315, Time: M/W 4-5:15pm
ECE 5582 Computer VisionLec 08: Feature Aggregation II
Zhu LiDept of CSEE, UMKC
Office: FH560E, Email: [email protected], Ph: x 2346.http://l.web.umkc.edu/lizhu
Z. Li: ECE 5582 Computer Vision, 2020 p.1
slides created with WPS Office Linux and EqualX LaTex equation editor
• VL_FEAT: vl_dsift • A dense description of image by computing SIFT descriptor
(no spatial-scale space extrema detection) at predetermined grid
• Supplement HoG as an alternative texture descriptor
Z. Li: ECE 5582 Computer Vision, 2020 p.10
VL_FEAT: vl_dsift
• Compute dense SIFT as a texture descriptor for the image• [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘step’, 2);
• There’s also a FAST option• [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘fast’, ‘step’, 2);• Huge amount of SIFT data will be generated
Z. Li: ECE 5582 Computer Vision, 2020 p.11
Fisher Vector
• Fisher Vector and variations:• Winning in image classification:
• Winning in the MPEG object re-identification:o SCFV(Scalable Coded Fisher Vec) in CDVS
Z. Li: ECE 5582 Computer Vision, 2020 p.12
Codebook: Gaussian Mixture Model (GMM)
• GMM is a generative model to express data • Assuming data is generated from with parameters {��, ��,��}
Z. Li: ECE 5582 Computer Vision, 2020 p.13
�� ~ ��=1
����(��,��)
�(��,��) =1
(2�)�2 |Σ�|�/�
�−�12� (�− ��)
���−�(�−��)
A bit of Theory: Fisher Kernel
•Encode the derivation from the generative model• Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 for
SIFT.• How’s these observations derivate from the given GMM
model with a set of parameter, � = {��, ��,��}?o i.e, how the parameter, e.g, mean will move to best fit the
observation ?
Z. Li: ECE 5582 Computer Vision, 2020 p.14
����
��
��
X1 +
A bit of Theory: Fisher Kernel
•Score function w.r.t. the likelihood function ��(�)• ��� = �� log��(�): derivative on the log likelihood • The dimension of score function is m, where m is the number
of generative model parameters, m=3 for GMM • Given the observed data X, score function indicate how
likelihood function parameter (e.g, mean) should move to better fit the data.
•Distance/Derivation of two observation X, Y w.r.t the generative model• Fisher Info Matrix (roughly the covariance in the
Mahanolibis distance)�� = �����
������
• Fisher Kernel Distance: normalized by the Fisher Info Matrix:
Z. Li: ECE 5582 Computer Vision, 2020 p.15
���(�, �) = ������
−����
Fisher Vector
• KFK(X, Y) is a measure of similarity, w.r.t. the generative model• Similar to the Mahanolibis distance
case, we can decompose this kernel as,
• That give us a kernel feature mapping of X to Fisher Vector
• For observed images features {xt}, can be computed as,
Z. Li: ECE 5582 Computer Vision, 2020 p.16
���(�, �) = ������
−���� = ��
����′�����
GMM Fisher Vector
•Encode the derivation from the generative model• Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 (!) for SIFT.• How’s these observations derivate from the given GMM model with a set
of parameter, � = {��, ��,��}?
• GMM Log Likelihood Gradient• Let �� =
������
��, Then we have
Z. Li: ECE 5582 Computer Vision, 2020 p.17
weight
mean
variance
GMM Fisher Vector VL_FEAT implementation
• GMM codebook• For a K-component GMM, we only allow 3K parameters, {��, ��,��|� = 1. .�}, i.e, iid Gaussian component
• Posterior prob of feature point xi to GMM component k
Z. Li: ECE 5582 Computer Vision, 2020 p.18
Σ� =�
�
���
�� 0 0 00 �� 0 0
…. ��
�
�
���
GMM Fisher Vector VL_FEAT implementation
• FV encoding• Gradient w.r.t. the mean, variance, for GMM component k,
j=1..D
• In the end, we have 2K x D aggregation on the derivation w.r.t. the means and variances
Z. Li: ECE 5582 Computer Vision, 2020 p.19
�X= [��, ��, …, ��, ��, ��, …, ��]
VL_FEAT GMM/FV API
• Compute GMM model with VL_FEAT• Prepare data:numPoints = 1000 ; dimension = 2 ;data = rand(dimension,N) ;
• Bonus points:• Encode HoG features with Fisher Vector ?• randomly collect 2~3 images from each class• Stack all HoG features together into an n x 36 data matrix• Compute its GMM• Use this GMM to encode all image HoG features (other than
average)
Z. Li: ECE 5582 Computer Vision, 2020 p.21
Super Vector Aggregation – Speaker ID
• Fisher Vector: Aggregates Features against a GMM• Super Vector: Aggregates GMM against GMM
• Ref:o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector
machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5): 308-311 (2006)
Z. Li: ECE 5582 Computer Vision, 2020 p.22
“Yes, We Can !”
?
Super Vector from MFCC• Motivated from Speaker ID work
• Speech is a continuous evolution of the vocal tract• Need to extract a sequence of spectra or sequence of spectral coefficients• Use a sliding window - 25 ms window, 10 ms shift
Z. Li: ECE 5582 Computer Vision, 2020 p.23
DCTLog|X(ω)|MFCC
GMM Model from MFCC• GMM on MFCC feature
Z. Li: ECE 5582 Computer Vision, 2020 p.24
• The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by
• Gaussian mixture model (GMM) for speaker s:
Universal Background Model
• UBM GMM Model:
Z. Li: ECE 5582 Computer Vision, 2020 p.25
• The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM):
• Parameters of the UBM
MAP Adaption
• Given the UBM GMM, how is the new observation derivate ?• The adapted mean is given by:
Z. Li: ECE 5582 Computer Vision, 2020 p.26
Supervector Distance
• Assuming we have UBM GMM model���� = {��, ��, Σ�},
with identical prior and covariance
• Then for two utterance samples a and b, with GMM models• �� = {��, ��
�, Σ�}, • �� = {��, ��
�, Σ�},
The SV distance is,
It means the means of two models need to be normalized by the UBM covariance induced Mahanolibis distance metricThis is also a linear kernel function scaled by the UBM covariances
Z. Li: ECE 5582 Computer Vision, 2020 p.27
�(��, ��) = ��� ���
−(12)�����
( ��Σ�−(12)��
�)
Supervector Performance in NIST Speaker ID
• System 5: Gaussian SV• DCF (Detection Cost Function)
Z. Li: ECE 5582 Computer Vision, 2020 p.28
m31491
AKULA – Adaptive KLUster Aggregation
2013/10/25
Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park
Z. Li: ECE 5582 Computer Vision, 2020 p.29
Outline
•Motivation•Adaptive Aggregation•Results with TM7•Summary
Z. Li: ECE 5582 Computer Vision, 2020 p.30
Motivation
•Better Aggregation• Fisher Vector and VLAD type aggregation depending on a
global model• AKULA removes this dependence, and directly coding the
cluster centroids and sift count• SCFV/RVD all having situations where clusters are turned
off due to no assignment, this can be avoided in AKULA
•Benefits:• Allow more DoF in aggregation optimization,
o by an outer loop boosting scheme for subspace projection optimization
o And an inner loop adaptive clustering without the constraint of the global GMM model
• Simple weighted distance sum metric, with no need to tune a multi-dimensional decision boundary
• The overall pair wise matching matched up with TM7 SCFV with 2-dimensional decision boundary
• In GD only matching outperforms the TM7 GD• Good improvements to the localization accuracy• Light in extraction, but still heavy in pair wise matching, and
need binarization scheme and/or indexing scheme to work for retrieval