Discriminative Learning and Visual Interactive Behavior Learning Techniques in Audio-Visual Information Processing (ICPR Tutorial) Tony Jebara Brian Clarkson MIT Sumit Basu Media Alex Pentland Lab Outline -Motivation Learning Tasks and Paradigms -Discriminative / Conditional Learning Maximum Entropy Discrimination -Latent Variables and Reversing Jensen’s Inequality CEM and Dual of EM -Action-Reaction Learning Behavior Analysis / Synthesis via Time Series Prediction -Wearable Platforms Personal Enhanced Reality System Wearable Interaction Learning -Conversational Context Learning Learning Applications for A & V Classification: Regression / Prediction: Detection / Clustering: Transduction: Feature Selection: x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ? ? ? ? ? O ? ? ? ? ? ? ? ? X ? ? ? ? ? ? ? ? ? x x x x x x x x x x x x O O O O O O O O O O Learning Paradigms New Competitors for Maximum Likelihood: Discriminative Learning & SVMs (NIPS, COLT, UAI, ICML) -Time Series Prediction (Muller) -Digit Recognition (Vapnik) -Speech Recognition (Deng) -Face Gender Classification (Moghaddam) -Gene-Sequence Classification (Jaakkola) -Text Classification (Joachims)
19
Embed
Discriminative Learning and Visual Interactive Behaviorjebara/papers/icprt.pdf · 2002. 3. 20. · Discriminative Learning and Visual Interactive Behavior Learning Techniques in Audio-Visual
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Minimize: regularization penalty:subject to classification constraints:
Example: SVM minimizes:
with discriminant:decision rule:
�L
\ ^������
48 8
TT
2 L2 � H� � ��
T T TY , 8 T2 � H p �
��� TT
F2 � H� � 4, 8 8 B2 � 2 �
e �Y SIGN , 8� 2
xx
x xx
xx
xx
x xx
xx x
x
xx
xx
x x
�, 8 2\ ^�
�����4
Y Y
Maximum Entropy Discrimination Approach
Many solutions may be valid.Use coarser description of sol’n instead of a single optimum.Solve for distribution P(4) over all good 4 (instead of 4 *).Find that mins subject to constraints:
= prior over models & margins (favors large margins).
Decision Rule:
Information Transfer / Projection: *Information transferred to prior after observations
*Entropic Regularization and Margin Penalties are on the Same Scale
* normalization constant (partition function)* non-negative Lagrange multipliers* O solved via unique max of concave objective function:
Example: SVM
Example: Generative Models (e-family)
��
� � EXP �T: T T T T
0 0 Y , 8M
¯2 H � 2 H M 2 � H� ¡ °¢ ± : M �\ ^�
�����4
M � M M �
LOG* :M � � M
�� � � �
� �
LOG �4T
T T T TT T TT T T
* Y Y 8 8C ¯� ¬M �M � M � � � M M¡ °� �� ®¡ °¢ ±
� �
\
\� LOG
0 8
0 8, 8 B�
�
R
R2 � �
Use conjugates of for priorT = model parameters and structureb = bias term
\0 8oR
Maximum Entropy Discrimination for Regression
Find that mins subject to constraints:
Decision Rule:
Solution:
Margin Priors: (epsilon-tube)
Example: SVM
� � ��T T T
0 Y , 8 D D T ¯2 H � 2 � H 2 H p �¡ °¢ ±¨
e �Y 0 , 8 D� 2 2 2¨
�0 2 H �+, 0 0
� � ��T T T
0 Y , 8 D D T ¯a2 H H � � 2 2 H p �¡ °¢ ±¨
EXP ��
� EXP �� �
T T T T T
T T T T T
Y , 8
: Y , 80 0
¯M � 2 �H� ¡ °¢ ±M ¯aM � 2 �H a� ¡ °¢ ±
2 H � 2 H
� �
� �
T
T
CT
T
IF0
E IF��H
£ ²b H b�¦ ¦¦ ¦¦ ¦H r ¤ »¦ ¦H ��¦ ¦¦ ¦¥ ¼
��
�
LOG LOG �
LOG LOG �
TT
T
TT
T
CT T T T T T
T T T
4
CT T T TT T T
T T T
* Y E
E 8 8
M�M �
�M
aMa�M �
a�M a a aa
a aM � M � M � � M � M � M � � �
a a a� M � � � � M � M M � M
� � �
� �
Marginprior
PenaltyFn.
Classification and Regression ExamplesGenerative Model Classification:
SVM Regression:
Sinc Fn.(Clean)
SVMLinearKernel
SVM3rd Order
Polynomial
Sinc Fn.With
GaussianNoise
MaximumLikelihood
Full CovarianceGaussians
MEDFull Covariance
Gaussians
Maximum Entropy Discrimination - Cancer
Maximum Entropy Discrimination - Crabs Feature Selection (Extension)
*Isolates interesting dimensions of the data for the given task*Typically needs exponential search:
consider all possible subsets of dimensions*Reduces complexity of data*Also Improves Generalization*Augments Sparse Vectors (SVMs) with Sparse Dimensions*Is possible jointly with parameter estimation.*Can be done discriminatively and efficiently with MED.
Feature Selection
Modify parameters to include a binary ON / OFF Switch
The model contains structuralparameters to aggressively prune features.
Prior:
Switch Prior: Bernoulli distributionp0 parameter smoothly selects no pruning to aggressive pruning
�
�
�N
I I I
I
, 8 S 8�
2 � R �R�\ ^� �
����� � �����N NS S2 � R R
\ ^���IS �
� �
� � �� ��� �� �� �\ ��
S II0 0 0 0 S 0 . ) 0 S
R R R2 � R R � R R �
�
�� � �� I
I
SS
S I0 S P P
�� �
Aggressive attenuationof linear coefficientsat low values (p0=.01).
ROC of DNA Splice Site100 FeaturesOriginal 25xGATC
ROC DNA Splice Site~5000 FeaturesQuadratic Kernel
CDF of Linear Coeffs DNA Splice Site100 Features
Dashed line: p0 = 0.99999Solid line: p0 = 0.00001
DNA Data: 2-class, 100 element binary vectors. Training Set 500, Testing 4724
O constrained to [0,c] hyper-cube with constraint �T TTYM ��
Feature Selection in SVM Regression & Results
�
�� �
���
� �
LOG LOG �
LOG LOG � LOG �
TT
T
T T T T IT T
T
CT T T T T T T TT
T T T
8
CT I
T
* Y E
E P P E
M�M �
�M
¯a aM M �Ma�M � ¡ °¢ ±a�M
a a aM � M � M � � M � M � T M � M � M � � �
� ¬� �a� M � � � � � � � �� ®
� � � �
� �
Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 1.7584 MED p0 = 0.99999 1.7529 MED p0 = 0.1 1.6894 MED p0 = 0.001 1.5377 MED p0 = 0.00001 1.4808
Linear Model Estimator Epsilon-Sensitive Linear Loss Least-Squares 3.609e+03 MED p0 = 0.00001 1.6734e+03
Boston Housing Data: 13 scalar features. Training Set 481, Testing 25Explicit Quadratic Kernel Expansion Used
D. Ross Cancer Data: 67 scalar features. Training Set 50, Testing 3951
Feature selection is not limited to SVMs.Applies to discriminative Generative Model Estimation as well.But, tractable computation sometimes needs approximations.
Each input example has additional unobservable propertiesOnly have a prior distribution over the unobservableI.e. category, affine transform, latent variable, alignment
Given: training examples:
binary (+/- 1) labels:
hidden transformations:
transformation function:
prior on transforms:
Solution:Transductive and Iterative. Solve iteratively by alternating solution of P(T) and P(U).
Example:
\ ^������
48 8
e �8 4 8 5�
\ ^������
4Y Y
\ ^������
45 5
� T0 5
�
�� � � � EXP � �
T: T T T T T0 5 0 5 Y , 4 8 5
M
¯2 H � 2 H M 2 � H� ¡ °¢ ±
e � � � �4
T T T T T, 8 , 4 8 5 8 5 B2 � 2 � R � �
G
Optimization & Bounded QP (Extension)
MED maximizes concave objective with convex constraints.Axis-parallel, Newton, gradient descent will converge.
Lower Bound the concave objective with quadraticCan then use SMO, QP, and other SVM optimizers.
Example: SVM Feature Selection
Iterate bound (contact at ) and QPEach QP is seeded at previous sol’nConverges in about 10 fast iterations
�
� �� � �
� � � �LOG � LOG �
4
T T T IT8 -
IJ P P E P P E
¯aM �M M M¡ °¢ ±� ¬� �M � � � � � � � �� �� ®
G G
�
�
4 4
IJ . H- - . CONSTM p M � M � M � M �
G G K�
~O
�
�
�� �
��� � EXP
4
4 P
P P -WHERE . - - AND H
�
� � M M� M M �
� �
� �
MED for the Exponential Family (Extension)
Proof: MED with generative models spans members of theexponential family (where Gaussians generate SVMs):
exponential family form:conjugate prior:
Analytic Partition Function for Classification:
\ EXP 4P 8 ! 8 8 +R � � R � R
\ EXP 4P ! +R D � R � R D � D� �
�
\
\� � �
EXP �
EXP LOG
EXP EXP
EXP EXP
EXP EXP
T T TT
0 8
0 8T TT
4 4
T T T TT
4
T T T T T TT T
T T T T T TT T
: 0 Y , 8 D
: 0 0 0 B Y B D
: ! + Y ! 8 8 + D
: + Y ! 8 ! Y 8 D
: + Y ! 8 + Y 8
�
�
o
o
o
2
R
�� R2
o o o o oR
o o oR
R
� 2 M 2 2
¯� R R M � 2¡ °
¢ ±
� R � R D � D M � R � R R
� � D � M q R � R D � M R
� � D � M q D � M
�¨
�¨
�¨� �¨� �
� �
��
� �
Using Non-InformativePrior on b
Concluding Ideas on MED and Feature Selection
Maximum Entropy Discrimination is a flexible Bayesianregularization approach. It provides a geometric view oflearning as constrained minimization to prior distributions overmargins, parameters, latent variables. It simultaneously combines:
probabilistic methodslarge margin discrimination and SVMsfeature selectionclassification, regression, etc.parameter and structure estimationexponential family generative modelstransduction and detection
Feature Selection is a particularly advantageous extensionwhich provides increased sparsity (support vectors & supportdimensions) and improves generalization.
Limitation of MED
Applies to Exponential FamilyYet many models are MIXTURES of E-family:
Intractable models in discriminative & conditional settingsThus use Variational Bounds to Perform CalculationsInvoke EM and derive its Discriminative DUALby Reversing Jensen
Reversing Jensen’sInequality
The Dual of EM forDiscriminative Latent Learning
Tony JebaraAlex Pentland
Jensen’s Inequality
-Inequalities allow us to Integrate, Maximize,Evaluate and Derive Intractable Expressions
-Convexity: 1905-1906 by J. Jensen (Dutch Mathematician & Engineer)-See “Convex Functions, Partial Ordering and Statistical Applications”
by J. Pecaric, F. Proschan and Y. Tong.
Jensen in Statistics and EM:-Subsumes many information theoretic bounds (Cover & Thomas)-Subsumes the EM Algorithm (Demspter, Laird & Rubin, Baum-Welch)-EM casts latent variable problems as complete data by solving
for a lower bound on likelihood.
Reversals of Jensen:-Constrained reversals and converses have been explored and are active
areas in mathematics (S.S. Dragomir).-Reversals have yet to be applied to discriminative learning.
The EM Algorithm
x
f(x)
234x x 1x
Makes intractable maximizationof likelihood and integration ofBayesian inference tractable viavariational bounds.
E-step: Replace unknowns withtheir expected values undercurrent model.(i.e. solving for a lowerbound on likelihood using Jensen!)
M-step: Optimize current modelwith the complete data(maximizing the Jensen lower bound!)
Applies and converges for ExponentialFamily Mixtures. I.e. a very large spaceof models that covers most ofcontemporary machine learning.HMMs, Gaussian Mixture Models, etc.
ARL (ACTION REACTION LEARNING)-Machine Learning of Correlations between Stimulus & Responsevia Perceptual Measurements of Human Interactions-Imitation Based Learning (Mataric)-Behaviourism (Thorndike, Watson, Skinnerian, Gibsonian) -> Reactionary-Watch Humans Interacting to Learn how to React to Stimulus
SHORT TERM MEMORY PRE-PROCESSING (Wan, Elliot-Anderson)
> @
> @)()...2()1()(
))(),...,2(),1(()(321321
TttttY
Ttttgt
BBBBBB bbbaaa
���
���|
yyy
yyyy
y Features (Blobs Concatenated)Prediction MappingShort Term Memory (T=120)
Temporal Modeling...PRINCIPAL COMPONENTS ANALYSIS
(linear, FFT, Wavelets…)
Gaussian Distribution of STM (roughly 6.5 seconds)Dims = T x Feats = 120x30 ���! 40 (95% Energy of Submanifold)Low dimensional (i.e. smoothed) characterization of past interaction
eigenvalues eigenvectors eigenspace
LEARN MAPPING PROBABILISTICALLYp(y|x) = p(future | STM) versus deterministic y=g(x)
Learning & The CEM AlgorithmEXPECTATION MAXIMIZATION (EM)
Learns p(x,y) by maximizing (joint model of phenomenon)
Powerful convergence -- Clean statistical FrameworkMore global than gradient solutions -- Can be deterministically annealed
For conditional problems (input/output regression, classification)Joint models are outperformed (I.e. NNs and RNNs versus HMMs)Since they don’t optimize output error (I.e. testing criterion is not like training)
Learns p(y|x) by maximizing (conditional model of task)
Convergence properties like EM but for Conditional Likelihood
)|,(1
43
ii
N
iyxp
),|(1
43
ii
N
ixyp
Variational Bound MaximizationMonotonicSkips Local MinimaDeterministically Annealable
Model: Mixture of ExpertsSoft Piece-Wise LinearGaussian Gates with Conditioned Gaussian RegressorsCEM Applies to other models: HMM, Multinomial Mixtures, etc.
Output: Expectation or Arg Max (Regression)
Computationally very efficient, simple matrix operations
Learning and the CEM Algorithm...
x
f(x)
234x x 1x
¦¦
³
M
m m
M
m mm
p
pdp
1
1
)ˆ|ˆ(
)ˆ|ˆ(ˆ)ˆ|(ˆ
xy
xyyyxyyy )ˆ(ˆ
1 xm
xxm
yxm
ymm PP �66�
�
xy
¦¦
6
:*�u6 4 M
m mmm
M
m mmmmmm
N
NNp
1
1
),;(
),;(),;(),|(
PD
XPD
x
xyxxy
EM
CEMIntegration
D
D
y(t)x(t)y(t−1)
A
y(t−1)
y(t−1)
y(t)
A
BB
y(t)
^
^^
D
D
y(t)x(t)y(t−1)
A
By(t)
y(t−1)
y(t−1)
y(t)
A
B
TRAINING MODE
System accumulatesaction / reaction pairs(x,y) and uses CEMto optimize aconditioned Gaussianmixture model forp(y|x)
INTERACTIVE MODE
System completesmissing time series andsynthesizes reaction ingraphical form for userusing the p(y|x)
Integration...FEEDBACK MODE
Predicted measurementon user can be fed backas a non-linear learnedEKF to aid vision. Canalso use p(x) to filtervision and findcorrespondence.
SIMULATION MODE
Fully synthesize bothcomponents of the timeseries (user A and user B).Some instabilities / bugs(no grounding) -> future work.
1 - Unsupervised Discovery of Simple Interactive Behaviour by Perceptual Observations and Statistical Time Series Prediction of Future Given Past or Reaction Given Action
2 - Imitation Based Learning of Behaviour
3 - Real-Time Behaviour Acquisition and Interactive Synthesis
4 - Small Amounts of Training Data and Non-Intrusive Training
5 - Non-Linear Predictive Model for Feedback Perception
6 - Monotonically Convergent Maximum Conditional Likelihood i.e. Discriminant Probabilistic Learning (CEM)
7 - No A Priori Segmentation or Classification of Gestures
Wearable Platform:Dynamic PersonalEnhanced Reality System
Tony JebaraBernt SchieleNuria OliverAlex Pentland
DyPERS Architecture* 3 Button interface: Record, Associated and Discard* User records live A/V Clips with wearable* Associates them with a visual trigger object(s)* Audio-Video is replayed when computer vision sees trigger object
Real-time probabilities of topics(politics, health, religion, etc.)Could also detect emotions & situations...
References1) Tony Jebara and Alex Pentland. On Reversing Jensen’s Inequality. In NeuralInformation Processing Systems 13 (NIPS*’00) , Dec. 2000.2) Tony Jebara, Yuri Ivanov, Ali Rahimi and Alex Pentland. Tracking ConversationalContext for Machine Mediation of Human Discourse. In AAAI Fall 2000 Symposium -Socially Intelligent Agents - The Human in the Loop. Nov. 2000.3) Tony Jebara and Tommi Jaakkola. Feature Selection and Dualities in MaximumEntropy Discrimination. In 16th Conf. Uncertainty in Artificial Intelligence (UAI 2000).4) Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum Entropy Discrimination.In Neural Information Processing Systems 12 (NIPS*'99).5) Tony Jebara and Alex Pentland. Action Reaction Learning: Automatic VisualAnalysis and Synthesis of Interactive Behaviour. In 1st Intl. Conf. on Computer VisionSystems (ICVS'99).6) Bernt Schiele, Nuria Oliver, Tony Jebara and Alex Pentland. An InteractiveComputer Vision System, DyPERS: Dynamic Personal Enhanced Reality System. In1st Intl. Conf. on Computer Vision Systems (ICVS'99).7) Tony Jebara and Alex Pentland. Maximum Conditional Likelihood via BoundMaximization and the CEM Algorithm. In Neural Information Processing Systems 11(NIPS*'98) .8) Tony Jebara. Action-Reaction Learning: Analysis and Synthesis of HumanBehaviour. Master's Thesis, MIT, May 1998.