Title
Return to Big PictureMain statistical goals of
OODA:Understanding population structureLow dimal Projections, PCA
Classification (i. e. Discrimination)Understanding 2+
populationsTime Series of Data ObjectsChemical Spectra, Mortality
DataVertical Integration of Data Types
1Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut
2PEod1Raw.ps
Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut
FLDOriginal Data
Very Bad3PEod1Raw.ps
Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut
FLD
SomewhatBetter
(Parabolic Fold)4PEod1Raw.ps
Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut
FLDGoodPerformance
(Slice of Paraboloid)5PEod1Raw.psKernel EmbeddingHot Topic
Variation: Kernel MachinesIdea: replace polynomials byother
nonlinear functionse.g. 1: sigmoid functions from neural netse.g.
2: radial basis functionsGaussian kernelsRelated to kernel density
estimation(recall: smoothed histogram) 6Kernel EmbeddingRadial
Basis Functions:Note: there are several ways to embed:Nave
Embedding (equally spaced grid)Explicit Embedding (evaluate at
data)Implicit Emdedding (inner prod. based)(everybody currently
does the latter)7
Kernel EmbeddingToy Example 4: CheckerboardVeryChallenging!
FLD
Linear IsHopeless8PEod1Raw.ps
Kernel EmbeddingToy Example 4: CheckerboardVeryChallenging!
PolynomialsDont HaveNeededFlexiblity9PEod1Raw.psKernel
EmbeddingToy Example 4: CheckerboardRadialBasisEmbedding+
FLDIsExcellent!
10PEod1Raw.psKernel EmbeddingDrawbacks to nave embedding:Equally
spaced grid too big in high dNot computationally tractable
(gd)Approach: Evaluate only at data pointsNot on full gridBut where
data live11Support Vector MachinesMotivation:Find a linear method
that works wellfor embedded dataNote: Embedded data are very
non-Gaussian
Classical Statistics: Use Prob. DistnLooks Hopeless12Support
Vector MachinesGraphical View, using Toy Example:
13SVMs, Optimization ViewpointLagrange Multipliers primal
formulation (separable case):Minimize: Where are Lagrange
multipliers
Dual Lagrangian version:Maximize:
Get classification function:
14SVMs, ComputationMajor Computational Point:Classifier only
depends on data through inner products!Thus enough to only store
inner productsCreates big savings in optimizationEspecially for
HDLSS dataBut also creates variations in kernel embedding
(interpretation?!?)This is almost always done in practice15SVMs,
Computn & EmbeddingFor an Embedding Map, e.g. Explicit
Embedding:Maximize:Get classification function:Straightforward
application of embedding But loses inner product advantage
16SVMs, Computn & EmbeddingImplicit Embedding:Maximize:Get
classification function:Still defined only via inner
productsRetains optimization advantageThus used very
commonlyComparison to explicit embedding?Which is better???
17Support Vector MachinesTarget Toy Data set:
18Support Vector MachinesExplicit Embedding, window = 0.1:
GaussianKernel,i.e.RadialBasisFunction19Support Vector
MachinesExplicit Embedding, window = 1:
Pretty BigChange(Factor of 10)20Support Vector MachinesExplicit
Embedding, window = 10:
Not QuiteAs Good???21Support Vector MachinesExplicit Embedding,
window = 100:
Note: Lost Center(Over- Smoothed)22Support Vector
MachinesInteresting Alternative Viewpoint: Study Projections In
Kernel Space
(Never done in Machine Learning World)23Support Vector
MachinesKernel space projection, window = 0.1:
Note:Data PilingAtMargin
24
Support Vector MachinesKernel space projection, window = 1:
ExcellentSeparation
(but less than = 0.1)25Support Vector MachinesKernel space
projection, window = 10:
Still Good(But SomeOverlap)
26
Support Vector MachinesKernel space projection, window =
100:
Some RedsOn WrongSide
(Missed Center)27Support Vector MachinesImplicit Embedding,
window = 0.1:
28Support Vector MachinesImplicit Embedding, window = 0.5:
29Support Vector MachinesImplicit Embedding, window = 1:
30Support Vector MachinesImplicit Embedding, window = 10:
31Support Vector MachinesNotes on Implicit Embedding:Similar
Large vs. Small lessonsRange of reasonable resultsSeems to be
smaller(note different range of windows)Much different edge
behavior
Interesting topic for future work32SVMs & RobustnessUsually
not severely affected by outliers,But a possible weakness: Can have
very influential pointsToy E.g., only 2 points drive SVM33SVMs
& RobustnessCan have very influential points
34SVMs & RobustnessUsually not severely affected by
outliers,But a possible weakness: Can have very influential
pointsToy E.g., only 2 points drive SVMNotes:Huge range of chosen
hyperplanes35SVMs & RobustnessUsually not severely affected by
outliers,But a possible weakness: Can have very influential
pointsToy E.g., only 2 points drive SVMNotes:Huge range of chosen
hyperplanesBut all are pretty good discriminators36SVMs &
RobustnessUsually not severely affected by outliers,But a possible
weakness: Can have very influential pointsToy E.g., only 2 points
drive SVMNotes:Huge range of chosen hyperplanesBut all are pretty
good discriminatorsOnly happens when whole range is OK???Good or
bad?37SVMs & RobustnessEffect of violators:
38SVMs, Tuning Parameter Recall Regularization Parameter
C:Controls penalty for violationI.e. lying on wrong side of
planeAppears in slack variablesAffects performance of SVMToy
Example: d = 50, Spherical Gaussian data39SVMs, Tuning Parameter
Toy Example: d = 50, Sphl Gaussian data
40SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian
dataX=Axis: Opt. Dirn Other: SVM Dirn
41SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian
dataX=Axis: Opt. Dirn Other: SVM DirnSmall C:Where is the
margin?Small angle to optimal (generalizable)42SVMs, Tuning
Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt.
Dirn Other: SVM DirnSmall C:Where is the margin?Small angle to
optimal (generalizable)Large C:More data pilingLarger angle (less
generalizable)Bigger gap (but maybe not better???) 43SVMs, Tuning
Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt.
Dirn Other: SVM DirnSmall C:Where is the margin?Small angle to
optimal (generalizable)Large C:More data pilingLarger angle (less
generalizable)Bigger gap (but maybe not better???) Between: Very
small range44SVMs, Tuning Parameter Toy Example: d = 50, Sphl
Gaussian dataPut MD on horizontal axis
45SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian
dataCareful look at small C:Put MD on horizontal axis46SVMs, Tuning
Parameter Toy Example: d = 50, Spherical Gaussian dataCareful look
at small C:Put MD on horizontal axisShows SVM and MD same for C
smallMathematics behind this?47SVMs, Tuning Parameter Toy Example:
d = 50, Spherical Gaussian dataCareful look at small C:Put MD on
horizontal axisShows SVM and MD same for C smallMathematics behind
this?Separates for large CNo data piling for MD48Support Vector
MachinesImportant Extension:Multi-Class SVMsHsu & Lin
(2002)Lee, Lin, & Wahba (2002) Defined for implicit
versionDirection Based variation???49Distance Weighted Discrimn
Improvement of SVM for HDLSS DataToy e.g.
(similar toearlier movie)
50Distance Weighted Discrimn Toy e.g.: Maximal Data Piling
Direction- Perfect Separation- GrossOverfitting- Large Angle-
PoorGenability
MDP51Distance Weighted Discrimn Toy e.g.: Support Vector Machine
Direction- Bigger Gap- Smaller Angle- BetterGenability- Feels
supportvectors toostrongly???- Ugly subpops?- Improvement?
52Distance Weighted Discrimn Toy e.g.: Distance Weighted
Discrimination- Addresses these issues- Smaller Angle-
BetterGenability- Nice subpops- Replaces min dist. by avg.
dist.
53Distance Weighted Discrimn Based on Optimization Problem:
For Residuals:
54Distance Weighted Discrimn Based on Optimization Problem:
Uses poles to push plane away from data
55Distance Weighted Discrimn Based on Optimization Problem:
More precisely: Work in appropriate penalty for violations
56Distance Weighted Discrimn Based on Optimization Problem:
More precisely: Work in appropriate penalty for
violationsOptimization Method:Second Order Cone Programming
57Distance Weighted Discrimn Based on Optimization Problem:
More precisely: Work in appropriate penalty for
violationsOptimization Method:Second Order Cone ProgrammingStill
convex genn of quadc programgAllows fast greedy solutionCan use
available fast software(SDP3, Michael Todd, et al)
58Distance Weighted Discrimn References for more on DWD:Current
paper:Marron, Todd and Ahn (2007)Links to more papers:Ahn (2007)R
Implementation of DWD:CRAN 2011SDPT3 Software:Toh (2007)59Distance
Weighted Discrimn 2-d Visualization:
Pushes PlaneAway FromData
All PointsHave SomeInfluence(not just support vectors)
60Support Vector MachinesGraphical View, using Toy Example:
61Support Vector MachinesGraphical View, using Toy Example:
62Distance Weighted Discrimn Graphical View, using Toy
Example:
63Batch and Source AdjustmentRecall from Class Notes 8/26/14For
Stanford Breast Cancer Data (C. Perou)Analysis in Benito, et al
(2004) https://genome.unc.edu/pubsup/dwd/Adjust for Source
EffectsDifferent sources of mRNA Adjust for Batch EffectsArrays
fabricated at different times
64Source Batch Adj: Biological Class Col. & Symbols
65Source Batch Adj: Source Colors
66Source Batch Adj: PC 1-3 & DWD direction
67Source Batch Adj: DWD Source Adjustment
68Source Batch Adj: Source Adjd, PCA view
69Source Batch Adj: S. & B Adjd, Adjd PCA
70Why not adjust using SVM?Major Problem: Projd Distribal
ShapeTriangular Distns (opposite skewed)Does not allow sensible
rigid shift
#
UNC, Stat & OR71Why not adjust using SVM?Nicely Fixed by
DWDProjected Distns near GaussianSensible to shift
#
UNC, Stat & OR72Why not adjust by means?DWD is complicated:
value added?
#
UNC, Stat & OR73Why not adjust by means?DWD is complicated:
value added?Because it is coolRecall Improves SVM for HDLSS
#
UNC, Stat & OR74Why not adjust by means?DWD is complicated:
value added?Because it is coolRecall Improves SVM for HDLSSGood
Empirical SuccessRoutinely Used in Perou LabMany Comparisons
DoneSimilar Lessons from Wistar
#
UNC, Stat & OR75Why not adjust by means?DWD is complicated:
value added?Because it is coolRecall Improves SVM for HDLSSGood
Empirical SuccessRoutinely Used in Perou LabMany Comparisons
DoneSimilar Lessons from WistarProven Statistical Power
#
UNC, Stat & OR76Why not adjust by means?But Why Not PAM
(~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud
centerpoints?#
UNC, Stat & OR77Why not adjust by means?But Why Not PAM
(~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud
centerpoints?
Elegant Answer: Xuxin Liu, et al (2009)#
UNC, Stat & OR78Why not adjust by means?But Why Not PAM
(~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud
centerpoints?Drawback to PAM:Poor Handling of Unbalanced Biological
SubtypesDWD more Resistant to Unbalance#
UNC, Stat & OR79Why not adjust by means?Toy Example:Gaussian
ClustersTwo batches (denoted: + o)#
UNC, Stat & OR80Why not adjust by means?Toy Example:Gaussian
ClustersTwo batches (denoted: + o)Two subtypes (red and blue)#
UNC, Stat & OR81Why not adjust by means?Toy Example:Gaussian
ClustersTwo batches (denoted: + o)Two subtypes (red and blue)Goal:
bring together+ o and also + o
#
UNC, Stat & OR82Why not adjust by means?Toy Example:Gaussian
ClustersTwo batches (denoted: + o)Two subtypes (red and blue)Goal:
bring together+ o and also + oChallenge: unequal biological ratios
within batches
#
UNC, Stat & OR83Twiddle ratios of subtypes
#
UNC, Stat & OR84Why not adjust by means?Lessons from DWD vs.
PAM Example:Both very good for ratio ~ 0.7 1.0PAM weakens for ratio
< 0.7DWD robust until ratio < 0.4PAM has some use, until
ratio < 0.2DWD has some use, until ratio < 0.05Both methods
eventually fail
#
UNC, Stat & OR85Why not adjust by means?DWD robust against
non-proportional subtypes
Mathematical Statistical Question:Are there mathematics behind
this?(will answer later)#
UNC, Stat & OR86DWD in Face RecognitionFace Images as
Data(with M. Benito & D. Pea)Male Female
Difference?Discrimination Rule?Represented as long vector of pixel
gray levelsRegistration is critical
#
UNC, Stat & OR87