Return to Big Picture

Title

Return to Big PictureMain statistical goals of OODA:Understanding population structureLow dimal Projections, PCA Classification (i. e. Discrimination)Understanding 2+ populationsTime Series of Data ObjectsChemical Spectra, Mortality DataVertical Integration of Data Types

1Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut

2PEod1Raw.ps

Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut

FLDOriginal Data

Very Bad3PEod1Raw.ps


FLD

SomewhatBetter

(Parabolic Fold)4PEod1Raw.ps


FLDGoodPerformance

(Slice of Paraboloid)5PEod1Raw.psKernel EmbeddingHot Topic Variation: Kernel MachinesIdea: replace polynomials byother nonlinear functionse.g. 1: sigmoid functions from neural netse.g. 2: radial basis functionsGaussian kernelsRelated to kernel density estimation(recall: smoothed histogram) 6Kernel EmbeddingRadial Basis Functions:Note: there are several ways to embed:Nave Embedding (equally spaced grid)Explicit Embedding (evaluate at data)Implicit Emdedding (inner prod. based)(everybody currently does the latter)7

Kernel EmbeddingToy Example 4: CheckerboardVeryChallenging!

FLD

Linear IsHopeless8PEod1Raw.ps

Kernel EmbeddingToy Example 4: CheckerboardVeryChallenging!

PolynomialsDont HaveNeededFlexiblity9PEod1Raw.psKernel EmbeddingToy Example 4: CheckerboardRadialBasisEmbedding+ FLDIsExcellent!

10PEod1Raw.psKernel EmbeddingDrawbacks to nave embedding:Equally spaced grid too big in high dNot computationally tractable (gd)Approach: Evaluate only at data pointsNot on full gridBut where data live11Support Vector MachinesMotivation:Find a linear method that works wellfor embedded dataNote: Embedded data are very non-Gaussian

Classical Statistics: Use Prob. DistnLooks Hopeless12Support Vector MachinesGraphical View, using Toy Example:

13SVMs, Optimization ViewpointLagrange Multipliers primal formulation (separable case):Minimize: Where are Lagrange multipliers

Dual Lagrangian version:Maximize:

Get classification function:

14SVMs, ComputationMajor Computational Point:Classifier only depends on data through inner products!Thus enough to only store inner productsCreates big savings in optimizationEspecially for HDLSS dataBut also creates variations in kernel embedding (interpretation?!?)This is almost always done in practice15SVMs, Computn & EmbeddingFor an Embedding Map, e.g. Explicit Embedding:Maximize:Get classification function:Straightforward application of embedding But loses inner product advantage

16SVMs, Computn & EmbeddingImplicit Embedding:Maximize:Get classification function:Still defined only via inner productsRetains optimization advantageThus used very commonlyComparison to explicit embedding?Which is better???

17Support Vector MachinesTarget Toy Data set:

18Support Vector MachinesExplicit Embedding, window = 0.1:

GaussianKernel,i.e.RadialBasisFunction19Support Vector MachinesExplicit Embedding, window = 1:

Pretty BigChange(Factor of 10)20Support Vector MachinesExplicit Embedding, window = 10:

Not QuiteAs Good???21Support Vector MachinesExplicit Embedding, window = 100:

Note: Lost Center(Over- Smoothed)22Support Vector MachinesInteresting Alternative Viewpoint: Study Projections In Kernel Space

(Never done in Machine Learning World)23Support Vector MachinesKernel space projection, window = 0.1:

Note:Data PilingAtMargin

24

Support Vector MachinesKernel space projection, window = 1:

ExcellentSeparation

(but less than = 0.1)25Support Vector MachinesKernel space projection, window = 10:

Still Good(But SomeOverlap)

26

Support Vector MachinesKernel space projection, window = 100:

Some RedsOn WrongSide

(Missed Center)27Support Vector MachinesImplicit Embedding, window = 0.1:

28Support Vector MachinesImplicit Embedding, window = 0.5:

29Support Vector MachinesImplicit Embedding, window = 1:

30Support Vector MachinesImplicit Embedding, window = 10:

31Support Vector MachinesNotes on Implicit Embedding:Similar Large vs. Small lessonsRange of reasonable resultsSeems to be smaller(note different range of windows)Much different edge behavior

Interesting topic for future work32SVMs & RobustnessUsually not severely affected by outliers,But a possible weakness: Can have very influential pointsToy E.g., only 2 points drive SVM33SVMs & RobustnessCan have very influential points

34SVMs & RobustnessUsually not severely affected by outliers,But a possible weakness: Can have very influential pointsToy E.g., only 2 points drive SVMNotes:Huge range of chosen hyperplanes35SVMs & RobustnessUsually not severely affected by outliers,But a possible weakness: Can have very influential pointsToy E.g., only 2 points drive SVMNotes:Huge range of chosen hyperplanesBut all are pretty good discriminators36SVMs & RobustnessUsually not severely affected by outliers,But a possible weakness: Can have very influential pointsToy E.g., only 2 points drive SVMNotes:Huge range of chosen hyperplanesBut all are pretty good discriminatorsOnly happens when whole range is OK???Good or bad?37SVMs & RobustnessEffect of violators:

38SVMs, Tuning Parameter Recall Regularization Parameter C:Controls penalty for violationI.e. lying on wrong side of planeAppears in slack variablesAffects performance of SVMToy Example: d = 50, Spherical Gaussian data39SVMs, Tuning Parameter Toy Example: d = 50, Sphl Gaussian data

40SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt. Dirn Other: SVM Dirn

41SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt. Dirn Other: SVM DirnSmall C:Where is the margin?Small angle to optimal (generalizable)42SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt. Dirn Other: SVM DirnSmall C:Where is the margin?Small angle to optimal (generalizable)Large C:More data pilingLarger angle (less generalizable)Bigger gap (but maybe not better???) 43SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataX=Axis: Opt. Dirn Other: SVM DirnSmall C:Where is the margin?Small angle to optimal (generalizable)Large C:More data pilingLarger angle (less generalizable)Bigger gap (but maybe not better???) Between: Very small range44SVMs, Tuning Parameter Toy Example: d = 50, Sphl Gaussian dataPut MD on horizontal axis

45SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataCareful look at small C:Put MD on horizontal axis46SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataCareful look at small C:Put MD on horizontal axisShows SVM and MD same for C smallMathematics behind this?47SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian dataCareful look at small C:Put MD on horizontal axisShows SVM and MD same for C smallMathematics behind this?Separates for large CNo data piling for MD48Support Vector MachinesImportant Extension:Multi-Class SVMsHsu & Lin (2002)Lee, Lin, & Wahba (2002) Defined for implicit versionDirection Based variation???49Distance Weighted Discrimn Improvement of SVM for HDLSS DataToy e.g.

(similar toearlier movie)

50Distance Weighted Discrimn Toy e.g.: Maximal Data Piling Direction- Perfect Separation- GrossOverfitting- Large Angle- PoorGenability

MDP51Distance Weighted Discrimn Toy e.g.: Support Vector Machine Direction- Bigger Gap- Smaller Angle- BetterGenability- Feels supportvectors toostrongly???- Ugly subpops?- Improvement?

52Distance Weighted Discrimn Toy e.g.: Distance Weighted Discrimination- Addresses these issues- Smaller Angle- BetterGenability- Nice subpops- Replaces min dist. by avg. dist.

53Distance Weighted Discrimn Based on Optimization Problem:

For Residuals:


Uses poles to push plane away from data


More precisely: Work in appropriate penalty for violations


More precisely: Work in appropriate penalty for violationsOptimization Method:Second Order Cone Programming


More precisely: Work in appropriate penalty for violationsOptimization Method:Second Order Cone ProgrammingStill convex genn of quadc programgAllows fast greedy solutionCan use available fast software(SDP3, Michael Todd, et al)

58Distance Weighted Discrimn References for more on DWD:Current paper:Marron, Todd and Ahn (2007)Links to more papers:Ahn (2007)R Implementation of DWD:CRAN 2011SDPT3 Software:Toh (2007)59Distance Weighted Discrimn 2-d Visualization:

Pushes PlaneAway FromData

All PointsHave SomeInfluence(not just support vectors)

60Support Vector MachinesGraphical View, using Toy Example:

61Support Vector MachinesGraphical View, using Toy Example:

62Distance Weighted Discrimn Graphical View, using Toy Example:

63Batch and Source AdjustmentRecall from Class Notes 8/26/14For Stanford Breast Cancer Data (C. Perou)Analysis in Benito, et al (2004) https://genome.unc.edu/pubsup/dwd/Adjust for Source EffectsDifferent sources of mRNA Adjust for Batch EffectsArrays fabricated at different times

64Source Batch Adj: Biological Class Col. & Symbols

65Source Batch Adj: Source Colors

66Source Batch Adj: PC 1-3 & DWD direction

67Source Batch Adj: DWD Source Adjustment

68Source Batch Adj: Source Adjd, PCA view

69Source Batch Adj: S. & B Adjd, Adjd PCA

70Why not adjust using SVM?Major Problem: Projd Distribal ShapeTriangular Distns (opposite skewed)Does not allow sensible rigid shift

#

UNC, Stat & OR71Why not adjust using SVM?Nicely Fixed by DWDProjected Distns near GaussianSensible to shift

#

UNC, Stat & OR72Why not adjust by means?DWD is complicated: value added?

#

UNC, Stat & OR73Why not adjust by means?DWD is complicated: value added?Because it is coolRecall Improves SVM for HDLSS

#

UNC, Stat & OR74Why not adjust by means?DWD is complicated: value added?Because it is coolRecall Improves SVM for HDLSSGood Empirical SuccessRoutinely Used in Perou LabMany Comparisons DoneSimilar Lessons from Wistar

#

UNC, Stat & OR75Why not adjust by means?DWD is complicated: value added?Because it is coolRecall Improves SVM for HDLSSGood Empirical SuccessRoutinely Used in Perou LabMany Comparisons DoneSimilar Lessons from WistarProven Statistical Power

#

UNC, Stat & OR76Why not adjust by means?But Why Not PAM (~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud centerpoints?#

UNC, Stat & OR77Why not adjust by means?But Why Not PAM (~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud centerpoints?

Elegant Answer: Xuxin Liu, et al (2009)#

UNC, Stat & OR78Why not adjust by means?But Why Not PAM (~Mean Difference)?Simpler is BetterWhy not means, i.e. point cloud centerpoints?Drawback to PAM:Poor Handling of Unbalanced Biological SubtypesDWD more Resistant to Unbalance#

UNC, Stat & OR79Why not adjust by means?Toy Example:Gaussian ClustersTwo batches (denoted: + o)#

UNC, Stat & OR80Why not adjust by means?Toy Example:Gaussian ClustersTwo batches (denoted: + o)Two subtypes (red and blue)#

UNC, Stat & OR81Why not adjust by means?Toy Example:Gaussian ClustersTwo batches (denoted: + o)Two subtypes (red and blue)Goal: bring together+ o and also + o

#

UNC, Stat & OR82Why not adjust by means?Toy Example:Gaussian ClustersTwo batches (denoted: + o)Two subtypes (red and blue)Goal: bring together+ o and also + oChallenge: unequal biological ratios within batches

#

UNC, Stat & OR83Twiddle ratios of subtypes

#

UNC, Stat & OR84Why not adjust by means?Lessons from DWD vs. PAM Example:Both very good for ratio ~ 0.7 1.0PAM weakens for ratio < 0.7DWD robust until ratio < 0.4PAM has some use, until ratio < 0.2DWD has some use, until ratio < 0.05Both methods eventually fail

#

UNC, Stat & OR85Why not adjust by means?DWD robust against non-proportional subtypes

Mathematical Statistical Question:Are there mathematics behind this?(will answer later)#

UNC, Stat & OR86DWD in Face RecognitionFace Images as Data(with M. Benito & D. Pea)Male Female Difference?Discrimination Rule?Represented as long vector of pixel gray levelsRegistration is critical

#

UNC, Stat & OR87

Return to Big Picture

Documents

explicit embedding

embedding map

kernel embeddingpolynomial

kernel machinesidea

pskernel embeddingtoy

data pointsnot

donut fldoriginal data

classification function