Top Banner
1 UNC, Stat & OR Isaac Newton Institute - Cambridge Isaac Newton Institute - Cambridge Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina June 14, 2022
102

Isaac Newton Institute - Cambridge

Feb 05, 2016

Download

Documents

mili

Isaac Newton Institute - Cambridge. Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina August 15, 2014. Personal Opinions on Mathematical Statistics. What is Mathematical Statistics? Validation of existing methods - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Isaac Newton Institute  -  Cambridge

11

UNC, Stat & OR

Isaac Newton Institute - CambridgeIsaac Newton Institute - Cambridge

Object Oriented Data Analysis

J. S. Marron

Dept. of Statistics and Operations

Research, University of North Carolina

April 22, 2023

Page 2: Isaac Newton Institute  -  Cambridge

22

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

What is Mathematical Statistics?

Validation of existing methods

Asymptotics (n ∞) & Taylor

expansion

Comparison of existing methods

(requires hard math, but

really “accounting”???)

Page 3: Isaac Newton Institute  -  Cambridge

33

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

What could Mathematical Statistics be?

Basis for invention of new methods

Complicated data mathematical

ideas

Do we value creativity?

Since we don’t do this, others do…

(where are the ₤₤₤s???)

Page 4: Isaac Newton Institute  -  Cambridge

44

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

Since we don’t do this, others do…

Pattern Recognition

Artificial Intelligence

Neural Nets

Data Mining

Machine Learning

???

Page 5: Isaac Newton Institute  -  Cambridge

55

UNC, Stat & OR

Personal Opinions on Mathematical Personal Opinions on Mathematical StatisticsStatistics

Possible Litmus Test:

Creative Statistics

Clinical Trials Viewpoint:

Worst Imaginable Idea

Mathematical Statistics Viewpoint:

???

Page 6: Isaac Newton Institute  -  Cambridge

66

UNC, Stat & OR

Object Oriented Data Analysis, IObject Oriented Data Analysis, I

What is the “atom” of a statistical analysis?

1st Course: Numbers

Multivariate Analysis Course : Vectors

Functional Data Analysis: Curves

More generally: Data Objects

Page 7: Isaac Newton Institute  -  Cambridge

77

UNC, Stat & OR

Object Oriented Data Analysis, IIObject Oriented Data Analysis, II

Examples:

Medical Image Analysis

Images as Data Objects?

Shape Representations as Objects

Micro-arrays

Just multivariate analysis?

Page 8: Isaac Newton Institute  -  Cambridge

88

UNC, Stat & OR

Object Oriented Data Analysis, IIIObject Oriented Data Analysis, III

Typical Goals:

Understanding population variation

Visualization

Principal Component Analysis +

Discrimination (a.k.a. Classification)

Time Series of Data Objects

Page 9: Isaac Newton Institute  -  Cambridge

99

UNC, Stat & OR

Object Oriented Data Analysis, IVObject Oriented Data Analysis, IV

Major Statistical Challenge, I:

High Dimension Low Sample Size (HDLSS)

Dimension d >> sample size n

“Multivariate Analysis” nearly useless Can’t “normalize the data”

Land of Opportunity for Statisticians Need for “creative statisticians”

Page 10: Isaac Newton Institute  -  Cambridge

1010

UNC, Stat & OR

Object Oriented Data Analysis, VObject Oriented Data Analysis, V

Major Statistical Challenge, II:

Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold

data)

Trees/Graphs as data objects

Interesting Issues: What is “the mean” (pop’n center)?

How do we quantify “pop’n variation”?

Page 11: Isaac Newton Institute  -  Cambridge

1111

UNC, Stat & OR

Statistics in Image Analysis, IStatistics in Image Analysis, I

First Generation Problems:

Denoising

Segmentation

Registration

(all about single images)

Page 12: Isaac Newton Institute  -  Cambridge

1212

UNC, Stat & OR

Statistics in Image Analysis, IIStatistics in Image Analysis, II

Second Generation Problems:

Populations of Images

Understanding Population Variation

Discrimination (a.k.a. Classification)

Complex Data Structures (& Spaces)

HDLSS Statistics

Page 13: Isaac Newton Institute  -  Cambridge

1313

UNC, Stat & OR

HDLSS Statistics in Imaging

Why HDLSS (High Dim, Low Sample Size)?

Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters

Complex 3-d Objects Costly to Segment Often have n = 10’s cases

Page 14: Isaac Newton Institute  -  Cambridge

1414

UNC, Stat & OR

Medical Imaging – A Challenging Medical Imaging – A Challenging ExampleExample

Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment

(cancer) Work with 3-d CT Very Challenging to Segment

Find boundary of each object? Represent each Object?

Page 15: Isaac Newton Institute  -  Cambridge

1515

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

One CT Slice

(in 3d

image)

Coccyx

(Tail Bone)

Rectum

Prostate

Page 16: Isaac Newton Institute  -  Cambridge

1616

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

Prostate:

manual segmentation

Slice by slice

Reassembled

Page 17: Isaac Newton Institute  -  Cambridge

1717

UNC, Stat & OR

Male Pelvis – Raw DataMale Pelvis – Raw Data

Prostate:

Slices:Reassembled in 3d

How to represent?

Thanks: Ja-Yeon Jeong

Page 18: Isaac Newton Institute  -  Cambridge

1818

UNC, Stat & OR

Object RepresentationObject Representation

Landmarks (hard to find) Boundary Rep’ns (no

correspondence) Medial representations

Find “skeleton” Discretize as “atoms” called M-reps

Page 19: Isaac Newton Institute  -  Cambridge

1919

UNC, Stat & OR

3-d m-reps3-d m-reps

Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong)

• Medial Atoms provide “skeleton”

• Implied Boundary from “spokes” “surface”

Page 20: Isaac Newton Institute  -  Cambridge

2020

UNC, Stat & OR

3-d m-reps3-d m-reps

M-rep model fitting

• Easy, when starting from binary (blue)

• But very expensive (30 – 40 minutes technician’s time)

• Want automatic approach

• Challenging, because of poor contrast, noise, …

• Need to borrow information across training sample

• Use Bayes approach: prior & likelihood posterior

• ~Conjugate Gaussians, but there are issues:

• Major HLDSS challenges

• Manifold aspect of data

Page 21: Isaac Newton Institute  -  Cambridge

2121

UNC, Stat & OR

PCA for m-reps, IPCA for m-reps, I

Major issue: m-reps live in(locations, radius and angles)

E.g. “average” of: = ???

Natural Data Structure is:Lie Groups ~ Symmetric spaces

(smooth, curved manifolds)

)2()3(3 SOSO

359,358,3,2

Page 22: Isaac Newton Institute  -  Cambridge

2222

UNC, Stat & OR

PCA for m-reps, IIPCA for m-reps, II

PCA on non-Euclidean spaces?(i.e. on Lie Groups / Symmetric Spaces)

T. Fletcher: Principal Geodesic Analysis

Idea: replace “linear summary of data”With “geodesic summary of data”…

Page 23: Isaac Newton Institute  -  Cambridge

2323

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum

Bladder – Prostate – Rectum, 1 person, 17 days

PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 24: Isaac Newton Institute  -  Cambridge

2424

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum

Bladder – Prostate – Rectum, 1 person, 17 days

PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 25: Isaac Newton Institute  -  Cambridge

2525

UNC, Stat & OR

PGA for m-reps, Bladder-Prostate-PGA for m-reps, Bladder-Prostate-RectumRectum

Bladder – Prostate – Rectum, 1 person, 17 days

PG 1 PG 2 PG 3

(analysis by Ja Yeon Jeong)

Page 26: Isaac Newton Institute  -  Cambridge

2626

UNC, Stat & OR

HDLSS Classification (i.e. HDLSS Classification (i.e. Discrimination)Discrimination)

Background: Two Class (Binary) version:

Using “training data” from Class +1, and from Class -1

Develop a “rule” for assigning new data to a Class

Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements

Page 27: Isaac Newton Institute  -  Cambridge

2727

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio

Less Useful Methods: Nearest Neighbors Neural Nets

(“black boxes”, no “directions” or intuition)

Page 28: Isaac Newton Institute  -  Cambridge

2828

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Currently Fashionable Methods: Support Vector Machines Trees Based Approaches

New High Tech Method Distance Weighted Discrimination

(DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

Page 29: Isaac Newton Institute  -  Cambridge

2929

UNC, Stat & OR

HDLSS Classification (Cont.)HDLSS Classification (Cont.)

Currently Fashionable Methods:

Trees Based ApproachesSupport Vector Machines:

Page 30: Isaac Newton Institute  -  Cambridge

3030

UNC, Stat & OR

Distance Weighted DiscriminationDistance Weighted Discrimination

Maximal Data Piling

Page 31: Isaac Newton Institute  -  Cambridge

3131

UNC, Stat & OR

Distance Weighted DiscriminationDistance Weighted Discrimination

Based on Optimization Problem:

More precisely work in appropriate penalty for violations

Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic

prog’ing Fast greedy solution Can use existing software

n

i ibw r1,

1min

Page 32: Isaac Newton Institute  -  Cambridge

3232

UNC, Stat & OR

DWD Bias Adjustment for MicroarraysDWD Bias Adjustment for Microarrays

Microarray data: Simult. Measur’ts of “gene

expression” Intrinsically HDLSS

Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s

My view: Each array is “point in cloud”

Page 33: Isaac Newton Institute  -  Cambridge

3333

UNC, Stat & OR

DWD Batch and Source AdjustmentDWD Batch and Source Adjustment

For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)

Bioinformaticshttps://genome.unc.edu/pubsup/dwd/

Adjust for Source Effects Different sources of mRNA

Adjust for Batch Effects Arrays fabricated at different times

Page 34: Isaac Newton Institute  -  Cambridge

3434

UNC, Stat & OR

DWD Adj: Raw Breast Cancer dataDWD Adj: Raw Breast Cancer data

Page 35: Isaac Newton Institute  -  Cambridge

3535

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 36: Isaac Newton Institute  -  Cambridge

3636

UNC, Stat & OR

DWD Adj: Batch ColorsDWD Adj: Batch Colors

Page 37: Isaac Newton Institute  -  Cambridge

3737

UNC, Stat & OR

DWD Adj: Biological Class ColorsDWD Adj: Biological Class Colors

Page 38: Isaac Newton Institute  -  Cambridge

3838

UNC, Stat & OR

DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols

Page 39: Isaac Newton Institute  -  Cambridge

3939

UNC, Stat & OR

DWD Adj: Biological Class SymbolsDWD Adj: Biological Class Symbols

Page 40: Isaac Newton Institute  -  Cambridge

4040

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

Page 41: Isaac Newton Institute  -  Cambridge

4141

UNC, Stat & OR

DWD Adj: PC 1-2 & DWD directionDWD Adj: PC 1-2 & DWD direction

Page 42: Isaac Newton Institute  -  Cambridge

4242

UNC, Stat & OR

DWD Adj: DWD Source AdjustmentDWD Adj: DWD Source Adjustment

Page 43: Isaac Newton Institute  -  Cambridge

4343

UNC, Stat & OR

DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view

Page 44: Isaac Newton Institute  -  Cambridge

4444

UNC, Stat & OR

DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored

Page 45: Isaac Newton Institute  -  Cambridge

4545

UNC, Stat & OR

DWD Adj: Source Adj’d, Batch ColoredDWD Adj: Source Adj’d, Batch Colored

Page 46: Isaac Newton Institute  -  Cambridge

4646

UNC, Stat & OR

DWD Adj: Source Adj’d, 5 PCsDWD Adj: Source Adj’d, 5 PCs

Page 47: Isaac Newton Institute  -  Cambridge

4747

UNC, Stat & OR

DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWDDWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

Page 48: Isaac Newton Institute  -  Cambridge

4848

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 AdjustedDWD Adj: S. & B1,2 vs. 3 Adjusted

Page 49: Isaac Newton Institute  -  Cambridge

4949

UNC, Stat & OR

DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCsDWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

Page 50: Isaac Newton Institute  -  Cambridge

5050

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 DWDDWD Adj: S. & B Adj’d, B1 vs. 2 DWD

Page 51: Isaac Newton Institute  -  Cambridge

5151

UNC, Stat & OR

DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’dDWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d

Page 52: Isaac Newton Institute  -  Cambridge

5252

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 5 PC viewDWD Adj: S. & B Adj’d, 5 PC view

Page 53: Isaac Newton Institute  -  Cambridge

5353

UNC, Stat & OR

DWD Adj: S. & B Adj’d, 4 PC viewDWD Adj: S. & B Adj’d, 4 PC view

Page 54: Isaac Newton Institute  -  Cambridge

5454

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Class ColorsDWD Adj: S. & B Adj’d, Class Colors

Page 55: Isaac Newton Institute  -  Cambridge

5555

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA

Page 56: Isaac Newton Institute  -  Cambridge

5656

UNC, Stat & OR

DWD Bias Adjustment for Microarrays

Effective for Batch and Source Adj. Also works for cross-platform Adj.

E.g. cDNA & Affy Despite literature claiming contrary

“Gene by Gene” vs. “Multivariate” views

Funded as part of caBIG“Cancer BioInformatics Grid”

“Data Combination Effort” of NCI

Page 57: Isaac Newton Institute  -  Cambridge

5757

UNC, Stat & OR

Interesting Benchmark Data SetInteresting Benchmark Data Set

NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available:

http://discover.nci.nih.gov/datasetsNature2000.jsp

Both cDNA and Affymetrix Platforms

8 Major cancer subtypes

Use DWD now for visualization

Page 58: Isaac Newton Institute  -  Cambridge

5858

UNC, Stat & OR

NCI 60: Views using DWD Dir’ns (focus on NCI 60: Views using DWD Dir’ns (focus on biology)biology)

Page 59: Isaac Newton Institute  -  Cambridge

5959

UNC, Stat & OR

DWD in Face Recognition, I

Face Images as Data

(with M. Benito & D. Peña)

Registered using

landmarks

Male – Female Difference?

Discrimination Rule?

Page 60: Isaac Newton Institute  -  Cambridge

6060

UNC, Stat & OR

DWD in Face Recognition, II

DWD Direction

Good separation

Images “make

sense”

Garbage at ends?

(extrapolation

effects?)

Page 61: Isaac Newton Institute  -  Cambridge

6161

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Page 62: Isaac Newton Institute  -  Cambridge

6262

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Page 63: Isaac Newton Institute  -  Cambridge

6363

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Page 64: Isaac Newton Institute  -  Cambridge

6464

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Page 65: Isaac Newton Institute  -  Cambridge

6565

UNC, Stat & OR

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Blood vessel tree dataBlood vessel tree data

Page 66: Isaac Newton Institute  -  Cambridge

6666

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Marron’s brain:

Segmented from

MRA

Reconstruct trees

in 3d

Rotate to view

Page 67: Isaac Newton Institute  -  Cambridge

6767

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Now look over many people (data

objects)

Structure of population (understand

variation?)

PCA in strongly non-Euclidean Space???

, ... ,,

Page 68: Isaac Newton Institute  -  Cambridge

6868

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Possible focus of analysis:

• Connectivity structure only (topology)

• Location, size, orientation of segments

• Structure within each vessel segment

, ... ,,

Page 69: Isaac Newton Institute  -  Cambridge

6969

UNC, Stat & OR

Blood vessel tree dataBlood vessel tree data

Present Focus:

Topology only

Already

challenging

Later address

others

Then add

attributes

To tree nodes

And extend

analysis

Page 70: Isaac Newton Institute  -  Cambridge

7070

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

Statistics on Population of Tree-Structured Data Objects?

• Mean???• Analog of PCA???

Strongly non-Euclidean, since:• Space of trees not a linear space• Not even approximately linear

(no tangent plane)

Page 71: Isaac Newton Institute  -  Cambridge

7171

UNC, Stat & OR

Strongly Non-Euclidean Strongly Non-Euclidean SpacesSpaces

PCA on Tree Space?

Key Idea (Jim Ramsay):

• Replace 1-d subspace

that best approximates data

• By 1-d representation

that best approximates data

Wang and Marron (2007) define notion of

Treeline (in structure space)

Page 72: Isaac Newton Institute  -  Cambridge

7272

UNC, Stat & OR

PCA for blood vessel tree PCA for blood vessel tree datadata

Data Analytic Goals: Age, Gender

See

these?

No…

Page 73: Isaac Newton Institute  -  Cambridge

7373

UNC, Stat & OR

Preliminary Tree-Curve Preliminary Tree-Curve ResultsResults

First Correlation

OfStructure

To Age!

(BackTrees)

Page 74: Isaac Newton Institute  -  Cambridge

7474

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics?

Page 75: Isaac Newton Institute  -  Cambridge

7575

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics?

An interesting (naïve) quote:

“I don’t look at asymptotics, because

I don’t have an infinite sample size”

Page 76: Isaac Newton Institute  -  Cambridge

7676

UNC, Stat & OR

HDLSS Asymptotics

Why study asymptotics?

An interesting (naïve) quote:

“I don’t look at asymptotics, because

I don’t have an infinite sample size”

Suggested perspective:

Asymptotics are a tool for finding simple

structure underlying complex entities

Page 77: Isaac Newton Institute  -  Cambridge

7777

UNC, Stat & OR

HDLSS Asymptotics

Which asymptotics?

n ∞ (classical, very widely

done)

d ∞ ???

Sensible?

Follow typical “sampling process”?

Say anything, as noise level

increases???

Page 78: Isaac Newton Institute  -  Cambridge

7878

UNC, Stat & OR

HDLSS Asymptotics

Which asymptotics?

n ∞ & d ∞

n >> d: a few results around

(still have classical info in data)

n ~ d: random matrices (Iain J., et al)

(nothing classically estimable)

HDLSS asymptotics: n fixed, d ∞

Page 79: Isaac Newton Institute  -  Cambridge

7979

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞

Follow typical “sampling process”?

Page 80: Isaac Newton Institute  -  Cambridge

8080

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞

Follow typical “sampling process”?

Microarrays: # genes bounded

Proteomics, SNPs, …

A moot point, from perspective:

Asymptotics are a tool for finding

simple structure underlying complex

entities

Page 81: Isaac Newton Institute  -  Cambridge

8181

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞

Say anything, as noise level

increases???

Page 82: Isaac Newton Institute  -  Cambridge

8282

UNC, Stat & OR

HDLSS Asymptotics

HDLSS asymptotics: n fixed, d ∞

Say anything, as noise level

increases???

Yes, there exists simple, perhaps

surprising, underlying structure

Page 83: Isaac Newton Institute  -  Cambridge

8383

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, I

For dim’al “Standard Normal” dist’n:

Euclidean Distance to Origin (as ):

- Data lie roughly on surface of sphere of radius

- Yet origin is point of “highest density”???

- Paradox resolved by:

“density w. r. t. Lebesgue Measure”

d

d

dd

d

IN

Z

Z

Z ,0~1

)1(pOdZ

d

Page 84: Isaac Newton Institute  -  Cambridge

8484

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, II

For dim’al “Standard Normal” dist’n: indep. of

Euclidean Dist. between and (as ):Distance tends to non-random constant:

Can extend to Where do they all go???

(we can only perceive 3 dim’ns)

d

d

dd INZ ,0~2

)1(221 pOdZZ

1Z

1Z 2Z

nZZ ,...,1

Page 85: Isaac Newton Institute  -  Cambridge

8585

UNC, Stat & OR

HDLSS Asymptotics: Simple Paradoxes, III

For dim’al “Standard Normal” dist’n: indep. of

High dim’al Angles (as ):

- -“Everything is orthogonal”??? - Where do they all go???

(again our perceptual limitations) - Again 1st order structure is non-random

d

d

dd INZ ,0~2

)(90, 2/121

dOZZAngle p

1Z

Page 86: Isaac Newton Institute  -  Cambridge

8686

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, I

Assume , let

Study Subspace Generated by Data

a. Hyperplane through 0, of dimension

b. Points are “nearly equidistant to 0”, & dist

c. Within plane, can “rotate towards Unit Simplex”

d. All Gaussian data sets are“near Unit Simplex Vertices”!!!

“Randomness” appears only in rotation of simplex

n

d ddn INZZ ,0~,...,1

d

d

With P. Hall & A. Neeman

Page 87: Isaac Newton Institute  -  Cambridge

8787

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, II

Assume , let

Study Hyperplane Generated by Data

a. dimensional hyperplane

b. Points are pairwise equidistant, dist

c. Points lie at vertices of “regular hedron”

d. Again “randomness in data” is only in rotation

e. Surprisingly rigid structure in data?

1n

d ddn INZZ ,0~,...,1

d2d~

n

Page 88: Isaac Newton Institute  -  Cambridge

8888

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Simulation View: shows “rigidity after rotation”

Page 89: Isaac Newton Institute  -  Cambridge

8989

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, III

Straightforward Generalizations:

non-Gaussian data: only need moments

non-independent: use “mixing conditions” (with P. Hall & A. Neeman)

Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi)

All based on simple “Laws of Large Numbers”

Page 90: Isaac Newton Institute  -  Cambridge

9090

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, IV

Explanation of Observed (Simulation) Behavior:

“everything similar for very high d”

2 popn’s are 2 simplices (i.e. regular n-

hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

Page 91: Isaac Newton Institute  -  Cambridge

9191

UNC, Stat & OR

HDLSS Asy’s: Geometrical Representation, V

Further Consequences of Geometric Representation

1. Inefficiency of DWD for uneven sample size(motivates “weighted version”, work in progress)

2. DWD more “stable” than SVM(based on “deeper limiting distributions”)(reflects intuitive idea “feeling sampling

variation”)(something like “mean vs. median”)

3. 1-NN rule inefficiency is quantified.

Page 92: Isaac Newton Institute  -  Cambridge

9292

UNC, Stat & OR

2nd Paper on HDLSS Asymptotics

Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2nd Moments (and Gaussian)

Assume no eigenvalues too large in sense:

For assume i.e.

(min possible)

(much weaker than previous mixing conditions…)

d

jj

d

jj

d1

2

2

1

)(1 do 1 d

Page 93: Isaac Newton Institute  -  Cambridge

9393

UNC, Stat & OR

HDLSS Math. Stat. of PCA, I

Consistency & Strong Inconsistency:

Spike Covariance Model (Johnstone & Paul)

For Eigenvalues:

1st Eigenvector:

How good are empirical versions,

as estimates?

1,,1, ,,2,1 dddd d

1u

1,,1 ˆ,ˆ,,ˆ uddd

Page 94: Isaac Newton Institute  -  Cambridge

9494

UNC, Stat & OR

HDLSS Math. Stat. of PCA, II

Consistency (big enough spike):

For ,

Strong Inconsistency (spike not big enough):

For ,

1

0ˆ, 11 uuAngle

1

011 90ˆ, uuAngle

Page 95: Isaac Newton Institute  -  Cambridge

9595

UNC, Stat & OR

HDLSS Math. Stat. of PCA, III

Consistency of eigenvalues?

Eigenvalues Inconsistent

But known distribution

Unless as well

nn

dL

d

2

,1,1̂

n

Page 96: Isaac Newton Institute  -  Cambridge

9696

UNC, Stat & OR

HDLSS Work in Progress, II

Canonical Correlations: Myung Hee Lee

Results similar to those for those for

PCA

Singular values inconsistent

But directions converge under a much

milder spike assumption.

Page 97: Isaac Newton Institute  -  Cambridge

9797

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:

John Kent example:

Can only say:

not deterministic

Conclude: need some flavor of mixing

dddddd ININX *100,02

1,0

2

1~

212/1212/1

2/1

..10

..)(

pwd

pwddOX p

Page 98: Isaac Newton Institute  -  Cambridge

9898

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:

Conclude: need some flavor of mixing

Challenge: Classical mixing conditions

require notion of time ordering

Not always clear, e.g. microarrays

Page 99: Isaac Newton Institute  -  Cambridge

9999

UNC, Stat & OR

HDLSS Work in Progress, III

Conditions for Geo. Rep’n & PCA Consist.:

Sungkyu Jung Condition:

where

Define:

Assume: Ǝ a permutation,

So that is ρ-mixing

ddX ,0~ tdddd UU

dtddd XUZ 2/1

d

ddZ

Page 100: Isaac Newton Institute  -  Cambridge

100100

UNC, Stat & OR

HDLSS Deep Open Problem

In PCA Consistency:

Strong Inconsistency - spike

Consistency - spike

What happens at boundary

( )???

1

1

1

Page 101: Isaac Newton Institute  -  Cambridge

101101

UNC, Stat & OR

The Future of HDLSS Asymptotics?

1. Address your favorite statistical problem…

2. HDLSS versions of classical optimality

results?

3. Continguity Approach (~Random Matrices)

4. Rates of convergence?

5. Improved Discrimination Methods?

It is early days…

Page 102: Isaac Newton Institute  -  Cambridge

102102

UNC, Stat & OR

Some Carry Away Lessons

Atoms of the Analysis: Object Oriented

Viewpoint: Object Space Feature Space

DWD is attractive for HDLSS classification

“Randomness” in HDLSS data is only in rotations

(Modulo rotation, have constant simplex shape)

How to put HDLSS asymptotics to work?