Top Banner
Statistical Paradises and Paradoxes in Big Data Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues 1
32

Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Sep 01, 2018

Download

Documents

doankiet
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Statistical Paradises and Paradoxes

in Big Data

Xiao-Li Meng

Department of Statistics,

Harvard University

Thanks to many students and colleagues

1

Page 2: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Paradises

• Much larger general pipeline:

• Much better airplane conversations

• Golden era for methodological research

• Emerging theoretical foundations 2

Statistics Concentration Statistics Concentration Statistics Concentration Statistics Concentration (Major) (Major) (Major) (Major)

Size at Harvard CollegeSize at Harvard CollegeSize at Harvard CollegeSize at Harvard College

Page 3: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math

• Rigorous theory of the trade-off between

statistical and computational efficiency,

under confidentiality, etc., based on

classical statistical decision theory.

• Wide-ranging statistical machine learning

theory, methodology, algorithms, using

empirical process, signal processing &

information theory (e.g., MDL principle).

• Automated Targeted Learning and Super

Learning built upon well-established semi-

parametric and nonparametric theory.

• Algebraic statistics, e.g., studying

statistical hypothesis testing via algebraic

geometry and computational and

combinatorial techniques.

• ……3

Page 4: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives

• Fusion learning via confidence distributions (CD)

• Combining results from multiple analyses under

possibly different perspectives

4

Page 5: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Jianqing Fan’s Group (Princeton):

Bringing statistical theory and methods to the forefront of Big Data

Fan et al. (2014) Challenges of Big Data Analysis

National Science Review (China) 1: 293-314

Salient features of Big Data

• Heterogeneity (Individuality)

• Noise accumulation

• Spurious correlation

• Incidental endogeneity

• FanBigDataReview.pdf

5

Page 6: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Great Promises and Grand Challenges �Multi-Resolution Inference

�Multi-Phase Inference

�Multi-Source Inference

o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50th Anniversary Volume.

o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, 1176-1211.

o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models are Uncongenial? (With discussion). Statistica Sinica, to appear.

6

Page 7: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

OnTheMap Project of US Census Bureau

7

• Developed by LED (Local

Employment Dynamic).

• Users zoom into any region of

the US for paired employee-

employer information.

• Used diverse data sources:

surveys and administrative

datasets with confidential

information.

Thanks to Jeremy Wu of C. B.

Page 8: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Multi-Resolution

8

Page 9: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Multi-Phase

9

• To protect confidentiality, the displayed data are synthetic:

draws from a posterior.

• Each data source itself has gone through multiple

“clean up” processes, most of which are gray boxes

or even

Page 10: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Multi-Source

• Built from more than 20 data sources in the LEHD

(Longitudinal Employer-Household Dynamics) system.

• Survey Samples: Monthly survey of 60,000 households

covering only 0.05% of households.

• Administrative Records: Unemployment insurance wage

records covering more than 90% of the US workforce;

Never intended for inference purposes.

• Census Data: Quarterly census of earnings and wages

covering 98% of US jobs.

10

Page 11: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

11

A Trio of NP-Hard Inference Problems

• Multi-Resolution: How do we infer estimands with resolution far

exceeding any possible estimators? Is it possible for such inference to

be qualitatively robust even if it cannot be quantitatively robust?

• Multi-Phase: (Big) Data are almost never collected, preprocessed,

and analyzed in a single phase. What theory and methods

accommodate this multi-phase setup?

• Multi-Source: Which one is better: a survey sample covering 1% or

an administrative record covering 95% of the population? How

should we combine information from these sources? Is it worth

combining?

Page 12: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

So which one is better for estimating the population mean:a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD) ?

12

1%

SRS

95%

AD

It d

epends!

Is th

is a

tric

...

0% 0%0%0%

1. 1% SRS

2. 95% AD

3. It depends!

4. Is this a trick question?

Page 13: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

A fundamental principle of statistics: Variance-Bias Tradeoff

Total Error = Variance + Bias2

• probabilistic SRS [(1-fs)/n]S2 + 0

• Large non-prob data ≈ 0 + r2[(1-fa)/fa)] S2

• f is the fraction in the population: f=n/N

• r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X)

• “Big Data Paradox” – the larger the data, the more pronounced the bias

13

Page 14: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does?

14

<0.5

% (

1.6M

) 5

%

(16M

) 1

0% (

32M)

20%

(64M

) 5

0% (

160M)

75%

(240M

) 9

0%

(288M

) >

95%

(303M

)

0% 0% 0% 0%0%0%0%0%

1. <0.5% (1.6M)

2. 5% (16M)

3. 10% (32M)

4. 20% (64M)

5. 50% (160M)

6. 75% (240M)

7. 90% (288M)

8. >95% (303M)

Page 15: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Big Data: Big Size or Big Fraction?• Size matters, but only after having quality

• Importance of combining non-probabilistic samples

with probabilistic ones, however small the latter are.

• More does NOT guarantee better:

• I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?(Meng and Xie, 2014, Economics Review, 218-250)

15

Page 16: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

So when/why do we need Big Data?

• Individualized treatments (e.g., medical;

educational; marketing; news)

• Inference/prediction with very weak signal to

noise ratio (e.g., climate change)

• Understand deeply connected (spatial)

networks and (temporal) dynamics

16

Page 17: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

What does Big Data mean for you?We see you and others more clearly

2015/11/1 17

Page 18: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Gift: Treatment for you based only on data from people like you.

Curse: No one is perfectly like you.

2015/11/1 18

Page 19: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

2015/11/1 19

Personalized Treatment: Sounds heavenly, but where on Earth did they find the right

guinea pig for me?

Liu and Meng (2014) A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference, The American Statistician, 17-29

Page 20: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

A Painful Problem

2015/11/1 20

Page 21: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

2015/11/1 21

Kidney Stone TreatmentC. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986)

Br Med J (Clin Res Ed) 292 (6524): 879–882.

Treatment A Treatment B

78%

(273/350)

83%(289/350)

Treatment A Treatment B

Small

Stone

93%

(81/87)

87%

(234/270)

Large Stone

73%(192/263)

69%

(55/80)

A: Open Surgery; B: Percutaneous Nephrolithotomy

Page 22: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

22

Treatment A

Large Stones

Small Stones

Large Stones

Small Stones

Treatment B

SuccessfulUnsuccessful

69% successful

73% successful

93%

87%

78%

83%

Overall

Overall

Uneven distribution of stone sizes across treatments makes overall success rate misleading.

Page 23: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Simpson’s Paradox

• Dealing with the disparities between

aggregated analysis and disaggregated

analyses

• Determining the right level (primary

resolution) for analysis

• Understanding the bias-variance (relevance-

robustness) trade-off

23

Page 24: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

So what would be the right resolution?

Let’s take a CarTalk challenge (7/111/2015)

24

Page 25: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

25

From Cartalk: “You are tested positive for D by a test with

95% accuracy. What’s the chance you actually have D, given

the prevalence of D is 0.1%?”

1-5

%

5-1

0%

10-

25%

2

5-50

%

50-

75%

75-

95%

Cou

ld b

e an

yth

...

I hav

e no

idea

...

0% 0% 0% 0%0%0%0%0%

1. 1-5%

2. 5-10%

3. 10-25%

4. 25-50%

5. 50-75%

6. 75-95%

7. Could be anything

8. I have no idea.

C

o

u

n

t

d

o

w

n

10

Page 26: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

It could be anything … depending on the meaning of “accuracy” and …

• Need to know how accurate the test is among

those with no disease (specificity) AND among

those with the disease (sensitivity)

• The probability could be 1 if sensitivity = 100%

• For rare disease, overall accuracy ~ specificity

• Then the answer is less than 2%, if this was a

random screening test

26

Page 27: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

27

1,000 with Symptoms

100 D 900 no D

45

pos

855

neg

95

pos

5

neg

100,000 People for Screening

100 D 99,900 no D

4,995

pos

94,005

neg

95

pos

5

neg

5%95% 5%95%

5% 5%95% 95%

0.1% 99.9% 10% 90%

95/(95+4,995) = 1.87% 95/(95+45) = 67.9%

Conditioning is the Soul of Statistics

--- Joe Blitzstein

Page 28: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Bayes Theorem

28

When the facts change, I change my opinion. What

do you do, sir?

~ John Maynard Keynes

Page 29: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

Useful Statistical Principles/Concepts for Data Science

Data Selection and Replication Mechanisms:

Randomization, sampling, experiments, observational studies, missing

data mechanisms; latent variable/constructs; potential outcome;

confidentiality protections

Conditioning vs. Marginalizing:

Disaggregation vs. aggregation, sub-population analysis,

individualized inference, Simpson’s paradox, ecological fallacy

Bias-Variance Trade-off:

Efficiency vs. Robustness, Relevance vs. Robustness; model

predictability vs. fitness

Inferences principles/perspectives:

Likelihood principle; Bayesian thinking; fiducial argument for

objectivity; uncertainty quantifications

…….

•2015/11/1

29

Page 30: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

A Traditional Statistical Theme/Aim:

Seeking representative samples to infer about populations

A Big-Data Statistical Theme/Aim:

Constructing approximating populations to infer about individuals

Targeted Individual Approx. Population

2015/11/1 30

Page 31: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

One more V for Big Data:

31

Veracity

Page 32: Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf · Statistical Paradises and Paradoxes in Big Data ... Statistics Concentration Statistics Concentration

I find your presentation …

32

Insp

iring a

nd ...

info

rmativ

e an...

confu

sing a

nd ...

what a

wast

e o...

0% 0%0%0%

1. Inspiring and thought

provoking

2. informative and I

learned a few things

3. confusing and not

very helpful

4. what a waste of my

time!