Ssc Pres Invited

On the Computational and Statistical Interface and Big Data

Michael I. Jordan University of California, Berkeley

May 26, 2014

With: Venkat Chandrasekaran, John Duchi, Martin Wainwright and Yuchen Zhang

What Is the Big Data Phenomenon?

Science in confirmatory mode (e.g., particle physics) Science in exploratory mode (e.g., astronomy, genomics) Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

Sensor networks are becoming pervasive

What Are the Conceptual/Mathematical Issues?

The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of

data?



data? Statistical with distributed and streaming data

how is inferential quality impacted by communication constraints?





The tradeoff between statistical risk and privacy





The tradeoff between statistical risk and privacy Many other issues that require a blend of statistical

thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

Data as a Resource

Computer science studies the management of resources, such as time and space and energy

Data as a Resource


Data has not been viewed as a resource, but as a workload

Data as a Resource



The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,

cost-effective, high-quality decisions and inferences

Data as a Resource




cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first

order) that the more of the data resource the better

Data as a Resource




cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first

order) that the more of the data resource the better not true in our current state of knowledge

Model complexity often grows much faster than number of data points

Big Data, Big Problems


Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points



Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points

Even worse, the more data the less likely a sophisticated algorithm will run in an acceptable time frame and then we have to back off to cheaper algorithms that may be

more error-prone or we can subsample, but this requires knowing the statistical

value of each data point, which we generally dont know a priori


Take (classical) statistical decision theory as a mathematical point of departure

Treat computation, communication, privacy, etc as constraints on statistical risk

This induces tradeoffs among these quantities and the number of data points

Our Approach

Take (classical) statistical decision theory as a mathematical point of departure

Treat computation, communication, privacy, etc as constraints on statistical risk

This induces tradeoffs among these quantities and the number of data points

Under the hood: geometry, information theory and optimization

Our Approach

Background on minimax decision theory Privacy constraints Communication constraints Computational constraints (via optimization)

Outline

In the 1930s, Wald laid the foundations of statistical decision theory

Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

Minimax principle [Wald, 39, 43]: choose

estimator minimizing worst-case risk:

Background

supP2P

EPhl(, (P ))

i

P(P ) P 2 P

l(, (P ))

RP () := EPhl(, (P ))

i

Part I: Privacy and Minimax Risk

with John Duchi and Martin Wainwright University of California, Berkeley

Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

We will quantify privacy loss via differential privacy

We then treat differential privacy as a constraint on inference via statistical decision theory

This yields (personal) tradeoffs between privacy loss and inferential gain

Privacy and Risk

A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]


X1 X2 Xn

ZnZ2Z1

b


X1 X2 Xn

ZnZ2Z1

bPrivate


X1 X2 Xn

ZnZ2Z1

bPrivate

Xi

Zi

bQ( | Xi)


X1 X2 Xn

ZnZ2Z1

bPrivate

Xi

Zi

bQ( | Xi)

Channel


X1 X2 Xn

ZnZ2Z1

bPrivate

Xi

Zi

bQ( | Xi)

Channel

Individuals with private data Estimator

Xiiid Pi 2 {1, . . . , n}

Zn1 7! b(Zn1 )

Definitions of privacyDefinition: channel is -differentially private if

Q

[Dwork, McSherry, Nissim, Smith 06]

Xi

Zi

bQ( | Xi)

supS,x2X ,x02X

Q(Z 2 S | x)Q(Z 2 S | x0) exp()

Private Minimax RiskCentral object of study: minimax risk

Minimax risk

Parameter of distribution(P )Family of distributions PLoss measuring error`

Mn((P), `) := infb supP2PEPh`(b(Xn1 ), (P ))i

-private Minimax risk

Private Minimax RiskCentral object of study: minimax riskParameter of distribution(P )Family of distributions PLoss measuring error`

Mn((P), `,) := infQ2Q

infb supP2PEP,Qh`(b(Zn1 ), (P ))i

Family of private channelsQ



Best -private channel


Mn((P), `,) := infQ2Q





Best -private channelMinimax risk under privacy constraint


Mn((P), `,) := infQ2Q



Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

Proportions =

Vignette: private mean (location) estimation

Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

123456

= .45= .32= .16= .20= .00= .02

Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

(P ) := EP [X] 2 Rd`1 E[kb k1]

Pd :=distributions P supported on [1, 1]d

Proposition:


(P ) := EP [X] 2 Rd`1 E[kb k1]


Minimax rateMn(Pd, kk1) min

1,

plog dpn

(achieved by sample mean)

Proposition:Private minimax rate for = O(1)


(P ) := EP [X] 2 Rd`1 E[kb k1]


Mn(Pd, kk1 ,) min1,

pd log dpn2

Proposition:Private minimax rate for = O(1)


(P ) := EP [X] 2 Rd`1 E[kb k1]


Mn(Pd, kk1 ,) min1,

pd log dpn2

Effective sample size n 7! n2/dNote:

Optimal mechanism?

Non-privateobservation

X =

26666410100

377775

Optimal mechanism?


X =

26666410100

377775Idea 1: add independent noise

(e.g. standard Laplacemechanism)

Z = X +W =

2666641 +W10 +W21 +W30 +W40 +W5

377775

Optimal mechanism?


X =

26666410100

377775Idea 1: add independent noise

(e.g. standard Laplacemechanism)

Z = X +W =

2666641 +W10 +W21 +W30 +W40 +W5

377775

Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

Optimal mechanism


X =

26666410100

377775

Optimal mechanism


X =

26666410100

377775 v =26666401100

377775 1 v =26666410011

377775View 1 View 2

Draw uniformly inv {0, 1}d

Optimal mechanism


X =

26666410100

377775 v =26666401100

377775 1 v =26666410011

377775View 1 View 2


With probability choose closer of and tov 1 v X

(closer : 3 overlap) (farther : 2 overlap)

otherwise, choose farther

e

1 + e

Optimal mechanism


X =

26666410100

377775 v =26666401100

377775 1 v =26666410011

377775View 1 View 2


With probability choose closer of and tov 1 v X

(closer : 3 overlap) (farther : 2 overlap)

At end:Compute sample

average andde-bias

otherwise, choose farther

e

1 + e

Empirical evidence

Estimate proportion of emergency room visits involving different substances

Data source: Drug Abuse Warning

NetworkSample size n

Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}

Mj(S) :=

ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

Question: How much contraction does privacy induce?

Xi Zi1,2Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV

Pj Q


Mj(S) :=

ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

Note: for n 7! n2 . 1


Xi Zi1,2

Theorem (data processing): for any -private channel and i.i.d. sample of size

n

Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV

Pj Q

Final remarks: privacy

Key: Allows identification of new optimal mechanisms

Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

[Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

Fixed-design regression Convex risk minimization Multinomial estimation Nonparametric density estimation

n 7! n2Almost always: effective sample size reduction n 7! n

2

dIn d-dimensional problems:

Part II: Communication and Minimax Risk

with John Duchi, Martin Wainwright and Yuchen Zhang

University of California, Berkeley

Communication-constraints

[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Communication-constraintsLarge data necessitates distributed storageIndependent data collection (hospitals)Privacy?



X1 X2 Xm

Z1 Z2 Zm

b

Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?



X1 X2 Xm

Z1 Z2 Zm

bXi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi



X1 X2 Xm

Z1 Z2 Zm

bXi = (Xi1, X

i2, . . . , X

in)


Setting: each of agents has sample of size

mn

Messages to fusion centerZi

Question: tradeoffs between communication

and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P

Mn((P), B) := inf2B

infb supP2PEPhkb(Zm1 ) (P )k22i

X1 X2 Xm

Z1 Z2 Zm

b Loss kk22

Minimax risk with -bounded communicationB

Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P

Mn((P), B) := inf2B

infb supP2PEPhkb(Zm1 ) (P )k22i

Best protocol with smaller than bitsBZiZi = (Xi)

X1 X2 Xm

Z1 Z2 Zm

b Loss kk22

Constrained to be bits B

Vignette: mean estimationX1 X2 Xm

Z1 Z2 Zm

bConsider estimation in normallocation family,

Xijiid N(,2Idd)

2 [1, 1]d


Z1 Z2 Zm

bMinimax rateTheorem: when each agent has sample of sizen

E[kb(X1, . . . , Xm) k22] 2dnm

Consider estimation in normallocation family,

Xijiid N(,2Idd)

2 [1, 1]d

Minimax rate with B-bounded communication


Z1 Z2 Zm

bTheorem: when each agent has sample of sizen

bitsB


Xijiid N(,2Idd)

2 [1, 1]d

d

B ^ d1

logm

2d

nm.Mn(Nd, B) . d logm

B ^ d2d

nm

Minimax rate with B-bounded communication


Z1 Z2 Zm

bTheorem: when each agent has sample of sizen

bitsB

Consequence: each sends bits for optimal estimation d


Xijiid N(,2Idd)

2 [1, 1]d

d

B ^ d1

logm

2d

nm.Mn(Nd, B) . d logm

B ^ d2d

nm

Part III: Computation/Statistics Tradeoffs via Convex

Relaxation

with Venkat Chandrasekaran Caltech

Time-Data Tradeoffs Consider an inference problem with fixed risk Inference procedures viewed as points in plot

Runtime

Number of samples n

Time-Data Tradeoffs Consider an inference problem with fixed risk Vertical lines

Runtime

Number of samples n

Classical estimation theory well understood

Time-Data Tradeoffs Consider an inference problem with fixed risk Horizontal lines

Runtime

Number of samples n

Complexity theory lower bounds poorly understood

depends on computational model

Time-Data Tradeoffs Consider an inference problem with fixed risk

Runtime

Number of samples n

o Trade off upper bounds o More data means smaller

runtime upper bound o Need weaker

algorithms for larger datasets

An Estimation Problem

Signal from known (bounded) set Noise

Observation model

Observe n i.i.d. samples

Convex Programming Estimator

Sample mean is sufficient statistic

Natural estimator

Convex relaxation

C is a convex set such that

Statistical Performance of Estimator

Consider cone of feasible directions into C

Statistical Performance of Estimator

Theorem: The risk of the estimator is

Intuition: Only consider error in feasible cone

Can be refined for better bias-variance tradeoffs

Hierarchy of Convex Relaxations

Corr: To obtain risk of at most 1,

Key point:

If we have access to larger n, can use larger C


If we have access to larger n, can use larger C Obtain weaker estimation algorithm


If algebraic, then one can obtain family of outer convex approximations

polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar)

Sets ordered by computational complexity Central role played by lift-and-project

Example 1

consists of cut matrices

E.g., collaborative filtering, clustering

Example 2

Signal set consists of all perfect matchings in complete graph

E.g., network inference

Example 3

consists of all adjacency matrices of graphs with only a clique on square-root many nodes

E.g., sparse PCA, gene expression patterns Kolar et al. (2010)

Example 4

Banding estimators for covariance matrices Bickel-Levina (2007), many others assume known variable ordering

Stylized problem: let M be known tridiagonal matrix

Signal set

Remarks

In several examples, not too many extra samples required for really simple algorithms

Approximation ratios vs Gaussian complexities approximation ratio might be bad, but doesnt matter as

much for statistical inference

Understand Gaussian complexities of LP/SDP hierarchies in contrast to theoretical CS

Conclusions

Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data

Conclusions


Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations

Conclusions


Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines

Conclusions


Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines

This wont happen overnight

Ssc Pres Invited

Documents

data resource

energy data

streaming data

workload data

inferences data

abstraction data

better data

statistical risk