Top Banner
On the Computational and Statistical Interface and “Big Data” Michael I. Jordan University of California, Berkeley May 26, 2014 With: Venkat Chandrasekaran, John Duchi, Martin Wainwright and Yuchen Zhang
87
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • On the Computational and Statistical Interface and Big Data

    Michael I. Jordan University of California, Berkeley

    May 26, 2014

    With: Venkat Chandrasekaran, John Duchi, Martin Wainwright and Yuchen Zhang

  • What Is the Big Data Phenomenon?

    Science in confirmatory mode (e.g., particle physics) Science in exploratory mode (e.g., astronomy, genomics) Measurement of human activity, particularly online

    activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

    Sensor networks are becoming pervasive

  • What Are the Conceptual/Mathematical Issues?

    The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of

    data?

  • What Are the Conceptual/Mathematical Issues?

    The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of

    data? Statistical with distributed and streaming data

    how is inferential quality impacted by communication constraints?

  • What Are the Conceptual/Mathematical Issues?

    The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of

    data? Statistical with distributed and streaming data

    how is inferential quality impacted by communication constraints?

    The tradeoff between statistical risk and privacy

  • What Are the Conceptual/Mathematical Issues?

    The need to control statistical risk under constraints on algorithmic runtime how do risk and runtime trade off as a function of the amount of

    data? Statistical with distributed and streaming data

    how is inferential quality impacted by communication constraints?

    The tradeoff between statistical risk and privacy Many other issues that require a blend of statistical

    thinking (e.g., a focus on sampling, confidence intervals, evaluation, diagnostics, causal inference) and computational thinking (e.g., scalability, abstraction)

  • Data as a Resource

    Computer science studies the management of resources, such as time and space and energy

  • Data as a Resource

    Computer science studies the management of resources, such as time and space and energy

    Data has not been viewed as a resource, but as a workload

  • Data as a Resource

    Computer science studies the management of resources, such as time and space and energy

    Data has not been viewed as a resource, but as a workload

    The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,

    cost-effective, high-quality decisions and inferences

  • Data as a Resource

    Computer science studies the management of resources, such as time and space and energy

    Data has not been viewed as a resource, but as a workload

    The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,

    cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first

    order) that the more of the data resource the better

  • Data as a Resource

    Computer science studies the management of resources, such as time and space and energy

    Data has not been viewed as a resource, but as a workload

    The fundamental issue is that data now needs to be viewed as a resource the data resource combines with other resources to yield timely,

    cost-effective, high-quality decisions and inferences Just as with time or space, it should be the case (to first

    order) that the more of the data resource the better not true in our current state of knowledge

  • Model complexity often grows much faster than number of data points

    Big Data, Big Problems

  • Model complexity often grows much faster than number of data points

    Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points

    Big Data, Big Problems

  • Model complexity often grows much faster than number of data points

    Statistical control is thus essential, but such control involves algorithms, and they may (and often do) scale poorly with the number of data points

    Even worse, the more data the less likely a sophisticated algorithm will run in an acceptable time frame and then we have to back off to cheaper algorithms that may be

    more error-prone or we can subsample, but this requires knowing the statistical

    value of each data point, which we generally dont know a priori

    Big Data, Big Problems

  • Take (classical) statistical decision theory as a mathematical point of departure

    Treat computation, communication, privacy, etc as constraints on statistical risk

    This induces tradeoffs among these quantities and the number of data points

    Our Approach

  • Take (classical) statistical decision theory as a mathematical point of departure

    Treat computation, communication, privacy, etc as constraints on statistical risk

    This induces tradeoffs among these quantities and the number of data points

    Under the hood: geometry, information theory and optimization

    Our Approach

  • Background on minimax decision theory Privacy constraints Communication constraints Computational constraints (via optimization)

    Outline

  • In the 1930s, Wald laid the foundations of statistical decision theory

    Given a family of probability distributions , a parameter for each , an estimator , and a loss , define the risk:

    Minimax principle [Wald, 39, 43]: choose

    estimator minimizing worst-case risk:

    Background

    supP2P

    EPhl(, (P ))

    i

    P(P ) P 2 P

    l(, (P ))

    RP () := EPhl(, (P ))

    i

  • Part I: Privacy and Minimax Risk

    with John Duchi and Martin Wainwright University of California, Berkeley

  • Individuals are not generally willing to allow their personal data to be used without control on how it will be used and now much privacy loss they will incur

    We will quantify privacy loss via differential privacy

    We then treat differential privacy as a constraint on inference via statistical decision theory

    This yields (personal) tradeoffs between privacy loss and inferential gain

    Privacy and Risk

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    b

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    bPrivate

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    bPrivate

    Xi

    Zi

    bQ( | Xi)

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    bPrivate

    Xi

    Zi

    bQ( | Xi)

    Channel

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    bPrivate

    Xi

    Zi

    bQ( | Xi)

    Channel

    Individuals with private data Estimator

    Xiiid Pi 2 {1, . . . , n}

    Zn1 7! b(Zn1 )

  • A model of privacyLocal privacy: providers do not trust collector [Warner 65, Evfimievski et al. 03]

    X1 X2 Xn

    ZnZ2Z1

    bPrivate

    Xi

    Zi

    bQ( | Xi)

    Channel

    Individuals with private data Estimator

    Xiiid Pi 2 {1, . . . , n}

    Zn1 7! b(Zn1 )

  • Definitions of privacyDefinition: channel is -differentially private if

    Q

    [Dwork, McSherry, Nissim, Smith 06]

    Xi

    Zi

    bQ( | Xi)

    supS,x2X ,x02X

    Q(Z 2 S | x)Q(Z 2 S | x0) exp()

  • Definitions of privacyDefinition: channel is -differentially private if

    Q

    [Dwork, McSherry, Nissim, Smith 06]

    Xi

    Zi

    bQ( | Xi)

    supS,x2X ,x02X

    Q(Z 2 S | x)Q(Z 2 S | x0) exp()

    logQ(z | x)logQ(z | x0)

  • Definitions of privacyDefinition: channel is -differentially private if

    Q

    [Dwork, McSherry, Nissim, Smith 06]

    Xi

    Zi

    bQ( | Xi)

    Testing interpretationvia Neyman Pearson[Wasserman & Zhou 10]

    supS,x2X ,x02X

    Q(Z 2 S | x)Q(Z 2 S | x0) exp()

    logQ(z | x)logQ(z | x0)

    Given Z, cannot recover x

  • Private Minimax RiskCentral object of study: minimax risk

    Minimax risk

    Parameter of distribution(P )Family of distributions PLoss measuring error`

    Mn((P), `) := infb supP2PEPh`(b(Xn1 ), (P ))i

  • -private Minimax risk

    Private Minimax RiskCentral object of study: minimax riskParameter of distribution(P )Family of distributions PLoss measuring error`

    Mn((P), `,) := infQ2Q

    infb supP2PEP,Qh`(b(Zn1 ), (P ))i

    Family of private channelsQ

  • -private Minimax risk

    Private Minimax RiskCentral object of study: minimax risk

    Best -private channel

    Parameter of distribution(P )Family of distributions PLoss measuring error`

    Mn((P), `,) := infQ2Q

    infb supP2PEP,Qh`(b(Zn1 ), (P ))i

    Family of private channelsQ

  • -private Minimax risk

    Private Minimax RiskCentral object of study: minimax risk

    Best -private channelMinimax risk under privacy constraint

    Parameter of distribution(P )Family of distributions PLoss measuring error`

    Mn((P), `,) := infQ2Q

    infb supP2PEP,Qh`(b(Zn1 ), (P ))i

    Family of private channelsQ

  • Vignette: private mean (location) estimation

    Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

  • Proportions =

    Vignette: private mean (location) estimation

    Example: estimate reasons for hospital visits Patients admitted to hospital for substance abuse Estimate prevalence of different substances

    1 Alcohol 1 Cocaine 0 Heroin 0 Cannabis 0 LSD 0 Amphetamines

    123456

    = .45= .32= .16= .20= .00= .02

  • Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

    (P ) := EP [X] 2 Rd`1 E[kb k1]

    Pd :=distributions P supported on [1, 1]d

  • Proposition:

    Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

    (P ) := EP [X] 2 Rd`1 E[kb k1]

    Pd :=distributions P supported on [1, 1]d

    Minimax rateMn(Pd, kk1) min

    1,

    plog dpn

    (achieved by sample mean)

  • Proposition:Private minimax rate for = O(1)

    Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

    (P ) := EP [X] 2 Rd`1 E[kb k1]

    Pd :=distributions P supported on [1, 1]d

    Mn(Pd, kk1 ,) min1,

    pd log dpn2

  • Proposition:Private minimax rate for = O(1)

    Vignette: mean estimationConsider estimation of mean ,errors measured in -norm, i.e. for

    (P ) := EP [X] 2 Rd`1 E[kb k1]

    Pd :=distributions P supported on [1, 1]d

    Mn(Pd, kk1 ,) min1,

    pd log dpn2

    Effective sample size n 7! n2/dNote:

  • Optimal mechanism?

    Non-privateobservation

    X =

    26666410100

    377775

  • Optimal mechanism?

    Non-privateobservation

    X =

    26666410100

    377775Idea 1: add independent noise

    (e.g. standard Laplacemechanism)

    Z = X +W =

    2666641 +W10 +W21 +W30 +W40 +W5

    377775

  • Optimal mechanism?

    Non-privateobservation

    X =

    26666410100

    377775Idea 1: add independent noise

    (e.g. standard Laplacemechanism)

    Z = X +W =

    2666641 +W10 +W21 +W30 +W40 +W5

    377775

    Problem: magnitude much too large(this is unavoidable: provably sub-optimal)

  • Optimal mechanism

    Non-privateobservation

    X =

    26666410100

    377775

  • Optimal mechanism

    Non-privateobservation

    X =

    26666410100

    377775 v =26666401100

    377775 1 v =26666410011

    377775View 1 View 2

    Draw uniformly inv {0, 1}d

  • Optimal mechanism

    Non-privateobservation

    X =

    26666410100

    377775 v =26666401100

    377775 1 v =26666410011

    377775View 1 View 2

    Draw uniformly inv {0, 1}d

    With probability choose closer of and tov 1 v X

    (closer : 3 overlap) (farther : 2 overlap)

    otherwise, choose farther

    e

    1 + e

  • Optimal mechanism

    Non-privateobservation

    X =

    26666410100

    377775 v =26666401100

    377775 1 v =26666410011

    377775View 1 View 2

    Draw uniformly inv {0, 1}d

    With probability choose closer of and tov 1 v X

    (closer : 3 overlap) (farther : 2 overlap)

    otherwise, choose farther

    e

    1 + e

  • Optimal mechanism

    Non-privateobservation

    X =

    26666410100

    377775 v =26666401100

    377775 1 v =26666410011

    377775View 1 View 2

    Draw uniformly inv {0, 1}d

    With probability choose closer of and tov 1 v X

    (closer : 3 overlap) (farther : 2 overlap)

    At end:Compute sample

    average andde-bias

    otherwise, choose farther

    e

    1 + e

  • Empirical evidence

    Estimate proportion of emergency room visits involving different substances

    Data source: Drug Abuse Warning

    NetworkSample size n

  • Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}

    Mj(S) :=

    ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

    Xi Zi1,2Pj Q

  • Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}

    Mj(S) :=

    ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

    Question: How much contraction does privacy induce?

    Xi Zi1,2Pj Q

  • Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}

    Mj(S) :=

    ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

    Question: How much contraction does privacy induce?

    Xi Zi1,2

    Theorem (data processing): for any -private channel and i.i.d. sample of size

    n

    Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV

    Pj Q

  • Sample size reductionsGiven -private channel , pair induces marginalQ {P1, P2}

    Mj(S) :=

    ZQ(S | x1, . . . , xn)dPnj (x1, . . . , xn)

    Note: for n 7! n2 . 1

    Question: How much contraction does privacy induce?

    Xi Zi1,2

    Theorem (data processing): for any -private channel and i.i.d. sample of size

    n

    Dkl (M1||M2) +Dkl (M2||M1) 4n(e 1)2 kP1 P2k2TV

    Pj Q

  • Final remarks: privacy

    Key: Allows identification of new optimal mechanisms

    Rough technique: Reduction of estimation to testing,then apply information-theoretic testing lower bounds

    [Le Cam, Hasminskii, Ibragimov, Assouad, Birge, Barron, Yu, ...]Additional examples

    Fixed-design regression Convex risk minimization Multinomial estimation Nonparametric density estimation

    n 7! n2Almost always: effective sample size reduction n 7! n

    2

    dIn d-dimensional problems:

  • Part II: Communication and Minimax Risk

    with John Duchi, Martin Wainwright and Yuchen Zhang

    University of California, Berkeley

  • Communication-constraints

    [Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

  • Communication-constraintsLarge data necessitates distributed storageIndependent data collection (hospitals)Privacy?

    [Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

  • Communication-constraints

    X1 X2 Xm

    Z1 Z2 Zm

    b

    Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?

    [Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

  • Communication-constraints

    X1 X2 Xm

    Z1 Z2 Zm

    bXi = (Xi1, X

    i2, . . . , X

    in)

    Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?

    Setting: each of agents has sample of size

    mn

    Messages to fusion centerZi

    [Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

  • Communication-constraints

    X1 X2 Xm

    Z1 Z2 Zm

    bXi = (Xi1, X

    i2, . . . , X

    in)

    Large data necessitates distributed storageIndependent data collection (hospitals)Privacy?

    Setting: each of agents has sample of size

    mn

    Messages to fusion centerZi

    Question: tradeoffs between communication

    and statistical utility?[Yao 79; Abelson 80;Tsitsiklis and Luo 87; Han & Amari 98; Tatikonda & Mitter 04; ...]

  • Minimax risk with -bounded communicationB

    Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P

    Mn((P), B) := inf2B

    infb supP2PEPhkb(Zm1 ) (P )k22i

    X1 X2 Xm

    Z1 Z2 Zm

    b Loss kk22

  • Minimax risk with -bounded communicationB

    Minimax communicationCentral object of study: Parameter of distribution(P ) Family of distributions P

    Mn((P), B) := inf2B

    infb supP2PEPhkb(Zm1 ) (P )k22i

    Best protocol with smaller than bitsBZiZi = (Xi)

    X1 X2 Xm

    Z1 Z2 Zm

    b Loss kk22

    Constrained to be bits B

  • Vignette: mean estimationX1 X2 Xm

    Z1 Z2 Zm

    bConsider estimation in normallocation family,

    Xijiid N(,2Idd)

    2 [1, 1]d

  • Vignette: mean estimationX1 X2 Xm

    Z1 Z2 Zm

    bMinimax rateTheorem: when each agent has sample of sizen

    E[kb(X1, . . . , Xm) k22] 2dnm

    Consider estimation in normallocation family,

    Xijiid N(,2Idd)

    2 [1, 1]d

  • Minimax rate with B-bounded communication

    Vignette: mean estimationX1 X2 Xm

    Z1 Z2 Zm

    bTheorem: when each agent has sample of sizen

    bitsB

    Consider estimation in normallocation family,

    Xijiid N(,2Idd)

    2 [1, 1]d

    d

    B ^ d1

    logm

    2d

    nm.Mn(Nd, B) . d logm

    B ^ d2d

    nm

  • Minimax rate with B-bounded communication

    Vignette: mean estimationX1 X2 Xm

    Z1 Z2 Zm

    bTheorem: when each agent has sample of sizen

    bitsB

    Consequence: each sends bits for optimal estimation d

    Consider estimation in normallocation family,

    Xijiid N(,2Idd)

    2 [1, 1]d

    d

    B ^ d1

    logm

    2d

    nm.Mn(Nd, B) . d logm

    B ^ d2d

    nm

  • Part III: Computation/Statistics Tradeoffs via Convex

    Relaxation

    with Venkat Chandrasekaran Caltech

  • Time-Data Tradeoffs Consider an inference problem with fixed risk Inference procedures viewed as points in plot

    Runtime

    Number of samples n

  • Time-Data Tradeoffs Consider an inference problem with fixed risk Vertical lines

    Runtime

    Number of samples n

    Classical estimation theory well understood

  • Time-Data Tradeoffs Consider an inference problem with fixed risk Horizontal lines

    Runtime

    Number of samples n

    Complexity theory lower bounds poorly understood

    depends on computational model

  • Time-Data Tradeoffs Consider an inference problem with fixed risk

    Runtime

    Number of samples n

    o Trade off upper bounds o More data means smaller

    runtime upper bound o Need weaker

    algorithms for larger datasets

  • An Estimation Problem

    Signal from known (bounded) set Noise

    Observation model

    Observe n i.i.d. samples

  • Convex Programming Estimator

    Sample mean is sufficient statistic

    Natural estimator

    Convex relaxation

    C is a convex set such that

  • Statistical Performance of Estimator

    Consider cone of feasible directions into C

  • Statistical Performance of Estimator

    Theorem: The risk of the estimator is

    Intuition: Only consider error in feasible cone

    Can be refined for better bias-variance tradeoffs

  • Hierarchy of Convex Relaxations

    Corr: To obtain risk of at most 1,

    Key point:

    If we have access to larger n, can use larger C

  • Hierarchy of Convex Relaxations

    If we have access to larger n, can use larger C Obtain weaker estimation algorithm

  • Hierarchy of Convex Relaxations

    If algebraic, then one can obtain family of outer convex approximations

    polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar)

    Sets ordered by computational complexity Central role played by lift-and-project

  • Example 1

    consists of cut matrices

    E.g., collaborative filtering, clustering

  • Example 2

    Signal set consists of all perfect matchings in complete graph

    E.g., network inference

  • Example 3

    consists of all adjacency matrices of graphs with only a clique on square-root many nodes

    E.g., sparse PCA, gene expression patterns Kolar et al. (2010)

  • Example 4

    Banding estimators for covariance matrices Bickel-Levina (2007), many others assume known variable ordering

    Stylized problem: let M be known tridiagonal matrix

    Signal set

  • Remarks

    In several examples, not too many extra samples required for really simple algorithms

    Approximation ratios vs Gaussian complexities approximation ratio might be bad, but doesnt matter as

    much for statistical inference

    Understand Gaussian complexities of LP/SDP hierarchies in contrast to theoretical CS

  • Conclusions

    Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data

  • Conclusions

    Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data

    Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations

  • Conclusions

    Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data

    Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines

  • Conclusions

    Many conceptual and mathematical challenges arise in taking seriously the problem of Big Data

    Facing these challenges will require a rapprochement between computer science and statistics, bringing them together at the level of their foundations thus reshaping both disciplines

    This wont happen overnight