Top Banner
. Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew University School of Computer Science and Engineering
33

. Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

.

Multivariate Information Bottleneck

Noam SlonimPrinceton UniversityLewis-Sigler Institute for Integrative Genomics

Nir FriedmanNaftali TishbyHebrew UniversitySchool of Computer Science and Engineering

Page 2: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

2

Multivariate Information Bottleneck - Preview

- A general framework for specifying a new family of clustering problems

- Almost all of these problems, are not treated by standard clustering approaches

- Insights and demonstrations why these problems are important

- A general optimal solution for all these problems, based on a single Information Theoretic principle

- Applications for text analysis, gene expression data and more...

Page 3: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

3

Multivariate IB – introduction

1X

-Original IB: Compressing one variable while preserving the information about some other single variable

2X

1T

2X

21 X,XP 21 X,TP

Page 4: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

4

Multivariate IB – introduction (cont.)

-However, we could think of other problems, e.g. symmetric compression:

Question: How to formulate and solve all such problems under one unifying principle?

2T

1X

2X

1T 21 X,XP

Page 5: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

5

(a few words about …) Bayesian Networks

-A Bayes net over (X1,…,Xn) is a DAG G in which vertices correspond to the random variables

Gii

in PaXPX,...,XP 1

- P(X1,…,Xn) is consistent with G iff each Xi is independent of all the other (non-descendant) variables, given its parents Pai

4X

2X 3X

1X

142 XX,XInd

Page 6: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

6

Multi-information and Bayes nets

-The information (X1,…,Xn) contains about each other is captured by:

nX,...,X n

nnn

XP...XPX,...,XP

logX,...,XP X,...,XI1 1

111

-If P(X1,…,Xn) is consistent with G then:

i

Giin

G Pa;XIX,...,XI 1

Page 7: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

7

Original IB through Bayes net formulation

11 X,TI

21 X,TI

1X 2X

1T outin GG IIXTPL :Minimize

New generalized formulation:

ConstX;TIX;TI :Minimize 2111 Which in this case means:

2111 X;XIX;TII inG

1X 2X

1T

inG

Constant

What compresses what

1X 2X

1T

outG

21 X;TII outG

What predicts what

Page 8: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

8

Alternative formulation: preliminaries

)QP(KLmin)GP(KL GQ

For a given DAG G, define: P

For P which is consistent with Gin:outin GGout II)GP(KL

Real multi-info in P(X,T) Multi-info as though P(X,T)is consistent with Gout

Page 9: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

12

Beyond the original IB[Slonim, Friedman, Tishby]

n432 X ... X X X X 1 n432 X ... X X X X 1

k32 T ... T T T 1

Gin dependencies(minimize)

Gout dependencies(maximize)

Compression (Bottleneck) variables

Input variables Input variables

)PaT(P inGjj

Parameters

Page 10: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

13

A simple example: Symmetric IB

2T

2211 X;TIX;TII inG

1X 2X

1T

inG What compresses what

1X 2X

1T

outG

21 T;TII outG

What predicts what

2T

212211 T;TIX;TIX;TIL :Minimize

Page 11: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

14

A multivariate formal optimal solution

)PaP(t w.r.t I-I Minimizing inoutin Gjj

GG

jGjG

j

jGjj t,Padexp

),Z(Pa)P(t

)PaP(t in

inin

121211 tTPxTPDt,xd KL

-Where now d(Paj,tj) is a generalized (KL) distortion measure…

- For example, in symmetric IB:

Page 12: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

15

Multivariate IB algorithms – example for aIB[Slonim, Friedman, Tishby, 2002]

W1 W2 W3 W4 W5 ................ WN

W1 W2 W3,W4 W5 .......... WN

W1,W2...WN

W1 W2 W3 W4 W5 ................ WN

W1 W2 W3,W4 W5 .......... WN

W1,W2...WN

afterbeforej,j LLtt rl

rlJS, tTP,tTPDttd~ rl

121211

rlrlrlj,jjjj,j ttd

~tPtPtt

-Which pair to merge?

-Where now is a generalized (JS) distortion measure…

rlj,j ttd

~

- For example, in symmetric aIB:

Page 13: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

16

Symmetric aIB compression: documents, words

60

65

70

75

80

85

90

Test2 Test4 Test5

Original aIBSymmetric aIB

- Accuracy of symmetric aIB vs. original aIB over 3 small datasets:

Word clusters provide a more robust representation…

Page 14: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

17

Symmetric IB through Deterministic Annealing

Data: 20,000 messages from 20 different discussion groups [Lang, 95]

W – a word in the corpusC – the class (newsgroup) of the message

P(W=‘bible’,C=‘alt.atheism’): Probability that choosing a random position in the corpus would select the word ‘bible’ in a message of the newsgroup (class) ‘alt.atheism’…

)C,W(Plog

Words

Classes

Page 15: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

18

Symmetric IB through Deterministic Annealing

N

ewsg

roup

Word

Page 16: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

19

Symmetric IB through Deterministic Annealing

alt.atheismrec.autosrec.motorcyclesrec.sport.*sci.medsci.spacesoc.religion.christiantalk.politics.*

comp.*misc.forsalesci.cryptsci.electronics

carturkishgameteamjesusgunhockey…

xfileimageencryptionwindowdosmac…

New

sgro

up

Word

P(TC,TW)

Page 17: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

21

Symmetric IB through Deterministic Annealing

New

sgro

up

word

P(TC,TW)

Page 18: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

23

Symmetric IB through Deterministic Annealing

New

sgro

up

Word

P(TC,TW)

Page 19: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

24

Symmetric IB through Deterministic Annealing

New

sgro

up

Word

P(TC,TW)

Page 20: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

25

Symmetric IB through Deterministic Annealing

New

sgro

up

Wordatheistschristianityjesusbiblesinfaith…

alt.atheismsoc.religion.christiantalk.religion.misc

P(TC,TW)

Page 21: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

26

Symmetric aIB compression: genes, samples

Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999)

Genes

Samples

)SG(Plog

Page 22: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

27

Symmetric aIB compression: genes, samples

0.1

0.2

0.3

0.4

0.5

0.6

0.7

ALLB-cellhosp1

ALLB-cellhosp1

ALLT-cellhosp1Male

BMB-cell

BMB-cell

AML AMLhosp2

AMLhosp3

10 Geneclusters

8 Sample clusters

X00437_s_atM12886_atX76223_s_atM59807_atU23852_s_atD00749_s_atU89922_s_atX03934_atU50743_atM21624_atM28826_atM37271_s_atX59871_atX14975_atM16336_s_atL05148_atM28825_at

)TT(P GS

Data after symmetric aIB compression:

Page 23: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

28

Another example: parallel IB

- Consider a document collection with different topics, and different writing styles:

topic4topic4topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic2topic2

topic2topic2

topic2topic2topic2topic2

topic2topic2

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

Science

Science

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

Page 24: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

29

Another example: parallel IB (cont.)

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic2topic2

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic1topic1

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic4topic4

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

topic3topic3

Topic1 Topic2 Topic3 Topic4

-One possible “legitimate” partition is by the topic:

Page 25: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

30

Another example: parallel IB (cont.)

-And another possible “legitimate” partition is by the writing style:

topic1topic1

topic3topic3

topic2topic2

topic3topic3

topic4topic4

topic1topic1

topic4topic4

topic1topic1

topic2topic2

topic2topic2

topic4topic4

topic1topic1

topic3topic3

topic1topic1

topic1topic1

topic3topic3

topic4topic4

topic1topic1

topic2topic2

topic3topic3

topic1topic1

topic3topic3

topic2topic2

topic4topic4

topic4topic4

Style1 Style2 Style3

There might be more than one“legitimate” partition…

Page 26: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

31

Parallel IB: solution

2212211 X;T,TIX;TIX;TIL :Minimize

2T

2211 X;TIX;TII inG

1X 2X

1T

inG Minimize dependencies

1X 2X

1T

outG

)X;T,T(II outG221

Maximize dependencies

2T

))]T,tX(P)T,xX(P(D[E)t,x(d KL)XT(P 2122121112

Effective distortion:

Page 27: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

32

Parallel sIB: Text analysis results

-Data: ~1,500 “documents” taken from E. R. Burroughs: The Beasts of Tarzan & The Gods of Mars

R. Kipling: The Jungle Book & Rewards and Fairies

- X1 corresponds to “documents”, X2 corresponds to words

32542

1254

4061

2315

T2,bT2,a

Burroughs

Kipling3670Rewards and Fairies

2550The Jungle Book

0407The Gods of Mars

2315The Beasts of Tarzan

T1,bT1,a

Page 28: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

33

Parallel sIB :Gene Expression data results

- Data: Gene expression of 500 “informative” genes Vs. 72 Leukemia samples (Golub et al, 1999)

- X1 corresponds to samples, X2 corresponds to genes

.72.64<PS>

90T-cell

380B-cell

470ALL

223AML

T1,bT1,a

.66.71

90

137

1037

1114

T2,bT2,a

.76.53

63

326

389

1312

T3,bT3,a

.69.70

72

1820

2522

1213

T4,bT4,a

Page 29: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

34

Another Example: Triplet IB

-Consider the following sequence data:

s(1) s(2) s(3) … s(t-1) s(t) s(t+1) …

-Can we extract features s.t. their combination is informative about a symbol between them?

Xp Xm Xn

Tp Tn

Page 30: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

35

Triplet IB: solution

mnpnnpp X;T,TβIX;TIX;TIL :Minimize

nnppG X;TIX;TII in

nT

pX

pT

inG Minimize dependencies

mX nX pX

pT

outG

)X;T,I(TI mnpGout

Maximize dependencies

nT

mX nX

Page 31: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

36

Triplet IB Data

(E. R. Burroughs, “Tarzan the Terrible”)

“… As Tarzan ascended the platform his eyes narrowed angrily at thesight which met them… ‘’What means this?” he cried angrily…”

1st word in triplet

Xp

2nd word in triplet

Xm

3rd word in triplet

Xn

Xm = {apemans, apes, eyes, girl, great, jungle, tarzan, time, two, way}

Data: Tarzan and the Jewels of Opar, Tarzan of the Apes, Tarzan the Terrible, Tarzan the Untamed, The Beasts of Tarzan, The Jungle Tales of Tarzan, The Return of Tarzan

Joint distribution P(Xp,Xm,Xn) of dimension 90 x 10 x 233

Page 32: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

37

Triplet sIB: Text analysis results

- Given Xp and Xn, two schemes to predict middle word:Xm = argmax P( xm’ | tp,tn )

- Test on a NEW sequence, “The son of Tarzan”:

22%28%55%53%Average

21%28%81%60%Way (101)

8%11%92%41%Two (148)

26%48%82%70%Time (145)

25%40%67%41%Tarzan (48)

24%27%54%49%Jungle (241)

48%50%92%92%Great (219)

1%5%30%43%Girl (240)

28%32%81%83%Eyes (177)

14%17%26%43%Apes(78)

Xp, XnTp, TnXp, XnTp, TnXm

Precision (%) Recall (%)

Xm = argmax P( xm’ | xp,xn )

Page 33: . Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.

38

Summary

- The IB method is a principled framework, for extracting “informative” structure out of a joint distribution P(X1,X2).

- The Multivariate IB extends this framework to extract “informative” structure from more complex joint distributions, P(X1,…,Xn), in various ways.

- This enables us to define and solve a new family of optimization problems, under a single unifying Information Theoretic principle.

- References: www.cs.huji.ac.il/~noamm

- “Clustering” conceals a family of distinct problems which deserve special consideration. The multivariate IB framework enables to define these sub-problems, solve them, and demonstrate their importance.