Top Banner
PREDICTING AND UNDERSTANDING NETWORKS USING GRAPH EMBEDDING Michiel Stock michielstock 1 KERMIT
24

2018 presentation montréal_handouts

Jan 28, 2018

Download

Science

Michiel Stock
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2018 presentation montréal_handouts

PREDICTING AND UNDERSTANDING NETWORKS USING GRAPH EMBEDDING

Michiel Stock michielstock

1

KERMIT

Page 2: 2018 presentation montréal_handouts

RATIONALISM VS EMPIRICISM

Rationalism “the view that regards reason as the chief source and test of knowledge”

Historically associated with mathematics and physics.

Deduction: A -> B

Plato René Descartes

Empiricism “a theory that states that knowledge comes only or primary from sensory experiences”

Historically associated with biology, chemistry, geology…

Experimentation

Aristotle John Locke

2

Page 3: 2018 presentation montréal_handouts

RATIONALISM: BUILDING THEORIES

3

THEORY

Theory of evolutionY = Y0M

aAllometric scaling

Central dogma of molecular biology

Page 4: 2018 presentation montréal_handouts

EMPIRICISM: COLLECTING DATA

4

DATA Dataism

Page 5: 2018 presentation montréal_handouts

GRAPHS AND NETWORKS: THE BEST OF BOTH WORLDS

A graph G=(V, E) consists of a set of vertices V together with a set of edges E representing connections between the vertices.

Can be interpreted as a mechanistic model. Draws from a very rich body of mathematical theory.

Can be experimentally determined. Data structure.

5

a graph

Page 6: 2018 presentation montréal_handouts

NETWORKS IN SCIENCE

systems biology

social networksfood flavour network 6

Page 7: 2018 presentation montréal_handouts

ECOLOGICAL NETWORKS

7

Networks in ecology:

parasitismfood webs pollination

Sampling of species interaction networks:

Page 8: 2018 presentation montréal_handouts

ANALYSIS OF NETWORK DATA

8

Page 9: 2018 presentation montréal_handouts

MY STORY

➤ Bioscience engineer (cellular biotech)

➤ Into biology, not into experimenting

➤ PhD in machine learning

➤ past focus: molecular biology

➤ current focus: ecological networks

9

Page 10: 2018 presentation montréal_handouts

MY RESEARCH PROJECT: MACHINE LEARNING FOR ECOLOGICAL NETWORKS

10

Understand ➤ What are species doing?

➤ How do we compare networks?

➤ How can we extract numerical features?

Predict ➤ Find missing interactions.

➤ Changes in time and space.

➤ Uncertainty in predictions?

Control / manage ➤ Effective monitoring.

➤ How to increase productivity/stability?

➤ Encourage/discourage interactions.

Page 11: 2018 presentation montréal_handouts

MACHINE LEARNING

The The branch of computer science concerned with giving computers the ability to learn without being explicitly programmed.

11

The The study and development of algorithms that can detect stable patterns in finite data sets.

Page 12: 2018 presentation montréal_handouts

SUPERVISED LEARNING

Given a data set a labeled examples, find a function f(.) to predict the label of new data points.

12

Example: detecting animals on camera trap images

➤ hog

➤ human

➤ deer

➤ empty

x

x1

x2

y regression

classification

Page 13: 2018 presentation montréal_handouts

UNSUPERVISED LEARNING

Find simple structure in a complex data set. Often dimension reduction or clustering.

Finally, it should be noted that the computational complexity, and thus the running time, differs greatly between algorithms. While t-SNE based methods and SPADE are typically only able to process a few tens of

thousands of cells, methods such as FlowSOM scale to much larger datasets than the other visualization algo-rithms, allowing the algorithm to efficiently process millions of cells.

Nature Reviews | Immunology

MHC class II CD19

CD19CD64Auto uorescenceCD3NK1.1

MHC class IICD11cCD11b

LY6G

CD64

CD11c CD3 Auto uorescence

CD11b LY6G NK1.1

MHC class II CD19 CD64

MEP

CMP

Mast cells

Monocytes

NeutrophilsMacrophages

mDCs

NK cells

CD4+ T cells NKT cells

CD8+ T cellsγδ T cells

B cells

pDCsBasophils

Eosinophils

GMPCLP

Long-term HSCShort-term HSC

CD11c CD3 Auto uorescence

CD11b LY6G NK1.1

a SPADE b FlowSOM

c t-SNE d Scaffold map

Figure 2 | Marker visualization of mouse splenocytes. a–c | Visualization of mouse splenocytes using SPADE (spanning tree progression of density normalized events), FlowSOM (flow cytometry data analysis using self-organizing maps) and t-SNE (t-stochastic neighbour embedding). SPADE uses density-based downsampling and hierarchical clustering to group similar cells, which are visualized in a minimal spanning tree. FlowSOM also uses a minimal spanning tree but does not use subsampling, and it clusters the cells using a self-organizing map. By contrast, methods based on t-SNE (such as viSNE and ACCENSE do not cluster the cells but show each cell individually in two new dimensions that take similarities in all the original dimensions into account. For SPADE and t-SNE, a subplot is shown for each individual marker, in which the colour is more saturated for higher expression levels. By comparing the different subplots, the cell type can be determined. FlowSOM uses pie charts, combining all markers in a single plot. The height of each part indicates the expression level. Owing to the density based subsampling, SPADE analysis will even out the distribution of the different cell types. Although FlowSOM does not do this, it is still able to distinguish populations as small as 0.7% (such as neutrophils (which are CD11b+LY6G+) in this dataset), while at the same time running almost two orders of magnitude faster (9 seconds versus 700 seconds on a single-threaded processor). FlowSOM also offers additional visualization options, such as the original self-organizing map grid or a t-SNE mapping of the nodes. All cells were used by SPADE and FlowSOM, but owing to computational limitations only 10,000 cells were processed using t-SNE. d | Visualization of scaffold maps for the mouse immune reference data set from REF. 53. Code to replicate these figures is available at https://github.com/saeyslab/FlowCytometryScripts. CLP, common lymphoid progenitor; CMP, common myeloid progenitor; GMP, granulocyte–monocyte progenitor; HSC, haematopoietic stem cell; mDC, myeloid dendritic cell; MEP, megakaryocyte–erythroid progenitor; NK, natural killer; NKT, natural killer T; pDC, plasmacytoid dendritic cell.

REV IEWS

456 | JULY 2016 | VOLUME 16 www.nature.com/nri

Saeys et al. (2016) “Computational flow cytometry: helping to make sense of high-dimensional immunology data” 13

Page 14: 2018 presentation montréal_handouts

LEARNING FROM/ON GRAPHS

14

VS

typical dataset

Page 15: 2018 presentation montréal_handouts

( , ,1)PAIRWISE DATA (TWO OBJECTS AND A LABEL)

15

person

book

label (0/1)

( , ,1)( , ,0)

( , ,0)( , ,1)( , ,1)

Page 16: 2018 presentation montréal_handouts

A PAIRWISE MODEL

16

‘Learn’ a function on pairs based on observed data:

such that a high score indicates that someone would be interested in a book.

f( , )Main idea: combine the ‘description’ of the objects (person and book) into a large pairwise feature vector.

Page 17: 2018 presentation montréal_handouts

FINDING INTERACTING SPECIES = FINDING INTERESTING BOOKS

17

For example: plant-pollinator interactions

pollinators => persons

plants => books

pollination => reading

Page 18: 2018 presentation montréal_handouts

PAIRWISE LEARNING TO FIND FALSE NEGATIVES IN SPECIES INTERACTIONS

18

precision =

# interactions

size top

Improvement compared to random

selection

Stock et al. (2017)

Page 19: 2018 presentation montréal_handouts

GRAPH EMBEDDING: ENCODING-DECODING

19

Graphs are complex! Can we extract a numerical representation from them that we can easily use in conventional machine learning?

z

neighbourhood prediction

classification

visualisation

Inspired by Hamilton et al. (2017)

Page 20: 2018 presentation montréal_handouts

PROBABILISTIC MATRIX FACTORIZATION FOR NETWORKS

20

Find a low-dimensional representation of the species, such that the inner product corresponds to the probability of interacting.

Y

=

P

⇡ �( )

U V >

logistic map:

squeezes input to [0,1]

Page 21: 2018 presentation montréal_handouts

PROBABILISTIC MATRIX FACTORIZATION REVEALS STRUCTURES

21

Page 22: 2018 presentation montréal_handouts

NETWORK PROPERTIES ARE RETAINED

22

Page 23: 2018 presentation montréal_handouts

DETECTING (MISSING) INTERACTIONS

23

Page 24: 2018 presentation montréal_handouts

IN CONCLUSION

24

data = awesome

networks connect different disciplines

algorithms needed to analyse these networks