2018 presentation montréal_handouts

PREDICTING AND UNDERSTANDING NETWORKS USING GRAPH EMBEDDING

Michiel Stock michielstock

1

KERMIT

RATIONALISM VS EMPIRICISM

Rationalism “the view that regards reason as the chief source and test of knowledge”

Historically associated with mathematics and physics.

Deduction: A -> B

Plato René Descartes

Empiricism “a theory that states that knowledge comes only or primary from sensory experiences”

Historically associated with biology, chemistry, geology…

Experimentation

Aristotle John Locke

2

RATIONALISM: BUILDING THEORIES

3

THEORY

Theory of evolutionY = Y0M

aAllometric scaling

Central dogma of molecular biology

EMPIRICISM: COLLECTING DATA

4

DATA Dataism

GRAPHS AND NETWORKS: THE BEST OF BOTH WORLDS

A graph G=(V, E) consists of a set of vertices V together with a set of edges E representing connections between the vertices.

Can be interpreted as a mechanistic model. Draws from a very rich body of mathematical theory.

Can be experimentally determined. Data structure.

5

a graph

NETWORKS IN SCIENCE

systems biology

social networksfood flavour network 6

ECOLOGICAL NETWORKS

7

Networks in ecology:

parasitismfood webs pollination

Sampling of species interaction networks:

ANALYSIS OF NETWORK DATA

8

MY STORY

➤ Bioscience engineer (cellular biotech)

➤ Into biology, not into experimenting

➤ PhD in machine learning

➤ past focus: molecular biology

➤ current focus: ecological networks

9

MY RESEARCH PROJECT: MACHINE LEARNING FOR ECOLOGICAL NETWORKS

10

Understand ➤ What are species doing?

➤ How do we compare networks?

➤ How can we extract numerical features?

Predict ➤ Find missing interactions.

➤ Changes in time and space.

➤ Uncertainty in predictions?

Control / manage ➤ Effective monitoring.

➤ How to increase productivity/stability?

➤ Encourage/discourage interactions.

MACHINE LEARNING

The The branch of computer science concerned with giving computers the ability to learn without being explicitly programmed.

11

The The study and development of algorithms that can detect stable patterns in finite data sets.

SUPERVISED LEARNING

Given a data set a labeled examples, find a function f(.) to predict the label of new data points.

12

Example: detecting animals on camera trap images

➤ hog

➤ human

➤ deer

➤ empty

x

x1

x2

y regression

classification

UNSUPERVISED LEARNING

Find simple structure in a complex data set. Often dimension reduction or clustering.

Finally, it should be noted that the computational complexity, and thus the running time, differs greatly between algorithms. While t-SNE based methods and SPADE are typically only able to process a few tens of

thousands of cells, methods such as FlowSOM scale to much larger datasets than the other visualization algo-rithms, allowing the algorithm to efficiently process millions of cells.

Nature Reviews | Immunology

MHC class II CD19

CD19CD64Auto uorescenceCD3NK1.1

MHC class IICD11cCD11b

LY6G

CD64

CD11c CD3 Auto uorescence

CD11b LY6G NK1.1

MHC class II CD19 CD64

MEP

CMP

Mast cells

Monocytes

NeutrophilsMacrophages

mDCs

NK cells

CD4+ T cells NKT cells

CD8+ T cellsγδ T cells

B cells

pDCsBasophils

Eosinophils

GMPCLP

Long-term HSCShort-term HSC

CD11c CD3 Auto uorescence

CD11b LY6G NK1.1

a SPADE b FlowSOM

c t-SNE d Scaffold map

Figure 2 | Marker visualization of mouse splenocytes. a–c | Visualization of mouse splenocytes using SPADE (spanning tree progression of density normalized events), FlowSOM (flow cytometry data analysis using self-organizing maps) and t-SNE (t-stochastic neighbour embedding). SPADE uses density-based downsampling and hierarchical clustering to group similar cells, which are visualized in a minimal spanning tree. FlowSOM also uses a minimal spanning tree but does not use subsampling, and it clusters the cells using a self-organizing map. By contrast, methods based on t-SNE (such as viSNE and ACCENSE do not cluster the cells but show each cell individually in two new dimensions that take similarities in all the original dimensions into account. For SPADE and t-SNE, a subplot is shown for each individual marker, in which the colour is more saturated for higher expression levels. By comparing the different subplots, the cell type can be determined. FlowSOM uses pie charts, combining all markers in a single plot. The height of each part indicates the expression level. Owing to the density based subsampling, SPADE analysis will even out the distribution of the different cell types. Although FlowSOM does not do this, it is still able to distinguish populations as small as 0.7% (such as neutrophils (which are CD11b+LY6G+) in this dataset), while at the same time running almost two orders of magnitude faster (9 seconds versus 700 seconds on a single-threaded processor). FlowSOM also offers additional visualization options, such as the original self-organizing map grid or a t-SNE mapping of the nodes. All cells were used by SPADE and FlowSOM, but owing to computational limitations only 10,000 cells were processed using t-SNE. d | Visualization of scaffold maps for the mouse immune reference data set from REF. 53. Code to replicate these figures is available at https://github.com/saeyslab/FlowCytometryScripts. CLP, common lymphoid progenitor; CMP, common myeloid progenitor; GMP, granulocyte–monocyte progenitor; HSC, haematopoietic stem cell; mDC, myeloid dendritic cell; MEP, megakaryocyte–erythroid progenitor; NK, natural killer; NKT, natural killer T; pDC, plasmacytoid dendritic cell.

REV IEWS

456 | JULY 2016 | VOLUME 16 www.nature.com/nri

Saeys et al. (2016) “Computational flow cytometry: helping to make sense of high-dimensional immunology data” 13

LEARNING FROM/ON GRAPHS

14

VS

typical dataset

( , ,1)PAIRWISE DATA (TWO OBJECTS AND A LABEL)

15

person

book

label (0/1)

( , ,1)( , ,0)

( , ,0)( , ,1)( , ,1)

…

A PAIRWISE MODEL

16

‘Learn’ a function on pairs based on observed data:

such that a high score indicates that someone would be interested in a book.

f( , )Main idea: combine the ‘description’ of the objects (person and book) into a large pairwise feature vector.

FINDING INTERACTING SPECIES = FINDING INTERESTING BOOKS

17

For example: plant-pollinator interactions

pollinators => persons

plants => books

pollination => reading

PAIRWISE LEARNING TO FIND FALSE NEGATIVES IN SPECIES INTERACTIONS

18

precision =

# interactions

size top

Improvement compared to random

selection

Stock et al. (2017)

GRAPH EMBEDDING: ENCODING-DECODING

19

Graphs are complex! Can we extract a numerical representation from them that we can easily use in conventional machine learning?

z

neighbourhood prediction

classification

visualisation

Inspired by Hamilton et al. (2017)

PROBABILISTIC MATRIX FACTORIZATION FOR NETWORKS

20

Find a low-dimensional representation of the species, such that the inner product corresponds to the probability of interacting.

Y

=

P

⇡ �( )

U V >

logistic map:

squeezes input to [0,1]

PROBABILISTIC MATRIX FACTORIZATION REVEALS STRUCTURES

21

NETWORK PROPERTIES ARE RETAINED

22

DETECTING (MISSING) INTERACTIONS

23

IN CONCLUSION

24

data = awesome

networks connect different disciplines

algorithms needed to analyse these networks

2018 presentation montréal_handouts

Science