PREDICTING AND UNDERSTANDING NETWORKS USING GRAPH EMBEDDING Michiel Stock michielstock 1 KERMIT
PREDICTING AND UNDERSTANDING NETWORKS USING GRAPH EMBEDDING
Michiel Stock michielstock
1
KERMIT
RATIONALISM VS EMPIRICISM
Rationalism “the view that regards reason as the chief source and test of knowledge”
Historically associated with mathematics and physics.
Deduction: A -> B
Plato René Descartes
Empiricism “a theory that states that knowledge comes only or primary from sensory experiences”
Historically associated with biology, chemistry, geology…
Experimentation
Aristotle John Locke
2
RATIONALISM: BUILDING THEORIES
3
THEORY
Theory of evolutionY = Y0M
aAllometric scaling
Central dogma of molecular biology
EMPIRICISM: COLLECTING DATA
4
DATA Dataism
GRAPHS AND NETWORKS: THE BEST OF BOTH WORLDS
A graph G=(V, E) consists of a set of vertices V together with a set of edges E representing connections between the vertices.
Can be interpreted as a mechanistic model. Draws from a very rich body of mathematical theory.
Can be experimentally determined. Data structure.
5
a graph
NETWORKS IN SCIENCE
systems biology
social networksfood flavour network 6
ECOLOGICAL NETWORKS
7
Networks in ecology:
parasitismfood webs pollination
Sampling of species interaction networks:
ANALYSIS OF NETWORK DATA
8
MY STORY
➤ Bioscience engineer (cellular biotech)
➤ Into biology, not into experimenting
➤ PhD in machine learning
➤ past focus: molecular biology
➤ current focus: ecological networks
9
MY RESEARCH PROJECT: MACHINE LEARNING FOR ECOLOGICAL NETWORKS
10
Understand ➤ What are species doing?
➤ How do we compare networks?
➤ How can we extract numerical features?
Predict ➤ Find missing interactions.
➤ Changes in time and space.
➤ Uncertainty in predictions?
Control / manage ➤ Effective monitoring.
➤ How to increase productivity/stability?
➤ Encourage/discourage interactions.
MACHINE LEARNING
The The branch of computer science concerned with giving computers the ability to learn without being explicitly programmed.
11
The The study and development of algorithms that can detect stable patterns in finite data sets.
SUPERVISED LEARNING
Given a data set a labeled examples, find a function f(.) to predict the label of new data points.
12
Example: detecting animals on camera trap images
➤ hog
➤ human
➤ deer
➤ empty
x
x1
x2
y regression
classification
UNSUPERVISED LEARNING
Find simple structure in a complex data set. Often dimension reduction or clustering.
Finally, it should be noted that the computational complexity, and thus the running time, differs greatly between algorithms. While t-SNE based methods and SPADE are typically only able to process a few tens of
thousands of cells, methods such as FlowSOM scale to much larger datasets than the other visualization algo-rithms, allowing the algorithm to efficiently process millions of cells.
Nature Reviews | Immunology
MHC class II CD19
CD19CD64Auto uorescenceCD3NK1.1
MHC class IICD11cCD11b
LY6G
CD64
CD11c CD3 Auto uorescence
CD11b LY6G NK1.1
MHC class II CD19 CD64
MEP
CMP
Mast cells
Monocytes
NeutrophilsMacrophages
mDCs
NK cells
CD4+ T cells NKT cells
CD8+ T cellsγδ T cells
B cells
pDCsBasophils
Eosinophils
GMPCLP
Long-term HSCShort-term HSC
CD11c CD3 Auto uorescence
CD11b LY6G NK1.1
a SPADE b FlowSOM
c t-SNE d Scaffold map
Figure 2 | Marker visualization of mouse splenocytes. a–c | Visualization of mouse splenocytes using SPADE (spanning tree progression of density normalized events), FlowSOM (flow cytometry data analysis using self-organizing maps) and t-SNE (t-stochastic neighbour embedding). SPADE uses density-based downsampling and hierarchical clustering to group similar cells, which are visualized in a minimal spanning tree. FlowSOM also uses a minimal spanning tree but does not use subsampling, and it clusters the cells using a self-organizing map. By contrast, methods based on t-SNE (such as viSNE and ACCENSE do not cluster the cells but show each cell individually in two new dimensions that take similarities in all the original dimensions into account. For SPADE and t-SNE, a subplot is shown for each individual marker, in which the colour is more saturated for higher expression levels. By comparing the different subplots, the cell type can be determined. FlowSOM uses pie charts, combining all markers in a single plot. The height of each part indicates the expression level. Owing to the density based subsampling, SPADE analysis will even out the distribution of the different cell types. Although FlowSOM does not do this, it is still able to distinguish populations as small as 0.7% (such as neutrophils (which are CD11b+LY6G+) in this dataset), while at the same time running almost two orders of magnitude faster (9 seconds versus 700 seconds on a single-threaded processor). FlowSOM also offers additional visualization options, such as the original self-organizing map grid or a t-SNE mapping of the nodes. All cells were used by SPADE and FlowSOM, but owing to computational limitations only 10,000 cells were processed using t-SNE. d | Visualization of scaffold maps for the mouse immune reference data set from REF. 53. Code to replicate these figures is available at https://github.com/saeyslab/FlowCytometryScripts. CLP, common lymphoid progenitor; CMP, common myeloid progenitor; GMP, granulocyte–monocyte progenitor; HSC, haematopoietic stem cell; mDC, myeloid dendritic cell; MEP, megakaryocyte–erythroid progenitor; NK, natural killer; NKT, natural killer T; pDC, plasmacytoid dendritic cell.
REV IEWS
456 | JULY 2016 | VOLUME 16 www.nature.com/nri
Saeys et al. (2016) “Computational flow cytometry: helping to make sense of high-dimensional immunology data” 13
LEARNING FROM/ON GRAPHS
14
VS
typical dataset
( , ,1)PAIRWISE DATA (TWO OBJECTS AND A LABEL)
15
person
book
label (0/1)
( , ,1)( , ,0)
( , ,0)( , ,1)( , ,1)
…
A PAIRWISE MODEL
16
‘Learn’ a function on pairs based on observed data:
such that a high score indicates that someone would be interested in a book.
f( , )Main idea: combine the ‘description’ of the objects (person and book) into a large pairwise feature vector.
FINDING INTERACTING SPECIES = FINDING INTERESTING BOOKS
17
For example: plant-pollinator interactions
pollinators => persons
plants => books
pollination => reading
PAIRWISE LEARNING TO FIND FALSE NEGATIVES IN SPECIES INTERACTIONS
18
precision =
# interactions
size top
Improvement compared to random
selection
Stock et al. (2017)
GRAPH EMBEDDING: ENCODING-DECODING
19
Graphs are complex! Can we extract a numerical representation from them that we can easily use in conventional machine learning?
z
neighbourhood prediction
classification
visualisation
Inspired by Hamilton et al. (2017)
PROBABILISTIC MATRIX FACTORIZATION FOR NETWORKS
20
Find a low-dimensional representation of the species, such that the inner product corresponds to the probability of interacting.
Y
=
P
⇡ �( )
U V >
logistic map:
squeezes input to [0,1]
PROBABILISTIC MATRIX FACTORIZATION REVEALS STRUCTURES
21
NETWORK PROPERTIES ARE RETAINED
22
DETECTING (MISSING) INTERACTIONS
23
IN CONCLUSION
24
data = awesome
networks connect different disciplines
algorithms needed to analyse these networks