Discovering Regulatory Networks from Gene Expression and Promoter Sequence Eran Segal Stanford University
Jan 11, 2016
Discovering Regulatory Networks from Gene
Expression and Promoter Sequence
Eran SegalStanford University
From Parts to Systems
Parts Modules Interactions
Activity
Gene Regulation
DNA
Gene 2Gene 1
RNA
Protein
DNA RNAis a tightly regulated process
Gene Regulation
DNA
Gene 2Gene 1
RNA
CodingControlCodingControl
Swi5 Regulator (transcription factor)
Sw
i5
ACGTGC
Regulator
Motif
Genome-wide Available DataGene 2Gene 1
CodingControlCodingControl
DNA Sequence Gene Expression
mRNA level of all genes Measured in different
conditions
RNA
DNA Microarray
……ACTAGCGGCTATAATGACTGGACCTACGTACCGATATAATGTCAGCTAGCA……
Gene RegulationGene 2Gene 1
CodingControlCodingControl
ACGTGCMotif
Many diagnostic, prognostic and therapeutic implications
Regulator
Sw
i5
How are genes regulated? How are genes regulated? Who regulates whom?
How are genes regulated? Who regulates whom? Under which conditions?
How are genes regulated? Who regulates whom? Under which conditions? Which genes are co-regulated?
Example: Finding Motifs Cluster gene expression profiles Search for motifs in control regions of clustered
genes
clustering
AGCTAGCTGAGACTGCACAC
TTCGGACTGCGCTATATAGA
GACTGCAGCTAGTAGAGCTC
CTAGAGCTCTATGACTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTGACTGCCGCTT
Control regions Gene I
Gene IIGene IIIGene IVGene VGene VI
GACTGC
AGCTAGCTGAGACTGCACAC
TTCGGACTGCGCTATATAGA
GACTGCAGCTAGTAGAGCTC
CTAGAGCTCTATGACTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTGACTGCCGCTT
Experiments
Gen
es
Procedural Apply a different method to each type of data Use output of one method as input to the next
Motif
Our Approach: Model Based
What is a model?
A description of the biological process that could have generated the
observed data
stochasticprobabilistic
Our Approach: Model Based Statistical modeling language for biological
domains Based on Bayesian networks Classes of objects Properties
Observed: gene sequence,experiment conditions
Hidden: gene module Interactions
Expression level as afunction of gene andexperiment properties
ExperimentGene
Expression
Condition
Module Tumor
STGFK ’01 (ISMB)
STGFK ’01 (ISMB)
Tumor
Module
Level
Probabilistic Model Defines a joint distribution
Condition
Exper.
Gene
ExpressionTumor1
Module1
Level1,1
Condition1
Level1,2
Tumor2
Condition2
Module2
Level2,1 Level2,2
Bayesian Network
P(Level2,1 | Module2,Condition2,Tumor2)
Gg Eee,g
Ee
Gg
)Tumor.e,Condition.e,Module.g|Level(P
)Condition.e(P)Tumor.e(P
)Module.g(PJ
Probabilistic Model Defines a joint distribution Learned automatically from data
Parameterization Structure Assignment to hidden variables
Find model M that maximizes P(M | D)
Tumor
Module
Level
Condition
Exper.
Gene
Expression
Learn parameterization and structure of distributionsLearn network structure Thousands of variables Space of possible networks is super-exponential
Probabilistic inference in the Bayesian network Millions of hidden variables Variables are highly dependent
NP-Hard
Convex optimization Graph theoretic algorithms Dynamic programming Heuristic search
Problem-specific structure Modularity in biological
systems
STGFK ’01 (ISMB)
Analyze results Visualization Literature Statistics
Learn model Automatically from data Structure Parameterization
Model design Classes of objects Properties Interactions
Scheme
Model design Learn model
Biological problem
Data Analyze results
Derive biological insights from model
STGFK ’01 (ISMB)
Outline
Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Ongoing Biological Debate
Can we discover actual regulators from gene expression
data alone?
Activator Repressor
Regulated gene
Activator Repressor
Regulated gene
Activator
Regulated gene
Repressor
State 1
Act
ivat
or
State 2
Act
ivat
or
Repressor
State 3
Gene Regulation: Simple Example
Regulated gene
DNA Microarray
Regulators
DNA Microarray
Regulators
truefalse
truefalse
Regulation Tree
Activator?
Repressor?
State 1 State 2 State 3
true Regulation
program
Module
genes
Activator expressio
n
Repressor expressio
n
SSRPBKF ’03 (Nature Genetics)
Genes in the same module share the same regulation
program
Module Networks
Goal: Discover regulatory modules and their regulators Module genes: set of genes that are similarly controlled Regulation program: expression as function of regulators
Modu
les
HAP4
CMK1 truefalse
truefalse
SSRPBKF ’03 (Nature Genetics)
Expression level in each module is a
function of expression of regulators
Module Network Probabilistic Model
Experiment
Gene
Expression
Module
Regulator1
Regulator2
Regulator3
Level
What module does gene “g” belong
to?
Expression level of Regulator1 in experiment
BMH1
GIC2
00 0
2
1
Module
P(Level | Module, Regulators)
HAP4
CMK1
0
0 0
SSRPBKF ’03 (Nature Genetics)
Outline
Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Learning Problem
Experiment
Gene
Expression
Module
Regulator1
Regulator2
Regulator3
Level
HAP4
CMK1
0
00
Find gene module assignments and tree structures that maximize P(M|D)
Goal:
Gene module
assignments
Tree structures
Hard
Genes: 5000-10000
Regulators: ~500
SSRPBKF ’03 (Nature Genetics)
Learning Algorithm Overview
Relearn gene
assignments to modules
clustering
Gene module assignment
Regulatory modules
Learn regulatio
n program
s
HAP4
CMK1
SSRPBKF ’03 (Nature Genetics)
Learning Regulation ProgramsExperiments
Mod
ul
e
gen
esExperiments
sorted in original order
Experiments sorted by Hap4
expression
log P(M|D) log P(D|,) + log P(,)
HAP4
log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DHAP4 |HAP4 ,HAP4 ) + log P(HAP4,HAP4, HAP4 ,HAP4)
SIP4
log P(M|D) log P(DSIP4 |SIP4 ,SIP4 ) + log P(DSIP4 |SIP4 ,SIP4 ) + log P(SIP4,SIP4, SIP4 ,SIP4)
log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DCMK1 |CMK1 ,CMK1 ) + log P(DCMK1 |CMK1 ,CMK1 ) + …
HAP4
CMK1
Mod
ul
e
gen
es
Hap4 expression
Regulator
Learning Algorithm Performance
-131
-130
-129
-128
0 5 10 15 20
Bayesi
an
sco
re (
avg
. p
er
gen
e)
Algorithm iterations
0
10
20
30
40
50
0 5 10 15 20
Algorithm iterations
Gen
e m
od
ule
ass
ign
ment
ch
an
ges
(% f
rom
tota
l)
Significant improvements across
learning iterations
Many genes (50%) change module assignment in
learning
SPRKF ’03 (UAI)
Outline
Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Yeast Stress Data
Genes Selected 2355 that showed activity
Experiments (173) Diverse environmental stress
conditions: heat shock, nitrogen depletion,…
Comparison to Bayesian Networks
Problems Robustness Interpretability
Cmk1
Hap4
Mig1
Ste12
Bayesian Network
Friedman et al ’00Hartemink et al. ’01
Yap1
Gic1
Expression level of each gene is a function of expression of
regulators
Fragment of learned Bayesian network 2355 variables (genes) 173 instances (experiments)
Comparison to Bayesian Networks
Problems Robustness Interpretability
Cmk1
Hap4
Mig1
Ste12
Bayesian Network
Friedman et al ’00Hartemink et al. ’01
Yap1
Gic1
Module NetworkSPRKF ’03 (UAI)
Solutions Robustness sharing parameters Interpretability module-level
model
Regulator1
Regulator2
Regulator3
Level
Module
Comparison to Bayesian Networks
Problems Robustness Interpretability
Solutions Robustness sharing parameters Interpretability module-level
model
Test
Data
Log
-Lik
elih
ood
(gain
per
inst
an
ce)
Number of modules
Bayesian Network performance
-150
-100
-50
0
50
100
150
0 100 200 300 400 500
SPRKF ’03 (UAI)
Learn which parameters are shared(by learning which genes are in the same
module)
Module
From Model to Regulatory Modules
Regulator1
Regulator2
Regulator3
Level
HAP4
CMK1
Biologically relevant?
HAP4
CMK1
0
0 0
SSRPBKF ’03 (Nature Genetics)
Respiration Module
Regulation
program
Module genes
Energy production (oxid. phos. 26/55 P<10-30)
Hap4+Msn4 known to regulate module genes
Module genes functionally coherent? Module genes known targets of predicted regulators?
SSRPBKF ’03 (Nature Genetics)
Predicted regulator
Energy, Osomlarity, & cAMP Signaling
Regulation by non-TFs (Tpk1 – cAMP-dependent protein kinase) Module genes known targets of predicted regulators?
Regulation
program
Module genes
Biological Evaluation Summary
Are the module genes functionally coherent?
Are some module genes known targets of the predicted regulators?
46/50
30/50
Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses)
Known targets = direct biological experiments reported in the literature
SSRPBKF ’03 (Nature Genetics)
Outline
Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
From Model to Detailed Predictions
Prediction:
Experiment:
Regulator ‘X’ regulates process ‘Y’
Knock out ‘X’ and repeat experiment
HAP4
Ypl230w X
?
SSRPBKF ’03 (Nature Genetics)
Does ‘X’ Regulate Predicted Genes?
Experiment: knock out Ypl230w (stationary phase)
1334 regulated genes(312 expected by
chance)
wild-type
mutant
>4x
Regulated genes
Rank modules by regulated genes
Predicted modules
Module Sig.
Protein foldingP<0.0001
Cell diferentiation P<0.02
Glycolysis and folding P<0.04
Mitochondrial and protein fate
P<0.04
Module Sig.
Protein foldingP<0.0001
Cell diferentiation P<0.02
Glycolysis and folding P<0.04
Mitochondrial and protein fate
P<0.04
Modules predicted to be regulated by
Ypl230w
Ypl230w regulates
computationally predicted genes
SSRPBKF ’03 (Nature Genetics)
Regulated
genes(1014)
Ppt1 knockout(hypo-osmotic
stress)wild-type
mutant
Regulated genes(1034)
wild-type
mutant
Kin82 knockout (heat
shock)
Module Sig.
Energy and osmotic stressP<0.0001
Energy, osmolarity & cAMP signaling
P<0.006
mRNA, rRNA and tRNA processing
P<0.02
Module Sig.
Ribosomal and phosphate metabolism
P<0.009
Amino acid and purine metabolism
P<0.01
mRNA, rRNA and tRNA processing
P<0.02
Protein folding P<0.02
Cell cycle P<0.02
Does ‘X’ Regulate Predicted Genes?
SSRPBKF ’03 (Nature Genetics)
Wet Lab Experiments Summary
3/3 regulators regulate computationally predicted genes
New yeast biology suggested Ypl230w activates protein-
folding, cell wall and ATP-binding genes
Ppt1 represses phosphate metabolism and rRNA processing
Kin82 activates energy and osmotic stress genes
SSRPBKF ’03 (Nature Genetics)
Ongoing Biological Debate
Can we discover actual regulators from gene expression
data alone?
Many regulatory relationships can be induced from gene
expression data
SSRPBKF ’03 (Nature Genetics)
Undetected regulators
Detected regulators
Detected target
Assumption: Regulators are transcriptionally regulated
Feedforward, auto-regulatory “motifs” (Shen-Orr et al. 2002)
TFs and SMs have detectable expression signature
Phd1 (TF)
Hap4 (TF)
Cox4
Cox6Atp1
7
Regulator chain(Respiration
)
Yap6 (TF)
Vid24 Tor1 Gut2
Auto regulation(Snf kinase regulated
processes)
Sip2 (SM)
Msn4 (TF)
Vid24 Tor1 Gut2
Positive signaling loop(Sporulation & cAMP)
Why Does it Work?
Statistical methods can infer their regulatory relationships from gene
expression data
SSRPBKF ’03 (Nature Genetics)
Outline
Who regulates whom and when?
How are genes regulated? Model Evaluation
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Reg
.
ACGTGC
Motif
GATAG Motif
Activator Repressor
From Sequence to Expression
? ?
ACGTGCGATAG
Gene 2 Gene 3Gene 1
?
Act
ivat
or
Act
ivat
or
Repre
ssor
ACGTGC
GATAG+GATAGNo motifs
DNA Microarray
DNA control sequence
From Sequence to Expression
ACGTGC
GATAG+GATAGNo motifs
Sequence Expression
Goal: Explain how expression arises from sequence Construct mechanistic model of gene regulation Learn the model from sequence and expression data
Cluster gene expression profiles Search for motifs in control regions of
clustered genes
clustering
AGCTAGCTGAGACTGCACAC
TTCGGACTGCGCTATATAGA
GACTGCAGCTAGTAGAGCTC
CTAGAGCTCTATGACTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTGACTGCCGCTT
Control regions Gene I
Gene IIGene IIIGene IVGene VGene VI
GACTGC
AGCTAGCTGAGACTGCACAC
TTCGGACTGCGCTATATAGA
GACTGCAGCTAGTAGAGCTC
CTAGAGCTCTATGACTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTGACTGCCGCTT
Experiments
Gen
es
Procedural Apply a different method to each type of data Use output of one method as input to the next
Motif
Two Phase Approach (I)
Expression clustering is not perfect
Cluster II
Cluster I
Clustering B Shared
Motif
Clustering A
Cluster II
Cluster IShared Motif
Two Phase Approach: Problems
Iterate over all sequences of length k Find all genes that have each k-mer in their
promoter Keep k-mers whose genes are coherent in
expression
GATACCACGACT
AAATGC
TCGACT
CGCTG
A
ACGAGATTCGCA
CG
ATGG
AAATTA TCGACT
GATACC
GATACC
Two Phase Approach (II)
Single motifs may not have coherent expression Activator: Repressor:
TCGACTGC
GATAC
TCGACTGCGATAC
GATAC
TCGACTGC+
GATAC
TCGACTGC+
GATAC
TCGACTGC+
Two Phase Approach: Problems
Are we missing motifs?
TCGACTGC
TCGACTGC
CCAAT
+
OR
?
Two Phase Approach: Problems
ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG
Sequence
TCGACTGC
TCGACTGC
TCGACTGC
TCGACTGC GATAC
GATAC
GATACGATAC
CCAATCCAAT
CCAATCCAAT
TCGACTGC
CCAATCCAAT
CCAAT
GCAGTT GCAGT
T
GCAGTT
TCGACTGC CCAATGATAC GCAGTTMotifs
TCGACTGC
GATAC+
CCAAT+
GCAGTTCCAATMotif
Profiles
Expression Profiles
Unified Model of Gene Regulation
SYK ’03 (ISMB)
Gen
es
Sequence
Motifs
TCGACTGC
GATAC+
CCAAT+
GCAGTTCCAATMotif
Profiles
Expression Profiles
cis-regulatory modules
Unified Model of Gene Regulation
Unified Model of Gene Regulation
Modu
les
ExperimentsExpression
of module genes
DNA control sequences of module
genes
TCGACTGC GATAC+Motif Profile:
Regulatory Module
SYK ’03 (ISMB)
Sequence
Motifs
Motif Profile
s
Expression Profiles
Unified model of gene regulation using sequence and expression
Model trained as a whole Motif profiles are predictive of expression Expression clusters share motif profiles Motifs added to make profiles predictive
Model learned without prior knowledge Input I: sequence data Input II: expression data
Our Approach
SYK ’03 (ISMB)
Expression clustering is not perfect
A single motif cannot explain variation in expression
Are we missing motifs?
Unified model for expression and motifs
Use combinatorial motif profiles
Dynamically add motifs to explain expression
Problems and Solutions
SYK ’03 (ISMB)
Probabilistic Model
Experiment
Gene
Expression
SequenceS4S1 S2 S3
R2R1 R3
Sequence
Motifs
Motif Profile
s
Expression Profiles
P(R2|S) =
Is motif i “active” in gene g?
Position SpecificScoring Matrix (PSSM)
SYK ’03 (ISMB)
Experiment
Expression
Probabilistic Model
Gene
SequenceS4S1 S2 S3
R1 R2 R3
Module
Sequence
Motifs
Motif Profile
s
Expression Profiles
1
2
3
Module R1 R2 R3
P(Module | R)= softmax
K
m
L
iiim
L
iiim rwexprwexp
1 11
)rR,...,rR|mModule(P LL11
Motif profile 1: R1 R2
SYK ’03 (ISMB)
Probabilistic Model
Experiment
Gene
Expression
Module
SequenceS4S1 S2 S3
R1 R2 R3
ID
Level
Sequence
Motifs
Motif Profile
s
Expression Profiles
Every module has a unique expression
profile
1
ModuleID
1 2 3
00 0
P(Level | Module, ID)
20 00
SYK ’03 (ISMB)
Probabilistic Model
Experiment
Gene
Expression
Module
SequenceS4S1 S2 S3
R1 R2 R3
ID
Level
Sequence
Motifs
Motif Profile
s
Expression Profiles
gen
es
Motif profile Expression profile
Regulatory Modules
SYK ’03 (ISMB)
Learning Problem
Experiment
Gene
Expression
Module
SequenceS4S1 S2 S3
R1 R2 R3
ID
Level
Sequence
Motifs
Motif Profile
s
Expression Profiles
Genes: 5000-10000 Variables per gene
Sequence: 1000 Expression: 200-500 Motifs: 50-100 (hidden) Module: 1 (hidden)
Learn Module assignments “Active” motifs per
gene Motif profiles
That maximize P(M|D)
Hard
SYK ’03 (ISMB)
add/delete motifs
X
clustering
Gene partition
motif search
Motif setE-step
Regulatory modules
M-step
Learning Algorithm Overview
Motif setAdd all sequences
of length k as motifs
ACGTAGTTGATGCA
ACGTGC
GCTGGT TTTTAC
XOverfitting Use the expression data to
guide the search for new motifs
Learning the Set of Active Motifs
Examine all
regulatory modules
Compare genes with
motif profile to module genes
Add motif initialized to
common motif in missed genes
Motif profile Expression profile
Regulatory Module 1
Motif profile Expression profile
Regulatory Module 2
All genes match motif profile
Many genes do not match motif profile
Add motif CCAAT
Dynamically Adding Motifs
Outline
Who regulates whom and when?
How are genes regulated? Model Evaluation
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Reg
.
ACGTGC
Application of Method to Data
4 Expression datasets 500bp upstream seq.
Yeast Human 4 Expression datasets 1000bp upstream seq.
77 motif profiles 65 motifs 25 known (out of 37)
Method found many known motifs in
yeast
62 motif profiles 80 motifs 10 known
TRANSFAC(37 known
motifs)
SYK ’03 (ISMB)
Yeast Human25 10
12 4Our method
Standard approach
Comparison to Standard Approach
(Recovery of known motifs)
Our method found many more known motifs from the
literature
25 1012 4
SYK ’03 (ISMB)
Caspase 3Cyclin A2Cyclin FCDC 2Centromere ACentromere Ekinesin familykaryopherin alpha 2polo-like kinase RGS3Serine kinase 6topoisomerase IITTK protein kinase aurora kinase B Kinase family 23extra spindle pole 1ARHGAP11A HECUbiquitin-conjugatingCDC8DKFZp762E1312 NALP2 C20orf129 DDA3 UBF-fl
Cell Division Module in Human
DNA control sequence of module
genes
Expression of module genes
NFAT motif Novel motif
Module genes functionally coherent? Module genes known to be regulated by predicted
motifs?
Module genes involved in mitosis
(10/25 P<10-9)
NFAT regulates cytokine (cell division)
genes
SYK ’03 (ISMB)
Biological Evaluation Summary
Are the module genes functionally coherent?
Yeast: module genes functionally coherent?
40/62
65/77
Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses)
SYK ’03 (ISMB)
Evaluating Human Motifs
Hide sequence of
gene i
Learn motif model for module
Assign gene i to module if gene is in
module with Prob. 0.5
Gene 1: TTGACTGCACTCGGCAATTACTATACT
Gene 2: AGCACTGCACTGCACTCGACTATACTA
Gene 3: TTTTACTATCTCACGATGCACTCGGCC
Gene 4: ACACTTACTATACCCTTGCACTCGTAG
DNA control sequences
Gene 5:
Gene 6:
Gene 7:
Gene 8:
TAGGCCAACCCGGTGGCTTACTATACTACAAACGTGAGTTTTCATCGAGTTCTTACGTGCACTCGAATATAGTCTTGATTTCTGATCGTAGCGGGTAGCTCGCGAGG
Module
genes
Non-module genes
Signal or overfitting?
Gene 1: TTGACTGCACTCGGCAATTACTATACT
TTTTACTATCTCACGATGCACTCGGCCACACTTACTATACCCTTGCACTCGTAG
P<0.5 (false positive)
P0.5(true positive)
Classification margin = True positives (%) – False positives (%)
Repeat for all genes
SS ’04 (RECOMB)
TGCACTCGMotifs:
TTACTAT
Tu
mo
r an
tige
nT
ran
scri
ptio
n c
o-r
ep
ress
or
Pro
tein
ph
osp
hat
ase
Che
mok
ine
rec
ep
tor
Nuc
lea
r la
min
aG
-pro
tein
sig
nal
ing
AT
pa
se a
ctiv
ityR
egu
latio
n o
f cd
kT
wo
-co
mp
on
ent
sig
nal t
rans
du
ctio
nC
AM
P d
ep
end
an
t pr
ote
in k
ina
seM
an
ga
nese
ion
bin
din
gP
rote
in f
old
ing
Car
bo
hyd
rate
bin
din
gR
egu
latio
n o
f cd
k
Che
mok
ine
rec
ep
tor
bin
din
gT
ran
sla
tion
initi
atio
nM
itoch
ond
ria
l me
mb
ran
eP
rote
in p
ho
sph
ata
seP
rote
in f
old
ing
Try
psi
n a
ctiv
ityL
ysos
om
eS
ecre
tory
ve
sicl
eS
erin
e p
rote
ase
inh
ibito
rP
rote
in k
ina
se c
k22
6s
pro
teas
om
eP
ath
og
ene
sis
Epi
de
rma
l diff
ere
ntia
tion
Ant
imic
rob
ial p
eptid
e a
ctiv
ityT
yro
sin
e k
ina
se s
ign
alin
g p
ath
wa
y
Kin
ase
reg
ula
tor
Pre
gn
an
cy
Ta
xis
Pro
tein
ph
osp
hat
ase
re
gu
lato
rS
uga
r b
ind
ing
Mito
cho
ndri
al m
em
bra
ne
Inte
rle
uki
n b
ind
ing
Ubi
qu
itin
cyc
leC
yto
kin
esi
sE
pid
erm
al d
iffer
en
tiatio
nR
egu
latio
n o
f t-
cell
pro
life
ratio
nE
mb
ryog
en
esi
s a
nd
mo
rph
og
ene
sis
Nuc
leo
lus
Nuc
leo
tide
bio
synt
he
sis
Ant
imic
rob
ial p
eptid
eT
her
mor
eg
ula
tion
Oxi
do
redu
cta
se o
n p
aire
d d
ono
rsM
usc
le c
ontr
act
ion
Tra
nsc
rip
tion
co
-re
pre
sso
r
Pro
tein
ph
osp
hat
ase
Me
tal i
on
tra
nsp
ort
Cyt
oso
lic c
alc
ium
ion
con
cen
tra
tion
GT
Pa
se r
eg
ula
tor
Tra
nsc
rip
tion
fa
cto
r co
mp
lex
pro
tein
-nuc
leu
s im
po
rt
Lig
ase
act
ivity
E
nerg
y de
riva
tion
by
oxid
atio
n
Ext
race
llula
r lig
and
-ga
ted
ion
ch
an
nel
Tra
nsl
atio
n r
ele
ase
fa
cto
r
G-p
rote
in s
ign
alin
gS
erin
e p
rote
ase
inh
ibito
r
Ene
rgy
taxi
sG
TP
ase
me
dia
ted
sig
na
l tra
nsd
uctio
nA
TP
de
pen
de
nt h
elic
ase
act
ivity
Tra
nsc
rip
tion
fro
m p
ol I
pro
mo
ter
Nuc
leo
som
e d
isa
sse
mb
lytR
NA
me
tabo
lism
Sph
ing
olip
id m
eta
bo
lism
NA
DH
de
hyd
rog
ena
se a
ctiv
ity
Xen
ob
iotic
me
tab
olis
m
Sm
all
mo
no
me
ric
gtp
ase
Nuc
leo
som
e a
sse
mb
ly
Mo
no
oxy
ge
na
se a
ctiv
ity
RN
A d
epe
nd
ent
AT
Pa
se a
ctiv
ity
Ste
roid
me
tabo
lism
Upt
ake
pe
rmea
se a
ctiv
ityT
ran
scri
ptio
n f
rom
pol
II
pro
mo
ter
Xen
ob
iotic
me
tab
olis
m
RN
A s
plic
ing
DN
A-d
epe
nd
ent
AT
Pa
se a
ctiv
ity
DN
A r
eco
mbi
na
tion
Sm
all
ribo
som
al s
ubu
nit
Cla
ssifi
cati
on
marg
in
Modules
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 Best classification margin from 100 random modules
HSF is known to regulate protein
folding
Motif: HSF Genes: Protein folding
Motif: GATA Genes: Mitochondrial
GATA is known to activate
mitochondrial membrane genes
Evaluating Human Motifs
MIN
I19
ET
S1
BR
AC
HN
FX
6
GA
TA
1
XF
D3
XB
P1
E2F
MA
F
GN
CF
1,
GA
TA
1
PA
X1
ELK
1
RO
RA
2
GF
I1
HO
GN
ES
S
SR
F
BA
RB
IE
ST
AT
5A
RO
RA
2
E2F
HN
F1
ZF
5
TA
AC
C
AR
NT
NF
KA
PP
AB
RO
RA
2
NF
MU
E1
HO
X1
3
TA
XC
RE
B
OC
T1
AR
NT
ME
F2
PA
X1
AR
NT
OC
T1
R_0
1
MU
SC
LE
_IN
I
AR
EB
6
OC
T1
NF
KA
PP
AB
HS
F
ER
G1
GA
TA
1
HN
F1
GIF
1
NF
Y,
AC
AA
T
MY
CM
AX
Modules
SS ’04 (RECOMB)
Compendium of human cis-regulatory modules
Module genes are functionally coherent
Module genes similarly expressed in external datasets
Learned motifs characterize module genes
Biological Evaluation Summary
Incorporating Protein-DNA binding
Protein-DNA Binding Identifies all the genes that are bound by a regulator Noisy assay
Gene 2Gene 1
CodingControlCodingControl
Reg
.
Incorporating Protein-DNA binding
Experiment
Gene
Expression
Module
SequenceS4S1 S2 S3
R1 R2
ID
Level
SBSFK ’02 (RECOMB)
Does regulator 3 bind to gene
g?
Protein-DNA data for regulator i is a noisy sensor for regulation
by motif i
Is the motif recognized by regulator 3 “active” in
gene g?
R3
P1 P2 P3
Outline
Who regulates whom and when?
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation
Reg
.
ACGTGC
Reg
.
ACGTGC
Model Assumption
Experiment
Gene
Expression
Module
Regulator1
Regulator2
Regulator3
Level
Every gene belongs to exactly one
module
Assumption:
X
XX
Multi-Functional Genes Model
Gene 2
Every gene can belong to multiple modules
Module 1
Gene 1
Module 2
Gene 3
Gene 2
The expression of a gene is the sum of its expression in each module it
participates
Gene 2 expression:
+ =
Multi-Functional Genes Model
Gene
Expression
M3M2 A3A2
Experiment
Is gene “g” part of
module i?M1
Activity level of
module i in experiment
A1
Expression is a sum of activity level of all
modules Levelg,e~N(g.Mie.Ai,)
Level
SBK ’03 (PSB)
Connection to SVD
Singular Value
Decomposition
Experiments
Genes
Genes
Modules Modules
Module
s
Module
s
Experiments
= x xE=MAT
Golub et al. ’96Alter et al. ’00
Levelg,e=ig.Mie.Ai
Levelg,e~N(g.Mie.Ai,σ)
Gene
Expression
M3M2M1 A3A2A1
Level
Experiment
SBK ’03 (PSB)
Learning problem Module
assignments Module activity
levels
Difference to our model: Discrete module
assignments
Hard
A11 A12 A13 Hidden
M12M11 M13 Hidden
Hard M12
Level11
A11
Level12
Level21 Level22
Bayesian Network
A12 A13
M11
M13
A21 A22 A23
M12M11
M13
(3 Modules, 2 genes, 2 experiments)
Learning Assignments and Activities
Every pair of hidden vars. are
dependent
Standard approximations Loopy belief
propagation Variational methods
Genes: 5000-10000
Experiments: ~200
Modules: 50-100
1,000,000 dependent hidden
variables
At best, local maximum of approximate energy
functionSBK ’03 (PSB)
A11 A12 A13 Observed
M12M11 M13 Hidden
Easy
GO
A11 A12 A13 Hidden
M12M11 M13 Observed
Easy
GO
Level11 Level12
Level21 Level22
Bayesian Network
M12M11
M13
M12M11
M13
A11 A12 A13 A21 A22 A23
(3 Modules, 2 genes, 2 experiments)
Learning Assignments and Activities
Optimize activities given assignments
Optimize assignments given activities
M12M11 M13
Initialize
Standard approximations converge (at best) to local maximum of approximate
energy function Our algorithm converges to strong local maximum
SBK ’03 (PSB)
A11 A12 A13 Hidden
M12M11 M13 Hidden
Hard
A11 A12 A13 Hidden
M12M11 M13 Observed
Easy
GO
Level11 Level12
Level21 Level22
Bayesian Network
M12M11
M13
M12M11
M13
A11 A12 A13 A21 A22 A23
(3 Modules, 2 genes, 2 experiments)
Learning Module Activity Levels
Aij variables are
continuous
Standard least squares problem
)Level,M|A(PmaxargA AOptimization
problem:
SBK ’03 (PSB)
A11 A12 A13 Observed
M12M11 M13 Hidden
Level11 Level12
Level21 Level22
Bayesian Network
M12M11
M13
M12M11
M13
A11 A12 A13 A21 A22 A23
(3 Modules, 2 genes, 2 experiments)
Learning Module Assignments
Mij variables are discrete
For each gene, combinatorial search in time
2m
Optimization problem:
},{M.t.s ij 10
)Level,A|M(PmaxargM M
A11 A12 A13 Observed
M12M11 M13 Hidden
Level11 Level12
Level21 Level22
Bayesian Network
M12M11
M13
M12M11
M13
(3 Modules, 2 genes, 2 experiments)
Learning Module Assignments
Optimize for continuous
Mij
For each gene i, select k largest
variables from {Mi1,…,Mim}
Combinatorial search in time
2k
Optimization problem:
},{M.t.s ij 10
)Level,A|M(PmaxargM M
Comparison to Plaid (Lazzeroni and Owen ’02)
0
5
10
15
20
0 5 10 15 20-Log (P-value)
-Log
(P-v
alu
e)
Compare P-value of enrichment for functional annotations (GO) (P-value of annotation enrichment = best
hypergeometric p-value in any module)
Plaid
Our method
122 of 137 annotations more significant in our
model
SBK ’03 (PSB)
Comparison to Standard Clustering
Compare P-value of enrichment for functional annotations (GO) (P-value of annotation enrichment = best
hypergeometric p-value in any module)
0
5
10
15
20
0 5 10 15 20-Log (P-value)
-Log
(P-v
alu
e) Hierarchical clustering
Our method
120 of 137 annotations more significant in our
model
SBK ’03 (PSB)
Adding the Regulation Model
Experiment
Gene
Expression
Regulator1
Regulator2
Regulator3
M3M2M1 A3A2A1
Level
Activity level of module i in array
HAP4
CMK1
0
0 0
BSK ’04 (RECOMB)
Gene
Expression
M3M2M1
Level
A3
ExperimentA2A1
Outline
Who regulates whom and when?
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation Robust prediction of gene function Identifying conserved modules
Reg
.
ACGTGC
Reg
.
ACGTGC
Single Species Gene Expression
Co-expression is not always functionally relevant Noise in DNA microarray technology Biological sloppiness
Use evolution as a filter
Multiple Species Gene Expression
Different organisms share many of their genes
Can we learn something from observing the expression of the same gene in multiple
species?
Yeast
Orthologs
Human
~30% of yeast genes are conserved in human
Irrelevant co-expression is uncorrelated in different species Relevant co-expression confers selective advantage
Combining expression from multiple species can improve gene function and regulatory
module discovery
Conserved Co-Expression Network
Yeast (643) Worm (949) Fly (155) Human (1202)
Connect genes that are co-expressed in at least two organisms
3D visualization of networkSSKK ’03 (Science)
Ribosomebiogenesis
Energygeneration
Cell cycle
Secretion
Neuronal
Proteasome
Generaltranscription
Ribosomal
subunits
Signaling
Translation initiation and
elongation
Lipidmetabolism
Unknown
Conserved Co-Expression Network
SSKK ’03 (Science)
Cla
ssifi
cati
on A
ccu
racy
(%
)
40 Annotations at 50%
accuracy 70 Annotations at 30%
accuracy
0
10
20
30
40
50
60
70
80
90
100
Gene annotations (Gene Ontology)
Predicting Gene Function
Predict function using guilt-by-association scheme
Protein modification
SSKK ’03 (Science)
0
10
20
30
40
50
60
70
80
90
Predicting Protein Modification
Worm Fly HumanYeast
12%18% 15% 13%
76%
Multiple species
prediction
predictions using single species
Significant improvements over any single species
network
Cla
ssifi
cati
on
Acc
ura
cy (
%)
(50
most
con
fid
en
t p
red
icti
on
s)
SSKK ’03 (Science)
Excess nuclei in mutant
Biological Experiment Prediction:
Experiment:
Consistent with cell proliferation prediction
ZK652.1 plays a role in cell proliferation
Knock-out ZK652.1 and test mutant
SSKK ’03 (Science)
Outline
Who regulates whom and when?
How are genes regulated?
Regulation of multi-functional genes
Evolution of gene regulation Robust prediction of gene function Identifying conserved modules
Reg
.
ACGTGC
Reg
.
ACGTGC
Reg
.
ACGTGC
Reg
.
ACGTGC
Mouse Human
Gene
Experiment
Expression
Regulator1
Regulator2
Regulator3
Level
Organism 2
Module
Experiment
Gene
Expression
Regulator1
Regulator2
Regulator3
Level
Organism 1
Module
Conserved Gene Regulation Model
Compatibility potential
(Module,Module)
Orthologs are more likely to be
in the same module
1
2
3
Module 1 2 3
ModuleRegulation programs for the same module
are more likely to share regulators
Human (138)Mouse (42)
Conserved Regulation
Normal brain (4) Brain tumors
Gliomas (57) Medulloblastoma
(60) Miscellaneous (17)
Brain development (39) Brain tumors
Medulloblastoma (3)
Goal: Discover regulators in brain that are shared between human
and mouse
Comparison to Single Species
Test Data Log-Likelihood (gain per gene)
Human
Single species
Multiple species
Mouse
Single species
Multiple species
Multiple species Single species
By combining expression data from mouse, we can learn a better model of gene
regulation in human
Mouse Human
Neuron Differentiation Module
NeuroD1
NeuroD1NeuroD1NeuroD1
NeuroD1
Brain expressed genes (18/34 P<10-12)
Module genes functionally coherent? Module genes known targets of predicted regulators?
NeuroD known to regulate module genes
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Reg
.
ACGTGC
Reg
.
ACGTGC
Mouse Human
Finding conserved regulators
SSKK ’03 (Science)
Reg
.
ACGTGC
Finding motifs
SS ’04 (RECOMB)SBSFK ’02 (RECOMB)
SYK ’03 (ISMB)
Reg
.
ACGTGC
Finding regulatorsSSRPBKF ’03 (Nature Gen.)
SPRKF ’03 (UAI)BSK ’04 (RECOMB)
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Gene regulation
Two-sided clustering
Learning abstraction hierarchies
Discovering molecular pathways
Learning with clinical data
SOK ’01 (NIPS) SK ’02
(RECOMB)
STGFK ’01 (ISMB)
SWK ’03 (ISMB)
SSKK ’03 (Science)SSRPBKF ’03 (Nature Gen.)SPRKF ’03
(UAI)SS ’04 (RECOMB)SBSFK ’02 (RECOMB)
SYK ’03 (ISMB)
SBK ’03 (PSB)BSK ’04 (RECOMB)
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Unified Approach for Heterogeneous Data Gene expression DNA sequence Protein-DNA binding data Multiple species data Protein-protein interaction data
SBSFK ’02 (RECOMB)
SWK ’03 (ISMB)
SSKK ’03 (Science)
SSRPBKF ’03 (Nature Gen.)
SYK ’03 (ISMB)
SS ’04 (RECOMB)
SBK ’03 (PSB)
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Unified Approach for Heterogeneous Data
Model Automatically Learned from Data
Convex optimization Graph theoretic algorithms
Exploit modularity in biological system Exploit problem-specific structure
Model design Learn modelData Analyze results
Dynamic programming Heuristic search
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Unified Approach for Heterogeneous Data
Model Automatically Learned from Data
Model Evaluation Methods Comparison to existing methods Cross validation Enrichment for known biological function Relative to current knowledge in literature
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Unified Approach for Heterogeneous Data
Model Automatically Learned from Data
Model Evaluation Methods
Testable Biological Hypotheses Generate novel hypotheses from model Wet-lab validation of predictions
SSKK ’03 (Science) SSRPBKF ’03 (Nature
Gen.)
Summary: Probabilistic Framework
Rich Modeling Language for Biological Processes
Unified Approach for Heterogeneous Data
Model Automatically Learned from Data
Model Evaluation Methods
Testable Biological Hypotheses
Visualization Software
The Challenge AheadOrganisms
Data types
Conditions
Developmental
Physiological
Environmental
ClinicalMetabolic
Experimental
Protein expression Tissue specific
expression Interaction data Location data …
Biological informatio
n
?
Biological informatio
n