3/4/2013 1 Bioinformatics and Evolutionary Genomics Evolution of Genomes, Proteomes, Networks and Complexes Berend Snel Associate Professor Theoretical Biology and Bioinformatics Department of Biology Science Faculty Utrecht University
3/4/2013 1
Bioinformatics and Evolutionary Genomics
Evolution of Genomes, Proteomes, Networks and Complexes
Berend Snel Associate Professor
Theoretical Biology and Bioinformatics Department of Biology
Science Faculty Utrecht University
Today
• Lecture on homology and domains • Introduction on general aims of the course
and on procedural stuff
Requests • very heterogeneous with respect to previous
knowledge (IBMB, GB, research projects, PhD students)
• PLEASE: interrupt / ask questions when I am going to fast, when I use jargon, when I make jumps/conclusions that to me seem obvious 100% logical, but to your are erratic; please point out my implicit assumptions regarding what everybody knows
• Computer exercises: more experienced people help
• And also apologies for some redundancy
The (human) genome: why does it look the way it does?
Why ? To do stuff! molecular biology and systems biology
But. Design logic is not so obvious: it is the result of an evolutionary process.
And classic two types of why
Gene Content
bag of genes
Why does the genome contain the genes that it does
Gene loss, gene duplication and gene invention shaped these phyletic patterns (and thus our
genome)
C. pneumoniae
C. trachomatis
M. tuberculosis M. pneumoniae M. genitalium
B. subtilis
T. pallidum
T. maritima
B. burgdorferi
P. horikoshii
M. thermoautotrophicum
A. fulgidus
M. jannaschii
S. cerevisiae C. elegans
A. aeolicus
E. coli
H. influenzae R. prowazekii
H. pylori 26695
Synechocystis sp.
H. pylori J99
A. pernix 100 100
100 100
100
100 100
100
100 98
93 89
69
88
0.1
97
Proteobacteria
Eukarya
Euryarchaeota
Low G+C Gram-Positive Bacteria
100
Snel Bork Huynen Nature Genet 1999 Huynen Snel Bork Science 1999
Fritz-Laylin et al. cell 2010
… however
Fokkens and Snel PLoS Comp Biol 2009
Mediator
•Essential for transcription •Associated with general transcription machinery •Bridge •25 subunits •Four submodules
S.po
Y.lip
S.cer K.lacC.alb
D.han
A.fum A.nidG.zea N.cra
U.mayC.neo
P.chr
L.bic R.ory
E.cun
D.mel
H.sap M.musC.ele
E.hist D.dis
root
C.merA.thal O.tau
C.rei
T.theCryp
Thei
P.falcPhyt
T.pseP.tri
GiardN.gruL.maj
Tryp
Animals Amoebozoa
Fungi Excavata Chrom- Alveolates*
Archaeplastids ?
divisions Fungi Animals Chromalveloates ExcevataAscomycota Basidomycota
subm
odul
e
Med
iato
r su
buni
t
spec
ies
S.ce
r K.
lac
D.ha
n
C.a
lb
Y.lip
A.ni
g A.
fum
N.c
ra G
.zea
S.po
U.m
ay
C.n
eo
P.ch
r
L.bi
c
R.o
ry
E.cu
n
H.s
ap
M.m
us
D.m
el
C.el
e
D.di
s
E.h
ist
A.th
al
O.ta
u
C.re
i
C.m
er
P.fa
lc
Cryp
Thei
T.th
e
Phyt
T.ps
e
P.tri
N.gr
u
L.m
aj T
ryp
Gia
rd
Med15 x x x x x x x x x x x x x x x x x x x x x xMed16 x x x x x x x x x x xMed14 x x x x x x x x x x x x x x x x x x x x x x xMed3 xMed2 x
Med10 x x x x x x x x x x x x x x x x x x x x x x x x xMed1 x x x x x x x x x x x x x xMed4 x x x x x x x x x x x x x x x x x x x xMed7 x x x x x x x x x x x x x x x x x x x x x x x x x x x xMed9 x x x x x x x x x x xMed5 x x x x x x x x x x xMed31 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xMed21 x x x x x x x x x x x x x x x x x x x x x x x x
Med11 x x x x x x x x x x x x x x xMed6 x x x x x x x x x x x x x x x x x x x x x x x x x x x x xMed20 x x x x x x x x x x x x x x x x xMed18 x x x x x x x x x x x x x x x x xMed17 x x x x x x x x x x x x x x x x x x x xMed22 x x x x x x x x x x x x x x x x x x x x x xMed8 x x x x x x x x x x x x x x x x x x x x xMed19 x x x x x x x x x x x x x
Cdk8 x x x x x x x x x x x x x x x x x x x x x xCycC x x x x x x x x x x x x x x x x x x x x xMed13 x x x x x x x x x x x x x x x x x x x x xMed12 x x x x x x x x x x x x x x x x x x x x x
Med23 x x x x x x x xMed24 x x x xMed25 x x x x xMed26 x x xMed27 x x x x x x x x x x x x xMed28 x x x x x xMed29 x x x xMed30 x x x
Amoe- bozoa
Archeaplas-tids
CDK
unkn
own
Tail
Mid
dle
Head
S.po
Y.lip
S.cer K.lacC.alb
D.han
A.fum A.nidG.zea N.cra
U.mayC.neo
P.chr
L.bic R.ory
E.cun
D.mel
H.sap M.musC.ele
E.hist D.dis
root
C.merA.thal O.tau
C.rei
T.theCryp
Thei
P.falcPhyt
T.pseP.tri
GiardN.gruL.maj
Tryp
many partial losses, few gains
Presence of genes across genomes: orthology
bags of genes
Why does the genome contain the genes that it does
Bag of genes, what about the homologs within a genome? (paralogs, duplicates)
Ub
Cdk1
Cdk1
Cyclin B Securin
Ub Ub
Ub
Ub Ub
Ub Ub
Anaphase Promoting Complex/Cyclosome (APC/C)
Initiating Anaphase
Mitotic Checkpoint
Cdk1
Cyclin B Securin
The Mitotic Checkpoint
X
The Mitotic Checkpoint Complex (MCC)
BubR1
Mad2
Cdc20
Mps1
Mitotic Checkpoint
BubR1
Mad2
Cdc20
MCC
Kinetochore
AurB Bub1
Kinase TPR
KEN BUB3-binding
ScBub1 HsBub1
(fungi & vertebrates)
TPR
KEN BUB3-binding
Kinase TPR
BUB3-binding CDI
hsBubR1 (vertebrates)
scMad3p (fungi)
9 independent duplications. 7 cases where a mad3-like and a bub1-like protein arose out of a bubmad-like ancestor. Parallel or convergent evolution?
What about the kinase domain in human bubr1?
degeneration of motifs essential for catalysis
• the preserved ‘catalytic’ residues are essential for BUBR1 conformational stability in vitro and in cells. Uncoupling potential enzymatic activity from structural stability shows that catalysis is dispensable for BUBR1 function in mitosis.
• Suggest that BUBR1 is an atypical pseudokinase.
What about the kinase domain in human (and fly)
What about all these BubMad containing species?
What about the ancestor?
? Subfunctionalization? (cf. TOR but with convergent evolution of domain architecture)
What about all these bubmad containing species? What about the ancestor?: experimentally testing
subfunctionalization
Assembled hsBubMad protein from 1-714 bubr1 & bub1 734-1085 is able to functionally replace both bub1 and bubr1 in human cells !
This course • I want study evolution of genomes pathways and
networks, so that is why I study gene/protein evolution • At the end to be more or less able to replicate e.g.
bubmad, mediator • Understanding that many bioinformatic challenges are
a mix of conceptual and technical problems (e.g. why orthology is such an incredibly persistent problem)
• “what you should ~know” in order to this kind of research
• Topics are interrelated – e.g. orthology already in homology lecture but proper
explanation a day after – e.g. that trees can be used to time a duplication to
eukaryogenesis but proper discussion of eukaryogenis has its own lecture
Homology (& domains) • Absolute basis of any comparative analysis,
affects MSA and trees, detection still being improved,
Gene Phylogeny & Orthology
• How do we get such trees and how do we interpret them
• Trees reveal some of the most important genome evolution processes: LGT, duplication, loss,
(Eukaryotic) tree of life • New genomes, cool/exotic animals / protists • Berend / Eelco / John / Like / Leny sitting behind
his/her computer and thinking should I include this genome? How should I interpret an absence? What source of species could I best use for homology based gene prediction.
• Crucial when interpreting gene trees: – Knowing it by heart >>> having to look it up
• With regards to evolutionary signaling cell biology ( kinases, smallgtpases etc. )the diversity in present day genomes is staggering and dwarfs e.g. human-fruit fly difference
Large scale orthology
• Needed to move beyond anecdotes, but difficult to get
RapGAP (animals(LSE), fungi, dicty)
PHYSOJ14061 Phytophthora sojae 142624 PHYINF15173 Phytophthora infestans PITG 15173
RalGAPB (oomycetes, dicty, naegleria, fungi, animals))
RalGAPA (dicty, naegleria, fungi, animals)
RheBGAP (TSC2, oomycetes, diatoms, red algea, animals, fungi, dicty, tetrahymena
99
13
823
31
100
24
Eukaryogenesis / LECA
• Biological topic, eukaryogensis / LECA for which these types of analyses are telling us a lot. But it also impacts a lot of things we do: we see it back in gene trees and it impacts getting orthologous groups across eukaryotes.
Gene content evolution
• Fundamental level of genome evolution • Gene invention -> inability to detect homologs vs real
lack of homologs does not simply mean novel gene • Evolutionary modules? • Trying to move large scale but remember the pitfalls
Whole Genome Duplication (WGD)
• Like LECA, WGD is important biology for which bioinf needed to research but which also impacts our data
• And which is welcome source of information for our analyses (Lidija, bubmad): independent and reliable reconstruction of the history of part of the history of genes
Using HTP data to study evolution of networks / complexes
• Is the number of conserved interactions between e.g. yeast and human 10% or 95%???
• On top of all the genome analysis pitfalls also all the HTP data pitfalls …
• Duplicates vs orthologs
Techniques AND biology
• Detective/forensics vs concepts; Large scale biology vs small-scale biology; Bioinformatics biology vs data/technique problems;
• A lot like police investigation … and less like Nobel prize winning physics ...
• Anything goes in genome evolution; many processes often entangled (i.e. google subneofunctionalization)
Literature discussion • You should have read the papers before the day of the
discussion • On day itself group split in critique / defense (not WGD) • Groups prepare defense and critique • Discussion: (critique starts because we all have read the
paper) – Critique gives outline why paper is weak – break – Defense responds to critique – break – Critique gives final comments to which immediately response
can be made – Defense gives final comments to which immediately response
can be made
Computer Exercises
• Computer exercises for some topics many others more difficult (i.e. evolution of interaction networks based on HTP analysis).
• Previous years too much cookbook. Attempt at less cookbook, more playing around → you should learn more; but it is slower
• Ask help from fellow students. • Ties strongly into mini-projects
Mini projects • A protein • what does swissprot / sgd already knows about your protein
function wise (GO) (pathway) • protein topology. (domains, disorder, TM?, motifs?) • "Most diverged" sequence in same genome ... that is still an
homolog. • Homologs across tree of life • Point of invention of family? if bacterial invention, mito or archaeal
route? • Does it have WGD duplicates? • Tree of relevant sequences in diverse genomes • Orthologs in relevant genomes • (normally relevant genomes would be a few metazoa, fungi, other
ophistokonts, amoebazoa, strameopiles, alveolates, plantae, excavates, see e.g. bubmad but it depends on your protein)
Requests • very heterogeneous with respect to previous
knowledge (IBMB, GB, research projects, PhD students)
• PLEASE: interrupt / ask questions when I am going to fast, when I use jargon, when I make jumps/conclusions that to me seem obvious 100% logical, but to your are erratic; please point out my implicit assumptions regarding what everybody knows
• Computer exercises: more experienced people help
• And also apologies for some redundancy