This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The STRING database
Michael KuhnEMBL Heidelberg
protein interactions
example
Tryptophan synthase beta chainE. Coli K12
many sources
genomic context
curated knowledge
Texperimental evidence
literature
373 genomes
(only completely sequenced genomes)
1.5 million genes
(not proteins)
Genome Reviews
RefSeq
Ensembl
model organism databases
data integration
genomic context methods
gene fusion
gene neighborhood
phylogenetic profiles
Cell
Cellulosomes
Cellulose
automatic inferenceof interactions
correct interactions
wrong associations
gene fusion
score: sequence similarity
gene neighborhood
score: sum of intergenic distances
phylogenetic profiles
SVD
singular value decomposition(removes redundancy)
score: Euclidean distance
all scores are “raw scores”
not comparable
sequence similarity
sum of intergenic distances
Euclidean distance
benchmarking
calibrate against “gold standard”(KEGG)
raw scores
probabilistic scores
e.g. “70% chance for an assocation”
curated knowledge
KEGG
Kyoto Encyclopedia of Genes
Reactome
GO
Gene Ontology
primary experimental data
many sources
many parsers
BIND
Biomolecular Interaction Network Database
GRID
General Repository for Interaction Datasets
HPRD
Human Protein Reference Database
co-expression
microarray data
GEO
Gene Expression Omnibus
correlation coefficient
literature mining
different gene identifiers
synonyms list
Medline
SGD
Saccharomyces Genome Database
The Interactive Fly
OMIM
Online Mendelian Inheritance in Man
simple scheme
co-mentioning
more advanced
NLP
Natural Language Processing
Gene and protein namesCue words for entity
recognitionVerbs for relation extraction
The expression ofthe cytochrome genes
CYC1 and CYC7is controlled by
HAP1
calibrate against gold standard
combine all evidence
Bayesian scoring scheme
e.g.: two scores of 0.7
combined probability: ?
e.g.: two scores of 0.7
combined probability: 0.91
1 - (1-0.7)2 = 0.91
evidence transfer
evidence spread over many species
transfer by orthology
(or “fuzzy orthology”)
von Mering et al., Nucleic Acids Research, 2005
von Mering et al., Nucleic Acids Research, 2005
two modes
COG mode
von Mering et al., Nucleic Acids Research, 2005
higher coveragelower specificity
includes all available evidence
some orthologous groups are too large to be meaningful
proteins mode
von Mering et al., Nucleic Acids Research, 2005
maximum specificitylower coverage
information will be relevant for selected species
Demo
outlook
take home message
STRING integrates information and predicts interactions
You can always go to the sources
Proteins mode: specific speciesCOG mode: more coverage,
especially for prokaryotic genes
Acknowledgements
The STRING team
Lars JensenPeer Bork
Christian von Mering & group in Zurich
Berend SnelMartijn Huynen
Thank you for your attention
take home message
STRING integrates information and predicts interactions
You can always go to the sources
Proteins mode: specific speciesCOG mode: more coverage,
especially for prokaryotic genes
Exercises:tinyurl.com/36twzq
(or via course wiki)
Alternative server:xi.embl.de
Bork et al., Current Opinion in Structural Biology, 2004