From Genome Sequences to Regulatory Network Phenotypes • Study the systematic operation of genes and their products in whole genome, whole cell contexts. • Discover the effect of every gene on growth, expression, & interaction . • Test quantitative network models. (bioinformatic functional genomics:)
50
Embed
From Genome Sequences to Regulatory Network Phenotypes
From Genome Sequences to Regulatory Network Phenotypes. (bioinformatic functional genomics:). Study the systematic operation of genes and their products in whole genome, whole cell contexts. Discover the effect of every gene on growth, expression, & interaction . - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
From Genome Sequences to Regulatory Network Phenotypes
• Study the systematic operation of genes and their products in whole genome, whole cell contexts.
• Discover the effect of every gene on growth, expression, & interaction .
• Test quantitative network models.
(bioinformatic functional genomics:)
Growth, Expression, & InteractionHarvard Center for
Small genome size: Mycoplasma, Haemophilus, MethanococcusEnergy relevance: Methanobacterium, Synechocystis Major Pathogens: Mycobacterium, Escherichia, HelicobacterBiotech Production: Escherichia, Saccharomyces, Homo Recombinant protein production, in vivo combinatorial chemistry,BACs, gene delivery, etc.
15 going on 40 complete genomes. 30,000 going on 150,000 complete genes (& intergenic regions).
Smith, et al. (1997) J. Bacteriol. 179:7135-55. MethanobacteriumBlattner, et al. (1997) Science 277, 1453-74. EscherichiaGoffeau, et al. (1996) Science 274, 563-7. Saccharomyces
Varma & Palsson (1994) Appl. Env. Micro. 60:3724.Karp et al. (1998) NAR 26:50. EcoCycSelkov, et al. (1997) NAR 25:37. WITRobison and Church http://arep.med.harvard.edu
has
exhibits
used in
described by
has
described by
described bydescribed by
exhibits
exhibits
exhibits
exhibits
exhibits
exhibits
exhibits
input to
used in
used in
used in
Strain Phenotype Expt
Starting Cell CountStarting Cell Density
Condition Set
Condition Set NumberDescriptionComment
Experiment Measures Set
Expt Measures Set NoTime of MeasurementExpt Measures Set TypeDescriptionCommentRaw Data Sets DescripData Transform DescripOutcome CommentSuccess CodeDate RecordedSample SizeOpenInd
Growth
Rel Growth MutantStd dev Rel Growth MutantWinner Mutant IndRel Growth AllStd dev Rel Growth AllWinner All Ind
mRNA Expression
mRNA Expression LevelStd dev Express Level
Protein Expression
Cell FractionProtein State Exp LevelStd Dev Prot State Level
DNA Seq Bind Const NumDNA SequenceBinding ConstantStd Dev Binding Constant
Protein Preparation Set
Prot Prep Set NumberDescriptionComment
Protein Protein Binding
Binding LevelStd Dev Binding Level
Protein Protein Binding Expt
Submodel cross-references: * = main model, C = Condition Set Entities, D = DNA and Protein Elements, N = Names, P = Protein Preparation Entities, S = Strain and Strain Mix Entities
(P)
Competition Phenotype Expt
Starting Cell CountStarting Cell Density
(S)
(C)
(S,N)
Non Specific DNA Binding
Non Specific Binding ConstStd Dev Non Spec Bind Const
Ratio of strains over environments, e ,times, te , selection coefficients, se,R = Ro exp[-sete]
80% of 34 random yeast insertions have s<0.3% or s>0.3%t=160 generations, e=1 (rich media); ~50% for t=15, e=7.Should allow comparisons with population allele models.
Other multiplex competitive growth experiments:Thatcher, et al. (1998) PNAS 95:253.Link AJ (1994) thesis; (1997) J Bacteriol 179:6228.Smith V, et al. (1995) PNAS 92:6479. Shoemaker D, et al. (1996) Nat Genet 14:450.
Multiplex DNA sequencing.Church GM. Kieffer-Higgins S. (1988) Science. 240:185.
Physical mapping of complex genomes by cosmid multiplex analysis. Evans GA. Lewis KA. (1989) PNAS 86: 5030.
Multiplexed biochemical assays with biological chips.Fodor SP, et al. (1993) Nature 364:555.
Lashkari DA, et al. (1995) An automated multiplex oligonucleotide synthesizer. PNAS 92(17):7912.
Multiplex: Tag(Mix) > Process > DecodeInternal standards, identical conditions, microscale
Genome EngineeringChallenges: Construct any mutant in any background,multiple mutants, minimizing hitchhiking mutants.
Avoid undesired residual activities and neomorphic effects on adjacent genes in most deletion, insertionnonsense, or antisense alleles.Full in-frame replacements, computationally track gene overlaps, primer & genomic repeats.
Link, et al. (1997) J. Bacteriol. 179: 6228-6237. (pKO3)http://arep.med.harvard.edu
ATG
TAA
Primer with NotI site
c-tag
tagATG
TAA
ATG
TAA
Primer with Bam site
TAAATG
tag
Crossover PCR in-frame deletions / tag substitutions
Similarity searching for environments,growth, expression, & interaction data and then theChallenges of DNA sequence motifs:short motifs & limited alphabet (4)
Yggn
pspAo85
YiaK
carAB
f214
hrsAf105
ppiA
o184mtlA5’
mtlA3’
rspA
YidX
kdgT
Yggn
pspAo85
YiaK
carAB
f214
hrsAf105
ppiA
o184
mtlA
5’
mtlA
3’
YidX
rspA
kdgT
A
B
C
D
E
F
Positive correlationNegative correlation
Catabolite repressionglucose & Crp regulated
CorFun = Zg.Zg
T /nn = #environ+genotypesg = gene sites
(switching n & g gives CorEnv)
Log vs. stationary-phase regulated
growth, expression, &/or interaction
Expression data from four cultures,allow three comparisons
glucose 30oC
Mating type a
galactose 30oC
Mating type a
glucose 30oC
Mating type
glucose 30o C -> 39o C shock
Mating type a
Expression Quantitation Options
1) n-dimensional cDNA or protein displays2) Computer selected oligomer-arraysphotolithographic or piezoelectric deposition3) Gridded microarrays from clones4) Counting 13-bp cDNA tags (SAGE)(20,000 tags means <800 RNAs have S/N>4)
Lockhart, et al. (1997) Nature Biotechnology 15:1359. DeRisi, et al. (1997) Science 278:680.Velculescu, et al. (1997) Cell 88:243.
Galactose Regulatory Network
Gal4p-Gal80p active complex
Gal3p
GAL1MEL1 GAL7PGM2 GAL2 GAL10
Gal4p-Gal80p inactive complex
GALACTOSE
GAL80
GAL4
GCY1
Structural Genes For Galactose Metabolism
?
GAL3
Gal1p
Fold Change in GAL3 in Galactose vs. Glucose(Median Fold Change is 3.1)
GAL3: Fold Change in Expression between Growth in Galactose and Growth in Glucose
0
5
10
15
20
25
1 3 5 7 9
11 13
15
17
19
Probe Number
Fo
ld C
ha
ng
e
orfID/gene:chip#probes medFC consFC thrshld missingMM? expr ratio log expr ratio BINS log expr ratioFRE Q
Y BR020w/GAL1:A 21 64.81 24.57 2 64.81 1.81164202 -2 0
Y BR018c/GAL7:A 21 41.91 10.58 2 41.91 1.62231766 -1.95 0
Y BR019c/GAL10:A 20 37.8 13.03 2 37.8 1.5774918 -1.9 0
Y DR345c/HXT3:A 20 -25.05 -13.58 0.03992016 -1.39880773 -1.85 0
Y MR256c/COX7:D 21 2.84 1.64 2.84 0.45331834 -0.45 3
0
5
10
15
20
25
30
Food Gas Motel
JanFebMarAprMayJun
Relative expression of all genes: Galactose vs. Glucose
0.1
1
10
100
1000
10000
-2.0
-1.5
-1.0
-0.5 0.0
0.5
1.0
1.5
2.0
Log of Fold Change
Num
ber
of G
enes
To analyze the most induced genes, we...
• Extracted the intergenic DNA sequence upstream of each translation start using the Saccharomyces Genome Database.
• Used an algorithm for multiple sequence alignment to look for sequence motifs conserved among the most induced (or repressed).
• Looked at the intersection of genes which both matched a conserved motif and were induced (or repressed)
Gibbs Motif Sampling Strategy1 Initialize the alignment by choosing a random subset of all
possible sites as the ‘site’ alignment, and use all remaining sequences to give a ‘non-site’ alignment.
2 Select a potential site from among all possible sites.3 If the site is in the alignment, take it out.4 Calculate the relative likelihood that the potential site belongs
with the site alignment rather than the ‘non-site’ alignment, based on a Bayesian multinomial distribution model.
5 Randomly choose whether or not to add the site, weighted by this relative likelihood.
6 Repeat Step 2
‘DNAGibbs’: A Modified Gibbs Motif Sampler Optimized for DNA searches.
• Either forward or reverse strand of a potential site -- but not both -- may be added to the alignment.
• Near-optimum sampling method was improved so that it is faster and tends to result in higher scoring alignments.
• Simultaneous multiple motif searching was replaced with a more efficient iterative masking approach.
• The model for base frequencies of non-site sequence was fixed using the average nucleotide frequencies of S. cerevisiae.
• Now runs on DEC Unix and Windows platforms, in addition to the formerly supported SGI and Sun Unix platforms.
• DNAGibbs (maximum log a posteriori likelihood ratio) scores less than 5. .
• Good matches (Z < 3 sd below the mean of the aligned positive motifs) with greater than 10% of all yeast genes (ORFs)
Finally, exclude motifs with:
*O.G. Berg & P.H. von Hippel, J. Mol. Biol., 193: 723-750 (1987)
Using the top 10 genes induced in galactose, DNAGibbs found UASG, the site recognized by Gal4p
Info
rmat
ion
(B
its)
sequence logos were developed by T.D. Schneider & R.M. Stephens, Nucleic Acids Res., 18: 6097-6100 (1990).
CGYTCGGA-GA-AGT---CCGA Previous UASG consensus
Genes that changed between galactose and glucose by more than 2-fold and have strong matches to the UASG motif
RsaI Digestion of a Fixed Density Double-Stranded DNA Chip with a Variable Spacer Length of 0 to 14 bp Between the Half-Sites
Conclusion: Loss of Signal Intensity Corresponds to Cleavage of dsDNA by RsaI
Significance:1) Double-Stranded DNA is Created by Primer Extension of ssDNA Chips
2) Double-Stranded DNA on the Surface of the Chip is Accessible for Interaction with a DNA-Binding Protein
5'
GTAC
GTAC
CA*TG
CA*TG
RsaI
Interaction Quantitation Options
Over-expression:Yeast two-hybrid screens (in vivo complexity)
In vitro chip assays
Natural levels, environmental regulation:Subcellular fractionation (unstable)In vivo footprinting (partners unknown)In vivo crosslinkingMartin Steffen, Andy Link
Isolate in vivo crosslinked complexes
by nucleic acid CsCl (or hybridization) by protein epitope tag
analyze protein by DNase 2D gel,trypsin-LC-ESI-MS/MS
analyze DNA/RNA by chip pH
kdal
Link et al. (1997) Electrophoresis 18:1259 & 1314
Rich media log-phase, in vivo crosslink, DNaseI digest
pH
kdal
4 5 6 7
10
20
30
40
50
100
lac I
fu r
grpE
dps
hns
efp
purEdps
sspA
ihfB
ssb
In vivo crosslinking & footprinting summary
11% of the E.coli genome is non-coding.About 340 / 4328 proteins are likely DNA-binding proteins (2 or the top 380 proteins).
24/25 footprinted GATC sites are non-coding. Odds = 10-27.
2/3 crosslinked DNA molecules are likely regulatory binding sites. Odds = 0.04
8/11 top DNA-crosslinked proteins are known DNA-binding proteins. Odds = 10-16.
Thoughts on chips for crosslinked epitope selections (& generally).
An easy 10-fold enrichment but with 40,000 fragments meansan expensive 1:4000 Signal:Noise,if sequencing (or SAGE) were used.
However, spread over a chip, 1:10.
E. coli oligonucleotide chip challenges:
#1) Closely spaced transcripts, e.g. carAB: (Intergenic 25-mers overlap, start 6 bp apart on average)