gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction
Post on 12-Mar-2020
4 Views
Preview:
Transcript
1
1
DNA2: Aligning ancient diversity(Last week)
Comparing types of alignments & algorithmsDynamic programming (DP)Multi- sequence alignmentSpace- time- accuracy tradeoffsFinding genes - - motif profilesHidden Markov Model (HMM) for CpG Islands
2
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping
3
Discrete & continuous bell-curves
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.10
0 10 20 30 40 50
Normal (m=20, s=4.47)
Poisson (m=20, s^2=m)
Binomial (N=2020, p=.01, m=Np)
t-dist (m=20, s=4.47, dof=2)
ExtrVal(u=20, L=1/4.47)
4
gggatttagctcagttgggagagcgccagactgaa gatttg gaggtcctgtgttcgatccacagaattcgcacca
Primary to tertiary structure
5
Non-watson-crick bps
ref
-CH3
6
Modified bases & bps in RNA
" "
ref
1 72
2
7
Covariance
Mij= Σfxixjlog2[fxixj/(fxifxj)] M=0 to 2 bits; x=base typexixj see Durbin et al p. 266-8.
D-stem
anticodonTψC
3’acc
8
Mutual Information
ACUUAU M1,6= Σ = fAU log2[fAU/(fA*fU)]...CCUUAG x1x6GCUUGC =4*.25log2[.25/(.25*.25)]=2UCUUGAi=1 j=6 M1,2= 4*.25log2[.25/(.25*1)]=0
Mij= Σfxixjlog2[fxixj/(fxifxj)] M=0 to 2 bits; x=base typexixj see Durbin et al p. 266-8.
See Shannon entropy, multinomial Grendar
9
RNA secondary structure prediction
Mathews DH, Sabina J, Zuker M, Turner DH J Mol Biol 1999 May 21;288(5):911-40
Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure.
Each set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs.
10
Stacked bp & ss
11
Initial 1981 O(N2) DP methods: Circular Representation of RNA Structure
Did not handle pseudoknots
5’
3’
12
RNA pseudoknots, important biologically, but challenging for structure searches
3
13
Dynamic programming finally handles RNA pseudoknots too.
Rivas E, Eddy SR J Mol Biol 1999 Feb 5;285(5):2053-68 A dynamic programming algorithm for RNA structure prediction includingpseudoknots. (ref)
Worst case complexity of O(N6) in time and O(N4) in memory space.
Bioinformatics 2000 Apr;16(4):334-40 (ref)
14
C+
A+
G+
T+
P(G+|C+) >
P(A+|A+)
CpG Island + in a ocean of -First order Markov Model
MM=16, HMM= 64 transition probabilities (adjacent bp)
C-
A-
G-
T-
P(C-|A+)>
Hidden
15
Small nucleolar (sno)RNAstructure & function
Lowe et al. Science (ref) 16
SnoRNA Search
17
Performance of RNA-fold matching algorithms
Algorithm CPU bp/sec True pos. False pos.
TRNASCAN’91 400 95.1% 0.4x10-6
TRNASCAN-SE ’97 30,000 99.5% <7x10-11
SnoRNAs’99 >93% <10-7
(See p. 258, 297 of Durbin et al.; Lowe et al 1999)
18
Putative Sno RNA gene disruption effects on rRNA modification
Primer extension pauses at 2'O-Me positions forming bands at low dNTP.
Lowe et al. Science 1999 283:1168-71 (ref)
4
19
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay
20
RNA (array) & Protein/metabolite (MS) quantitation
RNA measures are closer to genomic regulatory motifs & transcriptional control
Protein/metabolite measures are closer to Flux & growth phenotypes.
21
Protein fusions
Coregulated sets of genes
purM purN purH purDB. subtilis
purH purD
purM purNE. coli
Microarray data
Phylogenetic profiles
Metabolic pathwaysConserved operons
Known regulons in other organisms
EC
P1P2P3P4P5P6P7
SC
1101101
BS
0110111
HI
1010110
A-BAB
TCAcycle
Integrating 8 regulon measures
In vivo crosslinking& selection (1-hybrid)
In vitroarray binding or selection
22
B. subtilis purE purK purB purC purL purF purM purN purH purD
In E. coli, each color above is a separate but coregulated operon:
C. acetobutylicum purE purC purF purM purN purH purD
purE purK
purBpurCpurLpurF
purM purN
purH purD
E. coli PurR motif
Check regulons from conserved operons(chromosomal proximity)
Predicting regulons and their cis-regulatory motifs by comparative genomics. Mcguire & Church, (2000) Nucleic Acids Research 28:4523-30.
23
M. janaschii
purE purK purM purN purH purD
P. furiosus
C. jejuniP. horokoshiiM. tuberculosisE. coli
purM purF purH purN
purF purC
purQ purC purL purH
purM purF purC purY
purQ purL purY
FC MN
HD
EKL
QYThe above predicts regulon
connections among these genes:
Predicting the PurR regulon by piecing together smaller operons
24
(Whole genome) RNA quantitation objectives
RNAs showing maximum changeminimum change detectable/meaningful
RNA absolute levels (compare protein levels)minimum amount detectable/meaningful
Network - - direct causality- - motifs
Classify (e.g. stress, drug effects, cancers)
5
25
(Sub)cellular inhomogeneity
( see figure)
Dissected tissues have mixed cell types.
Cell-cycle differences in expression.
XIST RNA localized on inactive X-chromosome
26
Fluorescent in situ hybridization (FISH)
•Time resolution: 1msec•Sensitivity: 1 molecule•Multiplicity: >24•Space: 10 nm (3-dimensional, in vivo)
10 nm accuracy with far-field optics energy-transfer fluorescent beads nanocrystal quantum dots,closed-loop piezo-scanner (ref)
27
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping
28
Steady-state population-average RNA quantitation methodology
Microarrays1
~1000 bp hybridization
experiment
control • R/G ratios
• R, G values
• quality indicators
ORF
1 DeRisi, et.al., Science 278:680-686 (1997) 4 Brenner et al, 2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)3 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995)
ORF
PMMM
• Averaged PM-MM
• “presence”Affymetrix2
25-bp hybridization
• Counts of SAGE
14 to 22-mers sequence tags for each ORF
ORF SAGE Tag
concatamers
SAGE3
sequence counting
MPSS4
29
Biotinylated RNAfrom experiment
GeneChip expressionanalysis probe array
Image of hybridized probe array
Each probe cell containsmillions of copies of a specific oligonucleotide probe
Streptavidin-phycoerythrinconjugate
30
Yeast RNA25-mer arrayWodicka, Lockhart, et al. (1997) Nature Biotech 15:1359-67
Most RNAs < 1
molecule per cell.
(ref)
Reproducibilityconfidence intervalsto find significantdeviations.
6
31
(Public) RNA array analyses (1, 2)
Image analysisdCHIP Wong/AffyArrayTools NCI-BRBArrayViewer TIGRF/P-SCAN NIHScanAlyze EisenGAPS HMS
StatisticsAFM
AMiADAChurchillR-SMA SAMVERA-SAM ISBR-Bioconductor
Database & managementGeneXMAPSAMAD
Cluster&VisualizeCLUSFAVORCluster-TreeView EisenGeneCluster WIJ-Express BergenPaGEPlaidSVDMAN LANLTreeArrange WaterlooXCluster Stanford
Free vs Open:OSI FSF ISCB 32
Statistical models for repeated array data(RNA vs. experiment repeats)
Li & Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2(8):0032
Kuo et al. (2002) Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18(3):405-12
Tusher, Tibshirani and Chu (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98(9):5116-21.
Selinger, et al. (2000) RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nature Biotech. 18, 1262-7.
33
“Significant” distributions
t-test t= ( Mean / SD ) * sqrt( N ). Degrees of freedom = N-1H0: The mean value of the difference =0. If difference distribution is not normal, use the Wilcoxon Matched-Pairs Signed-Ranks Test.
graph.00.01.02.03.04.05.06.07.08.09.10
-30 -20 -10 0 10 20 30
Normal (m=0, s=4.47)
t-dist (m=0, s=4.47, dof=2)
ExtrVal(u=0, L=1/4.47)
34
Independent Experiments
Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx.Livesay, et al. (2000) Current Biology
35
RNA quantitation
Is less than a 2- fold RNA- ratio ever important?Yes; 1.5- fold in trisomies.
Why oligonucleotides rather than cDNAs?Alternative splicing, 5' & 3' ends; gene families.
What about using a subset of the genomeor ratios to a variety of control RNAs?
It makes trouble for later (meta) analyses.
36
7
37
(Whole genome) RNA quantitation methods
Method Advantages
Genes immobilized labeled RNA Chip manufactureRNAs immobilized labeled genes-
Northern gel blot RNA sizesQRT- PCR Sensitivity 1e- 10Reporter constructs No crosshybridizationFluorescent In Situ Hybridization Spatial relationsTag counting (SAGE) Gene discoveryDifferential display & subtraction "Selective" discovery
38
Microarray to Northern
39
Non-coding sequences
Protein coding 25-mers
(12% of genome)
295,936 oligonucleotides (including controls)Intergenic regions: ~6bp spacing Genes: ~70 bp spacingNot polyA (or 3' end) biased
Strengths: Gene family paralogs, RNA fine structure (adjacent promoters), untranslated & antisense RNAs, DNA-protein interactions.
Affymetrix: Mei, Gentalen, Johansen, Lockhart(Novartis Inst)HMS: Church, Bulyk, Cheung, Tavazoie, Petti, Selinger
tRNAs, rRNAs
E. coli25-mer array
Genomic oligonucleotide microarrays
40
Random & Systematic Errors in RNA quantitation
• Secondary structure• Position on array (mixing, scattering)• Amount of target per spot• Cross-hybridization• Unanticipated transcripts
41
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
0
100200
300400
500600
0100
200300
400500
Inte
nsity
X
Y
0
500
1000
1500
2000
2500
3000
0
100200
300400
500600
0100
200300
400500
Inte
nsity
X
Y
Experiment 1 experiment 2
Spatial Variation in Control Intensity
Selinger et al 42
b0671 - ORF of unknown function, tiled in the opposite orientation
Expression Chip Reverse Complement Chip
“intergenic region 1725” - is actually a small untranslated RNA (csrB)
Crick Strand Watson Strand (same chip)
Detection of Antisense and Untranslated RNAs
8
43
Mapping deviations from expected repeat ratios
Li & Wong 44
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping
45
-1
-0.5
0
0.5
1
1.5
2
-300 -200 -100 0 100 200 300 400
Bases from Translation Start
Inte
nsity
(PM
- M
M) /
Sm
ax
LogStationaryGenomic DNA
KnownHairpin
Translation Stop(237 bases)
Known Transcription Start(position -33)
Independent oligosanalysis of RNA structure
Selinger et al 46
Predicting RNA-RNA interactions
Human RNA-splice
junctionssequencematrix
http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html
47
of the human genome using microarray technology.Shoemaker, et al. (2001) Nature 409:922-7.
48
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping
9
49
Time courses
•To discriminate primary vs secondary effects we need conditional gene knockouts .
•Conditional control via transcription/translation is slow (>60 sec up & much longer for down regulation)
•Chemical knockouts can be more specific than temperature (ts-mutants).
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2 4 6 8 10 12 14 16 18
Time (min)
Frac
tion
of In
itial
(16S
nor
mal
ized
) cspE Chiplpp ChipcspE Northernlpp Northern
cspE half lifechip 2.4 minNorthern 2.9 min
lpp half lifechip >20 minNorthern >300 min
Chip metric = Smax
lpp Northern
lpp Chip
cspE Northern cspE Chip
Beyond steady state: mRNA turnover rates (rifampicin time-course)
51
i i+1
j
j+1
t0
u0
u1u2
u3u4t1
t2t3
t4
t5 t6
t0
u0
u1u2
u3u4t1
t2t3
t4
t5 t6
series a
series b
series a
series b
a b c
d e f
0 1 2 3 4 5 ...
0
1
2
3
4
...
series a
serie
s b
0 1 2 3 4 5 ...0
1
2
3
4
...series a
serie
s b
i i+1j
j+1 *††1
†2
TimeWarp: pairs of expression series, discrete or interpolative
Aach & Church 52
TimeWarp: cell-cycle experiments
53
TimeWarp: alignment example
54
RNA1: Structure & Quantitation
Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping
top related