RNAbioinformatics
Marcela Davila-LopezDepartment of Medical Biochemistry and Cell Biology
Institute of Biomedicine
Medical genomics and bioinformatics, 2009
RNA bioinformatics 2
RNA
mRNA
DNA
Alternative splicing
Translation
ProteinA ProteinB
PolyA tail
5’cap
Mod. / Export
Transcription
RNA bioinformatics 3
Overview
RNA ncRNAImportance disease relatedStructure type
RNA regulatory elements RiboswitchesSECISIREmiRNA
How to predict ncRNA secondary structureMfoldMutual information
How to identify ncRNA genesPattern matching (Patscan)SCFG (CMsearch)Phylogenetic analysis
RNA bioinformatics 4
General concepts
RNA bioinformatics 5
Types and Roles of ncRNAs• mRNA codes for proteins
• A non-coding RNA (ncRNA) is any RNAmolecule that is not translated into a protein
•Genomic stabilityTelomerase
•RNA processing and modificationSpliceosomal snRNAU7 snRNARNAse PRNAse MRP
•Transcription7SK RNA6S RNA
•TranslationtRNAtmRNArRNA
•Protein traffickingSRP RNA
Gisela Storz, Shoshy Altuvia and Karen M. Wasserman (2005)Matera, A.G., R.M. Terns, and M.P. Terns, Nat Rev Mol Cell Biol, 2007.
RNA bioinformatics 6
ncRNA content
Are ncRNAs responsible for the complexity in different organisms?
Huttenhofer, A., P. Schattner, and N. Polacek, Trends Genet, 2005
RNA bioinformatics 7
DiseasePrasanth, K.V. and D.L. Spector, Genes Dev, 2007. Costa, F.F. Drug Discov Today 2009Pandey, A.K., P. Agarwal, K. Kaur, and M. Datta. Cell Physiol Biochem 2009
miR DiabetesMRP RNA Cartilage hair-hypoplasia
RNA bioinformatics 8
DiseaseThiel, C.T., G. Mortier, I. Kaitila, A. Reis, and A. Rauch. Am J Hum Genet 2007
Cartilage hair-hypoplasia
MRP RNA processing of pre-rRNA
RNA bioinformatics 9
Protein - Primary sequenceClustalW
Sequence similarity biological relationsame function
RNA bioinformatics 10
ncRNA - Primary sequence
No sequence conservation,but structural
Covariation: Consistent and compensatory mutations that (often) conserve the structure
RNA bioinformatics 11
A single mutation can radically change the structure
Canonical pairs Non-canonical pairs: GU wobble
http://prion.bchs.uh.edu/bp_type/bp_structure.html
RNA bioinformatics 12
Multibranched loop
Secondary structure
RNA functionality depends on structure
External base
Stem
Loop
Hairpin
Internal loop
Bulge
Pseudoknot
RNA bioinformatics 13
Tertiary structure
RNA tertiary structure comprises interactions of SS:two helicestwo unpaired regionsone unpaired region and a double-stranded helix
Prediction of RNA 3D structure is very difficult and RNA bioinformatics is therefore dominated by the prediction and analysis of secondary structure.
RNA bioinformatics 14
Family structure
tRNA Telomerase RNAP RNA
Each family typically adopts a characteristic secondary structure
RNA bioinformatics 15
However...
Dictyostelium discoideumCandida albicans
Trypanosoma brucei
U1 snRNA
MRP RNA
RNA bioinformatics 16
Examples:RNA regulatory elements
RiboswitchesSECISIREmiRNA
RNA bioinformatics 17
RNA regulatory elements
A cis-regulatory element or cis-element is a region of RNA that regulates the expression of genes located on that same strand.
Trans-regulatory elements are RNAs that may modify the expression of genes, distant from the gene that was originally transcribed to create them.
C D Sm7G
5’ 3’miRNA 5’3’AAUAA AAAAAAAA
RNA bioinformatics 18
Cis and trans regulatory elementsDominski, Z. and W.F. Marzluff. Gene, 2007
Histones
DNA
U7 snRNA
D3
B G
ELsm10
Lsm11 F Symplekin
CPSF-73
CPSF-100
SLBP
ZFP-100
Histone pre-mRNA
Stem-Loop motif of Histone pre-mRNA
RNA bioinformatics 19
Riboswitch2002 Part of an mRNA molecule that can directly bind a small target
molecule, affecting the gene’s activity (Auto-regulation)
• Typically found in the 5’ UTR• Biosynthesis, catabolism and transport of various cellular catabolites
(aminoacids [K,G], cofactors, nucleotides and metal ions)• Most known occur in Bacteria
Tucker, B.J. and R.R. Curr Opin Struct Biol, 2005
RNA bioinformatics 20
Riboswitch examplesSerganov A, Patel DJ. Biochim Biophys Acta. 2009
Transcription Translation
Shine-Dalgarno
RNA bioinformatics 21
Riboswitch identificationHenkin TM. Genes Dev. 2008Mandal M, et al, Cell. 2003
Comparative analysis of upstream regions of several genes:
• BLAST to find UTRs homologous to all UTRs in Bacillus subtilis (e.g)• Inspection for conserved structure RNA-like motifs• Experimental confirmation
Guanine Riboswitch
RNA bioinformatics 22
Selenoproteins
At least 25 selenoproteinsPresent in all lineages of life (bacteria, archaea and eukarya)
Function of most selenoproteins is currently unknown
Prevention of some forms of cancer (?) therapeutic targets (?)
Selenium antioxidant activity chemopreventive, antiinflammatory, and antiviral properties
Moderate selenium deficiency has been linked to:increased cancer and infection risk, male infertility, decrease in immune and thyroid function, and several neurologic conditions, including Alzheimer’s and Parkinson’s disease
Not a cofactor incorporated into the polypeptide chain asselenocysteine [SEC] (21st aa)
Papp, LV, et al. ANTIOXIDANTS & REDOX SIGNALING 2007
RNA bioinformatics 24
SECISKryukov, G.V., et al., Science, 2003
Overall low sequence similarities
Secondary structures are highly conserved and contain consensus sequences that are indispensable for Sec incorporation
Eukaryotic SECIS: non-canonical A-G base pairs K-turn motif
RNA bioinformatics 25
RNA bioinformatics 26
IRE: Iron responsive element
Essential for oxygen transport, cellular respiration, and DNA synthesis
[↓] cellular growth arrest and death anemia, retardation in children
[↑] generate hydroxyl or lipid radicals damage lipid membranes, proteins, and nucleic acids.
hemochromatosis, liver/heart failure
Iron:
Balance: iron-responsive element/iron regulatory protein regulatory system
Muckenthaler MU, Galy B, Hentze MW. AnnuRev Nutr. 2008
Piccinelli P, Samuelsson T, RNA, 2007
26–30 nts (long hairpin) CAGUGN apical loop sequence5’UTR – 3’UTR
RNA bioinformatics 27
IRE regulationMuckenthaler MU, Galy B, Hentze MW. AnnuRev Nutr. 2008
RNA bioinformatics 28
Gene Identificationand
SS prediction
RNA bioinformatics 29
Protein vs RNA identification
Sequence-similarity based
Conserved primary sequence
Protein RNA
Promoters (Pol II)Not Conserved primary sequencePromoters (Pol II, Pol III)Sequence-similarity basedSecondary structure basedComparative genomics
RNA bioinformatics 30
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Nussinov algorithm: Find the structure with the most base pairs (dynamic programming)
Drawbacks:Not unique structureTesting all possible structures
numerically impossible
RNA bioinformatics 31
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Zuker folding algorithm (1981): The correct structure is the one with the lowest equilibrium free energy (ΔG) which is the sum of individual contributions from loops, base pairs and other secondary structure elements
Every system seeks to achieve a minimum of free energy (MFE)
However ... The structure with the lowest MFE not always is the biological relevant
RNA bioinformatics 32
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Mutual information: quantity that measures the mutual dependence of the two variables (two positions). The unit of measurement is the bit.
Covarying positions:consitent and compensatory mutations that conserve the structure
RNA bioinformatics 33
Mutual information - example
fxi = fq of one of the 4 bases in column ifxixj = fq of one of the 16 base-pairs in
columns i and jMij = 2 max value informative
= 0 conserved positions not informative
1 2 3 4G G C CG C C GG A C UG U C A
Columns 2-4:GCCGAUUA
fG=1/4 fC=1/4 fGC=1/4fC=1/4 fG=1/4 fCG=1/4fA=1/4 fU=1/4 fAU=1/4fU=1/4 fA=1/4 fUA=1/4
fGC*log2(fGC/fG*fC)1/4*log2(0.25/(0.25*0.25)) = 0.51/4*log2(0.25/(0.25*0.25)) = 0.51/4*log2(0.25/(0.25*0.25)) = 0.51/4*log2(0.25/(0.25*0.25)) = 0.5
MI = 2Columns 1-3:GC fG=4/4 fC=4/4 fGC=4/4 4/4*log2(1/(1*1)) = 0
MI = 0
RNA bioinformatics 34
Mutual information – excercise
RNA bioinformatics 35
Mutual information plot
Diagonals of covarying positions correspond to the four stems of the tRNA. Dashed lines indicate some of the addtional tertiary contacts observed in the yeast tRNA-Phe crytal structure.
RNA bioinformatics 36
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
p1 = 5...7GGAA~p1
Patscan: is a pattern matcher (deterministic motifs as well as secondary structure constraints) which searches protein or nucleotide sequence archives
Drawback:Yes/No answer
RNA bioinformatics 37
PatScan - Example
r1={au,ua,gc,cg,gu,ug}
r1~p2[1,0,1]
p1=6...7 ~p1
4...4
p2=8...9
GGG [1,0,0] 3...4
r1={au,ua,gc,cg,gu,ug} p1=6...7GGG [1,0,0]p2=8...94...4r1~p2[1,0,1]3...4
[1,0,0]Mismatch
DeletionInsertion
RNA bioinformatics 38
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Regular grammar primary sequence models
T aS | bT | ɛaT aaS aabS aabaT aabaɛ aaba
S aT | bS
Model repeat regions (ex. FMR-1 triplet repeat region)
S gW1W1 cW2W2 gW3W3 cW4W4 gW5W5 gW6W6 cW7 | aW4 | cW4W7 tW8W8 g
gcg cgg ctggcg cgg agg cgg ctggag agg ctggcg agg cgg ctggcg agg cgg cgg
RNA bioinformatics 39
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Context-free grammar primary sequence models palindromes
S aSa | bSb | aa | bb S aSa aaSaa aabSbaa aabaabaa
RNA secondary structureCAGGAAACUGGCUGCAAAGCGCUGCAACUG
S aW1u | cW1g | gW1c |uW1aW1 aW2u | cW2g | gW2c |uW2aW2 aW3u | cW3g | gW3c |uW3aW3 ggaa | gcaa
G AG AG.CA.UC.G
C AG AU.AC.GG.C
C AG AUxCCxUGxG
RNA bioinformatics 40
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Stochastic regular grammar weighted primary sequence models (probabilistic)
S rW1 S kW1 S nW1
(0,45) (0,45) (0,10)
Hidden markov modelsA
C G
T
ɛβ
RNA bioinformatics 41
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
Stochastic context-free grammar Covariance models: probabilistic models that flexibly describe the secondary structure and primary sequences consensus fo an RNA sequence family
RNA bioinformatics 42
Infernal Package
•Search for additional and family-related sequences in sequence databases
RNA bioinformatics 43
CM exampleBuild a model (automatically) from an existing sequence alignment
RNA bioinformatics 44
CM example
RNA bioinformatics 45
Database containing information about ncRNA families and other structured RNA elements.
RNA bioinformatics 46
Structural alignments
Phylogenetic distribution
RNA bioinformatics 47
Methods
•Nussinov algorithm•Mfold (prediction of secondary structure)•Analysis of mutual information•Pattern matching•SCFG (Stochastic context-free grammar models)•Phylogenetic analysis
- Conserved elements alignment- SCFG Secondary structure- Fold- Phylogenetic evaluation
EVOfold:
RNA bioinformatics 48
miRNA
RNA bioinformatics 49
miRNANegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
C D Sm7G
5’ 3’miRNA
5’3’
AAUAA AAAAAAAATarget
•SS RNA
•~22 nucleotides
•Inhibit the translation of mRNAs to their protein products by biding tospecific regions in the 3ʼ UTR
•Accounts for ~1% of all transcripts in humans and potentially regulate 10%-30% of all genes.
•Expressed ubiquitously and highly conserved in Metazoans (animal kingdom).
RNA bioinformatics 50
miRNANegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
C D Sm7G
5’ 3’miRNA
5’3’
AAUAA AAAAAAAA
ApoptosisCell prolifertion Cell differentiationDevelopmentOrganism defense against infectionsTissue morphogenesisRegulation of metabolism
CancerViral infectionsNeurodegenerative disordersCardiac pathologiesMuscle disordersDiabetes
Biological processes Diseases
Target
RNA bioinformatics 51
miRNANegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009. He, L. and G.J. Hannon, Nat Rev Genet 2004
C D Sm7G
5’ 3’miRNA
5’3’
AAUAA AAAAAAAATarget
Multiple binding sites lin-4 is partially complementary to 7 sites in the lin-14 3′ UTR
RNA bioinformatics 52
miRNA genesKim VN Nat Rev Mol Cell Biol. 2005Winter J et al Nat Cell Biol. 2009
Exonic miRNAs in non-coding transcripts
Intronic miRNAs in non-coding transcripts
Intronic miRNAs in protein-coding transcripts
SingleClustered
RNA bioinformatics 53
miRNA BiogenesisWinter, J., S. Jung, S. Keller, R.I. Gregory, and S. Diederichs. Nat Cell Biol 2009. Paul S. Meltzer, Nature, 2005
Canonical
Non-Canonical
RNA bioinformatics 54
miRNA structureNegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
miRNA
miRNA*
Interveningloop
High conservation mature miRNALower conservation loop
Human genome ~11 million hairpins
Hairpin structure
RNA bioinformatics 55
miRNA computational identification
Homology search basedBLASTmiRAling, ProMir, microHARVESTER
Gene findingIdentification of conserved genomic regionsFolding of the identified regions (Mfold, RNAfold)Evalutation of hairpinsmiRseeker, miRscan
Neighbour stem loop (~42% of human miRNA genes are clustered together)Check surroundings of a known miRNA for candidate secondary structures
Comparative genomicsBLAST intergenic sequences of two genomes against each otherFilter based on rules inferred based on known miRNAsmiRFinder
Intragenomic matching (A functional miRNA should have at least a target)miRNAs show perfect complementarity to their targets (?)It simultaneously predicts miRNAs and their targetsmiMatcher
RNA bioinformatics 56
miRNA experimental validation through sequencing
Experimental approach:
– Purify small RNAs (15-35 nt)– Deep sequencing of the RNA library.– Map sequence traces to the genome.
Ruby JG. et al. Genome Res., 2007
RNA bioinformatics 57
miRNA Target predictionNegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
• Predicting miRNA targets in plants is easier, due to the perfectcomplementarity to the miRNAs
• In animals, perfect complementarity is not common– miRNA seed complementarity (6 to 9 nt)– High false positives rate
• Common approach– Experimental evidences – Validated miRNA/target pairs– Tarbase, miRecords
• Computational methods:– Base-pairing rules and binding sites sequence features– Conservation– Thermodynamics
C D Sm7G
5’ 3’miRNA
5’3’
AAUAA AAAAAAAATarget
RNA bioinformatics 58
Base-pairing rulesBartel, D.P. 2009. Cell 2009.
6-9 nt, starting usually at P2P1 is typically unpaired or starts with UOften flanked by AUsually no G:U wobbles (vs regulation)
3’ compensatory sites
Canonical sites
Atypical sites
lsy-6/cog-1 3’UTR
5’ dominant sites
May compensate for insufficient basepairing in the seed
RNA bioinformatics 59
More methods ...Negrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
Search for conserved seeds in the UTRs across different species
Evaluation of ΔG of predicted duplexes usually < -20 Kcal/molDiscard F(+) but favorable interactions not always correspond to
actual duplex
The targe site on the mRNA not involved in any intramolecular bp
Any existing secondary structure must be first removed
Thermodynamics
Structural accesiblity
Conservation
RNA bioinformatics 60
miRNABartel, D.P. 2009. Cell 2009
RNA bioinformatics 61
miRNA gene expression in cancerNegrini, M., M.S. Nicoloso, and G.A. Calin. CurrOpin Cell Biol 2009.
RNA bioinformatics 62
miRNA in CancerLu, J., et al., Nature, 2005
RNA bioinformatics 63
Carlo Croce 2009
A
B
miR-29b or scrambled oligos injection (5 µg)K562 cells injected SC
Days
Tumor size
Stop
0 3 7 10 14
D
* P<0.003
0
200
400
600
800
1000
1200
1400
1600
1800
0 +3Days +7 +10 +14
Tum
or V
olum
e (m
m3 )
Mock
Scrambled
miR-29b
**
miR-29b
Scrambled
C
Tum
or W
eigh
t (gr
ams)
P<0.001
0
0.2
0.4
0.6
0.8
1
1.2
scrambled miR-29b
(A) Diagram illustrating the experimental design of the mice xenograft experiment.
(B) Graphic representing the tumor volume determinations at the indicated days during the experiment for the three groups; mock (n= 6), scrambled (n=12) and synthetic miR-29b (n=12).
(C) Tumor weight averages between scrambled and synthetic miR-29b treated mice groups at the end of the experiment (Day +14). P-values were obtained using t-test. Bars represent ±S.D.
(D) Photographs of two mice injected with miR-29b (left flank) or scrambled (right flank).
MiR-29b inhibits Leukemic growth in vivo.
miRNAs as tumor suppresors
RNA bioinformatics 64
miR DBs
Published miRNAS
Experimentally suported targets
Prediction of miRNAS targets
miRNA-disease relationships reported in the literature.