1 Sequence analysis • Analysis of primary , secondary , not tertiary ... structures • Biological sequences. Central dogma. • Similarities (orthologs, paralogs) • Methods, algorithms (alignments, models) • Databases (primary, secondary) Sequences: DNA, RNA , protein ... Genome: DNA transcription ? Primary transcript: pre-mRNA , pre-ncRNA processing (splicing*, cleavage) ? Processed transcript: mRNA, ncRNA (tRNA, rRNA ...) translation, modification ? [a] Translated sequence: protein (amino acids). [b] Mature ncRNA protein cleavage ... ? Mature protein. [ ESTs are nucleotide sequences, might be unspliced, spliced ...] * Splicing only occurs in Eukaryotes. SEQUENCE ANALYSIS Where and why ? Sequencing projects, assembly of sequence data Identification of functional elements in sequences Sequence comparison Classification of proteins Comparative genomics RNA structure prediction Protein structure prediction Evolutionary history Alignments and database searches (Summary) Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein? * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly similar to a protein with known function from another organism => The human protein has the same function (it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam ** Secondary structure precition ** Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) Comparing non-identical sequences Protein sequence comparison - basic concepts When two protein sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships: Orthologs Proteins that carry out the same function in different species Paralogs Proteins that perform different but related functions within one organism Proteins are homologous if they are related by divergence from a common ancestor. Homology: orthologs & paralogs Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor) Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).
16
Embed
Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Sequence analysis
• Analysis of primary, secondary, not tertiary ... structures
[ ESTs are nucleotide sequences, might be unspliced, spliced ...]
* Splicing only occurs in Eukaryotes.
SEQUENCE ANALYSIS
Where and why ?
Sequencing projects, assembly of sequence dataIdentification of functional elements in sequences Sequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history
Alignments and database searches (Summary)
Common biological problem: We have a novel protein sequence. What can we inferfrom this sequence about the biological function of theprotein?
* Sequence homology - BLAST, FASTA, SSEARCHSimple example: unknown human protein is highly similar to a protein with known function from another organism=> The human protein has the same function
(it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam** Secondary structure precition** Prediction of transmembrane domains
When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
Homology: orthologs & paralogs
Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor)
Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).
2
Methods in sequence analysis• Simple transformation/extraction
a) Translation: RNA > proteinb) Reverse translation protein>RNAc) Splicing (removing introns in pre-mRNA, pre-rRNA ...)
• Analyzing for other propertiesa) statistical compositionb) profile analysis (PSI-Blast)c) HMMs (probabilities of aa in position, Pfam) d) higher order stucture (secondary structure in RNA/prot)
Translation of sequences
• Different nucleotide sequences may translate into identical amino acid sequences.
• Nucleotide sequence may yield different amino acid seqs. (6 reading frames)
• Reverse translation does not give unique nucleotide sequence.
• Different splicing of pre-mRNA1 gene – several proteins!
The (degenerate) Genetic code
UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
Translation:
AUGUUGGGUUGA=MLG*||| | || | | AUGCUAGGAUAA=MLG*
Reverse translation:
MLG* =AUG UUA GGU UAA 1AUG UUA GGU UAG 2AUG UUA GGU UGA 3... .AUG CUG GGG UGA 72(1x6x4x3 possible seqs)
3rd position is not so important!
UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop*UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R V P S H A H Q T A S P R F C T R H S V A
------------------------------------------------------------H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6
Example unknown RNA:
Translation tables
• The coding for amino acids depends on species and/or nuclear/mitochondrial DNA.
• At least 17 translation tables exist:* The Standard Code* The Vertebrate Mitochondrial Code* The Yeast Mitochondrial Code* The Mold, Protozoan, and Coelenterate Mitochondrial Code and ...* The Invertebrate Mitochondrial Code* The Ciliate, Dasycladacean and Hexamita Nuclear Code* The Echinoderm and Flatworm Mitochondrial Code* The Euplotid Nuclear Code...* ...
Tables with comments may be found at NCBI: http://www.ncbi.nlm.nih.gov/Taxonomy/
3
Translation tables (cont), examples
Example:
The Vertebrate Mitochondrial Code (transl_table=2)
Differences from the Standard Code:
Code 2 Standard AGA Ter * Arg R AGG Ter * Arg RAUA Met M Ile IUGA Trp W Ter *
Example:
The Yeast Mitochondrial Code (transl_table=3)
Differences from the Standard Code:
Code 3 Standard AUA Met M Ile I CUU Thr T Leu L CUC Thr T Leu LCUA Thr T Leu LCUG Thr T Leu LUGA Trp W Ter *CGA absent Arg RCGC absent Arg R
Big differences if start (initiation) and stop (termination) codes differ!
Ambiguous sequence notation
Nucleotide examples:A or C, [AC]: symbol MA or G, [AG]: symbol RA or T, [AT]: symbol WA or C or G, [ACG]: V
... etc.
G A A A A CG A G A T CG C A A C CG C G A G C-----------------G[AC][AG]A[ATCG]C
The 4 sequence example may be written as a sequence : GMRANC , or as a pattern : G-[AC]-[AG]-A-x(1)-C
Wildcard: x(N) represents N arbitrary symbols.
Identity (pattern matching)• Finding short exact matches
GAATTC – recognition site for enzyme EcoRIGDSGGP – typical of serine proteases (e.g. G-[DE]-S-G-[GS] -[SAPHV] )
• Patterns for multiple matchesGA-[AG]-L-[ST] : GA + A or G + L + S or T
GAALS, GAGLS, GAALT, GAGLT matchesGA-x-G-[STLAG] : GA + any 1 aa + G + S or T or L or A or G
100 different sequences matchC-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
pattern for zinc finger proteins (millions of possible sequences)
Programs that use these kinds of patterns:”Findpatterns” searches a sequence (or set of sequences) for a pattern.”Motifs” searches a sequence for motifs present in the PROSITE database.PROSITE have patterns for >1000 protein families.Important: Match or no match – just true or false, no score!(”Profiles” have probabilities for different aminoacids in certain positions.)
Pairwise alignments:
Global alignmentConsiders similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Local alignment (most common)Considers regions of similarity in parts of the sequences only.
xxxxxxx|||||||xxxxxxx
region of similarity
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
2 mismatches
M A K L Q L G K R Y
M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *
M A K L Q L G K R Y
* * * * * * * * * *M A K L Q G A L G K R Y
Gap
Sequence alignment
Comparing 2 sequences - Gaps
4
Comparing 2 sequences: What are gaps?
Gaps are results of mutations (changes in DNA) that occur during evolution
For instance consider this deletion mutation:
AACTTGACGTTGAACTGC
GACTGGGCGTATCTGACCCGCATA
CGGGCACCGGCCCGTGGC
N L T D W A Y R A P
N L T R A P
AACTTGACGTTGAACTGC
CGGGCACCGGCCCGTGGC
DNAprotein
Alignment report example
Red lines = matches full sequence (high identity) Purple lines = matches contain gap (good identity)
Gap
Best alignment = highest score!
Give scores for match, mismatch and gap (and gap extension).
What is better: mismatch or gap?
Calculate best score for each position, “trace back” to find best alignment.
“Dynamic programming” algorithms.
Very slow algorithm, cannot be used in database searches!
BLAST lists all matching “words”*
Query
Subject
For each short match, the program tries to extend in both directions.
* A word is 7-11 nucleotides or 3-.. aa
Improvement of speed as compared to local alignment algorithm:
BLAST and FastA
Searching databases with BLAST
Initial search is for short words.Word hits are then extended in either direction.? we only extend words that are in both sequences? fast, but gap can’t be long between two close words
Searching databases with FastA
Initial search for short words.Words are extended, but also linked if they are close!? slower, but longer alignments
Aligning two sequences - Gap extension penalty. Alignment of genomic sequence with mRNA (Global alignment!)
Alignment of the following two sequences: V00594 (Human mRNA for metallothionein) and J00271 (corresponding genomic sequence).
Default setting
Extend gap= 3
In a global alignment all residues are matched.
?
!
New settings
Extend gap= 0Exon 1
Exon 2
Exon 3
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq(75 letters)
Database: nr457,798 sequences; 140,871,481 total letters
E-value: probability of finding hit in a database of this size.
E-value, as important as score!
Score
Alig
nmen
ts
Expect ValueE = number of database hits you expect to find by chance
size of database
your score
expected number of random hits
Small database = few random hits. Big database = many random hits!In small databases you get higher E-values.
High score
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;cDNA EST EMBL:D71338 comes from this gene; cDNA ESTEMBL:D74010 comes from this gene; cDNA EST EMBL:D74852comes from this gene; cDNA EST EMBL:C07354 comes fromthis gene; cDNA EST EMBL:C0...Length = 65
Unitary matrices (nucleotide, protein)All matches get ’10’, all mismatches ’0’.Used for nucleotide seqs. Bad protein hits due to identities by chance.
Point Accepted Mutation, PAM (proteins)PAM30, PAM70 ... matrices. Based on evolutionary distance: 1 PAM = 1 point mutation / 100 residues. Can’t handle distant relationships well.
Blocks Substitution Matrix, BLOSUM (prots)BLOSUM50, BLOSUM62 ... matrices. Based on alignments in the BLOCKS db. Sequence segments of a certain identity are clustered: The most used matrices. BLOSUM62 default in BLAST (>62% identity).
Remember: Any substitution matrix is making a statement about the probability of observing a pair of aligned residues in real alignments!
ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGTM A K L E K L N Q A G L M V A G
60% nucleotide identityATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGAM V R L E K I N Q A G L L V A G69% amino acid identity
M V R I Q K I N E K G A L L A G38%
Q V R I Q K I Y E K G A L L A A19% (‘twilight zone’)
Q V R I Q K I Y E K T A L L F A6% (‘midnight zone’)
Evolution of protein genes: secondary and tertiary structure conservedBlast report
Sequences producing significant alignments: (bits) Value
pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1
Low complexity sequence tends to(1) increase the number of non-specific hits to database sequences(2) correspond to regions in proteins not associated with a knownbiological function (typically unstructured parts of the protein)
Therefore, low complexity parts are filtered out by default in BLAST searches. (Don’t use filtering if you want exact matches.)
Blast variants:
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
Example: Searching a new genome assembly for a protein homolog.
Input: protein.Database: DNA (genome sequences)
? tblastn
7
Rules of database searches (like BLAST)
? Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level *? Use of smallest possible database (not too small though) ? Sequence statistics should be used rather than percent identity/similarity as criterion for homology? Consider different scoring matrices and gap penalties
* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.
TTTCGATTCTCAACAAGAAGC** * ** ** * *TTCAGGTTTAGCACGCGGTCCF R F S T R S
2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)
BLAST at NCBI
tblastn
BLAST output at NCBI
1 perfect hit, some hits with parts of sequence matched
Score = 36.2 bits (18), Expect = 7.7Identities = 18/18 (100%)Strand = Plus / Minus
Query: 25 aggctacacactgaggac 42||||||||||||||||||
Sbjct: 42727936 aggctacacactgaggac 42727919
Note: Only the best HSP is shown in the list before the alignments. Check the positions to understand in which order the HSPsmatch. The strand must be the same!
?
Databases at NCBI available for BLAST searches
Protein sequence databases
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
swissprot the last major release of SWISS-PROT
DNA sequence Databases
nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
You may also blast against single genomes ...
8
Multiple alignments - applications
Identify conserved motifs - patterns (PROSITE)Profiles (Pfam)Phylogenetic studiesPrediction of protein secondary structure Experimental : design of probes
PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final multiple alignment. A cluster consists of two or more already-aligned sequences.
PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).
The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:
Trees from MSA
SRP54
SRPRFtsY
3 large groups
Multiple alignment software
Pileup (GCG)
Clustalw / Clustalx
MSA (program that in principle finds the true optimal multiple alignment by the dynamic programming method)
T-coffee
Multiple alignment editors/viewers
SeqLab (GCG)MACAW (search for motifs, blocks)JalviewCINEMAGenedocBioeditBoxshade
Clustalx
njplot
Colours of amino acids according to type: charged, hydrophobic ...
Makes it easier to see matches.
9
How to find homologs with low sequence identity
• Sequence identity high if evolutionary distance is small, but low if the distance is big.
• Many amino acid positions change.• An amino acid may be substituted differently in
different species.• If we have many known homologs, we can search
with “all of them” as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs.? align known sequences and make a “profile”
Example sequence. How does Serine score in positions 211 and 216?
Amino acids
PSIBLAST – a more sensitive BLAST!
PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps:
(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program .
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence servesas a template for the multiple alignment and profile, whose lengths are identical to that of the query.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale , and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.
Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass.
Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.
1st BLAST round 2nd BLAST round
threshold
profile profile
3rd BLAST round
PSI-BLAST creates profiles automatically
When no more new sequences are found, search terminates.
Problem: If bad sequences enters the profile, it finds only trash!
Example of homology: SRP9/14/21
• SRP9 & SRP14 are related (common ancestor)• SRP9 is not found in Fungi (but SRP21 is)• But weak SRP9 hit in the fungi S.pombe (YE07)• Weak similarity SRP9 S.pombe & SRP21• Make a profile of known SRP21 sequences and
search a database of all known proteins!Can we detect any similarity SRP9/21?
10
Profilesearch - based on Saccharomyces SRP21 sequences Sequence ZScore Orig Length Comment
Green box = sequences in profile (should be first!)Yellow box = unknown SRP21 (incl YE07 from S.pombeRed box = SRP9 sequences (Best hits in db of >1 million proteins!)
SRP21 aligned to SRP9 &14
Unaligned box21
9
14
Secondary structure prediction by PSI-Pred also showed the conserved ? ? ? ? ?structure.
SRP9/14 ????? secondary structure (Birse et al.) shown as cylinders (alfahelices) and arrows (beta strands).
The most conserved residues are in secondary structure elements.SRP9, SRP21 more similar.
Residues marked according to similarity in sequence and chemical properties.
21
9
14
Proteins share domains
• In primary sequence searches the found proteins are aligned because they share domains
• If the sequences are very different outside the shared domain, they may be paralogs.
• The next example shows a MSA in which the middle part is a GTPase domain. The first or last part is missing ...
N-terminal
C-terminal
Two different proteins (4+4 sequences ) are aligned. They share a domain.
Pfam – protein domains DB
• From multiple alignments of many related proteins, profiles (HMMs) are made
Scores for sequence family classification (score includes all domains):Model Description Score E-value N -------- ----------- ----- ------- ---RNase_P_pop3 RNase P subunit Pop3 332.1 8.6e-97 1
• Sometimes you want to find out if there are short sequences (often called words) that are in a set of sequences. They may, for instance, be transcription factor binding sites ...
• Alignment programs wont find these ...• MEME is a program that finds “words” of a
specified length in a set of sequences.• MAST may be used to search for known words
But what about RNA genes?
• RNA genes are genes that do not code for protein (they are not translated)They are usually called “noncoding RNAs”
• There are structural, catalytic and regulatory ncRNA, few are conserved in all organisms
• Many ncRNAs are part of ribonucleoproteincomplexes (RNPs)
• Some commonly known ncRNAs are:ribosomal RNAs (rRNA), transfer RNAs (tRNAs),signal recognition particle RNA (SRP RNA),ribonuclease P RNA (RNaseP RNA)
Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis.Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed.
ncRNAs in the 3 Kingdoms of Life
Rfam: annotating non-coding RNAs in complete genomes.Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman.Nucleic Acids Res. 2005 33:D121-D124.
A A C A C AG A G A G AG-C U-A U CA-U C-G C UC-G G-C G G
seq1 seq2 seq3
Seq1 and seq2 are not similar, but they both have a hairpin structure, which is not shared by seq3!
The alignment of the primary sequences (structure) doesn’t give us any information.
Secondary structure pattern
Compensatory base changes maintain secondary structure.We need a way to specify the base pairing!
Secondary structure pattern
Pattern:h1 s1 h1’h1 NNN:NNNs1 GMAA
Note:M = [GA]N = [AUGC]
”h” stands for helix”s” -”- strand
A A C AG A G AG-C U-A A-U C-G C-G G-C
seq1 seq2
Programs that search for secondary structure: Patscan, RNAbob.
Creating probabalistic covariance models from alignments (Rfam)
tetraloop with stem
Both primary sequence and secondary structure conservation captured in probabalistic model.
COVE (Eddy 1994) used for creating models, searching.
Covariance models are equivalent to stochastic context-free grammars.
Patterns: Hard to make large patterns and patterns that find new structures. Yes/No match, no scoring. Fast!
Models: covariance model of which bases appear together created automatically from alignment.Time-complexity: O(n3). Slow!
Idea: Use smaller pattern to filter, use covariance model on filtered sequences ? Fast and sensitive!
Multiple Sequence Alignments for ncRNAs must specify basepairings
The Rfam database is the “Pfam for ncRNAs”
14
ncRNA structure evolution
1. Mutations in ncRNAs maintain the secondary structure? primary sequence is poorly conserved? hard to detect similarities by primary sequence searches
2. Structure evolves by loosing / adding helices? big gaps in alignments even when primary sequence is conserved
An example ...
SRP RNA variants
Helices H3 and H4 missing in yeast!
Bacteria
Archaea
Eukarya
Comment: t1 and t2 depict tertiary interactions
Helix 8 is the only part found in all SRP RNA!
Fungi SRPRNA lack helices 3,4
o
A
GCUGUAA U G G C
AU U U
UG U C G G A
G U GG U A A A U
CG C C U U C U
UGUU
GUGCGU
UC G
AGUUCUG
GACUC
UGCACUGG
G C U A C U U UG U U G U C C UUU
C C GA A U
U CUG
C G G UUGAUGGGCGUCUCGG
UCUGA
GU A
AUCGGC
UUUGAGAUUUCCGUUCU
AAGA
UUAACUGGGAUACUU C
AGU
GGAG
CAAUCCAG
CA G
AGAUCCAGUU
GCCGUG
GGU
AUGGCGGUGGG
AUAGCAACAAAGUGGU
AU
AUGU
UAU
GGAAGGUAUUUGCAA
UCA
CGACUC
UCo12
3
4
5
6
Yarrowia lipolytica
oC
GACTGTAA T
G G T CA
A G G T G G GT
T T GAAG G C A C T T G A
T T TT C T C A A T G
TC T C T A T T CC A
TG
TCCA
AA T
CTGGA
AGC C C A G C G G C G C C C A G C A C G A A CC T T G C G G T G
GTC
A C CCACTCGCACGGGT
AGCC TG
CG
ACTTGCTGCGCGTGG C CC
TAAG
CAATGA
AGATG A
CAC
TT G
AGA
GAGGTTCC
ACTCTG
CA G
AG
ACATCTT
CACCGTCAGGTGG
CGCGCTGGA
TTACG
ATCGCTGGG
GGGTTGGGATAGAGCGTTGAGATGGAG
ATGTC
GACTCCTATTT
To
1
2
3
4
5
6
Neurospora crassa
o
GGCTGTGATG G C T
TT T A G
CG G A
AG C
GT G C T G C
T C G TG T A C C T G C T G T T T G TT GA
AAAT TT
AA
G A G C A A A G T G T C CG G C T C G A T CC CT GC G AAT
TGAATTCTGA
ACGCTAGAG
T AATCAGTGT C
TTT
CAAGTTCTG
GTAAT
GTTTAGCAT A
AC
CACTG G
AG
GGAAG
CAATTCA
GC
A CAGT
AATGCTAA
TCGTG
GT
GGAGG
CGAAT
CCGGATG
GCACCTTGTTTGTTG
ATAAATAGTGC
GGTATC
TAGT
GTTGCAAC
TCTATo
1
2
3
4
Candida albicans
oC
GCUGU
AA U
G G CU UGGU
CGAA
G U G U U U AGU A CU C C C A
AU A
GU G C A UG U U C G G U GG
UC U
CG GG U
U CG A G U C U CG C U U U C G
A UC C C
UCG A
UCUGCCACGUCUGUUCGAAGA
GUA
GUCUUCGUGGCAACUGGCAGU
UAA
ACCGUGUAGU A
CCG
AUG G
AGG
UUGG
AAACAAUG
CA C
AUC
ACUACCGGG
UCUU
GGGC
AGUGCGAUAGCGA
UGGGAUUCACCUUCGCAGGAUGUGCAUGGAAGUAUAAACAC
AACG
GUC
GU
U o
1
2
3
4
S. pombe
These RNAs are < 300 nts.
BUT ...
S.cerevisiae (length 519 nts) SRP RNA was not possible to fit to this type of SRP RNAs.
How do we decide on the structure of this gene?
Note that Yarrowia (bottom) has an extra helix.
Comparative analysis of SRP RNA Saccharomyces species
Using the known SRP sequences from Saccharomyces cerevisiaeas queries, regions of the genomes of S.paradoxus, S.mikatae, S.kudriavzevii, S.castelli and S.kluyveri were retreived from Washington Univ., St. Louis.
By comparative analysis, SRPRNA sequences (453-547 nts) and structures were identified*.
The results showed that all species had large inserts in the helix 5 region, especially close to the small Alu domain, and that helix 7 also was variable.
* The secondary structures were predicted with MFOLD.
o
A
GCUGUAA U G G C
AU U U
UG U C G G A
G U GG U A A A U
CG C C U U C U
UGUUG
UGCGU
UC G
AGUUCUG
GACUC
UGCACUGG
G C U A C U U U G U U G U C C UUU
C C GA A U
U CUG
C G G UUGAUGGGCGUCUCGG
UCUGA
GU A
AUCGGC
UUUGAGAUUUCCGUUCU
AAGA
UUAACUGGGAUACUUG
AGAUCCAGUU
GCCGUG
GGU
AUGGCGGUGGG
AUAGCAACAAAGUGGU
AUA
UGUUAU
GGAAGGUAUUUGCAA
UCA
CGACUC
UCo12
3
4
5
6
Yarrowia lipolytica
AGGCUGUAAUG G C U U
UC U
GG U G G
G AU GG G A U A C
GUUG
GGA
AUU
UU
GGC
CG
AGG
AACA
AAU C
CU
UCCU
CG
CGG
CC
AGA
CACGGA
C UGC
ACG
CC
CUUUG
GG
CAAGGGAUGGUUCU
CCAUCUC
GCA
CCGUG
CC C U G
U UG U G G C A
AC C G U CU UUU
CUCCGUCGCUAA
UU
U G UCCUGGGCAGA
AA U
GUCUGCUCGGA
GGCGGGGGAG
U C C G GUC U G A A G U G U C C C G G C U
AU
A A U AAAU C G A U C
U U UG C G G G
CAGCCCGU
UGGCAGGAGGCGCGA
GG A
AUCCGUCUCUCUGUCU
GGU
GCGGCAA
G GUA G U C C
UGG G
UUUG
GGGCUCCAC
CUU
CACC
GCUGUU A
GGG
GAGU
UUUAUCCA
GC G
GCAGCA
AA G
GUGA
CCCGUGAUGGAGGC
GGCCGGGAU
AGCACAUAUCAGUCGGAU
AA
UCGUG
CAAGUUGAUCGUU
UCGGCGGUCU AAUUU
GGCGGUGCCAUCAGGAU
UUACUCG
CACA
UUGUGU
UCGUUCCC
UCGGGGACGAG
UGU
GUAUCCUGAACCACA
UU
UUUo
1
2
3
4
5
6 7
8
9
10
11
12
13
14
15
S. bayanusSaccharomyceshave a unique inserted part in helix 5 close to the Alu domain.
This was found in all Saccharomycesspecies.
S.bayanus
Saccharomyces helix insertions
Helix 7
This structure is not in C.albicans, S.pombe.
MicroRNAs – regulatory ncRNAs
Red part is the mature miRNA, the sequence is complementary to mRNA!
15
RNAi pathway
Cell. 2004 Apr 2;117(1):1-3. miRNA and siRNA work in a similar fashion
Cross-species genomic sequence conservation can be used1. for discovery of new regions with regulatory functions2. to enhance gene predictions, and3. alternative splicing predictions (1 gene ? >1 mRNA ? >1 protein)4. reveal transcription factor binding sites
Cross-species gene location conservation can be used for1. identification of unknown ORFs (predicted proteins)2. adding evidence for discovered new genes
Cross-species gene prevalence can be used for prediction of1. the probability for the existance of a gene in a species (Keep looking!)2. the function of a certain gene/protein/RNA (Is the product essential?)
Post-genomic Bioinformatics/Genomics
Cross-species genome comparisons
And much more ... (We will show some examples later ...)
SRP component searches
This is part of the secretory pathway.
The SRPpathway is conserved is all domains of life: Eukarya, Bacteria, Archaea.
All organisms have an SRPparticle, but it looks different.
Mitochondria and
Chloroplasts are
endosymbionts
Origin of photosynthetic organisms (have chloroplasts with own genome!)
Primary endosymbiosis
:Cyanobacteria+ Eukaryote ?
algae
Secondary endosymbiosis
:algae +
Eukaryote
Genome map of
P.purpureachloroplast
at NCBI
We downloaded 26 chloroplast genomes
and searched with pattern and model for bacterial SRP RNA.
16
Red algal group
Odontella and Guillardia have chloroplasts of secondary endosymbiosis origin
Green plant group
Found SRP RNA candidates (low scores) in 8 chloroplast genomes
Genome position for SRP RNA gene candidates in “green plant” group
Conserved clusters The candidates in phylogeneticallylinked organisms are all found in this position.
No overlap with known genes!
(Conserved gene clusters are marked with ‘3’, ‘4’ ...)
Some of these also contain rnpB (gene for RNase P RNA)
•2 clear groups: Red algae and Green algae
Genome locations of SRP RNA candidatesin chloroplasts
The predicted SRP RNAs have conserved promoters (as in cyanobacteria)
Cyano-bacteria
Distances between –10 TATA box and sequence (5-8 nts), and promoter sequences are consistent with experimentally verified promoters in Prochlorococcus (Vogel et al. 2003)