3.31.2005 BIOL497 Undergraduate Presentati on, Stanislav Luban, Member of K ihara Lab, Purdue Univ. 1 Stanislav Luban 1,2 Daisuke Kihara 2,1 1. Department of Computer Sciences 2. Department of Biological Sciences Purdue University, West Lafayette, IN Comparative Study of Small RNA and Small Peptides in Complete Genome Sequences
27
Embed
3.31.2005BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ. 1 Stanislav Luban 1,2 Daisuke Kihara 2,1 1. Department.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
1
Stanislav Luban1,2
Daisuke Kihara2,1
1. Department of Computer Sciences2. Department of Biological Sciences
Purdue University, West Lafayette, IN
Comparative Study of Small RNA and Small Peptides in
Complete Genome Sequences
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
2
Introduction: Structural Small RNA (sRNA)
Genes which produce non-coding transcripts that function directly as structural, regulatory, or catalytic RNAs
Include rRNAs, tRNAs, small nucleolar RNAs, spliceosomal RNAs, viral associated RNAs, microRNAs, ctRNAs, and others
In Rfam (RNA families) database, 34496 sRNA entries distributed among 352 known families are stored
In E. coli, about 50 sRNAs are known
(figure from Rfam database: http://www.sanger.ac.uk/Software/Rfam/)
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
3
Methods: QRNAModel distinctive pattern of mutation: Conserved Structural RNA
Pattern of compensatory mutations consistent with base-paired secondary structure
Pair Stochastic Context-Free Grammar Model Conserved Coding Region
Pattern of synonymous codon substitutions Pair Hidden Markov Model
Other Types of Conserved Regions Approximated by “null hypothesis” that mutations occur position
independently, without pattern Pair Hidden Markov Model
Scores are log likelihoods used to calculate final log odds score for RNA model compared to other two models
(Figure: Rivas et al, Current Biol. 2001)
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
12
Result Verification
71 total sRNAs related to E. coli already found to be annotated in Rfam database were used as benchmark
Of those: 15 – found by computational method that were also listed in Rfam and not
tRNAs 6 – not found due to shortcomings of method 29 – tRNAs already annotated as gene loci in E. coli genome sequence used 10 – E. coli plasmid loci not found in full E. coli genome sequence used 2 – 4.5S RNAs already annotated as gene loci in E. coli genome sequence used 2 – E. coli reverse transcriptase loci not found in full E. coli genome sequence used 1 – E. coli insertion sequence not found in full E. coli genome sequence used 1 – E. coli small RNA annotated separately, not found in full E. coli genome sequence used 1 – Antisense RNA already annotated as gene locus in E. coli genome sequence used 1 – Cloning vector with E. coli promoter not found in full E. coli genome sequence used 1 – E. coli transposable element not found in full E. coli genome sequence used 1 – Reporter vector not found in full E. coli genome sequence used 1 – E. coli retron not found in full E. coli genome sequence used
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
13
Candidates for ExperimentalVerification of Findings
For the following 2 slides:
Family designation expressed as [Organism name] [locus absolute start location] [locus absolute end location] and is synonymous with the first (header) entry of that family
Entries refer to number of different organism (2 chromosomes counted separately) sRNA entries in the family
Length (nt) and score only refer to the header entry of the family
Scores calculated by QRNA program with log odds post for RNA likelihood as opposed to null hypothesis
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
14
Candidates for ExperimentalVerification of Findings Top 10 highest statistically scoring E. coli sRNA loci
found by computational method:
Family designation: Ecoli 3941194 3941327 Length: 133 Score: 34.114 Family designation: Ecoli 2744345 2744445 Length: 100 Score: 29.631 Family designation: Ecoli 780875 781068 Length: 193 Score:
29.194 Family designation: Ecoli 2687537 2687689 Length: 152 Score: 27.734 Family designation: Ecoli 2519348 2519548 Length: 200 Score: 23.876 Family designation: Ecoli 4169337 4169400 Length: 63 Score: 21.625 Family designation: Ecoli 4038218 4038281 Length: 63 Score: 21.596 Family designation: Ecoli 2751994 2752022 Length: 28 Score: 20.893 Family designation: Ecoli 3420989 3421058 Length: 69 Score:
20.821 Family designation: Ecoli 3808832 3808858 Length: 26 Score:
16.995
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
15
Candidates for ExperimentalVerification of Findings
Top 10 largest sRNA families found by computational method:
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
17
Most Likely (Lowest Free Energy) Predicted Fold of 80 nt Segment of Sequence
Mfold by Zuker et al, 2004 Used
Detailed Study of Located Sample sRNA
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
18
Another Approach to Finding sRNAs in E. Coli: Paper Summary
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
19
Method Used in Paper to Find Putative sRNAs
A database of all E. coli intergenic DNA sequences was created based on gene annotations in early release of the EcoGene database, and used as input to profile search program (pftools2.2, Swiss Bioinformatics Institute) set to find sigma-70 promoter
Terminator motif was searched for in database using following search criteria: (1) An 11-nt A-rich region; (2) variable-length hairpin; (3) variable-length spacer; (4) 5-nt T-rich region nearest the hairpin; and (5) 7-nt distal extra T-rich region
Predicted promoter and terminator pairs were combined to generate putative sRNAs if (1) pair was on same strand; and (2) pair was greater than 45 but less than 350 nt apart
To verify, open reading frames and possible ribosome binding sites were searched for downstream of each promoter
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
20
Synopsis of Method Used in Paper
Using the E. Coli MG1655 genome, DNA regions that contained a sigma-70 promoter within a short distance of a rho-independent terminator were searched for
227 putative sRNAs between 80 and 400 nt in length were predicted in E. coli by paper, 32 of which were already known to be sRNAs
Transcripts of some of the candidate loci were verified using Northern hybridization
Approach may possibly be used in annotating sRNA loci in other bacterial genomes
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
21
Verification of Paper Results with Results Using Our Method
Along with other results, the paper gives a detailed listing of the 277 sRNAs predicted, including the designation, strand orientation (forward or reverse), left and right boundaries (nt from genome start position), and length (nt) of each sRNA
Left and right boundary positions in genome given by paper were compared with left and right boundary positions of putative sRNAs found by our method
If an sRNA candidate from the paper was within 100 nt of any sRNA predicted by our method, that sRNA was scored as ‘found’
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
22
Results of Verification
227 candidate sRNAs were predicted in E. coli by the paper
Among them, 150 (66.1 %) were localized by our method, according to previously utilized criteria
The test was re-run with a 50 nt threshold, yielding 140 hits (61.7 %), a 10 nt threshold, yielding 128 hits (56.4 %), and a 1000 nt threshold, yielding 187 hits (82.4 %)
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
23
Preliminary Procedure for Extracting Small Peptides
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
26
Conclusions
Possible sRNAs are found from 20~39% of the intergenic regions in each organism
Among them, ~31% of the sRNAs satisfy the log-odds score threshold of 5.0 or higher
137 “families” are conserved in equal to or more than 5 organisms
Being well conserved, sRNAs may be responsible for fundamental functions of living organisms
3.31.2005 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
27
Future Direction
Search for sRNAs will be expanded to a larger quantity of more diverse genomes
Secondary structure prediction will be later employed in greater detail to verify well conserved sRNA regions among multiple evolutionarily distant organisms
Experimental verification of the findings of this particular study under way (particularly for Shewanella oneidensis)
Comparative genomics will be used to discover the function associated with each sRNA and possibly lead to learning its part in pathway