Page 1
NON-CODING RNA PREDICTION OF CLINICALLY IMPORTANT MYCOPLASMA BY COMPARATIVE
GENOMIC ANALYSIS
Dissertation submitted to the Madurai Kamaraj University in partial fulfillment for the requirement of Masters of Science in Biotechnology
Regn. No:A242009
School of Biotechnology Madurai Kamaraj University
Madurai
Page 2
OBJECTIVES:
• To choose the best possible approach to predict the ncRNA
• To standardize the procedure required for the approach selected.
• Identification and characterization of the ncRNAs from clinically important Mycoplasma.
• To form the base for the automization procedure for the ncRNA prediction.
Page 3
Past• Sequence similarity search, Statistical analysis, Transcription signal analysis, Comparative genomic analysis.
• Existing methods are biased to particular classes of ncRNAs only.
•tRNAscan-SE, Mir-Scan etc.,
QRNA - A Blend
• Secondary structure alone is not statistically significant for the detection of ncRNAs.
• Important sequences that code for proteins and performing important functions are conserved across the related organisms.
QRNA was developed to screen the conserved RNA secondary structures from the background of the other conserved sequences.
Page 4
OUTLINE
INTERGENIC REGIONS OF ORGANISM OF INTEREST
↓
SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS
↓
PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS
↓
THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA
↓
PUTATIVE ncRNA
blastn
Perl scripts
Page 5
PROTEIN CODING REGION→INTERGENIC REGION
.ptt file
↓
Co-ordinates of protein coding regions
↓
Intergenic region co-ordinates
↓
Intergenic region co-ordinates
difference > 50 nucleotides
↓
Range file
↓
Intergenic sequence extraction by EMBOSS application
extractseq –regions @rangefile -separate
Page 6
GENOME LENGTH COMPARISION OF THE MYCOPLASMA
M.gen- Mycoplasma genetaliumM.pne- Mycoplasma pneumoniaeM.pul- Mycoplasma pulmonisM.gal- Mycoplasma gallisepticumM.myc- Mycoplasma mycoidesM.pen- Mycoplasma penetrans
Genome Size Comparision
0 500000 1000000 1500000
M.pen
M.myc
M.gal
M.pul
M.pne
M.gen
Genome length
Organism Genomesize
M.gallisepticum 9,96,422
M.genitalium 580,074
M.mycoides 12,11,703
M.penetrans 13,58,633
M.pneumoniae 8,16,394
M.pulmonis 9,63,879
Page 7
MYCOPLASMA GENOME – INTERGENIC REGION
BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC REGION IN THE GENOME OF MYCOPLASMA
0%
20%
40%
60%
80%
100%
Page 8
PROTEIN TABLE OF THE GENOME
Mycoplasma genitalium G37 complete genome - 0..580074480 proteins Location Strand Length PID Gene Synonym Code COG Product
Product
735..1829 + 364 3844620MG001 - - - (dnaN) 1829..2761 + 310 1045670MG002 - - - dnaJ
2846..4798 + 650 1045671MG003 - - - (gyrB) 4813..7323 + 836 1045672MG004 - - - (gyrA) 7295..8548 + 417 1045673MG005 - - - (serS)
8552..9184 + 210 1045674MG006 - - - (tmk) 9157..9921 + 254 1045675MG007 - - -
hypothetical 9924..11252 + 442 1045676MG008 - - - (tdhF)
…… …….. … ….. ……….. ……… .. .. .. …
Page 9
Protein Co-ordinates Intergeinc Co-ordinates
735 18291829 27612846 47984813 73237295 85488552 91849157 9921……. …….
1 7342762 28454799 48127224 72948549 85519183 9156……. …….
→
Page 10
CURINGRaw intergenisc coordinates
Starting Ending Length
1 734 7342762 2845 844799 4812 147324 7294 -298549 8551 39185 9156 -289922 9923 211253 11251 -112041 12068 2812726 12701 -2413566 13569 414434 14395 -3815317 15555 239
Curing of Intergenic Regions
0200400600800
10001200
No. o
f Int
erge
nic
Regi
ons
Before
After
Starting EndingLength
1 734734
2762 2845 8415317 15555
2390
Intergenic region coordiantes which are more than 50 nucleotides in length
GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE
C PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL
TO 50 NUCLEOTIDES ONLY
Page 11
INTERGENIC SEQUENCES>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genitalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA …………………………………………………………………………………………………………..
Intergenic sequences extracted in Fasta format
Page 12
Similarity Search - WU BLAST 2.0
Organism DatabaseCreated
Organisms inDatabase
M.gallisepticum gempppdb M.genitaliumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
M.genitalium gampppdb M.gallisepticumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
M.mycoides ggpppdb M.gallisepticumM.genitaliumM.mycoidesM.penetrans
M.pneumoniaeM.pulmonis
Organism DatabaseCreated
Organisms inDatabase
M.penetrans ggmpnpudb M.gallisepticumM.genitaliumM.mycoides
M.pneumoniaeM.pulmonis
M.pneumoniae ggmpepudb M.gallisepticumM.genitaliumM.mycoidesM.penetransM.pulmonis
M.pulmonis ggmpepndb M.gallisepticumM.genitaliumM.mycoidesM.penetrans
M.pneumoniae
•Six genome databases were made each excluding one organism
•Intergenic sequences of each organism were searched for similarity (blastn) against the database which doesn’t consist the organisms genome
Table showing the list of databases made and the organisms
Page 13
Parsing alignments - Factors• Perl script is used to parse the blast alignments
• blastn2qrnadepth.pl is used to parse the alignments.
• Factors considered in parsing– I trimming
• Evalue• Minimum and Maximum Identity of alignments• Length of the alignment
– II trimming• Score• Depth of alignments• Shift
Page 14
Parsing alignments – QRNA input
• Perl script generates various files– QRNA input file : filename.q file
• It is a collection of sequences in fasta format, where two sequences are the two component of an alignmnet with gaps left in place.
– Parsing report file : filename.q.rep• It is a report of the blastn alignment that have been
pruned in the process of creating the QRNA input file.
Page 15
QRNA input file>L43967_15317_15555-1>179-Mycoplasma
ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATTT
>gb-U00089--19096>19275-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATTT
>L43967_19760_19824-5<65-MycoplasmaTTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAAAT
>emb-BX293980.1--57200>57261-MycoplasmaTTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAAAT
Page 16
Parsing Report FileFILE: genblastDIR: /home/kalyankpy/coput2/blast//FIRST TRIMMINGMinimum length = 1Maximum Evalue = 0.01Minimum %id = 0Maximum %id = 100SECOND TRIMMINGAlignments culled by = SCDepth of alignments = 1shift = 1
113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 1121 After First trimming: 88 After Second trimming: 2
57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 152 After First trimming: 3 After Second trimming: 3
……………………………………………………………………………………………….……………………………………………………………………………………………….
Total #Queries 122Total #Alignments 53927 ave_len = 309.5After first trimming 18851 ave_len = 552.6After second trimming 386 ave_len = 404.2
Page 17
GRAPH SHOWING NUMBER OF ALIGNMENTS SELECTED FOR QRNA INPUT FOR EACH GENOME THROUGH THE PERLSCRIPT
No. of Blast hits
1852
1787
850
1012
565
386
M.pen
M.myc
M.gal
M.pul
M.pne
M.gen
No. of alignments
53927
44433
360830
154026
430551
560263
No. of blastn hits selected for qrna input
Page 18
QRNA – PARAMETERS
• Scanning window approach– Window =150 nt; Extension = 50 nt
• Maximum length 9999999
• Local viterbi algorithm
• RIBOPROB matrix
• Shuffling the sequence maintaining the composition
Page 19
QRNA OUTPUT#---------------------------------------------------------------------# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m
(Sept 1997)#---------------------------------------------------------------------# PAM model = BLOSUM62 #---------------------------------------------------------------------# RNA model = /mix_tied_linux.cfg# RIBOPROB matrix = /RIBOPROB85-60.mat#---------------------------------------------------------------------# seq file = /home/kalyankpy/perlscriptresult/genblast.q# #seqs: 772 (max_len = 3420)#---------------------------------------------------------------------# window version: window = 150 slide = 50 -- length range =
[0,9999999]#---------------------------------------------------------------------# 1 [both strands] (sre_shuffled)>L43967_1_734-90>722-Mycoplasma (664)>gb-U00089--130>767-Mycoplasma (664)
length of whole alignment after removing common gaps: 664 Divergence time (variable): 0.401[alignment ID = 61.75 MUT = 29.67 GAP = 8.58………………………………………………………… ……………….. ( CONTD..)
Page 20
length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)
posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)
L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTTgb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT
………………………………………………………………………………………………………………………………………………………………………………
LOCAL_DIAG_VITERBI -- [Inside SCFG]
OTH ends *(+) = (0..[150]..149) OTH ends (-) = (0..[150]..149)
COD ends *(+) = (120..[27]..146) COD ends (-) = (41..[12]..52) RNA ends *(+) = (0..[21]..20) RNA ends (-) = (0..[150]..149) winner = OTH
OTH = 184.281 COD = 166.408 RNA = 179.710 logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571 sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571
QRNA OUTPUT
Page 21
Number of non-coding predicted
0
10
20
30
40
50
60
M.pen M.myc M.gal M.pul M.pne M.gen
Num
ber
Number of ncRNA predicted for each organism
No. of ncRNAs predicted
Page 22
Range of Non-coding RNA
0
50
100
150
200
250
300
350
M.pen M.myc M.gal M.pul M.pne M.gen
Len
gth
(n
t)
PICTURE SHOWING THE LENGTH RANGE OF NON-CODING RNAs.
(Vertical bars represent the spread of scores and horizontal bar represent the average)
Length Range of Non-coding RNA predicted
Page 23
Putative Vs Annotated•The predicted ncRNa were searched for similarity against the biochemically characterized ncRNA of Bacteria ( Non-coding RNA database at http://biobases.ibch.poznan.pl/nc, updated 2002)
•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.
•Mc_MCS4 was already characterized to be having extensive homology with the eukaryotic U6 snRNA.
•Another motif in one of the putative ncRNA was found to be conserved across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these organsims.
•MicF was characterised to be regulating the expression of OmpF protein in these organisms.
•Similarity was also found with OxyS ncRNA of E.coli.
•OxyS was found to modulate the expression of various genes in response to Hydrogen peroxide.
Page 24
- In Eukaryotes
• Similarity was observed with few miRNAs that were present in the miRNA database (Rfam miRNA registry)
• Same stretch of sequence was present in Human, Rat and Mouse miRNA.
• Small stretches of similarity was observed with various ncRNAs playing role in regulation of development also.
Page 25
Sequences producing High-scoring Segment Pairs: Score P(N) N
hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1
>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop Length = 85
Minus Strand HSPs:
Score = 91 (19.7 bits), Expect = 0.31, P = 0.26 Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus
Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23 ||| | |||| | | ||| || |||||||||||| | || || || ||| | | |Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62
Query: 22 ATACTAGT 15 | | || |Sbjct: 63 ACA-TATT 69
Page 26
>Hs_NTT Length = 17,572
Plus Strand HSPs:
Score = 116 (23.5 bits), Expect = 0.025, P = 0.024 Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus
Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65 || |||| | || ||| | | || | |||| | ||| | |||| ||| ||||Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394
Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98 |||| | || ||| |||| | ||||| |||Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427
Page 27
CONCLUSIONS
• Comparative genomic analysis was selected for the ncRNA prediction.
• Procedure for the prediction was standardized.• One of the putative ncRNA was found to be
similar to the already characterized ncRNA from the same genus.
• Conserved region of MicF was found to be present in the putative ncRNA also.
• Identification of the eukaryotic miRNA counterpart in Mycoplasma.
Page 28
Future Plans• To develop programmes for getting the intergenic region co-ordinates given the protein table file as input.
• To verify the genuinity of the predictions beyond the homologous regions found in bacteria.
• To extend the prediction procedure for Eukaryotes.
• To develop the procedure required for classification of the predicted ncRNAs into subclasses.
• To identify the functions of the putative ncRNAs by searching their effector targets.
• To automize the whole procedure.
Page 29
ACKNOWLEDGMENTSDr. Z. A. Rafi
Dr. S. Krishnaswamy
The Whole SBT family
Ministry of Human Recourses Development
Department of Education
Department of Science and Technology
Department of Biotechnology
All my classmates