M Sc Project

NON-CODING RNA PREDICTION OF CLINICALLY IMPORTANT MYCOPLASMA BY COMPARATIVE

GENOMIC ANALYSIS

Dissertation submitted to the Madurai Kamaraj University in partial fulfillment for the requirement of Masters of Science in Biotechnology

Regn. No:A242009

School of Biotechnology Madurai Kamaraj University

Madurai

OBJECTIVES:

• To choose the best possible approach to predict the ncRNA

• To standardize the procedure required for the approach selected.

• Identification and characterization of the ncRNAs from clinically important Mycoplasma.

• To form the base for the automization procedure for the ncRNA prediction.

Past• Sequence similarity search, Statistical analysis, Transcription signal analysis, Comparative genomic analysis.

• Existing methods are biased to particular classes of ncRNAs only.

•tRNAscan-SE, Mir-Scan etc.,

QRNA - A Blend

• Secondary structure alone is not statistically significant for the detection of ncRNAs.

• Important sequences that code for proteins and performing important functions are conserved across the related organisms.

QRNA was developed to screen the conserved RNA secondary structures from the background of the other conserved sequences.

OUTLINE

INTERGENIC REGIONS OF ORGANISM OF INTEREST

↓

SEARCH FOR HOMOLOGY ACROSS RELEATED ORGANISMS

↓

PARSE THE ALIGNMENTS WITH CERTAIN CUTOFFS

↓

THE ALIGNMENTS WERE GIVEN AS INPUT FOR THE QRNA

↓

PUTATIVE ncRNA

blastn

Perl scripts

PROTEIN CODING REGION→INTERGENIC REGION

.ptt file

↓

Co-ordinates of protein coding regions

↓

Intergenic region co-ordinates

↓

Intergenic region co-ordinates

difference > 50 nucleotides

↓

Range file

↓

Intergenic sequence extraction by EMBOSS application

extractseq –regions @rangefile -separate

GENOME LENGTH COMPARISION OF THE MYCOPLASMA

M.gen- Mycoplasma genetaliumM.pne- Mycoplasma pneumoniaeM.pul- Mycoplasma pulmonisM.gal- Mycoplasma gallisepticumM.myc- Mycoplasma mycoidesM.pen- Mycoplasma penetrans

Genome Size Comparision

0 500000 1000000 1500000

M.pen

M.myc

M.gal

M.pul

M.pne

M.gen

Genome length

Organism Genomesize

M.gallisepticum 9,96,422

M.genitalium 580,074

M.mycoides 12,11,703

M.penetrans 13,58,633

M.pneumoniae 8,16,394

M.pulmonis 9,63,879

MYCOPLASMA GENOME – INTERGENIC REGION

BAR GRAPH SHOWING THE PERCENTAGE OF INTERGENIC REGION IN THE GENOME OF MYCOPLASMA

0%

20%

40%

60%

80%

100%

PROTEIN TABLE OF THE GENOME

Mycoplasma genitalium G37 complete genome - 0..580074480 proteins Location Strand Length PID Gene Synonym Code COG Product

Product

735..1829 + 364 3844620MG001 - - - (dnaN) 1829..2761 + 310 1045670MG002 - - - dnaJ

2846..4798 + 650 1045671MG003 - - - (gyrB) 4813..7323 + 836 1045672MG004 - - - (gyrA) 7295..8548 + 417 1045673MG005 - - - (serS)

8552..9184 + 210 1045674MG006 - - - (tmk) 9157..9921 + 254 1045675MG007 - - -

hypothetical 9924..11252 + 442 1045676MG008 - - - (tdhF)

…… …….. … ….. ……….. ……… .. .. .. …

Protein Co-ordinates Intergeinc Co-ordinates

735 18291829 27612846 47984813 73237295 85488552 91849157 9921……. …….

1 7342762 28454799 48127224 72948549 85519183 9156……. …….

→

CURINGRaw intergenisc coordinates

Starting Ending Length

1 734 7342762 2845 844799 4812 147324 7294 -298549 8551 39185 9156 -289922 9923 211253 11251 -112041 12068 2812726 12701 -2413566 13569 414434 14395 -3815317 15555 239

Curing of Intergenic Regions

0200400600800

10001200

No. o

f Int

erge

nic

Regi

ons

Before

After

Starting EndingLength

1 734734

2762 2845 8415317 15555

2390

Intergenic region coordiantes which are more than 50 nucleotides in length

GRAPH SHOWING THE CULLING OF THE INTERGENIC SEQUENCES BY THE

C PROGRAMME THAT SELECTS THE REGIONS WHOSE LENGTH IS GREATER THAN OR EQUAL

TO 50 NUCLEOTIDES ONLY

INTERGENIC SEQUENCES>L43967_2762_2845 Mycoplasma genitalium G37 intergenic sequenceAAAACCTTTCATTTTTAATGTGTTATAATTATTTGTTATGCCATAAATTTAGTTTGTGGCAAAAGCTTCTGTACTGTTTATTTA>L43967_15317_15555 Mycoplasma genitalium G37 intergenic sequenceACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTATTTAACTAATTATTATAACAATTTTAATTTAACCAAAATACCCCTCGAATTTTAACAGTTTTTATAATCAAAACAGCTAATTTT>L43967_19760_19824 Mycoplasma genitalium G37 intergenic sequenceATAAATTTAATAGTGTTGAAAGACAAACATTATTAATTTTTGATCAGCTAAATAAAACAAAGCAA>L43967_20356_20543 Mycoplasma genitalium G37 intergenic sequenceCTCAAAAAACTAATACATCAAACTTCAACCGTTTACTTTTTTATGAACAAGCACTACAAAGGTTTTATGAAGAATTATTTCAAATAGATTATTTAAGAAGATTTGAAAACATTCCCATTAAAGATAAGAATCAAATTGCGCTTTTTAAAACTGTTTTTGATGATTACAAAACCATTGATTTAGCAGAA …………………………………………………………………………………………………………..

Intergenic sequences extracted in Fasta format

Similarity Search - WU BLAST 2.0

Organism DatabaseCreated

Organisms inDatabase

M.gallisepticum gempppdb M.genitaliumM.mycoidesM.penetrans

M.pneumoniaeM.pulmonis

M.genitalium gampppdb M.gallisepticumM.mycoidesM.penetrans


M.mycoides ggpppdb M.gallisepticumM.genitaliumM.mycoidesM.penetrans


Organism DatabaseCreated

Organisms inDatabase

M.penetrans ggmpnpudb M.gallisepticumM.genitaliumM.mycoides


M.pneumoniae ggmpepudb M.gallisepticumM.genitaliumM.mycoidesM.penetransM.pulmonis

M.pulmonis ggmpepndb M.gallisepticumM.genitaliumM.mycoidesM.penetrans

M.pneumoniae

•Six genome databases were made each excluding one organism

•Intergenic sequences of each organism were searched for similarity (blastn) against the database which doesn’t consist the organisms genome

Table showing the list of databases made and the organisms

Parsing alignments - Factors• Perl script is used to parse the blast alignments

• blastn2qrnadepth.pl is used to parse the alignments.

• Factors considered in parsing– I trimming

• Evalue• Minimum and Maximum Identity of alignments• Length of the alignment

– II trimming• Score• Depth of alignments• Shift

Parsing alignments – QRNA input

• Perl script generates various files– QRNA input file : filename.q file

• It is a collection of sequences in fasta format, where two sequences are the two component of an alignmnet with gaps left in place.

– Parsing report file : filename.q.rep• It is a report of the blastn alignment that have been

pruned in the process of creating the QRNA input file.

QRNA input file>L43967_15317_15555-1>179-Mycoplasma

ACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAAGCATAATCCTAAGGGTAT-TTAACTA-ATTATTATAACAATTT

>gb-U00089--19096>19275-MycoplasmaACCCTCAACCTCCTGAGTGCAAATCAGGTGCTCTATCAGTTGAGCTACATCCCCATTATTGGTGGAAGTAAATGGACTTGAACCATCGACCTCACCCTTATCAGGGGTGTGCTCTAACCAACTGAGCTATACTTCCAGGCAAAATCTTC-GTACAGGTTCGCTTCATAATTATATTAATTT

>L43967_19760_19824-5<65-MycoplasmaTTGCTTTGTTTTATTTAGCTGATCAA-AAATTAATAATGTTTGTCTTTCAACACTATTAAAT

>emb-BX293980.1--57200>57261-MycoplasmaTTGTTTTGTTTTATTTAATTGATCAATAAATTGATTTAGTTTATCTTTATTTATTAATAAAT

Parsing Report FileFILE: genblastDIR: /home/kalyankpy/coput2/blast//FIRST TRIMMINGMinimum length = 1Maximum Evalue = 0.01Minimum %id = 0Maximum %id = 100SECOND TRIMMINGAlignments culled by = SCDepth of alignments = 1shift = 1

113-QUERY: L43967_546708_546877 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 1121 After First trimming: 88 After Second trimming: 2

57-QUERY: L43967_325878_326027 Mycoplasma genitalium G37 intergenic sequence Total # alignments: 152 After First trimming: 3 After Second trimming: 3

……………………………………………………………………………………………….……………………………………………………………………………………………….

Total #Queries 122Total #Alignments 53927 ave_len = 309.5After first trimming 18851 ave_len = 552.6After second trimming 386 ave_len = 404.2

GRAPH SHOWING NUMBER OF ALIGNMENTS SELECTED FOR QRNA INPUT FOR EACH GENOME THROUGH THE PERLSCRIPT

No. of Blast hits

1852

1787

850

1012

565

386

M.pen

M.myc

M.gal

M.pul

M.pne

M.gen

No. of alignments

53927

44433

360830

154026

430551

560263

No. of blastn hits selected for qrna input

QRNA – PARAMETERS

• Scanning window approach– Window =150 nt; Extension = 50 nt

• Maximum length 9999999

• Local viterbi algorithm

• RIBOPROB matrix

• Shuffling the sequence maintaining the composition

QRNA OUTPUT#---------------------------------------------------------------------# qrna 2.0.1 (Tue Aug 19 11:30:55 CDT 2003) using squid 1.5m

(Sept 1997)#---------------------------------------------------------------------# PAM model = BLOSUM62 #---------------------------------------------------------------------# RNA model = /mix_tied_linux.cfg# RIBOPROB matrix = /RIBOPROB85-60.mat#---------------------------------------------------------------------# seq file = /home/kalyankpy/perlscriptresult/genblast.q# #seqs: 772 (max_len = 3420)#---------------------------------------------------------------------# window version: window = 150 slide = 50 -- length range =

[0,9999999]#---------------------------------------------------------------------# 1 [both strands] (sre_shuffled)>L43967_1_734-90>722-Mycoplasma (664)>gb-U00089--130>767-Mycoplasma (664)

length of whole alignment after removing common gaps: 664 Divergence time (variable): 0.401[alignment ID = 61.75 MUT = 29.67 GAP = 8.58………………………………………………………… ……………….. ( CONTD..)

length alignment: 150 (id=61.33) (mut=32.67) (gap=6.00)(sre_shuffled)

posX: 0-149 [0-145](146) -- (0.42 0.08 0.06 0.43) posY: 0-149 [0-144](145) -- (0.37 0.11 0.06 0.46)

L43967_1_734-90 TTAATTTTATTAAAACTATAACTTATTTTTTATAAACATTCTATGTTTTTgb-U00089--130> TTTATTTTATTAAAATTATAATGTATTTTTGTTAAATTTT.TAATTCTTT

………………………………………………………………………………………………………………………………………………………………………………

LOCAL_DIAG_VITERBI -- [Inside SCFG]

OTH ends *(+) = (0..[150]..149) OTH ends (-) = (0..[150]..149)

COD ends *(+) = (120..[27]..146) COD ends (-) = (41..[12]..52) RNA ends *(+) = (0..[21]..20) RNA ends (-) = (0..[150]..149) winner = OTH

OTH = 184.281 COD = 166.408 RNA = 179.710 logoddspostOTH = 0.000 logoddspostCOD = -17.873 logoddspostRNA = -4.571 sigmoidalOTH = 4.571 sigmoidalCOD = -17.932 sigmoidalRNA = -4.571

QRNA OUTPUT

Number of non-coding predicted

0

10

20

30

40

50

60

M.pen M.myc M.gal M.pul M.pne M.gen

Num

ber

Number of ncRNA predicted for each organism

No. of ncRNAs predicted

Range of Non-coding RNA

0

50

100

150

200

250

300

350

M.pen M.myc M.gal M.pul M.pne M.gen

Len

gth

(n

t)

PICTURE SHOWING THE LENGTH RANGE OF NON-CODING RNAs.

(Vertical bars represent the spread of scores and horizontal bar represent the average)

Length Range of Non-coding RNA predicted

Putative Vs Annotated•The predicted ncRNa were searched for similarity against the biochemically characterized ncRNA of Bacteria ( Non-coding RNA database at http://biobases.ibch.poznan.pl/nc, updated 2002)

•Found similar to the Mc_MCS4 ncRNA of Mycoplasma capricolum.

•Mc_MCS4 was already characterized to be having extensive homology with the eukaryotic U6 snRNA.

•Another motif in one of the putative ncRNA was found to be conserved across E.coli, S.typhi, K.pneumoniae as a part of MicF ncRNA in these organsims.

•MicF was characterised to be regulating the expression of OmpF protein in these organisms.

•Similarity was also found with OxyS ncRNA of E.coli.

•OxyS was found to modulate the expression of various genes in response to Hydrogen peroxide.

http://biobases.ibch.poznan.pl/nc

- In Eukaryotes

• Similarity was observed with few miRNAs that were present in the miRNA database (Rfam miRNA registry)

• Same stretch of sequence was present in Human, Rat and Mouse miRNA.

• Small stretches of similarity was observed with various ncRNAs playing role in regulation of development also.

Sequences producing High-scoring Segment Pairs: Score P(N) N

hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop 91 0.26 1rno-mir-190 MI0000933 Rattus norvegicus miR-190 stem-loop 91 0.26 1mmu-mir-190 MI0000232 Mus musculus miR-190 stem-loop 86 0.48 1

>hsa-mir-190 MI0000486 Homo sapiens miR-190 stem-loop Length = 85

Minus Strand HSPs:

Score = 91 (19.7 bits), Expect = 0.31, P = 0.26 Identities = 45/68 (66%), Positives = 45/68 (66%), Strand = Minus / Plus

Query: 77 AGGTTTAGGTGTTCT-TATTT-ATTTATTAGGTTGTTTAGTT--TC-AATTATTTTTGGA 23 ||| | |||| | | ||| || |||||||||||| | || || || ||| | | |Sbjct: 4 AGGCCTCTGTGTGATATGTTTGATATATTAGGTTGTT-ATTTAATCCAACTATATATCAA 62

Query: 22 ATACTAGT 15 | | || |Sbjct: 63 ACA-TATT 69

>Hs_NTT Length = 17,572

Plus Strand HSPs:

Score = 116 (23.5 bits), Expect = 0.025, P = 0.024 Identities = 60/94 (63%), Positives = 60/94 (63%), Strand = Plus / Plus

Query: 11 TATTTAATATTTATAATTGCTATTTAGCATCTTAAAA-AAGA-CG-TCTTT-AAA-TATA 65 || |||| | || ||| | | || | |||| | ||| | |||| ||| ||||Sbjct: 5336 TACATAAT-TAGATCATTTATTCTAAGTAAATTAAGAGAAGCTCTATCTTCCAAAATATA 5394

Query: 66 GATAGTTATACTAATTAGAAAATAGTTAAT-AAG 98 |||| | || ||| |||| | ||||| |||Sbjct: 5395 GATATCTCTAGCAAT-AGAAGAGTTTTAATTAAG 5427

CONCLUSIONS

• Comparative genomic analysis was selected for the ncRNA prediction.

• Procedure for the prediction was standardized.• One of the putative ncRNA was found to be

similar to the already characterized ncRNA from the same genus.

• Conserved region of MicF was found to be present in the putative ncRNA also.

• Identification of the eukaryotic miRNA counterpart in Mycoplasma.

Future Plans• To develop programmes for getting the intergenic region co-ordinates given the protein table file as input.

• To verify the genuinity of the predictions beyond the homologous regions found in bacteria.

• To extend the prediction procedure for Eukaryotes.

• To develop the procedure required for classification of the predicted ncRNAs into subclasses.

• To identify the functions of the putative ncRNAs by searching their effector targets.

• To automize the whole procedure.

ACKNOWLEDGMENTSDr. Z. A. Rafi

Dr. S. Krishnaswamy

The Whole SBT family

Ministry of Human Recourses Development

Department of Education

Department of Science and Technology

Department of Biotechnology

All my classmates

M Sc Project

Documents

mycoplasma mycoides

mycoplasma gallisepticum

mycoplasma pulmonis

mycoplasma pneumoniae

mycoplasma penetrans

organism intergenic

intergenic sequences

intergenic regions of