Top Banner
Tracking down ncRNAs in the genomes
55

Tracking down ncRNAs in the genomes

Jan 21, 2016

Download

Documents

trygg

Tracking down ncRNAs in the genomes. How to find ncRNA gene. The stability of ncRNA secondary structure is not sufficiently different from the predicted stability of a random sequence. [Rivas and Eddy Bioinformatics (2000)]. RNA secondary structure prediction problem. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tracking down ncRNAs in the genomes

Tracking down ncRNAs in the genomes

Page 2: Tracking down ncRNAs in the genomes

How to find ncRNA gene

• The stability of ncRNA secondary structure is not sufficiently different from the predicted stability of a random sequence. [Rivas and Eddy Bioinformatics (2000)].

Page 3: Tracking down ncRNAs in the genomes

RNA secondary structure prediction problem

• Algorithms/programs to compute the minimum energy:

– Nussinov et al (1978), Waterman (1978), Smith and Waterman (1978), and Zuker and Sankoff (1984).

– Mfold (Zuker 2003) and RNAfold (ViennaRNA) (Hofacker 2003).

• RNA folding via energy minimization has its shortcomings:

– Prediction depends on correct energy parameters.

– Sometimes, the true structure does not

have the minimum energy.

Page 4: Tracking down ncRNAs in the genomes

– RNAs with similar functions often have similar structures.– Sequence changes can be tolerated by covarying mutations.

How to find ncRNA genes from a multiple alignment

GCAUCGGugGUUCagu--gguaGAAU---//---CCGAUGCa

UCUAAUAugGCAUauu-----aGUGC---//---UAUUAGAa

GGGGAUGuaGCUUagu--gguaGAGC---//---UAUCCCCa

GCCGCCGuaGCUCagcccgggaGAGC---//---CGGCGGCa

GGGCCCGuaGCUUagcucgguaGAGC---//---CGGGCCCa

Page 5: Tracking down ncRNAs in the genomes

RNAalifold

• If we have correct multiple alignments, looking for covarying mutations and finding consensus structure is a good way to do structure prediction. – RNAalignfold (Hofacker et al. 2002)

– The consensus structure prediction is more accurate.

– To find energetically stable consensus structure is more statistically significant.

– Still compute the MFE.

– Covariance information is incorporated into the energy model by rewarding compensatory and consistent mutations.

Page 6: Tracking down ncRNAs in the genomes

Loop based energy models

• Stacks (contiguous nested base pairs) are the dominant stabilizing force – contribute the negative energy

• Unpaired bases form loops contribute the positive energy.– Hairpin loops, bulge/internal loops, and multiloops.

• Take into account covariance contribution:–

• Take into account inconsistent sequences:–

• Put together:

Page 7: Tracking down ncRNAs in the genomes

Mountain representation of E. coli 16 S rRNA

Page 8: Tracking down ncRNAs in the genomes

How to used it to detect the new ncRNA?

• To find energetically stable consensus structure is more statistically significant.

• MFE can be used to compute the statistical significance.– MFE: m

– Mean: ų

– Standard: δ

– Z-score: z = (m- ų)/ δ

• We need randomize the multiple sequence alignment– Shuffle the columns of the input alignment

• Not destroy the gap structure.• Certain sequence pattern.

Page 9: Tracking down ncRNAs in the genomes

Alifoldz

Page 10: Tracking down ncRNAs in the genomes

Distribution of z-scores for the tRNA test sets

Page 11: Tracking down ncRNAs in the genomes

Sensitivity depends sequence divergence and the quality of alignment

(Based on pairwise alignments of SRP RNAs)

Page 12: Tracking down ncRNAs in the genomes

Sensitivity on known ncRNAs in S. cerevisiae

• Use MultiPipMaker to generate the multiple alignment of S. cerevisiae and other 6 related yeast genome.

• Extracted the regions of annotated ncRNAs• Refine the poor aligned regions • Window size = 150, slide 20.• False-positive rate: 0.25%.• 30 CPU days.

Page 13: Tracking down ncRNAs in the genomes

Some issues about Alifoldz

• Time consuming to compute the z-score

• If the sequence identity is too high (>95%), it would not work.

• If the sequence identity is too low (<60%), it would not work too.

Page 14: Tracking down ncRNAs in the genomes

RNAz (PNAS, 2005)

• z-score (for individual sequence)– Using Support Vector Machine (SVM) regression.– Using 1000 random sequences of each of ~10,000 point to

compute the distribution.• Same length.

• Same base composition.

– Mean (ų) and standard deviation (δ) • Are functions of the length and base composition.

– Z-score: z = (m- ų)/ δ

– For an alignment, using the mean of the z-scores.

Page 15: Tracking down ncRNAs in the genomes

z-scores calculated by SVM vs. sampled z-scores

Page 16: Tracking down ncRNAs in the genomes

Structure conservation index (SCI)

• It is a measure for the structure conservation based on the computing a consensus structure– Using RNAalifold.– – If SCI -> 0, RNAalifold does not find a consensus structure.– If SCI-> 1, having conserved consensus structure.

Page 17: Tracking down ncRNAs in the genomes

Classification based on both scores

• Estimate a probability (P) if the alignment is classified as a functional RNA, based on– SCI– z-score– Average pairwise identity– Number of sequences.

• It is also done by SVM.

Page 18: Tracking down ncRNAs in the genomes

Classification based on z scores and SCI by a SVM

Page 19: Tracking down ncRNAs in the genomes

Some test families

Page 20: Tracking down ncRNAs in the genomes

Comparison with other methods

Page 21: Tracking down ncRNAs in the genomes

Screening the Comparative Regulatory Genomics (CORG) Database

Page 22: Tracking down ncRNAs in the genomes

Using RNAz screening human genome

• Nature Biotechnology  23, 1383 - 1390 (Nov. 2005), “Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome”

• Input:– Genome-wide alignments of vertebrates from UCSC genome browser.

– Using PhastCons program to find the most conserved

– Adjacent conserved regions (<50 distances) are joined together.

– All regions > 50 bps.

– Remove all “known genes” and “Refseq genes”

• Output:– Predicted structured RNA elements in the human genomes using RNAz

Page 23: Tracking down ncRNAs in the genomes

Results

Page 24: Tracking down ncRNAs in the genomes
Page 25: Tracking down ncRNAs in the genomes

Statistical analysis of predicted RNAs

Page 26: Tracking down ncRNAs in the genomes

Comparison with some RNA database

Page 27: Tracking down ncRNAs in the genomes
Page 28: Tracking down ncRNAs in the genomes

Problem:

• Correctly aligning multiple and divergent RNA sequences without taking into account the structural information is difficult.

• To get covarying mutation, we need an alignment.

• Do the multiple alignment and structure prediction at the same time.

Page 29: Tracking down ncRNAs in the genomes

RNA consensus folding problem

RNA consensus folding problem: computing the common secondary structure for a set of unaligned RNA sequences

– Sankoff (1985) first proposed an algorithm simultaneously align RNA sequences and find the optimal common fold.

• Time complexity is 0(n6) for two sequences with length n.• Implemented as Dynalign (Mathews and Turner 2002)• It’s not practical for multiple sequences.

– Eddy and Durbin (1994) and other groups used stochastic context-free grammars to predict the consensus structure.

• Start from a seed alignment.• Stochastic iteration to improve the alignment and predict structure.• Need a good seed alignment.

Page 30: Tracking down ncRNAs in the genomes

Motivation to a new approach

– Base-pairs appear in ‘clusters’: we call them stacks, which is energetically favorable.

– Most of the stability of the RNA secondary structure is determined by stacks.

ACCUU AAGGA

p = (1/4)5 < 0.001.

– Stacks are much less likely to occur by chance.

Page 31: Tracking down ncRNAs in the genomes

Statistics of the stacks in Rfam database

Fraction of true stacks missed

00.10.20.30.40.50.60.70.80.9

1

1 2 3 4 5 6 7 8 9 10

length of stacks

Page 32: Tracking down ncRNAs in the genomes

Using stacks as anchors for predictions

• The idea of anchors as constraints has been used in multiple genomic sequence alignment.

– MAVID (Bray and Pachter, 2004)– TBA (Blanchette et al., 2004)

Several heuristic methods have been developed by finding anchored stacks:

– Waterman (1989) used a statistical approach to choose conserved stacks within fixed-size windows.

– Ji and Stormo (2004) and Perriquet et al. (2003) use primary sequence conservation of the stacks and the length of loop regions to reduce the searching space.

– stack anchor has low sequence similarity.

– It’s hard to find correct anchors

Page 33: Tracking down ncRNAs in the genomes

Problem:

• Selecting one stack at a time may cause wrong matching stacks.

Page 34: Tracking down ncRNAs in the genomes

A global approach: configuration of stacks

• RNA secondary structure can be viewed as stacks plus unpaired loops. (no individual base-pairs)

• The energy of the structure is the sum of the energies of stacks and loops.

• Stack configuration:

– Nested stacks

– Parallel stacks

– Crossing stacks (pseudo knots)

• More generalized stacks can include mismatches in the stacks.

Page 35: Tracking down ncRNAs in the genomes

RNA Stack-based Consensus Folding (RNAscf) problem

• Find conserved stack configurations for a set of unaligned RNA sequence.

• Optimize both stability (free energy) of the structure and sequence similarity computed based on these common stacks as anchors.

Page 36: Tracking down ncRNAs in the genomes

RNA stack-based consensus folding for pairwise sequences

Page 37: Tracking down ncRNAs in the genomes

A matching stack-configurations on two sequences

Weights of different costs.Energy of the consensus structureSequence similarity of stacksSequence similarity of unpaired regions

Page 38: Tracking down ncRNAs in the genomes

RNA Stack-based Consensus Folding for multiple sequences

Page 39: Tracking down ncRNAs in the genomes

Cost function for multiple sequences

A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,k-2 A1,k-1 A1,k

...

A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,k-2 A2,k-1 A2,k

As,1 As,2 As,3 As,4 As,5 As,6 As,k-2 As,k-1 As,k

Page 40: Tracking down ncRNAs in the genomes

Compute an optimal stack configuration for two sequences

• Dynamic programming algorithm is used to align RNA sequences and find an optimal configuration at the same time.

– The algorithm is similar to prior work (Sankoff 1985, Bafna et al. 1995)

– Differences: • We use stacks as the basic structural elements. • Prior work used individual base pairs.

– The computational time is O(n4) (n is the number of stacks). • Sankoff’s algorithm is O(m6), (m is the length of the sequences).• The number of possible stacks (size >= 4) is much smaller than the length

of the sequence.• It’s much faster.

Page 41: Tracking down ncRNAs in the genomes

For any pair of stacks, there are three choices:

PA

PB

hairpin loop

PA

PB

Loop(PA)

Loop(PB)PA

PB

PX

PY

interior loop/bulge

PA

PB

PiA

PjB

P1A

P1B

multi-loop

Page 42: Tracking down ncRNAs in the genomes

The score of matching stacks:

PA

PB

Page 43: Tracking down ncRNAs in the genomes

The score of matching hairpin loops:

PA

PB

Loop(PA)

Loop(PB)

Page 44: Tracking down ncRNAs in the genomes

The score of matching interior loops or bulges:

PA

PB

PX

PY

Loop(PX,PA)

Loop(PY,PA)

Page 45: Tracking down ncRNAs in the genomes

The score of matching two multi-loops:

PA

PB

PiA

PjB

P1A

P1B

Loop(Pi,PA)

Loop(Pi,PB)

Page 46: Tracking down ncRNAs in the genomes

Consensus folding for multiple sequences

• We use a heuristic method based on the notion of star-alignment.– Compute an optimal configuration from a random seed pair.– Align all individual sequences to this configuration.– Choose the conserved stack configuration in all sequences.– Allow some stacks to be partially conserved (at least appear in a certain

fraction of the sequences).

Page 47: Tracking down ncRNAs in the genomes

Compute the stack configuration for multiple sequences: RNAscf(k,h,f)

.

..

.........

Page 48: Tracking down ncRNAs in the genomes

Iterative procedure for RNAscf

1. P = RNAscf(k, h, f).

2. In each sequence, extract the unpaired regions according to the loop regions in P.

3. Predict additional putative stacks that are not crossing with P using smaller k’ and h’.

4. Recompute the alignment for with additional putative stacks using RNAscf(k’,h’,f).

Page 49: Tracking down ncRNAs in the genomes

Test dataset

• We choose a set of 12 RNA families from Rfam database:– 20 sequences chosen from the families. (except for CRE and glms, we choose 10

sequences) with annotated structures.– There are 953 stacks.– We compare RNAscf with 3 other programs that are available online for RNA

folding:• RNAfold (energy based minimization) (Hofacker 2003)• COVE (covariance model) (Eddy and Durbin 1994)

– Cove need a staring seed alignment which is produced by ClustalW.• comRNA (computing anchors in multiple sequences) (Ji, Xu and Stormo 2004).

– Sensitivity: the fraction of true stacks that overlapped with predicted stacks.– Accuracy: the fraction of predicted stacks that overlapped with true stacks

Page 50: Tracking down ncRNAs in the genomes

Test results

Page 51: Tracking down ncRNAs in the genomes

Test results

Sensitivity

00.10.20.30.40.50.60.70.80.9

1

5s_r

RNA

CRE_220(

*)

ctRNA_2

36(+

)

glmS(*)

hamm

er_3

intro

n_II(+

)

lysine

purin

e

sam

_ribo

thiam

ine(+

)

tRNA

ykok

_elem

ent

RNAfold

COVE

comRNA

RNAscf

Page 52: Tracking down ncRNAs in the genomes

Test results

Accuracy

00.10.20.30.40.50.60.70.80.9

1

5s_r

RNA

CRE_220(

*)

ctRNA_2

36(+

)

glmS(*)

hamm

er_3

intro

n_II(+

)

lysine

purin

e

sam

_ribo

thiam

ine(+

)

tRNA

ykok

_elem

ent

RNAfold

COVE

comRNA

RNAscf

Page 53: Tracking down ncRNAs in the genomes

Performance improves when the number of sequences increases

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0 10 20 30 40 50 60 70 80

# of input sequences

Sensitivity

Accuracy

(Using Thiamine riboswitch subfamily (RF00059))

Page 54: Tracking down ncRNAs in the genomes

RNAscf always finds the right consensus stack configuration.

(Sam riboswitch (RF00162))

Page 55: Tracking down ncRNAs in the genomes

Conclusion and future work

• RNAscf is a valid approach to RNA consensus structure prediction.– Use stack configuration to represent RNA secondary structure.– Propose a dynamic programming algorithm to find optimal stack configuration

for pairwise sequences.– Use both primary sequence information and energy information.– Use a star-alignment-like heuristic method to get the consensus structure for

multiple sequences.

• Future work:– Correcting errors by using a stochastic iterative scheme (such as Gibbs

sampling).– Provide P-value for each prediction.– Use RNAscf to find new families of ncRNAs.– Perform constraint folding to refine the predicted structure by adding some

minor basepairs.