Top Banner
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
37

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Noncoding RNA Genes Pt. 2SCFGs

CS374

Vincent Dorie

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 2: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Motivation

Noncoding RNA genes can be anywhere Noncoding RNA genes can do anything

Page 3: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Location

rRNA, snRNA Exons? Introns Viral vectors

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 4: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Function

Page 5: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Function, pt. 2

Page 6: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Overview

“RSEARCH: Finding homologs of single structured RNA sequences” by Klein and Eddy (2003)

“Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars” by Holmes and Rubin (2002)

Page 7: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Comparison - Methodology

RSEARCH DART (Stemloc)

Sequence

Page 8: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Comparison, Pt. 2 - Uses

RSEARCH Find parts of a

genome which may be homologous to query sequence

More practical in comparative genomics

DART (Stemloc) Investigate a specific

sequence suspected of being homologous to query sequence

Page 9: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Comparison, Pt. 3 - Complexity

RSEARCH O((M - B)LD + BLD2)

to scan O(M4) to calculate

statistics

DART (Stemloc) Between O(LM) and

O(L3M3)

Page 10: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Background:Context Free Grammars

Four-tuple {N, T, S, P} N is a set of nonterminals T is a set of terminals S is the start symbol, S N P is a set of productions

Page 11: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Context Free Grammars, pt. 2Sample Grammar

N = {S, A, B} T = {a, u, c, g, } P = {

S -> A | B,

A -> aAc | aBc | g,

B -> g

}

Page 12: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Context Free Grammars, pt. 3Parse Trees

Parse: aagccS

A

A

g

ca

ca

S

A

A

g

ca

ca B

Page 13: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Stochastic CFG

Each production associated with a probability

Probabilities for all productions starting from a given nonterminal sum to one

Superset of HMM Assigns a probability to a parse E.g. S -> A, 0.3

| B, 0.7

Page 14: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Pairwise (profile) SCFG

Terminals in each production can exist in each of two strings

E.g. W -> xiykVxjyl

Page 15: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

RSEARCH: pSCFG Simplified Each secondary

structure specifies (most of) a grammar, creating a “Model Architecture”

Eschews probabilistic interpretation

Problem becomes fitting target to model architecture

Sequence

Page 16: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
Page 17: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Node Types vs. Node States

Nodes types are what we want to do given model (e.g. MATP is match pair)

Node state represents what happens when scanning a target sequence

E.g. Node type is MATP, target sequence doesn’t have a pair in that location -> insert a gap

Page 18: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Node States

Set of node states possible for node type

Page 19: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Gap Classes

Gap class per node type/state pair

Page 20: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Transition Scores

Gap class determines transition scores Gap penalties are affine

Page 21: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Emission Scores

Emission scores determined empirically

Page 22: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
Page 23: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Parameterizing the ModelEmission Scores

AA AU AC AG UA …AA sAAAA sAAAU sAAAC sAAAG sAAAU …AU - sAUAU sAUAC sAUAG sAUUA …AC - - sACAC sACAG sACUA …AG - - - sAGAG sAGUA …UA - - - - sUAUA …… … … … … … …

Substitution Matrices

sij = log2f ijgig j

A U C GA sAA sAU sAC sAG

U - sUU sUC sUG

C - - sCC sCG

G - - - sGG

sijkl = log2f ijkl

gig jgkgl

Scores are observed / random

Page 24: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

RIBOSUM Matrices

Start with MSA Whose MSA?

RIBOSUM[X, Y] Sequences X% identical are reweighted to

sum to 1 Only sequences Y% identical are counted in

making matrices

Page 25: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Model Parameters

Gap open penalty (single and pair) Gap extension penalty (single and pair) Internal start penalty Internal end penalty

Page 26: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Solution

Guess and check “We might have been able to derive a more

robust parameter set had we used a more comprehensive set of tests, but the long running time required by RSEARCH makes such an approach infeasible.”

Page 27: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Digression: Biostatistics

Confidence intervals Expectation values

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 28: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Gumbel Distribution

Parameterized by and K E = KNe-x, P = 1 - e-E

Page 29: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Gumbel Distriubtion, pt. 2

K and depend on G+C content of target database

For database with heterogeneous G+C content, compute K and for G+C bins

Page 30: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Putting it All Together

Run against database substrings of length two times the query

Greedily take K best, non-overlapping hits Recover alignments Report: score, position in database,

alignment, E-value, P-value Statistics need to be calculated for every

query and target database

Page 31: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Time

For a 113 nt sequence against 2.1 * 107 nt database, 2.9 CPU days. 2% computing statistics

For a 330 nt sequence against 2.1 * 107 nt database, 38 CPU days. 7% computing statistics

Parallelized to 33 minutes and 7.4 hours respectively

Page 32: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Shifting GearsFold Envelopes

Pre-enumerates pSCFGs search space

Presents conditional versions of dynamical programming algorithms

User defined complexity

Page 33: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Fold Envelopes, pt. 2

Conceptualize search over grammars and parse trees

Each node in tree accounts for subsequence

Wu

…Accounts for Xi..j

… Accounts for X0..i and Xj..L

Outside sequence

Inside sequence

Page 34: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Analogy: Message Passing

Inside algorithm: likelihood of sequence over all possible parses

Cocke-Younger-Kasami algorithm: maximum likelihood parse of a sequence

Inside-Outside algorithm: expected number each grammar production is used

Use fold envelopes to limit messages by restricting subsequences considered

Page 35: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

The Inside Algorithm

To compute

a(i, j, V) = P(xi…xj, produced by V)

a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)

k k+1i j

V

X Y

Batzolgou

Page 36: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Constructing Fold Envelopes

Constrain to possible 2ndary structures Constrain to primary sequence alignment

Page 37: Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Summary

RSEARCH to find a set of possible homologs, sorted by score and statistics

Fold Envelopes permit greater search depth in case of unfolded comparisons

RSEARCH employs simplified pSCFGs Fold Envelopes are useful over full

spectrum of comparisons but represent more computationally complex situations