Graduate School ETD Form 9 - Purdue Genomics Wikirna.genomics.purdue.edu/@api/deki/files/1119/=final_revision3.pdf · Graduate School ETD Form 9 (Revised 12/07) PURDUE UNIVERSITY

Graduate School ETD Form 9 (Revised 12/07)

PURDUE UNIVERSITY GRADUATE SCHOOL

Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

By

Entitled

For the degree of

Is approved by the final examining committee:

Chair

To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.

Approved by Major Professor(s): ____________________________________

____________________________________

Approved by: Head of the Graduate Program Date

Kejie Li

A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships

Doctor of Philosophy

Wen Jiang

Michael Gribskov

Daisuke Kihara

Dabao Zhang

Michael Gribskov

Peter J. Hollenbeck 07/25/2011

Graduate School Form 20 (Revised 9/10)

PURDUE UNIVERSITY GRADUATE SCHOOL

Research Integrity and Copyright Disclaimer

Title of Thesis/Dissertation:

For the degree of Choose your degree

I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*

Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.

I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.

______________________________________ Printed Name and Signature of Candidate

______________________________________ Date (month/day/year)

*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html

A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships


Kejie Li

07/25/2011

A GRAPH THEORETIC APPROACH FOR IDENTIFYING RNA STRUCTURE AND FUNCTION RELATIONSHIPS

A Dissertation

Submitted to the Faculty

of

Purdue University

by

Kejie Li

In Partial Fulfillment of the

Requirements for the Degree

of


August 2011

Purdue University

West Lafayette, Indiana

ii

Dedicated to my beloved wife Juan Liao, my great father Changgui Li, and my dear

mother Hanfang He.

以此献给我心爱的妻子：廖娟，我伟大的父亲：李常贵，以

及我亲爱的母亲：何汉芳。

iii

ACKNOWLEDGEMENTS

I would like to express my greatest and the most sincere gratitude to my major advisor,

Dr. Michael Gribskov, for his support, patience, understanding and encouragement

during my graduate study in Purdue. I thank him for giving me freedom and support to

explore the research areas I am interested in. He is always ready to help and is such a

superb mentor throughout the development of my projects. I sincerely appreciate the

time he spent to improve my writing of the annual progress reports, the graduate thesis

and publications. To me, he is like a huge library which holds all kind of sources. I can

retrieve information I need at anytime and the process could take less time than I

google it around.

I would like to extend my sincere appreciation to the members of my PhD advisory

committee: Dr. Wen Jiang, Dr. Daisuke Kihara and Dr. Dabao Zhang. I truly appreciate all

their support, advice and guidance throughout my graduate study.

Special thanks go to Reazur Rahman and Aditi Gupta, who are my lab mates as well as

RNA group mates. We worked closely and I do enjoy our great teamwork.

Extra thanks go to the past and current members in the Gribskov lab: Hao Jiang, Ying Li,

Damion Junk, Doug Yatcilla, Omer Ijaz, Prasad Siddavatam, Greg Ziegler, Emre Demirors,

iv

James Hengenius, Qiong Wu, Minming Li, Jiajie Huang, Biaobin Jiang, and Junhui Wang

(according to the time order I met with them) for their help and advice on my projects. I

would also like to thank Nina Robinson for her full support of our lab activities. She is

priceless to us.

Finally, to my mom and dad, Hanfang He and Changgui Li, without their true love and

support, I would not be able to come to this stage of my life. To my beloved wife, Juan

Liao, who is the most important part of my life, with your trust and believe, we would

always hold our hands tightly and being together in our life journey.

v

TABLE OF CONTENTS

Page

LIST OF TABLES ..................................................................................................................... x

LIST OF FIGURES .................................................................................................................. xi

ABSTRACT ........................................................................................................................... xv

CHAPTER 1 INTRODUCTION ................................................................................................ 1

1.1 RNA’s double life ....................................................................................................... 1

1.2 Importance of Pseudoknots ...................................................................................... 4

1.2.1 Definition of pseudoknot ................................................................................... 4

1.2.2 Pseudoknot functional significance ................................................................... 5

1.3 Representations of RNA secondary structure .......................................................... 6

1.3.1 Simple representations ...................................................................................... 6

1.3.2 Graph theoretical representations .................................................................... 8

1.4 Methods for studying RNA structure ...................................................................... 11

1.4.1 Experimental approaches ................................................................................ 12

1.4.2 Computational approaches ............................................................................. 13

vi

Page

1.5 Decision making in novel molecule design ............................................................. 22

1.6 Summary/organization of this work ....................................................................... 24

1.7 Data sources ........................................................................................................... 26

1.7.1 Manually curated dataset ................................................................................ 26

1.7.2 STRAND dataset ............................................................................................... 28

CHAPTER 2 PATTERN MATCHING IN RNA STRUCTURES ................................................... 42

2.1 Introduction ............................................................................................................ 42

2.2 XIOS RNA graphs ..................................................................................................... 45

2.2.1 Definition ......................................................................................................... 45

2.2.2 Training data .................................................................................................... 46

2.2.3 DFS Lexicographical ordering ........................................................................... 46

2.2.4 Enumeration N-stem structures ...................................................................... 48

2.3 Greatest conserved structures ............................................................................... 48

2.3.1 Extension of the gSpan algorithm .................................................................... 48

2.3.2 Graph matching algorithm (similar to gSpan) ................................................. 50

2.3.3 Greatest conserved structure(s) in a set of RNAs ........................................... 50

2.3.4 Characteristics of biological graphs ................................................................. 51

2.4 Future directions ..................................................................................................... 52

vii

Page

2.4.1 Graph preprocessing ........................................................................................ 52

2.4.2 Reduction of graph complexity ........................................................................ 53

2.4.3 Adding labels .................................................................................................... 54

2.4.4 Motif identification tool .................................................................................. 54

2.4.5 Database search tool ....................................................................................... 55

CHAPTER 3 RNA STRUCTURAL FINGERPRINT.................................................................... 68

3.1 Enumeration of XIOS graphs ................................................................................... 68

3.2 Structural motif library construction ...................................................................... 70

3.3 RNA structural fingerprint ...................................................................................... 73

3.3.1 Background ...................................................................................................... 73

3.3.2 Definition of RNA structural fingerprint .......................................................... 73

3.3.3 Fingerprint searching algorithms ..................................................................... 75

3.3.4 Possible applications ........................................................................................ 85

CHAPTER 4 MATCHING UNKNOWN RNA STRUCTURES .................................................. 103

4.1 Introduction .......................................................................................................... 103

4.2 Methods and dataset ............................................................................................ 105

4.2.1 XIOS Graph ..................................................................................................... 105

4.2.2 Dataset ........................................................................................................... 105

viii

Page

4.2.3 Indexing and searching .................................................................................. 105

4.2.4 Scoring function ............................................................................................. 106

4.3 Results ................................................................................................................... 108

4.3.1 Validation using known biological structures ................................................ 108

4.3.2 Size only graph database search .................................................................... 109

4.3.3 Embedding simulation ................................................................................... 109

4.3.4 Blast search .................................................................................................... 111

4.4 Discussion ............................................................................................................. 111

CHAPTER 5 IDENTIFICATION OF TOPOLOGICAL FEATURES THAT DISCRIMINATE BETWEEN

RNA CLASSES ................................................................................................................... 121

5.1 Introduction .......................................................................................................... 121

5.1.1 RNA importance, RNA function determined by RNA structure..................... 121

5.1.2 Contribution ................................................................................................... 123

5.2 Material and Methods .......................................................................................... 123

5.2.1 Reverse cIndex basic feature selection on RNA fingerprints ......................... 123

5.2.2 RNA structure classification ........................................................................... 125

5.2.3 RNA structure datasets .................................................................................. 126

5.3 Results ................................................................................................................... 126

ix

Page

5.3.1 Feature selection on the fingerprint generated ............................................ 126

5.3.2 Top unique feature selected (in the same order as weights): ...................... 127

5.3.3 Validation of RNA structure classification ..................................................... 127

5.4 Discussion ............................................................................................................. 129

5.5 Future directions ................................................................................................... 131

LIST OF REFERENCES ....................................................................................................... 143

VITA ................................................................................................................................. 154

PUBLICATIONS ................................................................................................................. 155

x

LIST OF TABLES

Table .............................................................................................................................. Page

Table 1.1 Manually curated structures. ............................................................................ 36

Table 1.2 RNaseP Sequences Used. .................................................................................. 37

Table 1.3 Group I Intron Sequences Used. ....................................................................... 39

Table 1.4 tmRNA Sequences Used .................................................................................... 40

Table 1.5 Dataset collected from STRAND database. ....................................................... 41

Table 2.1 Brief description of RNA datasets. .................................................................... 66

Table 2.2 Number of possible RNA topologies for different numbers of stems, N. ......... 67

Table 3.1 Design of the NH index array. ......................................................................... 102

Table 4.1 Kolmogorov-Smirnov test results .................................................................... 119

Table 4.2 Statistics of embedding simulation ................................................................. 120

Table 5.1 Statistics of the selected structural features for four RNA families from two

datasets. .......................................................................................................................... 139

Table 5.2 Classification performance.............................................................................. 140

Table 5.3 Leave one out cross validation result ............................................................. 141

Table 5.4 Classification test ............................................................................................ 142

xi

LIST OF FIGURES

Figure ............................................................................................................................ Page

Figure 1.1 Common types of Pseudoknots ....................................................................... 29

Figure 1.2 Common RNA secondary structure representations. ...................................... 30

Figure 1.3 Rooted-labeled tree ......................................................................................... 32

Figure 1.4 Dot plots ........................................................................................................... 33

Figure 1.5 RNA tree graph, RNA dual graph and RNA digraph representations. ............. 35

Figure 2.1 XIOS definition. ................................................................................................ 58

Figure 2.2 tRNA 3D structure and corresponding XIOS graph representation. ............... 59

Figure 2.3 Unique three-stem XIOS graphs, including pseudoknots. ............................... 61

Figure 2.4 Identification of the common structure in S. cerevisiae and H. sapiens RNase P

RNA. .................................................................................................................................. 63

Figure 2.5 Correlation between number of stems and sequence length. ........................ 64

Figure 2.6 Length of RNA stem structures in biological RNAs .......................................... 65

Figure 3.1 Enumeration of XIOS graphs. ........................................................................... 87

Figure 3.2 RNA secondary structure visualization. ........................................................... 89

Figure 3.3 Flow of generating fingerprint. ........................................................................ 90

Figure 3.4 RNA structural fingerprint tRNA example. ...................................................... 91

xii

Figure ............................................................................................................................ Page

Figure 3.5 Architecture comparison of CPU and GPU. ..................................................... 92

Figure 3.6 Comparison of CPU and GPU. .......................................................................... 93

Figure 3.7 Prefix tree structure stores structural motif library for efficient subgraph

isomorphism check ........................................................................................................... 94

Figure 3.8 Neighborhood indexing (NH indexing). ........................................................... 95

Figure 3.9 Triangle descriptors. ........................................................................................ 97

Figure 3.10 Full List of Mathematically Possible Triangle Descriptors. .......................... 101

Figure 4.1 Positive hit ratio (PHR) in a Blast search. ....................................................... 113

Figure 4.2 NH database search result. ............................................................................ 114

Figure 4.3 Size only database search result. ................................................................... 115

Figure 4.4 Embedding simulation. .................................................................................. 116

Figure 4.5 Embedding simulation database search result. ............................................. 117

Figure 4.6 Blast search result .......................................................................................... 118

Figure 5.1 Idea of graph containment search. ................................................................ 133

Figure 5.2 Graph feature matrix. .................................................................................... 134

Figure 5.3 Feature selection weight vs. iteration ........................................................... 135

Figure 5.4 Selected top unique structural features in dataset Table 1.1. ...................... 136

Figure 5.5 Selected top unique structural features in dataset Table 1.5. ...................... 137

Figure 5.6 Link from structure to function...................................................................... 138

xiii

LIST OF ABBREVIATIONS

3D Three Dimensional

CGI Common Gateway Interface

CM Covariance Model

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

DDT dichlorodiphenyltrichloroethane

DFS Depth First Search

DOS Diversity Oriented Synthesis

DP Dynamic Programming

GPGPU General Purpose computing on GPUs

GPU Graphics Processing Unit

HMM Hidden Markov Model

HTS High Throughput Screening

xiv

LOOCV Leave One Out Cross Validation

LWP The World-Wide Web Library for Perl

MFE Minimum Free Energy

MSA Multiple Sequence Alignment

ncRNA non-coding RNA

NGS Next-Generation Sequencing

NH Neighboring Index

PDB Protein Data Bank

PHR Positive Hit Ratio

QSAR Quantitative Structure-Activity Relationships

RAG RNA As Graphs

SCFG Stochastic Context Free Grammars

STRAND The RNA secondary STRucture and statistical Analysis Database

TOS Target Oriented Synthesis

UID Unique ID

VARNA Visualization Applet for RNA

XIOS eXclusive, Included, Overlap and Serial

xv

ABSTRACT

Li, Kejie. Ph.D., Purdue University, August 2011. A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships. Major Professor: Michael Gribskov.

Understanding of structure-function mapping is crucial to the study of the nature of

biopolymers. This mapping can be used to extract information to aid in the prediction of

molecular function based on structural topological patterns. This study presents a graph

theoretical approach for understanding RNA structural topological features, and

revealing the mapping from biological RNA structural topological features to biological

functions. We have built a package that represents ensembles of suboptimal RNA

structures as a graph, the XIOS graph, for easy structural comparison and analysis by an

extended version of the gSpan algorithm. In order to detect structural similarities, The

Neighbor Indexing algorithm has been extended by adding additional RNA structure-

specific information, and introducing the concept of an RNA structural fingerprint, from

a structural descriptor point of view, to represent the topological information of

ensembles of RNA structures. Based on the cIndex feature selection strategy, I have

developed and applied a new feature selection approach for RNA structures which

xvi

reveals important structural topological patterns that provide specific information about

the functional class of RNAs. This information can be used to relate RNA structural

patterns to function. In addition, I have developed a novel structure indexing and

database searching method for finding RNAs with similar characteristics (topological

modules).

It is remarkable that even without using RNA primary sequence information RNA

structures can be classified into the correct classes. By combining information from both

sequence and topology, unclassified or misclassified RNAs can be correctly classified and

categorized with high confidence. The structure-based classification described here is

significantly better than sequence-based classification using Blast (Kolmogorov-Smirnov

test).

1

CHAPTER 1 INTRODUCTION

1.1 RNA’s double life

The central dogma of molecular biology (2) says that RNA is the biopolymer copied from

DNA (transcription) that serves as the template during protein synthesis (translation). In

other words, RNA is the intermediate message interpreter from heritable DNA genetic

information to various biological functions. This assumption underlines that it is genes

that encode proteins, and proteins play most of the important biological roles such as

catalytic and regulatory functions. From that point of view, biological complexity is

determined by the number of protein-coding genes.

Two great figures initiated the study of evolution in vitro, or so called RNA evolution in

the test-tube: Sol Spiegelman and Manfred Eigen. Spiegelman’s serial transfer

experiments with the Qβ assay (3-5) and Eigen’s extensive kinetic study of the

mechanism of Qβ RNA replication (6-8) revealed that the primary sequence and spatial

structure of the same RNA molecule are its genotype and phenotype, respectively. RNA

molecules have to satisfy structural requirements in order to be recognized and

replicated by enzymes (9). Structure alone is not sufficient to infer function, but a

complete understanding of the functional molecule requires information about its

2

organization in space. It has been suggested that the spatial structure of RNA is a crucial

factor in determining its function.

In the meantime, Carl Woese discovered that RNA forms complex secondary structures,

which suggested, for the first time, that RNA could act as a catalyst (10). Later in the

1980s, Thomas R. Cech (11-13) and Sidney Altman (14,15) separately discovered the

catalytic properties of RNA molecules, making proteins no longer the only biopolymers

with catalytic function. An RNA molecule, therefore, is not just a chemical entity that

carries genetic information which is very chemically similar to DNA, but remarkably, also

possesses catalytic activity as a function executor. Although, when compared with

proteins, RNA molecules, ribozymes (13), have a limited catalytic repertoire, it is more

than sufficient to process genetic information and (self) replicate in a pre-biotic

environment. This gave rise to the idea that RNA molecules could play a bridging role

between the lifeless pre-biotic environment and the beginning of life (16-18), the RNA

world hypothesis. The RNA world provides a possible answer to the long-standing

question: the origin of life. It suggests that versatile RNA, with its abilities for both

storing information like DNA and catalyzing enzymatic reactions like proteins, came first.

RNA-encoded proteins evolved after RNA but before DNA (19).

The traditional definition of RNA secondary structure is based on base-pairing

interactions: Watson-Crick base pairs (A∷U, G⋮⋮C) (20) and wobble base pairs (G::U, I::U,

I::A and I::C) (21), within an RNA molecule. The basic structural elements are stacked

base-pairs (or stems), hairpin loops, interior loops, and bulge and multiple loops (Figure

3

1.2 A). In classical secondary structures there are only nested structural elements, which

means one base-paring region must be completely within the loop of the other base-

pairing region. Early studies (22,23) have found that the catalytic cores of many

ribozymes have uniquely shaped conserved RNA secondary structures that allow them

to perform their catalytic function. More recent studies show that small ribozymes

exhibit a broad range of catalytic activities (24-31) and that RNA catalysis plays essential

roles in the metabolism of cells (32-34). The involvement of RNA in such diverse

catalytic functions gives further support to the RNA world hypothesis.

Traditionally, we believed that the genome is a simple combination of separate genes,

one gene → one protein. Most of the gene transcripts were thought to be protein-

coding and rarely non-coding RNAs (ncRNAs), the bulk of the cellular RNA: tRNA and

rRNA are exceptions. ncRNAs are RNA molecules that perform biological functions

without being translated into proteins. In the course of the recent rapid development of

high throughput techniques, comprehensive large scale transcriptome studies (35-37)

across species as diverse as plants, bacteria and mammals (38-46) have changed our

understanding of RNA. New functions of ncRNAs have been discovered, and ncRNAs and

RNA-based biological processes are now known exist in all life forms. Gradually, RNA has

been recognized as a central player in cellular regulation (47). In particular, ncRNAs play

active roles in multiple regulatory layers from transcription, to RNA maturation, and

RNA modification to translational regulation (47). The current view is that transcripts are

potentially overlapping and bidirectional, and non-coding transcripts are abundant. In

4

spite of the importance and ubiquity of ncRNAs, we still know relatively little about

them (48).

1.2 Importance of Pseudoknots

It is widely accepted that RNA functions are mainly determined by RNA structures.

Reciprocal relationships like this demand comprehensive study and analysis of RNA

structures, in order to better understand RNA catalytic and regulatory functions.

1.2.1 Definition of pseudoknot

With regard to RNA structure analysis, I have to mention an important structural

element called a pseudoknot. Compared with nested RNA secondary structures,

pseudoknots are base-paired regions that are only partially nested: RNA base-pairing

between the bases loop region of one base paired region with a region outside this

base-paired region. Let us consider two stems, S1 and S2, where S1 is formed by base-

paired regions A and B, and S2 is formed by regions C and D. The sequential order of

those base paired regions is A, C, B, D in the RNA sequence. S1 and S2 form a

pseudoknot structure because region C of stem S2 lies inside the “loop” of stem S1, and

region D of S2 is outside of that loop. Such knotted structures were first discovered in

yellow mosaic virus in 1982 (24), and they occur frequently in RNA functional sites and

catalytic cores, often being directly involved in RNA catalytic and regulatory functions.

There are many types of pseudoknots, simple and common ones include the H-type

pseudoknot (classic pseudoknot), kissing pseudoknot, simple recursive pseudoknot and

hairpin-bulge pseudoknot (three stemmed pseudoknot) (Figure1.1).

5

1.2.2 Pseudoknot functional significance

Pseudoknots are well known to play diverse fundamental roles in cells (49). The diversity

of pseudoknots corresponds to their various functions. The formation of H-type

pseudoknots, for instance, can lead to compact structures. Site directed mutagenesis

kinetic studies showed that slight sequence changes can destabilize pseudoknot

structures. It has further been suggested that pseudoknotted structures are just slightly

more stable than the corresponding non-knotted hairpin structures (less than 2kcal/mol)

(50,51). This small free energy difference allows the knotted structure to fold (compact

structure) and unfold (less compact structure) without having to cross high energy

barriers. This suggests a possible role for pseudoknots as conformational switches or

control elements.

Pseudoknots fold locally in RNA molecules, so their positions in the sequence may also

reflect their function (52). In several viruses, replicase expression is controlled by

ribosomal frameshifting (53-57) or in-frame stop codon readthrough (58). In these cases,

pseudoknot formation is necessary, which suggests that pseudoknots near the 5’ end of

mRNAs may be involved in translational control. tRNA-like motifs at the 3'end of several

groups of plant viral RNA genomes also contain pseudoknot structures (52), and in

tobacco mosaic virus, pseudoknots before the tRNA-like domain were shown to be

required for substituting as a poly(A) tail to stabilize the mRNA and increase gene

expression (up to 100-fold) (59). This evidence shows 3’ end pesudoknots have roles in

preserving replication signals. In catalytic and regulatory RNA molecules, pseudoknots

6

are often found at the core of the tertiary structures (60). In addition to these functions,

evolutionarily conserved pseudoknots are also involved in self-splicing (49).

Overall, pseudoknots are biologically important due to their appearance in critical

regions of functional RNAs. Among all the fundamental RNA structural elements,

pseudoknots are possibly the most important.

1.3 Representations of RNA secondary structure

1.3.1 Simple representations

There are many ways of representing RNA secondary structures. The most common are

stem-loop diagrams, circle plots, dome plots, mountain plot, linear dot-bracket

representations, dotPlots and tree graphs.

One of the most common representations is the biological stem-loop or “squiggle”

diagram (Figure1.2 A and Figure1.2 B). It describes RNA secondary structure in terms of

stem-loop structural elements. Stem-loop diagrams are not rotationally invariant and

therefore very similar structures may appear very differently.

Nussinov suggested a new representation called a circle plot in 1978 (61) (Figure1.2 C

and E). The backbone nucleotides of RNA are arranged along a circle, and base pairs are

drawn as arcs linking the paired bases. With this representation, different length

sequences can look quite different.

Dome plots can be considered as a simple alternative representation (Figure1.2 D and F).

Instead of organizing the RNA sequence along a circle, nucleotides are placed along a

7

line. Base pairs are still drawn as arcs. In contrast to circle plots, dome plots look similar

even when sequence lengths differ. Both circle plots and dome plots can be used to

show classical stem-loop secondary structures as well as pseudoknots.

Hogeweg and Konings (62) developed a graph called the mountain representation or

mountain plot, to compare RNA secondary structures (Figure1.2 G). The x-axis of

mountain plot corresponds to the RNA sequence, and the y-axis shows the number of

base-pairs in which a specific nucleotide is nested. While mountain plots have the

advantage of being rotationally invariant, they cannot be used to show pseudoknots.

The linear dot-bracket representation (or Vienna notation) (Figure1.2 H) is both

common and useful. It is a string with dots and brackets, where dots indicates unpaired

positions, and a pair of matching brackets at position i and j indicates there is a base-

pair (i, j) between the bases at position i and j. Pseudoknots can be included in this

representation by using different types of parentheses to distinguish the corresponding

base-paired regions.

The dot bracket representation can be translated into a rooted-labeled tree (63). In this

representation, the interior nodes of the tree correspond to base-pairs, and the leaf

nodes are unpaired nucleotides. An additional dummy node is added as the root of the

tree graph which serves as the parent of all nodes in the tree to ensure structures with

free end(s) are not represented by a forest (disconnected trees). Stems appear as chains

of interior nodes (rope-like) and multi-loops appear as bush-like branching centers in

8

this representation (Figure1.3). Similar representations have been proposed by other

groups (64,65). Trees can capture all the pairwise interactions of an RNA secondary

structure but not other tertiary interactions, such as pseudoknots and base triples (66).

Dot plots represent structures as a two-dimensional matrix in which a dot is placed at

the (i,j) position of each base-pair. There are two common types of dot plots. One is a

visualization produced by Zuker’s mfold algorithm called the energy dot plot for an RNA

sequence (67,68). For each possible pair (i, j), this plot position (i, j) shows the lowest

free energy for a structure that contains base pair (i, j). It provides a picture of all

alternative secondary structures within a specific free energy increment from the lowest

free energy structure (Figure1.4). When many dots are displayed with a small energy

increment, it suggests a less well-determined secondary structure prediction (i.e., the

presence of many competing structures with similar energy). The Vienna RNA package

implemented McCaskill’s partition function algorithm (69). The output of this program is

a base-pairing binding probability matrix (Figure1.4), which is shown as a so called

partition function dot plot. It is a visualization of the thermodynamics of ensemble of

structures.

1.3.2 Graph theoretical representations

1.3.2.1 Basics of graph theory

A graph is a collection of vertices (nodes) and edges linking the vertices. Graph theory,

which uses graphs to model relationships of natural and artificial objects, has been used

9

in many different areas, such as communication networks, network flow and data

organization, molecular structures in chemistry and physics, and social network analysis

in sociology. Graph is normally represented by connectivity matrix which describes the

connectivity of vertices by different types of edges. In this connectivity matrix, the

values in the i-th row and j-th column describe the edge (such as directionality, weight

or edge type) which links vertices i and j in the graph. In the simplest case, the values

are either zero (no edge between the I and j vertices) or one (an edge exists between i

and j ).

To find the difference between graphs, methods for graph comparison are required. An

important concept related to graph comparison is graph isomorphism, which describes a

structure preserving bijection between the sets of vertices of two graphs. If an

isomorphism exists between two graphs, these two graphs are called isomorphic. The

easier way to understand it is, if two graphs are isomorphic, they are structurally

equivalent (have equivalent vertices linked by equivalent edges) even though they have

different layouts of the vertices and edges. Considerable work has been done on the

graph isomorphism problem, i.e., how to determine if two graphs are structurally

equivalent or isomorphic. One way to test for isomorphism is to compare the

eigenvalues of the connectivity matrices of the graphs. If two graphs have identical

eigenvalues, they are isomorphic graphs.

Graph similarity is related to another problem called subgraph isomorphism, which

involves the determination of whether graph G contains a subgraph that is isomorphic

10

to a subgraph of graph H. The subgraph isomorphism problem is NP-complete (70).

Given a method for measuring the similarity of graphs, available clustering and

classification methods can be efficiently applied, and novel analyses and predictions can

be made.

In chemistry, chemical structure graphs are used to model molecules. The vertices

represent atoms and bonds are represented by edges. Chemical structure graphs have

been used to identify similar chemical structures, for structure search and for function

identification (71,72). One specific area, called quantitative structure-activity

relationships (QSAR), quantitatively studies the correlation of small molecule structure

and function by using structural determinants to generate graph models (73). Linear and

nonlinear relationships between structure and activity are considered. Such

relationships allow predictive models of synthetic small molecule activity based on

existing knowledge of small molecules.

1.3.2.2 Using graph theory in RNA study

Similar to chemical studies, secondary structures of RNA molecules can also be

represented as planar graphs, examples are shown in section 1.4.1. RNA graphs use

vertices to represent nucleotides or structural elements; edges between vertices

corresponds to relationships such as bonds, ordering or connectivity.

The RNA tree graph (see also section 1.4.1) was introduced by the Schlick group (74)

(Figure1.5). It was used to represent the connectivity of secondary structure elements

11

(Figure1.2 A). Vertices in the graph represent loops and the edges correspond to stems.

This graph models topological, rather than geometric, aspects of RNA molecules. For

example, it does not tell the stem length, loop length and so on. It simply provides a

coarse resolution image of the actual secondary structure. One advantage of this

representation is that there are existing tree enumeration theorems (75,76). This is

useful since enumeration can be used to estimate the size of the RNA structural space.

On the other hand, new RNA folds can be discovered by enumerating non-existing

graphs. However, RNA tree graphs cannot represent pseudoknotted structures. Later,

Schlick group proposed a new representation called a dual graph to include

pseudoknots (74) (Figure1.5). In the dual graph, vertices correspond to stems and loops

are represented arcs. Similarly to RNA tree graphs, the dual graph also captures the

topological characteristics of a folded RNA. One issue with the dual graph

representation is different RNA topologies can share the same dual graph. The digraph

representation is an RNA dual graph with directed edges (Figure1.5). The direction of

the edges can resolve some ambiguity in representing RNA topologies.

1.4 Methods for studying RNA structure

As I mentioned before biological complexity has been measured in terms of the number

of protein-coding genes. At the molecular level, the spatial structure of molecules is the

phenotype. All biological functions are properties of phenotypes (77), and the mapping

from genotypes into phenotypes is the key to understand information and complexity in

12

biology (77). To crack the code that relates molecular structures to biological

phenomena is the greatest current challenge for life science (77).

The genetic code, the relationship between the nucleotide sequences of DNA or RNA

and the amino acid sequences of proteins (2), is well accepted as only one part of the

highly complex language of evolution. We have clear knowledge about how the

biological machinery processes genetic information according to the central dogma. The

language linking sequences and three-dimensional structures of biopolymers has been

called the second half of the genetic code (78). Coarse grained notions of structures of

biopolymers, like RNA secondary structures, can be considered to be phenotypes (77).

Thus the connection between RNA structure and biological function could be considered

to be a third part or extension of the language of evolution.

1.4.1 Experimental approaches

Several experiments have been developed to determine the location of base-paired and

single-stranded regions in RNA. Nuclease protection assays use ribonucleases which

specifically cleave single-stranded RNA regions (or specifically cleave double-stranded

RNA regions), leaving the double-stranded regions intact (or leaving the single-stranded

regions intact). Many ribonucleases can be used, examples include V1 ribonuclease

which cleaves phosphodiester bonds 3′ of double stranded RNA regions, and S1

nuclease which specifically cleaves 3′ of single stranded RNA regions. The limitation of

this kind of approaches is that it is time consuming and costly.

13

Pioneering efforts also have been made by projects such as Doudna’s structural

genomics of RNA project, which sought to decipher the biological and functional

properties of RNA molecules by determining their molecular three-dimensional

structures (79). The Puglisi group is trying to understand RNA function in terms of

molecular structure and dynamics using biophysical tools. They use nuclear magnetic

resonance (NMR) spectroscopy to determine the structures of RNA and RNA-protein

complexes, in order to determine the role of RNA in cellular processes and diseases.

These approaches could lead to novel therapeutic strategies targeting processes

involving RNA. Solving RNA structure using NMR spectroscopy is a powerful biophysical

tool. But using NMR to solve RNA structures is more difficult than for proteins due to

the intrinsic biophysical and biochemical properties of RNA. One important reason is

that the lower proton concentration of RNA molecules (compared to proteins) results in

fewer restraints for structure calculations. These projects are likely to be time

consuming, but by integrating the results, we could have a better picture of the true life

of RNA molecules.

1.4.2 Computational approaches

Today, there are many publicly accessible genomic sequences, RNA secondary

structures, RNA three-dimensional (3D) structures, protein sequences and 3D structures.

The main challenge in biological studies is computing. Data mining and computational

predictions are prevalent. Compared with experimental RNA structure determination,

computational methods hope to speed up the annotation of RNA structures by using the

14

power of computer models and predictions, with much lower cost. Many bioinformatics

tools help to pull important information related to RNA from the vast amounts of

sequence data generated by high throughput techniques, such as whole genome and

transcriptome sequencing. However, accurate annotation of RNAs is still an extremely

difficult issue.

1.4.2.1 The comparative covariance approach

For a given RNA sequence, computer programs can enumerate all possible RNA

secondary structures based on thermodynamic and free energy minimization

considerations. But the selection of the true structure out of the large set of predicted

structures requires more information, such as the experimental evidence mentioned in

section 1.4.1.

Alternatively, a comparative approach can be used to solve this problem more

efficiently. The assumption is that evolution of RNA structures is slower than the

evolution of RNA sequences. In order to maintain the same biological function, RNA

molecules should have similar structures even if they do not share high sequence

similarity. One mutation in one a base pair breaks the structure, but we do see many

RNA structures in a same family share similar structural patterns. That means changes in

sequence coincide to conserve structure (base-paired bases), which is also called

covariance or co-evolution. Covariance ensures that base-pairs are maintained and RNA

structure is conserved. The presence of the same stable tRNA structures in different

organisms is a good example. The idea of the comparative approach is that when

15

homologous RNA sequences are available, the consensus secondary structure can be

inferred from the multiple sequences. Covariance models (CM) have been used in

sequence analysis to search for regions sharing similar base complementary patterns

(66). Similar base pairing patterns indicate similar structures. The CM is a generalization

of hidden Markov models (HMM). CM uses primary sequence consensus and pairwise

covariations with respect to consensus secondary structure to describe a RNA multiple

sequence alignment (MSA) and represent the representative secondary structure of a

set of RNA sequences. For easier understanding, CM generates probabilistic

representative structure of a set of RNA. With more sequences, the inferred structure

becomes less ambiguous due to additional sequence diversity and support to the

pairwise covariations. However, there are only four nucleotides, and the random chance

of finding complementary bases is high. The lack of primary sequence conservation,

which makes it difficult to obtain a reliable MSA for divergent RNA families, is the

primary limitation for CM.

1.4.2.2 The thermodynamic energy based models

Thermodynamically, calorimetric data on the stability of RNA secondary structures are

used to build a base-pair energy model. These are four main categories in this area:

minimum free energy RNA structure prediction (80-89); RNA sequence secondary

structure prediction including pseudoknots (90-96); RNA secondary structure prediction

including suboptimal structures (97-100); and RNA secondary structure prediction with

suboptimal structures and pseudoknots (101).

16

In the1970s, Tinoco et al. first studied the thermodynamics of RNA folding (base-pairing)

by using short oligonucleotides (102,103). They measured the free energy decrease

between the denatured RNA molecule and its native state. In the Tinoco model, the

total free energy difference is the summation of the contributions of independent

elements in the structure. One assumption that can be made is that, if we assume the

interactions in the secondary structure are stronger those in the tertiary structure, the

sum of free energies of the secondary structures are a good approximation of the total

free energy of the RNA molecule. Stacked base-paired regions and hydrogen bonding

are generally considered as the source of stabilization of RNA molecules, while loops

generally destabilize the RNA molecular structure due to the introduced entropy.

Two factors that make it difficult to find solutions to biological problems are the size of

biological data and high computational complexity. We are always trying to find the

“true” solution. Sometimes computational models do not exactly mirror the physical

world, or sometimes we are forced for computational reasons to simplify the problem.

Instead of searching for the “true” solution of the problem, focusing on the optimal

solution is always a good alternative. In biology, we usually do not know the exactly

answer for which we are looking. What biologists do is to try to find optimal solutions

that closely resemble the true solution. For example, genome sequence assembly tries

to use all the sequence reads from a genome, and assemble them in the correct order

and location to find the correct sequence of the whole genome. In the process of

assembly, there are problems such as repeat sequences. There are many choices for

17

how to place the repeat sequences in the genome. Without further experimental results,

such as genome walking, it’s hard to know the correct answer. But computer programs

either leave repeats out of the assembled sequence or try to calculate the likelihood of

possible locations for these repeat sequences; these are optimal solutions to the true

solution. Understanding the mechanism of the effect of a certain gene is another

example of how the lack of biological knowledge prevents our models from exactly

describing the biology. Until recently, there was no way to find all of the regulatory

factors related to gene X. However, with the support of microarray or RNA-seq

experiments, it is now possible to find a set of genes, even not a complete set, that are

correlated with X. From here, biologists can construct the regulatory network to gain

some understanding of the pathways X involves in, and can propose possible hypothesis

about the mechanisms of the effect of gene X. At this point, the solution is optimal, but

it can get closer to the true answer with further experimental support or technology

which can provide more detail.

In computer science, a popular technique called dynamic programming (DP) was

developed by Bellman in 1950s to solve optimization problems (104). DP enables the

combination of sub-problem solutions to solve the overall problem. It calculates the

solution for each sub-problem only once, and stores them in a matrix. By doing so, it

avoids recalculating the answer when the same sub-problem is reencountered during

the calculation. However, the limitation of DP is that it only deals with problems with

recursively nested sub-problems. In RNA research, we know RNA function is determined

18

by its spatial structure, and DP has been the main approach used to predict minimum

free energy (MFE) RNA structures (95). Nussinov et al. tried to computationally predict

RNA secondary structures by maximizing the number of base-pairs (61). Later, Nussinov

and Jacobson introduced DP to RNA structure prediction by implementing a simple

recurrence based on the decomposition of structure into base-paired elements (105). It

is most likely for RNA molecules to fold into the MFE structures (not necessarily unique)

which are the most stable confirmations. Zuker, et al., proposed a method to predict

RNA secondary structures by looking for the base-pair combinations with the minimum

free energy (83). It uses the same recurrence principle as Nussinov’s approach, and also

decomposes structures into base-paired elements. Zuker pointed out that the free

energy of a structure is associated with the regions between bonds (base stacking

regions), rather than hydrogen bonding, as was done by Nussinov et al. (61). Zuker’s

energy function is based on free energy, and experimental thermodynamic data was

taken into consideration, such as loops (Zuker treated multiloops as interior loops, and

the positive energy of loops is usually modeled as dependent on the log of the loop

length). Stacking regions have stabilizing effects, while various loops have destabilizing

effects. The Zuker algorithm basically uses DP to do a forward recursion and completely

fill a matrix with the lowest energies of admissible structure ending at each base-pair.

The last number to be computed is the overall MFE of an admissible structure. One can

recursively follow the path that produced the MFE value, tracing back the MFE structure

from the optimal free energy. Zuker’s algorithm has computational complexity O(n4) and

19

O(n2) memory requirement. By limiting the interior loop size to a constant value, for

example 30, its computational complexity can be further reduced to O(n3).

The prediction of single optimal structures has received the most attention, but it is

worthwhile to explore a broader energy and conformational landscape. RNA structures

are dynamic. In slightly different the environments, such as varying temperature and

salt concentration, conformation of RNA molecules vary widely. Even in a uniform stable

environment, RNA molecules exist as ensemble of structures instead of as a single

specific conformation. This ensemble of structures is a distribution of structures, in

which the expected frequency of each structure is determined by its free energy.

McCaskill described this ensemble of structures in terms of a partition function (69) That

allows the calculation of structural melting curves, base-pair binding probabilities and

frequencies of possible structures at thermodynamic equilibrium. Basically, it relates

the RNA folding problem to the Boltzmann distribution. This ensemble structural

partition-function algorithm can be derived from the MFE algorithm by substituting the

minimum operations by summations and additions by multiplications. Suboptimal

structure prediction involves searching the near optimal conformational space for

structures with low, but greater than minimum, free energies. There are some

algorithms that compute MFE and near optimal (suboptimal) structures. Some of them

include pseudoknot prediction. Hofacker et al. implemented both MFE and partition

function algorithms in the Vienna RNA package (85). Similar to the Vienna RNA package,

UNAFOLD (100) also provides suboptimal structure predictions. Both packages identify a

20

MFE structure using DP, and identify base-pairs that have the potential to form

pseudoknots.

Pseudoknot prediction is computationally challenging for such MFE structure prediction

algorithms because pseudoknot base-paired regions are not in linear order along the

sequence, which breaks the basic recursion. DP cannot deal with this kind of non-nested

problem. Pseudoknots can be predicted by heuristic approaches, however, some of the

approaches have restrictions such that they can only be applied to certain type of

pseudoknots, and the overall performance such heuristic approaches is good only when

the RNA sequence is short. Rivas and Eddy proposed a new approach using DP to predict

single MFE structures with pseudoknots (95). More recently, Reeder and Giegerich

reduced the computational complexity by restricting the prediction to only certain types

of pseudoknots (which are easy to compute) (94,106). Ren et al. applied Zuker’s

algorithm to predict the substructures and identify pseudoknots. Among those

substructures, they search for the energetically favorable pseudoknots (90).

Sperschneider et al. used base pair probability dot-plots, produced by RNAfold from

Vienna RNA package, to identify candidate substructures and prediction pseudoknots

(107,108).

The ensemble of structures captures the dynamics of RNA molecules, I believe it could

be used to enhance the identification and classification of RNAs. Pseudoknots can

potentially be predicted with higher confidence using dynamic information. The XIOS

21

framework I propose in this thesis provides the ability to specifically include multiple

structures as well as pseudoknots in a single graph representation.

The predictions of thermodynamic energy model approaches mainly depend on the

thermodynamic parameters. But there is doubt about the reliability of the optimization

approaches and lack of precision in the energetic parameters. A better understanding of

the thermodynamic parameters would provide better predictions, but there are some

limitations. The Watson-Crick and wobble base pairs are not the only base pair

interactions in RNA. Hoogsteen base pairs and pyrimidine-pyrimidine base-pairs are

frequently found in hairpin loops and the GNRA-tetraloops. Tandem A-G pairs and non-

Watson-Crick A-U pairs are also present in the three-way junction of the hammerhead

ribozyme. Trans hoogsteen/sugar edge base-pairing in RNA plays important role in

stabilizing folded RNA molecules. Those noncanonical (Watson-Crick) base-paired

regions are not considered in existing prediction programs. The presence of such

noncanonical regions will lead to incorrect local or global predictions. Without

considering noncanonical interactions and possible RNA protein binding interactions, we

cannot get a reliable energy model for RNA structure prediction, which is also true for

protein structure prediction.

In short, in the computational biology area, efficient RNA secondary structure prediction

algorithms have been available since 1980s. The accuracy of these algorithms,

handicapped by biochemical and computational limitations, are still not very satisfactory.

Currently, the accuracy of RNA secondary structure prediction is around 80% (60-80%

22

for pseudoknot structure predictions). Those programs can be used to obtain a general

scaffold of RNA structure, which still needs further validation and support. In the mean

while, there is plenty of space to improve the prediction accuracy, such as combining

noncanoical interactions into the prediction models. In my thesis work, I use existing

programs to predict RNA secondary structures of RNA families (if there is no structure

available). Despite their unsatisfactory performance, at least some of the structure

prediction results still capture some true aspects of the conserved structure patterns. By

combining all the predicted structures together, my method is still able to find the

important structure patterns conserved across RNA family and achieve the goal of my

study.

1.5 Decision making in novel molecule design

In the modern age, with the advances in biology, chemistry, laboratory technologies and

equipment, we understand and realize that many small natural products can benefit

human beings. One well known example is the discovery of penicillin antibiotics. It was

the first drug against many previously serious diseases and infections. The average

human life span was increased 8 years due to the introduction of antibiotics. Also, we

have been designing and synthesizing many kinds of substances to accomplish desirable

tasks, such as the famous synthetic pesticide dichlorodiphenyltrichloroethane (DDT).

With the discoveries of the effects of natural products and chemical intuition about

them, synthetic chemists use synthetic organic techniques to synthesize compounds

with desirable effects. We use synthetic products as a basic part of our daily life: in food

23

products, commodities and drugs. In cancer research, scientists would like to find drugs

that target cancer cells in order to cure cancer. Virologists are also looking for ways to

inhibit viral effects on hosts.

But the big problem is that we still do not have deep understanding of the chemicals.

The need to have ways to describe molecules in terms of mathematical or logical rules

has never been greater. Without that there is no means to efficiently design new

molecules with novel effects, and the progress would be slow and costly. Structure

determines function. The understanding of molecular structures and knowledge of their

structural diversity are essential in aiding the novel synthetic compound design process.

Now there are two main strategies in modern drug discovery: target oriented synthesis

(TOS) and diversity oriented synthesis (DOS) (109). Both of them involve small molecule

high throughput screening (HTS) for the ones can bind to specific targets. Planning of

the small molecule library is the most important step related to the efficiency of the

drug discovery process and result. With predictive models and classification methods,

based on structure to function correlations, one could provide practical insights into

construction, selection, and screening of small molecule libraries. Chemists can prioritize

the synthetic process and focus on the small molecules with higher likelihood of having

higher binding specificity to the target.

Similarly in the case of RNA, with strategies such as in vitro evolution and the knowledge

of the catalytic functions of various RNA molecules, it is possible to make RNA molecules

for predefined purposes, such as RNA catalysts. This has been called evolutionary

24

biotechnology or applied molecular evolution (110-113) in the pharmaceutical drug

engineering described above and biotechnology related applications. Nowadays, the

standard procedure is to perform selection experiments on a large pool of partially or

completely randomly synthesized nucleotide sequences for desired functions (114-118).

The strategy is similar to DOS, in chemical organic synthesis, which is expensive and

inefficient without guidance from the predictive models.

Computationally, understanding the link between RNA structure and its function would

give a solid foundation for computational predictive models, which calculate the

likelihood of having a specific desired function for a given RNA sequence.

1.6 Summary/organization of this work

RNA related research is the shining star of the booming next-generation sequencing

(NGS) industry. Research focus is shifting from protein to protein-RNA complexes and

RNA molecules. At the current stage, high throughput techniques have enabled

researchers to collect humongous genomic and transcriptomic data at an ever

increasing speed. There’s no way for human beings to analyze and annotate this large

scale data manually. This is the perfect time for bioinformatics and computational

biology to aid the hypothesis generating process.

In chapter 2 Pattern matching in RNA structures, I present a new graph theoretic RNA

secondary structure representation, XIOS, which describes RNA secondary structures on

a topological basis, and includes pseudoknots. XIOS also includes ensembles of

25

structures in a single graph in order to capture the dynamics of the RNA molecule. It

allows analysis of RNA structures using graph matching approaches, such as similarity

comparison and classification.

Chapter 3, RNA structural fingerprint. This chapter describes a new concept, the RNA

structural fingerprint, that describes the topological characteristics of an ensemble of

RNA suboptimal structures. A small structural motif library has been constructed by

graph enumeration, and an advanced graph indexing technique is modified and

improved to speed up the fingerprint searching process.

Chapter 4, Matching unknown RNA structures. This chapter demonstrates a database

searching tool which identifies topologically similar structures based on a query

structure. It uses the XIOS representation as a core framework and applies advanced

modified graph indexing technique to provide fast structure searching ability. The

search result provides insights of the possible function of the query structure.

Chapter 5, describes an RNA structure topological feature selection study, based on RNA

structural fingerprint, in which a feature selection method is applied to study the

correlation between topological features and function. It reveals important structural-

topological patterns that provide information that can be used to relate RNA structural

topology to RNA function. I demonstrate that topological patterns can be used to

classify known types of RNAs.

26

1.7 Data sources

1.7.1 Manually curated dataset

For this thesis, we have identified a diverse set of RNA sequences of varied lengths

which are involved in a variety of biological processes: tRNA, tmRNA, RNase P RNA and

Group I Intron RNA (Table 1.1) from different reliable resources (119-122). Each dataset

contains only non-redundant sequences with low sequence similarity (< 50%). Further

steps taken in curating datasets for each category are detailed below.

1.7.1.1 tRNA

16 non-redundant tRNA sequence with crystal structures (resolution < 3 Å) from the

Protein Data Bank (PDB) (120). Base pairing information was extracted by RNAView

(123), single base-pairs were not included. The PDB IDs: 1C0A, 1F7U, 1GAX, 1H4S, 1QF6,

1QTQ, 1QU2, 1TTT, 2BTE, 2CSX, 2DXI, 2FMT, 2ZM5, 2ZUF, 2ZZM, 3EPH. Note that non

Watson-Crick base-pairs are often formed adjacent to the stems of the RNA cloverleaf

secondary structure, giving rise to at least one pseudoknot in most tRNA tertiary

structures.

1.7.1.2 RNase P RNA

Representative RNase P RNA sequences were selected from classes as defined by Ellis

and Brown (124). In each case, all sequences of a class were retrieved from the RNaseP

database (125) and purged to remove sequences with greater than 50% sequence

identity. Stems and pseudoknots were then reviewed manually and adjusted for

consistent labeling. A total of 39 curated structures were obtained from the following

27

classes: A1 (5), A2 (2), A3 (1), A4 (2), A5 (2), AX (2), B1 (7), B2 (1), B3 (3), BX (1), C (1),

Archaeal type A (11), and Archaeal type M (3). Secondary structures and pseudoknots

were assigned according to Ellis and Brown (124) and folding diagrams in the RNase P

database entries (125). A complete list of sources is given in Table 1.2.

1.7.1.3 Group I Intron RNA

152 sequences were downloaded from the RNA STRAND database (121). The shortest

and longest 10% of the sequences were removed on the assumption that these were

most likely to be incomplete or poorly annotated. Sequences with greater than 50%

sequence identity were purged leaving 36 sequences ranging from 240 to 602 bases in

length. With one exception, PDB structure IL8V, for which the RNA structure is assigned

by RNAView (126), stems and pseudoknots were assigned according to expert curation

in the CRW database (127). A complete list of the sequences is given in Table 1.3.

1.7.1.4 tmRNA

632 complete sequences (from 514 species) with structural assignments were obtained

from the tmRNA website (128). Sequences were purged to remove sequences with >40%

sequence identity, leaving 165 sequences. 48 of these sequences contained asterisks,

indicating the absence of some bases, and were removed. The final dataset consists of

117 sequences, with sequence lengths ranging from 230 to 393 bases. The structure

assignments in the tmRNA database are used as the curated structures. A complete list

of sequences is given in Table 1.4.

28

1.7.2 STRAND dataset

RNA STRAND (The RNA secondary STRucture and statistical Analysis Database) is a

database with comprehensive collection of known RNA secondary structures

(experimental solved and computational predicted) from different organisms. Dataset

collect from STRAND are the corresponding 4 RNA families in our manually curated

dataset: tRNA, RNAseP, Group I intron RNA and tmRNA Table 1.5.

Compared to the manually curated dataset, the dataset from STRAND is a mixture of

reliable structure data as well as partial structures and noise, and even misclassification.

The manually curated dataset is a clean and high quality dataset. SRTAND dataset is a

bag of all kinds of structure data, good and bad, which represents the average quality of

other RNA structural databases out there.

29

Figure 1.1 Common types of Pseudoknots

30

Figure 1.2 Common RNA secondary structure representations.

A. Stem-loop diagram depicts RNA secondary structure elements: S stacking base pair (or stem), H hairpin loop, I interior loop, B bulge and M multiple loop. B. Stem-loop digram with Pseudoknot. C. Circle Plot. The backbone nucleotides of RNA are arranged along a circle, and base pairs are drawn as arcs. D. Dome Plot. The backbone nucleotides of RNA are placed in a line, and base pairs are drawn as arcs. E. Circle plot including a Pseudoknot. Pseudoknot structure is indicated by arcs crossing each other. F. Dome plot including a Pseudoknot. Pseudoknot structure is again indicated by the arcs crossing each other. G. Mountain Plot. The x-axis of Mountain plot corresponds to the RNA sequence, and y-axis shows the number of base pairs in which a specific nucleotide is enclosed. H. Primary sequence (X means any one of the four nucleotides) and its linear dot-bracket representation. A dot indicates a non-base-paired position, and pair of matching brackets at position i and j indicates there is a base-pair (i, j) between positions i and j.

31

Figure 1.2

32

Figure 1.3 Rooted-labeled tree

Closed circles represent base-pairs, and leaf nodes represent unpaired nucleotides. The root of the tree, box, is a dummy node added as the root of the tree graph which serves as the parent of all nodes in the tree to ensure structures with free end(s) are not represented by a forest (disconnected trees). Stems are rope-like and loops are bush-like.

33

Figu

re 1

.4 D

ot p

lots

Two

type

s of d

ot p

lots

are

show

n he

re. T

his i

s a tR

NA

exam

ple.

Cen

ter i

s the

stem

-loop

dia

gram

show

ing

the

fam

iliar

tRN

A cl

over

leaf

. Lef

t. Pa

rtiti

on fu

nctio

n do

t plo

t is a

bas

e pa

iring

bin

ding

pro

babi

lity

mat

rix. I

t is a

vi

sual

izatio

n of

the

ther

mod

ynam

ics o

f an

ense

mbl

e of

stru

ctur

es. T

he c

olor

of t

he d

ot re

flect

s the

neg

ativ

e lo

g of

bas

e pa

ir bi

ndin

g pr

obab

ility

of t

he b

ase

pair.

The

ord

er fr

om re

d to

bro

wni

sh, t

o gr

een

and

then

to b

lue

is th

e de

crea

sing

orde

r of t

he p

roba

bilit

y. R

ight

. Ene

rgy

dot p

lot s

how

s, fo

r a sp

ecifi

c ba

se p

air,

the

low

est f

ree

ener

gy

for a

stru

ctur

e th

at e

nds a

t thi

s bas

e pa

ir. T

he o

rder

from

red

to b

row

nish

, to

gree

n an

d th

en to

blu

e is

the

incr

easin

g or

der o

f the

free

ene

rgy.

Fou

r ste

m re

gion

s are

hig

hlig

hted

by

the

colo

red

boxe

s. D

ot p

lots

are

ge

nera

ted

by th

e RN

AStr

uctu

re so

ftwar

e (1

).

34

Figu

re 1

.4

35

Figure 1.5 RNA tree graph, RNA dual graph and RNA digraph representations.

A is the tRNA secondary structure in its squiggle notation. B is its tree graph representation. Each of vertices of the tree graph is a loop region (the 3’ and 5’ ends of stem is also considered as a loop region), and edges represent stems. C is its dual graph representation. Vertices represent stems, and edges are loop regions (3’ and 5’ ends of stem is not a loop in this representation). D is the digraph representation. This is an RNA dual graph with directed edges. Direction of the edges can resolve some ambiguity in representing RNA topologies.

36

Table 1.1 Manually curated structures.

Dataset tRNA RNAseP Group I intron tmRNA Sample size 16 40 36 117

Source PDB (120) RNAseP database (122) STRAND (121) The tmRNA Website (119)

Min graph size 3 11 9 4** Max graph size 6 26 25 22

Average graph size 5.25 19.55 16.11 16.65 ** tmRNA has an outlier tmRNA/BaciPhage_G, this structure has only 4 stems.

37

Table 1.2 RNaseP Sequences Used.

RNaseP stems and pseudoknots were assigned based on expert curation in the RNAseP database and structural types assigned according to Ellis and Brown (2009). All structural assignments were manually reviewed, in some cases minor adjustments had to be made to the RNaseP database structures to make the labeling of stems consistent across all structures.

Species Length (bases) Type Buchnera APSa 376 A1Carboxydothermus hydrogenoformans 331 A1Neisseria meningitidis 360 A1Pseudomonas fluorescens 354 A1Serratia marcescens 378 A1Cupriavidus necator 341 A2Nitrosomas europaea 285 A2Chlamydia pneumoniae 406 A3Anacyctis nidulans 385 A4Pseudoanabaena PCC6903b 450 A4Bacillus pertussis 414 A5Chlrobium tepidum 381 A5Mycobacterium leprae 423 AXd

Streptomyces lividans 405 AXd

Bacillus anthracis 408 B1Bacillus magaterium 408 B1Entercoccus faecalis 389 B1Mycoplasma capricolum 356 B1Staphylococcus epidermis 401 B1Staphyloccus gordonii 382 B1Ureaplasma urealyticum 370 B1Mycoplasma flocculare 412 B2Mycoplasma fermentans 302 B3Mycoplasma pneunomiae 369 BXc

Thermomicrobium roseum 350 CAeropyrum pernix 330 Archaeal type A Halobacterium cutirubrum 375 Archaeal type Ae

Halococcus morrhuae 475 Archaeal type Af

Metallosphaera sedula 304 Archaeal type A Methanobacterium thermoautotrophicum 293 Archaeal type A Methanosarcina barkeri 371 Archaeal type A Natronobacterium gregoryi 474 Archaeal type A Pyrococcus abyssi 330 Archaeal type A Sulfolobus acidocaldarius 315 Archaeal type A Sulfolobus solfataricus 311 Archaeal type Ah

Thermoplama volcanumg 305 Archaeal type A Archeoglobus fulgidus 229 Archaeal type M Methanococcus jannaschii 252 Archaeal type M Methanococcus maripaludus 233 Archaeal type M a Also known as Ralstonia or Alcaligenes eutrophus b Labeling of stems does not fall easily into standard scheme due to second stem coming off of L15

38

c This structure is midway between B1 and B2 having stem 10.1 and P9, but lacking P19. Also has an extra pseudoknot between the L9 and the region before P20. d Clearly A type due to presence of P6, P13 and P14 and lack of P15.1. Has additional stem (annotated in this work as P16.1) coming off of L15. e This structure is difficult to label due to three stems branching from P12. In this work these were annotated as P12.1 - P12.3. The structure given for the P15-P17 region may not be correct. f RNaseP database annotated structure may not be entirely correct. g RNaseP database diagram labelled T. volvanum h No RNAML file available, structure annotated based on .ct file and structure diagram.

39

Table 1.3 Group I Intron Sequences Used.

RNA Strand ID CRW IDa Genbank Accession CRW_00009 b.I1.e.C.hypophloia.E.SSU.989.bpseq AF015912 CRW_00012 b.I1.e.H.rubra.1.C1.SSU.1506.bpseq L19345 CRW_00015 b.I1.e.M.anisopliae.2.E.LSU.2066.bpseq AF197123 CRW_00608 a.I1.b.Dermocarpa.sp.ATCC29371.C3.tMET.bpseq U10480 CRW_00609 a.I1.b.P.hollandica.1.C3.trnL.bpseq U29955 CRW_00619 a.I1.c.N.tabacum.C3.tLEU.bpseq M16898, Z00044CRW_00626 a.I1.e.A.adeninivorans.C1.LSU.2449.bpseq Z50840 CRW_00633 a.I1.e.B.ciliata.JCM6865.C1.SSU.1506.bpseq D38233 CRW_00634 a.I1.e.B.ciliata.JCM6865.C1.SSU.943.bpseq D38233 CRW_00637 a.I1.e.C.botrytis.C1.SSU.1506.bpseq X77453 CRW_00639 a.I1.e.C.ellipsoidea.IAMC-87.C1.SSU.1506.bpseq D13324 CRW_00641 a.I1.e.C.grayi.UNK.SSU.1046.bpseq Z14026 CRW_00642 a.I1.e.C.grayi.UNK.SSU.1516.bpseq Z14026 CRW_00643 a.I1.e.C.luteoviridis.B.C1.SSU.1052.bpseq X73998 CRW_00645 a.I1.e.C.merochlorophea.UNK.SSU.1210.bpseq Z14025 CRW_00651 a.I1.e.C.saxonicum.C1.SSU.1506.bpseq X79497 CRW_00652 a.I1.e.C.sorokiniana.C1.SSU.323.bpseq X73993 CRW_00653 a.I1.e.D.parva.C1.SSU.1512.bpseq M62998 CRW_00656 a.I1.e.E.dermatitidis.C1.SSU.943.bpseq Z75304 CRW_00657 a.I1.e.G.planctonica.C1.SSU.943.bpseq Z28970 CRW_00658 a.I1.e.G.spirotaenia.C1.SSU.1506.bpseq X74753 CRW_00659 a.I1.e.L.dispersa.UNK.SSU.1046.bpseq L37734 CRW_00662 a.I1.e.L.dispersa.UNK.SSU.1516.bpseq L37734 CRW_00664 a.I1.e.L.dispersa.UNK.SSU.516.bpseq L37734. CRW_00673 a.I1.e.P.clavariiformis.C1.SSU.943.bpseq AB003945.CRW_00687 a.I1.e.S.paniceum.UNK.SSU.1052.bpseq D49657. CRW_00689 a.I1.e.S.paniceum.UNK.SSU.1210.bpseq D49657. CRW_00690 a.I1.e.S.paniceum.UNK.SSU.1506.bpseq D49657. CRW_00692 a.I1.e.S.sclerotiorum_1837.C1.LSU.798.bpseq AJ226089. CRW_00694 a.I1.e.Staurastrum.sp.M753.C1.SSU.1506.bpseq X77452. CRW_00703 a.I1.m.M.grisea.B2.ND1.bpseq X96412. CRW_00715 a.I1.m.S.luteus.A1.LSU.2504.bpseq L47586. CRW_00717 a.I1.m.S.luteus.B4.LSU.1923 L47586 CRW_00721 a.I1.m.S.pombe.B1.OX1.bpseq M15669 CRW_00814 b.I1.e.B.fuscopurpurea.7.C1.SSU.516.bpseq AF172557 PDB_00140a PDB 1L8V a the final structure is from PDB not CRW

40

Table 1.4 tmRNA Sequences Used

Sequence ID Length (Bases)

Sequence ID Length (Bases)

Sequence ID Length(Bases)

Acibm_capsu 339 Fibro_succi 346 Propi_acne2 354Acthi_ferr2 353 Frnkia_EAN1 367 Psalt_atlan 340Alkph_metal 334 Fusob_nucl2 329 Psmon_aerug 337Aquif_aeoli 336 Geobl_stear 334 Rbrbr_xylan 343Artbr_sFB24 354 Geobr_metal 340 Rhopi_balti 366Bacil_liche 340 Graci_tenui 342 Rumin_albus 339BaciPhage_G 264 Guill_theta 267 Salbr_ruber 355Bdell_bacte 333 Helcb_hepat 349 Solib_usita 343Bifid_long2 385 Helcb_muste 343 Stapc_aure3 347Bloch_flori 375 Helcb_pylo2 365 Stmyc_averm 372Bloch_penns 374 Kineo_radio 348 Strpc_zooep 330Borde_pertu 371 Lacbl_casei 343 Sulfh_azore 336Borre_garin 349 Lacbl_plan2 350 Syntr_acidi 341Brevi_linen 364 Laccc_lact1 335 Tanne_forsy 380Buchn_aphi1 352 Lactb_aciph 347 Thala_pseud 254Buchn_aphi3 351 Lawso_intra 364 Thana_tengc 338Bvora_marin 374 Leifso_xyli 379 Thdsb_commu 339Caldc_sacch 354 Lepta_inter 336 Therv_yello 338Campbr_lari 340 Leptm_grup2 348 Thmic_cruno 340Chlbm_tepid 393 Leuno_mesen 339 Thmic_roseu 339Chlfl_auran 345 Magcocc_MC1 343 Thmus_ther2 333Chrbm_viol2 351 Marbr_aquae 350 Thtog_neapo 343Clavi_michi 360 Mecoc_capsu 339 Trepo_denti 339Clost_aceto 340 Mesos_virid 298 Trepo_palli 342Clost_diffi 331 Micbu_degra 373 Troph_whipp 347Clost_perf2 344 Moore_therm 342 Uncul_bone3 343Clost_therm 374 Mycpl_arthr 371 Uncul_farm3 352Copth_prote 337 Mycpl_mobil 355 Uncul_farm7 346Coxie_burne 348 Mycpl_pneum 368 Uncul_flatw 280Crbth_hydro 344 Mycpl_pulmo 354 Uncul_phako 356Cyanp_parad 231 Myxoc_xanth 351 Uncult_HF10 323Cytph_hutch 389 Nephr_oliva 264 Verru_spino 336Dehco_ethen 339 Nostoc_7120 374 Wiggl_gloss 344Deinc_radio 336 Ntsco_ocean 345 Wolin_succi 349Desfm_aceto 336 Ocebl_iheyi 343 Xanmo_camp1 380Destl_pschr 347 Odont_sinen 281Dicgl_therm 337 Oenoco_oeni 333Diche_nodos 335 Paeni_larva 345Emiln_huxle 230 Phobm_profu 351Entco_faecm 348 Porpa_purpu 266Exiguoba_sp 346 Prevo_inter 387

41

Table 1.5 Dataset collected from STRAND database.

Dataset collect here are the corresponding 4 RNA families in our manually curated dataset: tRNA, RNAseP, Group I intron RNA and tmRNA.

Dataset tRNA > 50 nt

RNAseP 100 - 300 nt

Group I intron 100 - 300 nt

tmRNA 100 - 300 nt

sample size 601 36 21 30 Min graph size 3 5 7 8 Max graph size 6 22 15 23

Average graph size 4.13 11.47 11.05 16.71

42

CHAPTER 2 PATTERN MATCHING IN RNA STRUCTURES1

2.1 Introduction

RNA molecules perform a variety of important biological functions in addition to

carrying information from the chromosome to the ribosome, or acting as structural

scaffolds. Catalytic RNAs play key roles in translation, RNA processing and splicing, and

gene regulation (36). Motifs that are important for RNA function are structural and

correspond to base-paired regions of secondary structure, which in turn, provide the

scaffold for the three-dimensional fold of the RNA (129,130). RNA sequences that have

the same structural motifs may have sequences that are impossible to align because

they have no detectable sequence similarity.

While programs that predict RNA secondary structure have been available since the

1980s, RNA structure prediction is handicapped by both biochemical and computational

limitations. Firstly, RNA exists as an ensemble of rapidly interconverting structures.

Protein structures (usually) show relatively minor fluctuations from a single minimum

free-energy state. The case is much different for RNA where there are usually many

1 This is the paper published in the Proceeding of 2008 International Symposium on Bioinformatics Research and

Applications (ISBRA2008). My contributions were participating in the algorithm and experiment design, implementing the algorithm, analyzing the data and results, making figures and writing the manuscript.

Full reference: Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.

43

structures with similar free-energies; these structures may be distinctly different in

terms of base-pairing (67,97). Secondly, while we know that pseudoknot structures are

very important in RNA structure and catalytic function (49), it remains difficult to

reliably predict pseudoknotted structures. This is due both to our incomplete

understanding of the energetics of pseudoknot formation, as well as to the

computational time complexity. The most efficient pseudoknot prediction algorithms,

e.g., pknotRG, have O(n4) time for certain classes of RNAs(94)), but achieve this by

placing significant limitations on which structures can be found. Memory complexity of

RNA structure prediction is O(n2), where n is the length of the RNA sequence, and

usually ranges from 10,000-100,000 bases for primary RNA transcripts.

In biology, functionally important features can often be recognized because they are

conserved over evolutionary time. A common approach is to obtain a set of sequences

using some biological criterion (such as similarity of regulation), and use pattern

recognition methods to identify unusually conserved features. Searching for sequence

motifs (approximately common substrings) in this way has been a powerful tool for

analysis of DNA and proteins; this approach does not work as effectively with RNA

because conserved RNA structures may have no detectable sequence similarity. And

while great progress has been made, it remains difficult to accurately predict MFE

structures for RNA sequences. To further complicate the picture, RNAs exist as

ensembles of structures, in addition to the MFE structure, that are constantly

interconverting and fluctuating. The biologically important structures (those that are

44

conserved over evolutionary time) may be present only transiently, or as minor

components of this structural ensemble. The problem is further complicated by the fact

that biology is messy; one can rarely get completely clean sets of sequence data in

which every sequence actually contains the structure of interest. This makes many

approaches unfeasible. In addition, in biological systems, conservation is only

approximate, no set of structures will exactly match.

We are building a system that allows one to find the greatest approximately conserved

structure(s) in a set of RNA sequences, in the presence of extraneous sequences that do

not share a common structure. This conserved common structure can then be used as

the basis for hypotheses about the importance of the structure in the biological

functioning of the RNA. These hypotheses can be tested either experimentally or by

further computational work.

We convert RNA structures to a graph representation that specifically includes

pseudoknots and is capable of representing an ensemble of RNA structures in a single

graph. Computationally, finding conserved structures corresponds to finding the

greatest approximately isomorphous subgraphs in a set of graphs, where each graph

represents a single RNA sequence. We use modifications of existing maximal subgraph

isomorphism algorithms to identify the similar portions of the graphs, and propose to

combine this with constrained MFE structure prediction tools (131), and a database

search capability.

45

Graph theoretical approaches have previously been applied to RNA structures (74,132),

but our approach differs significantly. The XIOS approach introduces the ability to

represent ensembles of structures, and emphasizes the topology of stems. Our

approach is most similar to that of Gan et al., but focuses on stem topologies rather

than the topology of loops and bulges (74). The XIOS approach also allows structural

motifs to be exactly matched without using heuristics (132).

2.2 XIOS RNA graphs

In this section, we describe the graph framework that we have developed to represent

ensembles of RNA structural topologies. We introduce the XIOS RNA graph

representation for RNAs, and discuss extensions to existing subgraph isomorphism

algorithms as they are apply to XIOS RNA graphs.

2.2.1 Definition

XIOS RNA graphs represent ensembles of RNA structural topologies. In XIOS graphs,

each base-paired stem is represented by a vertex, and the edges connecting the vertices

indicate the topological relationship between the stems. Topologically, two stems can

be eXclusive (X, i.e., both cannot simultaneously form because they use the same

sequence ranges), Included (I, indicates the direction of I edges with respect to the

higher numbered vertex and J indicates the opposite, i.e., one is nested within the loop

of the other), Overlapping (O, i.e., the stems have a pseudoknot relationship) or Serial

(S, i.e., adjacent, non-overlapping stem and loop structures) (Figure2.1). Each pair of

vertices is related by one and only X, I, O or S relationship.

46

2.2.2 Training data

We have developed Perl packages that translate Vienna RNA format (85) and the

MFOLD (83) connect format into XIOS graphs. Because the predicted MFE structure is

only one structure in a structural ensemble, we enumerate all energetically favorable

short stems and label the entire set as X, I, O, and S, as described above. The graph is

therefore an image of the entire structural ensemble. Our test datasets are described in

Table 2.1. Highly similar sequences with sequence identity >40% are removed from the

dataset to avoid selection bias.

2.2.3 DFS Lexicographical ordering

DFS (Depth-First Search) lexicographical ordering was originally developed by Yan et al.

(133,134) in their gSpan algorithm for identifying common chemical structures in

chemical datasets. In the chemical structure case, both the vertices (atoms) and edges

(bonds: single, double and triple) are labeled, and all edges are undirected. gSpan is a

powerful search algorithm that reduces the search space for isomorphous subgraphs

using a clever DFS preordered search tree.

The traversal order of edges and vertices in the DFS of a graph can be canonically

ordered. This is called the DFS tree, or when serialized, the DFS code. Yan et al. proved

that graphs with the same DFS code are, by definition isomorphous. Lexicographic rules

provide an unambiguous best order to the canonical DFS code (133).

47

The direct path from the first traversed vertex (root) to the most recently added vertex

(right-most vertex) is the right-most path. The extension of DFS graphs by edge growth

is restricted to extension from the rightmost path, similarly to the approach of

TreeminerV (135). Graphs are extended in the following order: edges to existing vertices

(backward edges), edges to new vertices extending from the right-most vertex, and

extension from internal vertices on the right-most path. An intrinsic property of the DFS

lexicographical ordering is that it creates a preorder that can be used to efficiently

explore the search tree when searching for isomorphous subgraphs. Isomorphic forms

of a graph fall in different positions in the search tree, but the canonical DFS

representation of a particular isomorph is guaranteed to be found first. Hence, the

lexicographically first instance of an isomorph in the search tree is its minimum

representation or canonical labeling and other instances can be efficiently pruned. Each

edge in the DFS code is described by a 3-tuple, (vi, vj, li,j), where vi and vj are two

connected vertices and li,j is the label of the edge. Figure2.3 shows how the canonical

labeling can easily be identified using lexicographic rules even though many different

DFS codes are possible. There are two additional rules that prune the search space.

Firstly, if the initial edge of a minimum DFS code is type e0, then no following edge can

have a lexically smaller edge label, and secondly, for any backward edge growth to vj, an

edge cannot be lexically smaller than any edge that is already connected to vj or vrightmost

(133). Each distinct mapping of vertices to a DFS code is the support for that potential

solution. Since many such mappings are possible, each graph may have multiple support

48

for a DFS code. As a simple example, Figure2.2 shows the XIOS graph for a tRNA,

according to the experimentally determined 3-dimensional structure (PDB ID: 1EHZ).

2.2.4 Enumeration N-stem structures

Every RNA structure can be represented by a XIOS graph. For n stems, the upper bound

on the number of possible structures2, N, can be calculated by Equation (1),

!n2

(2n)! N n ⋅

= (1)

For example, there is only one possible one-stem structure, two possible two-stem

structures, and 10 possible three-stem structures, but only eight unique structures

(Table 2.2). Figure2.3 shows the XIOS graphs for the eight unique structures that can be

formed from three stems. The other two three-stem structures are either redundant or

physically impossible.

2.3 Greatest conserved structures

2.3.1 Extension of the gSpan algorithm

XIOS graphs have several differences from the chemical structure graphs considered by

Yan and Han. XIOS graphs

2 For the n-stem case, there are 2n half stems. We assign integer labels to each half stem from 1 to 2n-1.By definition,

the first half stem is labeled 1, and there are 2n-1 possible half stems that can pair with the first half stem; the third half stem has only one possible label (2 or 3), and there are 2n-3 possible half stems that can pair with this half stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1.

49

• have both directed and undirected edges. I edges are directed because it is

highly important whether a stem is nested within or outside another stem. X, O,

and S edges are undirected.

• do not have vertex labels. Because every vertex is simply an anonymous

elemental stem, no labels are available.

The use of unlabeled vertices with the gSpan algorithm is fairly straightforward, but

results in a decreased ability to rapidly prune the search tree. Directed edges are a little

more difficult to accommodate because the direction of the edge depends on the vertex

from which one looks. The simplest approach is to label the edge as either I or J from the

point of view of the lowest numbered vertex. I and J are treated as lexicographically

distinguishable edges.

In the original application of gSpan to chemical structures, Yan and Han were interested

in identifying frequently occurring chemical substructures. In their case, structures that

occur many times in a single graph are equally interesting. The case of RNA differs;

motifs that occur in multiple graphs (molecules), rather than many times in a single

graph (molecule), are considered more important. In addition, the presence of

incorrectly classified sequences, i.e., sequences that have no common structure, means

that not all graphs will support the biologically relevant subgraph. For XIOS graphs,

therefore, support is calculated as the number of graphs that containing a subgraph,

rather than the total count of matching subgraphs.

50

2.3.2 Graph matching algorithm (similar to gSpan3)

begin: For a XIOS graph G with edges eG I. Sort edges in eG by edge type eG ∈ {X,I,O,S} II. For each edge type E

1. Find all lexicographically minimal one edge subgraphs, S, from the given XIOS graphs;

2. For each edge e in S 3. Do Subgraph_mining(G, S, e):

i. If the graph is NOT a minimum graph according to DFS lexicographical order, return; ii. Generate all potential children with one edge growth, enew iii. If support for each child is above threshold

Recursively call Subgraph_mining with updated edge list (G, S+, enew) 4. Remove all edges of edge type E from G after all descendents have been searched 5. If eG = Ø, break;

end.

2.3.3 Greatest conserved structure(s) in a set of RNAs

Many computational approaches use pairwise or multiple DNA or protein sequence

alignments to find conserved motifs, but this approach is generally impossible with RNA

sequences because of their lack of conserved sequences, and because of the difficulty of

obtaining unambiguously correct alignments. However, secondary and higher order

structures in RNA are conserved, so matching the topology of two RNA structures with a

graph matching approach can identify conserved motifs that cannot be seen in the

sequences. The pre-ordered DFS search approach of gSpan provides an effective

approach to this problem.

3 Adapted from (133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE

Computer Society, Maebashi City, Japan, pp. 721. with minor modification

51

The time complexity for the worst case of this algorithm is suggested to be O(kn)

(133,134), where k is the maximum number of subgraph isomorphisms existing between

the two graphs and n is the size of the greatest common match. Figure2.4 shows the

application of the XIOS graph approach to the structure of S. cerevisiae and H. sapiens

RNase P.

2.3.4 Characteristics of biological graphs

The graph isomorphism approach is limited by the size of the graphs. We examined

sequences from snoRNA, 5S rRNA, microRNA, tRNA, and RNase P (See Appendix for

details) to determine how the number of stems varies with sequence length in biological

RNAs. The sequences were obtained from online databases (Table 2.1) and their

predicted MFE structures were obtained using the RNAsubopt program of the Vienna

RNA package (97). Predicted MFE structures were also obtained for random sequences

in a similar fashion. Random sequences were obtained by randomizing the order of

bases in the corresponding biological sequences, thus preserving the base composition

and sequence length.

Figure2.5 indicates the overall trend of linear increase in number of stems as a function

of sequence length. This rapid increase in the number of stems is due to the intricately

folded structures of the RNAs. This observation further necessitates the development of

an efficient system for searching biologically relevant structural patterns in RNA. It is

notable the biological RNAs and random RNAs have very similar numbers of structures.

52

As one can see in Figure2.6, stem structures in biological RNAs are predominantly less

than ten base-pairs long.

2.4 Future directions

The number of stem structures in an RNA MFE structure can be very large (Figure2.5);

the total number of possible stems, however, grows quadratically with the length of the

sequence. If one assumes that stem-loop structures require on average 24 bases, the

number of possible stems would be something like (SequenceLength/24)2. For a relative

short 10kb mRNA sequence this would lead to graphs with over 150,000 vertices. Our

ultimate goal is to analyze 10-20 sequences of much longer length (many biological

RNAs are over 100,000 bases long), a daunting problem. There are a number of

approaches that can be used to reduce the size of the problem. These include

preprocessing the structure to include only the most interesting stems (rather than all

possible stems), the application of graph contraction methods, and the introduction of

vertex labels.

2.4.1 Graph preprocessing

While the most biologically interesting RNA structure need not be the minimum free

energy (MFE) structure, it is likely that the important structures are close to the MFE

(136). This follows from the Boltzmann relationship, which indicates that the relative

frequency of a given structure in the structural ensemble depends on its energy. Rather

than identifying all short energetically favorable stems, we can greatly reduce the size of

the problem by including only stems that participate in a structure within some energy

53

interval, ∂, from the predicted MFE structure. The total number of stems can be

controlled by altering ∂; ∂=0 produces the MFE structure.

2.4.2 Reduction of graph complexity

Graph contraction reduces graph complexity by pruning irrelevant vertices and edges.

There are a number of different approaches one can take to pruning XIOS graphs. Firstly,

as we pointed out above, one can simply discard the S edges; since there are exactly

four edge types and each pair of vertices has exactly one edge, only three edge types

need be used. Secondly, we can place limits on the construction of edges of other types,

especially of I edges. One of the advantages of the XIOS representation is that nested

stems, represented by I edges, have an edge with every other stem in which they are

included. This embedding can be many levels deep, generating a huge number of highly

connected vertices. This is a great advantage because it obviates the need for

introducing gaps (137) which make the matching problem much more complex (and ad

hoc since there is no way to determine correct gap parameters). We postulate that we

would lose little matching power if the depth of I edge nesting was limited to a fixed

depth such as four. This would still permit extraneous stems to be easily omitted but

greatly reduce the number of edges in the graphs. Finally, because we can enumerate all

possible XIOS structures with a fixed number of stems, we can create a dictionary of

these substructures and condense the graphs to a smaller number of vertices based on

this dictionary, at the same time converting the unlabelled vertices to labeled vertices

(the labels then correspond to the dictionary structures).

54

2.4.3 Adding labels

The dictionary strategy, described above, faces difficulties since the isomorphous

structure of interest is buried in a huge field of random noise. If the dictionary based

labels are dominated by the non-matching (noise) portion of the graph, the re-encoded

graph will lose the information needed to match to other graphs (e.g., if the dictionary

structures overlap but do not exactly correspond to the interesting conserved

structures). A similar strategy, unique to the XIOS graph, is to examine all three vertex

triangles, of which there are a strictly limited number of types due to the limitations

both of the graph and of the biochemistry of RNA, and replace each triangle with a

corresponding labeled vertex. Triangles may share one or two edges which can be

incorporated as an extended set of edge labels. Such graphs would be modestly smaller,

but much more heavily labeled, greatly increasing the search speed. At the same time,

little information is lost since the original graph can be almost completely reconstructed

from the triangle-condensed graph.

2.4.4 Motif identification tool

RNAs that interact with specific molecules, such as proteins, generally have common

topological motifs. For example, in alternative splicing the donor, acceptor, and branch

point all have specific conserved structures important in recognition and catalysis. Such

conserved structures, when identified in molecules of unknown function, immediately

generate experimentally testable hypotheses. Once motifs are identified, they can be

used to search for additional sequences that could form the same structure. This

55

provides a means for both statistically evaluating the significance of the structural motif,

as well as for validating matches by examining them for biological similarities, e.g., by

comparing the GO annotations (138) of the sequences. A number of approaches may be

suitable for this, including stochastic context free grammars (SCFG) (139) which are

frequently used to identify RNA structures based on biological knowledge (140).

2.4.5 Database search tool

For searching of large databases, SCFGs are likely to be too slow. We are developing a

fast database search tool for RNA motifs. Since we can enumerate all possible XIOS

graphs up for structures of up to 7 or 8 stems (hundreds of thousands) we believe that

we can use the enumerated structures to prescreen graphs in much the same way that

BLAST (141) uses identically matching words. This is closely related to the dictionary

concept introduced above. Because matching to the enumerated structures in the

dictionary can be precalculated, we plan to develop a fast system based on the

observation that one need not do the complete isomorphous subgraph search if two

sequences share no dictionary motifs, and that if they do, the isomorphism search can

be seeded by the matching motifs. Such a search tool would allow users to both extend

and validate motifs found through subgraph isomorphism matching, and would also

provide a means to functionally classify unknown RNAs. RNA is still rather poorly

understood and such an approach will be of great use in identifying novel structural and

functional motifs.

56

Because RNA structures are relatively degenerate, it is likely that a post-processing

system will be needed to identify the most interesting possible structures out of a large

number of possibilities. This issue is similar to the problem of relevance ranking in web

indexing. In sequence comparisons, statistical probability calculations are commonly

used as a relevance ranking mechanism, and this may be possible in the XIOS system;

we anticipate that the distribution of maximal matching structures will follow an

extreme value distribution. Any two large RNAs, however, will have common structures

that are almost completely trivial: they will match as a long series of serial stems. This is

generally not biologically interesting, suggesting that there is a notion of biological

complexity which can be used as a relevance ranking function. This biological notion of

complexity may or may not correspond to mathematical notions of graph complexity

(142). Another possible relevance function would be to choose only structural motifs

that can form near-MFE predicted structures using a constrained folding approach

(motif stems are constrained to base-pair in the predicted structure) such as are

available in MFOLD and the Vienna RNA package.

The XIOS graph representation has great promise for identifying biologically interesting

structural motifs in RNA based on sequence alone. Constructing a sufficiently fast motif

search system will allow RNA studies to take advantage of the same bootstrap process

that is commonly used for DNA and protein sequences, namely 1) identify biologically

related sequences, 2) identify statistically significant structural motifs, 3) use structural

57

motifs to identify additional candidate sequences (iterating to convergence), and 4) use

the structural motif as a basis for laboratory experiments.

58

Figure 2.1 XIOS definition.

Relationships (edges) are defined as X (exclusive), I (included), O (overlapping), and S (serial). I indicates the direction of I edges with respect to the higher numbered vertex and J indicates the opposite.

59

Figure 2.2 tRNA 3D structure and corresponding XIOS graph representation.

I.A. 3-D structure of tRNA (PDB ID, 1EHZ). I.B, the simple three-leaf clover shape of tRNA is shown, where the acceptor stem, D-arm, anticodon-arm, and T-arm are represented by vertices 0, 3, 2 and 5 respectively. Vertex 1 represents an interaction between the D-loop and a region between the D-arm and acceptor-arm, and vertex 4 represents an interaction between the D-loop region and the region between anticodon-arm and T-arm. In the XIOS representation (I.C), vertex 1 is included in the acceptor stem and overlaps with the D-arm, vertex 4 overlaps with the D-arm and the Anticodon arm is included in vertex 4. II a, b, and c show the sequential extension of the DFS graph, and II d shows the minimum DFS tree and corresponding DFS code. At the each stage of graph extension, all the possible extensions are shown in dotted lines. For each edge extension, only the canonical graph (shown by dotted ellipse) is used in the next stage.

60

Figure 2.2

61

Figure 2.3 Unique three-stem XIOS graphs, including pseudoknots.

Fifteen XIOS graphs with three vertices are possible, three of them in the first row are not true three-stem topologies (at least one of the stems has only S relationships with other stems); the other four three-stem structures are either redundant or physically impossible.

62

Figure 2.3

63

Figure 2.4 Identification of the common structure in S. cerevisiae and H. sapiens RNase P RNA.

Left panel (top) shows the secondary structure of the S. cerevisiae RNAse P RNA. Each stem is labeled with a capital letter A-L. Left panel, bottom, shows the XIOS graph. I edges are shown as single lines and O edges as double lines. Right panel shows the secondary structure (A-N) and XIOS graphs for a single human RNAse P RNA. In both panels, matching secondary structures are enclosed by boxes and the uniquely matching part of the XIOS graphs shown in dark lines. Dotted lines in the XIOS graphs indicate where there are multiple mapping between stems H and I of the S. cerevisiae structure and the human structure; these multiple mapped stems are also indicated by arrows in the secondary structure diagrams. The right panel shows two of the mappings as an example.

64

Figure 2.5 Correlation between number of stems and sequence length.

Number of stems in biological (♦) and randomized (×) RNA sequences versus sequence length. The number of stems increases roughly linearly with sequence length. Each biological sequence was permuted to generate a corresponding random sequence, preserving the sequence length and base

miRNA

05

10152025303540

0 200 400 600 800 1000

Sequence Length (bases)

Num

ber o

f Ste

ms

snoRNA

0

5

10

15

20

25

30

0 100 200 300 400 500 600 700 800


Num

ber o

f Ste

ms

RNaseP

0

5

10

15

20

25

0 100 200 300 400 500 600


Num

ber o

f Ste

ms

tRNA

0

2

4

6

8

10

12

0 50 100 150 200 250


Num

ber o

f Ste

ms

5S rRNA

0123456789

0 20 40 60 80 100 120 140 160


Num

ber o

f Ste

ms

65

Figure 2.6 Length of RNA stem structures in biological RNAs

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0-9

10-1

9

20-2

9

30-3

9

40-4

9

50-5

9

60-6

9

70-7

9

80-8

9

90-9

9

100-

199

200-

299

>300

Stem Length

Freq

uenc

y

66

Table 2.1 Brief description of RNA datasets.

Formats are: A, alignment; C, MFOLD connect; S, sequence; V, Vienna RNA package.

Type of RNA

Database or Program

Format Link

microRNA miRNA S http://microrna.sanger.ac.uk/sequences/index.shtml 5S rRNA Database S http://biobases.ibch.poznan.pl/5SData/ rRNA RDP II A, S http://rdp.cme.msu.edu/index.jsp RNase P RNase P Database C http://www.mbio.ncsu.edu/RNaseP/ snoRNA snoRNABase S http://www-snorna.biotoul.fr/ snoRNA Plant snoRNA

Database A, S

http://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home

snoRNA Human snoRNA Database

S http://www.trex.uqam.ca/~snorna/Seqs.html

tRNA GtRNAdb V http://lowelab.ucsc.edu/GtRNAdb/ tmRNA tmRNA A, S http://www.indiana.edu/~tmrna/ Noncoding RNA

ncRNA Database S

http://biobases.ibch.poznan.pl/ncRNA/

All Pseudobase V http://biology.leidenuniv.nl/~batenburg/PKB.html All RNAbase S http://www.rnabase.org/ All Rfam A, S http://www.sanger.ac.uk/Software/Rfam/index.shtml All RNAfold/MFOLD C, V Installed on local server

67

Table 2.2 Number of possible RNA topologies for different numbers of stems, N.

In the enumerated graph results, there are isomorphic graphs (redundant structures).

Unique topologies are the remaining graphs after removing isomorphic graphs.

N Total Unique Topologies % Unique 1 1 1 100 2 2 2 100 3 10 8 80 4 78 46 58.97 5 746 368 49.33 6 8566 3914 45.69 7 114834 51390 44.75

68

CHAPTER 3 RNA STRUCTURAL FINGERPRINT

3.1 Enumeration of XIOS graphs

In order to understand and explore the RNA topology space, I have developed a

systematic way to efficiently enumerate all small XIOS graphs which are physically

possible. An n-vertex XIOS graph corresponds to an n-stem structure. Based on the

results shown in Figure2.5 and Figure2.6, the average size of RNA stems is ~20nt (not

counting loop regions).

For an n-stem structure, there are 2n half-stems (each stem has two base-paired regions,

and each region is called a half-stem). In enumerating the possible structures, we assign

labels to the 2n half-stems such that the label on left half-stem of each stem is lower

than all labels on the left half-stems to its right, and the label on each right half-stem is

higher than all labels on the right half-stems to its left. By definition, the label of the first

half-stem is 1, and there are 2n-1 possible regions that can pair with half-stem 1. The

half-stem chosen to pair with half-stem 1 is labeled 2. At this point, one stem (1, 2) is

formed, with the two half-stems 1 and 2 paired. Here I have defined the sequence

direction from labeled half-stem 1 to half-stem 2 as the positive direction, and the

opposite direction is the negative. For the third half-stem, there are three possible

69

locations it can be placed relative to the position of half-stem 1 and half-stem 2: A. in

the negative direction from half-stem 1, B. between half-stem 1 and half-stem 2 and C.

in the positive direction from half-stem 2. The directionality chosen here is arbitrary.

Cases A and C are symmetric, which means they lead to redundant structures. From

now on, we just consider the positive direction cases (B and C). If the third half-stem is

between half-stem 1 and half-stem 2, by definition, the third half-stem should be

labeled as half-stem 2 (the half-stem previously labeled as half-stem 2 is now assigned

label 3). Otherwise, the third half-stem is in the positive direction from half-stem 2, and

the third half-stem is then labeled 3. As a result, in a unique structure the third half-

stem could only have one possible label (2 or 3), and there are 2n-3 possible half-stems

that can pair with this half-stem, and so on (Figure3.1 A).

As described in section 2.2.4 the upper bound of the number of possible n-stem

structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1 = ( )!∗ ! . The final structure is

stored in a format I call paired format (Figure3.1 B). Each pair of matching parentheses

in the paired format represents a stem. The labels of the two half-stems associated with

the stem are shown inside the parentheses.

This enumeration method only guarantees the structure can form physically. In the

enumerated graph results, there are isomorphic graphs (redundant structures). All the

paired format representations are converted into graphs and then into their minimum

DFS code. Yan et al. proved that two graphs are isomorphic if their minimum DFS codes

are the same (133). In order to purge the redundant graphs, all the minimum DFS codes

70

were built into a perl hash data structure, and only unique minimum DFS codes were

kept in the final set. After these steps, this set contains only unique XIOS graphs for

further research.

In my thesis work, I have tried to enumerate as many small XIOS graphs as possible. One

observation I have made is that once the graph size (vertex number) reaches 8 or 9, the

total number of possible graphs becomes very large. In my experiments, I have

enumerated the entire set of possible 1 to 10 stem XIOS graphs following the steps

described in Figure3.1 A. However, as a result of the large number and size of the graphs,

it took over 7TB of hard drive storage space, not mention the time consuming steps of

generating minimum DFS code for each of the structures in the dataset and purging the

redundant structures. The take home message is that it is possible to use our approach

to enumerate as many graphs as one wants. But due to the limitations of the computer

hardware and intractable computational time, one needs to decide where to stop. I

chose to keep all 1 to 7 stem unique XIOS graphs for later on work in this thesis. The

non-redundant set of 1-7 stem graphs comprises 55,728 graphs in total as described in

Error! Reference source not found..

3.2 Structural motif library construction

As mentioned above, all 1 to 7 stem small XIOS graphs were enumerated. I define a

concept called the structural motif, which represents the building blocks used in

biological RNA XIOS graphs. I further define the collection of 55,728 enumerated small

XIOS graphs as the RNA structural motif library. This library can be extended when more

71

graphs are enumerated and built into this collection. My assumption here is that

different RNA structures contain different structural motifs which correspond to their

functional differences. The RNA structural motifs embedded in a RNA XIOS graph are its

intrinsic properties.

An N-vertex graph has a maximum of N*(N-1)/2 edges. In the structural motif library,

the graphs may have up to 21 edges, namely 21 spatial RNA stem-stem relationships. I

clustered all the 55,728 graphs into groups based on the number of edges. The rationale

behind this clustering criterion is that each edge in the XIOS graph represents one

topological relationship between a pair of stems, and it is reasonable to group

structures with same number of pairwise stem relationships. Indeed, I clustered the

graphs based on their stem number at the very beginning. But as you may have guessed,

some graphs have strong correlations with each other, and small graphs appear within

bigger graphs. In chapters four and five, which focus on RNA classification and

identification, the stem number-based graph-clustering strategy does not work well. As

an alternative, I then developed the following edge based clustering strategy. Each

graph has a pre-generated minimum DFS code associated with it (calculated in the

redundant graph purge step). Within each N-edge graph group, all the graphs are sorted

based on the alphabetical order of their minimum DFS codes. By doing so, a unique ID

(UID) was assigned to each of the graphs of the form N_row_motif_X, where N is the

number of edges in the graph and X is the rank of the graph in the sorted order. I call

72

these row motifs because the clustering was based on their edge numbers, which in

terms of the DFS code is also the number of rows of code.

In order to manage this big set of graphs, a MYSQL database was set up to store them

and provide easy access. A table called row_motif was created with the graph UID as a

primary key and the minimum DFS codes stored as a column in the table.

Each structural motif is represented by its minimum DFS code, which is an abstract

concept. For better visualization and understanding of structural motifs, in the

beginning, I wrote a PERL CGI (Common Gateway Interface) script, row_motif_check.cgi,

to render them as RNA stem-loop diagrams by using minimum DFS code as input. This

script uses LWP (The World-Wide Web library for Perl) package to interact with

PseudoViewer3 web service (143) and retrieve the result. PseudoViewer3 is an excellent

visualization tool which was the first one developed for the automatic drawing of RNA

with any type of pseudoknots as a planar graph (Figure3.2). Despite its useful features,

we experienced a lot internet connection difficulties and inconsistency since the server

is located in Korea. As an alternative tool, I added another visualization application

called VARNA (Visualization Applet for RNA) (144), which is a lightweight Java applet

that draws RNA secondary structures, to the row_motif_check.cgi script . VARNA is runs

locally from our server and this guarantees its speed and availability. One drawback is

that VARNA cannot produce as nice layout of pseudoknotted structures as

Pseudoviewer3 (Figure3.2). Since we do not have a strict requirement about the layout,

VARNA is sufficient to perform the visualization task.

73

3.3 RNA structural fingerprint

3.3.1 Background

In the bioinformatics research related to DNA and protein function, it is the constraint

that function places on mutational change that gives rise to the observed sequence

conservation. Traditionally bioinformatics tools rely on sequence conservation;

sequence similarity often translates into functional similarity. But in the case of RNA,

function is often only weakly linked to sequence, while the main player is the structure.

Our RNA XIOS graph representation captures the dynamic topological characteristics of

a folded RNA molecule, and it can be thought of a coarse resolution picture of the actual

structure without details such as sequence and stem length, and loop size. The XIOS

graph framework is topology based, compared to sequence based (145) or shape based

(146-148) frameworks. I present a tool which can use such structural topological

features to identify functionally related RNA molecules.

3.3.2 Definition of RNA structural fingerprint

With the RNA structural motif library constructed, based on the assumption that

different RNA structures contain different RNA structural motifs, I propose a new

concept called the RNA structural fingerprint (simply referred to as fingerprint in the

later part of the thesis). The definition of the RNA structural fingerprint is a list of

structural motifs (defined in section 3.2) that are found in a specific biological RNA

structure. This comprehensive list of structural motifs summarizes the spatial

relationships between the stems in an RNA structure.

74

The fingerprint idea is simple and straightforward. Figure3.3 shows the work flow of

generating fingerprints for RNA structures or sequences. If one has an RNA sequence,

the RNA folding program, UNAFOLD, is used to predict its suboptimal structures (up to 5%

above its MFE). Note that UNAFOLD can only predict non-pseudoknotted structures.

Our lab developed a strategy to compare suboptimal structures and identify possible

pseudoknotted structures (149), representing the ensemble of structures as a single

XIOS graph. Or if one has RNA secondary structure, it can be directly converted into a

XIOS graph by using the XIOS package described in Chapter 2. The XIOS graph is used as

query to search against the RNA structural motif library using a subgraph matching

process. This search identifies all the structural motifs that motifs are found in the RNA

XIOS graph; this is the fingerprint of the RNA XIOS graph. Each element of the feature

vector corresponds to one specific RNA structural motif in our library, and the value of

that element is the corresponding count.

The concept of an RNA structural fingerprint is slightly abstract and an example of tRNA

is given in Figure3.4 to better illustrate it. This example shows the actual structural

motifs that comprise a tRNA structural fingerprint. The tRNA XIOS graph is surrounded

by 1 to 3 row structural motifs with their corresponding XIOS graphs. There are

additional, bigger structural motifs embedded in the tRNA XIOS graph but they are not

shown here for simplicity. The colors of the stems represent one of their possible

corresponding stems found in tRNA XIOS graph. The tRNA secondary structure, 3D

75

structure and XIOS graph are colored using the same color scheme to better highlight

the corresponding structures.

3.3.3 Fingerprint searching algorithms

The fingerprint generating process, which requires searching against the RNA structural

motif library, would be time consuming if a brute-force method was used. Let me break

this down for you. This task mainly involves determination of whether a query XIOS

graph contains a subgraph that is isomorphic to a specific structural motif XIOS graph. If

the query RNA structure is compared with every structure in the library, the

computational complexity of such a search is O(nmm), where n is the number of

structural motifs to be compared and m is the number of edges in the query XIOS graph.

For the structural motif library, n is 55,728; it could be even bigger in a more complete

motif library. This subgraph isomorphism problem is known to be NP-complete (70).

It is inefficient to scan the whole library to match structural motifs one by one. An

efficient strategy is needed to make this fingerprint search faster. A filter and

verification method is a common approach to speed up the search efficiency of

subgraph isomorphism checking over large sets of graphs. The filtering step, which

omits graphs that do not satisfy restraints defined by the user, is the key to improve

search efficiency, since the efficiency is largely determined by the number of graphs left

to be checked in the verification step (the fewer graphs left, the faster the search is).

Therefore, many approaches have proposed using indexing techniques to speed up the

76

filtering (150-158). Here I am going to describe the strategies I used in my fingerprint

search, including CUDA GPU programming and two indexing techniques.

3.3.3.1 CUDA GPU programming

The graphics processing unit (GPU) is a specialized circuit designed to efficiently

manipulate computer graphics. GPUs are normally embedded in a graphics card, or

integrated on the motherboard. The highly parallel, multithreaded, multi-core processor

structure of the GPU makes it more powerful than central processing unit (CPU) when

processing large blocks of data in parallel, Figure3.5. A simple comparison of floating

point operations per second (GFLOP/s) and memory bandwidth (GB/s) between CPU

and GPU is shown in Figure3.6.

NVIDA Corporation, a major supplier of graphics cards, released the Compute Unified

Device Architecture (CUDA), which is an extended C/C++ mixed language, for general

purpose computing on GPUs (GPGPU). It is becoming one of the hot computational

research areas with promise to advance computationally challenging problems in areas

such as large database searching, protein folding, and molecular dynamics simulation.

The RNA structural fingerprint search process includes thousands of independent

subgraph isomorphism checks. GPGPU was used to parallelize the searching process and

improve its searching efficiency. We used the NVIDIA GeForce 9800 GTX+ graphics card

(16 multiprocessors, 128 streaming cores) to implement CUDA code and perform the

search. While it seems that the subgraph isomorphism problem is suitable for

77

implementation on GPU, my fingerprint search requires reasonable large amount of

memory. I stopped implementing the search code in CUDA due to the graphics card

memory limitation (512MB). GPUs with larger memory could be used to solve this

problem and speed up the search process.

3.3.3.2 Prefix tree search

Binary tree searching has O(nlogn) computational complexity, which is far better than

O(nmm). In Chapter 2, I described using a graph sequentialization method, the gSpan

algorithm DFS coding (133,134), to translate a graph into its minimum DFS code which

can be considered to be canonical labeling of the graph. Yan and Han showed that if two

graphs are isomorphic, they must share the same minimum DFS code (133). I proposed

a prefix tree data structure to efficiently store and retrieve structural motifs in the

library (Figure3.7). As mentioned before, the minimum DFS codes were pre-calculated

for each structural motif, and those codes were stored in this prefix tree. The prefix tree

stores all n row motifs in level n. Each node of the tree only holds one row of DFS code.

In order to retrieve the complete minimum DFS code of a structural motif, it is necessary

to trace from root node to the leaf node representing the last row of the DFS code for

the structure and retrieve the code. Node X is a parent of node Y if and only if the DFS

code from the root node to X is a prefix of the DFS code from the root node to Y.

Most subgraph isomorphism checking methods used in large scale graph set are filter-

and-verification, which means they first filter out graphs that do not satisfy restraints

defined by the user, and then perform isomorphism checking on remaining graphs. My

78

prefix tree strategy employs the verification-and-filter style. Compared with filter-and-

verification, this style does isomorphism checking start with a small graph. The program

filters out the large number of graphs which are extensions from this small graph, if this

small graph fails to pass the isomorphism checking. The details of the prefix tree

searching are as follows: when doing fingerprint search for a query XIOS graph, it starts

from the root of the prefix tree. The root node contains just the “zero” row motif graph

which has only serial stem-to-stem topological relationships among all the vertices, for

example, the first graph in Figure2.3. There are two one row motifs that are children of

this root node, which are the 2nd and 3rd graphs in the first row of Figure2.3. From here

on, a depth first searching strategy is used in the search. One of the two motifs is chosen

for subgraph isomorphism checking with respect to the query XIOS graph. If this motif

passes the check (matches), then one of its child node motifs is retrieved and used for

subgraph isomorphism checking with respect to the query XIOS graph. This search

process is repeated until a motif M which fails the subgraph isomorphism checking is

found. Since this is a prefix tree data structure, all the motifs represented by the child

nodes must contain M as a subgraph. If M is not a subgraph of the query XIOS graph,

then all its child nodes represent motifs that cannot be subgraphs of the query XIOS

graph, because they are basically extensions from M.

The filtering power of this strategy is that once the subgraph isomorphism checking fails

at a specific node, the whole child branch of this node no longer needs to be checked,

since this node is a prefix of all its child nodes.

79

During the test of this approach, I experienced some inconsistency in searching speeds.

In spite of this, the overall searching speed was faster than the brute-force method.

After carefully looking at the layout of the tree in the memory, I found that the order in

which the tree nodes are allocated in memory is very important. Using the perl language,

one does not have full control of memory allocation; the sequence of construction

nodes in the tree was the cause of the inconsistency. What we found was memory page

problem occurs whenever the tree is big, spanning more than one page. If the physical

address in which a parent node stores to the address of its child node is bigger than a

page range, the memory access time is far higher. One take home message is that

efficient layout of the tree in memory is important, which can potentially save a lot of

computational time.

3.3.3.3 NH indexing

Prefix tree searching improved the fingerprint generating speed, but it was still not

satisfactory. Inspired by the Neighborhood indexing (NH indexing) method (159)

(Figure3.8), I have developed a modified version of the NH indexing strategy to speed up

the RNA structure database searching process. The main idea of the NH indexing

strategy is that a vertex plays a role proportional to its significance when we are

matching two graphs. The neighbors of the vertex and its degree can be used to

determine the significance of a vertex in the matching process. This information is used

in the indexing as well as in the query search process. Besides the neighbor and degree

information, I also define triangle descriptors (Figure3.9) that describe vertex properties.

80

For a given vertex i, vertices j and k are its neighbor (connected by edges). A triangle can

be formed by i, j and k. Depending on the edges linking these three vertices, 36 different

triangles are possible (Figure3.10).

XIOS graphs are further separated into connected components (modules), i.e., distinct

subgraphs that have only serial (non-nested) relationships with each other. Modules

represent independent pieces found in biological RNA structures. Every vertex of each

module in the database is indexed by the modified NH indexing strategy (159).

Besides the triangle descriptors shown in Figure3.9, there are cases that cannot

physically form, even though they are mathematically feasible. A complete list of all of

the 36 mathematically possible triangle descriptors can be found in Figure3.10. I use a

list called the NH index array to store the vertex properties for each vertex. . The design

of the NH index array is shown in Table 3.1.

The NH index array has 42 elements. It includes the counts of all 36 triangle descriptors

listed in Figure3.10, the number of I, J, O, and X edges that extend from the vertex, its

degree (d) and number of edges between its neighbor vertices (nc). The details of

generating NH index array for a specific vertex are described in Algorithm 3.2.

Algorithm 3.1 build_NH_index_array

Input: graph vertex ni Output: NH index array NH(ni) 0: Initialize NH index array of vertex ni with zeroes 1: Find all neighbors (vertices connected to ni) of vertex nI, and put them into neighbor list NB

81

2: FOR each vertex nj in neighbor list NB 3: FOR each vertex nk ≠ nj in neighbor list NB 4: For each triangle descriptor Tl

5: if vertices ni, nj and nk match triangle Tl 6: increment the count of triangle descriptor by 1 7: is the type of edge between vertex ni and nj, ∈ { , , , } 8: is the type of edge between vertex ni and nk, ∈ { , , , } 9: Increment the count of and in NH(ni) 10: END FOR 11: END FOR 12: RETURN NH index array of vertex ni

Each RNA structure is converted to a XIOS graph, and further separated into its XIO-

edge-connected components (modules). For each module, I calculate the NH index

array for every vertex of the graph, and create a database of structure vertices indexed

by triangle descriptors. For example: if vertex n of graph S and its neighbors can form

triangle descriptors T0 and T18. Feature T0 and T18 are used as keys to index vertex n

and graph S. After the vertices in the entire structure database have been indexed, it is

easy and fast to look up all the vertices and graphs in the database which associated

with a specific triangle descriptor features.

Algorithm 3.2 NH indexing Input: all database XIOS graphs ( ) Output: index , a list of database graph vertex ids containing each feature Dk 1: FOR each graph Si in the XIOS graph database S 2: separate graph into modules list*, ( ) 3: FOR each module in module list ( ) 4: FOR each graph vertex in module 5: CALL build_NH_index_array function with vertex , it returns array 6: FOR each of the 36 possible triangle descriptors T0 to T36

82

7: IF vertex is associated with descriptor Tk THEN 8: Append vertex to index entry Dk 9: END IF 10: END FOR 11: END FOR 12: END FOR 13: END FOR 14: RETURN index * x,i,o edge connected components

The searching method uses the NH indexing aided database search as well as complete

subgraph matching. The database comprises a set of XIOS graphs derived from biological

RNA structures. Search queries are indexed in the same way as the database and each

query vertex is compared to the database index in order to find topologically similar

vertices in the database as candidates/seeding vertices. This is the NH indexing

screening step. My search strategy does not require searching against the whole

database, but just the graphs that contain the seeding vertices found by the NH indexing

screening step. The performance of the search is greatly improved due to the smaller

searching space. Algorithm 3.3 describes the search process step by step.

Algorithm 3.3 NH indexing search Input: query XIOS graph and index Output: search hit list , where each Hi indicates a database module that matches the query 1: Separate query XIOS graph Q into its connected components module list, ( ) 2: FOR each module in module list ( ) 3: FOR each graph vertex in module 4: CALL build_NH_index_array function with vertex , it returns list NH(ni)* 5: FOR each of the 36 triangle descriptors 6: IF vertex is not associated with descriptor Tk THEN 7: Put all the vertices from the list Dk into the non-candidate list NC(ni)

83

8: END IF 9: END FOR 10: FOR each of the 36 triangle descriptors 11: IF vertex is associated with descriptor Tk THEN 12: FOR each vertex nj in list Dk 13: IF nj is not in the non-candidate list NC(ni) THEN 14: Append nj to the candidate list C(ni) 15: END IF 16: END FOR 17: FOR each vertex nc in the candidate list C(ni) 18: FOR each k-th value in list NH(ni): NH(ni)[k], where 1<=k<=42 19: IF k-th value in list NH(nc): NH(nc)[k] is smaller than NH(ni)[k] THEN 20: skip to the next vertex in the candidate list C(ni) 21: END IF 22: END FOR 23: lookup the database graph module mhit that contains vertex nc 24: Append mhit into hit list H 25: END FOR 26: END FOR 27: END FOR 28: FOR each graph module hit mhit from hit list H 29: IF query module and this graph module hit mhit have equal or bigger number of vertices THEN 30: IF simple_subgraph_isomorphism_check(ml, mhit)** is NOT true THEN 31: Remove this graph hit mhit from the hit list H 32: END IF 33: END IF 34: END FOR 35: END FOR 36: RETURN search hit list ; * The detail of the function described in Algorithm 1. It basically returns a list of the count of each of the 36 triangle descriptors associated with the vertex, edge types going out from this vertex and degree of the vertex. ** This function would take two graphs as input and do complete subgraph match test. Line 2-27 graph vertex filtering. According to the graph containment search exclusive

logic (155), if a feature f is not embedded in query graph Q, any graph Gi in the database,

84

which has feature f, should not be a matching candidate. First we find out which triangle

descriptors are not associated with the query structure vertex ni. From the database

index, I identify and push all the vertices that contain any triangle descriptor feature,

not associated with ni, onto the non-candidate vertex list NC(ni). And all the vertices that

contain any of the query vertex triangle descriptor features are included in the

candidate vertex list C(ni). This is followed by removing the intersection of NC(ni) and

C(ni) ( ( ) ∩ ( )) from the candidate list C(ni). Further screening was done by

checking the triangle descriptor feature counts. The K-th count (1 ≤ ≤ 42) of the

query structure vertex NH(ni)[k] should not be smaller than the k-th count of NH(nc)[k],

where nc is the database vertex. If NH(ni)[k] ≥NH(nc)[k], nc is removed from the

candidate list C(ni). Later, a module list mhit is built from the candidate list C(ni) by

looking up the index of the vertex to module association. At this point, a list of

candidate vertices C(ni) and graph module list mhit are available for next step.

Line 28-35 module size screening and simple subgraph isomorphism check. This step

efficiently rules out candidates from the list mhit, leaving a small number of candidates

for the more accurate and the most time consuming test. In order to perform a specific

biological function, RNA needs to have a certain topological module set, and each of the

modules needs to be complete. That is to say the query structure topological module

needs to have the same or bigger size (number of vertices) as a database module. This

simple module size test filters the false positive matching very quickly. If the size

requirement is not met, that database module is discarded from the list mhit. The next

85

step is the most computationally expensive step of the searching process, a simple

subgraph isomorphism check. In this case, it goes through an accurate complete

subgraph containment (looping through all candidate vertex combinations till the first

complete match is found, the worst case is going through all combinations) search using

the query XIOS graph module to search against all the database module hits in the

module list mhit. This is looking for database modules are the same size as the query

module or completely nested in the query module. The false-positives are omitted from

the result hit list H.

3.3.4 Possible applications

It is intuitive that RNA molecules contain different structural motifs, but members of the

same RNA family share more common motifs than RNAs from different families. For a

given biological RNA, we represent its ensemble of suboptimal secondary structures

(predicted by UNAFOLD) by a XIOS graph and describe its topological features by

fingerprint. Comparison of the fingerprints of a set of RNAs can give one a clue about

their relationships and functional similarities. It is a natural extension of this work to

index experimentally determined or computationally predicted RNA structures from

different publically accessible data sources by their fingerprints. This approach allows

the construction of an RNA topology database with all RNA topological information.

Furthermore, a database search utility can be developed to perform RNA topological

similarity search. With the aid of the RNA topology database, feature selection and

86

classification methods can be used to identify important topological features

corresponding to specific biological functions.

87

Figure 3.1 Enumeration of XIOS graphs.

A. the steps of assigning half-stem labels. For the n-stem case, there are 2n half-stems. We assign integer labels to each half-stem from 1 to 2n.By definition, the first half-stem is labeled 1, and there are 2n-1 possible regions that can pair with the first half-stem. Here we defined the sequence direction from labeled half-stem 1 to half-stem 2 as the positive direction, and then the opposite direction is the negative. The third half-stem has only one possible label (2 or 3), and there are 2n-3 possible half-stems that can pair with this half-stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1. B.tRNA example. This is a result resembles the tRNA structure (3 leaves cloverleaf structure). The code below is the paired format representation of this structure. Each pair of matching parenthesis represents a stem. Each stem has two half-stems associate with it. Their labels are inside the parenthesis.

88

Figure 3.1

89

Figure 3.2 RNA secondary structure visualization.

On the top of the figure shows an example structure dot-bracket representation. A. RNA structure visualization done by PseudoViewer3. B. RNA structure visualized by VARNA.

90

Figure 3.3 Flow of generating fingerprint.

91

Figure 3.4 RNA structural fingerprint tRNA example.

This is one example showing actual structural motifs listed in tRNA RNA structural fingerprint. A. tRNA XIOS graph is located in the center, and it is surrounded by 1 to 3 row structures motifs with their XIOS graphs. There are bigger structural motifs embedded in tRNA’s XIOS graph but they are not showing here for simplicity. Colors of the stem represent one of their possible corresponding stems found in tRNA XIOS graph. B. tRNA 3D structure C. tRNA secondary structure stem-loop diagram. Note that tRNA secondary structure, 3D structure and XIOS graph are using the same color schemes to characterize different stems.

92

Figure 3.5 Architecture comparison of CPU and GPU.

Adopted from CUDA C Programming Guide 4.0

93

Figure 3.6 Comparison of CPU and GPU.

Left Floating-point operations per second. Right Memory bandwidth. Adopted from CUDA C Programming Guide 4.0

94

Figu

re 3

.7 P

refix

tree

stru

ctur

e st

ores

stru

ctur

al m

otif

libra

ry fo

r effi

cien

t sub

grap

h is

omor

phis

m c

heck

.

95

Figure 3.8 Neighborhood indexing (NH indexing).

Left panel: the open circle in the center is the vertex we are focusing on. All the squares are its direct neighbors, and the diamond is not its neighbor. Information such as number of I, J O and X edges extend from the vertex, degree of the vertex (d) and connections between (nc) its neighbors are considered as the properties of the vertex. A list of all the information is called the NH index array of vertex. Right panel: With the help of the NH index array, some vertices can be easily anchored from the query graph (smaller graph on the left) to the database graph (bigger graph on the right). Those vertices serve as seeds (closed circles) of the initial step of graph matching. Extending to their neighbor vertices (open circles) would lead to the maximum subgraph out of the two graphs more efficiently.

96

Figure 3.9 Triangle descriptors.

All physically possible triangle descriptors are shown. A complete list of all mathematically possible triangle descriptors can be found in Figure3.10. Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.

97

Figure 3.9 Triangle descriptors.

98

Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.

Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.

99

100

101

Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.

102

Table 3.1 Design of the NH index array.

There are 42 elements in this array. Basic elements of the array include: 36 mathematically possible triangle descriptors (T0 to T 35), counts of I, J, O and X edges (I, J, O, X), degree of the vertex (d) and number of edges connect its neighbor vertices (nc).

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24 T25 T26 T27 T28 T29 T30 T31 T32 T33 T34 T35 I J O

X d nc

103

CHAPTER 4 MATCHING UNKNOWN RNA STRUCTURES

4.1 Introduction

Graph theoretical approaches have been used to identify chemical moieties associated

with functional properties (71,72,153,161). Chemical structures have been represented

by molecular graphs in quantitative structure-activity relationships (QSAR) studies, using

structural determinants to model and predict physicochemical and biological properties

(73). Graphical representation of chemical structures has been used to compare

structure similarity and to identify function (71,72). The correlation between chemical

properties and function can be used to predict the function of novel molecules.

ChemIDplus (161), PubChem (162), ChemBank (163) and BindingDB (164) are examples

of databases using graph theoretical approaches for chemical structure search and

comparison.

While increasing attention has been drawn to the important biological roles of RNA,

RNA functional annotation remains difficult. Approaches based on RNA primary

sequence alone have been extensively studied and implemented (66,140,146,165,166),

but there are many cases where structurally similar RNAs have little or no detectable

sequence similarity; in these cases sequence based approaches fail to correctly identify

104

and classify the RNAs. RNA function is dependent on RNA tertiary folding, and tertiary

structure, in turn, is largely determined by base-paired secondary structures and

pseudoknotted structures, which are not secondary structure. The RNA-XIOS database

provides a means to link RNA secondary structure, including pseudoknots, to its

biological function and physicochemical properties by associating topological patterns

with the functions of currently known of RNA families.

Similar to graph database studies in chemical informatics, the RNA structure XIOS graph

database provides extensive RNA-secondary-structure topological information, including

pseudoknots, and ensembles of suboptimal RNA secondary structures (160). For a given

query RNA structure, it can quickly identify topologically similar RNAs for further

analysis, such as function identification.

Several RNA motif databases based on graph theory are currently available (167), but

they do not provide a RNA structural topological searching service that includes

pseudoknot topologies. Additionally, techniques for efficiently identifying structural

similarity between RNAs are not well developed. Our approach is similar to the

RNAshapes approach of Giegerich et al. (98), but we also consider pseudoknots and

suboptimal structural ensembles. It is also similar to the RAG (RNA-As-Graphs) (168),

which describes RNA structures as graphs. However, some RAG structures that are

mathematically possible cannot form in the physical world. In our approach we only

enumerate physically possible graphs, greatly reducing the search space for topological

similarity.

105

4.2 Methods and dataset

4.2.1 XIOS Graph

We have developed a framework , XIOS, which represents an ensemble of RNA

secondary structures in a single graph; pseudoknots and suboptimal structures are

specifically included (160). XIOS graphs are constructed based on base-pairing in actual

and/or predicted stems. Each vertex in the XIOS graph represents a RNA stem and each

edge/link corresponds to one of the four spatial stem-stem relationships: exclusive (X),

included (I), overlapping (O) and serial (S). A special case of the reverse relationship of

included (I) is denoted as J (Figure2.1). As this is a complete list of possible relationships

between base-paired regions in a RNA, the XIOS approach can be used to enumerate all

physically possible graphs. The XIOS graph is then converted into minimum depth first

search (minimum DFS) code for fast RNA structure comparison (133). In contrast to

traditional sequence-based approaches, XIOS is a topology-based approach, which

allows comprehensive and efficient exploration of the RNA structure space.

4.2.2 Dataset

The data set used here is the manually curated dataset described in Table 1.1.

4.2.3 Indexing and searching

The basics of the database search and fingerprint search are all the same as described in

chapter 3. The indexing algorithm is the same as algorithm 3.2, and the search algorithm

is the same as algorithm 3.3. But there are differences: the database in the fingerprint

search is the structural motif library (see section 3.2), while here it is the manually

106

curated dataset (Table 1.1). Real biological RNA structures are indexed in the RNA

structural database considered in this chapter. The structures I consider here are

significantly bigger than the enumerated structural motifs and more biologically

relevant. The large size and complexity of real biological RNAs are challenging aspects in

this database search study.

4.2.4 Scoring function

The similarity of RNA structures is evaluated based on the number of indexed subgraphs

they have in common. For a query structure, each query module has a true candidate

database-module-list generated by algorithm 3.3. By examining the combinations of lists

of all graph modules, the database structures with the most module hits, as well as the

closest size-matches, can be found. A combination of these terms is used to define the

best matching structure.

For a specific query structure, a XIOS graph with certain number of modules, large

graphs in the database would tend to have a higher number of module hits and larger

matches. This is because large graphs have larger modules and more kinds of small

modules, regardless of their true similarity to the query. Large database graph modules

will therefore tend to have larger matches to query graph modules. For example,

consider a query graph module with N (N≥2) vertices, namely size N. Suppose there are

two database graph modules A and B (where the size of A is N and the size of B is less

than A), and that both A and B match to the query. Module A would tend to have a

larger match with the query module, since A is bigger than B. In general, larger database

107

modules will therefore tend to have larger matches regardless of the query. We penalize

the unmatched regions of a database graph match to correct for this effect. The bigger

the unmatched region, the higher the penalty it receives. Among the best database hits

with the same matching size and number etc., higher scores are given to database

modules that are the most similar in size to the query graph.

The database search result is affected by the following factors: structure overlap size

(the number of vertices that can be mapped between query and hit), and query and hit

module size differences. We added penalties to the scoring function (eq 4.1) to penalize

the unmatched part of the structures. This helps to promote the structures, with similar

size to the query graph, to the top of the result hit list.

= ___ _ _ eq 4.1

where score is in the range of (0, _ /2] The denominator of eq 4.1 adjusts for size differences between the query and database

modules. If the query and hit sizes are the same, the hit score reaches its maximum

value. If the query and hit size are very different, say query size >> hit size or query size

<< hit size, the hit score reaches its minimum value which asymptotically approaches

zero. For all other cases, the hit score would lie between the two extreme values.

108

4.3 Results

4.3.1 Validation using known biological structures

An NH indexing database search was conducted using each XIOS graph in the manually

curated structure dataset (Table 1.1) to search against the whole database of manually

curated structures. Performance of the database search is evaluated by the Positive hit

ratio (PHR), which is the number of correctly labeled hits divided by the total number of

hits. The label is the known family of its best match in the database. A sample PHR

calculation for a BLAST search is shown in Figure4.1. In this example, the black line

represents a query search sequence, red lines are the true positive hits, and blue lines

are false positive hits. The positive hit ratio for this search is 3/5. Figure4.2 shows that

we correctly identified and classified RNA structures using topological criteria across

four distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA.

The Y-axis of the charts represents the percentage of NH index searches that achieved a

certain PHR. For example, in our dataset, I perform 16 separate searches for each of 16

tRNAs. 10 times out of 16 I observed 100% PHR, the percentage of having 100% PHR for

tRNA is 0.625 (62.5%). Over 75% of Group I Intron RNA, RNAseP RNA and tmRNA

queries retrieve 100% PHR in the top 5 hits, while over 55% of tRNA queries have 100%

positive hit ratio. We also examined the top 10 scoring hits for each RNA family. In this

case, Group I Intron RNA and tmRNA queries still show high recall, while RNaseP RNA

queries rank somewhat lower.

109

Classification accuracy is lowest for tRNA queries. There are two possible reasons: First,

tRNA XIOS graphs are small - such small motifs maybe found in many larger, but

unrelated, graphs in the database. Second, the number of tRNAs in the database is

relatively small compared to the other groups (16 structures collected from PDB

database). If a couple of tRNAs match to other RNA families, the fraction would be

relatively big in comparison to the other three RNA families which have more instances

in the database. Overall, this result confirms that our database search is able to identify

RNAs with similar structure and function based on topological matching.

4.3.2 Size only graph database search

Many RNAs within a functional class have very similar lengths. This raises the possibility

that the classification shown in Figure4.2 was simply due to matching between

structures with similar sizes (the structure (graph) size is the number of vertices in the

XIOS graph). To eliminate this possibility, we implemented a function that scored the

RNAs based only on their XIOS graph sizes. Figure4.3 shows that matching by size alone

achieves about only about 20% classification accuracy, much lower than the level

achieved by the topological matching (Kolmogorov-Smirnov test result is shown in Table

4.1). This shows that graph size is not the main factor contributing to the correct RNA

classification.

4.3.3 Embedding simulation

Another key issue in matching to a structural database is whether a topologically and

functionally similar structure can be found even when it is embedded in a larger

110

structure. Such embedding could occur either in a biologically meaningful sense (a true

relationship), or be due to misassembly or misannotation of the source sequence. For

each of the structures in our dataset, we performed an embedding simulation which

automatically mutated the sequence of the query RNA structure graph while generating

unique structures bigger (in size) than the input structure, and with the input structure

embedded in it (Figure4.4). We call these bigger structures extended structures. Table

4.2 lists the statistics of the embedding simulation.

For each of the original query structures in the database, we generated circa 15

extended structures (Table 1.1). An NH index database search is then performed using

the extended structures as queries to see if the original graph and its related graphs can

still be found. This embedding simulation (Figure4.5) clearly shows that the NH index

database search is able to find the embedded structure and its related structures,

compared with results shown in Figure4.2 (Kolmogorov-Smirnov test result is shown in

Table 4.1). It further suggests that NH indexing XIOS database search acts more like a

local similarity search than a global similarity search, since it can identify the local graph

patterns within a larger overall graph pattern. Indeed, the performance of the extended

query search is even better than the original queries, most likely because the bigger the

structure is, the more likely it is to have hits to its family. The topological structure

database search is therefore robust and it successfully classified RNA structures into

their correct RNA families.

111

4.3.4 Blast search

A study understanding the difference between NH indexing database search and Blast

search was performed. The result is shown in Figure4.6. The Blast result shows some

good result for RNAseP and tmRNA, but not tRNA and Group_I_Intron RNA.

Kolmogorov-Smirnov test (Table 4.1) shows that the NH indexing database search

achieved same performances for RNAseP and tmRNA, while both search results are

good. Also it shows that NH Indexing database search results for Group_I_Intron RNA

and tRNA are statistically significantly better that Blast search.

4.4 Discussion

Identification of conserved sequences has played an important role in establishing

sequence-structure-function relationships in proteins, but has been less useful with RNA

because it is the folded structure rather than the sequence that is most closely related

to function. We have developed a structure database searching algorithm, NH indexing

database search that can identify and classify topologically, and probably functionally

similar, RNA structures. This knowledge can be used to build experimentally testable

hypotheses about the function of the query RNA.

The NH-indexing database-search algorithm can accurately classify RNA structures

without using primary sequence information. Integrating primary sequence information

into the framework would improve performance for RNAs that have significant

sequence similarity to others in their functional class. Combining both sequence and

topological information should improve the classification of unclassified or misclassified

112

sequences with low sequence identity but relatively close structural/topological

distances. Any significant sequence similarity should improve the ability to assign the

novel member to the correct family.

The NH-indexing database-search algorithm can also be used as a topological distance

measure of RNA structure similarity. Identifying conserved sequence motifs associated

with unknown functions using multiple alignments and HMMs have been highly useful

in identifying and classifying proteins according to their function (169-171), and should

be similarly useful for RNA.

The current design of the database search requires that the database modules be

smaller or equal in size to the query module. This work can be extended to implement a

subgraph similarity search that would allow a certain amount of mismatch in order to

maximize the searching power of our approach. If mismatches are allowed, more

database structures would be included in the search result, but the results, presumably,

would include more false positives. Thus more information can be used to classify and

identity query structures, but at the expense of increased noise. Another concern is

whether the current module size correction is reasonable. Currently, if the structure is

small, fewer results would be found in database search, since fewer indexed graph

modules are found in small XIOS graphs. It is likely that if we add smaller and family

specific motifs to the database, the search performance would be better for small RNA

query structures. Such family specific motifs can be identified and obtained by applying

feature selection methods to specific RNA family structure datasets.

113

Figure 4.1 Positive hit ratio (PHR) in a Blast search.

The positive hit ratio is the number of true positve divided by the total number of hits in the result. In this example, the black line represents a query search sequence, red lines are the true positive hits, and blue lines are false positive hits. The positive hit ratio for this search is 3/5.

114

Figure 4.2 NH database search result.

Topological criteria can be used to correctly identify and classify RNA structures across 4 distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA. The x-axis shows the Positive hit ratio, which is calculated as the count of correct hits over total number of hits. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10. The y-axis is the percentage of queries showing the specified positive hit ratio.

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positive hit ratio

Top 5 hitsFraction with penalty

All

tRNA

group1

RNAseP

tmRNA

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positive hit ratio


All

tRNA

group1

RNAseP

tmRNA

115

Figure 4.3 Size only database search result.

The horizontal axis shows the PHR, and the vertical axis the fraction of queries reaching the specified level. The results here show distributions of close to random matching, indicating that matching between structures based on size alone is not the main factor contributing to the classification result shown in Figure 4.2.

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Top 5 hitsFraction graph size only

All

tRNA

group1

RNAseP

tmRNA 0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Top 10 hitsFraction graph size only

All

tRNA

group1

RNAseP

tmRNA

116

Figure 4.4 Embedding simulation.

A. Sequence embedding. A given sequence (blue), is embeded into two flanking sequences (yellow and orange) resulting in a longer sequence. B. Graph embedding. Using the same idea as in sequence embedding, graph embedding is applied to a graph by adding extra vertices (orange) and edges to form a bigger graph. In our study, we implemented a perl script to automatically mutate the base pairing of the input RNA structure graph on the sequence level and generate unique XIOS graphs which are bigger (number of vertices) than the input graph and have the input graph as a subgraph embedded in it

117

Figure 4.5 Embedding simulation database search result.

The x-axis shows the Positive hit ratio, and the vertical axis shows the fraction of queries in each family achieving the specified PHR.. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10.

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positive hit ratio


All

tRNA

group1

RNAseP

tmRNA

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positive hit ratio


All

tRNA

group1

RNAseP

tmRNA

118

Figure 4.6 Blast search result

By using RNA sequence, Blast search was performed on the manually curated dataset.

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positvie hit ratio

Blast Search Top 5 hits

All

tRNA

RNaseP

group1

tmRNA

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Perc

enta

ge

Positive hit ratio

Blast Search Top 10 hits

All

tRNA

RNAseP

group1

tmRNA

119

Table 4.1 Kolmogorov-Smirnov test results

Kolmogorov-Smirnov test p-values vs. blast search vs. size only vs. embedment

top 5 hits top10 hits top 5 hits top10 hits top 5 hits top10 hitstRNA_M 0.0657 0.0657 0.999 0.0071 0.8998 0.3544RNaseP_M 0.2307 0.005 0 0.0009 1 1Group_I_Intron_M 0.0059 0.001 0 0 1 0.5769tmRNA_M 0.866 0.058 0 0 0.9631 0.9919

120

Table 4.2 Statistics of embedding simulation

tRNA RNAseP group1 tmRNA All

Simulated graph number 237 567 631 1778 3213

Simulation per structure 14.81 15.75 15.78 15.20 15.37

Min graph size 4 6 12 5 Max graph size 12 29 33 28

Average graph size 7.80 15.88 22.42 19.48

121

CHAPTER 5 IDENTIFICATION OF TOPOLOGICAL FEATURES THAT DISCRIMINATE

BETWEEN RNA CLASSES

5.1 Introduction

5.1.1 RNA importance, RNA function determined by RNA structure

Like proteins, RNAs also perform important cellular functions, and our understanding of

this fact is increasing rapidly (172,173). As our existing knowledge about RNA grows, the

large scale characterization and analysis of RNA structures and functions, namely

structural genomics of RNA, becomes increasingly important (148,174). The core focus

of the structural genomics of RNA is to find all unique structural motifs and 3D folds;

molecular structure determines function. Current RNA function predictions mostly are

based on finding conserved sequence motifs, similarly to what is done with proteins. In

order to identify sequence motifs, multiple sequence alignments have to be generated.

In principle, conserved motifs can be identified from the alignments and the function of

RNAs predicted. The problem is that, compared to proteins, there are not many RNA

classes currently known, and RNA sequences sharing the same structural motifs may

have no detectable primary sequence similarity, making it impossible to align them. RNA

secondary structure prediction can help the prediction and identification of conserved

122

tertiary structure, but accurately predicting RNA secondary structures from sequence

information alone is not trivial.

As an alternative approach, graphs have been used to represent RNA secondary

structures. Notable efforts in this direction include RNAshapes approach of the

Giegerich group (98), but RNAshapes does not include pseudoknots.

Another graphical approach is the RAG (RNA-As-Graphs) of the Schlick group (168)But

they enumerate all mathematically possible graphs, and thus the search space is really

big.

Reliable RNA secondary structural information is needed to solve the RNA function

prediction problem because experimental determination of large RNA structures is very

difficult. While there are many RNA secondary structure prediction programs, few of

them can predict the key elements called pseudoknots. Pseudoknots are a common

structural motif in many RNA classes, such as self-splicing introns and telomerase. They

play important roles catalytic functions of RNA, such as forming the catalytic core of

ribozymes, and altering gene expression by inducing ribosomal frame shifting in many

viruses (175). The ability to annotate novel RNA secondary structures can give insight

into the possible different functions and roles of RNA. In addition, the ability to find

novel RNA secondary structures can help with the design of pharmaceuticals by

providing an accurate target site for drug recognition.

123

5.1.2 Contribution

Biological RNAs have different topological features than random RNAs (98,168,174). By

examining RNA families and selecting the most important features, and filtering out

common features shared by different families, one should be able to identify the unique

features contributing to the unique functions of different RNA classes. This would

constitute a mapping between topological features and function. Such a topology-

function mapping would lead to better understanding of RNA structural patterns and

also lead to more efficiently engineering/design of RNA-based drugs/complexes with

specific functions and effects. In addition, by identification of important topological

features in biological RNA molecules would help us to further refine our motif library to

contain the most discriminatory structural motifs.

5.2 Material and Methods

5.2.1 Reverse cIndex basic feature selection on RNA fingerprints

For each specific RNA family, the members of the family should share similar structural

patterns (topological features) because they perform similar biological functions. On the

other hand, different RNA families should contain a lower fraction of similar structural

patterns because they play different roles in the biological processes. Feature selection,

that is, identification of the features that most powerfully discriminate between

different classes of RNAs, based on RNA fingerprint should provide useful information

about which structural motifs contribute to a specific RNA family and, implicitly, to

specific RNA biological functions.

124

Our feature selection strategy is based on the cIndex idea (155) (Figure5.1), but uses the

opposite selection order. The cIndex strategy selects features from the most frequent to

least frequent; our strategy is to select features from the rarest features to common

ones. We refer this algorithm as cIndex-Basic-Min. Algorithm 5.1 outlines its

pseudocode. A graph feature matrix is used to show the containment relationship

between features and graphs (Figure5.2). Its (i,j) value tells the count of feature i found

in graph j. Support of a feature f in graph feature matrix is the number of graphs in the

graph feature matrix that contain the feature, f. For a given graph feature matrix , it first

selects a feature with minimum support (greater or equal than 1), then removes this

feature (row) and all graphs have this feature (columns). This process is repeated until

the graph feature matrix is empty. The rationale of the reverse the cIndex feature

selection order is that, in our study, we have a lot of small structural motifs in the library.

Those small motifs appear randomly everywhere, in every structure. If we use original

cIndex algorithm, for the first couple iterations those small motifs would be selected as

important features. The graph feature matrix would then be empty. It would fail to

capture real important structural motif features.

Algorithm 5.1 cIndex-Basic-Min Input: Graph Feature Matrix . Output: Selected Feature list . 1: Set the selected feature list as an empty list { } 2: FOR each feature f in M 3: IF support* of feature f support(f) > 0 4: Index f by support(f) value 5: END IF 6:END FOR 7: REPEAT 8: Feature is the list with all the features with the minimum support, support(f), from the index

125

9: Append the features from the list Feature to F 10: Record the iteration number iter and support(f) 11: FOR each feature from the list Feature 12: Find the corresponding row and delete all columns with non-zero values (remove feature hits) in M 13: Delete the corresponding row in M 14: END FOR 15: END FOR 16: UNTIL Matrix M is empty 17: RETURN selected feature list ;

* support of a feature f means number of graphs in the graph feature matrix that contain the feature, f.

5.2.2 RNA structure classification

With the information gathered by cIndex-Basic-Min feature selection, we have

developed a classification method to classify RNA structures.

The classification scoring function is based on the following four factors. 1) feature

number (the number of features in each RNA family); 2) iteration number (for a feature

selected in iteration n of the cIndex-Basic-Min procedure, n is the iteration number); 3)

feature support (number of structures containing this feature); 4) feature size (number

of stems in the feature graph).

1. Iteration function: ( ) = − + , where variable i is the iteration number. 2. Support function: ( , ) = × ( ) , where variable s is the support.

Support is the number of graphs contains that specific feature in the graph feature matrix. Basically a combination of support and iteration function. Features with the same support do not necessary get the same weight. More weight was given to common feature.

3. Feature size function: size(fs) = fs2 , where variable fs is the feature size.

We give higher weights to bigger found structural motif features.

Overall classification scoring function:

= ( ) ∗ ( , ) ∗ ( )/ _ .

126

The classification scoring function associates the selected structural motif feature with

an RNA family, and gives each feature specific weight for later on classification

calculation.

The RNA structural classification is based on the features found in the query and their

corresponding weights (scores). The query RNA structure receives a score for each RNA

family in the database. The query structure is classified the RNA family with the highest

score.

This feature selection process finds the features that are important to a group of RNA

structures, these features can be associated with functions in known families, and aid in

forming hypotheses about the function of novel RNAs.

5.2.3 RNA structure datasets

The datasets used in this work are a manually curated dataset (Table 1.1) and a dataset

collected from the STRAND database (Table 1.5).

5.3 Results

5.3.1 Feature selection on the fingerprint generated

By using cIndex-Basic-Min feature selection strategy, we have successfully identified

features which are important to specific RNA families in our manually curated dataset

(Table 1.1), as well as in a dataset downloaded from the STRAND database (Table 1.5).

Each dataset contains only non-redundant sequences with low sequence similarity (<

50%). The manually curated dataset contains more reliable RNA secondary structures

127

(curate process described in section 1.7.1). Because the STRAND dataset is collected

from all different sources, noise and partial structures are likely to be more common in

this dataset.

Table 5.1 shows the statistics of the selected structural motif features. Figure5.3 shows

that in our feature selection strategy, higher weights are given to the features which are

neither too rare nor too common. This agrees with the well accepted fact (176).

5.3.2 Top unique feature selected (in the same order as weights):

Figure5.4 and Figure5.5 show that the top four structural features found are the most

important in discriminating each of the RNA families from our datasets (Table 1.1 and

Table 1.5).These top structural motif features highlight characteristics known to be

important in the RNA families (Figure5.6). Conceptually, this provides us ability means to

decode the relationship between structure and functions. In other words, we can

understand the mapping from structural motif features to RNA biological functions. One

can ask whether these features are specific enough, since the actual motif sizes in the

structural motif library range from 1 to 7 stems. From the result in Figure5.6, we can see

that they are sufficient to describe the RNA family structure.

5.3.3 Validation of RNA structure classification

Table 5.2 shows the RNA structure classification result. The performance of the

classification is good for most of the RNA families included here. Only RNAseP (STRAND)

is slightly poor. This reflects the poorer curation of the structures in the STRAND

128

database; the RNaseP dataset we downloaded from STRAND database contains many

small/partial structures which can be easily misclassified.

To further the classification performance, we performed leave one out cross validation

(LOOCV). The LOOCV classification rate result is shown in Table 5.3. All the RNA families

show a high LOOCV classification rate, with the exception of the RNAseP STRAND,

discussed above. This is expected since there are small/partial structures and the

performance on that family is not as good as the performance of other families.

Finally, we classified RNA structures collected from the STRAND source by using the

selected features from the manually curated dataset and vice versa. The performance

(Table 5.4) is still reasonably good, but slightly worse than the result we saw above.

While using features selected from manually curated dataset to classify STRAND

structures, the overall performance is not good. RNAseP STRAND and Group I intron

STRAND correct classification rate are around 0.57. Again, we would like to mention

STRAND dataset contains partial structures and misclassified structures. These

structures are likely to be misclassified, either because they actually belong to other

classes, or because their small size interrupts or truncates the selected features on

which the classification is based.

On the other hand, when using STRAND dataset learned features to classify the

manually curated structures, the performance is very good. One thing we need to

emphasize is that for the tRNA manually curated classification by using tRNA STRAND

129

selected features; there is only 1 correctly classified case. This is what we expect to see,

since the tRNA manually curated data was based on three-dimensional structures in PDB,

they contain pseudoknotted structures in them. But on the other hand, when classifying

tRNA STRAND data using the tRNA manually curated selected features, the performance

is as good as when using the features selected from tRNA STRAND data (STRAND data is

based on secondary structure and does not contain pseudoknots). This indirectly shows

the robustness of our feature selection method.

5.4 Discussion

This feature selection process should be able to find the structural features that are

important to different RNA families. Identification of these features, in turn, should

help create a mapping between structural features and biological function.

The maximum structural motif size used in this study is only 21 edges (7 stems), which is

relatively small comparing with the complicated large graphs seen in biological RNA

structures. This limitation can be overcome by the subgraph enumeration approach

from biological structures. This would generate bigger motifs which are subgraphs of big

biological structures. As we collect more biological RNA structure data, we will be able

to add bigger structural motifs back into the motif library. That would provide bigger,

biologically relevant structural motifs. The feature selection result would be more

specific and, possibly, more biologically meaningful.

130

But we can see from our results that using the 1 to 7 stem structural motifs is already

sufficient to describe RNA family structures well. As the size of the structural motifs

increase, the match of a structure would be more specific. In our situation, we would

like the matching of the structural motifs to be somewhere not too specific and not too

random (176). More work needs to be done in order to understand what size structural

motifs achieve the best performance.

From our classification test (Figure5.4), we found that features learned from our reliable

dataset (manually curated dataset) perform poorly against more poorly annotated

datasets. Interestingly, features learned from the dataset with lower quality (STRAND

dataset) can give very good classification result. After thinking about it more, we believe

that the features learned from the low quality dataset (such as STRAND) would still tend

to include features that are representative of the dominant family of structures in the

dataset, which can represent the specific RNA family, as well as extraneous features

arising from mis-classified structures. This feature set can be considered as a super-set

of features are important to specific RNA family. In the sense of the super-set, all good

features are included, so the reliable RNA structures can still be correctly classified. This

suggests our approach can detect the poorer quality of the uncurated dataset.

One more use of this is that we can use our feature selection framework to clean up

RNA structure data in the publicly accessible databases. This is currently an annoying

task for RNA structure analysts: How to obtain the gold standard structure dataset? Our

strategy can provide a potential solution.

131

We are continuously retrieving RNA sequence and structure information from available

public databases (NCBI, RFAM, RNASPE database, etc.), and building a comprehensive

RNA fingerprint database containing the structural features, sequence, and function for

each entry. The sequence and function information for the known structures are

available to aid in assignment of function to novel RNA queries.

In the pharmaceutical industry, design of inhibitory or therapeutic RNAs often begins

with randomly generated RNA sequences of specific lengths, and tests whether the

generated molecules have specific function(s). This random approach is both time

consuming and costly due to combinatorial search space. With the help of our feature

selection approach, we can identify structural motifs from a relevant biological

sequence pool, which would be likely to perform desired functions (based on similarity

to known molecules). For researchers, this significantly reduces the search space to

identify functional RNA molecules of therapeutic use and saves time and expense.

Above all, our new application, by unleashing the power of molecular structures, could

benefit busy biologists in many ways. Our web service can be accessed freely by public

at http://xios.genomics.purdue.edu.

5.5 Future directions

The RNA structure classification problem is complicated. Different feature

selection/extracting methods, such as PCA, SVM, cosine-distance clustering, or machine

learning methods (e.g.: Naive-Bayes), and semi-supervised statistical methods could be

132

applied. In the end we would like to find the "sequences to structure N to 1 mapping" as

well as a “structure to function N to N mapping". With this knowledge, RNA family

classification, annotation and RNA structural and functional prediction will greatly

benefit from our new approach.

133

Figure 5.1 Idea of graph containment search.

Modified from (155) paper.

134

Figure 5.2 Graph feature matrix.

This matrix describes number of graphs (g) in the database contains a certain feature (f).

135

Figure 5.3 Feature selection weight vs. iteration

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

wei

ght/

scor

e

iteration number

Weight/Score vs iteration

tRNA

group1

RNAseP

tmRNA

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1

wei

ght/

scor

e

iteration

Weight/Score vs iteration

tRNA STRAND

group1 STRAND

RNAseP STRAND

tmRNA STRAND

136

tRNA manually curated:

5_stem_motif_253 4_stem_motif_5 4_stem_motif_26 3_stem_motif_1 RNAseP manually curated:

7_stem_motif_38580 7_stem_motif_13161 7_stem_motif_6225 7_stem_motif_11091 Group I intron manually curated:

6_stem_motif_2232 5_stem_motif_41 5_stem_motif_35 4_stem_motif_20 tmRNA manually curated:

7_stem_motif_13849 7_stem_motif_31245 7_stem_motif_9944 7_stem_motif_14051

Figure 5.4 Selected top unique structural features in dataset Table 1.1.

Structural features selected by using algorithm 5.1 for RNA families in the manually

curated dataset. The order of its appearance is the same as its weight contributing to

that specific RNA family

137

tRNA STRAND:

4_stem_motif_27 5_stem_motif_211 5_stem_motif_180 3_stem_motif_1 RNAseP STRAND:

7_stem_motif_36950 7_stem_motif_49000 7_stem_motif_11091 7_stem_motif_18408 Group I intron STRAND:

7_stem_motif_27347 7_stem_motif_38147 7_stem_motif_37624 7_stem_motif_45173 tmRNA STRAND:

7_stem_motif_13161 7_stem_motif_32198 7_stem_motif_36615 7_stem_motif_15157

Figure 5.5 Selected top unique structural features in dataset Table 1.5.

Structural features selected by using algorithm 5.1 for RNA families in STRAND dataset. The order of its appearance is the same as its weight contributing to that specific RNA family

138

Figure 5.6 Link from structure to function.

Left, selected features for RNaseP manually curated and STRAND (blue frame), as well as RNAseP secondary structure. Right, selected features for tRNA manually curated and STRAND (blue frame), as well as tRNA secondary structure plus pseudoknots found in 3D structure.

139

Table 5.1 Statistics of the selected structural features for four RNA families from two datasets.

RNA family tRNA manually curated

RNAseP manually curated

Group I intron manually curated

tmRNA manually curated

Selected Feature # 25 1307 131 276

RNA family tRNA STRAND

RNAseP STRAND

Group I intron STRAND

tmRNA STRAND

Selected Feature # 15 753 85 176

140

Table 5.2 Classification performance

Classification performance tRNA manually curated 16 out of 16 (1.00) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 36 out of 36 (1.00) tmRNA manually curated 115 out of 117 (0.98) tRNA STRAND 585 out of 601 (0.97) RNAseP STRAND 28 out of 36 (0.78) Group I intron STRAND 21 out of 21 (1.00) tmRNA STRAND 30 out of 30 (1.00)

141

Table 5.3 Leave one out cross validation result

RNA family LOOCV Sample size tRNA manually curated 0.94 16 RNAseP manually curated 1.00 40 Group I intron manually curated 1.00 36 tmRNA manually curated 0.99 117 tRNA STRAND 0.97 601 RNAseP STRAND 0.58 36 Group I intron STRAND 1.00 21 tmRNA STRAND 1.00 30

142

Table 5.4 Classification test

Manually curated features tested on STRAND data RNA family Classification performance tRNA STRAND 558 out of 601 (0.93) RNAseP STRAND 21 out of 36 (0.58) Group I intron STRAND 12 out of 21 (0.57) tmRNA STRAND 30 out of 30 (1.00)

STRAND features tested on Manually curated data RNA family Classification performance tRNA manually curated 1 out of 16 (0.06) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 34 out of 36 (0.94) tmRNA manually curated 115 out of 117 (0.98)

LIST OF REFERENCES

143

LIST OF REFERENCES

1. Reuter, J.S. and Mathews, D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129-137.

2. Crick, F.H. (1958) On protein synthesis. Symp Soc Exp Biol, 12, 138-163. 3. Mills, D.R., Peterson, R.L. and Spiegelman, S. (1967) An extracellular Darwinian

experiment with a self-duplicating nucleic acid molecule. Proc Natl Acad Sci U S A, 58, 217-224.

4. Spiegelman, S. (1971) An approach to the experimental analysis of precellular evolution. Q Rev Biophys, 4, 213-253.

5. Kramer, F.R., Mills, D.R., Cole, P.E., Nishihara, T. and Spiegelman, S. (1974) Evolution in vitro: sequence and phenotype of a mutant RNA resistant to ethidium bromide. Journal of molecular biology, 89, 719-736.

6. Eigen, M. (1971) Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58, 465-523.

7. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1983) Kinetics of RNA replication. Biochemistry, 22, 2544-2559.

8. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1985) Kinetics of RNA replication: competition and selection among self-replicating RNA species. Biochemistry, 24, 6550-6560.

9. Biebricher, C.K. (1987) Replication and evolution of short-chained RNA species replicated by Q beta replicase. Cold Spring Harb Symp Quant Biol, 52, 299-306.

10. Woese, C. (1967) The Genetic Code: The Molecular Basis for Genetic Expression. Harper.

11. Cech, T. (1986) RNA as an enzyme. Scientific American 255, 64-75. 12. Cech, T.R. (1990) Self-splicing of group I introns. Annu Rev Biochem, 59, 543-568. 13. Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E. and Cech, T.R.

(1982) Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31, 147-157.

14. Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N. and Altman, S. (1983) The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35, 849-857.

15. Guerrier-Takada, C. and Altman, S. (1984) Catalytic activity of an RNA molecule prepared by transcription in vitro. Science, 223, 285-286.

16. Gilbert, W. (1986) Origin of life: The RNA world. Nature, 319, 618-618.

144

17. Joyce, G.F. (1989) RNA evolution and the origins of life. Nature, 338, 217-224. 18. Joyce, G.F. (1991) The rise and fall of the RNA world. New Biol, 3, 399-407. 19. Freeland, S.J., Knight, R.D. and Landweber, L.F. (1999) Do Proteins Predate DNA?

Science, 286, 690-692. 20. Watson, J.D. and Crick, F.H. (1953) Molecular structure of nucleic acids; a

structure for deoxyribose nucleic acid. Nature, 171, 737-738. 21. Crick, F.H. (1966) Codon--anticodon pairing: the wobble hypothesis. Journal of

molecular biology, 19, 548-555. 22. Pyle, A.M., Murphy, F.L. and Cech, T.R. (1992) RNA substrate binding site in the

catalytic core of the Tetrahymena ribozyme. Nature, 358, 123-128. 23. Cate, J.H., Gooding, A.R., Podell, E., Zhou, K., Golden, B.L., Kundrot, C.E., Cech,

T.R. and Doudna, J.A. (1996) Crystal Structure of a Group I Ribozyme Domain: Principles of RNA Packing. Science, 273, 1678-1685.

24. Unrau, P.J. and Bartel, D.P. (1998) RNA-catalysed nucleotide synthesis. Nature, 395, 260-263.

25. Illangasekare, M. and Yarus, M. (1999) A tiny RNA that catalyzes both aminoacyl-RNA and peptidyl-RNA synthesis. RNA, 5, 1482-1489.

26. Lee, N., Bessho, Y., Wei, K., Szostak, J.W. and Suga, H. (2000) Ribozyme-catalyzed tRNA aminoacylation. Nat Struct Biol, 7, 28-33.

27. Johnston, W.K., Unrau, P.J., Lawrence, M.S., Glasner, M.E. and Bartel, D.P. (2001) RNA-catalyzed RNA polymerization: accurate and general RNA-templated primer extension. Science, 292, 1319-1325.

28. Baskerville, S. and Bartel, D.P. (2002) A ribozyme that ligates RNA to protein. Proceedings of the National Academy of Sciences of the United States of America, 99, 9154-9159.

29. Joyce, G.F. (2002) The antiquity of RNA-based evolution. Nature, 418, 214-221. 30. Serganov, A. and Patel, D.J. (2007) Ribozymes, riboswitches and beyond:

regulation of gene expression without proteins. Nat Rev Genet, 8, 776-790. 31. Strobel, S.A. and Cochrane, J.C. (2007) RNA catalysis: ribozymes, ribosomes, and

riboswitches. Curr Opin Chem Biol, 11, 636-643. 32. Jeffares, D.C., Poole, A.M. and Penny, D. (1998) Relics from the RNA world. J Mol

Evol, 46, 18-36. 33. Moore, P.B. and Steitz, T.A. (2002) The involvement of RNA in ribosome function.

Nature, 418, 229-235. 34. Doudna, J.A. and Cech, T.R. (2002) The chemical repertoire of natural ribozymes.

Nature, 418, 222-228. 35. Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engstrom, P.G.,

Lenhard, B., Aturaliya, R.N., Batalov, S., Beisel, K.W. et al. (2006) Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet, 2, e62.

145

36. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816.

37. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res, 16, 11-19.

38. Majdalani, N., Chen, S., Murrow, J., St John, K. and Gottesman, S. (2001) Regulation of RpoS by a novel small RNA: the characterization of RprA. Mol Microbiol, 39, 1382-1394.

39. Havilio, M., Levanon, E.Y., Lerman, G., Kupiec, M. and Eisenberg, E. (2005) Evidence for abundant transcription of non-coding regions in the Saccharomyces cerevisiae genome. BMC Genomics, 6, 93-100.

40. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A, 103, 5320-5325.

41. Manak, J.R., Dike, S., Sementchenko, V., Kapranov, P., Biemar, F., Long, J., Cheng, J., Bell, I., Ghosh, S., Piccolboni, A. et al. (2006) Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nature genetics, 38, 1151-1158.

42. Miura, F., Kawaguchi, N., Sese, J., Toyoda, A., Hattori, M., Morishita, S. and Ito, T. (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc Natl Acad Sci U S A, 103, 17846-17851.

43. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Research, 16, 11-19.

44. He, H., Wang, J., Liu, T., Liu, X.S., Li, T., Wang, Y., Qian, Z., Zheng, H., Zhu, X., Wu, T. et al. (2007) Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res, 17, 1471-1477.

45. Li, D., Willkomm, D.K., Schon, A. and Hartmann, R.K. (2007) RNase P of the Cyanophora paradoxa cyanelle: a plastid ribozyme. Biochimie, 89, 1528-1538.

46. Wilhelm, B.T., Marguerat, S., Watt, S., Schubert, F., Wood, V., Goodhead, I., Penkett, C.J., Rogers, J. and Bahler, J. (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453, 1239-1243.

47. Bompfunewerer, A.F., Flamm, C., Fried, C., Fritzsch, G., Hofacker, I.L., Lehmann, J., Missal, K., Mosig, A., Muller, B., Prohaska, S.J. et al. (2005) Evolutionary patterns of non-coding RNAs. Theory Biosci, 123, 301-369.

48. Caetano-Anollés, G. (2010) Evolutionary Genomics and Systems Biology. Wiley-Blackwell.

49. Staple, D.W. and Butcher, S.E. (2005) Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3, e213.

146

50. Puglisi, J.D., Wyatt, J.R. and Tinoco, I. (1991) RNA pseudoknots. Accounts of Chemical Research, 24, 152-158.

51. Mans, R.M., Van Steeg, M.H., Verlaan, P.W., Pleij, C.W. and Bosch, L. (1992) Mutational analysis of the pseudoknot in the tRNA-like structure of turnip yellow mosaic virus RNA. Aminoacylation efficiency and RNA pseudoknot stability. Journal of molecular biology, 223, 221-232.

52. Mans, R.M., Pleij, C.W. and Bosch, L. (1991) tRNA-like structures. Structure, function and evolutionary significance. Eur J Biochem, 201, 303-324.

53. Brierley, I., Rolley, N.J., Jenner, A.J. and Inglis, S.C. (1991) Mutational analysis of the RNA pseudoknot component of a coronavirus ribosomal frameshifting signal. Journal of molecular biology, 220, 889-902.

54. Tzeng, T.H., Tu, C.L. and Bruenn, J.A. (1992) Ribosomal frameshifting requires a pseudoknot in the Saccharomyces cerevisiae double-stranded RNA virus. J Virol, 66, 999-1006.

55. Chamorro, M., Parkin, N. and Varmus, H.E. (1992) An RNA pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger RNA. Proc Natl Acad Sci U S A, 89, 713-717.

56. ten Dam, E.B., Pleij, C.W. and Bosch, L. (1990) RNA pseudoknots: translational frameshifting and readthrough on viral RNAs. Virus Genes, 4, 121-136.

57. Dinman, J.D., Icho, T. and Wickner, R.B. (1991) A -1 ribosomal frameshift in a double-stranded RNA virus of yeast forms a gag-pol fusion protein. Proc Natl Acad Sci U S A, 88, 174-178.

58. Wills, N.M., Gesteland, R.F. and Atkins, J.F. (1991) Evidence that a downstream pseudoknot is required for translational read-through of the Moloney murine leukemia virus gag stop codon. Proceedings of the National Academy of Sciences, 88, 6991-6995.

59. Gallie, D.R., Feder, J.N., Schimke, R.T. and Walbot, V. (1991) Functional analysis of the tobacco mosaic virus tRNA-like structure in cytoplasmic gene regulation. Nucleic acids research, 19, 5031-5036.

60. Westhof, E. and Jaeger, L. (1992) RNA pseudoknots. Current Opinion in Structural Biology, 2, 327-333.

61. Nussinov, R., Pieczenik, G., Griggs, J.R. and Kleitman, D.J. (1978) Algorithms for Loop Matchings. SIAM Journal on Applied Mathematics, Vol. 35, No. 1 68-82.

62. Konings, D.A. and Hogeweg, P. (1989) Pattern analysis of RNA secondary structure similarity and consensus of minimal-energy folding. Journal of molecular biology, 207, 597-614.

63. Le, S.-Y., Nussinov, R. and Maizel, J.V. (1989) Tree graphs of RNA secondary structures and their comparisons. Computers and Biomedical Research, 22, 461-473.

64. Zuker, M. and Sankoff, D. (1984) RNA secondary structures and their prediction. Bulletin of Mathematical Biology, 46, 591-621.

147

65. Shapiro, B.A. and Zhang, K. (1990) Comparing multiple RNA secondary structures using tree comparisons. Computer applications in the biosciences : CABIOS, 6, 309-318.

66. Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic acids research, 22, 2079-2088.

67. Zuker, M. (1989), Science, Vol. 244, pp. 48-52. 68. Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization.

Methods Mol. Biol, 25, 267-294. 69. McCaskill, J.S. (1990) The equilibrium partition function and base pair binding

probabilities for RNA secondary structure. Biopolymers, 29, 1105-1119. 70. Cook, S.A. (1971), Proceedings of the third annual ACM symposium on Theory of

computing. ACM, Shaker Heights, Ohio, United States, pp. 151-158. 71. HAGADONE, T.R. (1992) Molecular substructure similarity searching : efficient

retrieval in two-dimensional structure databases. Anglais, 32, 515-521. 72. Willett, P., Barnard, J.M. and Downs, G.M. (1998) Chemical Similarity Searching.

Anglais, 38, 983-996. 73. Hansch, C., Muir, R.M., Fujita, T., Maloney, P.P., Geiger, F. and Streich, M. (1963)

The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. Journal of the American Chemical Society, 85, 2817-2824.

74. Gan, H.H., Pasquali, S. and Schlick, T. (2003) Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. Nucleic acids research, 31, 2926-2943.

75. Harary, F. and Prins, G. (1959) The number of homeomorphically irreducible trees, and other species. Acta Mathematica, 101, 141-162.

76. Harary, F. (1969) Graph Theory. Addison-Wesley, Reading, MA. 77. Schuster, P. (1997) Genotypes with phenotypes: adventures in an RNA toy world.

Biophys Chem, 66, 75-110. 78. Gierasch, L.M. and (editor), J.K. (1990) Protein Folding: Deciphering the Second

Half of the Genetic Code. Amer Assn for the Advancement. 79. Doudna, J.A. (2000) Structural genomics of RNA. Nat Struct Biol, 7 Suppl, 954-

956. 80. Ogurtsov, A.Y., Shabalina, S.A., Kondrashov, A.S. and Roytberg, M.A. (2006)

Analysis of internal loops within the RNA secondary structure in almost quadratic time. Bioinformatics (Oxford, England), 22, 1317-1324.

81. Do, C.B., Woods, D.A. and Batzoglou, S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (Oxford, England), 22, e90-98.

82. Flamm, C., Fontana, W., Hofacker, I.L. and Schuster, P. (2000) RNA folding at elementary step resolution. RNA, 6, 325-338.

83. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9, 133-148.

148

84. Ying, X., Luo, H., Luo, J. and Li, W. (2004) RDfolder: a web server for prediction of RNA secondary structure. Nucleic acids research, 32, W150-153.

85. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M. and Schuster, P. (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly, 125, 167-188.

86. Hofacker, I.L. and Stadler, P.F. (2006) Memory efficient folding algorithms for circular RNA secondary structures. Bioinformatics (Oxford, England), 22, 1172-1176.

87. Danilova, L.V., Pervouchine, D.D., Favorov, A.V. and Mironov, A.A. (2006) RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J Bioinform Comput Biol, 4, 589-596.

88. Ding, Y., Chan, C.Y. and Lawrence, C.E. (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic acids research, 32, W135-141.

89. Dawson, W., Fujiwara, K., Kawai, G., Futamura, Y. and Yamamoto, K. (2006) A method for finding optimal rna secondary structures using a new entropy model (vsfold). Nucleosides Nucleotides Nucleic Acids, 25, 171-189.

90. Ren, J., Rastegari, B., Condon, A. and Hoos, H.H. (2005) HotKnots: heuristic prediction of RNA secondary structures including pseudoknots. RNA, 11, 1494-1504.

91. Huang, C.H., Lu, C.L. and Chiu, H.T. (2005) A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics (Oxford, England), 21, 3501-3508.

92. Xayaphoummine, A., Bucher, T. and Isambert, H. (2005) Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic acids research, 33, W605-610.

93. Zadeh, J.N., Steenberg, C.D., Bois, J.S., Wolfe, B.R., Pierce, M.B., Khan, A.R., Dirks, R.M. and Pierce, N.A. (2011) NUPACK: Analysis and design of nucleic acid systems. J Comput Chem, 32, 170-173.

94. Reeder, J. and Giegerich, R. (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, 104-115.

95. Rivas, E. and Eddy, S.R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of molecular biology, 285, 2053-2068.

96. Huang, X. and Ali, H. (2007) High sensitivity RNA pseudoknot prediction. Nucleic acids research, 35, 656-663.

97. Wuchty, S., Fontana, W., Hofacker, I.L. and Schuster, P. (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49, 145-165.

98. Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J. and Giegerich, R. (2006) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics (Oxford, England), 22, 500-503.

99. Clote, P. (2005) RNALOSS: a web server for RNA locally optimal secondary structures. Nucleic acids research, 33, W600-W604.

149

100. Markham, N.R. and Zuker, M. (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol, 453, 3-31.

101. Shapiro, B.A., Kasprzak, W., Grunewald, C. and Aman, J. (2006) Graphical exploratory data analysis of RNA secondary structure dynamics predicted by the massively parallel genetic algorithm. J Mol Graph Model, 25, 514-531.

102. Tinoco, I., Jr., Borer, P.N., Dengler, B., Levin, M.D., Uhlenbeck, O.C., Crothers, D.M. and Bralla, J. (1973) Improved estimation of secondary structure in ribonucleic acids. Nat New Biol, 246, 40-41.

103. Tinoco, I., Jr., Uhlenbeck, O.C. and Levine, M.D. (1971) Estimation of secondary structure in ribonucleic acids. Nature, 230, 362-367.

104. Bellman, R. (1952) On the Theory of Dynamic Programming. Proc Natl Acad Sci U S A, 38, 716-719.

105. Nussinov, R. and Jacobson, A.B. (1980) Fast Algorithm for Predicting the Secondary Structure of Single-Stranded RNA. Proc. Natl. Acad. Sci. U. S. A., 77, 6309-6313.

106. Reeder, J., Steffen, P. and Giegerich, R. (2007) pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows. Nucleic Acids Res., 35, W320-W324.

107. Sperschneider, J., Datta, A. and Wise, M.J. (2011) Heuristic RNA pseudoknot prediction including intramolecular kissing hairpins. RNA, 17, 27-38.

108. Sperschneider, J. and Datta, A. (2010) DotKnot: pseudoknot prediction using the probability dot plot under a refined energy model. Nucleic acids research, 38, e103.

109. Schreiber, S.L. (2000) Target-Oriented and Diversity-Oriented Organic Synthesis in Drug Discovery. Science, 287, 1964-1969.

110. Joyce, G.F. (1992) Directed molecular evolution. Sci Am, 267, 90-97. 111. Kauffman, S.A. (1986) Autocatalytic sets of proteins. J Theor Biol, 119, 1-24. 112. Kauffman, S.A. (1992) Applied molecular evolution. J Theor Biol, 157, 1-7. 113. Eigen, M. and Gardiner, W.C. (1984) Evolutionary molecular engineering based

on RNA replication. Pure Appl. Chem., 56, 967-978. 114. Horwitz, M.S., Dube, D.K. and Loeb, L.A. (1989) Selection of new biological

activities from random nucleotide sequences: evolutionary and practical considerations. Genome, 31, 112-117.

115. Ellington, A.D. and Szostak, J.W. (1990) In vitro selection of RNA molecules that bind specific ligands. Nature, 346, 818-822.

116. Bartel, D.P. and Szostak, J.W. (1993) Isolation of new ribozymes from a large pool of random sequences. Science, 261, 1411-1418.

117. Chapman, K.B. and Szostak, J.W. (1994) In vitro selection of catalytic RNAs. Curr Opin Struct Biol, 4, 618-622.

118. Lorsch, J.R. and Szostak, J.W. (1994) In vitro evolution of new ribozymes with polynucleotide kinase activity. Nature, 371, 31-36.

119. The tmRNA Website. http://www.indiana.edu/~tmrna/.

150

120. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic acids research, 28, 235-242.

121. Andronescu, M., Bereg, V., Hoos, H.H. and Condon, A. (2008) RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9, 340.

122. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol, 6, 362-369. 123. Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and

Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic acids research, 31, 3450-3460.

124. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol., 6, 362-369. 125. Brown, J.W. (1999) The Ribonuclease P Database. Nucleic Acids Res., 27, 314-314. 126. Yang, H.W., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and

Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res., 31, 3450-3460.

127. Cannone, J.J., Subramanian, S., Schnare, M.N., Collett, J.R., D'Souza, L.M., Du, Y.S., Feng, B., Lin, N., Madabusi, L.V., Muller, K.M. et al. (2002) The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2.

128. Williams, K.P. (2002) The tmRNA Website: invasion by an intron. Nucleic Acids Res., 30, 179-182.

129. Zarrinkar, P.P. and Williamson, J.R. (1996) The kinetic folding pathway of the Tetrahymena ribozyme reveals possible similarities between RNA and protein folding. Nat Struct Biol, 3, 432-438.

130. Doherty, E.A. and Doudna, J.A. (1997) The P4-P6 domain directs higher order folding of the Tetrahymena ribozyme core. Biochemistry, 36, 3159-3169.

131. Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci U S A, 101, 7287-7292.

132. Kim, N., Shiffeldrim, N., Gan, H.H. and Schlick, T. (2004) Candidates for novel RNA topologies. Journal of molecular biology, 341, 1129-1144.

133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE Computer Society, Maebashi City, Japan, pp. 721.

134. Yan, X. and Han, J. (2003), Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington, D.C.

135. Zaki, M.J. (2002), Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, Edmonton, Alberta, Canada.

136. Jaeger, J.A., Turner, D.H. and Zuker, M. (1989) Improved predictions of secondary structures for RNA. Proc Natl Acad Sci U S A, 86, 7706-7710.

137. Wang, Z. and Zhang, K. (2001), Proceedings of the 26th International Symposium on Mathematical Foundations of Computer Science. Springer-Verlag, pp. 690-702.

151

138. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25, 25-29.

139. Grate, L., Herbster, M., Hughey, R., Haussler, D., Mian, I.S. and Noller, H. (1994) RNA modeling using Gibbs sampling and stochastic context free grammars. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB, 2, 138-146.

140. Lowe, T.M. and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research, 25, 955-964.

141. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of molecular biology, 215, 403-410.

142. Pudlák, P., Rödl, V. and Savický, P. (1988) Graph complexity. Acta Informatica, 25, 515-535.

143. Byun, Y. and Han, K. (2009) PseudoViewer3: generating planar drawings of large-scale RNA structures with pseudoknots. Bioinformatics (Oxford, England), 25, 1435-1437.

144. Darty, K., Denise, A. and Ponty, Y. (2009) VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics (Oxford, England), 25, 1974-1975.

145. Janssen, S., Reeder, J. and Giegerich, R. (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinformatics, 9, 131.

146. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. and Eddy, S.R. (2003) Rfam: an RNA family database. Nucleic acids research, 31, 439-441.

147. Weinberg, Z. and Ruzzo, W.L. (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics (Oxford, England), 20, i334-i341.

148. Weinberg, Z. and Ruzzo, W.L. (2006) Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics (Oxford, England), 22, 35-39.

149. Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA Structural Ensembles Including Pseudoknots. Submitted.

150. Eppstein, D. (1995), Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, San Francisco, California, United States, pp. 632-640.

151. Kukluk, J.P., Holder, L.B. and Cook, D.J. (2004) Algorithm and experiments in testing planar graphs for isomorphism. Journal of Graph Algorithms and Applications, 8, 313-356.

152. Yan, X., Yu, P.S. and Han, J. (2004), Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, Paris, France, pp. 335-346.

152

153. Yan, X., Yu, P.S. and Han, J. (2005), Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore, Maryland, pp. 766-777.

154. Yan, X., Yu, P.S. and Han, J. (2005) Graph indexing based on discriminative frequent structure analysis. ACM Trans. Database Syst., 30, 960-993.

155. Chen, C., Yan, X., Yu, P.S., Han, J., Zhang, D.-Q. and Gu, X. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 926-937.

156. Williams, D.W., Huan, J. and Wang, W. (2007), Proceedings of 23rd International Conference on Data Engineering. IEEE, Istanbul, Turkey, pp. 976-985.

157. Zhao, P., Yu, J.X. and Yu, P.S. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 938-949.

158. Shang, H., Zhang, Y., Lin, X. and Yu, J.X. (2008) Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow., 1, 364-375.

159. Tian, Y. and Patel, J.M. (2008), Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 963-972.

160. Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.

161. ChemIDplus. http://chem.sis.nlm.nih.gov/chemidplus. 162. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J. and Bryant, S.H. (2009)

PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37, W623-633.

163. Seiler, K.P., George, G.A., Happ, M.P., Bodycombe, N.E., Carrinski, H.A., Norton, S., Brudz, S., Sullivan, J.P., Muhlich, J., Serrano, M. et al. (2008) ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic acids research, 36, D351-D359.

164. Liu, T., Lin, Y., Wen, X., Jorissen, R.N. and Gilson, M.K. (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic acids research, 35, D198-201.

165. Eddy, S.R. (2002) Computational genomics of noncoding RNA genes. Cell, 109, 137-140.

166. Eddy, S.R. (2002) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18.

167. Fera, D., Kim, N., Shiffeldrim, N., Zorn, J., Laserson, U., Gan, H.H. and Schlick, T. (2004) RAG: RNA-As-Graphs web resource. BMC Bioinformatics, 5, 88.

168. Gan, H.H., Fera, D., Zorn, J., Shiffeldrim, N., Tang, M., Laserson, U., Kim, N. and Schlick, T. (2004) RAG: RNA-As-Graphs database—concepts, analysis, and features. Bioinformatics (Oxford, England), 20, 1285-1291.

153

169. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A. and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic acids research, 26, 320-322.

170. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D. and Sonnhammer, E.L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic acids research, 27, 260-262.

171. Sigrist, C.J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A. and Bucher, P. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform, 3, 265-274.

172. Washietl, S., Hofacker, I.L., Lukasser, M., Huttenhofer, A. and Stadler, P.F. (2005) Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol, 23, 1383-1390.

173. Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E.S., Kent, J., Miller, W. and Haussler, D. (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol, 2, e33.

174. Giegerich, R., Voss, B. and Rehmsmeier, M. (2004) Abstract shapes of RNA. Nucleic acids research, 32, 4843-4851.

175. Hofacker, I.L., Fekete, M. and Stadler, P.F. (2002) Secondary structure prediction for aligned RNA sequences. Journal of molecular biology, 319, 1059-1066.

176. Zipf, G. (1949) Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley Press., Oxford, England.

VITA

154

VITA

Kejie Li

Department of Biological Sciences, Purdue University

Education

B.S., Biological Sciences, 2004, Sichuan University, Chengdu, Sichuan, P.R. China

Ph.D., Biological Sciences, 2011, Purdue University, West Lafayette, Indiana

Kejie Li was born in Chengdu, Sichuan Province, P.R. China on June 4th, 1982. Kejie grew up in his hometown and went to Sichuan University in 2000. In Sichuan University, Kejie was selected to an Educational Exchange Program and spent his junior year in University of Washington, Seattle, USA, as a visiting student. Kejie graduated from Sichuan University in 2004 with a Bachelor’s Degree in Biological Sciences. In the Fall of the same year, Kejie was admitted to the Bioinformatics master program at Wageningen University, Wageningen, Netherlands. Fall semester of 2005, Kejie was admitted to the PhD program in Department of Biological Sciences at Purdue University, West Lafayette, USA, and joined the laboratory of Dr. Michael Gribskov. His research focus is the understanding of RNA structure and function relationships. Kejie finished his PhD studies and received his Ph.D. degree in Aug 2011. Kejie will pursue postdoctoral studies at Broad Institute, Boston, USA.

PUBLICATIONS

155

PUBLICATIONS

Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) RNA structure topological

pattern study reveals link between topology and function. (In preparation)

Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) Matching unknown RNA

structures: RNA XIOS topological pattern database. (In preparation)

Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA

Structural Ensembles Including Pseudoknots. (Submitted)

Banks, J.A., Nishiyama, T., Hasebe, M., Bowman, J.L., Gribskov, M., Li, K. et al. (2011)

The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of

Vascular Plants. Science, 332, 960-963.

Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I.,

Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium

on Bioinformatics Research and Applications. Bioinformatics Research and Applications,

Atlanta, GA, Vol. 4983/2008, pp. 317-330.

Graduate School ETD Form 9 - Purdue Genomics Wikirna.genomics.purdue.edu/@api/deki/files/1119/=final_revision3.pdf · Graduate School ETD Form 9 (Revised 12/07) PURDUE UNIVERSITY

Documents