Graduate School ETD Form 9 (Revised 12/07) PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By Entitled For the degree of Is approved by the final examining committee: Chair To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material. Approved by Major Professor(s): ____________________________________ ____________________________________ Approved by: Head of the Graduate Program Date Kejie Li A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships Doctor of Philosophy Wen Jiang Michael Gribskov Daisuke Kihara Dabao Zhang Michael Gribskov Peter J. Hollenbeck 07/25/2011
176
Embed
Graduate School ETD Form 9 - Purdue Genomics Wikirna.genomics.purdue.edu/@api/deki/files/1119/=final_revision3.pdf · Graduate School ETD Form 9 (Revised 12/07) PURDUE UNIVERSITY
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Graduate School ETD Form 9 (Revised 12/07)
PURDUE UNIVERSITY GRADUATE SCHOOL
Thesis/Dissertation Acceptance
This is to certify that the thesis/dissertation prepared
By
Entitled
For the degree of
Is approved by the final examining committee:
Chair
To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.
Approved by Major Professor(s): ____________________________________
____________________________________
Approved by: Head of the Graduate Program Date
Kejie Li
A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships
Doctor of Philosophy
Wen Jiang
Michael Gribskov
Daisuke Kihara
Dabao Zhang
Michael Gribskov
Peter J. Hollenbeck 07/25/2011
Graduate School Form 20 (Revised 9/10)
PURDUE UNIVERSITY GRADUATE SCHOOL
Research Integrity and Copyright Disclaimer
Title of Thesis/Dissertation:
For the degree of Choose your degree
I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*
Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.
I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.
______________________________________ Printed Name and Signature of Candidate
______________________________________ Date (month/day/year)
*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html
A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships
Doctor of Philosophy
Kejie Li
07/25/2011
A GRAPH THEORETIC APPROACH FOR IDENTIFYING RNA STRUCTURE AND FUNCTION RELATIONSHIPS
A Dissertation
Submitted to the Faculty
of
Purdue University
by
Kejie Li
In Partial Fulfillment of the
Requirements for the Degree
of
Doctor of Philosophy
August 2011
Purdue University
West Lafayette, Indiana
ii
Dedicated to my beloved wife Juan Liao, my great father Changgui Li, and my dear
mother Hanfang He.
以此献给我心爱的妻子:廖娟,我伟大的父亲:李常贵,以
及我亲爱的母亲:何汉芳。
iii
ACKNOWLEDGEMENTS
I would like to express my greatest and the most sincere gratitude to my major advisor,
Dr. Michael Gribskov, for his support, patience, understanding and encouragement
during my graduate study in Purdue. I thank him for giving me freedom and support to
explore the research areas I am interested in. He is always ready to help and is such a
superb mentor throughout the development of my projects. I sincerely appreciate the
time he spent to improve my writing of the annual progress reports, the graduate thesis
and publications. To me, he is like a huge library which holds all kind of sources. I can
retrieve information I need at anytime and the process could take less time than I
google it around.
I would like to extend my sincere appreciation to the members of my PhD advisory
committee: Dr. Wen Jiang, Dr. Daisuke Kihara and Dr. Dabao Zhang. I truly appreciate all
their support, advice and guidance throughout my graduate study.
Special thanks go to Reazur Rahman and Aditi Gupta, who are my lab mates as well as
RNA group mates. We worked closely and I do enjoy our great teamwork.
Extra thanks go to the past and current members in the Gribskov lab: Hao Jiang, Ying Li,
STRAND The RNA secondary STRucture and statistical Analysis Database
TOS Target Oriented Synthesis
UID Unique ID
VARNA Visualization Applet for RNA
XIOS eXclusive, Included, Overlap and Serial
xv
ABSTRACT
Li, Kejie. Ph.D., Purdue University, August 2011. A Graph Theoretic Approach for Identifying RNA Structure and Function Relationships. Major Professor: Michael Gribskov.
Understanding of structure-function mapping is crucial to the study of the nature of
biopolymers. This mapping can be used to extract information to aid in the prediction of
molecular function based on structural topological patterns. This study presents a graph
theoretical approach for understanding RNA structural topological features, and
revealing the mapping from biological RNA structural topological features to biological
functions. We have built a package that represents ensembles of suboptimal RNA
structures as a graph, the XIOS graph, for easy structural comparison and analysis by an
extended version of the gSpan algorithm. In order to detect structural similarities, The
Neighbor Indexing algorithm has been extended by adding additional RNA structure-
specific information, and introducing the concept of an RNA structural fingerprint, from
a structural descriptor point of view, to represent the topological information of
ensembles of RNA structures. Based on the cIndex feature selection strategy, I have
developed and applied a new feature selection approach for RNA structures which
xvi
reveals important structural topological patterns that provide specific information about
the functional class of RNAs. This information can be used to relate RNA structural
patterns to function. In addition, I have developed a novel structure indexing and
database searching method for finding RNAs with similar characteristics (topological
modules).
It is remarkable that even without using RNA primary sequence information RNA
structures can be classified into the correct classes. By combining information from both
sequence and topology, unclassified or misclassified RNAs can be correctly classified and
categorized with high confidence. The structure-based classification described here is
significantly better than sequence-based classification using Blast (Kolmogorov-Smirnov
test).
1
CHAPTER 1 INTRODUCTION
1.1 RNA’s double life
The central dogma of molecular biology (2) says that RNA is the biopolymer copied from
DNA (transcription) that serves as the template during protein synthesis (translation). In
other words, RNA is the intermediate message interpreter from heritable DNA genetic
information to various biological functions. This assumption underlines that it is genes
that encode proteins, and proteins play most of the important biological roles such as
catalytic and regulatory functions. From that point of view, biological complexity is
determined by the number of protein-coding genes.
Two great figures initiated the study of evolution in vitro, or so called RNA evolution in
the test-tube: Sol Spiegelman and Manfred Eigen. Spiegelman’s serial transfer
experiments with the Qβ assay (3-5) and Eigen’s extensive kinetic study of the
mechanism of Qβ RNA replication (6-8) revealed that the primary sequence and spatial
structure of the same RNA molecule are its genotype and phenotype, respectively. RNA
molecules have to satisfy structural requirements in order to be recognized and
replicated by enzymes (9). Structure alone is not sufficient to infer function, but a
complete understanding of the functional molecule requires information about its
2
organization in space. It has been suggested that the spatial structure of RNA is a crucial
factor in determining its function.
In the meantime, Carl Woese discovered that RNA forms complex secondary structures,
which suggested, for the first time, that RNA could act as a catalyst (10). Later in the
1980s, Thomas R. Cech (11-13) and Sidney Altman (14,15) separately discovered the
catalytic properties of RNA molecules, making proteins no longer the only biopolymers
with catalytic function. An RNA molecule, therefore, is not just a chemical entity that
carries genetic information which is very chemically similar to DNA, but remarkably, also
possesses catalytic activity as a function executor. Although, when compared with
proteins, RNA molecules, ribozymes (13), have a limited catalytic repertoire, it is more
than sufficient to process genetic information and (self) replicate in a pre-biotic
environment. This gave rise to the idea that RNA molecules could play a bridging role
between the lifeless pre-biotic environment and the beginning of life (16-18), the RNA
world hypothesis. The RNA world provides a possible answer to the long-standing
question: the origin of life. It suggests that versatile RNA, with its abilities for both
storing information like DNA and catalyzing enzymatic reactions like proteins, came first.
RNA-encoded proteins evolved after RNA but before DNA (19).
The traditional definition of RNA secondary structure is based on base-pairing
interactions: Watson-Crick base pairs (A∷U, G⋮⋮C) (20) and wobble base pairs (G::U, I::U,
I::A and I::C) (21), within an RNA molecule. The basic structural elements are stacked
base-pairs (or stems), hairpin loops, interior loops, and bulge and multiple loops (Figure
3
1.2 A). In classical secondary structures there are only nested structural elements, which
means one base-paring region must be completely within the loop of the other base-
pairing region. Early studies (22,23) have found that the catalytic cores of many
ribozymes have uniquely shaped conserved RNA secondary structures that allow them
to perform their catalytic function. More recent studies show that small ribozymes
exhibit a broad range of catalytic activities (24-31) and that RNA catalysis plays essential
roles in the metabolism of cells (32-34). The involvement of RNA in such diverse
catalytic functions gives further support to the RNA world hypothesis.
Traditionally, we believed that the genome is a simple combination of separate genes,
one gene → one protein. Most of the gene transcripts were thought to be protein-
coding and rarely non-coding RNAs (ncRNAs), the bulk of the cellular RNA: tRNA and
rRNA are exceptions. ncRNAs are RNA molecules that perform biological functions
without being translated into proteins. In the course of the recent rapid development of
high throughput techniques, comprehensive large scale transcriptome studies (35-37)
across species as diverse as plants, bacteria and mammals (38-46) have changed our
understanding of RNA. New functions of ncRNAs have been discovered, and ncRNAs and
RNA-based biological processes are now known exist in all life forms. Gradually, RNA has
been recognized as a central player in cellular regulation (47). In particular, ncRNAs play
active roles in multiple regulatory layers from transcription, to RNA maturation, and
RNA modification to translational regulation (47). The current view is that transcripts are
potentially overlapping and bidirectional, and non-coding transcripts are abundant. In
4
spite of the importance and ubiquity of ncRNAs, we still know relatively little about
them (48).
1.2 Importance of Pseudoknots
It is widely accepted that RNA functions are mainly determined by RNA structures.
Reciprocal relationships like this demand comprehensive study and analysis of RNA
structures, in order to better understand RNA catalytic and regulatory functions.
1.2.1 Definition of pseudoknot
With regard to RNA structure analysis, I have to mention an important structural
element called a pseudoknot. Compared with nested RNA secondary structures,
pseudoknots are base-paired regions that are only partially nested: RNA base-pairing
between the bases loop region of one base paired region with a region outside this
base-paired region. Let us consider two stems, S1 and S2, where S1 is formed by base-
paired regions A and B, and S2 is formed by regions C and D. The sequential order of
those base paired regions is A, C, B, D in the RNA sequence. S1 and S2 form a
pseudoknot structure because region C of stem S2 lies inside the “loop” of stem S1, and
region D of S2 is outside of that loop. Such knotted structures were first discovered in
yellow mosaic virus in 1982 (24), and they occur frequently in RNA functional sites and
catalytic cores, often being directly involved in RNA catalytic and regulatory functions.
There are many types of pseudoknots, simple and common ones include the H-type
pseudoknot (classic pseudoknot), kissing pseudoknot, simple recursive pseudoknot and
Archaeal type A (11), and Archaeal type M (3). Secondary structures and pseudoknots
were assigned according to Ellis and Brown (124) and folding diagrams in the RNase P
database entries (125). A complete list of sources is given in Table 1.2.
1.7.1.3 Group I Intron RNA
152 sequences were downloaded from the RNA STRAND database (121). The shortest
and longest 10% of the sequences were removed on the assumption that these were
most likely to be incomplete or poorly annotated. Sequences with greater than 50%
sequence identity were purged leaving 36 sequences ranging from 240 to 602 bases in
length. With one exception, PDB structure IL8V, for which the RNA structure is assigned
by RNAView (126), stems and pseudoknots were assigned according to expert curation
in the CRW database (127). A complete list of the sequences is given in Table 1.3.
1.7.1.4 tmRNA
632 complete sequences (from 514 species) with structural assignments were obtained
from the tmRNA website (128). Sequences were purged to remove sequences with >40%
sequence identity, leaving 165 sequences. 48 of these sequences contained asterisks,
indicating the absence of some bases, and were removed. The final dataset consists of
117 sequences, with sequence lengths ranging from 230 to 393 bases. The structure
assignments in the tmRNA database are used as the curated structures. A complete list
of sequences is given in Table 1.4.
28
1.7.2 STRAND dataset
RNA STRAND (The RNA secondary STRucture and statistical Analysis Database) is a
database with comprehensive collection of known RNA secondary structures
(experimental solved and computational predicted) from different organisms. Dataset
collect from STRAND are the corresponding 4 RNA families in our manually curated
dataset: tRNA, RNAseP, Group I intron RNA and tmRNA Table 1.5.
Compared to the manually curated dataset, the dataset from STRAND is a mixture of
reliable structure data as well as partial structures and noise, and even misclassification.
The manually curated dataset is a clean and high quality dataset. SRTAND dataset is a
bag of all kinds of structure data, good and bad, which represents the average quality of
other RNA structural databases out there.
29
Figure 1.1 Common types of Pseudoknots
30
Figure 1.2 Common RNA secondary structure representations.
A. Stem-loop diagram depicts RNA secondary structure elements: S stacking base pair (or stem), H hairpin loop, I interior loop, B bulge and M multiple loop. B. Stem-loop digram with Pseudoknot. C. Circle Plot. The backbone nucleotides of RNA are arranged along a circle, and base pairs are drawn as arcs. D. Dome Plot. The backbone nucleotides of RNA are placed in a line, and base pairs are drawn as arcs. E. Circle plot including a Pseudoknot. Pseudoknot structure is indicated by arcs crossing each other. F. Dome plot including a Pseudoknot. Pseudoknot structure is again indicated by the arcs crossing each other. G. Mountain Plot. The x-axis of Mountain plot corresponds to the RNA sequence, and y-axis shows the number of base pairs in which a specific nucleotide is enclosed. H. Primary sequence (X means any one of the four nucleotides) and its linear dot-bracket representation. A dot indicates a non-base-paired position, and pair of matching brackets at position i and j indicates there is a base-pair (i, j) between positions i and j.
31
Figure 1.2
32
Figure 1.3 Rooted-labeled tree
Closed circles represent base-pairs, and leaf nodes represent unpaired nucleotides. The root of the tree, box, is a dummy node added as the root of the tree graph which serves as the parent of all nodes in the tree to ensure structures with free end(s) are not represented by a forest (disconnected trees). Stems are rope-like and loops are bush-like.
33
Figu
re 1
.4 D
ot p
lots
Two
type
s of d
ot p
lots
are
show
n he
re. T
his i
s a tR
NA
exam
ple.
Cen
ter i
s the
stem
-loop
dia
gram
show
ing
the
fam
iliar
tRN
A cl
over
leaf
. Lef
t. Pa
rtiti
on fu
nctio
n do
t plo
t is a
bas
e pa
iring
bin
ding
pro
babi
lity
mat
rix. I
t is a
vi
sual
izatio
n of
the
ther
mod
ynam
ics o
f an
ense
mbl
e of
stru
ctur
es. T
he c
olor
of t
he d
ot re
flect
s the
neg
ativ
e lo
g of
bas
e pa
ir bi
ndin
g pr
obab
ility
of t
he b
ase
pair.
The
ord
er fr
om re
d to
bro
wni
sh, t
o gr
een
and
then
to b
lue
is th
e de
crea
sing
orde
r of t
he p
roba
bilit
y. R
ight
. Ene
rgy
dot p
lot s
how
s, fo
r a sp
ecifi
c ba
se p
air,
the
low
est f
ree
ener
gy
for a
stru
ctur
e th
at e
nds a
t thi
s bas
e pa
ir. T
he o
rder
from
red
to b
row
nish
, to
gree
n an
d th
en to
blu
e is
the
incr
easin
g or
der o
f the
free
ene
rgy.
Fou
r ste
m re
gion
s are
hig
hlig
hted
by
the
colo
red
boxe
s. D
ot p
lots
are
ge
nera
ted
by th
e RN
AStr
uctu
re so
ftwar
e (1
).
34
Figu
re 1
.4
35
Figure 1.5 RNA tree graph, RNA dual graph and RNA digraph representations.
A is the tRNA secondary structure in its squiggle notation. B is its tree graph representation. Each of vertices of the tree graph is a loop region (the 3’ and 5’ ends of stem is also considered as a loop region), and edges represent stems. C is its dual graph representation. Vertices represent stems, and edges are loop regions (3’ and 5’ ends of stem is not a loop in this representation). D is the digraph representation. This is an RNA dual graph with directed edges. Direction of the edges can resolve some ambiguity in representing RNA topologies.
36
Table 1.1 Manually curated structures.
Dataset tRNA RNAseP Group I intron tmRNA Sample size 16 40 36 117
Min graph size 3 11 9 4** Max graph size 6 26 25 22
Average graph size 5.25 19.55 16.11 16.65 ** tmRNA has an outlier tmRNA/BaciPhage_G, this structure has only 4 stems.
37
Table 1.2 RNaseP Sequences Used.
RNaseP stems and pseudoknots were assigned based on expert curation in the RNAseP database and structural types assigned according to Ellis and Brown (2009). All structural assignments were manually reviewed, in some cases minor adjustments had to be made to the RNaseP database structures to make the labeling of stems consistent across all structures.
Thermomicrobium roseum 350 CAeropyrum pernix 330 Archaeal type A Halobacterium cutirubrum 375 Archaeal type Ae
Halococcus morrhuae 475 Archaeal type Af
Metallosphaera sedula 304 Archaeal type A Methanobacterium thermoautotrophicum 293 Archaeal type A Methanosarcina barkeri 371 Archaeal type A Natronobacterium gregoryi 474 Archaeal type A Pyrococcus abyssi 330 Archaeal type A Sulfolobus acidocaldarius 315 Archaeal type A Sulfolobus solfataricus 311 Archaeal type Ah
Thermoplama volcanumg 305 Archaeal type A Archeoglobus fulgidus 229 Archaeal type M Methanococcus jannaschii 252 Archaeal type M Methanococcus maripaludus 233 Archaeal type M a Also known as Ralstonia or Alcaligenes eutrophus b Labeling of stems does not fall easily into standard scheme due to second stem coming off of L15
38
c This structure is midway between B1 and B2 having stem 10.1 and P9, but lacking P19. Also has an extra pseudoknot between the L9 and the region before P20. d Clearly A type due to presence of P6, P13 and P14 and lack of P15.1. Has additional stem (annotated in this work as P16.1) coming off of L15. e This structure is difficult to label due to three stems branching from P12. In this work these were annotated as P12.1 - P12.3. The structure given for the P15-P17 region may not be correct. f RNaseP database annotated structure may not be entirely correct. g RNaseP database diagram labelled T. volvanum h No RNAML file available, structure annotated based on .ct file and structure diagram.
Dataset collect here are the corresponding 4 RNA families in our manually curated dataset: tRNA, RNAseP, Group I intron RNA and tmRNA.
Dataset tRNA > 50 nt
RNAseP 100 - 300 nt
Group I intron 100 - 300 nt
tmRNA 100 - 300 nt
sample size 601 36 21 30 Min graph size 3 5 7 8 Max graph size 6 22 15 23
Average graph size 4.13 11.47 11.05 16.71
42
CHAPTER 2 PATTERN MATCHING IN RNA STRUCTURES1
2.1 Introduction
RNA molecules perform a variety of important biological functions in addition to
carrying information from the chromosome to the ribosome, or acting as structural
scaffolds. Catalytic RNAs play key roles in translation, RNA processing and splicing, and
gene regulation (36). Motifs that are important for RNA function are structural and
correspond to base-paired regions of secondary structure, which in turn, provide the
scaffold for the three-dimensional fold of the RNA (129,130). RNA sequences that have
the same structural motifs may have sequences that are impossible to align because
they have no detectable sequence similarity.
While programs that predict RNA secondary structure have been available since the
1980s, RNA structure prediction is handicapped by both biochemical and computational
limitations. Firstly, RNA exists as an ensemble of rapidly interconverting structures.
Protein structures (usually) show relatively minor fluctuations from a single minimum
free-energy state. The case is much different for RNA where there are usually many
1 This is the paper published in the Proceeding of 2008 International Symposium on Bioinformatics Research and
Applications (ISBRA2008). My contributions were participating in the algorithm and experiment design, implementing the algorithm, analyzing the data and results, making figures and writing the manuscript.
Full reference: Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.
43
structures with similar free-energies; these structures may be distinctly different in
terms of base-pairing (67,97). Secondly, while we know that pseudoknot structures are
very important in RNA structure and catalytic function (49), it remains difficult to
reliably predict pseudoknotted structures. This is due both to our incomplete
understanding of the energetics of pseudoknot formation, as well as to the
computational time complexity. The most efficient pseudoknot prediction algorithms,
e.g., pknotRG, have O(n4) time for certain classes of RNAs(94)), but achieve this by
placing significant limitations on which structures can be found. Memory complexity of
RNA structure prediction is O(n2), where n is the length of the RNA sequence, and
usually ranges from 10,000-100,000 bases for primary RNA transcripts.
In biology, functionally important features can often be recognized because they are
conserved over evolutionary time. A common approach is to obtain a set of sequences
using some biological criterion (such as similarity of regulation), and use pattern
recognition methods to identify unusually conserved features. Searching for sequence
motifs (approximately common substrings) in this way has been a powerful tool for
analysis of DNA and proteins; this approach does not work as effectively with RNA
because conserved RNA structures may have no detectable sequence similarity. And
while great progress has been made, it remains difficult to accurately predict MFE
structures for RNA sequences. To further complicate the picture, RNAs exist as
ensembles of structures, in addition to the MFE structure, that are constantly
interconverting and fluctuating. The biologically important structures (those that are
44
conserved over evolutionary time) may be present only transiently, or as minor
components of this structural ensemble. The problem is further complicated by the fact
that biology is messy; one can rarely get completely clean sets of sequence data in
which every sequence actually contains the structure of interest. This makes many
approaches unfeasible. In addition, in biological systems, conservation is only
approximate, no set of structures will exactly match.
We are building a system that allows one to find the greatest approximately conserved
structure(s) in a set of RNA sequences, in the presence of extraneous sequences that do
not share a common structure. This conserved common structure can then be used as
the basis for hypotheses about the importance of the structure in the biological
functioning of the RNA. These hypotheses can be tested either experimentally or by
further computational work.
We convert RNA structures to a graph representation that specifically includes
pseudoknots and is capable of representing an ensemble of RNA structures in a single
graph. Computationally, finding conserved structures corresponds to finding the
greatest approximately isomorphous subgraphs in a set of graphs, where each graph
represents a single RNA sequence. We use modifications of existing maximal subgraph
isomorphism algorithms to identify the similar portions of the graphs, and propose to
combine this with constrained MFE structure prediction tools (131), and a database
search capability.
45
Graph theoretical approaches have previously been applied to RNA structures (74,132),
but our approach differs significantly. The XIOS approach introduces the ability to
represent ensembles of structures, and emphasizes the topology of stems. Our
approach is most similar to that of Gan et al., but focuses on stem topologies rather
than the topology of loops and bulges (74). The XIOS approach also allows structural
motifs to be exactly matched without using heuristics (132).
2.2 XIOS RNA graphs
In this section, we describe the graph framework that we have developed to represent
ensembles of RNA structural topologies. We introduce the XIOS RNA graph
representation for RNAs, and discuss extensions to existing subgraph isomorphism
algorithms as they are apply to XIOS RNA graphs.
2.2.1 Definition
XIOS RNA graphs represent ensembles of RNA structural topologies. In XIOS graphs,
each base-paired stem is represented by a vertex, and the edges connecting the vertices
indicate the topological relationship between the stems. Topologically, two stems can
be eXclusive (X, i.e., both cannot simultaneously form because they use the same
sequence ranges), Included (I, indicates the direction of I edges with respect to the
higher numbered vertex and J indicates the opposite, i.e., one is nested within the loop
of the other), Overlapping (O, i.e., the stems have a pseudoknot relationship) or Serial
(S, i.e., adjacent, non-overlapping stem and loop structures) (Figure2.1). Each pair of
vertices is related by one and only X, I, O or S relationship.
46
2.2.2 Training data
We have developed Perl packages that translate Vienna RNA format (85) and the
MFOLD (83) connect format into XIOS graphs. Because the predicted MFE structure is
only one structure in a structural ensemble, we enumerate all energetically favorable
short stems and label the entire set as X, I, O, and S, as described above. The graph is
therefore an image of the entire structural ensemble. Our test datasets are described in
Table 2.1. Highly similar sequences with sequence identity >40% are removed from the
dataset to avoid selection bias.
2.2.3 DFS Lexicographical ordering
DFS (Depth-First Search) lexicographical ordering was originally developed by Yan et al.
(133,134) in their gSpan algorithm for identifying common chemical structures in
chemical datasets. In the chemical structure case, both the vertices (atoms) and edges
(bonds: single, double and triple) are labeled, and all edges are undirected. gSpan is a
powerful search algorithm that reduces the search space for isomorphous subgraphs
using a clever DFS preordered search tree.
The traversal order of edges and vertices in the DFS of a graph can be canonically
ordered. This is called the DFS tree, or when serialized, the DFS code. Yan et al. proved
that graphs with the same DFS code are, by definition isomorphous. Lexicographic rules
provide an unambiguous best order to the canonical DFS code (133).
47
The direct path from the first traversed vertex (root) to the most recently added vertex
(right-most vertex) is the right-most path. The extension of DFS graphs by edge growth
is restricted to extension from the rightmost path, similarly to the approach of
TreeminerV (135). Graphs are extended in the following order: edges to existing vertices
(backward edges), edges to new vertices extending from the right-most vertex, and
extension from internal vertices on the right-most path. An intrinsic property of the DFS
lexicographical ordering is that it creates a preorder that can be used to efficiently
explore the search tree when searching for isomorphous subgraphs. Isomorphic forms
of a graph fall in different positions in the search tree, but the canonical DFS
representation of a particular isomorph is guaranteed to be found first. Hence, the
lexicographically first instance of an isomorph in the search tree is its minimum
representation or canonical labeling and other instances can be efficiently pruned. Each
edge in the DFS code is described by a 3-tuple, (vi, vj, li,j), where vi and vj are two
connected vertices and li,j is the label of the edge. Figure2.3 shows how the canonical
labeling can easily be identified using lexicographic rules even though many different
DFS codes are possible. There are two additional rules that prune the search space.
Firstly, if the initial edge of a minimum DFS code is type e0, then no following edge can
have a lexically smaller edge label, and secondly, for any backward edge growth to vj, an
edge cannot be lexically smaller than any edge that is already connected to vj or vrightmost
(133). Each distinct mapping of vertices to a DFS code is the support for that potential
solution. Since many such mappings are possible, each graph may have multiple support
48
for a DFS code. As a simple example, Figure2.2 shows the XIOS graph for a tRNA,
according to the experimentally determined 3-dimensional structure (PDB ID: 1EHZ).
2.2.4 Enumeration N-stem structures
Every RNA structure can be represented by a XIOS graph. For n stems, the upper bound
on the number of possible structures2, N, can be calculated by Equation (1),
!n2
(2n)! N n ⋅
= (1)
For example, there is only one possible one-stem structure, two possible two-stem
structures, and 10 possible three-stem structures, but only eight unique structures
(Table 2.2). Figure2.3 shows the XIOS graphs for the eight unique structures that can be
formed from three stems. The other two three-stem structures are either redundant or
physically impossible.
2.3 Greatest conserved structures
2.3.1 Extension of the gSpan algorithm
XIOS graphs have several differences from the chemical structure graphs considered by
Yan and Han. XIOS graphs
2 For the n-stem case, there are 2n half stems. We assign integer labels to each half stem from 1 to 2n-1.By definition,
the first half stem is labeled 1, and there are 2n-1 possible half stems that can pair with the first half stem; the third half stem has only one possible label (2 or 3), and there are 2n-3 possible half stems that can pair with this half stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1.
49
• have both directed and undirected edges. I edges are directed because it is
highly important whether a stem is nested within or outside another stem. X, O,
and S edges are undirected.
• do not have vertex labels. Because every vertex is simply an anonymous
elemental stem, no labels are available.
The use of unlabeled vertices with the gSpan algorithm is fairly straightforward, but
results in a decreased ability to rapidly prune the search tree. Directed edges are a little
more difficult to accommodate because the direction of the edge depends on the vertex
from which one looks. The simplest approach is to label the edge as either I or J from the
point of view of the lowest numbered vertex. I and J are treated as lexicographically
distinguishable edges.
In the original application of gSpan to chemical structures, Yan and Han were interested
in identifying frequently occurring chemical substructures. In their case, structures that
occur many times in a single graph are equally interesting. The case of RNA differs;
motifs that occur in multiple graphs (molecules), rather than many times in a single
graph (molecule), are considered more important. In addition, the presence of
incorrectly classified sequences, i.e., sequences that have no common structure, means
that not all graphs will support the biologically relevant subgraph. For XIOS graphs,
therefore, support is calculated as the number of graphs that containing a subgraph,
rather than the total count of matching subgraphs.
50
2.3.2 Graph matching algorithm (similar to gSpan3)
begin: For a XIOS graph G with edges eG I. Sort edges in eG by edge type eG ∈ {X,I,O,S} II. For each edge type E
1. Find all lexicographically minimal one edge subgraphs, S, from the given XIOS graphs;
2. For each edge e in S 3. Do Subgraph_mining(G, S, e):
i. If the graph is NOT a minimum graph according to DFS lexicographical order, return; ii. Generate all potential children with one edge growth, enew iii. If support for each child is above threshold
Recursively call Subgraph_mining with updated edge list (G, S+, enew) 4. Remove all edges of edge type E from G after all descendents have been searched 5. If eG = Ø, break;
end.
2.3.3 Greatest conserved structure(s) in a set of RNAs
Many computational approaches use pairwise or multiple DNA or protein sequence
alignments to find conserved motifs, but this approach is generally impossible with RNA
sequences because of their lack of conserved sequences, and because of the difficulty of
obtaining unambiguously correct alignments. However, secondary and higher order
structures in RNA are conserved, so matching the topology of two RNA structures with a
graph matching approach can identify conserved motifs that cannot be seen in the
sequences. The pre-ordered DFS search approach of gSpan provides an effective
approach to this problem.
3 Adapted from (133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE
Computer Society, Maebashi City, Japan, pp. 721. with minor modification
51
The time complexity for the worst case of this algorithm is suggested to be O(kn)
(133,134), where k is the maximum number of subgraph isomorphisms existing between
the two graphs and n is the size of the greatest common match. Figure2.4 shows the
application of the XIOS graph approach to the structure of S. cerevisiae and H. sapiens
RNase P.
2.3.4 Characteristics of biological graphs
The graph isomorphism approach is limited by the size of the graphs. We examined
sequences from snoRNA, 5S rRNA, microRNA, tRNA, and RNase P (See Appendix for
details) to determine how the number of stems varies with sequence length in biological
RNAs. The sequences were obtained from online databases (Table 2.1) and their
predicted MFE structures were obtained using the RNAsubopt program of the Vienna
RNA package (97). Predicted MFE structures were also obtained for random sequences
in a similar fashion. Random sequences were obtained by randomizing the order of
bases in the corresponding biological sequences, thus preserving the base composition
and sequence length.
Figure2.5 indicates the overall trend of linear increase in number of stems as a function
of sequence length. This rapid increase in the number of stems is due to the intricately
folded structures of the RNAs. This observation further necessitates the development of
an efficient system for searching biologically relevant structural patterns in RNA. It is
notable the biological RNAs and random RNAs have very similar numbers of structures.
52
As one can see in Figure2.6, stem structures in biological RNAs are predominantly less
than ten base-pairs long.
2.4 Future directions
The number of stem structures in an RNA MFE structure can be very large (Figure2.5);
the total number of possible stems, however, grows quadratically with the length of the
sequence. If one assumes that stem-loop structures require on average 24 bases, the
number of possible stems would be something like (SequenceLength/24)2. For a relative
short 10kb mRNA sequence this would lead to graphs with over 150,000 vertices. Our
ultimate goal is to analyze 10-20 sequences of much longer length (many biological
RNAs are over 100,000 bases long), a daunting problem. There are a number of
approaches that can be used to reduce the size of the problem. These include
preprocessing the structure to include only the most interesting stems (rather than all
possible stems), the application of graph contraction methods, and the introduction of
vertex labels.
2.4.1 Graph preprocessing
While the most biologically interesting RNA structure need not be the minimum free
energy (MFE) structure, it is likely that the important structures are close to the MFE
(136). This follows from the Boltzmann relationship, which indicates that the relative
frequency of a given structure in the structural ensemble depends on its energy. Rather
than identifying all short energetically favorable stems, we can greatly reduce the size of
the problem by including only stems that participate in a structure within some energy
53
interval, ∂, from the predicted MFE structure. The total number of stems can be
controlled by altering ∂; ∂=0 produces the MFE structure.
2.4.2 Reduction of graph complexity
Graph contraction reduces graph complexity by pruning irrelevant vertices and edges.
There are a number of different approaches one can take to pruning XIOS graphs. Firstly,
as we pointed out above, one can simply discard the S edges; since there are exactly
four edge types and each pair of vertices has exactly one edge, only three edge types
need be used. Secondly, we can place limits on the construction of edges of other types,
especially of I edges. One of the advantages of the XIOS representation is that nested
stems, represented by I edges, have an edge with every other stem in which they are
included. This embedding can be many levels deep, generating a huge number of highly
connected vertices. This is a great advantage because it obviates the need for
introducing gaps (137) which make the matching problem much more complex (and ad
hoc since there is no way to determine correct gap parameters). We postulate that we
would lose little matching power if the depth of I edge nesting was limited to a fixed
depth such as four. This would still permit extraneous stems to be easily omitted but
greatly reduce the number of edges in the graphs. Finally, because we can enumerate all
possible XIOS structures with a fixed number of stems, we can create a dictionary of
these substructures and condense the graphs to a smaller number of vertices based on
this dictionary, at the same time converting the unlabelled vertices to labeled vertices
(the labels then correspond to the dictionary structures).
54
2.4.3 Adding labels
The dictionary strategy, described above, faces difficulties since the isomorphous
structure of interest is buried in a huge field of random noise. If the dictionary based
labels are dominated by the non-matching (noise) portion of the graph, the re-encoded
graph will lose the information needed to match to other graphs (e.g., if the dictionary
structures overlap but do not exactly correspond to the interesting conserved
structures). A similar strategy, unique to the XIOS graph, is to examine all three vertex
triangles, of which there are a strictly limited number of types due to the limitations
both of the graph and of the biochemistry of RNA, and replace each triangle with a
corresponding labeled vertex. Triangles may share one or two edges which can be
incorporated as an extended set of edge labels. Such graphs would be modestly smaller,
but much more heavily labeled, greatly increasing the search speed. At the same time,
little information is lost since the original graph can be almost completely reconstructed
from the triangle-condensed graph.
2.4.4 Motif identification tool
RNAs that interact with specific molecules, such as proteins, generally have common
topological motifs. For example, in alternative splicing the donor, acceptor, and branch
point all have specific conserved structures important in recognition and catalysis. Such
conserved structures, when identified in molecules of unknown function, immediately
generate experimentally testable hypotheses. Once motifs are identified, they can be
used to search for additional sequences that could form the same structure. This
55
provides a means for both statistically evaluating the significance of the structural motif,
as well as for validating matches by examining them for biological similarities, e.g., by
comparing the GO annotations (138) of the sequences. A number of approaches may be
suitable for this, including stochastic context free grammars (SCFG) (139) which are
frequently used to identify RNA structures based on biological knowledge (140).
2.4.5 Database search tool
For searching of large databases, SCFGs are likely to be too slow. We are developing a
fast database search tool for RNA motifs. Since we can enumerate all possible XIOS
graphs up for structures of up to 7 or 8 stems (hundreds of thousands) we believe that
we can use the enumerated structures to prescreen graphs in much the same way that
BLAST (141) uses identically matching words. This is closely related to the dictionary
concept introduced above. Because matching to the enumerated structures in the
dictionary can be precalculated, we plan to develop a fast system based on the
observation that one need not do the complete isomorphous subgraph search if two
sequences share no dictionary motifs, and that if they do, the isomorphism search can
be seeded by the matching motifs. Such a search tool would allow users to both extend
and validate motifs found through subgraph isomorphism matching, and would also
provide a means to functionally classify unknown RNAs. RNA is still rather poorly
understood and such an approach will be of great use in identifying novel structural and
functional motifs.
56
Because RNA structures are relatively degenerate, it is likely that a post-processing
system will be needed to identify the most interesting possible structures out of a large
number of possibilities. This issue is similar to the problem of relevance ranking in web
indexing. In sequence comparisons, statistical probability calculations are commonly
used as a relevance ranking mechanism, and this may be possible in the XIOS system;
we anticipate that the distribution of maximal matching structures will follow an
extreme value distribution. Any two large RNAs, however, will have common structures
that are almost completely trivial: they will match as a long series of serial stems. This is
generally not biologically interesting, suggesting that there is a notion of biological
complexity which can be used as a relevance ranking function. This biological notion of
complexity may or may not correspond to mathematical notions of graph complexity
(142). Another possible relevance function would be to choose only structural motifs
that can form near-MFE predicted structures using a constrained folding approach
(motif stems are constrained to base-pair in the predicted structure) such as are
available in MFOLD and the Vienna RNA package.
The XIOS graph representation has great promise for identifying biologically interesting
structural motifs in RNA based on sequence alone. Constructing a sufficiently fast motif
search system will allow RNA studies to take advantage of the same bootstrap process
that is commonly used for DNA and protein sequences, namely 1) identify biologically
related sequences, 2) identify statistically significant structural motifs, 3) use structural
57
motifs to identify additional candidate sequences (iterating to convergence), and 4) use
the structural motif as a basis for laboratory experiments.
58
Figure 2.1 XIOS definition.
Relationships (edges) are defined as X (exclusive), I (included), O (overlapping), and S (serial). I indicates the direction of I edges with respect to the higher numbered vertex and J indicates the opposite.
59
Figure 2.2 tRNA 3D structure and corresponding XIOS graph representation.
I.A. 3-D structure of tRNA (PDB ID, 1EHZ). I.B, the simple three-leaf clover shape of tRNA is shown, where the acceptor stem, D-arm, anticodon-arm, and T-arm are represented by vertices 0, 3, 2 and 5 respectively. Vertex 1 represents an interaction between the D-loop and a region between the D-arm and acceptor-arm, and vertex 4 represents an interaction between the D-loop region and the region between anticodon-arm and T-arm. In the XIOS representation (I.C), vertex 1 is included in the acceptor stem and overlaps with the D-arm, vertex 4 overlaps with the D-arm and the Anticodon arm is included in vertex 4. II a, b, and c show the sequential extension of the DFS graph, and II d shows the minimum DFS tree and corresponding DFS code. At the each stage of graph extension, all the possible extensions are shown in dotted lines. For each edge extension, only the canonical graph (shown by dotted ellipse) is used in the next stage.
60
Figure 2.2
61
Figure 2.3 Unique three-stem XIOS graphs, including pseudoknots.
Fifteen XIOS graphs with three vertices are possible, three of them in the first row are not true three-stem topologies (at least one of the stems has only S relationships with other stems); the other four three-stem structures are either redundant or physically impossible.
62
Figure 2.3
63
Figure 2.4 Identification of the common structure in S. cerevisiae and H. sapiens RNase P RNA.
Left panel (top) shows the secondary structure of the S. cerevisiae RNAse P RNA. Each stem is labeled with a capital letter A-L. Left panel, bottom, shows the XIOS graph. I edges are shown as single lines and O edges as double lines. Right panel shows the secondary structure (A-N) and XIOS graphs for a single human RNAse P RNA. In both panels, matching secondary structures are enclosed by boxes and the uniquely matching part of the XIOS graphs shown in dark lines. Dotted lines in the XIOS graphs indicate where there are multiple mapping between stems H and I of the S. cerevisiae structure and the human structure; these multiple mapped stems are also indicated by arrows in the secondary structure diagrams. The right panel shows two of the mappings as an example.
64
Figure 2.5 Correlation between number of stems and sequence length.
Number of stems in biological (♦) and randomized (×) RNA sequences versus sequence length. The number of stems increases roughly linearly with sequence length. Each biological sequence was permuted to generate a corresponding random sequence, preserving the sequence length and base
miRNA
05
10152025303540
0 200 400 600 800 1000
Sequence Length (bases)
Num
ber o
f Ste
ms
snoRNA
0
5
10
15
20
25
30
0 100 200 300 400 500 600 700 800
Sequence Length (bases)
Num
ber o
f Ste
ms
RNaseP
0
5
10
15
20
25
0 100 200 300 400 500 600
Sequence Length (bases)
Num
ber o
f Ste
ms
tRNA
0
2
4
6
8
10
12
0 50 100 150 200 250
Sequence Length (bases)
Num
ber o
f Ste
ms
5S rRNA
0123456789
0 20 40 60 80 100 120 140 160
Sequence Length (bases)
Num
ber o
f Ste
ms
65
Figure 2.6 Length of RNA stem structures in biological RNAs
microRNA miRNA S http://microrna.sanger.ac.uk/sequences/index.shtml 5S rRNA Database S http://biobases.ibch.poznan.pl/5SData/ rRNA RDP II A, S http://rdp.cme.msu.edu/index.jsp RNase P RNase P Database C http://www.mbio.ncsu.edu/RNaseP/ snoRNA snoRNABase S http://www-snorna.biotoul.fr/ snoRNA Plant snoRNA
tRNA GtRNAdb V http://lowelab.ucsc.edu/GtRNAdb/ tmRNA tmRNA A, S http://www.indiana.edu/~tmrna/ Noncoding RNA
ncRNA Database S
http://biobases.ibch.poznan.pl/ncRNA/
All Pseudobase V http://biology.leidenuniv.nl/~batenburg/PKB.html All RNAbase S http://www.rnabase.org/ All Rfam A, S http://www.sanger.ac.uk/Software/Rfam/index.shtml All RNAfold/MFOLD C, V Installed on local server
67
Table 2.2 Number of possible RNA topologies for different numbers of stems, N.
In the enumerated graph results, there are isomorphic graphs (redundant structures).
Unique topologies are the remaining graphs after removing isomorphic graphs.
In order to understand and explore the RNA topology space, I have developed a
systematic way to efficiently enumerate all small XIOS graphs which are physically
possible. An n-vertex XIOS graph corresponds to an n-stem structure. Based on the
results shown in Figure2.5 and Figure2.6, the average size of RNA stems is ~20nt (not
counting loop regions).
For an n-stem structure, there are 2n half-stems (each stem has two base-paired regions,
and each region is called a half-stem). In enumerating the possible structures, we assign
labels to the 2n half-stems such that the label on left half-stem of each stem is lower
than all labels on the left half-stems to its right, and the label on each right half-stem is
higher than all labels on the right half-stems to its left. By definition, the label of the first
half-stem is 1, and there are 2n-1 possible regions that can pair with half-stem 1. The
half-stem chosen to pair with half-stem 1 is labeled 2. At this point, one stem (1, 2) is
formed, with the two half-stems 1 and 2 paired. Here I have defined the sequence
direction from labeled half-stem 1 to half-stem 2 as the positive direction, and the
opposite direction is the negative. For the third half-stem, there are three possible
69
locations it can be placed relative to the position of half-stem 1 and half-stem 2: A. in
the negative direction from half-stem 1, B. between half-stem 1 and half-stem 2 and C.
in the positive direction from half-stem 2. The directionality chosen here is arbitrary.
Cases A and C are symmetric, which means they lead to redundant structures. From
now on, we just consider the positive direction cases (B and C). If the third half-stem is
between half-stem 1 and half-stem 2, by definition, the third half-stem should be
labeled as half-stem 2 (the half-stem previously labeled as half-stem 2 is now assigned
label 3). Otherwise, the third half-stem is in the positive direction from half-stem 2, and
the third half-stem is then labeled 3. As a result, in a unique structure the third half-
stem could only have one possible label (2 or 3), and there are 2n-3 possible half-stems
that can pair with this half-stem, and so on (Figure3.1 A).
As described in section 2.2.4 the upper bound of the number of possible n-stem
structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1 = ( )!∗ ! . The final structure is
stored in a format I call paired format (Figure3.1 B). Each pair of matching parentheses
in the paired format represents a stem. The labels of the two half-stems associated with
the stem are shown inside the parentheses.
This enumeration method only guarantees the structure can form physically. In the
enumerated graph results, there are isomorphic graphs (redundant structures). All the
paired format representations are converted into graphs and then into their minimum
DFS code. Yan et al. proved that two graphs are isomorphic if their minimum DFS codes
are the same (133). In order to purge the redundant graphs, all the minimum DFS codes
70
were built into a perl hash data structure, and only unique minimum DFS codes were
kept in the final set. After these steps, this set contains only unique XIOS graphs for
further research.
In my thesis work, I have tried to enumerate as many small XIOS graphs as possible. One
observation I have made is that once the graph size (vertex number) reaches 8 or 9, the
total number of possible graphs becomes very large. In my experiments, I have
enumerated the entire set of possible 1 to 10 stem XIOS graphs following the steps
described in Figure3.1 A. However, as a result of the large number and size of the graphs,
it took over 7TB of hard drive storage space, not mention the time consuming steps of
generating minimum DFS code for each of the structures in the dataset and purging the
redundant structures. The take home message is that it is possible to use our approach
to enumerate as many graphs as one wants. But due to the limitations of the computer
hardware and intractable computational time, one needs to decide where to stop. I
chose to keep all 1 to 7 stem unique XIOS graphs for later on work in this thesis. The
non-redundant set of 1-7 stem graphs comprises 55,728 graphs in total as described in
Error! Reference source not found..
3.2 Structural motif library construction
As mentioned above, all 1 to 7 stem small XIOS graphs were enumerated. I define a
concept called the structural motif, which represents the building blocks used in
biological RNA XIOS graphs. I further define the collection of 55,728 enumerated small
XIOS graphs as the RNA structural motif library. This library can be extended when more
71
graphs are enumerated and built into this collection. My assumption here is that
different RNA structures contain different structural motifs which correspond to their
functional differences. The RNA structural motifs embedded in a RNA XIOS graph are its
intrinsic properties.
An N-vertex graph has a maximum of N*(N-1)/2 edges. In the structural motif library,
the graphs may have up to 21 edges, namely 21 spatial RNA stem-stem relationships. I
clustered all the 55,728 graphs into groups based on the number of edges. The rationale
behind this clustering criterion is that each edge in the XIOS graph represents one
topological relationship between a pair of stems, and it is reasonable to group
structures with same number of pairwise stem relationships. Indeed, I clustered the
graphs based on their stem number at the very beginning. But as you may have guessed,
some graphs have strong correlations with each other, and small graphs appear within
bigger graphs. In chapters four and five, which focus on RNA classification and
identification, the stem number-based graph-clustering strategy does not work well. As
an alternative, I then developed the following edge based clustering strategy. Each
graph has a pre-generated minimum DFS code associated with it (calculated in the
redundant graph purge step). Within each N-edge graph group, all the graphs are sorted
based on the alphabetical order of their minimum DFS codes. By doing so, a unique ID
(UID) was assigned to each of the graphs of the form N_row_motif_X, where N is the
number of edges in the graph and X is the rank of the graph in the sorted order. I call
72
these row motifs because the clustering was based on their edge numbers, which in
terms of the DFS code is also the number of rows of code.
In order to manage this big set of graphs, a MYSQL database was set up to store them
and provide easy access. A table called row_motif was created with the graph UID as a
primary key and the minimum DFS codes stored as a column in the table.
Each structural motif is represented by its minimum DFS code, which is an abstract
concept. For better visualization and understanding of structural motifs, in the
beginning, I wrote a PERL CGI (Common Gateway Interface) script, row_motif_check.cgi,
to render them as RNA stem-loop diagrams by using minimum DFS code as input. This
script uses LWP (The World-Wide Web library for Perl) package to interact with
PseudoViewer3 web service (143) and retrieve the result. PseudoViewer3 is an excellent
visualization tool which was the first one developed for the automatic drawing of RNA
with any type of pseudoknots as a planar graph (Figure3.2). Despite its useful features,
we experienced a lot internet connection difficulties and inconsistency since the server
is located in Korea. As an alternative tool, I added another visualization application
called VARNA (Visualization Applet for RNA) (144), which is a lightweight Java applet
that draws RNA secondary structures, to the row_motif_check.cgi script . VARNA is runs
locally from our server and this guarantees its speed and availability. One drawback is
that VARNA cannot produce as nice layout of pseudoknotted structures as
Pseudoviewer3 (Figure3.2). Since we do not have a strict requirement about the layout,
VARNA is sufficient to perform the visualization task.
73
3.3 RNA structural fingerprint
3.3.1 Background
In the bioinformatics research related to DNA and protein function, it is the constraint
that function places on mutational change that gives rise to the observed sequence
conservation. Traditionally bioinformatics tools rely on sequence conservation;
sequence similarity often translates into functional similarity. But in the case of RNA,
function is often only weakly linked to sequence, while the main player is the structure.
Our RNA XIOS graph representation captures the dynamic topological characteristics of
a folded RNA molecule, and it can be thought of a coarse resolution picture of the actual
structure without details such as sequence and stem length, and loop size. The XIOS
graph framework is topology based, compared to sequence based (145) or shape based
(146-148) frameworks. I present a tool which can use such structural topological
features to identify functionally related RNA molecules.
3.3.2 Definition of RNA structural fingerprint
With the RNA structural motif library constructed, based on the assumption that
different RNA structures contain different RNA structural motifs, I propose a new
concept called the RNA structural fingerprint (simply referred to as fingerprint in the
later part of the thesis). The definition of the RNA structural fingerprint is a list of
structural motifs (defined in section 3.2) that are found in a specific biological RNA
structure. This comprehensive list of structural motifs summarizes the spatial
relationships between the stems in an RNA structure.
74
The fingerprint idea is simple and straightforward. Figure3.3 shows the work flow of
generating fingerprints for RNA structures or sequences. If one has an RNA sequence,
the RNA folding program, UNAFOLD, is used to predict its suboptimal structures (up to 5%
above its MFE). Note that UNAFOLD can only predict non-pseudoknotted structures.
Our lab developed a strategy to compare suboptimal structures and identify possible
pseudoknotted structures (149), representing the ensemble of structures as a single
XIOS graph. Or if one has RNA secondary structure, it can be directly converted into a
XIOS graph by using the XIOS package described in Chapter 2. The XIOS graph is used as
query to search against the RNA structural motif library using a subgraph matching
process. This search identifies all the structural motifs that motifs are found in the RNA
XIOS graph; this is the fingerprint of the RNA XIOS graph. Each element of the feature
vector corresponds to one specific RNA structural motif in our library, and the value of
that element is the corresponding count.
The concept of an RNA structural fingerprint is slightly abstract and an example of tRNA
is given in Figure3.4 to better illustrate it. This example shows the actual structural
motifs that comprise a tRNA structural fingerprint. The tRNA XIOS graph is surrounded
by 1 to 3 row structural motifs with their corresponding XIOS graphs. There are
additional, bigger structural motifs embedded in the tRNA XIOS graph but they are not
shown here for simplicity. The colors of the stems represent one of their possible
corresponding stems found in tRNA XIOS graph. The tRNA secondary structure, 3D
75
structure and XIOS graph are colored using the same color scheme to better highlight
the corresponding structures.
3.3.3 Fingerprint searching algorithms
The fingerprint generating process, which requires searching against the RNA structural
motif library, would be time consuming if a brute-force method was used. Let me break
this down for you. This task mainly involves determination of whether a query XIOS
graph contains a subgraph that is isomorphic to a specific structural motif XIOS graph. If
the query RNA structure is compared with every structure in the library, the
computational complexity of such a search is O(nmm), where n is the number of
structural motifs to be compared and m is the number of edges in the query XIOS graph.
For the structural motif library, n is 55,728; it could be even bigger in a more complete
motif library. This subgraph isomorphism problem is known to be NP-complete (70).
It is inefficient to scan the whole library to match structural motifs one by one. An
efficient strategy is needed to make this fingerprint search faster. A filter and
verification method is a common approach to speed up the search efficiency of
subgraph isomorphism checking over large sets of graphs. The filtering step, which
omits graphs that do not satisfy restraints defined by the user, is the key to improve
search efficiency, since the efficiency is largely determined by the number of graphs left
to be checked in the verification step (the fewer graphs left, the faster the search is).
Therefore, many approaches have proposed using indexing techniques to speed up the
76
filtering (150-158). Here I am going to describe the strategies I used in my fingerprint
search, including CUDA GPU programming and two indexing techniques.
3.3.3.1 CUDA GPU programming
The graphics processing unit (GPU) is a specialized circuit designed to efficiently
manipulate computer graphics. GPUs are normally embedded in a graphics card, or
integrated on the motherboard. The highly parallel, multithreaded, multi-core processor
structure of the GPU makes it more powerful than central processing unit (CPU) when
processing large blocks of data in parallel, Figure3.5. A simple comparison of floating
point operations per second (GFLOP/s) and memory bandwidth (GB/s) between CPU
and GPU is shown in Figure3.6.
NVIDA Corporation, a major supplier of graphics cards, released the Compute Unified
Device Architecture (CUDA), which is an extended C/C++ mixed language, for general
purpose computing on GPUs (GPGPU). It is becoming one of the hot computational
research areas with promise to advance computationally challenging problems in areas
such as large database searching, protein folding, and molecular dynamics simulation.
The RNA structural fingerprint search process includes thousands of independent
subgraph isomorphism checks. GPGPU was used to parallelize the searching process and
improve its searching efficiency. We used the NVIDIA GeForce 9800 GTX+ graphics card
(16 multiprocessors, 128 streaming cores) to implement CUDA code and perform the
search. While it seems that the subgraph isomorphism problem is suitable for
77
implementation on GPU, my fingerprint search requires reasonable large amount of
memory. I stopped implementing the search code in CUDA due to the graphics card
memory limitation (512MB). GPUs with larger memory could be used to solve this
problem and speed up the search process.
3.3.3.2 Prefix tree search
Binary tree searching has O(nlogn) computational complexity, which is far better than
O(nmm). In Chapter 2, I described using a graph sequentialization method, the gSpan
algorithm DFS coding (133,134), to translate a graph into its minimum DFS code which
can be considered to be canonical labeling of the graph. Yan and Han showed that if two
graphs are isomorphic, they must share the same minimum DFS code (133). I proposed
a prefix tree data structure to efficiently store and retrieve structural motifs in the
library (Figure3.7). As mentioned before, the minimum DFS codes were pre-calculated
for each structural motif, and those codes were stored in this prefix tree. The prefix tree
stores all n row motifs in level n. Each node of the tree only holds one row of DFS code.
In order to retrieve the complete minimum DFS code of a structural motif, it is necessary
to trace from root node to the leaf node representing the last row of the DFS code for
the structure and retrieve the code. Node X is a parent of node Y if and only if the DFS
code from the root node to X is a prefix of the DFS code from the root node to Y.
Most subgraph isomorphism checking methods used in large scale graph set are filter-
and-verification, which means they first filter out graphs that do not satisfy restraints
defined by the user, and then perform isomorphism checking on remaining graphs. My
78
prefix tree strategy employs the verification-and-filter style. Compared with filter-and-
verification, this style does isomorphism checking start with a small graph. The program
filters out the large number of graphs which are extensions from this small graph, if this
small graph fails to pass the isomorphism checking. The details of the prefix tree
searching are as follows: when doing fingerprint search for a query XIOS graph, it starts
from the root of the prefix tree. The root node contains just the “zero” row motif graph
which has only serial stem-to-stem topological relationships among all the vertices, for
example, the first graph in Figure2.3. There are two one row motifs that are children of
this root node, which are the 2nd and 3rd graphs in the first row of Figure2.3. From here
on, a depth first searching strategy is used in the search. One of the two motifs is chosen
for subgraph isomorphism checking with respect to the query XIOS graph. If this motif
passes the check (matches), then one of its child node motifs is retrieved and used for
subgraph isomorphism checking with respect to the query XIOS graph. This search
process is repeated until a motif M which fails the subgraph isomorphism checking is
found. Since this is a prefix tree data structure, all the motifs represented by the child
nodes must contain M as a subgraph. If M is not a subgraph of the query XIOS graph,
then all its child nodes represent motifs that cannot be subgraphs of the query XIOS
graph, because they are basically extensions from M.
The filtering power of this strategy is that once the subgraph isomorphism checking fails
at a specific node, the whole child branch of this node no longer needs to be checked,
since this node is a prefix of all its child nodes.
79
During the test of this approach, I experienced some inconsistency in searching speeds.
In spite of this, the overall searching speed was faster than the brute-force method.
After carefully looking at the layout of the tree in the memory, I found that the order in
which the tree nodes are allocated in memory is very important. Using the perl language,
one does not have full control of memory allocation; the sequence of construction
nodes in the tree was the cause of the inconsistency. What we found was memory page
problem occurs whenever the tree is big, spanning more than one page. If the physical
address in which a parent node stores to the address of its child node is bigger than a
page range, the memory access time is far higher. One take home message is that
efficient layout of the tree in memory is important, which can potentially save a lot of
computational time.
3.3.3.3 NH indexing
Prefix tree searching improved the fingerprint generating speed, but it was still not
satisfactory. Inspired by the Neighborhood indexing (NH indexing) method (159)
(Figure3.8), I have developed a modified version of the NH indexing strategy to speed up
the RNA structure database searching process. The main idea of the NH indexing
strategy is that a vertex plays a role proportional to its significance when we are
matching two graphs. The neighbors of the vertex and its degree can be used to
determine the significance of a vertex in the matching process. This information is used
in the indexing as well as in the query search process. Besides the neighbor and degree
information, I also define triangle descriptors (Figure3.9) that describe vertex properties.
80
For a given vertex i, vertices j and k are its neighbor (connected by edges). A triangle can
be formed by i, j and k. Depending on the edges linking these three vertices, 36 different
triangles are possible (Figure3.10).
XIOS graphs are further separated into connected components (modules), i.e., distinct
subgraphs that have only serial (non-nested) relationships with each other. Modules
represent independent pieces found in biological RNA structures. Every vertex of each
module in the database is indexed by the modified NH indexing strategy (159).
Besides the triangle descriptors shown in Figure3.9, there are cases that cannot
physically form, even though they are mathematically feasible. A complete list of all of
the 36 mathematically possible triangle descriptors can be found in Figure3.10. I use a
list called the NH index array to store the vertex properties for each vertex. . The design
of the NH index array is shown in Table 3.1.
The NH index array has 42 elements. It includes the counts of all 36 triangle descriptors
listed in Figure3.10, the number of I, J, O, and X edges that extend from the vertex, its
degree (d) and number of edges between its neighbor vertices (nc). The details of
generating NH index array for a specific vertex are described in Algorithm 3.2.
Algorithm 3.1 build_NH_index_array
Input: graph vertex ni Output: NH index array NH(ni) 0: Initialize NH index array of vertex ni with zeroes 1: Find all neighbors (vertices connected to ni) of vertex nI, and put them into neighbor list NB
81
2: FOR each vertex nj in neighbor list NB 3: FOR each vertex nk ≠ nj in neighbor list NB 4: For each triangle descriptor Tl
5: if vertices ni, nj and nk match triangle Tl 6: increment the count of triangle descriptor by 1 7: is the type of edge between vertex ni and nj, ∈ { , , , } 8: is the type of edge between vertex ni and nk, ∈ { , , , } 9: Increment the count of and in NH(ni) 10: END FOR 11: END FOR 12: RETURN NH index array of vertex ni
Each RNA structure is converted to a XIOS graph, and further separated into its XIO-
edge-connected components (modules). For each module, I calculate the NH index
array for every vertex of the graph, and create a database of structure vertices indexed
by triangle descriptors. For example: if vertex n of graph S and its neighbors can form
triangle descriptors T0 and T18. Feature T0 and T18 are used as keys to index vertex n
and graph S. After the vertices in the entire structure database have been indexed, it is
easy and fast to look up all the vertices and graphs in the database which associated
with a specific triangle descriptor features.
Algorithm 3.2 NH indexing Input: all database XIOS graphs ( ) Output: index , a list of database graph vertex ids containing each feature Dk 1: FOR each graph Si in the XIOS graph database S 2: separate graph into modules list*, ( ) 3: FOR each module in module list ( ) 4: FOR each graph vertex in module 5: CALL build_NH_index_array function with vertex , it returns array 6: FOR each of the 36 possible triangle descriptors T0 to T36
82
7: IF vertex is associated with descriptor Tk THEN 8: Append vertex to index entry Dk 9: END IF 10: END FOR 11: END FOR 12: END FOR 13: END FOR 14: RETURN index * x,i,o edge connected components
The searching method uses the NH indexing aided database search as well as complete
subgraph matching. The database comprises a set of XIOS graphs derived from biological
RNA structures. Search queries are indexed in the same way as the database and each
query vertex is compared to the database index in order to find topologically similar
vertices in the database as candidates/seeding vertices. This is the NH indexing
screening step. My search strategy does not require searching against the whole
database, but just the graphs that contain the seeding vertices found by the NH indexing
screening step. The performance of the search is greatly improved due to the smaller
searching space. Algorithm 3.3 describes the search process step by step.
Algorithm 3.3 NH indexing search Input: query XIOS graph and index Output: search hit list , where each Hi indicates a database module that matches the query 1: Separate query XIOS graph Q into its connected components module list, ( ) 2: FOR each module in module list ( ) 3: FOR each graph vertex in module 4: CALL build_NH_index_array function with vertex , it returns list NH(ni)* 5: FOR each of the 36 triangle descriptors 6: IF vertex is not associated with descriptor Tk THEN 7: Put all the vertices from the list Dk into the non-candidate list NC(ni)
83
8: END IF 9: END FOR 10: FOR each of the 36 triangle descriptors 11: IF vertex is associated with descriptor Tk THEN 12: FOR each vertex nj in list Dk 13: IF nj is not in the non-candidate list NC(ni) THEN 14: Append nj to the candidate list C(ni) 15: END IF 16: END FOR 17: FOR each vertex nc in the candidate list C(ni) 18: FOR each k-th value in list NH(ni): NH(ni)[k], where 1<=k<=42 19: IF k-th value in list NH(nc): NH(nc)[k] is smaller than NH(ni)[k] THEN 20: skip to the next vertex in the candidate list C(ni) 21: END IF 22: END FOR 23: lookup the database graph module mhit that contains vertex nc 24: Append mhit into hit list H 25: END FOR 26: END FOR 27: END FOR 28: FOR each graph module hit mhit from hit list H 29: IF query module and this graph module hit mhit have equal or bigger number of vertices THEN 30: IF simple_subgraph_isomorphism_check(ml, mhit)** is NOT true THEN 31: Remove this graph hit mhit from the hit list H 32: END IF 33: END IF 34: END FOR 35: END FOR 36: RETURN search hit list ; * The detail of the function described in Algorithm 1. It basically returns a list of the count of each of the 36 triangle descriptors associated with the vertex, edge types going out from this vertex and degree of the vertex. ** This function would take two graphs as input and do complete subgraph match test. Line 2-27 graph vertex filtering. According to the graph containment search exclusive
logic (155), if a feature f is not embedded in query graph Q, any graph Gi in the database,
84
which has feature f, should not be a matching candidate. First we find out which triangle
descriptors are not associated with the query structure vertex ni. From the database
index, I identify and push all the vertices that contain any triangle descriptor feature,
not associated with ni, onto the non-candidate vertex list NC(ni). And all the vertices that
contain any of the query vertex triangle descriptor features are included in the
candidate vertex list C(ni). This is followed by removing the intersection of NC(ni) and
C(ni) ( ( ) ∩ ( )) from the candidate list C(ni). Further screening was done by
checking the triangle descriptor feature counts. The K-th count (1 ≤ ≤ 42) of the
query structure vertex NH(ni)[k] should not be smaller than the k-th count of NH(nc)[k],
where nc is the database vertex. If NH(ni)[k] ≥NH(nc)[k], nc is removed from the
candidate list C(ni). Later, a module list mhit is built from the candidate list C(ni) by
looking up the index of the vertex to module association. At this point, a list of
candidate vertices C(ni) and graph module list mhit are available for next step.
Line 28-35 module size screening and simple subgraph isomorphism check. This step
efficiently rules out candidates from the list mhit, leaving a small number of candidates
for the more accurate and the most time consuming test. In order to perform a specific
biological function, RNA needs to have a certain topological module set, and each of the
modules needs to be complete. That is to say the query structure topological module
needs to have the same or bigger size (number of vertices) as a database module. This
simple module size test filters the false positive matching very quickly. If the size
requirement is not met, that database module is discarded from the list mhit. The next
85
step is the most computationally expensive step of the searching process, a simple
subgraph isomorphism check. In this case, it goes through an accurate complete
subgraph containment (looping through all candidate vertex combinations till the first
complete match is found, the worst case is going through all combinations) search using
the query XIOS graph module to search against all the database module hits in the
module list mhit. This is looking for database modules are the same size as the query
module or completely nested in the query module. The false-positives are omitted from
the result hit list H.
3.3.4 Possible applications
It is intuitive that RNA molecules contain different structural motifs, but members of the
same RNA family share more common motifs than RNAs from different families. For a
given biological RNA, we represent its ensemble of suboptimal secondary structures
(predicted by UNAFOLD) by a XIOS graph and describe its topological features by
fingerprint. Comparison of the fingerprints of a set of RNAs can give one a clue about
their relationships and functional similarities. It is a natural extension of this work to
index experimentally determined or computationally predicted RNA structures from
different publically accessible data sources by their fingerprints. This approach allows
the construction of an RNA topology database with all RNA topological information.
Furthermore, a database search utility can be developed to perform RNA topological
similarity search. With the aid of the RNA topology database, feature selection and
86
classification methods can be used to identify important topological features
corresponding to specific biological functions.
87
Figure 3.1 Enumeration of XIOS graphs.
A. the steps of assigning half-stem labels. For the n-stem case, there are 2n half-stems. We assign integer labels to each half-stem from 1 to 2n.By definition, the first half-stem is labeled 1, and there are 2n-1 possible regions that can pair with the first half-stem. Here we defined the sequence direction from labeled half-stem 1 to half-stem 2 as the positive direction, and then the opposite direction is the negative. The third half-stem has only one possible label (2 or 3), and there are 2n-3 possible half-stems that can pair with this half-stem, and so on. The upper boundary of the number of possible n-stem structures is therefore: (2n-1)*(2n-3)*(2n-5)*…*5*3*1. B.tRNA example. This is a result resembles the tRNA structure (3 leaves cloverleaf structure). The code below is the paired format representation of this structure. Each pair of matching parenthesis represents a stem. Each stem has two half-stems associate with it. Their labels are inside the parenthesis.
88
Figure 3.1
89
Figure 3.2 RNA secondary structure visualization.
On the top of the figure shows an example structure dot-bracket representation. A. RNA structure visualization done by PseudoViewer3. B. RNA structure visualized by VARNA.
This is one example showing actual structural motifs listed in tRNA RNA structural fingerprint. A. tRNA XIOS graph is located in the center, and it is surrounded by 1 to 3 row structures motifs with their XIOS graphs. There are bigger structural motifs embedded in tRNA’s XIOS graph but they are not showing here for simplicity. Colors of the stem represent one of their possible corresponding stems found in tRNA XIOS graph. B. tRNA 3D structure C. tRNA secondary structure stem-loop diagram. Note that tRNA secondary structure, 3D structure and XIOS graph are using the same color schemes to characterize different stems.
92
Figure 3.5 Architecture comparison of CPU and GPU.
Adopted from CUDA C Programming Guide 4.0
93
Figure 3.6 Comparison of CPU and GPU.
Left Floating-point operations per second. Right Memory bandwidth. Adopted from CUDA C Programming Guide 4.0
94
Figu
re 3
.7 P
refix
tree
stru
ctur
e st
ores
stru
ctur
al m
otif
libra
ry fo
r effi
cien
t sub
grap
h is
omor
phis
m c
heck
.
95
Figure 3.8 Neighborhood indexing (NH indexing).
Left panel: the open circle in the center is the vertex we are focusing on. All the squares are its direct neighbors, and the diamond is not its neighbor. Information such as number of I, J O and X edges extend from the vertex, degree of the vertex (d) and connections between (nc) its neighbors are considered as the properties of the vertex. A list of all the information is called the NH index array of vertex. Right panel: With the help of the NH index array, some vertices can be easily anchored from the query graph (smaller graph on the left) to the database graph (bigger graph on the right). Those vertices serve as seeds (closed circles) of the initial step of graph matching. Extending to their neighbor vertices (open circles) would lead to the maximum subgraph out of the two graphs more efficiently.
96
Figure 3.9 Triangle descriptors.
All physically possible triangle descriptors are shown. A complete list of all mathematically possible triangle descriptors can be found in Figure3.10. Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.
97
Figure 3.9 Triangle descriptors.
98
Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.
Each vertex of the XIOS graph represents a RNA stem and each edge/link corresponds to one of four spatial stem-stem relationships: exclusive (X) (not shown here), included (I) (directed edge), overlap (O) (undirected edge) and Serial (S) (if there is no edge shown between two vertices, it is an S edge). The vertex on the left side of the triangle is the target vertex ni (closed circle ●), and the two vertices on the right are its neighbors (nj on the top and nk on the bottom, open circles ○). Each descriptor is coded by groups of three letters, which represent edge type between ni and nj, edge type between ni and nk and edge type between nj and nk. For example descriptor T0 is coded by III and IIJ. This means there are two possible triangles form descriptor T0. In both cases, the edges between ni and nj, ni and nk are both included (I) edges. Edge between nj and nk are different, included (I) and reverse included (J) respectively. Also each descriptor has one DFS (depth first search) code associated with it, see (160) for more detail.
99
100
101
Figure 3.10 Full List of Mathematically Possible Triangle Descriptors.
102
Table 3.1 Design of the NH index array.
There are 42 elements in this array. Basic elements of the array include: 36 mathematically possible triangle descriptors (T0 to T 35), counts of I, J, O and X edges (I, J, O, X), degree of the vertex (d) and number of edges connect its neighbor vertices (nc).
Graph theoretical approaches have been used to identify chemical moieties associated
with functional properties (71,72,153,161). Chemical structures have been represented
by molecular graphs in quantitative structure-activity relationships (QSAR) studies, using
structural determinants to model and predict physicochemical and biological properties
(73). Graphical representation of chemical structures has been used to compare
structure similarity and to identify function (71,72). The correlation between chemical
properties and function can be used to predict the function of novel molecules.
ChemIDplus (161), PubChem (162), ChemBank (163) and BindingDB (164) are examples
of databases using graph theoretical approaches for chemical structure search and
comparison.
While increasing attention has been drawn to the important biological roles of RNA,
RNA functional annotation remains difficult. Approaches based on RNA primary
sequence alone have been extensively studied and implemented (66,140,146,165,166),
but there are many cases where structurally similar RNAs have little or no detectable
sequence similarity; in these cases sequence based approaches fail to correctly identify
104
and classify the RNAs. RNA function is dependent on RNA tertiary folding, and tertiary
structure, in turn, is largely determined by base-paired secondary structures and
pseudoknotted structures, which are not secondary structure. The RNA-XIOS database
provides a means to link RNA secondary structure, including pseudoknots, to its
biological function and physicochemical properties by associating topological patterns
with the functions of currently known of RNA families.
Similar to graph database studies in chemical informatics, the RNA structure XIOS graph
database provides extensive RNA-secondary-structure topological information, including
pseudoknots, and ensembles of suboptimal RNA secondary structures (160). For a given
query RNA structure, it can quickly identify topologically similar RNAs for further
analysis, such as function identification.
Several RNA motif databases based on graph theory are currently available (167), but
they do not provide a RNA structural topological searching service that includes
pseudoknot topologies. Additionally, techniques for efficiently identifying structural
similarity between RNAs are not well developed. Our approach is similar to the
RNAshapes approach of Giegerich et al. (98), but we also consider pseudoknots and
suboptimal structural ensembles. It is also similar to the RAG (RNA-As-Graphs) (168),
which describes RNA structures as graphs. However, some RAG structures that are
mathematically possible cannot form in the physical world. In our approach we only
enumerate physically possible graphs, greatly reducing the search space for topological
similarity.
105
4.2 Methods and dataset
4.2.1 XIOS Graph
We have developed a framework , XIOS, which represents an ensemble of RNA
secondary structures in a single graph; pseudoknots and suboptimal structures are
specifically included (160). XIOS graphs are constructed based on base-pairing in actual
and/or predicted stems. Each vertex in the XIOS graph represents a RNA stem and each
edge/link corresponds to one of the four spatial stem-stem relationships: exclusive (X),
included (I), overlapping (O) and serial (S). A special case of the reverse relationship of
included (I) is denoted as J (Figure2.1). As this is a complete list of possible relationships
between base-paired regions in a RNA, the XIOS approach can be used to enumerate all
physically possible graphs. The XIOS graph is then converted into minimum depth first
search (minimum DFS) code for fast RNA structure comparison (133). In contrast to
traditional sequence-based approaches, XIOS is a topology-based approach, which
allows comprehensive and efficient exploration of the RNA structure space.
4.2.2 Dataset
The data set used here is the manually curated dataset described in Table 1.1.
4.2.3 Indexing and searching
The basics of the database search and fingerprint search are all the same as described in
chapter 3. The indexing algorithm is the same as algorithm 3.2, and the search algorithm
is the same as algorithm 3.3. But there are differences: the database in the fingerprint
search is the structural motif library (see section 3.2), while here it is the manually
106
curated dataset (Table 1.1). Real biological RNA structures are indexed in the RNA
structural database considered in this chapter. The structures I consider here are
significantly bigger than the enumerated structural motifs and more biologically
relevant. The large size and complexity of real biological RNAs are challenging aspects in
this database search study.
4.2.4 Scoring function
The similarity of RNA structures is evaluated based on the number of indexed subgraphs
they have in common. For a query structure, each query module has a true candidate
database-module-list generated by algorithm 3.3. By examining the combinations of lists
of all graph modules, the database structures with the most module hits, as well as the
closest size-matches, can be found. A combination of these terms is used to define the
best matching structure.
For a specific query structure, a XIOS graph with certain number of modules, large
graphs in the database would tend to have a higher number of module hits and larger
matches. This is because large graphs have larger modules and more kinds of small
modules, regardless of their true similarity to the query. Large database graph modules
will therefore tend to have larger matches to query graph modules. For example,
consider a query graph module with N (N≥2) vertices, namely size N. Suppose there are
two database graph modules A and B (where the size of A is N and the size of B is less
than A), and that both A and B match to the query. Module A would tend to have a
larger match with the query module, since A is bigger than B. In general, larger database
107
modules will therefore tend to have larger matches regardless of the query. We penalize
the unmatched regions of a database graph match to correct for this effect. The bigger
the unmatched region, the higher the penalty it receives. Among the best database hits
with the same matching size and number etc., higher scores are given to database
modules that are the most similar in size to the query graph.
The database search result is affected by the following factors: structure overlap size
(the number of vertices that can be mapped between query and hit), and query and hit
module size differences. We added penalties to the scoring function (eq 4.1) to penalize
the unmatched part of the structures. This helps to promote the structures, with similar
size to the query graph, to the top of the result hit list.
= ___ _ _ eq 4.1
where score is in the range of (0, _ /2] The denominator of eq 4.1 adjusts for size differences between the query and database
modules. If the query and hit sizes are the same, the hit score reaches its maximum
value. If the query and hit size are very different, say query size >> hit size or query size
<< hit size, the hit score reaches its minimum value which asymptotically approaches
zero. For all other cases, the hit score would lie between the two extreme values.
108
4.3 Results
4.3.1 Validation using known biological structures
An NH indexing database search was conducted using each XIOS graph in the manually
curated structure dataset (Table 1.1) to search against the whole database of manually
curated structures. Performance of the database search is evaluated by the Positive hit
ratio (PHR), which is the number of correctly labeled hits divided by the total number of
hits. The label is the known family of its best match in the database. A sample PHR
calculation for a BLAST search is shown in Figure4.1. In this example, the black line
represents a query search sequence, red lines are the true positive hits, and blue lines
are false positive hits. The positive hit ratio for this search is 3/5. Figure4.2 shows that
we correctly identified and classified RNA structures using topological criteria across
four distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA.
The Y-axis of the charts represents the percentage of NH index searches that achieved a
certain PHR. For example, in our dataset, I perform 16 separate searches for each of 16
tRNAs. 10 times out of 16 I observed 100% PHR, the percentage of having 100% PHR for
tRNA is 0.625 (62.5%). Over 75% of Group I Intron RNA, RNAseP RNA and tmRNA
queries retrieve 100% PHR in the top 5 hits, while over 55% of tRNA queries have 100%
positive hit ratio. We also examined the top 10 scoring hits for each RNA family. In this
case, Group I Intron RNA and tmRNA queries still show high recall, while RNaseP RNA
queries rank somewhat lower.
109
Classification accuracy is lowest for tRNA queries. There are two possible reasons: First,
tRNA XIOS graphs are small - such small motifs maybe found in many larger, but
unrelated, graphs in the database. Second, the number of tRNAs in the database is
relatively small compared to the other groups (16 structures collected from PDB
database). If a couple of tRNAs match to other RNA families, the fraction would be
relatively big in comparison to the other three RNA families which have more instances
in the database. Overall, this result confirms that our database search is able to identify
RNAs with similar structure and function based on topological matching.
4.3.2 Size only graph database search
Many RNAs within a functional class have very similar lengths. This raises the possibility
that the classification shown in Figure4.2 was simply due to matching between
structures with similar sizes (the structure (graph) size is the number of vertices in the
XIOS graph). To eliminate this possibility, we implemented a function that scored the
RNAs based only on their XIOS graph sizes. Figure4.3 shows that matching by size alone
achieves about only about 20% classification accuracy, much lower than the level
achieved by the topological matching (Kolmogorov-Smirnov test result is shown in Table
4.1). This shows that graph size is not the main factor contributing to the correct RNA
classification.
4.3.3 Embedding simulation
Another key issue in matching to a structural database is whether a topologically and
functionally similar structure can be found even when it is embedded in a larger
110
structure. Such embedding could occur either in a biologically meaningful sense (a true
relationship), or be due to misassembly or misannotation of the source sequence. For
each of the structures in our dataset, we performed an embedding simulation which
automatically mutated the sequence of the query RNA structure graph while generating
unique structures bigger (in size) than the input structure, and with the input structure
embedded in it (Figure4.4). We call these bigger structures extended structures. Table
4.2 lists the statistics of the embedding simulation.
For each of the original query structures in the database, we generated circa 15
extended structures (Table 1.1). An NH index database search is then performed using
the extended structures as queries to see if the original graph and its related graphs can
still be found. This embedding simulation (Figure4.5) clearly shows that the NH index
database search is able to find the embedded structure and its related structures,
compared with results shown in Figure4.2 (Kolmogorov-Smirnov test result is shown in
Table 4.1). It further suggests that NH indexing XIOS database search acts more like a
local similarity search than a global similarity search, since it can identify the local graph
patterns within a larger overall graph pattern. Indeed, the performance of the extended
query search is even better than the original queries, most likely because the bigger the
structure is, the more likely it is to have hits to its family. The topological structure
database search is therefore robust and it successfully classified RNA structures into
their correct RNA families.
111
4.3.4 Blast search
A study understanding the difference between NH indexing database search and Blast
search was performed. The result is shown in Figure4.6. The Blast result shows some
good result for RNAseP and tmRNA, but not tRNA and Group_I_Intron RNA.
Kolmogorov-Smirnov test (Table 4.1) shows that the NH indexing database search
achieved same performances for RNAseP and tmRNA, while both search results are
good. Also it shows that NH Indexing database search results for Group_I_Intron RNA
and tRNA are statistically significantly better that Blast search.
4.4 Discussion
Identification of conserved sequences has played an important role in establishing
sequence-structure-function relationships in proteins, but has been less useful with RNA
because it is the folded structure rather than the sequence that is most closely related
to function. We have developed a structure database searching algorithm, NH indexing
database search that can identify and classify topologically, and probably functionally
similar, RNA structures. This knowledge can be used to build experimentally testable
hypotheses about the function of the query RNA.
The NH-indexing database-search algorithm can accurately classify RNA structures
without using primary sequence information. Integrating primary sequence information
into the framework would improve performance for RNAs that have significant
sequence similarity to others in their functional class. Combining both sequence and
topological information should improve the classification of unclassified or misclassified
112
sequences with low sequence identity but relatively close structural/topological
distances. Any significant sequence similarity should improve the ability to assign the
novel member to the correct family.
The NH-indexing database-search algorithm can also be used as a topological distance
measure of RNA structure similarity. Identifying conserved sequence motifs associated
with unknown functions using multiple alignments and HMMs have been highly useful
in identifying and classifying proteins according to their function (169-171), and should
be similarly useful for RNA.
The current design of the database search requires that the database modules be
smaller or equal in size to the query module. This work can be extended to implement a
subgraph similarity search that would allow a certain amount of mismatch in order to
maximize the searching power of our approach. If mismatches are allowed, more
database structures would be included in the search result, but the results, presumably,
would include more false positives. Thus more information can be used to classify and
identity query structures, but at the expense of increased noise. Another concern is
whether the current module size correction is reasonable. Currently, if the structure is
small, fewer results would be found in database search, since fewer indexed graph
modules are found in small XIOS graphs. It is likely that if we add smaller and family
specific motifs to the database, the search performance would be better for small RNA
query structures. Such family specific motifs can be identified and obtained by applying
feature selection methods to specific RNA family structure datasets.
113
Figure 4.1 Positive hit ratio (PHR) in a Blast search.
The positive hit ratio is the number of true positve divided by the total number of hits in the result. In this example, the black line represents a query search sequence, red lines are the true positive hits, and blue lines are false positive hits. The positive hit ratio for this search is 3/5.
114
Figure 4.2 NH database search result.
Topological criteria can be used to correctly identify and classify RNA structures across 4 distantly related RNA families: tRNA, Group I Intron RNA, RNAseP RNA and tmRNA. The x-axis shows the Positive hit ratio, which is calculated as the count of correct hits over total number of hits. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10. The y-axis is the percentage of queries showing the specified positive hit ratio.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 5 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 10 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
115
Figure 4.3 Size only database search result.
The horizontal axis shows the PHR, and the vertical axis the fraction of queries reaching the specified level. The results here show distributions of close to random matching, indicating that matching between structures based on size alone is not the main factor contributing to the classification result shown in Figure 4.2.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Top 5 hitsFraction graph size only
All
tRNA
group1
RNAseP
tmRNA 0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Top 10 hitsFraction graph size only
All
tRNA
group1
RNAseP
tmRNA
116
Figure 4.4 Embedding simulation.
A. Sequence embedding. A given sequence (blue), is embeded into two flanking sequences (yellow and orange) resulting in a longer sequence. B. Graph embedding. Using the same idea as in sequence embedding, graph embedding is applied to a graph by adding extra vertices (orange) and edges to form a bigger graph. In our study, we implemented a perl script to automatically mutate the base pairing of the input RNA structure graph on the sequence level and generate unique XIOS graphs which are bigger (number of vertices) than the input graph and have the input graph as a subgraph embedded in it
The x-axis shows the Positive hit ratio, and the vertical axis shows the fraction of queries in each family achieving the specified PHR.. For the top 5 hits case, the total number of hits considered is 5. For top 10 hits case, the total number of hits considered is 10.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 5 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Top 10 hitsFraction with penalty
All
tRNA
group1
RNAseP
tmRNA
118
Figure 4.6 Blast search result
By using RNA sequence, Blast search was performed on the manually curated dataset.
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positvie hit ratio
Blast Search Top 5 hits
All
tRNA
RNaseP
group1
tmRNA
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Perc
enta
ge
Positive hit ratio
Blast Search Top 10 hits
All
tRNA
RNAseP
group1
tmRNA
119
Table 4.1 Kolmogorov-Smirnov test results
Kolmogorov-Smirnov test p-values vs. blast search vs. size only vs. embedment
Simulation per structure 14.81 15.75 15.78 15.20 15.37
Min graph size 4 6 12 5 Max graph size 12 29 33 28
Average graph size 7.80 15.88 22.42 19.48
121
CHAPTER 5 IDENTIFICATION OF TOPOLOGICAL FEATURES THAT DISCRIMINATE
BETWEEN RNA CLASSES
5.1 Introduction
5.1.1 RNA importance, RNA function determined by RNA structure
Like proteins, RNAs also perform important cellular functions, and our understanding of
this fact is increasing rapidly (172,173). As our existing knowledge about RNA grows, the
large scale characterization and analysis of RNA structures and functions, namely
structural genomics of RNA, becomes increasingly important (148,174). The core focus
of the structural genomics of RNA is to find all unique structural motifs and 3D folds;
molecular structure determines function. Current RNA function predictions mostly are
based on finding conserved sequence motifs, similarly to what is done with proteins. In
order to identify sequence motifs, multiple sequence alignments have to be generated.
In principle, conserved motifs can be identified from the alignments and the function of
RNAs predicted. The problem is that, compared to proteins, there are not many RNA
classes currently known, and RNA sequences sharing the same structural motifs may
have no detectable primary sequence similarity, making it impossible to align them. RNA
secondary structure prediction can help the prediction and identification of conserved
122
tertiary structure, but accurately predicting RNA secondary structures from sequence
information alone is not trivial.
As an alternative approach, graphs have been used to represent RNA secondary
structures. Notable efforts in this direction include RNAshapes approach of the
Giegerich group (98), but RNAshapes does not include pseudoknots.
Another graphical approach is the RAG (RNA-As-Graphs) of the Schlick group (168)But
they enumerate all mathematically possible graphs, and thus the search space is really
big.
Reliable RNA secondary structural information is needed to solve the RNA function
prediction problem because experimental determination of large RNA structures is very
difficult. While there are many RNA secondary structure prediction programs, few of
them can predict the key elements called pseudoknots. Pseudoknots are a common
structural motif in many RNA classes, such as self-splicing introns and telomerase. They
play important roles catalytic functions of RNA, such as forming the catalytic core of
ribozymes, and altering gene expression by inducing ribosomal frame shifting in many
viruses (175). The ability to annotate novel RNA secondary structures can give insight
into the possible different functions and roles of RNA. In addition, the ability to find
novel RNA secondary structures can help with the design of pharmaceuticals by
providing an accurate target site for drug recognition.
123
5.1.2 Contribution
Biological RNAs have different topological features than random RNAs (98,168,174). By
examining RNA families and selecting the most important features, and filtering out
common features shared by different families, one should be able to identify the unique
features contributing to the unique functions of different RNA classes. This would
constitute a mapping between topological features and function. Such a topology-
function mapping would lead to better understanding of RNA structural patterns and
also lead to more efficiently engineering/design of RNA-based drugs/complexes with
specific functions and effects. In addition, by identification of important topological
features in biological RNA molecules would help us to further refine our motif library to
contain the most discriminatory structural motifs.
5.2 Material and Methods
5.2.1 Reverse cIndex basic feature selection on RNA fingerprints
For each specific RNA family, the members of the family should share similar structural
patterns (topological features) because they perform similar biological functions. On the
other hand, different RNA families should contain a lower fraction of similar structural
patterns because they play different roles in the biological processes. Feature selection,
that is, identification of the features that most powerfully discriminate between
different classes of RNAs, based on RNA fingerprint should provide useful information
about which structural motifs contribute to a specific RNA family and, implicitly, to
specific RNA biological functions.
124
Our feature selection strategy is based on the cIndex idea (155) (Figure5.1), but uses the
opposite selection order. The cIndex strategy selects features from the most frequent to
least frequent; our strategy is to select features from the rarest features to common
ones. We refer this algorithm as cIndex-Basic-Min. Algorithm 5.1 outlines its
pseudocode. A graph feature matrix is used to show the containment relationship
between features and graphs (Figure5.2). Its (i,j) value tells the count of feature i found
in graph j. Support of a feature f in graph feature matrix is the number of graphs in the
graph feature matrix that contain the feature, f. For a given graph feature matrix , it first
selects a feature with minimum support (greater or equal than 1), then removes this
feature (row) and all graphs have this feature (columns). This process is repeated until
the graph feature matrix is empty. The rationale of the reverse the cIndex feature
selection order is that, in our study, we have a lot of small structural motifs in the library.
Those small motifs appear randomly everywhere, in every structure. If we use original
cIndex algorithm, for the first couple iterations those small motifs would be selected as
important features. The graph feature matrix would then be empty. It would fail to
capture real important structural motif features.
Algorithm 5.1 cIndex-Basic-Min Input: Graph Feature Matrix . Output: Selected Feature list . 1: Set the selected feature list as an empty list { } 2: FOR each feature f in M 3: IF support* of feature f support(f) > 0 4: Index f by support(f) value 5: END IF 6:END FOR 7: REPEAT 8: Feature is the list with all the features with the minimum support, support(f), from the index
125
9: Append the features from the list Feature to F 10: Record the iteration number iter and support(f) 11: FOR each feature from the list Feature 12: Find the corresponding row and delete all columns with non-zero values (remove feature hits) in M 13: Delete the corresponding row in M 14: END FOR 15: END FOR 16: UNTIL Matrix M is empty 17: RETURN selected feature list ;
* support of a feature f means number of graphs in the graph feature matrix that contain the feature, f.
5.2.2 RNA structure classification
With the information gathered by cIndex-Basic-Min feature selection, we have
developed a classification method to classify RNA structures.
The classification scoring function is based on the following four factors. 1) feature
number (the number of features in each RNA family); 2) iteration number (for a feature
selected in iteration n of the cIndex-Basic-Min procedure, n is the iteration number); 3)
feature support (number of structures containing this feature); 4) feature size (number
of stems in the feature graph).
1. Iteration function: ( ) = − + , where variable i is the iteration number. 2. Support function: ( , ) = × ( ) , where variable s is the support.
Support is the number of graphs contains that specific feature in the graph feature matrix. Basically a combination of support and iteration function. Features with the same support do not necessary get the same weight. More weight was given to common feature.
3. Feature size function: size(fs) = fs2 , where variable fs is the feature size.
We give higher weights to bigger found structural motif features.
Overall classification scoring function:
= ( ) ∗ ( , ) ∗ ( )/ _ .
126
The classification scoring function associates the selected structural motif feature with
an RNA family, and gives each feature specific weight for later on classification
calculation.
The RNA structural classification is based on the features found in the query and their
corresponding weights (scores). The query RNA structure receives a score for each RNA
family in the database. The query structure is classified the RNA family with the highest
score.
This feature selection process finds the features that are important to a group of RNA
structures, these features can be associated with functions in known families, and aid in
forming hypotheses about the function of novel RNAs.
5.2.3 RNA structure datasets
The datasets used in this work are a manually curated dataset (Table 1.1) and a dataset
collected from the STRAND database (Table 1.5).
5.3 Results
5.3.1 Feature selection on the fingerprint generated
By using cIndex-Basic-Min feature selection strategy, we have successfully identified
features which are important to specific RNA families in our manually curated dataset
(Table 1.1), as well as in a dataset downloaded from the STRAND database (Table 1.5).
Each dataset contains only non-redundant sequences with low sequence similarity (<
50%). The manually curated dataset contains more reliable RNA secondary structures
127
(curate process described in section 1.7.1). Because the STRAND dataset is collected
from all different sources, noise and partial structures are likely to be more common in
this dataset.
Table 5.1 shows the statistics of the selected structural motif features. Figure5.3 shows
that in our feature selection strategy, higher weights are given to the features which are
neither too rare nor too common. This agrees with the well accepted fact (176).
5.3.2 Top unique feature selected (in the same order as weights):
Figure5.4 and Figure5.5 show that the top four structural features found are the most
important in discriminating each of the RNA families from our datasets (Table 1.1 and
Table 1.5).These top structural motif features highlight characteristics known to be
important in the RNA families (Figure5.6). Conceptually, this provides us ability means to
decode the relationship between structure and functions. In other words, we can
understand the mapping from structural motif features to RNA biological functions. One
can ask whether these features are specific enough, since the actual motif sizes in the
structural motif library range from 1 to 7 stems. From the result in Figure5.6, we can see
that they are sufficient to describe the RNA family structure.
5.3.3 Validation of RNA structure classification
Table 5.2 shows the RNA structure classification result. The performance of the
classification is good for most of the RNA families included here. Only RNAseP (STRAND)
is slightly poor. This reflects the poorer curation of the structures in the STRAND
128
database; the RNaseP dataset we downloaded from STRAND database contains many
small/partial structures which can be easily misclassified.
To further the classification performance, we performed leave one out cross validation
(LOOCV). The LOOCV classification rate result is shown in Table 5.3. All the RNA families
show a high LOOCV classification rate, with the exception of the RNAseP STRAND,
discussed above. This is expected since there are small/partial structures and the
performance on that family is not as good as the performance of other families.
Finally, we classified RNA structures collected from the STRAND source by using the
selected features from the manually curated dataset and vice versa. The performance
(Table 5.4) is still reasonably good, but slightly worse than the result we saw above.
While using features selected from manually curated dataset to classify STRAND
structures, the overall performance is not good. RNAseP STRAND and Group I intron
STRAND correct classification rate are around 0.57. Again, we would like to mention
STRAND dataset contains partial structures and misclassified structures. These
structures are likely to be misclassified, either because they actually belong to other
classes, or because their small size interrupts or truncates the selected features on
which the classification is based.
On the other hand, when using STRAND dataset learned features to classify the
manually curated structures, the performance is very good. One thing we need to
emphasize is that for the tRNA manually curated classification by using tRNA STRAND
129
selected features; there is only 1 correctly classified case. This is what we expect to see,
since the tRNA manually curated data was based on three-dimensional structures in PDB,
they contain pseudoknotted structures in them. But on the other hand, when classifying
tRNA STRAND data using the tRNA manually curated selected features, the performance
is as good as when using the features selected from tRNA STRAND data (STRAND data is
based on secondary structure and does not contain pseudoknots). This indirectly shows
the robustness of our feature selection method.
5.4 Discussion
This feature selection process should be able to find the structural features that are
important to different RNA families. Identification of these features, in turn, should
help create a mapping between structural features and biological function.
The maximum structural motif size used in this study is only 21 edges (7 stems), which is
relatively small comparing with the complicated large graphs seen in biological RNA
structures. This limitation can be overcome by the subgraph enumeration approach
from biological structures. This would generate bigger motifs which are subgraphs of big
biological structures. As we collect more biological RNA structure data, we will be able
to add bigger structural motifs back into the motif library. That would provide bigger,
biologically relevant structural motifs. The feature selection result would be more
specific and, possibly, more biologically meaningful.
130
But we can see from our results that using the 1 to 7 stem structural motifs is already
sufficient to describe RNA family structures well. As the size of the structural motifs
increase, the match of a structure would be more specific. In our situation, we would
like the matching of the structural motifs to be somewhere not too specific and not too
random (176). More work needs to be done in order to understand what size structural
motifs achieve the best performance.
From our classification test (Figure5.4), we found that features learned from our reliable
dataset (manually curated dataset) perform poorly against more poorly annotated
datasets. Interestingly, features learned from the dataset with lower quality (STRAND
dataset) can give very good classification result. After thinking about it more, we believe
that the features learned from the low quality dataset (such as STRAND) would still tend
to include features that are representative of the dominant family of structures in the
dataset, which can represent the specific RNA family, as well as extraneous features
arising from mis-classified structures. This feature set can be considered as a super-set
of features are important to specific RNA family. In the sense of the super-set, all good
features are included, so the reliable RNA structures can still be correctly classified. This
suggests our approach can detect the poorer quality of the uncurated dataset.
One more use of this is that we can use our feature selection framework to clean up
RNA structure data in the publicly accessible databases. This is currently an annoying
task for RNA structure analysts: How to obtain the gold standard structure dataset? Our
strategy can provide a potential solution.
131
We are continuously retrieving RNA sequence and structure information from available
public databases (NCBI, RFAM, RNASPE database, etc.), and building a comprehensive
RNA fingerprint database containing the structural features, sequence, and function for
each entry. The sequence and function information for the known structures are
available to aid in assignment of function to novel RNA queries.
In the pharmaceutical industry, design of inhibitory or therapeutic RNAs often begins
with randomly generated RNA sequences of specific lengths, and tests whether the
generated molecules have specific function(s). This random approach is both time
consuming and costly due to combinatorial search space. With the help of our feature
selection approach, we can identify structural motifs from a relevant biological
sequence pool, which would be likely to perform desired functions (based on similarity
to known molecules). For researchers, this significantly reduces the search space to
identify functional RNA molecules of therapeutic use and saves time and expense.
Above all, our new application, by unleashing the power of molecular structures, could
benefit busy biologists in many ways. Our web service can be accessed freely by public
at http://xios.genomics.purdue.edu.
5.5 Future directions
The RNA structure classification problem is complicated. Different feature
selection/extracting methods, such as PCA, SVM, cosine-distance clustering, or machine
learning methods (e.g.: Naive-Bayes), and semi-supervised statistical methods could be
132
applied. In the end we would like to find the "sequences to structure N to 1 mapping" as
well as a “structure to function N to N mapping". With this knowledge, RNA family
classification, annotation and RNA structural and functional prediction will greatly
benefit from our new approach.
133
Figure 5.1 Idea of graph containment search.
Modified from (155) paper.
134
Figure 5.2 Graph feature matrix.
This matrix describes number of graphs (g) in the database contains a certain feature (f).
Figure 5.5 Selected top unique structural features in dataset Table 1.5.
Structural features selected by using algorithm 5.1 for RNA families in STRAND dataset. The order of its appearance is the same as its weight contributing to that specific RNA family
138
Figure 5.6 Link from structure to function.
Left, selected features for RNaseP manually curated and STRAND (blue frame), as well as RNAseP secondary structure. Right, selected features for tRNA manually curated and STRAND (blue frame), as well as tRNA secondary structure plus pseudoknots found in 3D structure.
139
Table 5.1 Statistics of the selected structural features for four RNA families from two datasets.
RNA family tRNA manually curated
RNAseP manually curated
Group I intron manually curated
tmRNA manually curated
Selected Feature # 25 1307 131 276
RNA family tRNA STRAND
RNAseP STRAND
Group I intron STRAND
tmRNA STRAND
Selected Feature # 15 753 85 176
140
Table 5.2 Classification performance
Classification performance tRNA manually curated 16 out of 16 (1.00) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 36 out of 36 (1.00) tmRNA manually curated 115 out of 117 (0.98) tRNA STRAND 585 out of 601 (0.97) RNAseP STRAND 28 out of 36 (0.78) Group I intron STRAND 21 out of 21 (1.00) tmRNA STRAND 30 out of 30 (1.00)
141
Table 5.3 Leave one out cross validation result
RNA family LOOCV Sample size tRNA manually curated 0.94 16 RNAseP manually curated 1.00 40 Group I intron manually curated 1.00 36 tmRNA manually curated 0.99 117 tRNA STRAND 0.97 601 RNAseP STRAND 0.58 36 Group I intron STRAND 1.00 21 tmRNA STRAND 1.00 30
142
Table 5.4 Classification test
Manually curated features tested on STRAND data RNA family Classification performance tRNA STRAND 558 out of 601 (0.93) RNAseP STRAND 21 out of 36 (0.58) Group I intron STRAND 12 out of 21 (0.57) tmRNA STRAND 30 out of 30 (1.00)
STRAND features tested on Manually curated data RNA family Classification performance tRNA manually curated 1 out of 16 (0.06) RNAseP manually curated 40 out of 40 (1.00) Group I intron manually curated 34 out of 36 (0.94) tmRNA manually curated 115 out of 117 (0.98)
LIST OF REFERENCES
143
LIST OF REFERENCES
1. Reuter, J.S. and Mathews, D.H. (2010) RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11, 129-137.
2. Crick, F.H. (1958) On protein synthesis. Symp Soc Exp Biol, 12, 138-163. 3. Mills, D.R., Peterson, R.L. and Spiegelman, S. (1967) An extracellular Darwinian
experiment with a self-duplicating nucleic acid molecule. Proc Natl Acad Sci U S A, 58, 217-224.
4. Spiegelman, S. (1971) An approach to the experimental analysis of precellular evolution. Q Rev Biophys, 4, 213-253.
5. Kramer, F.R., Mills, D.R., Cole, P.E., Nishihara, T. and Spiegelman, S. (1974) Evolution in vitro: sequence and phenotype of a mutant RNA resistant to ethidium bromide. Journal of molecular biology, 89, 719-736.
6. Eigen, M. (1971) Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften, 58, 465-523.
7. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1983) Kinetics of RNA replication. Biochemistry, 22, 2544-2559.
8. Biebricher, C.K., Eigen, M. and Gardiner, W.C., Jr. (1985) Kinetics of RNA replication: competition and selection among self-replicating RNA species. Biochemistry, 24, 6550-6560.
9. Biebricher, C.K. (1987) Replication and evolution of short-chained RNA species replicated by Q beta replicase. Cold Spring Harb Symp Quant Biol, 52, 299-306.
10. Woese, C. (1967) The Genetic Code: The Molecular Basis for Genetic Expression. Harper.
11. Cech, T. (1986) RNA as an enzyme. Scientific American 255, 64-75. 12. Cech, T.R. (1990) Self-splicing of group I introns. Annu Rev Biochem, 59, 543-568. 13. Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E. and Cech, T.R.
(1982) Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31, 147-157.
14. Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N. and Altman, S. (1983) The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35, 849-857.
15. Guerrier-Takada, C. and Altman, S. (1984) Catalytic activity of an RNA molecule prepared by transcription in vitro. Science, 223, 285-286.
16. Gilbert, W. (1986) Origin of life: The RNA world. Nature, 319, 618-618.
144
17. Joyce, G.F. (1989) RNA evolution and the origins of life. Nature, 338, 217-224. 18. Joyce, G.F. (1991) The rise and fall of the RNA world. New Biol, 3, 399-407. 19. Freeland, S.J., Knight, R.D. and Landweber, L.F. (1999) Do Proteins Predate DNA?
Science, 286, 690-692. 20. Watson, J.D. and Crick, F.H. (1953) Molecular structure of nucleic acids; a
structure for deoxyribose nucleic acid. Nature, 171, 737-738. 21. Crick, F.H. (1966) Codon--anticodon pairing: the wobble hypothesis. Journal of
molecular biology, 19, 548-555. 22. Pyle, A.M., Murphy, F.L. and Cech, T.R. (1992) RNA substrate binding site in the
catalytic core of the Tetrahymena ribozyme. Nature, 358, 123-128. 23. Cate, J.H., Gooding, A.R., Podell, E., Zhou, K., Golden, B.L., Kundrot, C.E., Cech,
T.R. and Doudna, J.A. (1996) Crystal Structure of a Group I Ribozyme Domain: Principles of RNA Packing. Science, 273, 1678-1685.
25. Illangasekare, M. and Yarus, M. (1999) A tiny RNA that catalyzes both aminoacyl-RNA and peptidyl-RNA synthesis. RNA, 5, 1482-1489.
26. Lee, N., Bessho, Y., Wei, K., Szostak, J.W. and Suga, H. (2000) Ribozyme-catalyzed tRNA aminoacylation. Nat Struct Biol, 7, 28-33.
27. Johnston, W.K., Unrau, P.J., Lawrence, M.S., Glasner, M.E. and Bartel, D.P. (2001) RNA-catalyzed RNA polymerization: accurate and general RNA-templated primer extension. Science, 292, 1319-1325.
28. Baskerville, S. and Bartel, D.P. (2002) A ribozyme that ligates RNA to protein. Proceedings of the National Academy of Sciences of the United States of America, 99, 9154-9159.
29. Joyce, G.F. (2002) The antiquity of RNA-based evolution. Nature, 418, 214-221. 30. Serganov, A. and Patel, D.J. (2007) Ribozymes, riboswitches and beyond:
regulation of gene expression without proteins. Nat Rev Genet, 8, 776-790. 31. Strobel, S.A. and Cochrane, J.C. (2007) RNA catalysis: ribozymes, ribosomes, and
riboswitches. Curr Opin Chem Biol, 11, 636-643. 32. Jeffares, D.C., Poole, A.M. and Penny, D. (1998) Relics from the RNA world. J Mol
Evol, 46, 18-36. 33. Moore, P.B. and Steitz, T.A. (2002) The involvement of RNA in ribosome function.
Nature, 418, 229-235. 34. Doudna, J.A. and Cech, T.R. (2002) The chemical repertoire of natural ribozymes.
Nature, 418, 222-228. 35. Maeda, N., Kasukawa, T., Oyama, R., Gough, J., Frith, M., Engstrom, P.G.,
Lenhard, B., Aturaliya, R.N., Batalov, S., Beisel, K.W. et al. (2006) Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet, 2, e62.
145
36. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R., Gingeras, T.R., Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T., Thurman, R.E. et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799-816.
37. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res, 16, 11-19.
38. Majdalani, N., Chen, S., Murrow, J., St John, K. and Gottesman, S. (2001) Regulation of RpoS by a novel small RNA: the characterization of RprA. Mol Microbiol, 39, 1382-1394.
39. Havilio, M., Levanon, E.Y., Lerman, G., Kupiec, M. and Eisenberg, E. (2005) Evidence for abundant transcription of non-coding regions in the Saccharomyces cerevisiae genome. BMC Genomics, 6, 93-100.
40. David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W. and Steinmetz, L.M. (2006) A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci U S A, 103, 5320-5325.
41. Manak, J.R., Dike, S., Sementchenko, V., Kapranov, P., Biemar, F., Long, J., Cheng, J., Bell, I., Ghosh, S., Piccolboni, A. et al. (2006) Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nature genetics, 38, 1151-1158.
42. Miura, F., Kawaguchi, N., Sese, J., Toyoda, A., Hattori, M., Morishita, S. and Ito, T. (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc Natl Acad Sci U S A, 103, 17846-17851.
43. Ravasi, T., Suzuki, H., Pang, K.C., Katayama, S., Furuno, M., Okunishi, R., Fukuda, S., Ru, K., Frith, M.C., Gongora, M.M. et al. (2006) Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Research, 16, 11-19.
44. He, H., Wang, J., Liu, T., Liu, X.S., Li, T., Wang, Y., Qian, Z., Zheng, H., Zhu, X., Wu, T. et al. (2007) Mapping the C. elegans noncoding transcriptome with a whole-genome tiling microarray. Genome Res, 17, 1471-1477.
45. Li, D., Willkomm, D.K., Schon, A. and Hartmann, R.K. (2007) RNase P of the Cyanophora paradoxa cyanelle: a plastid ribozyme. Biochimie, 89, 1528-1538.
46. Wilhelm, B.T., Marguerat, S., Watt, S., Schubert, F., Wood, V., Goodhead, I., Penkett, C.J., Rogers, J. and Bahler, J. (2008) Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature, 453, 1239-1243.
47. Bompfunewerer, A.F., Flamm, C., Fried, C., Fritzsch, G., Hofacker, I.L., Lehmann, J., Missal, K., Mosig, A., Muller, B., Prohaska, S.J. et al. (2005) Evolutionary patterns of non-coding RNAs. Theory Biosci, 123, 301-369.
48. Caetano-Anollés, G. (2010) Evolutionary Genomics and Systems Biology. Wiley-Blackwell.
49. Staple, D.W. and Butcher, S.E. (2005) Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3, e213.
146
50. Puglisi, J.D., Wyatt, J.R. and Tinoco, I. (1991) RNA pseudoknots. Accounts of Chemical Research, 24, 152-158.
51. Mans, R.M., Van Steeg, M.H., Verlaan, P.W., Pleij, C.W. and Bosch, L. (1992) Mutational analysis of the pseudoknot in the tRNA-like structure of turnip yellow mosaic virus RNA. Aminoacylation efficiency and RNA pseudoknot stability. Journal of molecular biology, 223, 221-232.
52. Mans, R.M., Pleij, C.W. and Bosch, L. (1991) tRNA-like structures. Structure, function and evolutionary significance. Eur J Biochem, 201, 303-324.
53. Brierley, I., Rolley, N.J., Jenner, A.J. and Inglis, S.C. (1991) Mutational analysis of the RNA pseudoknot component of a coronavirus ribosomal frameshifting signal. Journal of molecular biology, 220, 889-902.
54. Tzeng, T.H., Tu, C.L. and Bruenn, J.A. (1992) Ribosomal frameshifting requires a pseudoknot in the Saccharomyces cerevisiae double-stranded RNA virus. J Virol, 66, 999-1006.
55. Chamorro, M., Parkin, N. and Varmus, H.E. (1992) An RNA pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger RNA. Proc Natl Acad Sci U S A, 89, 713-717.
56. ten Dam, E.B., Pleij, C.W. and Bosch, L. (1990) RNA pseudoknots: translational frameshifting and readthrough on viral RNAs. Virus Genes, 4, 121-136.
57. Dinman, J.D., Icho, T. and Wickner, R.B. (1991) A -1 ribosomal frameshift in a double-stranded RNA virus of yeast forms a gag-pol fusion protein. Proc Natl Acad Sci U S A, 88, 174-178.
58. Wills, N.M., Gesteland, R.F. and Atkins, J.F. (1991) Evidence that a downstream pseudoknot is required for translational read-through of the Moloney murine leukemia virus gag stop codon. Proceedings of the National Academy of Sciences, 88, 6991-6995.
59. Gallie, D.R., Feder, J.N., Schimke, R.T. and Walbot, V. (1991) Functional analysis of the tobacco mosaic virus tRNA-like structure in cytoplasmic gene regulation. Nucleic acids research, 19, 5031-5036.
60. Westhof, E. and Jaeger, L. (1992) RNA pseudoknots. Current Opinion in Structural Biology, 2, 327-333.
61. Nussinov, R., Pieczenik, G., Griggs, J.R. and Kleitman, D.J. (1978) Algorithms for Loop Matchings. SIAM Journal on Applied Mathematics, Vol. 35, No. 1 68-82.
62. Konings, D.A. and Hogeweg, P. (1989) Pattern analysis of RNA secondary structure similarity and consensus of minimal-energy folding. Journal of molecular biology, 207, 597-614.
63. Le, S.-Y., Nussinov, R. and Maizel, J.V. (1989) Tree graphs of RNA secondary structures and their comparisons. Computers and Biomedical Research, 22, 461-473.
64. Zuker, M. and Sankoff, D. (1984) RNA secondary structures and their prediction. Bulletin of Mathematical Biology, 46, 591-621.
147
65. Shapiro, B.A. and Zhang, K. (1990) Comparing multiple RNA secondary structures using tree comparisons. Computer applications in the biosciences : CABIOS, 6, 309-318.
66. Eddy, S.R. and Durbin, R. (1994) RNA sequence analysis using covariance models. Nucleic acids research, 22, 2079-2088.
67. Zuker, M. (1989), Science, Vol. 244, pp. 48-52. 68. Zuker, M. (1994) Prediction of RNA secondary structure by energy minimization.
Methods Mol. Biol, 25, 267-294. 69. McCaskill, J.S. (1990) The equilibrium partition function and base pair binding
probabilities for RNA secondary structure. Biopolymers, 29, 1105-1119. 70. Cook, S.A. (1971), Proceedings of the third annual ACM symposium on Theory of
computing. ACM, Shaker Heights, Ohio, United States, pp. 151-158. 71. HAGADONE, T.R. (1992) Molecular substructure similarity searching : efficient
retrieval in two-dimensional structure databases. Anglais, 32, 515-521. 72. Willett, P., Barnard, J.M. and Downs, G.M. (1998) Chemical Similarity Searching.
Anglais, 38, 983-996. 73. Hansch, C., Muir, R.M., Fujita, T., Maloney, P.P., Geiger, F. and Streich, M. (1963)
The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. Journal of the American Chemical Society, 85, 2817-2824.
74. Gan, H.H., Pasquali, S. and Schlick, T. (2003) Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. Nucleic acids research, 31, 2926-2943.
75. Harary, F. and Prins, G. (1959) The number of homeomorphically irreducible trees, and other species. Acta Mathematica, 101, 141-162.
76. Harary, F. (1969) Graph Theory. Addison-Wesley, Reading, MA. 77. Schuster, P. (1997) Genotypes with phenotypes: adventures in an RNA toy world.
Biophys Chem, 66, 75-110. 78. Gierasch, L.M. and (editor), J.K. (1990) Protein Folding: Deciphering the Second
Half of the Genetic Code. Amer Assn for the Advancement. 79. Doudna, J.A. (2000) Structural genomics of RNA. Nat Struct Biol, 7 Suppl, 954-
956. 80. Ogurtsov, A.Y., Shabalina, S.A., Kondrashov, A.S. and Roytberg, M.A. (2006)
Analysis of internal loops within the RNA secondary structure in almost quadratic time. Bioinformatics (Oxford, England), 22, 1317-1324.
81. Do, C.B., Woods, D.A. and Batzoglou, S. (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics (Oxford, England), 22, e90-98.
82. Flamm, C., Fontana, W., Hofacker, I.L. and Schuster, P. (2000) RNA folding at elementary step resolution. RNA, 6, 325-338.
83. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9, 133-148.
148
84. Ying, X., Luo, H., Luo, J. and Li, W. (2004) RDfolder: a web server for prediction of RNA secondary structure. Nucleic acids research, 32, W150-153.
85. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M. and Schuster, P. (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly, 125, 167-188.
86. Hofacker, I.L. and Stadler, P.F. (2006) Memory efficient folding algorithms for circular RNA secondary structures. Bioinformatics (Oxford, England), 22, 1172-1176.
87. Danilova, L.V., Pervouchine, D.D., Favorov, A.V. and Mironov, A.A. (2006) RNAKinetics: a web server that models secondary structure kinetics of an elongating RNA. J Bioinform Comput Biol, 4, 589-596.
88. Ding, Y., Chan, C.Y. and Lawrence, C.E. (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic acids research, 32, W135-141.
89. Dawson, W., Fujiwara, K., Kawai, G., Futamura, Y. and Yamamoto, K. (2006) A method for finding optimal rna secondary structures using a new entropy model (vsfold). Nucleosides Nucleotides Nucleic Acids, 25, 171-189.
90. Ren, J., Rastegari, B., Condon, A. and Hoos, H.H. (2005) HotKnots: heuristic prediction of RNA secondary structures including pseudoknots. RNA, 11, 1494-1504.
91. Huang, C.H., Lu, C.L. and Chiu, H.T. (2005) A heuristic approach for detecting RNA H-type pseudoknots. Bioinformatics (Oxford, England), 21, 3501-3508.
92. Xayaphoummine, A., Bucher, T. and Isambert, H. (2005) Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic acids research, 33, W605-610.
93. Zadeh, J.N., Steenberg, C.D., Bois, J.S., Wolfe, B.R., Pierce, M.B., Khan, A.R., Dirks, R.M. and Pierce, N.A. (2011) NUPACK: Analysis and design of nucleic acid systems. J Comput Chem, 32, 170-173.
94. Reeder, J. and Giegerich, R. (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, 104-115.
95. Rivas, E. and Eddy, S.R. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of molecular biology, 285, 2053-2068.
96. Huang, X. and Ali, H. (2007) High sensitivity RNA pseudoknot prediction. Nucleic acids research, 35, 656-663.
97. Wuchty, S., Fontana, W., Hofacker, I.L. and Schuster, P. (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49, 145-165.
98. Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J. and Giegerich, R. (2006) RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics (Oxford, England), 22, 500-503.
99. Clote, P. (2005) RNALOSS: a web server for RNA locally optimal secondary structures. Nucleic acids research, 33, W600-W604.
149
100. Markham, N.R. and Zuker, M. (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol, 453, 3-31.
101. Shapiro, B.A., Kasprzak, W., Grunewald, C. and Aman, J. (2006) Graphical exploratory data analysis of RNA secondary structure dynamics predicted by the massively parallel genetic algorithm. J Mol Graph Model, 25, 514-531.
102. Tinoco, I., Jr., Borer, P.N., Dengler, B., Levin, M.D., Uhlenbeck, O.C., Crothers, D.M. and Bralla, J. (1973) Improved estimation of secondary structure in ribonucleic acids. Nat New Biol, 246, 40-41.
103. Tinoco, I., Jr., Uhlenbeck, O.C. and Levine, M.D. (1971) Estimation of secondary structure in ribonucleic acids. Nature, 230, 362-367.
104. Bellman, R. (1952) On the Theory of Dynamic Programming. Proc Natl Acad Sci U S A, 38, 716-719.
105. Nussinov, R. and Jacobson, A.B. (1980) Fast Algorithm for Predicting the Secondary Structure of Single-Stranded RNA. Proc. Natl. Acad. Sci. U. S. A., 77, 6309-6313.
106. Reeder, J., Steffen, P. and Giegerich, R. (2007) pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows. Nucleic Acids Res., 35, W320-W324.
107. Sperschneider, J., Datta, A. and Wise, M.J. (2011) Heuristic RNA pseudoknot prediction including intramolecular kissing hairpins. RNA, 17, 27-38.
108. Sperschneider, J. and Datta, A. (2010) DotKnot: pseudoknot prediction using the probability dot plot under a refined energy model. Nucleic acids research, 38, e103.
109. Schreiber, S.L. (2000) Target-Oriented and Diversity-Oriented Organic Synthesis in Drug Discovery. Science, 287, 1964-1969.
on RNA replication. Pure Appl. Chem., 56, 967-978. 114. Horwitz, M.S., Dube, D.K. and Loeb, L.A. (1989) Selection of new biological
activities from random nucleotide sequences: evolutionary and practical considerations. Genome, 31, 112-117.
115. Ellington, A.D. and Szostak, J.W. (1990) In vitro selection of RNA molecules that bind specific ligands. Nature, 346, 818-822.
116. Bartel, D.P. and Szostak, J.W. (1993) Isolation of new ribozymes from a large pool of random sequences. Science, 261, 1411-1418.
117. Chapman, K.B. and Szostak, J.W. (1994) In vitro selection of catalytic RNAs. Curr Opin Struct Biol, 4, 618-622.
118. Lorsch, J.R. and Szostak, J.W. (1994) In vitro evolution of new ribozymes with polynucleotide kinase activity. Nature, 371, 31-36.
119. The tmRNA Website. http://www.indiana.edu/~tmrna/.
150
120. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic acids research, 28, 235-242.
121. Andronescu, M., Bereg, V., Hoos, H.H. and Condon, A. (2008) RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9, 340.
122. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol, 6, 362-369. 123. Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and
Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic acids research, 31, 3450-3460.
124. Ellis, J.C. and Brown, J.W. (2009) The RNase P family. RNA Biol., 6, 362-369. 125. Brown, J.W. (1999) The Ribonuclease P Database. Nucleic Acids Res., 27, 314-314. 126. Yang, H.W., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H. and
Westhof, E. (2003) Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res., 31, 3450-3460.
127. Cannone, J.J., Subramanian, S., Schnare, M.N., Collett, J.R., D'Souza, L.M., Du, Y.S., Feng, B., Lin, N., Madabusi, L.V., Muller, K.M. et al. (2002) The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2.
128. Williams, K.P. (2002) The tmRNA Website: invasion by an intron. Nucleic Acids Res., 30, 179-182.
129. Zarrinkar, P.P. and Williamson, J.R. (1996) The kinetic folding pathway of the Tetrahymena ribozyme reveals possible similarities between RNA and protein folding. Nat Struct Biol, 3, 432-438.
130. Doherty, E.A. and Doudna, J.A. (1997) The P4-P6 domain directs higher order folding of the Tetrahymena ribozyme core. Biochemistry, 36, 3159-3169.
131. Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci U S A, 101, 7287-7292.
132. Kim, N., Shiffeldrim, N., Gan, H.H. and Schlick, T. (2004) Candidates for novel RNA topologies. Journal of molecular biology, 341, 1129-1144.
133. Yan, X. and Han, J. (2002), Proceedings of the 2002 IEEE International Conference on Data Mining. IEEE Computer Society, Maebashi City, Japan, pp. 721.
134. Yan, X. and Han, J. (2003), Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Washington, D.C.
135. Zaki, M.J. (2002), Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, Edmonton, Alberta, Canada.
136. Jaeger, J.A., Turner, D.H. and Zuker, M. (1989) Improved predictions of secondary structures for RNA. Proc Natl Acad Sci U S A, 86, 7706-7710.
137. Wang, Z. and Zhang, K. (2001), Proceedings of the 26th International Symposium on Mathematical Foundations of Computer Science. Springer-Verlag, pp. 690-702.
151
138. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25, 25-29.
139. Grate, L., Herbster, M., Hughey, R., Haussler, D., Mian, I.S. and Noller, H. (1994) RNA modeling using Gibbs sampling and stochastic context free grammars. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB, 2, 138-146.
140. Lowe, T.M. and Eddy, S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research, 25, 955-964.
141. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of molecular biology, 215, 403-410.
142. Pudlák, P., Rödl, V. and Savický, P. (1988) Graph complexity. Acta Informatica, 25, 515-535.
143. Byun, Y. and Han, K. (2009) PseudoViewer3: generating planar drawings of large-scale RNA structures with pseudoknots. Bioinformatics (Oxford, England), 25, 1435-1437.
144. Darty, K., Denise, A. and Ponty, Y. (2009) VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics (Oxford, England), 25, 1974-1975.
145. Janssen, S., Reeder, J. and Giegerich, R. (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinformatics, 9, 131.
146. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. and Eddy, S.R. (2003) Rfam: an RNA family database. Nucleic acids research, 31, 439-441.
147. Weinberg, Z. and Ruzzo, W.L. (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics (Oxford, England), 20, i334-i341.
148. Weinberg, Z. and Ruzzo, W.L. (2006) Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics (Oxford, England), 22, 35-39.
149. Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA Structural Ensembles Including Pseudoknots. Submitted.
150. Eppstein, D. (1995), Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, San Francisco, California, United States, pp. 632-640.
151. Kukluk, J.P., Holder, L.B. and Cook, D.J. (2004) Algorithm and experiments in testing planar graphs for isomorphism. Journal of Graph Algorithms and Applications, 8, 313-356.
152. Yan, X., Yu, P.S. and Han, J. (2004), Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, Paris, France, pp. 335-346.
152
153. Yan, X., Yu, P.S. and Han, J. (2005), Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore, Maryland, pp. 766-777.
154. Yan, X., Yu, P.S. and Han, J. (2005) Graph indexing based on discriminative frequent structure analysis. ACM Trans. Database Syst., 30, 960-993.
155. Chen, C., Yan, X., Yu, P.S., Han, J., Zhang, D.-Q. and Gu, X. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 926-937.
156. Williams, D.W., Huan, J. and Wang, W. (2007), Proceedings of 23rd International Conference on Data Engineering. IEEE, Istanbul, Turkey, pp. 976-985.
157. Zhao, P., Yu, J.X. and Yu, P.S. (2007), Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Vienna, Austria, pp. 938-949.
158. Shang, H., Zhang, Y., Lin, X. and Yu, J.X. (2008) Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow., 1, 364-375.
159. Tian, Y. and Patel, J.M. (2008), Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pp. 963-972.
160. Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I., Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium on Bioinformatics Research and Applications. Bioinformatics Research and Applications, Atlanta, GA, Vol. 4983/2008, pp. 317-330.
161. ChemIDplus. http://chem.sis.nlm.nih.gov/chemidplus. 162. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J. and Bryant, S.H. (2009)
PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37, W623-633.
163. Seiler, K.P., George, G.A., Happ, M.P., Bodycombe, N.E., Carrinski, H.A., Norton, S., Brudz, S., Sullivan, J.P., Muhlich, J., Serrano, M. et al. (2008) ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic acids research, 36, D351-D359.
164. Liu, T., Lin, Y., Wen, X., Jorissen, R.N. and Gilson, M.K. (2007) BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic acids research, 35, D198-201.
166. Eddy, S.R. (2002) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics, 3, 18.
167. Fera, D., Kim, N., Shiffeldrim, N., Zorn, J., Laserson, U., Gan, H.H. and Schlick, T. (2004) RAG: RNA-As-Graphs web resource. BMC Bioinformatics, 5, 88.
168. Gan, H.H., Fera, D., Zorn, J., Shiffeldrim, N., Tang, M., Laserson, U., Kim, N. and Schlick, T. (2004) RAG: RNA-As-Graphs database—concepts, analysis, and features. Bioinformatics (Oxford, England), 20, 1285-1291.
153
169. Sonnhammer, E.L., Eddy, S.R., Birney, E., Bateman, A. and Durbin, R. (1998) Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic acids research, 26, 320-322.
170. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D. and Sonnhammer, E.L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic acids research, 27, 260-262.
171. Sigrist, C.J., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A. and Bucher, P. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform, 3, 265-274.
172. Washietl, S., Hofacker, I.L., Lukasser, M., Huttenhofer, A. and Stadler, P.F. (2005) Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol, 23, 1383-1390.
173. Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad-Toh, K., Lander, E.S., Kent, J., Miller, W. and Haussler, D. (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol, 2, e33.
174. Giegerich, R., Voss, B. and Rehmsmeier, M. (2004) Abstract shapes of RNA. Nucleic acids research, 32, 4843-4851.
175. Hofacker, I.L., Fekete, M. and Stadler, P.F. (2002) Secondary structure prediction for aligned RNA sequences. Journal of molecular biology, 319, 1059-1066.
176. Zipf, G. (1949) Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley Press., Oxford, England.
VITA
154
VITA
Kejie Li
Department of Biological Sciences, Purdue University
Education
B.S., Biological Sciences, 2004, Sichuan University, Chengdu, Sichuan, P.R. China
Ph.D., Biological Sciences, 2011, Purdue University, West Lafayette, Indiana
Kejie Li was born in Chengdu, Sichuan Province, P.R. China on June 4th, 1982. Kejie grew up in his hometown and went to Sichuan University in 2000. In Sichuan University, Kejie was selected to an Educational Exchange Program and spent his junior year in University of Washington, Seattle, USA, as a visiting student. Kejie graduated from Sichuan University in 2004 with a Bachelor’s Degree in Biological Sciences. In the Fall of the same year, Kejie was admitted to the Bioinformatics master program at Wageningen University, Wageningen, Netherlands. Fall semester of 2005, Kejie was admitted to the PhD program in Department of Biological Sciences at Purdue University, West Lafayette, USA, and joined the laboratory of Dr. Michael Gribskov. His research focus is the understanding of RNA structure and function relationships. Kejie finished his PhD studies and received his Ph.D. degree in Aug 2011. Kejie will pursue postdoctoral studies at Broad Institute, Boston, USA.
PUBLICATIONS
155
PUBLICATIONS
Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) RNA structure topological
pattern study reveals link between topology and function. (In preparation)
Li, K., Gupta, A., Rahman, R. and Gribskov, M. (2011) Matching unknown RNA
structures: RNA XIOS topological pattern database. (In preparation)
Gupta, A., Rahman, R., Li, K. and Gribskov, M. (2011) Identifying Complete RNA
Structural Ensembles Including Pseudoknots. (Submitted)
Banks, J.A., Nishiyama, T., Hasebe, M., Bowman, J.L., Gribskov, M., Li, K. et al. (2011)
The Selaginella Genome Identifies Genetic Changes Associated with the Evolution of
Vascular Plants. Science, 332, 960-963.
Li, K., Rahman, R., Gupta, A., Siddavatam, P. and Gribskov, M. (2008) In Mandoiu, I.,
Sunderraman, R. and Zelikovsky, A. (eds.), Proceeding of 2008 International Symposium
on Bioinformatics Research and Applications. Bioinformatics Research and Applications,