CLASSIFICATION OF CODING AND NON-CODING RNA IN RNA-SEQ DATA by Hisanaga Mark Okada B.Sc., Simon Fraser University, 2008 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science c Hisanaga Mark Okada 2011 SIMON FRASER UNIVERSITY Spring 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.
95
Embed
CLASSIFICATION OF CODING AND NON-CODING RNA IN RNA …summit.sfu.ca/system/files/iritems1/11668/etd6585_HOkada.pdf · 2020. 7. 12. · Abstract Recently, the coverage of non-protein-coding
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
All rights reserved. However, in accordance with the Copyright Act of
Canada, this work may be reproduced without authorization under the
conditions for Fair Dealing. Therefore, limited reproduction of this
work for the purposes of private study, research, criticism, review and
news reporting is likely to be in accordance with the law, particularly
if cited appropriately.
APPROVAL
Name: Hisanaga Mark Okada
Degree: Master of Science
Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq
data
Examining Committee: Dr. Anoop Sarkar
Associate Professor, Computing Science
Simon Fraser University
Chair
Dr. Martin Ester
Professor, Computing Science
Simon Fraser University
Senior Supervisor
Dr. Cenk Sahinalp
Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Kay Wiese
Associate Professor, Computing Science
Simon Fraser University
Examiner
Date Approved: February 28, 2011
11
APPROVAL
Name: Hisanaga Mark Okada
Degree: Master of Science
Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq
data
Examining Committee: Dr. Anoop Sarkar
Associate Professor, Computing Science
Simon Fraser University
Chair
Dr. Martin Ester
Professor, Computing Science
Simon Fraser University
Senior Supervisor
Dr. Cenk Sahinalp
Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Kay Wiese
Associate Professor, Computing Science
Simon Fraser University
Examiner
Date Approved: February 28, 2011
11
Last revision: Spring 09
Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.
The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.
It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.
Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.
While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.
The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.
Simon Fraser University Library Burnaby, BC, Canada
Abstract
Recently, the coverage of non-protein-coding RNA in the scientific literature has expanded
dramatically. While the functions for many are unknown, strong interest in this aspect
of cellular biology is driving development of methods for detecting non-coding genes and
transcripts.
During the same period, RNA sequencings high throughput and high spatial resolution
have established it as the preferred method for characterising transcriptomes. Many groups
are now sequencing transcriptomes. De novo transcriptome assembly methods are being
developed to address issues for which no reference genome is available.
We propose a methodology that is compatible with de novo transcriptome assembly,
that uses sequence, structural and genomic features to classify transcripts as non-coding vs.
protein-coding RNA, and to classify different non-coding RNA types. We have applied our
technique on a variety of known RNA sequences and have explored its use on contigs from
the Trans-ABySS assembly pipeline for RNA-Seq data from normal mouse tissues.
iii
To family and friends
iv
“As iron sharpens iron, so one man sharpens another”
— Proverbs 27:17
v
Acknowledgments
I wish to express my deepest gratitude to the many individuals whose support and assistance
made this work described in this thesis possible.
As my senior supervisor, I thank Martin Ester for giving me the academic and personal
guidance I needed. I am grateful for his patience and for encouragement during the entire
length of my research. I wish to also thank the members of my committee for their invaluable
counsel. I wish to thank the members of the Data Mining Lab, and the Computing Science
Department at Simon Fraser University for providing the environment I needed to perform
this research. Thanks especially to Phuong Dao for his expertise in countless matters.
This work was possible because of our collaboration with the Michael Smith Genome
Sciences Centre (GSC). I thank Inanc Birol, Jacqueline Schein, Pamela Hoodless and espe-
cially Gordon Robertson for providing me with so many opportunities, and for going above
and beyond their supervisory roles. I gratefully acknowledge the GSCs making available the
seven mouse transcriptome datasets generated in the Genome Canada MORGEN project. I
would like to acknowledge in particular: Sam Lee for generating the RNA reagents; Yongjun
Zhao who manages the library construction teams; Nina Thiessen and An He who applies the
GSCs production WTSS pipeline; and Shaun Jackman, Readman Chiu, Rong She, Jenny
Qian, Karen Mungall, for de novo contig data from ABySS and Trans-ABySS.
This work was funded by the Canadian Institute of Health Research / Michael Smith
Foundation for Health Research Bioinformatics Training Program. I am extremely grateful
that they have provided such a supportive community for bioinformatics research. I wish
to acknowledge in particular Marco Marra, Steve Jones and Sharon Ruschkowski.
Lastly, I wish to thank my family and friends for their unconditional love and support.
As chaotic as it seemed at times, they kept me grounded. Thanks to them, I will always
acid composition, amino acid composition, and protein complexity. These sequence-based
features are properties calculated from the sequence at hand. Second, because non-coding
RNA transcripts have functional secondary structures, a number of approaches assume that
the secondary structures predicted for a sequence can be used to calculate the likelihood
of the transcript being non-coding. Examples include RNApfold and RNAz, both of which
are a part of or rely on the Vienna RNA package [49, 50]. As structural features are
1
CHAPTER 1. INTRODUCTION 2
more computationally expensive, attempts have emerged that use sliding windows. Finally,
a number of approaches use genomically-mapped evidence like transcript and expressed
sequence tag (EST) alignments, chromatin profiles and evolutionarily conserved regions.
For this project, we used mapped RNA-Seq reads, mapped de novo contigs, chromatin
profiles, and conservation data.
RNA-Seq, based on second generation deep sequencing technologies, is an effective tool
for quantifying the expression levels of the transcriptome using short sequence reads orig-
inating from fragmented transcripts [132]. Although RNA-Seq has been primarily used to
detect the transcript of protein coding RNA, the technology has increasingly been applied
to detect non-coding RNAs [32, 67, 59].
For this thesis, we introduce the Sequence-Structure-Genome Classifier (SSGC). Using
SSGC, we investigate the transcript classification problem using short-read sequencing data.
Existing studies on non-coding RNAs, using RNA-Seq have relied on mapping reads to a ref-
erence genome; we investigate the classification problem using contigs from a non-reference
based approach, using the de novo transcriptome assembly. By introducing assembly to
non-coding RNA classification, we allow the ability to work on de novo settings. In our in-
vestigation, we also take in consideration the large sizes and noisy nature of these datasets.
We demonstrate the effectiveness of the various feature sets under an assortment of test
conditions.
These will be the major steps in building and running SSGC:
1. Create contigs
- RNA-Seq assembly
2. Build classifier
- label contigs as protein coding or non-coding RNAs
- train SVM model using labelled contigs
3. Run contigs on classifier
- include genomically mapped evidence as attributes
CHAPTER 1. INTRODUCTION 3
Reads
Contigs
Protein codinggenes
non-codingRNAs
miRNA
tRNApiRNA
lincRNA
lncRNA
rRNAsnoRNA
pre-miRNA
Figure 1.1: A top level structure of our approach from the short read sequence down to theclassification of RNA transcripts. We are both interested in using reads and contigs as partof the input as well as the potential to classify different non-coding RNA families.
1.2 Contributions
In this thesis, we extend existing work in which transcript sequences from public databases
were classified into two groups, i.e. protein coding vs non-coding. First, we extend the
classification to discriminate between non-coding RNA families (Figure 1.1). Then, we
apply the classifier to RNA-Seq data, and to de novo transcriptome assembly that uses such
short-read data to generate contigs [110]. De novo assembly can be used with non-model
species for which a reference genome is not available, and can detect chimeric transcripts
that are not represented by a reference genomes gene models but can be important in disease
(ref). We show that non-coding RNA family types can be identified in RNA-Seq data, and in
de novo transcriptome contigs. We outline potential constraints, related to expression level
and sequencing depth, in comprehensively characterising non-coding RNA in sequence data.
The software developed for this thesis is available for use with high-throughput RNA-Seq
and de novo transcriptome assembly pipelines.
CHAPTER 1. INTRODUCTION 4
1.3 How this thesis is organised
The first three chapters give background material: Chapter 2 briefly summarises the bio-
logical concepts, and Chapter 3 summarises published work related to the thesis. The next
two chapters describe the classifier: Chapter 4 explains concepts, and Chapter 5 provides
details on the tools and methods used. Chapter 6 explains the results of using the classi-
fier on database sequences and de novo transcriptome contigs from real biological samples.
Chapter 7 concludes with final remarks and possible future directions.
Chapter 2
Biological background
Bioinformatics is an interdisciplinary study and a wide variety of topics are covered in this
thesis. This section acts as a primer to the biological terms and concepts that are used in
this thesis.
2.1 Second generation sequencing and transcriptomics
DNA sequencing has existed since the beginning of molecular biology. The Sanger method [113]
is the well known and revolutionary first generation technology based on dideoxy chain ter-
mination; first generation technology has been used to unlock sequences lengths in the order
of several hundred base-pairs. Second generation sequencing technologies emerged decades
later, towards the end of the first human genome project. The dominant platforms, Illu-
mina, Roche 454, and ABI SOLiD, have high throughput but generate shorter sequence
reads [86].
Transcription is the synthesis of RNA ribonucleotides using polymerase and a DNA se-
quence as the template. Transcriptome studies have been an important part of molecular
biology and bioinformatics research as expressed RNA is often a precursor for protein syn-
thesis [1]. RNA-Seq, or whole transcriptome shotgun sequencing, is a recently developed
method that uses second generation sequencing on a transcriptome to survey the RNA ex-
pression landscape [92, 91]. RNA-Seq is performed by capturing RNA transcripts by their
poly-A tail, converting the RNA sequence to double stranded DNA by reverse transcriptase,
fragmenting and sequencing using second generation technology. [132]. RNA-Seq has been
shown to be effective in profiling the expression level of transcripts [132, 81, 4, 92], as well
5
CHAPTER 2. BIOLOGICAL BACKGROUND 6
as identifying novel transcription events [110, 41, 126, 40].
2.2 Central dogma of molecular biology
Molecular biology is the study of the formation, organisation and activity of macromolecules
essential to life [56]. This is encapsulated by the Central Dogma, one that states that the
flow of genetic information in cells is from DNA to RNA to protein [1]. For a given gene, this
can be broken down into two steps: transcription and translation (Figure 2.1). Transcription
is the process of synthesising a chain of RNA oligonucleotides from the sequence of a DNA
template. The resulting oligonucleotide chain, or transcript, is known as the messenger
RNA (mRNA).
Translation is the process of synthesising amino acid polymers by reading the open
reading frame (ORF) found within the transcript sequence. The ORF of a transcript is
the segment of the transcript that is used to encode the amino acid sequence. It is the
chemical properties of the amino acid, or peptide, sequence that give it its structure and
function. The regions outside the ORF of a transcript is called the untranslated region
(UTR). Transcripts, as DNA and RNA, have a direction of synthesis and transcription.
The beginning of the transcript starts with the 5′ end and terminates at the 3′ end. From
the original sequence of a DNA source, transcripts are appended with a 5′ cap containing
a modified guanine nucleotide and a poly-adenylation (poly-A) tail on the 3′ end consisting
of a long set of adenosine sequences [1].
2.3 Non-coding RNA
Despite the fundamental significance of the Central Dogma, we have come to realise im-
portant exceptions of this principle. Of the dry weight of RNA extracted from a cell, only
3-5% consists of mRNA, similar to the proportion of genes that make up the genome [1]. In
contrast, as much as 62% of the mouse genome [125], 85% of the fruit fly genome [80], and
93% of the human genome [8] has been estimated to be transcribed.
Non-protein coding, or non-coding RNAs, are RNA products that are not translated to
proteins after transcription (Figure 2.1). Recently there has been an explosion of micro-RNA
(miRNA) research and their critical roles as gene regulation [85, 97], and their implications
for tumorigenesis [111, 13, 84, 131]. miRNA, along with other small RNAs were once named
CHAPTER 2. BIOLOGICAL BACKGROUND 7
the breakthrough of the year by Science magazine [23]. Overall, there are a number of non-
coding RNA types such as those involved in the translation process, ribosomal RNA (rRNA)
and transfer RNA (tRNA); small non-coding RNAs such as micro RNA (miRNA), small
interfering RNA (siRNA), small temporal RNA (stRNA), small nuclear RNA (snRNA),
small nucleolar RNA (snoRNA), piwi-interacting RNA (piRNA); and the more elusive long
non-coding RNA (lncRNA) which include long intergenic non-coding RNA (lincRNA).
ORF features
Protein coding mRNAs have characteristics that are well defined as explained earlier. ORFs
are mostly thought to be unique to that of protein coding genes. There are exceptions to
this concept as bifunctional RNAs have been documented to have functioning ORFs [25, 2].
There are however controversies surrounding non-coding RNAs as the function of many
annotated non-coding RNAs are not known. Of the transcript products found in the FAN-
TOM database [125], there are reports that many of the transcripts are the result of unde-
graded protein coding mRNA, undegraded introns, internal priming, putative protein coding
genes and some have low conservation across species [95]. This have also been reports where
large deletions in gene deserts associated with non-coding DNA had no effect on mice [93].
Recently, comparing newer RNA-Seq methods to potentially noisier microarrays have shown
that non-coding RNAs may not be transcribed as once thought [91].
CHAPTER 2. BIOLOGICAL BACKGROUND 8
pre-mRNA
genome
mRNA
non-coding RNA
folded non-coding RNA
protein
introns
ORF
exons
transcription
translation
transcription
peptide sequence
poly-A tail5’ cap
5’ UTR
3’ UTR
Figure 2.1: The Central Dogma of molecular biology. On the left is the typical transcriptionand translation steps for a given gene. The end product is translated amino acid sequencethat eventually forms a protein. On the right is the transcription of a non-coding RNA, the3-D structure consisting of its secondary structure.1
———————————
13-D images from PDB (http://www.pdb.org/) and EBI (http://www.ebi.ac.uk/)
Chapter 3
Related work
Many non-coding RNAs have been known for decades [27], though it is only recently where
various computational methods to detect these entities have started to emerge. Using various
methodologies, many attempts have been made to classify, find, validate and store non-
coding RNAs. In this chapter, we summarise these methodologies.
3.1 Discovery of non-coding RNAs
In this section, we review strategies in the literature that find non-coding RNAs by cate-
gorising the methods into groups based on sequence, structure, comparative genomics, and
scanning methods.
3.1.1 Sequence based approaches
Sequence based methods classify entities as non-coding RNAs or protein coding RNA by
using the primary nucleotide sequence as input. The literature shows that many biologically
relevant features can be extracted from the sequence such as GC content, sequence motifs,
and nucleotide usage. The extracted features can be converted to numerical values that can
be fed into a machine learning model.
CRITICA [6] uses two types of features: comparative genomics features that use DNA
alignment from a DNA database (refer to section 3.1.3), and sequence based features that
compute distributions of hexanucleotides in coding frames and take into account dicodon
biases. DIANA-EST [45] uses artificial neural networks to find coding regions from ESTs.
9
CHAPTER 3. RELATED WORK 10
ESTSCAN [76] also finds the coding regions of ESTs using a Hidden Markov Model. POR-
TRAIT [3] and SOM-PORTRAIT [119] both extract sequence and ORF-related features
and performs classification using support vector machines and artificial neural networks.
CONC [74] and CPC [64] uses a large collection of simple features such as length, amino
acid composition, GC content, nucleotide identity, 3-periodicity, and simple thermodynam-
ics, to feed into a machine learning method to perform the classification; a large source of
their information does come from comparative methods using BLASTX. Creanza et al. [24]
and Re et al. [104] also use a large collection of features to perform classification, the most
effective feature reportedly being synonymous nucleotide substitutions. Clamp et al. [18],
Li et al. [72], Jia et al. [58], and Wu et al. [137] use methods to extract the open reading
frame of transcripts. Siederdissen et al. [117] uses covariance models using only sequence
information to distinguish between many non-coding RNA families.
3.1.2 Secondary structure based approaches
Secondary structure based classifiers assume functional non-coding RNA have secondary
structures that can be fully or partially predicted and used to extract properties to distin-
guish non-coding RNA from other elements. These properties can include stem loop related
features that can include prevalence, size and GC content [94, 122], while other strategies
estimate fold energies in both global and local contexts. Also, despite the fact that 3′ UTRs
of mRNAs also contain secondary structure [25], a number of secondary structure based
methods have been shown to have reliable rates of success. Another major consideration is
that secondary structure prediction is computationally expensive, forcing workarounds such
as local secondary structure input. These methods perform a scan of the input sequences
and for every window calculate the local secondary structure and consequent attributes.
Xue et al. [139] and Noel et al. [94] uses a method of extracting local features within
the largest stem loop to classify real and pseudo miRNA precursors. The miRanalyzer
web tool [42] scans the genome using the local secondary structure prediction program
RNAfold [51] and for every window extract features strongly related to folding and loop
energy such as length, stem length, Mfe, and GC. Classification is done using the random
forest scheme found in the WEKA package [43]. Langenberger et al. [67] scans for RNA
folds in a sliding window along mapped reads. Horesh et al. [52] also implemented their
method by a sliding window method along a genome to find locally stable RNA structures
and investigates dinucleotide biases that have an effect on the minimal free energies. Childs
CHAPTER 3. RELATED WORK 11
et al. [16] builds a classifer to infer functionality based on a system where each molecule
of a RNA structure is represented as a graph. miRTRAP [47] assess features derived from
loops of miRNA to identify miRNAs from high throughput sequencing data.
3.1.3 Comparative Genomics based approaches
Another common method of finding non-coding RNA is to use information from several
sources such as alignment data from related species. This method is known as comparative
genomics. These methods are especially useful when genomic and transciptomic information
from related species are known. Many approaches use a combination of existing tools such as
RNAz [133] was one of first major methods to predict functional non-coding RNA by
using a combination of sequence alignments, secondary structure and SVM classification.
Dynalign [128] detects non-coding RNAs by predicting secondary structures and thermal en-
ergy for multiple aligned RNAs using a combination of methods including using RNAz [133]
and QRNA [107]. Mignone et al. [87] compares the genomes of human and mouse to find
conserved sequences to evaluate protein coding potential using the notion of conserved se-
quenced tags (CSTs) to produce blocks of BLAST-like high scoring pairs. Voß et al. [130]
predicts non-coding RNAs by using the alignment tool ClustalW [68] and the consensus
structure prediction tool RNAlishapes [129]. Weinberg et al. [134] has uncovered non-coding
RNA by using a number of structure and motif based methods such as CMfinder [140]. Cen-
troidFold [114] is a web server for RNA secondary structure prediction engine that takes in
an RNA sequence along with its alignment as input. Mathelier et al. [83] finds miRNA using
5 parameters that are heavily influenced by fold properties and energies. Tseng et al. [127]
uses genome scale blasting that combines secondary structure and primary sequences by
using folded-BLAST in intergenic regions.
3.1.4 Genome scanning / mapping approaches
The last category we investigate are methods that find non-coding RNA by incorporating
genome scanning methods to identify new RNAs. These methods use the genomic sequence
as the primary input and use subtle clues to pinpoint locations of possible non-coding RNAs.
Although these are not directly part of this thesis, their goals and strategies are insightful
CHAPTER 3. RELATED WORK 12
for our purposes. This category includes strategies that observe motifs and read alignments
from transcriptomes.
Hiller et al. [48] scans the genome for conserved introns to find novel transcripts especially
focusing on the set of mRNA-like non-coding RNAs. Salari et al. [112] employs a method
of scanning motifs along a reference genome using k-mer motifs lengths. Erhard et al. [30]
and Chol et al. [59] both use mapped reads from transcriptome experiements and mainly
use their position and size to find and classify non-coding RNA on the genome. Hofacker et
al. [50] uses local RNA folding on a genome wide scale to discover potential RNA structures.
3.2 RNA databases
In response to the expanding set of non-coding RNAs discovered, a number of databases
have emerged to accommodate their unique characterisics. Many cater to specific types
while others are more inclusive.
Although technically a transcriptome database, FANTOM [125] is known to house many
known and unknown EST sequences including non-coding RNAs. RNAdb [99], fRNAdb [63],
NONCODE [46], and RFam [36] are databases that have their own set of classifications or
family types and all have a user interface available publicly on their servers. RFam [36]
is a database of published non-coding RNAs that uses various tools in covariance models
to WU-Blast to catogorise entries to their extensive categorical families. RNAdb [99] is
a database that specifically applies to mammalian non-coding RNAs, combining several
sources. fRNAdb [63] is a database that aims to categorise functional RNA candidates and
includes tools to analyse structure motifs and EST support evaluation. NONCODE [46]
examines a number of non-coding RNA family types (excluding tRNAs and rRNAs) and
categorises these non-coding RNAs into nine biological related categories.
The following are databases that are specific to a special niche. miRbase [38] is a
database specifically for miRNAs and lists detailed information on both pre and mature
miRNA structures along with a target prediction pipeline. piRNABank [66] is a database
specifically for PIWI interacting RNAs. Sno/scaRNAbase [138] is a curated database for
nucleolar RNAs and cajal body-specific RNAs. NRED [25] is a database containing only long
non-coding RNAs 200 nucleotides or larger taken from microarray and in situ hybridisation
experiments for the mouse and human. ncRNAimprint [141] is a database of mammalian
non-coding RNAs that are imprinted. lncRNAdb [2] is a database for long non-coding
CHAPTER 3. RELATED WORK 13
RNAs that have biological functions in eukaryote cells and viruses, which include functional
mRNAs.
Chapter 4
Classification
The goal of this thesis is to create a practical, accurate and reliable classifier that can
distinguish different classes of transcript sequences from noisy data in real biological settings.
In particular we classify protein coding from non-protein coding RNA, in data derived from
RNA-Seq experiments, i.e. from short sequence reads. Using de novo assembly we generate
transcript contigs that represents the transcriptional landscape.
This chapter describes the concepts of the various aspects of our classifier, SSGC, which
aims to fulfil these goals. Section 4.1 describes concepts of the RNA-Seq reads and their
pre-processing. Section 4.2 describes the features used to classify input sequences. Section
4.3 describes the concepts of the classification and how its performance can be assessed.
4.1 Preprocessing reads
The output of the RNA-Seq procedure consists of very short fragments of RNA sequences.
As we are interested in working with long sequences that depict transcripts, we utilise the
process of assembly to build contig sequences.
4.1.1 Assembly
Assembly is a process in which contiguous sequences, or contigs, are created by piecing
together smaller sequences. ABySS [120] is a popular assembler program as it has been
successfully demonstrated on transcriptome sequencing [9]. ABySS is based on the de Bruijn
graph model, first introduced by Pevzner et al. [100]. This method fits into the category of
14
CHAPTER 4. CLASSIFICATION 15
de novo assemblers, i.e. one that uses only the short read sequence information, without
any external data source such as the reference sequence.
De Bruijn graphs using short read sequences rely on a given value k, such that sequencing
reads are chopped up into k-mers, or k length subsequences. Each k-mer is represented in
the graph as a node, directed edges represent k − 1 overlaps between adjacent k-mers, and
the paths traversed along edges represent contiguous sequences or contigs assembled from
sequenced reads. One of the challenges with de Bruijn based assemblers is that depending
on the coverage and the value k, this can lead to a high number of fragmented or non-
contiguous contigs [9], though some fragmentation is unavoidable due to repeats and low
coverage. It is also unclear if assembly is the sole cause of fragmentation as it can also
be argued that cDNAs such as those found in the FANTOM database are also fragmented
versions of longer transcripts [35].
To reduce the amount of fragmented short contigs, a merging technique has been shown
to be successful [110]. This technique is based on the strategy of assembling a large set of
contigs using multiple k-mer values, then removing all contigs where it is a perfect subse-
quence of another contig. This procedure is also accompanied by a filtering step to further
reduce the number of small contigs.
4.1.2 Mapping to RNA database
Our approach is to not only run, but to train the classifier using contigs; contigs must be
assigned a label from the class definitions. After assembly, contigs sequences are mapped
to protein coding and non-coding RNA databases. Based on the mapping criteria and
threshold set, subsets of contigs inherit the labels of the elements in the databases (Figure
4.1). In the case of multiple mappings, contigs are assigned labels in a greedy manner,
based on mapping score. The resulting set of labelled contigs are used to train and test the
classifier.
To assess the performance of the classifier on contig sequences, we first create class labels
for each contig sequence. This is done by mapping each contig sequences to known protein
coding and non-coding sequences based on mapping scores. This is performed by using the
BLAT aligner [61] between the annotated database entries and the contig set. For each
contig-annotation pair, we can choose to accept or reject the pairing by comparing BLAT
alignment parameters batc and bata, for contig and annotation respectively, to threshold
values. The parameters are calculated as: batc = numbasesmatch/lengthcontig, and bata =
CHAPTER 4. CLASSIFICATION 16
RNA-Seqreads
assemble&
merge
protein codingmRNA database
non-coding RNAdatabase
contigs
0.85 ; 0.83
0.88 ; 0.87
0.71 ; 0.70
0.79 ; 0.77
0.93 ; 0.95
map
0.84 ; 0.91
Figure 4.1: Overview of the contig assembly and labelling procedure. From short readtranscriptome reads, contigs are assembled and merged. Contigs are mapped individuallyto protein coding and non-coding RNA datasets. Contigs inherit the labels of the databaseelements with the best matched mapping score, which must be above a set threshold. Foreach mapping score, there are two threshold values, one for the contig and one for theannotation. The labelled contigs are used as training and testing sequences for the classifier.
CHAPTER 4. CLASSIFICATION 17
numbasesmatch/lengthannotation. To find the best annotation mapping for a given contig,
we choose the annotation with the highest score calculated by score = batc + bata. The
procedure of assigning contigs to annotation consists of the following steps: set a threshold
between 0 and 1; calculate the score for each contig and annotation pair with each bat term
above the threshold; from the highest to the lowest score, label the contig as the annotation
and remove all future instances of the contig and annotation from consideration.
4.2 Feature extraction
Given a set of sequences, the classifier attempts to distinguish the set into classes, whether
that be protein coding and non-coding, or non-coding RNA family types. This is done by
extracting features, or properties attained from the sequence. This section describes the
features used by the classifier. The features are categorised as sequenced based features,
structure based features, and genomic map based features, represented in Figure 4.2 and
further expanded in Table 4.1. The following sections describe the features at a conceptual
level, and section 5.1 provides further details on the implementation.
4.2.1 Sequence based features
Various methods found in the literature have explored features directly computed from the
sequence itself. The functional unit of proteins are the peptides folded in a three dimensional
manner while the functional unit of many non-coding RNAs are the their secondary struc-
ture. The selection pressures of the functional units are responsible for many features that
are embedded in the sequence information of coding and non-coding RNA transcripts [117].
This section explains the methods involved extracting sequence based features from a given
sequence.
Nucleotide usage
From the four nucleotides that make up the alphabet used in RNA, there are reports of
certain biases in the nucleotide composition of certain transcript types. One way to measure
the composition is to compare the distribution of unigrams, bigrams, and trigrams for the
entire length of the transcript. This itself creates 84 vectors representing each possible
word: 64 possible trigram combinations, 16 possible bigrams, and 4 possible unigrams. An
Figure 4.2: The classification approach starting from the sequence reads down to the testingof RNA transcripts. We propose a classifier that draws on three categories of features basedon sequence, secondary structure, and genome mapped data, which we name the Sequence-Structure-Genome Classifier (SSGC). For de novo experiments, we only consider sequenceand secondary structure based features.
Table 4.1: Features available from the prediction model. Sequence and secondary basedfeature make up the de novo set of features. The concepts of the features are described insection 4.2, and the implementation in section 5.1..
CHAPTER 4. CLASSIFICATION 20
alternative is to compute the single feature, GC content (essentially the merging of two
bins, C and G divided by the total number of nucleotides), that has been used in the
past to distinguish coding from non-coding transcripts [67, 104]. These use the tendency
that protein coding GC content is approximately 50%, statistically distinct from intergenic
sequences [79, 24].
Length
Among the non-coding RNA families, two classes, tRNAs and miRNA stand out as they
have a well defined structure and length. [1] As such, mining for these particular non-coding
RNAs in a large dataset has shown to be possible by restricting the length of the transcript
and/or the secondary structure [67, 47]. Non-coding RNAs can vary greatly in length,
with transcripts smaller than 200 nucleotides are often associated with microRNA, PIWI-
associated RNAs, endogenous small interfering RNAs [25]. RNAs in the long non-coding
RNA class have transcripts in the same order of magnitude as protein coding genes with
some transcripts as large as a hundred kilobases in length [99].
ORF features
Protein coding mRNAs have characteristics that are well defined: they have a 5′ cap, 5′
and 3′ untranslated regions, an open reading frame and a polyadenylated tail [1], refer to
Figure 2.1. The portion of RNA that becomes translated to a peptide sequence is called
the open reading frame (ORF) and this is mostly thought to be unique to that of protein
coding genes; exceptions to this rule are bifunctional RNAs which are documented to have
functioning ORFs [25, 2].
A crude way to detect ORFs within a transcript sequence is to search for the longest
ORF from within one of the 6-frame translations, those that begin with the start codon
and end with the stop codon. There are much better and robust methods as proposed by
Slater et al. [121] and Shimizu et al. [116] that use machine learning methods that take into
account erroneous input sequences and frameshifts.
Once an ORF is predicted, we can investigate the protein coding biases such as the
log-odds score, compositional entropy, the amino acid composition, isoelectric point, and
mean hydropathy. However, there is a drawback such that if a protein coding gene’s ORF
is mis-predicted, the following features will likely yield poor results.
CHAPTER 4. CLASSIFICATION 21
The amino acid composition is the makeup of amino acids used for the peptide sequence,
this can be measured as a histogram of amino acid unigrams. This can be a crude measure
to distinguish from the assumed random peptide sequence expected from a non-coding
RNA. The log-odds score is an effective and often used measure of the likelihood that a
given sequence is not from a random source. This makes use of the fact that of the 64
possible codon triplets, there are heavy biases in the usage found in nature. By measuring
the in frame nucleotide usages, the log-odds score gives a measure to the quality of the
sequence [137].
Compositional entropy is another term to describe the degree of low-complexity regions
that can occur in a peptide sequence of the ORF. Low complexity regions are repetitive
or homopolymeric sequences such as Ser, Asn, Gln, Asp, Glu and Thr residues [37] found
in peptide sequences that code for peptides in nonglobular domains. These can consist of
repetitive sequences found in the peptide. Although their function is not known, this is a
well documented trait found in many protein coding genes [101].
An isoelectric point for a protein is the pH in which it has no net charge. By examining
the amino acid side chains of a peptide, the buffering characteristics can be determined at
different pH levels. Since living systems have very narrow ranges of pH, it is expect that
peptide sequences would also have a narrow range of isoelectric points to be useful in a
living organism [1].
Hydropathy is used here to measure how hydrophobic regions of as peptide are, i.e.
whether they are polar or non-polar depending on the side chains of the amino acids used.
Kyte and Doolittle [65] proposed a method to calculate the hydropathy character of a
protein. Here we use the mean hydropathy across the entire length of the peptide sequence,
which may be problematic due to peptides hidden in globular pockets in a folded protein
structure.
4.2.2 Secondary structure based features
RNA secondary structure
Some non-coding RNA types are known to have secondary structure that are key to their
function, such as ribosomes and tRNAs. Here we assume that there are no significant
CHAPTER 4. CLASSIFICATION 22
secondary structures associated with protein coding RNAs. From a long chain of ribonu-
cleotides, secondary structures result from segments of intramolecular base pairing, result-
ing in distinguishable structure such as stems, loops and bulges. Given a ribonucleotide
sequence, the most likely secondary structure would be the one with the lowest free energy
among all candidate sequences. However, to compute all possible candidates is unfeasible
due to the sheer size of the structures possible [142]. Lyngs and Pederson [78] show that
prediction of secondary structures taking into account pseudo-knots is NP-complete.
Zuker and Stiegler [143] describe a O(n3) dynamic programming algorithm under the
conditions that it assumes a simplistic thermodynamic model and it disregard pseudo-knots.
The Vienna package [50] contains an implementation of this global secondary structure in
addition to a O(nl2) local secondary structure prediction that only considers sub-structures
within a sliding window of size l of the input sequence. It has been shown that non-
coding RNAs can be reliably detected solely by using local structures such as hairpins and
stemloops [31].
We examine RNA folding ability for each of the transcripts by predicting the pseudo-
knot free secondary structure. From its success in distinguishing miRNA and pre-miRNA,
we focus on the quality of stem loops as shown in Xue et al. [139] and Hackenberge et
al. [42]. By extracting the longest stem loops, these methods are able to extract features
based on the length, GC content, number of symmetric and asymmetric bulges and structure
motifs and feed them to a machine learning program to do their predictions. In addition
to these features, we also extract the triplet-SVM features proposed by Xue et al. [139].
By feeding in a secondary structure represented by an alphabet of brackets and dots and
the ribonucleotide sequence, we can compute the occurrence of each of the eight possible
trigrams (combinations of dots and brackets) for each of the four RNA bases that represent
the middle character of eight possible trigrams: [(((, ((., (.(, (.., .((, .(., ..(, and ...].
There is clearly a potential in investigating secondary structures but at the same time
a limitation of exclusively examining dynamic programming solutions. One of the major
drawbacks is that dynamic programming solutions work to get the minimum free energy
structure; however, the biologically functional RNA product is not always the candidate
structure with the minimum free energy [115].
Another practical issue is that computing structural motifs will be very computationally
expensive. It is expected that many large transcripts will significantly increase the running
time. In that case, we have two alternative options, either to only compute small contigs
CHAPTER 4. CLASSIFICATION 23
below a certain size cutoff, or to run only localised structure predictions in a sliding win-
dow. Both strategies can potentially limit the structures predicted, and can additionally
be affected with the selection of size thresholds and step sizes. Our approach utilises the
sliding-window approach in the experimentation.
4.2.3 Genomic map based features
Genomic mapped strategies uses data that are mapped onto the genome coordinates. With
the ability to map transcripts back to the originating genome, several pieces of information
become available. The two strategies used in this thesis are to observe the splicing patterns
of a transcript as well as mining data associated with the bases mapped to a transcript’s
genomic coordinate. As such, we are limited to using data for a species with a known
reference genome, thereby excluding its use from de novo type experiements.
For this thesis, we focus on extracting features relating to the number of exons predicted
and mapped as well as extracting data from the regions each transcript or contig maps to,
namely scores relating to evolutionary conservation and histone modifications explained in
the subsequent sections.
Evolutionary conservation
Genomic conservation is a tool to measure evolutionary distance between two or more species
for a particular location. Incorporated in our classifer, it is useful to measure specific
sequences on the genome that are conserved in order to detect functional regions in the
genomes [44, 75, 12, 60, 82, 136]. Analysing sequenced genomes and data from comparative
genomic studies, it has been shown that large portions of the genome are functional elements
that have not been identified [19, 15, 21, 20, 118, 89].
Two algorithms are often used to measure the conservation between species at a base-by-
base level on a reference genome: VISTA [34] and Phastcons [118]. Phastcons is an HMM
based program that uses phylogeny and genome alignments calculate conservation between
multiple species where VISTA calculates conservation between pairs of species.
In the context of classification, it is widely accepted that protein coding RNAs are
conserved [1], however there are inconsistent reports of conservation levels between protein
coding RNAs and non-coding RNAs. Studies have shown that long non-coding RNAs are
conserved across species in varying degrees [5, 17, 39, 57]. In contrast, it has also been
CHAPTER 4. CLASSIFICATION 24
reported that conservation in only short non-coding RNAs are expected while longer non-
coding RNAs will not [98].
Histone modification data
The development of next-generation sequencing has not only provided more throughput and
smaller costs, it has found its way into many different applications. Chromatin immunopre-
cipitation (ChIP) is one such technology that utilises this powerful sequencing technology.
First described by Solomon et al. [123], ChIP uses cross linking between protein and DNA
to find a genome wide maps to where transcription factors bind. ChIP-Seq expands this
method by introducing next-generation sequencing and mapping to rapidly determine a map
of transcription binding sites [109].
Using ChIP-Seq technology, discovering sites of histone modifications associated with
gene expression has shown to be successful in studying their transcription factor bind-
ing [108]. In addition, chromatin state maps [88] have also been used to discover a large
set of long intergenic non-coding RNAs [39]. In this thesis, we investigate the effect of
our classifier using chromatin state maps for our task of distinguishing protein coding and
non-coding RNAs.
4.3 Classification
The primary goal of the classifier is to accurately detect whether an input RNA sequence
originated from a protein coding or a non-coding gene. The secondary objective is to further
classify a sequence that is predicted to be non-coding into its non-coding RNA family types.
To make the decision, the classifier makes use of features extracted from the three categories
of features described above. We investigate the classifier in two settings: one to assess the
performance by performing cross-validation of all contigs that map to known annotated
protein coding and non-coding sequences, and the other by running the classifier on the full
contig set to create a list of contigs ranked by prediction confidence. In both the training and
testing steps, features are processed and are ultimately fed into a support vector machine
that makes up the classifier model.
CHAPTER 4. CLASSIFICATION 25
4.3.1 Support vector machines
The main engine used in determining the class and family types of RNA is a support vector
machine (SVM), a popular method used in classification, regression and novelty detec-
tion [10]. They have become particularly useful in classification problems in computational
biology due to their high accuracy, robustness with large, high-dimensional data and flexi-
bility in diverse data sources [7]. SVMs model classification problems by representing data
as points in high dimensional space. Within that space, SVM models learn a hyperplane
which maximally separates the two classes of a training dataset. SVM models are then used
to classify new instances [22, 135].
4.3.2 Performance evaluation
A standard procedure to assess the accuracy of a model consists of splitting a dataset into
training and testing sets; a model is created with the training set and are evaluated with
the test set. Cross validation is an alternative to this approach that uses multiple rounds of
classification and testing. This is especially useful when the size of the dataset is limited.
One such type is K-fold cross validation. It is performed by splitting the dataset into K
partitions, an SVM is trained using K − 1 partitions and evaluated with the remaining
partition. This is repeated for all partitions [22, 135].
For our thesis, we utilise cross validation to assess the performance of the classifier in
both the binary and multiclass classification problems. As SVMs are binary classifiers that
can only handle two classes, multiclass problems are addressed using strategies that combine
multiple rounds of one-against-one or one-against-all classifications combined with voting.
For our classifier, we rely on the one-against-one implementation [54].
For each classification experiment, the accuracy, precision and recall are calculated.
These are evaluated based on the true counts (TP and TN) and the false counts (FP, FN)
from the confusion matrix (Table 4.2).
Accuracy is a measure of the total number of correct predictions from the total sample
size [96].
Accuracy =TP + TN
TP + TN + FP + FN
Precision is a measure of accuracy for the true positives from all samples predicted as
true [96].
Precision =TP
TP + FP
CHAPTER 4. CLASSIFICATION 26
Recall is a measure of all true positives that were correctly predicted from all samples
Table 4.2: Confusion matrix (or coincidence matrix) for a two-class classification problem.The correct predictions, true positive and true negative, are shaded while the erroneouspredictions, false positives and false negatives, are not.
4.3.3 Cross validation evaluation
We evaluate the performance of the classifier on annotated sequences. We investigate the
performance of the classifier on sequences with known class. This allows the ability to
evaluate the performance of the classifier under different settings.
Binary coding vs. non-coding classification
SSGC is applied on binary classification, the ability to differentiate coding from non-coding
RNA sequences. Physically, both sets of sequences can be similar as they are composed
of the same alphabet and overlap in sequence size. Using the features of the SSGC, we
demonstrate its ability in predicting the class of input sequences. This is performed using
SVMs with cross validation on sequences with known classes or on annotated contigs.
CHAPTER 4. CLASSIFICATION 27
Multiclass RNA family classification
Many strategies found in the literature perform their classification based on the two crude
classes of non-coding RNA and protein coding mRNA. This can be a naive approach as
non-coding RNA have many family types that differ in size, structure and function. Our
classifier attempts to distinguish not just non-coding RNA from protein coding RNA, but
within the multiple non-coding families. Some family types that we apply our classifier to
include piRNA, miRNA, pre-miRNA, snoRNA, snRNA, tRNA, rRNA. To solve this multi-
class problem, we look to a one-versus-one implementation of the support vector machine
classifier. In addition to the different classes, we investigate a multi-phase classifier that
performs multiclass classification once protein coding sequences are removed.
4.3.4 Full contig prediction
Applying the classifier on labeled sequences enables the ability to evaluate the classifier.
However, this limits its use on sequences already known and classified. In particular, its
application on assembled contigs can only be used for annotations that are mapped to
known sequences. Although the performance cannot be directly determined, we investigate
the ability to predict the class of the entire contig set.
Classification on the entire contig set is achieved by first training an SVM model using
the subset of sequences mapped to known sequences. The model can then be applied to the
entire contig set to predict the class and the confidence of each contig (refer to Figure 4.3).
4.3.5 Feature set ranking
We also investigate the effectiveness of our feature set. It is possible that some features will
not be available for some datasets. Also many features do not apply to all possible transcript
types. Notably, numerous features associated with ORFs of proteins do not apply to non-
coding RNA, and analogously, secondary structure do not apply to protein coding genes. If a
transcript can be identified as a protein coding gene, we would be uninterested in measuring
the degree of secondary structure, just as we would be uninterested in computing ORF
feature for non-coding RNA. Computing unneeded features can be a strain on resources.
We investigate the features that are the most effective in our classification experiments.
Once the feature set is assessed, we propose subsets of feature are called upon for certain
conditions. Ultimately, we envision a multiple step classifier, one that will have multiple
CHAPTER 4. CLASSIFICATION 28
Train model
Predict
Normalised feature vectors
full contig set
contigs mapped to proteincoding sequences
contigs mapped to non-coding sequences
protein coding
non- coding
Ranked contig predictionsby p-value
SVMmodel
Figure 4.3: Contig prediction procedure for the full contig set. A subset of contigs mapped toprotein coding and non-coding sequences from Ensembl and fRNAdb, respectively, are usedto train an SVM model. The SVM model is used to classify the entire contig set, predictingthe class and p-value for each contig. The p-value allows the contigs to be ranked, fromstrongly protein coding (0) to non-coding (1).
CHAPTER 4. CLASSIFICATION 29
feature extraction and classification steps. For this thesis, we are interested in separating
transcripts representing all genes, then to separate the transcript to the multiple classes, as
shown in Figure 1.1.
Chapter 5
Implementation
This chapter describes the steps taken to construct the classifier, and to run the experi-
ments. Section 5.1 describes the steps involved in computing the features from a set of
sequences. Section 5.2 describes the steps used to assess the classifier performance, predict
novel transcripts, and to rank the features used.
5.1 Feature extraction
The classifier is designed to distinguish one set of sequences from another using a number
of feature extraction strategies. Feature extraction was designed as a set of modular tools
that can be turned on or off depending on the data available, the effectiveness, the time
and space constraints of the system used. The central programs are accessible from the
command line and are controlled by using a set of arguments as well as a configuration files.
In total, 169 features are configured for the classifier, 159 are de novo and an additional
10 are genome based. Table 4.1 lists the features used by the classifier. These features are
fed to a support vector machine that makes up the core of the model building and decision
making process. The proceeding sections explain in detail each of the components used in
the feature extraction procedure.
5.1.1 Sequence based feature extraction
Programming for sequence based feature extraction was done in Perl in a UNIX environ-
ment. Perl was used to manage the components of the system, perform some of the feature
30
CHAPTER 5. IMPLEMENTATION 31
extraction calculations and used as the scripting language that utilised the classification
tools.
Perl was used for feature extraction for the following feature types: GC content, length,
nucleoide composition, amino acid composition, ORF analysis, and through the BioPerl
libraries [124] isoelectric point and mean hydropathy. The pH of the amino acid side chains
used to calculate the isoelectric point were based on the values found in the EMBOSS
toolkit [106]. Mean hydropathy was calculated by using a BioPerl implementation of the
method proposed by Kyte and Doolittle [65].
To extract the ORF from a transcript or contig sequence, the ESTate package [121] was
used as it is specially tailored to handle potential sequencing and frameshift errors in the
input data making it ideal for assembled contigs. The training data was used to extract the
word usage and probabilities, and framefinder was used to do the ORF extraction and was
used to calculate the log-odds score.
Low-complexity regions were detected using the Compositional Bias Detection Algo-
rithm [102] using the default values. The compositional entropy feature was calculated by
taking the number of masked residues divided by the total length of the ORF.
5.1.2 Secondary structure feature extraction
We examine the ability of RNA folding for each of the transcripts using tools from the
Vienna package [49, 50]. We have the option of running either full secondary predictions
using RNAfold or to run local secondary structure using RNALfold. In the interest of
running time, we perform all our tests using local secondary structure prediction, with the
span size set to 150 bp.
From the output of these structure prediction programs, we extract the longest stem loop
by using a modified version from code available from Xue et al. [139]. This also gives us the
32 triplet-SVM features, which are 3 character motifs from the structure sequence made up
of dots (mismatches) and brackets (matches) for each of the four possible bases A, C, G, and
U. Once we extract the longest stem loop, we extract features for the stem length, minimum
free energy in hairpin, loop length, loop GC, asymmetric bulges, symmetric bulges, and the
longest bulge.
CHAPTER 5. IMPLEMENTATION 32
5.1.3 Genomic map based feature extraction
For non-de novo experiments, where we have the reference sequence available, we can observe
the splicing patterns of the transcript, and take account the number of exons as well as
their placement. For assembled contig sequences, genome coordinates are predicted using
BLAT [61] for each contig, mapped to the mouse mm9 (NCBI m37) reference genome. For
multiple genomic candidates, a single coordinate is chosen based on the highest score:
Table 6.1: SSGC performance compared with PORTRAIT for the dataset composed ofSwiss-prot and EMBL for protein coding set, and Rfam, RNADB and NONCODE for thenon-coding set. Precision and recall are shown for the non-coding class.
6.1.2 Ensembl protein coding vs. non-coding
Preparation
To simulate the full length mRNAs found in transcriptome studies, we also look to mm9 mR-
NAs obtained from Ensembl v60 [55]. From the range of biotypes available from Ensembl,
we consider sequences with the biotype protein coding, consisting of 88,186 sequences. In
the same manner as in the EMBL dataset in the previous section, we performed BLAST-
CLUST [26] using the same arguments and restricted the sequences to the same size ranges,
resulting in 46,261 total sequences.
The same non-coding RNA dataset consisting of 60,849 sequences explained in the pre-
vious section was used.
Classification
SSGC was compared with PORTRAIT [3] using Ensembl v60 [55] protein coding transcripts
as the protein coding set, and the same non-coding RNA set as in section 6.1.1. The results
are summarised in Table 6.2. In this case, SSGC outperforms PORTRAIT in terms of
accuracy, precision and recall. The different in performance between this dataset and the
last is striking. As the same non-coding set is used, and transcripts are clustered and size-
selected for both, the difference between the inputs are likely that the EMBL sequences
CHAPTER 6. EXPERIMENTAL RESULTS 37
contain purely the ORF containing portion of the mRNA while the Ensembl set contains
the full mRNA sequence including the UTRs. For the purpose of contig classification in the
transcriptome, we expect to see full-length mRNAs that include UTR sequences resemble
Table 6.2: SSGC performance compared with PORTRAIT for the dataset composed ofEnsembl protein coding, and Rfam, RNADB and NONCODE for the non-coding set. Pre-cision and recall are shown for the non-coding class.
6.1.3 Ensembl vs. fRNAdb
Preparation
To test the ability to distinguish a range of different non-coding RNA types, we look to
fRNAdb [63] for mouse mm9 sequences, downloaded March 1st, 2010. fRNAdb has in total
83,826 sequences divided into nine RNA types: fly-smallRNA, mat-miRNA, misc, piRNA,
Protein coding sequences are made up of Ensembl v60 [63] with biotype protein coding
as before. To compare with smaller non-coding RNAs found in fRNAdb, no filtering was
performed based on similarity or size.
Classification
The previous sections presented our findings for the binary ‘coding vs. non-coding’ class
problem using exclusively de novo features. In this section we expand our methods to
incorporate two techniques: we compare the performance using the complete feature set
(which includes genome based features and the de novo feature set), and also to investigate
the multiclass problem by including several non-coding RNA types in our classification. Our
CHAPTER 6. EXPERIMENTAL RESULTS 38
investigation is performed using datasets from Ensembl [55] protein coding and the multiple
non-coding RNA types from fRNAdb [63].
We investigate our classifier performance using the entire feature set and the de novo
feature set for the binary class using Ensembl and fRNAdb. Table 6.3 presents the per-
formance of the classification. Using the full feature set results in a slightly better overall
performance.
Table 6.4 represents the results for the pairwise binary classification between Ensembl
protein coding elements and each non-coding element found in fRNAdb, using both all
features and only de novo features. The resulting accuracies are high for each pair of RNA
elements; the misc class has the lowest performance in classification.
Features Accuracy Precision [nc] Recall [nc]
all 96.3 0.966 0.976de novo 95.6 0.961 0.97
Table 6.3: Binary classification performance between Ensembl protein coding with allfRNAdb non-coding sequences. The first row represents the experiment where all featuresare used. The second row represents the experiment where only the de novo features wereused.
In addition to the pairwise binary classification between protein coding sequences and
all non-coding RNA types, pairwise binary classification was performed on each pair of
non-coding RNA. Table 6.5 presents the result of our tests using all features, and Table 6.6
presents the tests using strictly de novo features. The number of samples per class varies
and likely causes fluctuations in the precision and recall but overall, the feature sets used
are promising in this binary classification problem.
In addition to the binary pairwise classification experiments, we performed multiclass
classifications between non-coding RNAs both with and without protein coding sequences.
Table 6.7 represents the confusion matrix of the multiclass classification for the nine non-
coding RNAs types found in fRNAdb. The higher numbers along the shaded diagonal cells,
the true positives, indicate the potential usage of our classifier to be used on multiple non-
coding RNAs. However, we do observe a skew in predictions towards RNA types that are
heavily represented in fRNAdb. Having small test sets for some RNA elements alongside
very large test sets indicates potential limitations in our current multiclass methodology.
CHAPTER 6. EXPERIMENTAL RESULTS 39
Features Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
Table 6.4: Pairwise classification performance between Ensembl protein coding elements vs.each RNA type found in fRNAdb. The first half represents the results where all featuresare used. The second half represents the results where only de novo features were used,thereby excluding genome mapped information such as the number of exons and cross-species conservation scores.
CHAPTER 6. EXPERIMENTAL RESULTS 40
Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
Table 6.5: Pairwise classification performance using the complete feature set for fRNAdbnon-coding RNA. Precision and recall are only shown for the second class.
CHAPTER 6. EXPERIMENTAL RESULTS 41
Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
Table 6.6: Pairwise classification performance using de novo feature set for fRNAdb non-coding RNA, similar to Table 6.5. Precision and recall are only shown for the second class.
CHAPTER 6. EXPERIMENTAL RESULTS 42
Despite this, the results suggest that our method is a good initial step in classifying among
different non-coding RNA sets. The limitation is possibly a subject of further study.
Table 6.7: Confusion matrix for the multiclass classification using fRNAdb RNA types,using the entire feature set. The cells represent the number of predictions for each type,the shaded cells represent the number of true positives. Each RNA type is labelled from ato i, representing in order: fly-smallRNA, mat-miRNA, misc, piRNA, pre-miRNA, rRNA,snoRNA, snRNA and tRNA.
6.2 The RNA-Seq dataset
Classification was performed on data derived from transcriptome sequencing experiments,
using contig sets created using the Trans-ABySS [110] pipeline.
In our analysis, we first examine the representation of coding and non-coding RNA
transcripts represented by RNA-Seq reads. This is done using two methods: a genome
mapping procedure that measures read coverage on annotated locations of Ensembl and
fRNAdb elements, then a direct mapping from assembled contig to annotation using a
range of mapping thresholds. Our results ultimately show that there are non-coding RNAs
represented as contigs, but that there are too few non-coding RNA types represented to
support multiclass classification. We continue our investigation on contig classification using
the binary ’protein coding vs. non-coding’ classes.
CHAPTER 6. EXPERIMENTAL RESULTS 43
6.2.1 Contig preparation
Contig sets were generated from six RNA-Seq libraries MM0490, MM0564, MM0566, MM0570,
MM0571, and MM0581. Each library consists of 50 bp paired-end poly(A)+ RNA as de-
scribed in Robertson et al. [110] These six libraries represents various developmental stages
and tissue types of C57BL/6J mouse. Table 6.8 lists the libraries along with their tissue of
origin, age, and the number of transcription reads sequenced.
Library Tissue Age Reads
MM0490 Liver E14.5 157MMM0564 Heart-Atrioventricular-Cushions E12.5 229MMM0566 Heart-Atrioventricular-Cushions E11.5 257MMM0570 Dorsal Aorta E11.5 217MMM0571 U and V Aorta E14.5 235MMM0581 Endoderm-Definitive E8.5 250M
Table 6.8: Six seven-lane RNA-Seq mouse libraries were exained.
6.2.2 Transcriptome reads mapped to the genome
We map the transcriptome reads to the mouse mm9 genome and calculate the read coverage
using the coordinates of each annotated element. This is done by mapping each read using
BWA [70] and SAMtools [71] to a modified mouse genome, one that contains pre-spliced
junctions between possible exon pairs as described in Morin et al. [90]. For this study,
these steps are taken for the Ensembl [55] v60 annotation for the mouse. Exon-exon junc-
tion coordinates are defined from Ensembl [55], Refseq [103] and UCSC known gene [53]
annotations.
The transcriptome reads are mapped to the genome and the coverage is calculated for
each annotation in Ensembl v60. Figures 6.1 and 6.2 show the breakdown of read coverage
for a set of non-coding RNA-related biotype annotations using MM0564 reads. Protein
coding annotations are well expressed, as expected, but the non-coding annotations have
varying amounts of coverage. Assembling transcripts de novo from an RNA-Seq experiment
requires higher read coverage than reference based methods [110]. From this mapping
experiment alone it is unclear what fraction of different non-coding biotypes will be available
as assembled contigs.
CHAPTER 6. EXPERIMENTAL RESULTS 44
1e−03 1e+01 1e+05
0.0
0.4
0.8
protein_coding
x
1e−03 1e+01 1e+05
0.0
0.4
0.8
lincRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
miRNA
xFn
(x)
1e−03 1e+01 1e+05
0.0
0.4
0.8
misc_RNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
pseudogene
x
1e−03 1e+01 1e+05
0.0
0.4
0.8
rRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
snoRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
snRNA
xFn
(x)
protein_coding
−2 0 2 4
010
0025
00
lincRNA
−2 0 2 4
020
40
miRNA
−2 0 2 4
020
6010
0
misc_RNA
−2 0 2 4
05
1525
pseudogene
−2 0 2 4
020
040
0
rRNA
−2 0 2 4
05
1015
snoRNA
−2 0 2 4
020
6010
0
snRNA
−2 0 2 4
020
60
Distribution of transcript coverages for library MM0564
Figure 6.1: Read coverage for Ensembl broken down to biotypes, for RNA-Seq reads fromlibrary MM0564. Each biotype is represented as an ECDF and as a distribution of log10
read coverage.
CHAPTER 6. EXPERIMENTAL RESULTS 45
1e−03 1e−01 1e+01 1e+03 1e+05
0.0
0.2
0.4
0.6
0.8
1.0
ECDF of Ensembl v60 transcript readcoverage for RNA−Seq library MM0564
Figure 6.2: Empirical cumulative distribution function representing the read coverage for aselect number of Ensembl biotypes mapped to the mm9 reference genome from Figure 6.1.
CHAPTER 6. EXPERIMENTAL RESULTS 46
6.2.3 Contig assembly and merging
Each RNA-Seq library was assembled and merged using Trans-ABySS [110], assembling the
reads for every even k-mer between 26 to 50, producing a set of contigs for each library.
One of the issues with de Bruijn based assemblers is that depending on the coverage and
the k-mer length k, this can lead to very fragmented and overlapping contigs. Here we
processed the contig sets using the contig merging method [110]. To prevent the potential
exclusion of non-coding RNAs in the merged dataset, we examined merged contig sets with
filtering turned both on and off. The resulting set of contigs are summarised in Table 6.9.
Number Min Max Ave. Med.Filter Library Reads of contigs size size size size N50
Table 6.9: Six seven-lane RNA-Seq libraries were assembled, merged to create the contigsets. These contigs were used as input for the classifier.
6.2.4 Contig to annotation mapping
The unfiltered contig set from each library was mapped to known protein coding mRNAs and
non-coding RNAs found in the databases Ensembl and fRNAdb using a range of thresholds
from 0.7 to 1.0. Figure 6.3 represents the number of contigs that map to annotated protein
coding and non-coding elements set with different thresholds for filtered and unfiltered
contigs, repspectively. From this figure we make a number of observations. First, the number
of fRNAdb non-coding elements are mapped in lower quantities than Ensemble types, but is
still in the order of hundreds and are likely sufficient for classification experiments. Second,
CHAPTER 6. EXPERIMENTAL RESULTS 47
comparison between filtered and unfiltered contigs show that filtering appears to affect
non-coding RNA sequences in fRNAdb but not Ensembl sequence. Third, as the mapping
threshold increases, the number of annotated contigs drops quite uniformly for both coding
and non-coding transcripts; it is therefore not obvious whether a single threshold value is
practical to perform all our mapping and is a possible topic of future work.
We further investigate both coding and non-coding annotation sets by breaking down
individual biotypes (Figure 6.4) and non-coding RNA families (Figure 6.5). From these two
figures, it is evident that not all types are represented in this mapping, indicating that either
their transcripts are not mapped well with the contig set, or are not present at high enough
levels in the RNA-Seq library, given the protocol and sequencing depth. From Figure 6.5,
for thresholds between 0.7 and 1.0, there are not enough individual RNA types found in the
fRNAdb dataset to perform pairwise or multiclass RNA classification as was done in section
6.1.3. For classification using contig sets, we focus on the binary coding vs. non-coding
classification problem.
6.2.5 Contig cross validation
Feature values were computed from contig sequences in the same manner for database
annotated sequences in earlier sections. We have performed the mapping, feature extraction
and classification on all six transcriptome libraries. All have resulted in similar findings and
performance and for the interest of space and to avoid repetition, we choose not to include
all the results in this thesis.
Contigs were mapped to protein coding or non-coding sequences by using the mapping
criteria in section 6.2.4, resulting in a sets of contigs labelled as Ensemble protein coding
RNAs and fRNAdb non-coding RNAs, for a range of mapping thresholds from 0.7 to 1.0.
We performed binary classification between the labelled contigs. Table 6.10 summarises the
classification performances between labelled contigs derived from library MM0564 in the top
half. We also performed the same classification using the original annotation sequence that
each contig represented, presented as ‘DB elements’ in the lower half of the table. For these
experiments, the total accuracaies are quite consistent for both sets. Also, the precision and
recall of the non-coding sequences are low. This is most likely caused by the difference in
sample size, as there are more coding contigs than non-coding sets.
To avoid the effect on performance due to differences in sample sizes between the two
classes, a stratified test set is made so that each class is equal in size. Table 6.11 shows
CHAPTER 6. EXPERIMENTAL RESULTS 48
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
0010
000
1500
020
000
2500
0
Ensembl / filtered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
0010
000
1500
020
000
2500
0
Ensembl / unfiltered mapped
BLAT alignment thresholdsN
umbe
r of a
nnot
atio
ns
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
a) b)
●●
●●
●●
●●
●●
●●
●●
● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
010
0015
0020
0025
0030
0035
00
fRNAdb / filtered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
●
●
●
●
●
●
●
●●
●●
●●
●● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
010
0015
0020
0025
0030
0035
00
fRNAdb / unfiltered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
c) d)
Figure 6.3: Number of unique contigs that map to the sequence annotation databasesfRNAdb and Ensembl using a range of mapping thresholds for all six mouse libraries. (a)and (c) represent the filtered contig set mappings, (b) and (d) represent the unfiltered contigset mappings.
CHAPTER 6. EXPERIMENTAL RESULTS 49
Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]
Table 6.10: Classification performance using the contigs from the library MM0564, usingthe full feature set. The contig sets are mapped to protein coding sequences from Ensembl,and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top halfof the table represents the classification results using features extracted from the contigsequences. The lower half represents the classification results using the features extractedfrom the original sequence from either Ensembl or fRNAdb that each contig mapped to.
CHAPTER 6. EXPERIMENTAL RESULTS 50
0.70 0.75 0.80 0.85 0.90 0.95 1.00
110
100
1000
1000
0Ensembl filtered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
snRNA ●
snoRNArRNA
pseudogenemisc_RNA
miRNAlincRNA
protein_coding
0.70 0.75 0.80 0.85 0.90 0.95 1.00
110
100
1000
1000
0
Ensembl unfiltered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
snRNA ●
snoRNArRNA
pseudogenemisc_RNA
miRNAlincRNA
protein_coding
a) b)
Figure 6.4: Ensembl transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual biotypes.
the performance of the classifier on this stratified set for the same contigs. In comparison
to Table 6.10 it is evident that the accuracy decreases slightly, but, at the same time, the
precision and recall rise to comparable levels with the accuracy.
The underlying difference in classification performance for the different threshold values
is not immediately clear. It is not clear whether this trend is a result of the rising threshold
values or simply due to the decrease in the number of elements tested. However, we note that
accuracy increases for the contigs as the threshold increases, while the database elements
do not change to the same degree. This suggests that the number of elements in the test
set is not responsible for the difference in performance. The only difference between these
values is the quality of the sequences, determined by the threshold values. Comparing the
performance between the contigs and the database elements shows that they converge to
approximately as the threshold goes to 1.0 (both to 96% in Table 6.11). Lower thresholds
produce lower classification results. This suggests that higher thresholds force the mapped
contigs to resemble real coding and non-coding sequences, improving the performance of the
classifier. But at the same time as the threshold increases there are fewer elements to train
CHAPTER 6. EXPERIMENTAL RESULTS 51
0.70 0.75 0.80 0.85 0.90 0.95 1.00
15
1050
100
500
fRNAdb filtered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
flysmallRNA ●
matmiRNAmisc
piRNApremiRNA
rRNAsnoRNAsnRNA
tRNA
● ●● ●
● ●
● ●
● ●
● ● ● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
15
1050
100
500
fRNAdb unfiltered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
flysmallRNA ●
matmiRNAmisc
piRNApremiRNA
rRNAsnoRNAsnRNA
tRNA
a) b)
Figure 6.5: fRNAdb transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual RNA types.
and test the classifier. From these observations, it again shows the difficulty in choosing
a suitable value or a set of values for the threshold. This is a major issue that must be
considered in order to perform the classification for raw contig sequences.
PORTRAIT was also used on the contig sets and the database annotations in the same
way that our classifier was used. Feature computation was not possible for the contig sets
due to software errors. However, we were able to extract the features from the database
elements mapped to the contigs. The results on the database elements for SSGC and
PORTRAIT are compared in Table 6.12. The accuracy is comparable for both methods in
the unbalanced set but are quite different for the stratified set where the protein coding and
non-coding elements were equal. This again illustrates the effect of unbalanced class sizes
in our dataset.
6.2.6 Full contig set classification
The cross-validation experiments in the previous sections were applied to labelled data sets.
From the tens of millions of contigs produced in the assembly, only tens of thousands were
CHAPTER 6. EXPERIMENTAL RESULTS 52
Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]
Table 6.11: Classification performance for the stratified contigs from library MM0564, usingthe full feature set. In comparison to Table 6.10, the number of elements in each class areequal. The contig sets are mapped to protein coding sequences from Ensembl, and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top half of thetable represents the classification results using features extracted from the contig sequences.The lower half represents the classification results using the features extracted from theoriginal sequence from either Ensembl or fRNAdb that each contig mapped to. Note thatfor thresholds at 1.0, there are not enough elements to perform classification.
CHAPTER 6. EXPERIMENTAL RESULTS 53
SSGC PORTRAITType Threshold Elements Acc. Prec Recall Acc. Prec Recall
Table 6.12: Classification performance for the database sequences mapped by the unfilteredcontig sets from MM0564; each classification is compared with PORTRAIT. The precisionand recall is only shown for the non-coding class. We were not able to compare the clas-sification accuracies for the actual contig sets themselves. Note the number of elements islower for PORTRAIT due to the size restrictions for their input.
CHAPTER 6. EXPERIMENTAL RESULTS 54
used in the cross validation experiments. In this section, we investigate the use of SSGC
applied on the full contig set. From the unannotated contig sequences, we attempt to use
the classifier predictions to find potential novel non-coding and protein coding transcripts
in the data.
We created an SVM model from 3124 annotated contig sequences that represent both
classes, in equal proportions, from the mouse library MM0564 using 0.8 as the mapping
threshold. The SVM model was applied on the entire contig set to obtain a class prediction
and a confidence value, the p-value (Figure 4.3).
Each contig is assigned a p-value from [0,1], where a value below 0.5 is classified as
protein coding and a value above 0.5 is classified as non-coding. Figure 6.6 represents the
distribution of contig predictions as well as the p-values. The p-values are skewed towards
non-coding values which have very high values, suggesting that the vast majority of the
assembled contigs are strongly non-coding. Figure 6.7 represents the mapping threshold
scores and sizes of contigs that are at either extreme of the p-value distribution, and therefore
likely non-coding or protein coding. We looked closely at possible novel non-coding and
protein coding contigs by examining sequences with p-values above 0.95 or below 0.05, and
that do not map to any known mm9 mouse fRNAdb and Ensembl protein coding sequences
using a BLAT alignment.
Our analysis of potential non-coding contigs, shows that many are found in intronic and
UTR regions of known genes. Using the UCSC Genome Browser [62], Figure 6.8 represents
one such contig, k50:177614, with p-value of 1.0, and has no BLAT alignments with any
sequences in fRNAdb and Ensembl protein coding. It is likely that this sequence is located
within a novel polyadenylation tail of the gene Fstl4. Although there is no evidence of the
sequence being functional, its location in the 3′ tail suggests that the classifier was correct
in classifying the contig as non-coding.
Figure 6.9 represents the alignment of contig k29:3267973 to the mm9 mouse genome.
The aligned RNA-Seq reads show pileups that resemble a spliced gene. In addition, the
exonic regions are highly conserved across some species. Figure 6.10 shows the contig with
the mouse sequence coordinate lifted from the mouse mm9 genome to the human hg18
genome using the UCSC LiftOver tool [62]. From the viewer, it is evident that one of the
exons is aligned to the AceView Gene Model glertee.aApr07. This suggests that the classifier
was correct in classifying the contig as protein coding.
Our analysis shows that many potential novel protein coding contigs are aligned to
CHAPTER 6. EXPERIMENTAL RESULTS 55
protein coding non−coding
MM0564 contig predictions0.
0e+0
05.
0e+0
61.
0e+0
71.
5e+0
72.
0e+0
7
non−coding RNA p−valuenu
mbe
r of c
ontig
s
0.0 0.2 0.4 0.6 0.8 1.0
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
a) b)
Contigs with no alignments
non−coding RNA p−value
num
ber o
f con
tigs
0.0 0.2 0.4 0.6 0.8 1.0
0e+0
01e
+06
2e+0
63e
+06
4e+0
65e
+06
6e+0
60e
+00
1e+0
62e
+06
3e+0
64e
+06
5e+0
66e
+06
Contigs ≥≥ 500bp
non−coding RNA p−value
num
ber o
f con
tigs
0.0 0.2 0.4 0.6 0.8 1.0
020
000
4000
060
000
8000
010
0000
1200
000
2000
040
000
6000
080
000
1000
0012
0000
c) d)
Figure 6.6: The full MM0564 contig set is predicted by the SVM model, and are assignedprobabilities. Contigs with p-values below 0.5 are classified as protein coding, while contigswith p-values above 0.5 are classified as non-coding. (a) is the class prediction for allcontigs. (b) is the p-value distribution of all the contigs, (c) is the p-value of contigs withno alignments to any known non-coding transcripts. (d) is the p-value for all contigs 500bpand larger.
CHAPTER 6. EXPERIMENTAL RESULTS 56
0.0 0.5 1.0 1.5 2.0
200
500
1000
2000
5000
2000
050
000
Contigs / p−value ≤≤ 0.05
protein coding mapping scores
num
ber o
f con
tigs
(log)
200
500
1000
2000
5000
2000
050
000
0.0 0.5 1.0 1.5 2.0
1e+0
31e
+04
1e+0
51e
+06
1e+0
7
Contigs / p−value ≥≥ 0.95
non−coding RNA mapping scoresnu
mbe
r of c
ontig
s (lo
g)1e
+03
1e+0
41e
+05
1e+0
61e
+07
a) b)
0 10000 20000 30000 40000 50000 60000
1e+0
01e
+02
1e+0
41e
+06
p−value ≤≤ 0.05 and no mapping
contig size (bp)
num
ber o
f con
tigs
(log)
1e+0
01e
+02
1e+0
41e
+06
1e+0
01e
+02
1e+0
41e
+06
0 10000 20000 30000 40000 50000 60000
1e+0
01e
+02
1e+0
41e
+06
p−value ≥≥ 0.95 and no mapping
contig size (bp)
num
ber o
f con
tigs
(log)
1e+0
01e
+02
1e+0
41e
+06
1e+0
01e
+02
1e+0
41e
+06
c) d)
Figure 6.7: Mapping scores and sizes of contigs strongly predicted as protein coding (p-value ≤ 0.05) and non-coding (p-value ≥ 0.95). a,b) Distribution of mapping scores withthe best-aligned a) protein-coding Ensembl sequence, b) non-coding fRNAdb sequence. c,d)Distribution of contig sizes (white). In (c), the red regions represent strongly protein coding(p-value ≤ 0.05) which do not map to any known sequences in Ensembl or fRNAdb. In (d),the orange regions represent strongly non-coding (p-value ≥ 0.95) which do not map to anyknown sequences.
CHAPTER 6. EXPERIMENTAL RESULTS 57
Scalechr11:
STS Markers
RefSeq Genes
Other RefSeq
Ensembl Genes
Spliced ESTs
RatHuman
OrangutanDog
HorseOpossum
ChickenStickleback
SNPs (128)
RepeatMasker
10 kb52995000 53000000 53005000 53010000 53015000
Contigs MM0564 BLAT chr11
MM0490 WTSS: E14.5 liver
MM0564 WTSS: E12.5 heart AV cushion
MM0566 WTSS: E11.5 heart AV cushion
MM0570 WTSS: E11.5 dorsal aorta
MM0571 WTSS: E14.5 umbilical & vitelline artery
MM0581 WTSS: E8.5 definitive endoderm
STS Markers on Genetic and Radiation Hybrid Maps
Your Sequence from Blat Search
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Non-Mouse RefSeq Genes
Ensembl Genes
Human Proteins Mapped by Chained tBLASTn
Mouse mRNAs from GenBank
Mouse ESTs That Have Been Spliced
Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 30 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP build 128)
Repeating Elements by RepeatMasker
k50:177614
Fstl4
FSTL4FSTL4
TAF13
AK046350AK081114AF374459BC132353BC144824AK220367
AK204007AK200446
BC018609
MM0490_7L3 _
0 _
MM0564_7L226 _
0 _
MM0566_7L271 _
0 _
MM0570_7L18 _
0 _
MM0571_7L32 _
0 _
MM0581_7L13 _
0 _
Mammal Cons
2.1 _
-3.3 _
0 -
Figure 6.8: Contig k50:177614 aligned in the mouse mm9 genome. The top track representsthe multiple contigs that are mapped to this location. The second set of tracks are thepileups for the RNA-Seq read alignments for the six mouse transcriptome libraries. Belowthe contig track is the gene track and the conservation track. This contig has a p-value of1.0 and does not map to any known non-coding or protein coding sequences.
CHAPTER 6. EXPERIMENTAL RESULTS 58
transcripts that are similar to previously known protein coding sequences, which are not yet
labelled as protein coding in the Ensembl database.
From these two simple examples, we demonstrate the ability of SSGC to detect potential
novel coding and non-coding contigs from the full contig set. From manual inspection,
sequences on either extremes of the p-value distribution do resemble real non-coding and
protein coding elements. However, SSGC’s ability as a gene finder, especially for novel
sequences, is potentially useful but is currently limited. For practical use, it would be
desirable to be able to distinguish a real transcript from an artifact from assembly, and to
distinguish functional from non-functional non-coding RNAs.
6.3 Feature ranking
We also investigate the effectiveness of the features used in the classification experiments
by ranking features for different conditions. Table 6.13 show the top twenty ranked features
for the classification experiments between Ensembl protein coding and fRNAdb non-coding
sequences as in section 6.1.3.
The first two columns represents the ranked features used in the binary classification
between coding and non-coding. ORF-related features are prevalent in the list, which is
understandable as non-coding sequences are not expected to have ORF sequences. We also
see the importance of the trigrams TAG and TAA in the first four columns. These are two
of the three stop codons within an ORF. We can also observe that a number of features
not available in the de novo set are important for this binary classification. This is again
understandable as we would expect the number of exons be important in identifying non-
coding RNAs. Conservation is also represented, further supporting the notion that protein
coding sequences are much better conserved than non-coding sequences in the genome.
The multiclass experiments are shown in the middle and the last pair of columns. We ob-
serve that once protein coding sequences are removed from the classifier (last two columns),
new features emerge in the list, notably for length and secondary structure. The length is
a key feature used to distinguish some of the smaller sized from the larger sized non-coding
RNAs. The secondary structure based feature ‘Total energy’ likely plays a larger role as
some RNA types are known to have very distinct confirmations.
We also examine the effectiveness in classification using subsets of the top-ranked features
using the information gain ranking filter. Table 6.14 represents the performance for the
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Non-Mouse RefSeq Genes
Ensembl Genes
Human Proteins Mapped by Chained tBLASTn
Mouse mRNAs from GenBank
Mouse ESTs That Have Been Spliced
Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 30 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP build 128)
Repeating Elements by RepeatMasker
k29:3267973
Mark4
MARK4MARK4MARK4
RPL34 XTP7
AK146784AY151083BC156720
MM0490_7L30 _
0 _
MM0564_7L186 _
0 _
MM0566_7L168 _
0 _
MM0570_7L560 _
0 _
MM0571_7L213 _
0 _
MM0581_7L97 _
0 _
Mammal Cons
2.1 _
-3.3 _
0 -
Figure 6.9: Contig k29:3267973 aligned in the mouse mm9 genome. Similar to Figure 6.8,the tracks represent the assembled contigs, RNA-Seq read pileups, the contig, known geneannotations, and conservation. This contig has a p-value of 0 and does not map to anyknown non-coding or protein coding sequences.
Figure 6.10: Contig k29:3267973 (from Figure 6.9) represented in the human hg18 genome,using the LiftOver tool from the UCSC Genome Browser [62]. The tracks represent thecontig coordinate (from the LiftOver), the contig BLAT alignment, known human genemodels, histone modification tracks, and the conservation.
CHAPTER 6. EXPERIMENTAL RESULTS 61
Coding vs. non-coding Multiclass (Prot + RNA) Multiclass (RNA)Rank All features de novo All features de novo All features de novo1 ORF pro-
4 ORF score ORF score TAG ORF-size conserv-Num-bases
ORF pro-portion
5 Number ofexons (h)
CG Histoneswith cover-age
TA Total-energy
TG
6 Number ofexons (c)
TA ORF-size ORF score length GA
7 CG CGA Bases withconserva-tion
T ORF pro-portion
GT
8 Conservedexons
TTA TA TT TG GC-content
9 TA TAA ORF score CG GA G10 Conservation
scoreaaD T Total-
energyGT A
11 CGA TTT TT GC-content GC-content T12 TTA TT CG GA G AT13 TAA CCG Total-
energyTAA A AG
14 aaD T GC-content TTA T TGA15 TTT CGG GA TTT AT TC16 TT GTA TAA GTT AG C17 CCG GGA TTA GGA TGA ORF end18 T GTT TTT GC TC AC19 CGG GAC GTT G C CT20 GTA TCG GGA GTA ORF end CA
Table 6.13: The top twenty ranked features based on classification effectiveness from theEnsembl and fRNAdb datasets. The first pair of columns lists the most effective featuresfrom binary class experiements, coding versus non-coding. The second pair of columns liststhe features for the multiclass considering RNA types and proteins. The last pair of columnsis from the multiclass using only RNA types. Both the complete feature set and the de novofeature sets are considered in each of the three experiment types.
CHAPTER 6. EXPERIMENTAL RESULTS 62
binary classification experiment between Ensembl protein coding with the fRNAdb non-
coding RNAs. Starting with the top ranked feature, ‘ORF proportion’, we run the classifier,
then increment the number of features in order of their rank and classify at each step. We
can see the steady rise in performance as the available features are added. The accuracy
rises to 94.8% by the time the top 20 features are used. The complete feature set achieved
Table 6.14: Classification performance using incrementally, the top twenty ranked featuresfrom the Ensembl and fRNAdb datasets, for the binary classifier. As more features areadded, there is a steady rise in the accuracy, precision and recall. The full model containingall features has an accuracy of 96.3%, precision of 0.966, and recall of 0.976 as shown inTable 6.3.
Chapter 7
Conclusion and future work
Over a short period, our understanding of non-coding RNA has increased dramatically.
No longer just an intermediate for protein synthesis, non-coding RNAs have shown to be
involved in numerous roles in cell biology. At the same time, advancements in transcriptome
studies using RNA-Seq has continued to provide a research platform for new research. Our
work explored the ability of non-coding RNA prediction using an RNA-Seq approach.
7.1 Summary
In this thesis, we present a method and software for classifying transcript sequences as
protein coding vs non-coding, and extend this to distinguish different non-coding RNA
families, which has not been reported in the literature. We also propose a method for
classifying de novo transcriptome contigs from short read RNA-Seq data.
Our results show that the performance of our classifier is comparable to, or in most cases
surpasses, what is reported in the current literature, and suggest that machine learning
based methods can be used to discriminate between different families of non-coding RNA.
The software tools generated in this work are designed to be modular and to be modified
to suit particular needs.
As the number of transcriptome studies continues to increase, especially de novo non-
reference based studies, we expect to see more methods emerge to handle the outputs of
these sometime noisy output sequences. Our investigation into assembled contigs indicate
that classifiers can be expected to contribute in such studies. With improvements in our
63
CHAPTER 7. CONCLUSION AND FUTURE WORK 64
understanding of non-coding RNAs, the quality of non-coding databases, quality of tran-
scriptome experiments and of different assembly algorithms, we expect machine learning
approaches to such problems will continue to improve.
7.2 Future work
Here, we outline a number of areas for improving the calculations described, and directions
that we have yet to explore.
• In our investigation on the full contig set, we found many elements that seem to
be neither functional protein coding nor non-coding, e.g. fragmented contigs and
transcript runoffs in intronic and UTR regions of genes. Depending on the assembly
used, we have seen many fragmented contigs that cannot be merged. It is possible
that these fragmented contigs can have potential features that can be used to classify
into an alternative class of non-functional non-coding RNAs.
• In a true de novo setting in which classification would be applied to a species that
does not have a well-annotated genome sequence, we cannot expect to have database
annotated coding and non-coding sequences for all species. To assess a strictly de novo
classifier we must also explore the ability of building models in one training species
and testing on another.
• Using relative RNA-Seq read coverage as a classifier feature has been shown to be
effective [30, 59, 77]. While this could be done for transcripts and de novo contigs,
our initial focus was on de novo methodologies, and we did not assess this. A quick
follow up could add the RNA-Seq read coverage for each transcript or contig.
• In our collaboration with the Trans-ABySS group we also assessed detecting polyadeny-
lation sites both within transcripts and contig sequences [110]. There is a possibility to
consider this as a source of information when inferring the direction of the transcript
as well as searching for certain polyadenylation signals found in certain 3′ UTRs. Cur-
rently, certain features are not optimised for reverse complement inputs in the feature
extraction and is a topic of further study.
• We assessed only one contig assembly program: ABySS [120], to be used in the de novo
setting. De novo assembly requires higher coverage than reference based methods for
CHAPTER 7. CONCLUSION AND FUTURE WORK 65
reconstructing the transcriptome. It is possible that reference based methods [40, 126]
can increase the sensitivity of transcript detection, though at the same time are also
known to increase false positive results. Evaluating the performance of our classifier
with reference based assembly may also be of interest.
• Our study, along with many others that utilise RNA-Seq, use protocols that are
designed more specifically for protein coding transcript sequencing. Alternative se-
quencing protocols are available that allow the detection of many small non-coding
sequences such as miRNAs. As many non-coding RNAs are small, investigation into
these protocols may provide a more informative framework to test our classifier.
• This thesis investigated different non-coding RNA types and families, and for that task
we focussed mainly on the types found in fRNAdb. Rfam is also one such database
annotated using RNA families. However, our experience has shown it to be difficult
to work with as there were many families with very few entries, as well as entries that
belonged to many families. Due to its strong growth over the years, we do not want to
simply abandon this resource because of these factors, and feel that this should again
be investigated.
Bibliography
[1] Bruce Alberts, Alexander Johnson, Lewis, Julian, Martin Raff, Keith Roberts, andPeter Walter. Molecular Biology of the Cell. Garland Science, 270 Madison Avenue,New York, New York, 5th edition, 2008.
[2] Paulo P. Amaral, Michael B. Clark, Dennis K. Gascoigne, Marcel E. Dinger, andJohn S. Mattick. lncrnadb: a reference database for long noncoding rnas. NucleicAcids Research, 39(suppl 1):D146–D151, 01 2011.
[3] Roberto Arrial, Roberto Togawa, and Marcelo Brigido. Screening non-coding RNAs intranscriptomes from neglected species using PORTRAIT: case study of the pathogenicfungus Paracoccidioides brasiliensis. BMC Bioinformatics, 10(1):239, 2009.
[4] Yan W. Asmann, Michael B. Wallace, and E. Aubrey Thompson. Transcriptomeprofiling using next-generation sequencing. Gastroenterology, 135(5):1466–1468, 112008.
[5] Courtney C. Babbitt, Olivier Fedrigo, Adam D. Pfefferle, Alan P. Boyle, Julie E.Horvath, Terrence S. Furey, and Gregory A. Wray. Both noncoding and protein-coding rnas contribute to gene expression evolution in the primate brain. GenomeBiology and Evolution, 2010(0):67–79, 2010.
[6] JH Badger and GJ Olsen. CRITICA: coding region identification tool invoking com-parative analysis. Mol Biol Evol, 16(4):512–524, 1999.
[7] Asa Ben-Hur, Cheng Soon Ong, Soren Sonnenburg, Bernhard Scholkopf, and GunnarRatsch. Support vector machines and kernels for computational biology. PLoS ComputBiol, 4(10):e1000173–, 10 2008.
[8] E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guig, T.R. Gingeras, E.H. Mar-gulies, Z. Weng, M. Snyder, and E.T. Dermitzakis. Identification and analysis offunctional elements in 1% of the human genome by the encode pilot project. Nature,447(7146):799–816, 06 2007.
[9] Inanc Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol,Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein,Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and
66
BIBLIOGRAPHY 67
Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics,25(21):2872–2877, 11 2009.
[10] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, Cam-bridge CB3 0FB, U.K., 2006.
[11] Brigitte Boeckmann, Amos Bairoch, Rolf Apweiler, Marie-Claude Blatter, Anne Es-treicher, Elisabeth Gasteiger, Maria J. Martin, Karine Michoud, Claire O’Donovan,Isabelle Phan, Sandrine Pilbout, and Michel Schneider. The SWISS-PROT pro-tein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research,31(1):365–370, 1 2003.
[12] Dario Boffelli, Jon McAuliffe, Dmitriy Ovcharenko, Keith D. Lewis, Ivan Ovcharenko,Lior Pachter, and Edward M. Rubin. Phylogenetic shadowing of primate sequences tofind functional regions of the human genome. Science, 299(5611):1391–1394, 02 2003.
[13] George A. Calin, Chang-gong Liu, Manuela Ferracin, Terry Hyslop, Riccardo Spizzo,Cinzia Sevignani, Muller Fabbri, Amelia Cimmino, Eun Joo Lee, Sylwia E. Wojcik,Masayoshi Shimizu, Esmerina Tili, Simona Rossi, Cristian Taccioli, Flavia Pichiorri,Xiuping Liu, Simona Zupo, Vlad Herlea, Laura Gramantieri, Giovanni Lanza, Han-sjuerg Alder, Laura Rassenti, Stefano Volinia, Thomas D. Schmittgen, Thomas J.Kipps, Massimo Negrini, and Carlo M. Croce. Ultraconserved regions encoding ncR-NAs are altered in human leukemias and carcinomas. Cancer Cell, 12(3):215 – 229,Sep 2007.
[14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a Library for Support Vector Ma-chines. National Taiwan University, 2001.
[15] F. Chiaromonte, R. J. Weber, K. M. Roskin, M. Diekhans, W. J. Kent, and D. Haus-sler. The share of human genomic dna under selection estimated from human–mousegenomic alignments. Cold Spring Harbor Symposia on Quantitative Biology, 68:245–254, 01 2003.
[16] Liam Childs, Zoran Nikoloski, Patrick May, and Dirk Walther. Identification andclassification of ncRNA molecules using graph properties. Nucleic Acids Research,37(9):e66–e66, 05 2009.
[17] Rebecca Chodroff, Leo Goodstadt, Tamara Sirey, Peter Oliver, Kay Davies, EricGreen, Zoltan Molnar, and Chris Ponting. Long noncoding RNA genes: conserva-tion of sequence and brain expression among diverse amniotes. Genome Biology,11(7):R72, 2010.
[18] Michele Clamp, Ben Fry, Mike Kamal, Xiaohui Xie, James Cuff, Michael F. Lin, Mano-lis Kellis, Kerstin Lindblad-Toh, and Eric S. Lander. Distinguishing protein-codingand noncoding genes in the human genome. Proceedings of the National Academy ofSciences, 104(49):19428–19433, 12 2007.
BIBLIOGRAPHY 68
[19] Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysisof the mouse genome. Nature, 420(6915):520–562, 12 2002.
[20] Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norwayrat yields insights into mammalian evolution. Nature, 428(6982):493–521, 04 2004.
[21] Gregory M. Cooper, Michael Brudno, Eric A. Stone, Inna Dubchak, Serafim Bat-zoglou, and Arend Sidow. Characterization of evolutionary rates and constraints inthree mammalian genomes. Genome Research, 14(4):539–548, 04 2004.
[22] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995-09-01.
[23] Jennifer Couzin. Breakthrough of the year: Small RNAs Make Big Splash. Science,298(5602):2296–2297, 2002.
[24] Teresa Creanza, David Horner, Annarita D’Addabbo, Rosalia Maglietta, FlavioMignone, Nicola Ancona, and Graziano Pesole. Statistical assessment of discrimi-native features for protein-coding and non coding cross-species conserved sequenceelements. BMC Bioinformatics, 10(Suppl 6):S2, 2009.
[25] Marcel E. Dinger, Ken C. Pang, Tim R. Mercer, and John S. Mattick. Differentiatingprotein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput Biol,4(11):e1000176–, 11 2008.
[26] I. Dondoshansky. Blastclust (NCBI software development toolkit), 6.1 edition, 2002.
[27] Sean R. Eddy. Non-coding RNA genes and the modern RNA world. Nat Rev Genet,2(12):919–929, 12 2001.
[28] Sean R. Eddy and Richard Durbin. RNA sequence analysis using covariance models.Nucleic Acids Research, 22(11):2079–2088, 06 1994.
[29] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM into WekaEnvironment, 2005.
[30] Florian Erhard and Ralf Zimmer. Classification of ncrnas using position and sizeinformation in deep sequencing data. Bioinformatics, 26(18):i426–i432, 09 2010.
[31] N. Erho and K. Wiese. An exploration of individual RNA structural elements inRNA gene finding. Computational Intelligence in Bioinformatics and ComputationalBiology (CIBCB), 2010 IEEE Symposium on, pages 1–9, 2-5 May 2010.
[32] Noah Fahlgren, Miya D. Howell, Kristin D. Kasschau, Elisabeth J. Chapman, Christo-pher M. Sullivan, Jason S. Cumbie, Scott A. Givan, Theresa F. Law, Sarah R. Grant,Jeffery L. Dangl, and James C. Carrington. High-throughput sequencing of Arabidop-sis microRNAs: Evidence for frequent birth and death of MIRNA genes. PLoS ONE,2(2):e219, 2007.
BIBLIOGRAPHY 69
[33] Alistair R. R. Forrest, Rehab F. Abdelhamid, and Piero Carninci. Annotating non-coding transcription using functional genomics strategies. Briefings in FunctionalGenomics & Proteomics, 8(6):437–443, 11 2009.
[34] Kelly A. Frazer, Lior Pachter, Alexander Poliakov, Edward M. Rubin, and InnaDubchak. Vista: computational tools for comparative genomics. Nucleic Acids Re-search, 32(suppl 2):W273–W279, 07 2004.
[35] Masaaki Furuno, Ken C Pang, Noriko Ninomiya, Shiro Fukuda, Martin C Frith, CarolBult, Chikatoshi Kai, Jun Kawai, Piero Carninci, Yoshihide Hayashizaki, John SMattick, and Harukazu Suzuki. Clusters of Internally Primed Transcripts RevealNovel Long Noncoding. PLoS Genet, 2(4):e37, 04 2006.
[36] Paul P. Gardner, Jennifer Daub, John G. Tate, Eric P. Nawrocki, Diana L. Kolbe,Stinus Lindgreen, Adam C. Wilkinson, Robert D. Finn, Sam Griffiths-Jones, Sean R.Eddy, and Alex Bateman. Rfam: updates to the RNA families database. NucleicAcids Research, pages gkn766–, 10 2008.
[37] G.B. Golding. Simple sequence is abundant in eukaryotic proteins. PRS, 8(06):1358–1361, 1999.
[38] Sam Griffiths-Jones, Russell J. Grocock, Stijn van Dongen, Alex Bateman, and An-ton J. Enright. miRBase: microRNA sequences, targets and gene nomenclature. Nu-cleic Acids Research, 34(suppl 1):D140–144, 1 2006.
[39] Mitchell Guttman, Ido Amit, Manuel Garber, Courtney French, Michael F. Lin, DavidFeldser, Maite Huarte, Or Zuk, Bryce W. Carey, John P. Cassady, Moran N. Cabili,Rudolf Jaenisch, Tarjei S. Mikkelsen, Tyler Jacks, Nir Hacohen, Bradley E. Bernstein,Manolis Kellis, Aviv Regev, John L. Rinn, and Eric S. Lander. Chromatin signaturereveals over a thousand highly conserved large non-coding RNAs in mammals. Nature,458(7235):223–227, 03 2009.
[40] Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson,Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John LRinn, Eric S Lander, and Aviv Regev. Ab initio reconstruction of cell type-specifictranscriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs.Nat Biotech, 28(5):503–510, 05 2010.
[41] Brian J Haas and Michael C Zody. Advancing RNA-Seq analysis. Nat Biotech,28(5):421–423, 05 2010.
[42] Michael Hackenberg, Martin Sturm, David Langenberger, Juan Manuel Falcon-Perez,and Ana M. Aransay. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 37(suppl 2):W68–W76,07 2009.
BIBLIOGRAPHY 70
[43] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,and Ian H. Witten. The WEKA Data Mining Software: An Update; SIGKDD Explo-rations. SIGKDD Explorations Newsletter, 11(1), June 2009.
[44] Ross C. Hardison, John Oeltjen, and Webb Miller. Long human–mouse sequencealignments reveal novel regulatory elements: A reason to sequence the mouse genome.Genome Research, 7(10):959–966, 10 1997.
[45] Artemis G. Hatzigeorgiou, Petko Fiziev, and Martin Reczko. DIANA-EST: a statis-tical analysis. Bioinformatics, 17(10):913–919, 10 2001.
[46] Shunmin He, Changning Liu, Geir Skogerbo, Haitao Zhao, Jie Wang, Tao Liu, BaoyanBai, Yi Zhao, and Runsheng Chen. NONCODE v2.0: decoding the non-coding. Nucl.Acids Res., page gkm1011, 2007.
[47] David Hendrix, Michael Levine, and Weiyang Shi. miRTRAP, a computational methodfor the systematic identification of miRNAs from high throughput sequencing data.Genome Biology, 11(4):R39, 2010.
[48] Michael Hiller, Sven Findeiß, Sandro Lein, Manja Marz, Claudia Nickel, DominicRose, Christine Schulz, Rolf Backofen, Sonja J. Prohaska, Gunter Reuter, and Pe-ter F. Stadler. Conserved introns reveal novel transcripts in Drosophila melanogaster.Genome Research, 19(7):1289–1300, 07 2009.
[49] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster.Fast folding and comparison of RNA secondary structures. Monatshefte fur Chemie/ Chemical Monthly, 125(2):167–188, 02 1994.
[50] I. L. Hofacker, B. Priwitzer, and P. F. Stadler. Prediction of locally stable RNAsecondary structures for genome-wide surveys. Bioinformatics, 20(2):186–190, 1 2004.
[51] Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research,31(13):3429–3431, 7 2003.
[52] Yair Horesh, Ydo Wexler, Ilana Lebenthal, Michal Ziv-Ukelson, and Ron Unger.RNAslider: a faster engine for consecutive windows folding and its application tothe analysis of genomic folding asymmetry. BMC Bioinformatics, 10(1):76, 2009.
[53] Fan Hsu, W. James Kent, Hiram Clawson, Robert M. Kuhn, Mark Diekhans, andDavid Haussler. The UCSC Known Genes. Bioinformatics, 22(9):1036–1046, 05 2006.
[54] Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized bradley-terry modelsand multi-class probability estimates. J. Mach. Learn. Res., 7:85–115, December 2006.
BIBLIOGRAPHY 71
[55] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent,Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, M. Hammond, R. Holland, K. Howe, A. Jenkin-son, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kule-sha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard,D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevan-ion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham,V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor,J. Smith, S. Searle, and P. Flicek. Ensembl 2009. Nucleic Acids Research, 37(suppl1):D690–D697, 01 2009.
[56] A. M. Hughes. Oxford English Dictionary. Isis, 99(3):586, Sep 2008.
[57] D. E. Janes, C. Chapus, Y. Gondo, D. F. Clayton, S. Sinha, C. A. Blatti, C. L. Organ,M. K. Fujita, C. N. Balakrishnan, and S. V. Edwards. Reptiles and mammals havedifferentially retained long conserved noncoding sequences from the amniote ancestor.Genome Biology and Evolution, 3:102–113, 01 2011.
[58] Hui Jia, Maureen Osak, Gireesh K. Bogu, Lawrence W. Stanton, Rory Johnson, andLeonard Lipovich. Genome-wide computational identification and manual annotationof human long noncoding RNA genes. RNA, 16(8):1478–1487, 08 2010.
[59] Chol-Hee Jung, Martin Hansen, Igor Makunin, Darren Korbie, and John Mattick.Identification of novel non-coding RNAs using profiles of short sequence reads fromnext generation sequencing data. BMC Genomics, 11(1):77, 2010.
[60] Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander.Sequencing and comparison of yeast species to identify genes and regulatory elements.Nature, 423(6937):241–254, 05 2003.
[61] W. James Kent. BLAT—the BLAST-like alignment tool. Genome Research,12(4):656–664, 04 2002.
[62] W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H.Pringle, Alan M. Zahler, and David Haussler. The human genome browser at UCSC.Genome Research, 12(6):996–1006, 06 2002.
[63] Taishin Kin, Kouichirou Yamada, Goro Terai, Hiroaki Okida, Yasuhiko Yoshinari,Yukiteru Ono, Aya Kojima, Yuki Kimura, Takashi Komori, and Kiyoshi Asai.fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Research, 35(suppl 1):D145–148, 1 2007.
[64] Lei Kong, Yong Zhang, Zhi-Qiang Ye, Xiao-Qiao Liu, Shu-Qi Zhao, Liping Wei, andGe Gao. Cpc: assess the protein-coding potential of transcripts using sequence featuresand support vector machine. Nucleic Acids Research, 35(suppl 2):W345–349, 7 2007.
BIBLIOGRAPHY 72
[65] Jack Kyte and Russell F. Doolittle. A simple method for displaying the hydropathiccharacter of a protein. Journal of Molecular Biology, 157(1):105 – 132, 1982.
[66] S. Sai Lakshmi and Shipra Agrawal. piRNABank: a web resource on classified andclustered Piwi-interacting RNAs. Nucleic Acids Research, 36(suppl 1):D173–D177, 012008.
[67] David Langenberger, Clara Bermudez-Santana, Jana Hertel, Steve Hoffmann, PhilippKhaitovich, and Peter F. Stadler. Evidence for human microRNA-offset RNAs insmall RNA sequencing data. Bioinformatics, 25(18):2298–2301, 2009.
[68] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan,H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson,T. J. Gibson, and D. G. Higgins. Clustal W and clustal X version 2.0. Bioinformatics,23(21):2947–2948, 11 2007.
[69] Rasko Leinonen, Ruth Akhtar, Ewan Birney, James Bonfield, Lawrence Bower, MattCorbett, Ying Cheng, Fehmi Demiralp, Nadeem Faruque, Neil Goodgame, RichardGibson, Gemma Hoad, Christopher Hunter, Mikyung Jang, Steven Leonard, QuanLin, Rodrigo Lopez, Michael Maguire, Hamish McWilliam, Sheila Plaister, RajeshRadhakrishnan, Siamak Sobhany, Guy Slater, Petra Ten Hoopen, Franck Valentin,Robert Vaughan, Vadim Zalunin, Daniel Zerbino, and Guy Cochrane. Improvementsto services at the European Nucleotide Archive. Nucleic Acids Research, 38(suppl1):D39–D45, 01 2010.
[70] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754–1760, 07 2009.
[71] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, GaborMarth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Process-ing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics,25(16):2078–2079, 08 2009.
[72] Jiong-Tang Li, Yong Zhang, Lei Kong, Qing-Rong Liu, and Liping Wei. Trans-naturalantisense transcripts including noncoding rnas in 10 species: implications for expres-sion regulation. Nucleic Acids Research, 36(15):4833–4844, 09 2008.
[73] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparinglarge sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 72006.
[74] Jinfeng Liu, Julian Gough, and Burkhard Rost. Distinguishing protein-coding fromnon-coding rnas through support vector machines. PLoS Genet, 2(4):e29, 04 2006.
[75] G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin,and K. A. Frazer. Identification of a coordinate regulator of interleukins 4, 13, and 5by cross-species sequence comparisons. Science, 288(5463):136–140, 04 2000.
BIBLIOGRAPHY 73
[76] C. Lottaz, C. Iseli, C. V. Jongeneel, and P. Bucher. Modeling sequencing errors bycombining Hidden Markov models. Bioinformatics, 19(suppl 2):ii103–112, 9 2003.
[77] Zhi John Lu, Kevin Y. Yip, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khu-rana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, MasaomiKato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterson, ValerieReinke, and Mark Gerstein. Prediction and characterization of non-coding RNAs inC. elegans by integrating conservation, secondary structure and high throughput se-quencing and array data. Genome Research, 10.1101/gr.110189.110, December 2010.
[78] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction in energy-based models.J Comput Biol, 7(3-4):409–427, 2000.
[79] Ariane Machado-Lima, Hernando del Portillo, and Alan Durham. Computationalmethods in noncoding RNA research. Journal of Mathematical Biology, 56(1):15–49,01 2008.
[80] J.R. Manak, S. Dike, V. Sementchenko, P. Kapranov, F. Biemar, J. Long, J. Cheng,I. Bell, S. Ghosh, A. Piccolboni, and T.R. Gingeras. Identification and analysis offunctional elements in 1% of the human genome by the ENCODE pilot project. Nature,447(7146):799–816, 06 2007.
[81] Samuel Marguerat, Brian T. Wilhelm, and Jurg Bahler. Next-generation sequencing:applications beyond genomes. Biochemical Society transactions, 36(Pt 5):1091–1096,October 2008.
[82] Elliott H. Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program,David Haussler, and Eric D. Green. Identification and characterization of multi-speciesconserved sequences. Genome Research, 13(12):2507–2518, 12 2003.
[83] Anthony Mathelier and Alessandra Carbone. MIReNA: finding microRNAs with highaccuracy and no learning at genome scale and from deep sequencing data. Bioinfor-matics, 26(18):2226–2234, 09 2010.
[84] Pedro P. Medina, Mona Nolde, and Frank J. Slack. OncomiR addiction in an in vivomodel of microRNA-21-induced pre-B-cell lymphoma. Nature, 467(7311):86–90, 092010.
[85] Tim R. Mercer, Marcel E. Dinger, and John S. Mattick. Long non-coding RNAs:insights into functions. Nat Rev Genet, 10(3):155–159, 03 2009.
[86] Michael L. Metzker. Sequencing technologies – the next generation. Nat Rev Genet,11(1):31–46, 01 2010.
[87] Flavio Mignone, Anna Anselmo, Giacinto Donvito, Giorgio Maggi, Giorgio Grillo,and Graziano Pesole. Genome-wide identification of coding and non-coding conservedsequence tags in human and mouse genomes. BMC Genomics, 9(1):277, 2008.
BIBLIOGRAPHY 74
[88] Tarjei S. Mikkelsen, Manching Ku, David B. Jaffe, Biju Issac, Erez Lieberman, GeorgiaGiannoukos, Pablo Alvarez, William Brockman, Tae-Kyung Kim, Richard P. Koche,William Lee, Eric Mendenhall, Aisling O/’Donovan, Aviva Presser, Carsten Russ,Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf Jaenisch, Chad Nusbaum,Eric S. Lander, and Bradley E. Bernstein. Genome-wide maps of chromatin state inpluripotent and lineage-committed cells. Nature, 448(7153):553–560, 08 2007.
[89] The modENCODE Consortium, Sushmita Roy, Jason Ernst, Peter V. Kharchenko,Pouya Kheradpour, Nicolas Negre, Matthew L. Eaton, Jane M. Landolin, Christo-pher A. Bristow, Lijia Ma, Michael F. Lin, Stefan Washietl, Bradley I. Arshinoff,Ferhat Ay, Patrick E. Meyer, Nicolas Robine, Nicole L. Washington, Luisa Di Ste-fano, Eugene Berezikov, Christopher D. Brown, Rogerio Candeias, Joseph W. Carlson,Adrian Carr, Irwin Jungreis, Daniel Marbach, Rachel Sealfon, Michael Y. Tolstorukov,Sebastian Will, Artyom A. Alekseyenko, Carlo Artieri, Benjamin W. Booth, Angela N.Brooks, Qi Dai, Carrie A. Davis, Michael O. Duff, Xin Feng, Andrey A. Gorchakov,Tingting Gu, Jorja G. Henikoff, Philipp Kapranov, Renhua Li, Heather K. MacAlpine,John Malone, Aki Minoda, Jared Nordman, Katsutomo Okamura, Marc Perry, Sara K.Powell, Nicole C. Riddle, Akiko Sakai, Anastasia Samsonova, Jeremy E. Sandler,Yuri B. Schwartz, Noa Sher, Rebecca Spokony, David Sturgill, Marijke van Baren,Kenneth H. Wan, Li Yang, Charles Yu, Elise Feingold, Peter Good, Mark Guyer,Rebecca Lowdon, Kami Ahmad, Justen Andrews, Bonnie Berger, Steven E. Brenner,Michael R. Brent, Lucy Cherbas, Sarah C. R. Elgin, Thomas R. Gingeras, RobertGrossman, Roger A. Hoskins, Thomas C. Kaufman, William Kent, Mitzi I. Kuroda,Terry Orr-Weaver, Norbert Perrimon, Vincenzo Pirrotta, James W. Posakony, BingRen, Steven Russell, Peter Cherbas, Brenton R. Graveley, Suzanna Lewis, Gos Mick-lem, Brian Oliver, Peter J. Park, Susan E. Celniker, Steven Henikoff, Gary H. Karpen,Eric C. Lai, David M. MacAlpine, Lincoln D. Stein, Kevin P. White, and Mano-lis Kellis. Identification of functional elements and regulatory circuits by drosophilamodencode. Science, 330(6012):1787–1797, 12 2010.
[90] Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzy-winski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, andMarco A. Marra. Profiling the HeLa S3 transcriptome using randomly primed cDNAand massively parallel short-read sequencing. Biotechniques, 45(1):81–94, July 2008.
[91] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and BarbaraWold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth,5(7):621–628, 07 2008.
[92] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, MarkGerstein, and Michael Snyder. The transcriptional landscape of the Yeast genomedefined by RNA sequencing. Science, 320(5881):1344–1349, 06 2008.
BIBLIOGRAPHY 75
[93] Marcelo A. Nobrega, Yiwen Zhu, Ingrid Plajzer-Frick, Veena Afzal, and Ed-ward M. Rubin. Megabase deletions of gene deserts result in viable mice. Nature,431(7011):988–993, 10 2004.
[94] Kirt Noel. Examining stem-loops as a sequence signal for identifying structural RNAgenes. Master’s thesis, Simon Fraser University, April 2005.
[95] Karl J. V. Nordstrom, Majd A. I. Mirza, Markus Sallman Almen, David E. Gloriam,Robert Fredriksson, and Helgi B. Schloth. Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics, 94(3):169–176, 9 2009.
[96] David L. Olson and Dursun Delen. Advanced Data Mining Techniques. SpringerPublishing Company, Incorporated, 1st edition, 2008.
[97] Ulf Andersson Ørom, Thomas Derrien, Malte Beringer, Kiranmai Gumireddy,Alessandro Gardini, Giovanni Bussotti, Fan Lai, Matthias Zytnicki, CedricNotredame, Qihong Huang, Roderic Guigo, and Ramin Shiekhattar. Long noncodingRNAs with enhancer-like function in human cells. Cell, 143(1):46–58, 10 2010.
[98] Ken C. Pang, Martin C. Frith, and John S. Mattick. Rapid evolution of noncodingRNAs: lack of conservation does not mean lack of function. Trends in Genetics,22(1):1–5, 1 2006.
[99] Ken C. Pang, Stuart Stephen, Marcel E. Dinger, Par G. Engstrom, Boris Lenhard,and John S. Mattick. RNAdb 2.0–an expanded database of mammalian non-codingRNAs. Nucleic Acids Research, 35(suppl 1):D178–182, 1 2007.
[100] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approachto DNA fragment assembly. Proceedings of the National Academy of Sciences of theUnited States of America, 98(17):9748–9753, 08 2001.
[101] Elisabetta Pizzi and Clara Frontali. Low-complexity regions in Plasmodium falci-parum proteins. Genome Research, 11:218–229, 2001.
[102] Vasilis J. Promponas, Anton J. Enright, Sophia Tsoka, David P. Kreil, ChristopheLeroy, Stavros Hamodrakas, Chris Sander, and Christos A. Ouzounis. CAST: aniterative algorithm for the complexity analysis of sequence tracts. Bioinformatics,16(10):915–922, 10 2000.
[103] Kim D. Pruitt, Tatiana Tatusova, and Donna R. Maglott. NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes, transcripts andproteins. Nucleic Acids Research, pages gkl842–, 11 2006.
[104] Matteo Re, Graziano Pesole, and David Horner. Accurate discrimination of conservedcoding and non-coding regions through multiple indicators of evolutionary dynamics.BMC Bioinformatics, 10(1):282, 2009.
BIBLIOGRAPHY 76
[105] Brooke Rhead, Donna Karolchik, Robert M. Kuhn, Angie S. Hinrichs, Ann S. Zweig,Pauline A. Fujita, Mark Diekhans, Kayla E. Smith, Kate R. Rosenbloom, Brian J.Raney, Andy Pohl, Michael Pheasant, Laurence R. Meyer, Katrina Learned, Fan Hsu,Jennifer Hillman-Jackson, Rachel A. Harte, Belinda Giardine, Timothy R. Dreszer,Hiram Clawson, Galt P. Barber, David Haussler, and W. James Kent. The UCSCGenome Browser database: update 2010. Nucleic Acids Research, 38(suppl 1):D613–D619, 01 2010.
[106] Peter Rice, Ian Longden, and Alan Bleasby. EMBOSS: The European MolecularBiology Open Software Suite. Trends in Genetics, 16(6):276 – 277, 2000.
[107] E. Rivas and S. R. Eddy. Noncoding RNA gene detection using comparative sequenceanalysis. BMC bioinformatics, 2(1):8+, 2001.
[108] A. Gordon Robertson, Mikhail Bilenky, Angela Tam, Yongjun Zhao, Thomas Zeng,Nina Thiessen, Timothee Cezard, Anthony P. Fejes, Elizabeth D. Wederell, RebeccaCullum, Ghia Euskirchen, Martin Krzywinski, Inanc Birol, Michael Snyder, Pamela A.Hoodless, Martin Hirst, Marco A. Marra, and Steven J. M. Jones. Genome-widerelationship between histone H3 lysine 4 mono- and tri-methylation and transcriptionfactor binding. Genome Research, 18(12):1906–1917, 12 2008.
[109] Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao,Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, NinaThiessen, Obi L Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones.Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipita-tion and massively parallel sequencing. Nat Meth, 4(8):651–657, 08 2007.
[110] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, MatthewField, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny QQian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron SButterfield, Richard Newsome, Simon K Chan, Rong She, Richard Varhol, BaljitKamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A Moore, MartinHirst, Marco A Marra, Steven J M Jones, Pamela A Hoodless, and Inanc Birol.De novo assembly and analysis of RNA-seq data. Nature Methods, advance onlinepublication, October 2010.
[111] Brid M. Ryan, Ana I. Robles, and Curtis C. Harris. Genetic variation in microRNAnetworks: the implications for cancer research. Nat Rev Cancer, 10(6):389–402, 062010.
[112] R. Salari, C. Aksay, E. Karakoc, P. J. Unrau, I. Hajirasouliha, S. C. Sahinalp, andS. Maas. smyRNA: A Novel Ab Initio ncRNA Gene Finder. PLoS ONE, 4:5433, May2009.
BIBLIOGRAPHY 77
[113] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminatinginhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467, 121977.
[114] Kengo Sato, Michiaki Hamada, Kiyoshi Asai, and Toutai Mituyama. CentroidFold: aweb server for RNA secondary structure prediction. Nucleic Acids Research, 37(suppl2):W277–W280, 07 2009.
[115] Bruce A Shapiro, Yaroslava G Yingling, Wojciech Kasprzak, and Eckart Bindewald.Bridging the gap in RNA structure prediction. Current Opinion in Structural Biology,17(2):157 – 165, 2007. Theory and simulation / Macromolecular assemblages.
[116] Kana Shimizu, Jun Adachi, and Yoichi Muraoka. ANGLE: a sequencing errors resis-tant program for predicting protein coding regions in unfinished cDNA. Journal ofBioinformatics Computal Biology, 4(3):649–64, June 2006.
[117] Christian Honer zu Siederdissen and Ivo L. Hofacker. Discriminatory power of RNAfamily models. Bioinformatics, 26(18):i453–i459, 09 2010.
[118] Adam Siepel, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, Minmei Hou, KateRosenbloom, Hiram Clawson, John Spieth, LaDeana W. Hillier, Stephen Richards,George M. Weinstock, Richard K. Wilson, Richard A. Gibbs, W. James Kent, WebbMiller, and David Haussler. Evolutionarily conserved elements in vertebrate, insect,worm, and yeast genomes. Genome Research, 15(8):1034–1050, 08 2005.
[119] Tulio C. Silva, Pedro A. Berger, Roberto T. Arrial, Roberto C. Togawa, Marcelo M.Brigido, and Maria Emilia M. T. Walter. SOM-PORTRAIT: Identifying Non-codingRNAs Using Self-Organizing Maps, volume 5676/2009 of Lecture Notes in ComputerScience. Springer Berlin / Heidelberg, 2009.
[120] Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven J.M.Jones, and Inanc Birol. ABySS: A parallel assembler for short read sequence data.Genome Research, 19:1117–1123, February 2009.
[121] G.S.C. Slater. Algorithms for the Analysis of Expressed Sequence Tags. PhD thesis,University of Cambridge, Cambridge, 2000.
[122] Tomasz Smolinski, Mariofanna Milanova, Aboul-Ella Hassanien, Kirt Noel, and KayWiese. Considering Stem-Loops as Sequence Signals for Finding Ribosomal RNAGenes, volume 151, pages 337–357. Springer Berlin / Heidelberg, 2008.
[123] MJ Solomon, PL Larsen, and A Varshavsky. Mapping protein-DNA interactions invivo with formaldehyde: evidence that histone H4 is retained on a highly transcribedgene. Cell, 53(6):937–947, 06 1988.
BIBLIOGRAPHY 78
[124] Jason E. Stajich, David Block, Kris Boulez, Steven E. Brenner, Stephen A. Chervitz,Chris Dagdigian, Georg Fuellen, James G.R. Gilbert, Ian Korf, Hilmar Lapp, HeikkiLehvaslaiho, Chad Matsalla, Chris J. Mungall, Brian I. Osborne, Matthew R. Pocock,Peter Schattner, Martin Senger, Lincoln D. Stein, Elia Stupka, Mark D. Wilkinson,and Ewan Birney. The Bioperl toolkit: Perl modules for the life sciences. GenomeResearch, 12:1611–1618, 2002.
[125] The FANTOM Consortium. The transcriptional landscape of the mammalian genome.Science, 309(5740):1559–1563, 9 2005.
[126] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke Jvan Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assemblyand quantification by RNA-Seq reveals unannotated transcripts and isoform switchingduring cell differentiation. Nat Biotech, 28(5):511–515, 05 2010.
[127] Huei-Hun H. Tseng, Zasha Weinberg, Jeremy Gore, Ronald R. Breaker, and Wal-ter L. Ruzzo. Finding non-coding RNAs through genome-scale clustering. Journal ofbioinformatics and computational biology, 7(2):373–388, April 2009.
[128] Andrew Uzilov, Joshua Keegan, and David Mathews. Detection of non-coding RNAson the basis of predicted secondary structure formation free energy change. BMCBioinformatics, 7(1):173, 2006.
[130] Bjorn Voß, Jens Georg, Verena Schon, Susanne Ude, and Wolfgang Hess. Biocom-putational prediction of non-coding RNAs in model cyanobacteria. BMC Genomics,10(1):123, 2009.
[131] Jiayi Wang, Xiangfan Liu, Huacheng Wu, Peihua Ni, Zhidong Gu, Yongxia Qiao,Ning Chen, Fenyong Sun, and Qishi Fan. CREB up-regulates long non-coding RNA,HULC expression through interaction with microRNA-372 in liver cancer. NucleicAcids Research, 38(16):5366–5383, 09 2010.
[132] Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool fortranscriptomics. Nat Rev Genet, 10(1):57–63, 01 2009.
[133] Stefan Washietl, Ivo L. Hofacker, and Peter F. Stadler. Fast and reliable predictionof noncoding RNAs. Proceedings of the National Academy of Sciences of the UnitedStates of America, 102(7):2454–2459, 2005.
[134] Zasha Weinberg, Jonathan Perreault, Michelle M. Meyer, and Ronald R. Breaker.Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis.Nature, 462(7273):656–659, 12 2009.
BIBLIOGRAPHY 79
[135] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Toolsand Techniques. Morgan Kaufmann Series in Data Management Systems. MorganKaufmann, second edition, June 2005.
[136] Adam Woolfe, Martin Goodson, Debbie K Goode, Phil Snell, Gayle K McEwen, TanyaVavouri, Sarah F Smith, Phil North, Heather Callaway, Krys Kelly, Klaudia Walter,Irina Abnizova, Walter Gilks, Yvonne J. K Edwards, Julie E Cooke, and Greg Elgar.Highly conserved non-coding sequences are associated with vertebrate development.PLoS Biol, 3(1):e7, 11 2004.
[137] Jing Wu. Testing the coding potential of conserved short genomic sequences. Advancesin Bioinformatics, Article ID 287070, 8 pages, 2010.
[138] Jun Xie, Ming Zhang, Tao Zhou, Xia Hua, LiSha Tang, and Weilin Wu. scaRNAbase:a curated database for small nucleolar RNAs and cajal body-specific RNAs. NucleicAcids Research, 35(suppl 1):D183–D187, 2006.
[139] Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xuegong Zhang. Classifi-cation of real and pseudo microRNA precursors using local structure-sequence featuresand support vector machine. BMC Bioinformatics, 6(1):310, 2005.
[140] Zizhen Yao, Zasha Weinberg, and Walter L. Ruzzo. CMfinder—a covariance modelbased RNA motif finding algorithm. Bioinformatics, 22(4):445–452, 2006.
[141] Ying Zhang, Dao-Gang Guan, Jian-Hua Yang, Peng Shao, Hui Zhou, and Liang-HuQu. ncRNAimprint: A comprehensive database of mammalian imprinted noncodingRNAs. RNA, pages –, 08 2010.
[142] Michael Zuker and David Sankoff. RNA secondary structures and their prediction.Bulletin of Mathematical Biology, 46(4):591–621, 07 1984.
[143] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequencesusing thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1 1981.