-
Implementa)on
of
computa)onal
pipelines
to
support
next
gen
applica)ons
Winter
School
in
Mathema0cal
and
Computa0onal
Biology,
6th
–
10th
July
2009
Roberto
Barrero
([email protected])
miR-Seq: microRNA profiling ChIP-Seq: Chromatin modification
Bi-Seq: DNA methylation analysis
-
*
*
*
Bioplatforms Australia
NCRIS 5.1 Evolving Biomolecular Platforms and Informatics
-
NCRIS 5.1
Australian Bioinformatics Facility
Genomics Australia
Embedded Activities
Proteomics Australia
Metabolomics Australia
Embedded Activities
Embedded Activities
Non-Embedded Activities
Non-Embedded Activities
Non-Embedded Activities
Project 1
Project 2
Project 3
Project 1
Project 2
Project 3
Project 1
Project 2
Project 3
Development of cross –omics Platform Projects Development of
cross NCRIS Investment Projects
NCRIS 5.16 NeAT
BioNeAT pending
NCRIS 5.16 NCI
Specialised Facility in
Bioinformatics pending
-
• Implementa0on
of
a
short
read
mapping
pipeline
– Benchmarking
of
freely
available
aligners
•
miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•
ChIP‐Seq:
Defining
genomic
regions
associated
with
histone
modifica0ons
•
Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks
Overview
-
• Implementa0on
of
a
short
read
mapping
pipeline
– Benchmarking
of
freely
available
aligners
•
miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•
ChIP‐Seq:
Defining
genomic
regions
associated
with
histone
modifica0ons
•
Bi‐Seq:
Determining
genome‐wide
methyla0on
marks
on
CpG,
CHG
and
CHH
marks
Overview
-
Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454
Finds
Mismatches Finds Indels Uses Quality Information
Tested platforms
ELAND
Large-Scale Alignment of Nucleotide Databases
FAST N Y N N Y N Y Linux. OsX
GMAP Genomic Mapping and Alignment Program
FAST Y N N Y Y Y N Linux
MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y
Linux, OSX
SHRiMP Maps short reads to a reference sequence
SLOW N Y Y N Y Y N Linux
MAQ Mapping and Assembly with Qualities
FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX
NOVOALIGN Genomic Mapping and SNP/indel finder
FAST N Y N N Y Y Y Linux-64/ OSX
SOAP Short Oligonucleotide Alignment Program
VARIABLE N Y N N Y Y N Linux-64/32 /OSX
SSAHA Sequence Search and Alignment by Hashing Algorithm
FAST Y Y N Y Y N N Linux
Initial list of available tools (as of April 2008)
-
Tool Name Performance Sanger Capillary ILLUMINA SOLiD 454
Finds
Mismatches Finds Indels Uses Quality Information
Tested platforms
ELAND
Large-Scale Alignment of Nucleotide Databases
FAST N Y N N Y N Y Linux. OsX
GMAP Genomic Mapping and Alignment Program
FAST Y N N Y Y Y N Linux
MOSAIK Reference guided aligner/assembler SLOW Y Y N Y Y Y Y
Linux, OSX
SHRiMP Maps short reads to a reference sequence
SLOW N Y Y N Y Y N Linux
MAQ Mapping and Assembly with Qualities
FAST N Y Y N Y Y Y All BSD Platforms/Linux/OSX
NOVOALIGN Genomic Mapping and SNP/indel finder
FAST N Y N N Y Y Y Linux-64/ OSX
SOAP Short Oligonucleotide Alignment Program
VARIABLE N Y N N Y Y N Linux-64/32 /OSX
SSAHA Sequence Search and Alignment by Hashing Algorithm
FAST Y Y N Y Y N N Linux
Initial list of available tools (as of April 2008)
-
• ELAND does ungapped alignment of SE/PE reads up to 32 nt in
length and generate accurate mapping qualities. • MAQ uses
probability models to measure the alignment quality of each read
using sequence quality information.
• SHRiMP uses seeding and a Smith-Waterman algorithm for
aligning short reads to a reference genome.
• RMAP map reads taking into account base-call quality scores
to determine important positions.
• NOVOALIGN finds global optimum alignments using full
Needleman-Wunsch algorithm with affinity gap penalties.
Mapping Tools
-
Genome coverage range by distinct next gen applications
Bi-Seq ChIP-Seq
Small RNAs
Genome
-
Generation of Simulated Short Reads (1)
Tool: MAQ-simulate DNA template: Human chromosome 22 Read
length: 36 bases Mutation rates: 0.1% up to 16% Number of reads:
70,000 x 3 per mutation rate Number of SNPs: 220~3,500 Indels: 10%
probability of SNPs
-
Single-end mapping performance at various mutation rates
• 70,000 reads • Triplicate
-
Pair-end mapping and SNP calling
-
Mapping performance of real data
HapMap project NA18507 10.2 million SE reads
5.1 million PE reads
-
Generation of Simulated Short Reads (2)
Dataset Simulated Set 1 Simulated
Set 2 Mutation Rate 0.1% 1.0% Number of single-end reads(1)
8,453,489 8,453,489 Number of paired-end reads(2) 16,906,978
16,906,978 Number of insertions 1,512 15,131 Number of deletions
1,514 15,166 Total number of indels 3,026 30,297 Number of
Heterozygous SNPs 18,024 182,055 Number of Homozygous SNPs 9,034
90,985 Total number of SNPs 27,058 273,040 Total number of
SNPs+indels 30,084 303,337
(1) Total number of single-end (SE) reads utilized in the
comparisons (2) Total number of paired-end (PE) reads utilized in
the comparisons
36bp-long reads datasets were generated using MAQ-simulate
Selected tools: NOVOALIGN, MAQ, BOWTIE, BWA
Arabidopsis thaliana (chr 5)
-
Mapping
(Colin et al. Submitted)
-
SNPs
(Colin et al. Submitted)
-
Indels
(Colin et al. Submitted)
-
Run Time
(Colin et al. Submitted)
-
Benchmarking Conclusion
NOVOALIGN is the best overall aligner for mapping both SE and PE
reads as well as SNP calling and indel detection.
-
• Implementa0on
of
a
short
read
mapping
pipeline
– Benchmarking
of
freely
available
aligners
•
miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•
ChIP‐Seq:
Defining
genomic
regions
associated
with
histone
modifica0ons
•
Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks
Overview
-
miRNA function in Drosophila: • Cell
proliferation/anti-apoptosis
(bantam)
• Fat storage/anti-apoptosis (mir-14) •
Homeostasis/anti-apoptosis (mir-278) • Anti-apoptosis (mir-2) •
Photoreceptor differentiation (mir-7) •
Neurogenesis/neurodegeneration
(mir-9) • Muscle differentiation (mir-1) • Homeotic
transformation (iab-4) • Energy metabolism/fat storage (mir-278)
• Metamorphosis (let-7, mir-100, 125, 34)
microRNA
pathway
-
2L 2R 3L 3R
X 4 U
mir-959,960,961,962,963,964 mir-275, 305
mir-1002, 968
mir-306, 79, 9b
mir-100, 125, let-7 mir-2a-2 2a-1, 2b-2
Drosophila melanogaster microRNA clusters
mir-281-2, 281-1
mir-6-3, 6-2, 6-2, 5, 4, 286, 3, 309
mir-310, 311, 312, 313, 991, 992
mir-277, 34 mir-994, 318
mir-13b-1, 13a, 2c
mir-998, 11
mir-982, 303, 983-1, 983-2, 984
mir-283, 304, 12
mir-972, 973, 974 mir-975, 976, 977, 978, 979
• 148 dme-miRNAs • 17 clusters (=
-
Method Type of method Resource
miRanda Complementarity http://www.microrna.org/
TargetScan Seed complementarity http://www.targetscan.org/
PicTar Thermodynamics http://pictar.bio.nyu.edu/
Canonical site
Dominant seed
Compensatory site
Prediction of miRNA targets
-
Chelicerata
Crustacea
Myriapoda
Insecta
Pasani et al. (2004) BMC Biology
A timescale of arthropod evolution
-
•
Ixodes
scapularis
–
ESTs:
183,834
–
Genome
Project
(version:
IscaW1;
released
3Dec08)
• Supercon0gs:
369,492
• Annotated
genes:
24,925
•
Pep0des:
20,486
•
Rhipicephalus
microplus
–
ESTs:
13,643
Tick Genomic Resources
-
Genome
Precursor miRNA (Pre-miRNA)
miRNA miRNA*
5P 3P
pre-dme-miR-33
Drosophila melanogaster
Rhipicephalus microplus
microRNA
locus
-
Simplified
data
processing
pipeline
Unique
Seq
Reads
(USR)
USR
w/o
adaptors
Retain
clone
count
Map
onto
genome
•
NOVOALIGN
•
Up
to
3
mismatches
•
Single‐locus
mapping
Mapped
reads
miRBase
miRNA
clusters
•
Extract
coordinates
of
miRNAs
‐
mature
miRNAs
‐
pre‐miRNAs
Illumina
Short
reads
Adaptor
removal
OUTPUTS miRNA,
miRNA*,
Mul0ple
Alignments,
etc
-
1. Collect Total RNA/small RNA fraction • Eggs • Larvae
(frustrated larvae, larvae) • Adult ticks (female, male)
2. Construct small RNA libraries 3. Illumina/Solexa
sequencing
• Eggs: 4,215,404 • Larvae: 9,437,803 • Adult ticks:
8,319,734
4. Data Analysis Pipeline
21,972,942 short reads
LARVA
NYMPH
ADULT
EGGS
female male
microRNA
discovery
-
0.0010
0.0100
0.1000
1.0000
We
found
58
miRNAs
in
Rhipicephalus
microplus
expressed
at
various
life
cycle
stages
that
are
highly
conserved
in
Drosophila
melanogaster.
Highly
conserved
0ck
miRNAs
Eggs (37)
26
Larvae (46)
Adults (44)
2 1
9 5
1
0
Fold
-incr
emen
t in
m
iRN
A ex
pres
sion
R
eads
Per
Mill
ion
0
20
40
60
80
100
120
140
Eggs
Frus
Larvae
Larvae
Female
Male
-
Eggs
Larvae
Adults
Pre-miRNA
Mature miRNA
miRNA:miRNA* co-expression
-
To
whom
it
may
concern:
Slides
containing
unpublished
data
were
removed.
We
appreciate
your
understanding.
RB.
-
mir-9a is conserved in the Ixodes scapularis genome
369,492 supercontigs
Finding I. scapularis miRNAs
BLAT onto D. melanogaster genome
Inspect known miRNA loci
Only mir9a was identified in the current I. scapularis
supercontigs
-
• Implementa0on
of
a
short
read
mapping
pipeline
– Benchmarking
of
freely
available
aligners
•
miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•
ChIP‐Seq:
Defining
genomic
regions
associated
with
histone
modifica0ons
•
Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks
Overview
-
Active Gene Expression
Less Gene Expression
Acetylation Methylation
Implications of Chromatin Modifications
-
cisGenome
ChIP‐Seq
simplified
processing
pipeline
Map
onto
genome
Mapped
reads
FindPeaks
Illumina
Short
reads
OUTPUTS
Genomic
regions
associated
with
chroma0n
modifica0ons
NOVOALIGN
-
Ji et al. (2008) Nature Biotechnology 26: 1293-1300
One Sample Data Processing • Scan genome with sliding windows
and identifies regions with read counts greater than a user chosen
cut off for bona fide binding regions.
• False Discovery Rates (FDRs) are estimated by modeling the
read count in nonbinding windows using a negative binomial
distribution.
Allows the background rate of occurrence of the reads to vary
across the genome and to have a more flexible gamma
distribution.
• Use the directionality of reads to refine peak boundaries and
filter out low-quality predictions.
cisGenome
-
Protein-DNA Interactions
-
Diverse genomic contexts for chromatin marks
-
Arabidopsis thaliana nucleosome
-
• Implementa0on
of
a
short
read
mapping
pipeline
– Benchmarking
of
freely
available
aligners
•
miR‐Seq:
Profiling
of
miRNAs
and
miRNAs*
•
ChIP‐Seq:
Defining
genomic
regions
associated
with
histone
modifica0ons
•
Bi‐Seq:
Determining
genome‐wide
CpG,
CHG
and
CHH
methyla0on
marks
Overview
-
Bisulfite Sequencing (Bi-Seq; BS-Seq)
Next
Gen
Sequencing
-
Genome‐wide
Methyla0on
Marks
Map
onto
genome
Mapped
reads
Check
Bisulfite
Conversion
Illumina
Short
reads
OUTPUTS
Bisulfite
conversion
report;
Genome‐wide
methyla0on
marks
MAQ
C T CpG CHG CHH
-
Sample C T Y Unconverted (Percentage) Converted
(Percentage)
run 1 10,806 10,577(97.88) 183(1.69) 14 (0.13) 10,760 (99.57)
run 2 10,837 10,570(97.54) 219(2.02) 11 (0.10) 10,789 (99.56)
Sample Read
Sequences Unique
Alignments Gapped
Alignments Aligned run 1 11,653,511 133,712 4,129 275,944 run 2
11,540,171 132,690 4,251 273,806
Checking Bisulfite Conversion Efficiency
Aligned reads onto the Arabidopsis chloroplast genome
Bisulfite conversion efficiency of the chloroplast genome
-
Bisulfite conversion of the Arabidopsis thaliana chloroplast
genome
-
Genome‐wide
Methyla0on
Marks
Coverage
CHG
CHH
CpG
chr1 chr2 chr3 chr4 chr5
-
hfp://ccg.murdoch.edu.au/yabi
Web
HPC
‐
Enabled
-
• Zhang Bing • Ala Lew-Tabor
Acknowledgements
Colin Hercus NCRIS 5.1
• Zayed Albertyn • Matthew Bellgard
An Australian Government Initiative
National Collaborative Research Infrastructure Strategy
Department of Primary Industries and Fisheries
Queensland Government
• Frances Shannon • Jun Fan
• Liz Dennis • Ian Greaves • Sameer Tiwari